Re: Janusgraph Hadoop Spark standalone cluster - Janusgraph job always creates constant number 513 of Spark tasks


Hi Dimitar,

The number 513 is probably the number of Cassandra partitions. You can inspect the number of partitions in the tables of the Cassandra cluster with:
$ nodetool tablestats <your_keyspace>

Involving SparkGraphComputer only helps for a large number of vertices (100.000+) because there is a lot of one-off overhead for instantiating the JVM's for the Spark executors. Even then, the 25 minutes you mention is excessive. Are you sure your k8s spark cluster was used? The janusgraph default is to use spark local inside your janusgraph container, see the docs for how to configure JanusGraph for a Spark standalone cluster.

HTH,     Marc

Op vrijdag 18 oktober 2019 16:19:19 UTC+2 schreef dim...@...:


I have setup Janusgraph 0.4.0 with Hadoop 2.9.0 and Spark 2.4.4 in a K8s cluster.
I connect to Janusgraph from gremlin console and execute: 
gremlin> og
==>graphtraversalsource[hadoopgraph[cassandra3inputformat->gryooutputformat], sparkgraphcomputer]
gremlin> g.V().count()

It takes 25min to do the count! The same time took when there were no vertices - e.g. -> 0.  Spark job shows that there were 513 tasks run! Number of task is always constant 513 no matter of the number of vertices.
I have set "spark.sql.shuffle.partitions=4" at spark job's environment, but again the number of Spark tasks was 513! My assumption is that Janusgraph somehow specifies this number of tasks when submits the job to Spark.
The questions are:
- Why Janusgraph job submitted to Spark is always palatalized to 513 tasks? 
- How to manage the number of tasks which are created for a Janusgrap job? 
- How to minimize the execution time of OLAP query for this small graph (OLTP query takes less than a second to execute)?


Join to automatically receive all group messages.