Janusgraph Hadoop Spark standalone cluster - Janusgraph job always creates constant number 513 of Spark tasks


Varun Ganesh <operatio...@...>
 

Thank you Marc. I was able to reduce the tasks by adjusting the `num_tokens` settings on Cassandra. Still unsure about why each task takes so long though. Hoping that this a per-task overhead that stays the same as we process larger datasets.

On Saturday, December 5, 2020 at 3:20:17 PM UTC-5 HadoopMarc wrote:
Hi Varun,

Not a solution, but someone in the thread below explained the 257 magic number for OLAP on a Cassandra cluster:

Marc


Op vrijdag 4 december 2020 om 20:48:28 UTC+1 schreef Varun Ganesh:
Hi,

I am facing this same issue. I am using SparkGraphComputer to read from Janusgraph backed by cassandra. `g.V().count()` takes about 3 minutes to load just two rows that I have in the graph.

I see that about 257 tasks are created. In my case, I am seeing parallelism in the spark cluster that I am using but each task seems to take about ~5 seconds on average and there is no obvious reason why.

(I can attach a page from the Spark UI and also the properties file I am using, but I am unable to find the option to)

Would appreciate any input on solving this. Thank you!

Varun

On Friday, December 6, 2019 at 5:04:55 AM UTC-5 s...@... wrote:
Hi Dimitar,

I'm experiencing the same problem of having some seemingly uncontrollable static number of Spark task - did you ever figure out how to fix this?

Thanks,
Sture


On Friday, October 18, 2019 at 4:19:19 PM UTC+2, dim...@... wrote:
Hello,

I have setup Janusgraph 0.4.0 with Hadoop 2.9.0 and Spark 2.4.4 in a K8s cluster.
I connect to Janusgraph from gremlin console and execute: 
gremlin> og
==>graphtraversalsource[hadoopgraph[cassandra3inputformat->gryooutputformat], sparkgraphcomputer]
gremlin> g.V().count()
==>1889

It takes 25min to do the count! The same time took when there were no vertices - e.g. -> 0.  Spark job shows that there were 513 tasks run! Number of task is always constant 513 no matter of the number of vertices.
I have set "spark.sql.shuffle.partitions=4" at spark job's environment, but again the number of Spark tasks was 513! My assumption is that Janusgraph somehow specifies this number of tasks when submits the job to Spark.
The questions are:
- Why Janusgraph job submitted to Spark is always palatalized to 513 tasks? 
- How to manage the number of tasks which are created for a Janusgrap job? 
- How to minimize the execution time of OLAP query for this small graph (OLTP query takes less than a second to execute)?

Thanks,
Dimitar


HadoopMarc <bi...@...>
 

Hi Varun,

Not a solution, but someone in the thread below explained the 257 magic number for OLAP on a Cassandra cluster:
https://groups.google.com/g/janusgraph-users/c/IdrRyIefihY

Marc


Op vrijdag 4 december 2020 om 20:48:28 UTC+1 schreef Varun Ganesh:

Hi,

I am facing this same issue. I am using SparkGraphComputer to read from Janusgraph backed by cassandra. `g.V().count()` takes about 3 minutes to load just two rows that I have in the graph.

I see that about 257 tasks are created. In my case, I am seeing parallelism in the spark cluster that I am using but each task seems to take about ~5 seconds on average and there is no obvious reason why.

(I can attach a page from the Spark UI and also the properties file I am using, but I am unable to find the option to)

Would appreciate any input on solving this. Thank you!

Varun

On Friday, December 6, 2019 at 5:04:55 AM UTC-5 s...@... wrote:
Hi Dimitar,

I'm experiencing the same problem of having some seemingly uncontrollable static number of Spark task - did you ever figure out how to fix this?

Thanks,
Sture


On Friday, October 18, 2019 at 4:19:19 PM UTC+2, dim...@... wrote:
Hello,

I have setup Janusgraph 0.4.0 with Hadoop 2.9.0 and Spark 2.4.4 in a K8s cluster.
I connect to Janusgraph from gremlin console and execute: 
gremlin> og
==>graphtraversalsource[hadoopgraph[cassandra3inputformat->gryooutputformat], sparkgraphcomputer]
gremlin> g.V().count()
==>1889

It takes 25min to do the count! The same time took when there were no vertices - e.g. -> 0.  Spark job shows that there were 513 tasks run! Number of task is always constant 513 no matter of the number of vertices.
I have set "spark.sql.shuffle.partitions=4" at spark job's environment, but again the number of Spark tasks was 513! My assumption is that Janusgraph somehow specifies this number of tasks when submits the job to Spark.
The questions are:
- Why Janusgraph job submitted to Spark is always palatalized to 513 tasks? 
- How to manage the number of tasks which are created for a Janusgrap job? 
- How to minimize the execution time of OLAP query for this small graph (OLTP query takes less than a second to execute)?

Thanks,
Dimitar


Varun Ganesh <operatio...@...>
 

Hi,

I am facing this same issue. I am using SparkGraphComputer to read from Janusgraph backed by cassandra. `g.V().count()` takes about 3 minutes to load just two rows that I have in the graph.

I see that about 257 tasks are created. In my case, I am seeing parallelism in the spark cluster that I am using but each task seems to take about ~5 seconds on average and there is no obvious reason why.

(I can attach a page from the Spark UI and also the properties file I am using, but I am unable to find the option to)

Would appreciate any input on solving this. Thank you!

Varun

On Friday, December 6, 2019 at 5:04:55 AM UTC-5 s...@... wrote:
Hi Dimitar,

I'm experiencing the same problem of having some seemingly uncontrollable static number of Spark task - did you ever figure out how to fix this?

Thanks,
Sture


On Friday, October 18, 2019 at 4:19:19 PM UTC+2, dim...@... wrote:
Hello,

I have setup Janusgraph 0.4.0 with Hadoop 2.9.0 and Spark 2.4.4 in a K8s cluster.
I connect to Janusgraph from gremlin console and execute: 
gremlin> og
==>graphtraversalsource[hadoopgraph[cassandra3inputformat->gryooutputformat], sparkgraphcomputer]
gremlin> g.V().count()
==>1889

It takes 25min to do the count! The same time took when there were no vertices - e.g. -> 0.  Spark job shows that there were 513 tasks run! Number of task is always constant 513 no matter of the number of vertices.
I have set "spark.sql.shuffle.partitions=4" at spark job's environment, but again the number of Spark tasks was 513! My assumption is that Janusgraph somehow specifies this number of tasks when submits the job to Spark.
The questions are:
- Why Janusgraph job submitted to Spark is always palatalized to 513 tasks? 
- How to manage the number of tasks which are created for a Janusgrap job? 
- How to minimize the execution time of OLAP query for this small graph (OLTP query takes less than a second to execute)?

Thanks,
Dimitar


sly...@...
 

Hi Dimitar,

I'm experiencing the same problem of having some seemingly uncontrollable static number of Spark task - did you ever figure out how to fix this?

Thanks,
Sture


On Friday, October 18, 2019 at 4:19:19 PM UTC+2, dim...@... wrote:
Hello,

I have setup Janusgraph 0.4.0 with Hadoop 2.9.0 and Spark 2.4.4 in a K8s cluster.
I connect to Janusgraph from gremlin console and execute: 
gremlin> og
==>graphtraversalsource[hadoopgraph[cassandra3inputformat->gryooutputformat], sparkgraphcomputer]
gremlin> g.V().count()
==>1889

It takes 25min to do the count! The same time took when there were no vertices - e.g. -> 0.  Spark job shows that there were 513 tasks run! Number of task is always constant 513 no matter of the number of vertices.
I have set "spark.sql.shuffle.partitions=4" at spark job's environment, but again the number of Spark tasks was 513! My assumption is that Janusgraph somehow specifies this number of tasks when submits the job to Spark.
The questions are:
- Why Janusgraph job submitted to Spark is always palatalized to 513 tasks? 
- How to manage the number of tasks which are created for a Janusgrap job? 
- How to minimize the execution time of OLAP query for this small graph (OLTP query takes less than a second to execute)?

Thanks,
Dimitar


marc.d...@...
 

Hi Dimitar,

Your spark screenshots do not show any parallelism. You state that your spark cluster only has a single worker. It seems that this worker also has only one core available (the spark.executor.cores property is not specified, so by default all available worker cores would be available to SparkGraphComputer). Without any parallelism a spark job will never be faster that an application without spark.

That being said, I do not understand why a single task takes 2 seconds. Retrieving 4 rows by average from Cassandra should rather take some 40 ms, so we are two orders of speed away from that. Apparently each task has some unexplained overhead, for setting up ........???? I would expect that the spark worker keeps its JVM, and that SparkGraphComputer keeps its classes loaded and its Cassandra connection established between tasks.

What I also do not understand is why the different 23 minute jobs are scheduled with a large delay in between. Is the underlying cloud not available? Would that also mean that the vcores used in the spark worker have a very low performance?  I would first try some simple spark jobs for a test application (no janusgraph, no cassandra) and be sure that you have a standalone spark cluster that behaves as expected: parallelism visible in the executor tab of the spark UI and no strange waiting periods between jobs of a single application.

Cheers,    Marc






Op maandag 21 oktober 2019 11:23:26 UTC+2 schreef Dimitar Tenev:

Hi Marc,

The output of nodetool gives: Number of partitions (estimate): 967 the whole output is attached as "nodetool_log.txt". 
Regarding the Spark configuration - Yes I have used the guides from the link you have provided, and "janus-spark.properties" (attached) is the graph configuration which I use for "og". I have also attached the html pages from Spark UI for the janusgraph job (Stage, Job, Environment) as spark.zip. Spark is configured with one master and one worker node, and yes the worker node output shows that the tasks are processed by it. Any help is appreciated!

Thanks,
Dimitar 

On Monday, October 21, 2019 at 10:48:00 AM UTC+3, ma...@... wrote:
Hi Dimitar,

The number 513 is probably the number of Cassandra partitions. You can inspect the number of partitions in the tables of the Cassandra cluster with:
$ nodetool tablestats <your_keyspace>

Involving SparkGraphComputer only helps for a large number of vertices (100.000+) because there is a lot of one-off overhead for instantiating the JVM's for the Spark executors. Even then, the 25 minutes you mention is excessive. Are you sure your k8s spark cluster was used? The janusgraph default is to use spark local inside your janusgraph container, see the docs for how to configure JanusGraph for a Spark standalone cluster.

HTH,     Marc


Op vrijdag 18 oktober 2019 16:19:19 UTC+2 schreef dim...@...:
Hello,

I have setup Janusgraph 0.4.0 with Hadoop 2.9.0 and Spark 2.4.4 in a K8s cluster.
I connect to Janusgraph from gremlin console and execute: 
gremlin> og
==>graphtraversalsource[hadoopgraph[cassandra3inputformat->gryooutputformat], sparkgraphcomputer]
gremlin> g.V().count()
==>1889

It takes 25min to do the count! The same time took when there were no vertices - e.g. -> 0.  Spark job shows that there were 513 tasks run! Number of task is always constant 513 no matter of the number of vertices.
I have set "spark.sql.shuffle.partitions=4" at spark job's environment, but again the number of Spark tasks was 513! My assumption is that Janusgraph somehow specifies this number of tasks when submits the job to Spark.
The questions are:
- Why Janusgraph job submitted to Spark is always palatalized to 513 tasks? 
- How to manage the number of tasks which are created for a Janusgrap job? 
- How to minimize the execution time of OLAP query for this small graph (OLTP query takes less than a second to execute)?

Thanks,
Dimitar


Dimitar Tenev <dimitar....@...>
 

Hi Marc,

The output of nodetool gives: Number of partitions (estimate): 967 the whole output is attached as "nodetool_log.txt". 
Regarding the Spark configuration - Yes I have used the guides from the link you have provided, and "janus-spark.properties" (attached) is the graph configuration which I use for "og". I have also attached the html pages from Spark UI for the janusgraph job (Stage, Job, Environment) as spark.zip. Spark is configured with one master and one worker node, and yes the worker node output shows that the tasks are processed by it. Any help is appreciated!

Thanks,
Dimitar 

On Monday, October 21, 2019 at 10:48:00 AM UTC+3, ma...@... wrote:
Hi Dimitar,

The number 513 is probably the number of Cassandra partitions. You can inspect the number of partitions in the tables of the Cassandra cluster with:
$ nodetool tablestats <your_keyspace>

Involving SparkGraphComputer only helps for a large number of vertices (100.000+) because there is a lot of one-off overhead for instantiating the JVM's for the Spark executors. Even then, the 25 minutes you mention is excessive. Are you sure your k8s spark cluster was used? The janusgraph default is to use spark local inside your janusgraph container, see the docs for how to configure JanusGraph for a Spark standalone cluster.

HTH,     Marc


Op vrijdag 18 oktober 2019 16:19:19 UTC+2 schreef dim...@...:
Hello,

I have setup Janusgraph 0.4.0 with Hadoop 2.9.0 and Spark 2.4.4 in a K8s cluster.
I connect to Janusgraph from gremlin console and execute: 
gremlin> og
==>graphtraversalsource[hadoopgraph[cassandra3inputformat->gryooutputformat], sparkgraphcomputer]
gremlin> g.V().count()
==>1889

It takes 25min to do the count! The same time took when there were no vertices - e.g. -> 0.  Spark job shows that there were 513 tasks run! Number of task is always constant 513 no matter of the number of vertices.
I have set "spark.sql.shuffle.partitions=4" at spark job's environment, but again the number of Spark tasks was 513! My assumption is that Janusgraph somehow specifies this number of tasks when submits the job to Spark.
The questions are:
- Why Janusgraph job submitted to Spark is always palatalized to 513 tasks? 
- How to manage the number of tasks which are created for a Janusgrap job? 
- How to minimize the execution time of OLAP query for this small graph (OLTP query takes less than a second to execute)?

Thanks,
Dimitar


marc.d...@...
 

Hi Dimitar,

The number 513 is probably the number of Cassandra partitions. You can inspect the number of partitions in the tables of the Cassandra cluster with:
$ nodetool tablestats <your_keyspace>

Involving SparkGraphComputer only helps for a large number of vertices (100.000+) because there is a lot of one-off overhead for instantiating the JVM's for the Spark executors. Even then, the 25 minutes you mention is excessive. Are you sure your k8s spark cluster was used? The janusgraph default is to use spark local inside your janusgraph container, see the docs for how to configure JanusGraph for a Spark standalone cluster.

HTH,     Marc


Op vrijdag 18 oktober 2019 16:19:19 UTC+2 schreef dim...@...:

Hello,

I have setup Janusgraph 0.4.0 with Hadoop 2.9.0 and Spark 2.4.4 in a K8s cluster.
I connect to Janusgraph from gremlin console and execute: 
gremlin> og
==>graphtraversalsource[hadoopgraph[cassandra3inputformat->gryooutputformat], sparkgraphcomputer]
gremlin> g.V().count()
==>1889

It takes 25min to do the count! The same time took when there were no vertices - e.g. -> 0.  Spark job shows that there were 513 tasks run! Number of task is always constant 513 no matter of the number of vertices.
I have set "spark.sql.shuffle.partitions=4" at spark job's environment, but again the number of Spark tasks was 513! My assumption is that Janusgraph somehow specifies this number of tasks when submits the job to Spark.
The questions are:
- Why Janusgraph job submitted to Spark is always palatalized to 513 tasks? 
- How to manage the number of tasks which are created for a Janusgrap job? 
- How to minimize the execution time of OLAP query for this small graph (OLTP query takes less than a second to execute)?

Thanks,
Dimitar


dimitar....@...
 

Hello,

I have setup Janusgraph 0.4.0 with Hadoop 2.9.0 and Spark 2.4.4 in a K8s cluster.
I connect to Janusgraph from gremlin console and execute: 
gremlin> og
==>graphtraversalsource[hadoopgraph[cassandra3inputformat->gryooutputformat], sparkgraphcomputer]
gremlin> g.V().count()
==>1889

It takes 25min to do the count! The same time took when there were no vertices - e.g. -> 0.  Spark job shows that there were 513 tasks run! Number of task is always constant 513 no matter of the number of vertices.
I have set "spark.sql.shuffle.partitions=4" at spark job's environment, but again the number of Spark tasks was 513! My assumption is that Janusgraph somehow specifies this number of tasks when submits the job to Spark.
The questions are:
- Why Janusgraph job submitted to Spark is always palatalized to 513 tasks? 
- How to manage the number of tasks which are created for a Janusgrap job? 
- How to minimize the execution time of OLAP query for this small graph (OLTP query takes less than a second to execute)?

Thanks,
Dimitar