Re: Janusgraph Hadoop Spark standalone cluster - Janusgraph job always creates constant number 513 of Spark tasks


HadoopMarc <bi...@...>
 

Hi Varun,

Not a solution, but someone in the thread below explained the 257 magic number for OLAP on a Cassandra cluster:
https://groups.google.com/g/janusgraph-users/c/IdrRyIefihY

Marc


Op vrijdag 4 december 2020 om 20:48:28 UTC+1 schreef Varun Ganesh:

Hi,

I am facing this same issue. I am using SparkGraphComputer to read from Janusgraph backed by cassandra. `g.V().count()` takes about 3 minutes to load just two rows that I have in the graph.

I see that about 257 tasks are created. In my case, I am seeing parallelism in the spark cluster that I am using but each task seems to take about ~5 seconds on average and there is no obvious reason why.

(I can attach a page from the Spark UI and also the properties file I am using, but I am unable to find the option to)

Would appreciate any input on solving this. Thank you!

Varun

On Friday, December 6, 2019 at 5:04:55 AM UTC-5 s...@... wrote:
Hi Dimitar,

I'm experiencing the same problem of having some seemingly uncontrollable static number of Spark task - did you ever figure out how to fix this?

Thanks,
Sture


On Friday, October 18, 2019 at 4:19:19 PM UTC+2, dim...@... wrote:
Hello,

I have setup Janusgraph 0.4.0 with Hadoop 2.9.0 and Spark 2.4.4 in a K8s cluster.
I connect to Janusgraph from gremlin console and execute: 
gremlin> og
==>graphtraversalsource[hadoopgraph[cassandra3inputformat->gryooutputformat], sparkgraphcomputer]
gremlin> g.V().count()
==>1889

It takes 25min to do the count! The same time took when there were no vertices - e.g. -> 0.  Spark job shows that there were 513 tasks run! Number of task is always constant 513 no matter of the number of vertices.
I have set "spark.sql.shuffle.partitions=4" at spark job's environment, but again the number of Spark tasks was 513! My assumption is that Janusgraph somehow specifies this number of tasks when submits the job to Spark.
The questions are:
- Why Janusgraph job submitted to Spark is always palatalized to 513 tasks? 
- How to manage the number of tasks which are created for a Janusgrap job? 
- How to minimize the execution time of OLAP query for this small graph (OLTP query takes less than a second to execute)?

Thanks,
Dimitar

Join janusgraph-users@lists.lfaidata.foundation to automatically receive all group messages.