Thank you Marc. I was able to reduce the tasks by adjusting the `num_tokens` settings on Cassandra. Still unsure about why each task takes so long though. Hoping that this a per-task overhead that stays the same as we process larger datasets.
toggle quoted messageShow quoted text
On Saturday, December 5, 2020 at 3:20:17 PM UTC-5 HadoopMarc wrote:
Not a solution, but someone in the thread below explained the 257 magic number for OLAP on a Cassandra cluster:
Op vrijdag 4 december 2020 om 20:48:28 UTC+1 schreef Varun Ganesh:
I am facing this same issue. I am using SparkGraphComputer to read from Janusgraph backed by cassandra. `g.V().count()` takes about 3 minutes to load just two rows that I have in the graph.
I see that about 257 tasks are created. In my case, I am seeing parallelism in the spark cluster that I am using but each task seems to take about ~5 seconds on average and there is no obvious reason why.
(I can attach a page from the Spark UI and also the properties file I am using, but I am unable to find the option to)
Would appreciate any input on solving this. Thank you!
On Friday, December 6, 2019 at 5:04:55 AM UTC-5 s...@...
I'm experiencing the same problem of having some seemingly uncontrollable static number of Spark task - did you ever figure out how to fix this?
On Friday, October 18, 2019 at 4:19:19 PM UTC+2, dim...@...
I have setup Janusgraph 0.4.0 with Hadoop 2.9.0 and Spark 2.4.4 in a K8s cluster.
I connect to Janusgraph from gremlin console and execute:
It takes 25min to do the count! The same time took when there were no vertices - e.g. -> 0. Spark job shows that there were 513 tasks run! Number of task is always constant 513 no matter of the number of vertices.
I have set "spark.sql.shuffle.partitions=4" at spark job's environment, but again the number of Spark tasks was 513! My assumption is that Janusgraph somehow specifies this number of tasks when submits the job to Spark.
The questions are:
- Why Janusgraph job submitted to Spark is always palatalized to 513 tasks?
- How to manage the number of tasks which are created for a Janusgrap job?
- How to minimize the execution time of OLAP query for this small graph (OLTP query takes less than a second to execute)?