Re: Janusgraph Hadoop Spark standalone cluster - Janusgraph job always creates constant number 513 of Spark tasks


Hi Dimitar,

Your spark screenshots do not show any parallelism. You state that your spark cluster only has a single worker. It seems that this worker also has only one core available (the spark.executor.cores property is not specified, so by default all available worker cores would be available to SparkGraphComputer). Without any parallelism a spark job will never be faster that an application without spark.

That being said, I do not understand why a single task takes 2 seconds. Retrieving 4 rows by average from Cassandra should rather take some 40 ms, so we are two orders of speed away from that. Apparently each task has some unexplained overhead, for setting up ........???? I would expect that the spark worker keeps its JVM, and that SparkGraphComputer keeps its classes loaded and its Cassandra connection established between tasks.

What I also do not understand is why the different 23 minute jobs are scheduled with a large delay in between. Is the underlying cloud not available? Would that also mean that the vcores used in the spark worker have a very low performance?  I would first try some simple spark jobs for a test application (no janusgraph, no cassandra) and be sure that you have a standalone spark cluster that behaves as expected: parallelism visible in the executor tab of the spark UI and no strange waiting periods between jobs of a single application.

Cheers,    Marc

Op maandag 21 oktober 2019 11:23:26 UTC+2 schreef Dimitar Tenev:

Hi Marc,

The output of nodetool gives: Number of partitions (estimate): 967 the whole output is attached as "nodetool_log.txt". 
Regarding the Spark configuration - Yes I have used the guides from the link you have provided, and "" (attached) is the graph configuration which I use for "og". I have also attached the html pages from Spark UI for the janusgraph job (Stage, Job, Environment) as Spark is configured with one master and one worker node, and yes the worker node output shows that the tasks are processed by it. Any help is appreciated!


On Monday, October 21, 2019 at 10:48:00 AM UTC+3, ma...@... wrote:
Hi Dimitar,

The number 513 is probably the number of Cassandra partitions. You can inspect the number of partitions in the tables of the Cassandra cluster with:
$ nodetool tablestats <your_keyspace>

Involving SparkGraphComputer only helps for a large number of vertices (100.000+) because there is a lot of one-off overhead for instantiating the JVM's for the Spark executors. Even then, the 25 minutes you mention is excessive. Are you sure your k8s spark cluster was used? The janusgraph default is to use spark local inside your janusgraph container, see the docs for how to configure JanusGraph for a Spark standalone cluster.

HTH,     Marc

Op vrijdag 18 oktober 2019 16:19:19 UTC+2 schreef dim...@...:

I have setup Janusgraph 0.4.0 with Hadoop 2.9.0 and Spark 2.4.4 in a K8s cluster.
I connect to Janusgraph from gremlin console and execute: 
gremlin> og
==>graphtraversalsource[hadoopgraph[cassandra3inputformat->gryooutputformat], sparkgraphcomputer]
gremlin> g.V().count()

It takes 25min to do the count! The same time took when there were no vertices - e.g. -> 0.  Spark job shows that there were 513 tasks run! Number of task is always constant 513 no matter of the number of vertices.
I have set "spark.sql.shuffle.partitions=4" at spark job's environment, but again the number of Spark tasks was 513! My assumption is that Janusgraph somehow specifies this number of tasks when submits the job to Spark.
The questions are:
- Why Janusgraph job submitted to Spark is always palatalized to 513 tasks? 
- How to manage the number of tasks which are created for a Janusgrap job? 
- How to minimize the execution time of OLAP query for this small graph (OLTP query takes less than a second to execute)?


Join to automatically receive all group messages.