Re: OLAP, Hadoop, Spark and Cassandra

Mladen Marović <mladen...@...>

I know I'm quite late to the party, but for future reference - the number of input partitions in Spark depends on the partitioning of the source. In case of cassandra, partitioning is determined by the number of tokens each node gets (as configured by `num_tokens` in `cassandra.yaml`), which is set to 256 by default. So, if you have a 3-node cassandra cluster, by default each node should get 256 tokens, which would result in 3*256 = 768 tokens total. Since Spark reads directly from cassandra (if you're using `org.janusgraph.hadoop.formats.cql.CqlInputFormat`), that translates to 768 partitions in the input Spark RDD, or 768 tasks during processing. Add to that 1 task that collects results, or something similar, and you end up at 769. At least that was my experience.

The default value of 256 for `num_tokens` made sense in older versions, but in cassandra 3.x a new token allocation algorithm was implemented to improve performance for operations requiring token-range scans, which is precisely what Spark does. I experimented a bit with smaller values (e.g. 16) and managed to drastically reduce the number of tasks when scanning the entire graph. For further, reading, I recommend this article.

On Thursday, December 5, 2019 at 9:28:26 AM UTC+1 s...@... wrote:
Answering my own question - turned out I had had a mixup of keyspaces used between the two instances

Default the conf/hadoop-graph/ reads

While for CQL it should read

Also - as I made a 'named' (ve_graph) graph I had to point to that one rather than the janusgraph keyspace.

Problem 1 solved. Now to the next - how can I lower the number of 'partitions' Spark is using (here 796  '... on localhost (executor driver) (769/769)')?  

On Wednesday, December 4, 2019 at 11:46:42 PM UTC+1, Sture Lygren wrote:

I'm trying to get JanusGraph 0.4.0 with a Cassandra (CQL) backend setup and running as OLAP while still keeping OLTP active in order to do graph updates. I've been searching high and low for some guidance, but so far without any luck. Hopefully someone here could tune in and help?

Here's where I'm at currently

  • local Hadoop running according to
  • gremlin server started as /bin/ conf/gremlin-server/gremlin-server-configuration.yaml
  • gremlin-server-configuration.yaml points to init.groovy script doing the traversal mappings for OLTP and OLAP
def globals = [:]
ve ="ve_graph")
OLAPGraph ='conf/hadoop-graph/')
globals << [g : ve.traversal(), sg: OLAPGraph.traversal().withComputer(]
  • conf/hadoop-graph/ reads
  • Running the gremlin shell I have
         (o o)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/sture/Scripts/janusgraph-0.4.0-hadoop2/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/sture/Scripts/janusgraph-0.4.0-hadoop2/lib/logback-classic-1.1.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
plugin activated: tinkerpop.server
plugin activated: tinkerpop.tinkergraph
plugin activated: tinkerpop.hadoop
plugin activated: tinkerpop.spark
plugin activated: tinkerpop.utilities
plugin activated: janusgraph.imports
gremlin> :remote connect tinkerpop.server conf/remote.yaml session
==>Configured localhost/[655848fc-b46e-40be-8174-f0dc42cdabd4]
gremlin> :remote console
==>All scripts will now be sent to Gremlin Server - [localhost/]-[655848fc-b46e-40be-8174-f0dc42cdabd4] - type ':remote console' to return to local mode
gremlin> g
==>graphtraversalsource[standardjanusgraph[cql:[]], standard]
gremlin> sg
==>graphtraversalsource[hadoopgraph[cqlinputformat->gryooutputformat], sparkgraphcomputer]
gremlin> g.V().has('lbl','System').count()
gremlin> sg.V().has('lbl','System').count()
  • The job is running for some time and while finishing the gremlin-server.log reads
253856 [Executor task launch worker for task 768] INFO  org.apache.spark.executor.Executor  - Finished task 768.0 in stage 0.0 (TID 768). 2388 bytes result sent to driver
253858 [task-result-getter-1] INFO  org.apache.spark.scheduler.TaskSetManager  - Finished task 768.0 in stage 0.0 (TID 768) in 6809 ms on localhost (executor driver) (769/769)
253861 [dag-scheduler-event-loop] INFO  org.apache.spark.scheduler.DAGScheduler  - ResultStage 0 (fold at finished in 161.427 s
253861 [task-result-getter-1] INFO  org.apache.spark.scheduler.TaskSchedulerImpl  - Removed TaskSet 0.0, whose tasks have all completed, from pool
253876 [SparkGraphComputer-boss] INFO  org.apache.spark.scheduler.DAGScheduler  - Job 0 finished: fold at, took 161.598267 s
253888 [SparkGraphComputer-boss] INFO  org.apache.spark.rdd.MapPartitionsRDD  - Removing RDD 1 from persistence list
253901 [block-manager-slave-async-thread-pool-0] INFO  - Removing RDD 1
  • However - the count (==> ) reads 0 for the sg traversal
I've most likely missed some crucial point here, but I'm not able to spot it. Please help.

Join { to automatically receive all group messages.