Re: Strange behaviors for Janusgraph 0.5.3 on AWS EMR


asivieri@...
 

Hi,

here are the properties that I am setting so far (plus the same ones that are set in the TinkerPop example, such as the classpath for the executors and the driver):
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.cql.CqlInputFormat
gremlin.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin.hadoop.jarsInDistributedCache=true
 
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output
 
schema.default=none
 
janusgraphmr.ioformat.conf.storage.backend=cql
janusgraphmr.ioformat.conf.storage.batch-loading=true
janusgraphmr.ioformat.conf.storage.buffer-size=10000
janusgraphmr.ioformat.conf.storage.cql.keyspace=...
 
janusgraphmr.ioformat.conf.storage.hostname=...
janusgraphmr.ioformat.conf.storage.port=9042
janusgraphmr.ioformat.conf.storage.username=...
janusgraphmr.ioformat.conf.storage.password=...
cassandra.output.native.port=9042
 
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra.input.widerows=true
 
spark.master=yarn
spark.executor.memory=20g
spark.executor.cores=4
spark.driver.memory=20g
spark.driver.cores=8
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator
gremlin.spark.graphStorageLevel=MEMORY_AND_DISK
gremlin.spark.persistContext=true
gremlin.spark.persistStorageLevel=MEMORY_AND_DISK
spark.default.parallelism=1000
On Spark UI I can see a number of tasks for the first job which is the same number of tokens for our Scylla cluster (256 tokens per node * 3 nodes), but only two executors are spawn, even though I tried on a cluster with 96 cores and 768 GB of RAM, which, given the configuration of drivers and executors that you can see in the properties, should allocate a lot more than 2.

Moreover, I wrote a dedicated Java application that replicates the first step of the SparkGraphComputer, which is the step where the entire vertex list is read into a RDD, so basically I tried skipping the entire Gremlin console, start a "normal" Spark session as we do in our applications, and then read the entire vertex list from Scylla. In this case the job has the same number of tasks as before, but the number of executors is the correct one that I expected, so it seems to me that something in the Spark context creation performed by Gremlin is limiting this number, so maybe I am missing a configuration.
The problem of empty results, however, remained: in this test the RDD in output is completely empty, even though the logs in DEBUG show that it is connecting to the correct keyspace, where there is some data present. There are no exceptions, so I am not sure why we are not reading anything. Am I missing some properties in your opinion/experience?

Best regards,
Alessandro

Join janusgraph-users@lists.lfaidata.foundation to automatically receive all group messages.