Re: Strange behaviors for Janusgraph 0.5.3 on AWS EMR
asivieri@...
Hi,
here are the properties that I am setting so far (plus the same ones that are set in the TinkerPop example, such as the classpath for the executors and the driver):
here are the properties that I am setting so far (plus the same ones that are set in the TinkerPop example, such as the classpath for the executors and the driver):
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraphgremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.cql.CqlInputFormatgremlin.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormatgremlin.hadoop.jarsInDistributedCache=truegremlin.hadoop.inputLocation=nonegremlin.hadoop.outputLocation=outputschema.default=nonejanusgraphmr.ioformat.conf.storage.backend=cqljanusgraphmr.ioformat.conf.storage.batch-loading=truejanusgraphmr.ioformat.conf.storage.buffer-size=10000janusgraphmr.ioformat.conf.storage.cql.keyspace=...janusgraphmr.ioformat.conf.storage.hostname=...janusgraphmr.ioformat.conf.storage.port=9042janusgraphmr.ioformat.conf.storage.username=...janusgraphmr.ioformat.conf.storage.password=...cassandra.output.native.port=9042cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitionercassandra.input.widerows=truespark.master=yarnspark.executor.memory=20gspark.executor.cores=4spark.driver.memory=20gspark.driver.cores=8spark.serializer=org.apache.spark.serializer.KryoSerializerspark.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistratorgremlin.spark.graphStorageLevel=MEMORY_AND_DISKgremlin.spark.persistContext=truegremlin.spark.persistStorageLevel=MEMORY_AND_DISKspark.default.parallelism=1000
On Spark UI I can see a number of tasks for the first job which is the same number of tokens for our Scylla cluster (256 tokens per node * 3 nodes), but only two executors are spawn, even though I tried on a cluster with 96 cores and 768 GB of RAM, which, given the configuration of drivers and executors that you can see in the properties, should allocate a lot more than 2.
Moreover, I wrote a dedicated Java application that replicates the first step of the SparkGraphComputer, which is the step where the entire vertex list is read into a RDD, so basically I tried skipping the entire Gremlin console, start a "normal" Spark session as we do in our applications, and then read the entire vertex list from Scylla. In this case the job has the same number of tasks as before, but the number of executors is the correct one that I expected, so it seems to me that something in the Spark context creation performed by Gremlin is limiting this number, so maybe I am missing a configuration.
The problem of empty results, however, remained: in this test the RDD in output is completely empty, even though the logs in DEBUG show that it is connecting to the correct keyspace, where there is some data present. There are no exceptions, so I am not sure why we are not reading anything. Am I missing some properties in your opinion/experience?
Best regards,
Alessandro
Moreover, I wrote a dedicated Java application that replicates the first step of the SparkGraphComputer, which is the step where the entire vertex list is read into a RDD, so basically I tried skipping the entire Gremlin console, start a "normal" Spark session as we do in our applications, and then read the entire vertex list from Scylla. In this case the job has the same number of tasks as before, but the number of executors is the correct one that I expected, so it seems to me that something in the Spark context creation performed by Gremlin is limiting this number, so maybe I am missing a configuration.
The problem of empty results, however, remained: in this test the RDD in output is completely empty, even though the logs in DEBUG show that it is connecting to the correct keyspace, where there is some data present. There are no exceptions, so I am not sure why we are not reading anything. Am I missing some properties in your opinion/experience?
Best regards,
Alessandro