I keep running into error
Script
start = new Date();
hdfs.copyFromLocal("/home/ubuntu/example/data/ctest.csv","data/ctest.csv")
hdfs.copyFromLocal("/home/ubuntu/example/scripts/script_challenge.groovy","scripts/script_challenge.groovy")
graph = GraphFactory.open("/home/ubuntu/janusgraphdocker/conf/hadoop-script.properties")
blvp = BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)
graph.compute(SparkGraphComputer).program(blvp).submit().get()
stop = new Date();
Input File size is about 1.7 MM nodes with 3 properties.
hadoop-script properties
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=data/ctest.csv
gremlin.hadoop.scriptInputFormat.script=scripts/script_challenge.groovy
gremlin.hadoop.outputLocation=output
####################################
# SparkGraphComputer Configuration #
####################################
spark.master=local[4]
spark.executor.memory=4g
spark.serializer=org.apache.spark.serializer.KryoSerializer
Error
graph.compute(SparkGraphComputer).program(blvp).submit().get()
09:20:48 WARN org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration - Local setting ids.block-size=1000000 (Type: GLOBAL_OFFLINE) is overridden by globally managed value (10000). Use the ManagementSystem interface instead of the local configuration to control this setting.
09:21:54 ERROR org.apache.spark.executor.Executor - Managed memory leak detected; size = 78887762 bytes, TID = 2
09:21:55 WARN org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration - Local setting ids.block-size=1000000 (Type: GLOBAL_OFFLINE) is overridden by globally managed value (10000). Use the ManagementSystem interface instead of the local configuration to control this setting.
09:22:25 ERROR org.apache.spark.executor.Executor - Managed memory leak detected; size = 349007340 bytes, TID = 3
09:22:25 ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage 5.0 (TID 3)
org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException
09:22:25 WARN org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 5.0 (TID 3, localhost): org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException
09:22:25 ERROR org.apache.spark.scheduler.TaskSetManager - Task 0 in stage 5.0 failed 1 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 3, localhost): org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException
Not sure why I am getting this. Is this a spark configuration setting ?
Also, I am only use factory.vertex and not graph.addVertex method, to add vertices.
Only happens when I try to load a large number of vertices at a time
I am running this on EC2 C4.4XL Ubuntu Xenial, 8 Core, and 30GB RAM