Managed memory leaks using the BulkLoaderVertexProgram


amark...@...
 

I keep running into error

Script 
start = new Date();

hdfs
.copyFromLocal("/home/ubuntu/example/data/ctest.csv","data/ctest.csv")
hdfs
.copyFromLocal("/home/ubuntu/example/scripts/script_challenge.groovy","scripts/script_challenge.groovy")

graph
= GraphFactory.open("/home/ubuntu/janusgraphdocker/conf/hadoop-script.properties")
blvp
= BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)
graph
.compute(SparkGraphComputer).program(blvp).submit().get()
stop
= new Date();


Input File size is about 1.7 MM nodes with 3 properties.

hadoop-script properties 

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph

gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat

gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat

gremlin.hadoop.jarsInDistributedCache=true


gremlin.hadoop.inputLocation=data/ctest.csv

gremlin.hadoop.scriptInputFormat.script=scripts/script_challenge.groovy

gremlin.hadoop.outputLocation=output

####################################

# SparkGraphComputer Configuration #

####################################

spark.master=local[4]

spark.executor.memory=4g

spark.serializer=org.apache.spark.serializer.KryoSerializer


Error

graph.compute(SparkGraphComputer).program(blvp).submit().get()

09:20:48 WARN  org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration  - Local setting ids.block-size=1000000 (Type: GLOBAL_OFFLINE) is overridden by globally managed value (10000).  Use the ManagementSystem interface instead of the local configuration to control this setting.

09:21:54 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 78887762 bytes, TID = 2

09:21:55 WARN  org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration  - Local setting ids.block-size=1000000 (Type: GLOBAL_OFFLINE) is overridden by globally managed value (10000).  Use the ManagementSystem interface instead of the local configuration to control this setting.

09:22:25 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 349007340 bytes, TID = 3

09:22:25 ERROR org.apache.spark.executor.Executor  - Exception in task 0.0 in stage 5.0 (TID 3)

org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException

09:22:25 WARN  org.apache.spark.scheduler.TaskSetManager  - Lost task 0.0 in stage 5.0 (TID 3, localhost): org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException


09:22:25 ERROR org.apache.spark.scheduler.TaskSetManager  - Task 0 in stage 5.0 failed 1 times; aborting job

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 3, localhost): org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException



Not sure why I am getting this. Is this a spark configuration setting ?

Also, I am only use factory.vertex and not graph.addVertex method, to add vertices.

Only happens when I try to load a large number of vertices at a time 

I am running this on EC2 C4.4XL Ubuntu Xenial, 8 Core, and 30GB RAM






Join janusgraph-users@lists.lfaidata.foundation to automatically receive all group messages.