Re: Managed memory leaks using the BulkLoaderVertexProgram


amark...@...
 

I guess my bigger question is

What do I need to setup in the env for getting the bulk loader to work

I have installed a local instance of hadoop on my EC2 instance, and added this 

export HADOOP_PREFIX=/usr/local/hadoop
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
# Put the Hadoop configuration on the classpath so HDFS doesn't resolve to the local filesystem
export CLASSPATH=$HADOOP_CONF_DIR
export HADOOP_GREMLIN_LIBS=/home/ubuntu/janusgraph-0.1.1-hadoop2/ext/spark-gremlin/lib:/home/ubuntu/janusgraph-0.1.1-hadoop2/ext/spark-gremlin/lib2


This command from Tinkerpop docs makes me think that I also need to install Spark 

bin/init-tp-spark.sh /usr/local/spark sp...@....0.1 sp...@....0.2 sp...@....0.3


I guess, what is the config settings for this. I am using SparkGraphComputer only for bulk loading of data 

Cheers
A

On Sunday, 8 October 2017 20:30:02 UTC-7, am...@... wrote:
Hi Jason,

Please see my comments below. 

I had created the schema before hand, its an empty graph when I start the load the data 

I had increase the ids.blocksize to 100K, default is 10K. Janusgraph docs say that as a rule of thumb, you should increase that to 10X, if you are doing bulk uploads. But I changed that to default 10K

script_challenge.groovy

def parse(line, factory) {


   def (cId, cType, cUserId,cCreatedDate) = line.split(/,/).toList()

   def v1 = factory.vertex(cId, "Challenge")

   v1.property("challengeId", cId) // first value is always the name

   v1.property("challengeType", Short.parseShort(cType)) // first value is always the name

   v1.property("creatorUserId", Long.parseLong(cUserId)) // first value is always the name

   v1.property("challengeCreatedDate", Date.parse("yyyy-mm-dd",cCreatedDate)) // first value is always the name

   return v1


}


sample data : I have about 200 MM rows d such data. I am loading about 1.4 MM when I am getting the errors

challenge-1,2,2,2016-04-04

challenge-2,1,1,2016-04-03



In the error message, .FastNoSuchElementException happens after the I get the executor memory error. I believe that the executor memory error is triggering it 

It mostly does part commit, so out of my 1.4MM rows, I can load about ~200K at a time before I get an error 

Steps taken for isolating the error source 

1) Increased max and initial heap size on EC2. java -XX:+PrintFlagsFinal -Xms2g -Xmx21g -version | grep HeapSize
2) Changes gremlin.sh to launch with more memory (Last time in gremlin.sh) exec $JAVA -Xmx16g $JAVA_OPTIONS $MAIN_CLASS "$@"
3) Increased the memory for executors 

spark.master=local[4]

spark.executor.memory=3g


My observations 

  • graph.addVertex method is not supported for batch upload. Tinkerpop 3.1.8 and 3.2.6 supports it. I have the use the older factory.vertex method 
  • After trying all these combinations and some others, I was still getting the same executor running out of memory error 
I think the problem is with using bulkLoader(OneTimeBulkLoader) in 
BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)

If I go with the default value (incremental loader) by using this 
blvp = BulkLoaderVertexProgram.build().writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)

I don't get the spark executor memory error 

But using this, I get warning messages 

WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertice


And the insert is very slow 

So, I think the problem is bulkLoader(OneTimeBulkLoader). Any thoughts? 

Thanks for looking into it. I appreciate it 


On Sunday, October 8, 2017 at 12:02:46 PM UTC-7, Jason Plurad wrote:
org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException usually indicates you're doing an unchecked call to next() in your traversal.

Are you able to share script_challenge.groovy? Also it seems like you should sort out what's going on with ids.block-size. I'd recommend creating the empty graph before running the bulk loader.

Have you tried increasing the executor memory?


On Sunday, October 8, 2017 at 2:23:55 PM UTC-4, amarkanday wrote:
I keep running into error

Script 
start = new Date();

hdfs
.copyFromLocal("/home/ubuntu/example/data/ctest.csv","data/ctest.csv")
hdfs
.copyFromLocal("/home/ubuntu/example/scripts/script_challenge.groovy","scripts/script_challenge.groovy")

graph
= GraphFactory.open("/home/ubuntu/janusgraphdocker/conf/hadoop-script.properties")
blvp
= BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)
graph
.compute(SparkGraphComputer).program(blvp).submit().get()
stop
= new Date();


Input File size is about 1.7 MM nodes with 3 properties.

hadoop-script properties 

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph

gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat

gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat

gremlin.hadoop.jarsInDistributedCache=true


gremlin.hadoop.inputLocation=data/ctest.csv

gremlin.hadoop.scriptInputFormat.script=scripts/script_challenge.groovy

gremlin.hadoop.outputLocation=output

####################################

# SparkGraphComputer Configuration #

####################################

spark.master=local[4]

spark.executor.memory=4g

spark.serializer=org.apache.spark.serializer.KryoSerializer


Error

graph.compute(SparkGraphComputer).program(blvp).submit().get()

09:20:48 WARN  org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration  - Local setting ids.block-size=1000000 (Type: GLOBAL_OFFLINE) is overridden by globally managed value (10000).  Use the ManagementSystem interface instead of the local configuration to control this setting.

09:21:54 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 78887762 bytes, TID = 2

09:21:55 WARN  org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration  - Local setting ids.block-size=1000000 (Type: GLOBAL_OFFLINE) is overridden by globally managed value (10000).  Use the ManagementSystem interface instead of the local configuration to control this setting.

09:22:25 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 349007340 bytes, TID = 3

09:22:25 ERROR org.apache.spark.executor.Executor  - Exception in task 0.0 in stage 5.0 (TID 3)

org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException

09:22:25 WARN  org.apache.spark.scheduler.TaskSetManager  - Lost task 0.0 in stage 5.0 (TID 3, localhost): org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException


09:22:25 ERROR org.apache.spark.scheduler.TaskSetManager  - Task 0 in stage 5.0 failed 1 times; aborting job

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 3, localhost): org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException



Not sure why I am getting this. Is this a spark configuration setting ?

Also, I am only use factory.vertex and not graph.addVertex method, to add vertices.

Only happens when I try to load a large number of vertices at a time 

I am running this on EC2 C4.4XL Ubuntu Xenial, 8 Core, and 30GB RAM






Join janusgraph-users@lists.lfaidata.foundation to automatically receive all group messages.