Managed memory leaks using the BulkLoaderVertexProgram


amark...@...
 

I keep running into error

Script 
start = new Date();

hdfs
.copyFromLocal("/home/ubuntu/example/data/ctest.csv","data/ctest.csv")
hdfs
.copyFromLocal("/home/ubuntu/example/scripts/script_challenge.groovy","scripts/script_challenge.groovy")

graph
= GraphFactory.open("/home/ubuntu/janusgraphdocker/conf/hadoop-script.properties")
blvp
= BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)
graph
.compute(SparkGraphComputer).program(blvp).submit().get()
stop
= new Date();


Input File size is about 1.7 MM nodes with 3 properties.

hadoop-script properties 

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph

gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat

gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat

gremlin.hadoop.jarsInDistributedCache=true


gremlin.hadoop.inputLocation=data/ctest.csv

gremlin.hadoop.scriptInputFormat.script=scripts/script_challenge.groovy

gremlin.hadoop.outputLocation=output

####################################

# SparkGraphComputer Configuration #

####################################

spark.master=local[4]

spark.executor.memory=4g

spark.serializer=org.apache.spark.serializer.KryoSerializer


Error

graph.compute(SparkGraphComputer).program(blvp).submit().get()

09:20:48 WARN  org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration  - Local setting ids.block-size=1000000 (Type: GLOBAL_OFFLINE) is overridden by globally managed value (10000).  Use the ManagementSystem interface instead of the local configuration to control this setting.

09:21:54 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 78887762 bytes, TID = 2

09:21:55 WARN  org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration  - Local setting ids.block-size=1000000 (Type: GLOBAL_OFFLINE) is overridden by globally managed value (10000).  Use the ManagementSystem interface instead of the local configuration to control this setting.

09:22:25 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 349007340 bytes, TID = 3

09:22:25 ERROR org.apache.spark.executor.Executor  - Exception in task 0.0 in stage 5.0 (TID 3)

org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException

09:22:25 WARN  org.apache.spark.scheduler.TaskSetManager  - Lost task 0.0 in stage 5.0 (TID 3, localhost): org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException


09:22:25 ERROR org.apache.spark.scheduler.TaskSetManager  - Task 0 in stage 5.0 failed 1 times; aborting job

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 3, localhost): org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException



Not sure why I am getting this. Is this a spark configuration setting ?

Also, I am only use factory.vertex and not graph.addVertex method, to add vertices.

Only happens when I try to load a large number of vertices at a time 

I am running this on EC2 C4.4XL Ubuntu Xenial, 8 Core, and 30GB RAM







Jason Plurad <plu...@...>
 

org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException usually indicates you're doing an unchecked call to next() in your traversal.

Are you able to share script_challenge.groovy? Also it seems like you should sort out what's going on with ids.block-size. I'd recommend creating the empty graph before running the bulk loader.

Have you tried increasing the executor memory?


On Sunday, October 8, 2017 at 2:23:55 PM UTC-4, amarkanday wrote:
I keep running into error

Script 
start = new Date();

hdfs
.copyFromLocal("/home/ubuntu/example/data/ctest.csv","data/ctest.csv")
hdfs
.copyFromLocal("/home/ubuntu/example/scripts/script_challenge.groovy","scripts/script_challenge.groovy")

graph
= GraphFactory.open("/home/ubuntu/janusgraphdocker/conf/hadoop-script.properties")
blvp
= BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)
graph
.compute(SparkGraphComputer).program(blvp).submit().get()
stop
= new Date();


Input File size is about 1.7 MM nodes with 3 properties.

hadoop-script properties 

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph

gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat

gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat

gremlin.hadoop.jarsInDistributedCache=true


gremlin.hadoop.inputLocation=data/ctest.csv

gremlin.hadoop.scriptInputFormat.script=scripts/script_challenge.groovy

gremlin.hadoop.outputLocation=output

####################################

# SparkGraphComputer Configuration #

####################################

spark.master=local[4]

spark.executor.memory=4g

spark.serializer=org.apache.spark.serializer.KryoSerializer


Error

graph.compute(SparkGraphComputer).program(blvp).submit().get()

09:20:48 WARN  org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration  - Local setting ids.block-size=1000000 (Type: GLOBAL_OFFLINE) is overridden by globally managed value (10000).  Use the ManagementSystem interface instead of the local configuration to control this setting.

09:21:54 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 78887762 bytes, TID = 2

09:21:55 WARN  org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration  - Local setting ids.block-size=1000000 (Type: GLOBAL_OFFLINE) is overridden by globally managed value (10000).  Use the ManagementSystem interface instead of the local configuration to control this setting.

09:22:25 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 349007340 bytes, TID = 3

09:22:25 ERROR org.apache.spark.executor.Executor  - Exception in task 0.0 in stage 5.0 (TID 3)

org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException

09:22:25 WARN  org.apache.spark.scheduler.TaskSetManager  - Lost task 0.0 in stage 5.0 (TID 3, localhost): org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException


09:22:25 ERROR org.apache.spark.scheduler.TaskSetManager  - Task 0 in stage 5.0 failed 1 times; aborting job

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 3, localhost): org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException



Not sure why I am getting this. Is this a spark configuration setting ?

Also, I am only use factory.vertex and not graph.addVertex method, to add vertices.

Only happens when I try to load a large number of vertices at a time 

I am running this on EC2 C4.4XL Ubuntu Xenial, 8 Core, and 30GB RAM







amark...@...
 

Hi Jason,

Please see my comments below. 

I had created the schema before hand, its an empty graph when I start the load the data 

I had increase the ids.blocksize to 100K, default is 10K. Janusgraph docs say that as a rule of thumb, you should increase that to 10X, if you are doing bulk uploads. But I changed that to default 10K

script_challenge.groovy

def parse(line, factory) {


   def (cId, cType, cUserId,cCreatedDate) = line.split(/,/).toList()

   def v1 = factory.vertex(cId, "Challenge")

   v1.property("challengeId", cId) // first value is always the name

   v1.property("challengeType", Short.parseShort(cType)) // first value is always the name

   v1.property("creatorUserId", Long.parseLong(cUserId)) // first value is always the name

   v1.property("challengeCreatedDate", Date.parse("yyyy-mm-dd",cCreatedDate)) // first value is always the name

   return v1


}


sample data : I have about 200 MM rows d such data. I am loading about 1.4 MM when I am getting the errors

challenge-1,2,2,2016-04-04

challenge-2,1,1,2016-04-03



In the error message, .FastNoSuchElementException happens after the I get the executor memory error. I believe that the executor memory error is triggering it 

It mostly does part commit, so out of my 1.4MM rows, I can load about ~200K at a time before I get an error 

Steps taken for isolating the error source 

1) Increased max and initial heap size on EC2. java -XX:+PrintFlagsFinal -Xms2g -Xmx21g -version | grep HeapSize
2) Changes gremlin.sh to launch with more memory (Last time in gremlin.sh) exec $JAVA -Xmx16g $JAVA_OPTIONS $MAIN_CLASS "$@"
3) Increased the memory for executors 

spark.master=local[4]

spark.executor.memory=3g


My observations 

  • graph.addVertex method is not supported for batch upload. Tinkerpop 3.1.8 and 3.2.6 supports it. I have the use the older factory.vertex method 
  • After trying all these combinations and some others, I was still getting the same executor running out of memory error 
I think the problem is with using bulkLoader(OneTimeBulkLoader) in 
BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)

If I go with the default value (incremental loader) by using this 
blvp = BulkLoaderVertexProgram.build().writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)

I don't get the spark executor memory error 

But using this, I get warning messages 

WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertice


And the insert is very slow 

So, I think the problem is bulkLoader(OneTimeBulkLoader). Any thoughts? 

Thanks for looking into it. I appreciate it 


On Sunday, October 8, 2017 at 12:02:46 PM UTC-7, Jason Plurad wrote:
org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException usually indicates you're doing an unchecked call to next() in your traversal.

Are you able to share script_challenge.groovy? Also it seems like you should sort out what's going on with ids.block-size. I'd recommend creating the empty graph before running the bulk loader.

Have you tried increasing the executor memory?


On Sunday, October 8, 2017 at 2:23:55 PM UTC-4, amarkanday wrote:
I keep running into error

Script 
start = new Date();

hdfs
.copyFromLocal("/home/ubuntu/example/data/ctest.csv","data/ctest.csv")
hdfs
.copyFromLocal("/home/ubuntu/example/scripts/script_challenge.groovy","scripts/script_challenge.groovy")

graph
= GraphFactory.open("/home/ubuntu/janusgraphdocker/conf/hadoop-script.properties")
blvp
= BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)
graph
.compute(SparkGraphComputer).program(blvp).submit().get()
stop
= new Date();


Input File size is about 1.7 MM nodes with 3 properties.

hadoop-script properties 

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph

gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat

gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat

gremlin.hadoop.jarsInDistributedCache=true


gremlin.hadoop.inputLocation=data/ctest.csv

gremlin.hadoop.scriptInputFormat.script=scripts/script_challenge.groovy

gremlin.hadoop.outputLocation=output

####################################

# SparkGraphComputer Configuration #

####################################

spark.master=local[4]

spark.executor.memory=4g

spark.serializer=org.apache.spark.serializer.KryoSerializer


Error

graph.compute(SparkGraphComputer).program(blvp).submit().get()

09:20:48 WARN  org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration  - Local setting ids.block-size=1000000 (Type: GLOBAL_OFFLINE) is overridden by globally managed value (10000).  Use the ManagementSystem interface instead of the local configuration to control this setting.

09:21:54 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 78887762 bytes, TID = 2

09:21:55 WARN  org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration  - Local setting ids.block-size=1000000 (Type: GLOBAL_OFFLINE) is overridden by globally managed value (10000).  Use the ManagementSystem interface instead of the local configuration to control this setting.

09:22:25 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 349007340 bytes, TID = 3

09:22:25 ERROR org.apache.spark.executor.Executor  - Exception in task 0.0 in stage 5.0 (TID 3)

org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException

09:22:25 WARN  org.apache.spark.scheduler.TaskSetManager  - Lost task 0.0 in stage 5.0 (TID 3, localhost): org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException


09:22:25 ERROR org.apache.spark.scheduler.TaskSetManager  - Task 0 in stage 5.0 failed 1 times; aborting job

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 3, localhost): org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException



Not sure why I am getting this. Is this a spark configuration setting ?

Also, I am only use factory.vertex and not graph.addVertex method, to add vertices.

Only happens when I try to load a large number of vertices at a time 

I am running this on EC2 C4.4XL Ubuntu Xenial, 8 Core, and 30GB RAM







amark...@...
 

I guess my bigger question is

What do I need to setup in the env for getting the bulk loader to work

I have installed a local instance of hadoop on my EC2 instance, and added this 

export HADOOP_PREFIX=/usr/local/hadoop
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
# Put the Hadoop configuration on the classpath so HDFS doesn't resolve to the local filesystem
export CLASSPATH=$HADOOP_CONF_DIR
export HADOOP_GREMLIN_LIBS=/home/ubuntu/janusgraph-0.1.1-hadoop2/ext/spark-gremlin/lib:/home/ubuntu/janusgraph-0.1.1-hadoop2/ext/spark-gremlin/lib2


This command from Tinkerpop docs makes me think that I also need to install Spark 

bin/init-tp-spark.sh /usr/local/spark sp...@....0.1 sp...@....0.2 sp...@....0.3


I guess, what is the config settings for this. I am using SparkGraphComputer only for bulk loading of data 

Cheers
A

On Sunday, 8 October 2017 20:30:02 UTC-7, am...@... wrote:
Hi Jason,

Please see my comments below. 

I had created the schema before hand, its an empty graph when I start the load the data 

I had increase the ids.blocksize to 100K, default is 10K. Janusgraph docs say that as a rule of thumb, you should increase that to 10X, if you are doing bulk uploads. But I changed that to default 10K

script_challenge.groovy

def parse(line, factory) {


   def (cId, cType, cUserId,cCreatedDate) = line.split(/,/).toList()

   def v1 = factory.vertex(cId, "Challenge")

   v1.property("challengeId", cId) // first value is always the name

   v1.property("challengeType", Short.parseShort(cType)) // first value is always the name

   v1.property("creatorUserId", Long.parseLong(cUserId)) // first value is always the name

   v1.property("challengeCreatedDate", Date.parse("yyyy-mm-dd",cCreatedDate)) // first value is always the name

   return v1


}


sample data : I have about 200 MM rows d such data. I am loading about 1.4 MM when I am getting the errors

challenge-1,2,2,2016-04-04

challenge-2,1,1,2016-04-03



In the error message, .FastNoSuchElementException happens after the I get the executor memory error. I believe that the executor memory error is triggering it 

It mostly does part commit, so out of my 1.4MM rows, I can load about ~200K at a time before I get an error 

Steps taken for isolating the error source 

1) Increased max and initial heap size on EC2. java -XX:+PrintFlagsFinal -Xms2g -Xmx21g -version | grep HeapSize
2) Changes gremlin.sh to launch with more memory (Last time in gremlin.sh) exec $JAVA -Xmx16g $JAVA_OPTIONS $MAIN_CLASS "$@"
3) Increased the memory for executors 

spark.master=local[4]

spark.executor.memory=3g


My observations 

  • graph.addVertex method is not supported for batch upload. Tinkerpop 3.1.8 and 3.2.6 supports it. I have the use the older factory.vertex method 
  • After trying all these combinations and some others, I was still getting the same executor running out of memory error 
I think the problem is with using bulkLoader(OneTimeBulkLoader) in 
BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)

If I go with the default value (incremental loader) by using this 
blvp = BulkLoaderVertexProgram.build().writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)

I don't get the spark executor memory error 

But using this, I get warning messages 

WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertice


And the insert is very slow 

So, I think the problem is bulkLoader(OneTimeBulkLoader). Any thoughts? 

Thanks for looking into it. I appreciate it 


On Sunday, October 8, 2017 at 12:02:46 PM UTC-7, Jason Plurad wrote:
org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException usually indicates you're doing an unchecked call to next() in your traversal.

Are you able to share script_challenge.groovy? Also it seems like you should sort out what's going on with ids.block-size. I'd recommend creating the empty graph before running the bulk loader.

Have you tried increasing the executor memory?


On Sunday, October 8, 2017 at 2:23:55 PM UTC-4, amarkanday wrote:
I keep running into error

Script 
start = new Date();

hdfs
.copyFromLocal("/home/ubuntu/example/data/ctest.csv","data/ctest.csv")
hdfs
.copyFromLocal("/home/ubuntu/example/scripts/script_challenge.groovy","scripts/script_challenge.groovy")

graph
= GraphFactory.open("/home/ubuntu/janusgraphdocker/conf/hadoop-script.properties")
blvp
= BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)
graph
.compute(SparkGraphComputer).program(blvp).submit().get()
stop
= new Date();


Input File size is about 1.7 MM nodes with 3 properties.

hadoop-script properties 

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph

gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat

gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat

gremlin.hadoop.jarsInDistributedCache=true


gremlin.hadoop.inputLocation=data/ctest.csv

gremlin.hadoop.scriptInputFormat.script=scripts/script_challenge.groovy

gremlin.hadoop.outputLocation=output

####################################

# SparkGraphComputer Configuration #

####################################

spark.master=local[4]

spark.executor.memory=4g

spark.serializer=org.apache.spark.serializer.KryoSerializer


Error

graph.compute(SparkGraphComputer).program(blvp).submit().get()

09:20:48 WARN  org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration  - Local setting ids.block-size=1000000 (Type: GLOBAL_OFFLINE) is overridden by globally managed value (10000).  Use the ManagementSystem interface instead of the local configuration to control this setting.

09:21:54 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 78887762 bytes, TID = 2

09:21:55 WARN  org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration  - Local setting ids.block-size=1000000 (Type: GLOBAL_OFFLINE) is overridden by globally managed value (10000).  Use the ManagementSystem interface instead of the local configuration to control this setting.

09:22:25 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 349007340 bytes, TID = 3

09:22:25 ERROR org.apache.spark.executor.Executor  - Exception in task 0.0 in stage 5.0 (TID 3)

org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException

09:22:25 WARN  org.apache.spark.scheduler.TaskSetManager  - Lost task 0.0 in stage 5.0 (TID 3, localhost): org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException


09:22:25 ERROR org.apache.spark.scheduler.TaskSetManager  - Task 0 in stage 5.0 failed 1 times; aborting job

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 3, localhost): org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException



Not sure why I am getting this. Is this a spark configuration setting ?

Also, I am only use factory.vertex and not graph.addVertex method, to add vertices.

Only happens when I try to load a large number of vertices at a time 

I am running this on EC2 C4.4XL Ubuntu Xenial, 8 Core, and 30GB RAM







Ted Wilmes <twi...@...>
 

You should be able to use Spark in standalone mode, just as it comes with JanusGraph, to load that amount of data. Here is one more thing you can try, I usually run bulk loads with this set in my script input properties file:

gremlin.spark.persistStorageLevel=DISK_ONLY

This tells Spark to persist intermediate RDD results to disk temporarily instead of memory which in my experience works well for bulk loading and should decrease the pressure on your executor's memory.

--Ted


On Monday, October 9, 2017 at 4:35:30 AM UTC-5, amark...@... wrote:
I guess my bigger question is

What do I need to setup in the env for getting the bulk loader to work

I have installed a local instance of hadoop on my EC2 instance, and added this 

export HADOOP_PREFIX=/usr/local/hadoop
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
# Put the Hadoop configuration on the classpath so HDFS doesn't resolve to the local filesystem
export CLASSPATH=$HADOOP_CONF_DIR
export HADOOP_GREMLIN_LIBS=/home/ubuntu/janusgraph-0.1.1-hadoop2/ext/spark-gremlin/lib:/home/ubuntu/janusgraph-0.1.1-hadoop2/ext/spark-gremlin/lib2


This command from Tinkerpop docs makes me think that I also need to install Spark 

bin/init-tp-spark.sh /usr/local/spark sp...@....0.1 sp...@....0.2 sp...@....0.3


I guess, what is the config settings for this. I am using SparkGraphComputer only for bulk loading of data 

Cheers
A

On Sunday, 8 October 2017 20:30:02 UTC-7, am...@... wrote:
Hi Jason,

Please see my comments below. 

I had created the schema before hand, its an empty graph when I start the load the data 

I had increase the ids.blocksize to 100K, default is 10K. Janusgraph docs say that as a rule of thumb, you should increase that to 10X, if you are doing bulk uploads. But I changed that to default 10K

script_challenge.groovy

def parse(line, factory) {


   def (cId, cType, cUserId,cCreatedDate) = line.split(/,/).toList()

   def v1 = factory.vertex(cId, "Challenge")

   v1.property("challengeId", cId) // first value is always the name

   v1.property("challengeType", Short.parseShort(cType)) // first value is always the name

   v1.property("creatorUserId", Long.parseLong(cUserId)) // first value is always the name

   v1.property("challengeCreatedDate", Date.parse("yyyy-mm-dd",cCreatedDate)) // first value is always the name

   return v1


}


sample data : I have about 200 MM rows d such data. I am loading about 1.4 MM when I am getting the errors

challenge-1,2,2,2016-04-04

challenge-2,1,1,2016-04-03



In the error message, .FastNoSuchElementException happens after the I get the executor memory error. I believe that the executor memory error is triggering it 

It mostly does part commit, so out of my 1.4MM rows, I can load about ~200K at a time before I get an error 

Steps taken for isolating the error source 

1) Increased max and initial heap size on EC2. java -XX:+PrintFlagsFinal -Xms2g -Xmx21g -version | grep HeapSize
2) Changes gremlin.sh to launch with more memory (Last time in gremlin.sh) exec $JAVA -Xmx16g $JAVA_OPTIONS $MAIN_CLASS "$@"
3) Increased the memory for executors 

spark.master=local[4]

spark.executor.memory=3g


My observations 

  • graph.addVertex method is not supported for batch upload. Tinkerpop 3.1.8 and 3.2.6 supports it. I have the use the older factory.vertex method 
  • After trying all these combinations and some others, I was still getting the same executor running out of memory error 
I think the problem is with using bulkLoader(OneTimeBulkLoader) in 
BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)

If I go with the default value (incremental loader) by using this 
blvp = BulkLoaderVertexProgram.build().writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)

I don't get the spark executor memory error 

But using this, I get warning messages 

WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertice


And the insert is very slow 

So, I think the problem is bulkLoader(OneTimeBulkLoader). Any thoughts? 

Thanks for looking into it. I appreciate it 


On Sunday, October 8, 2017 at 12:02:46 PM UTC-7, Jason Plurad wrote:
org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException usually indicates you're doing an unchecked call to next() in your traversal.

Are you able to share script_challenge.groovy? Also it seems like you should sort out what's going on with ids.block-size. I'd recommend creating the empty graph before running the bulk loader.

Have you tried increasing the executor memory?


On Sunday, October 8, 2017 at 2:23:55 PM UTC-4, amarkanday wrote:
I keep running into error

Script 
start = new Date();

hdfs
.copyFromLocal("/home/ubuntu/example/data/ctest.csv","data/ctest.csv")
hdfs
.copyFromLocal("/home/ubuntu/example/scripts/script_challenge.groovy","scripts/script_challenge.groovy")

graph
= GraphFactory.open("/home/ubuntu/janusgraphdocker/conf/hadoop-script.properties")
blvp
= BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)
graph
.compute(SparkGraphComputer).program(blvp).submit().get()
stop
= new Date();


Input File size is about 1.7 MM nodes with 3 properties.

hadoop-script properties 

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph

gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat

gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat

gremlin.hadoop.jarsInDistributedCache=true


gremlin.hadoop.inputLocation=data/ctest.csv

gremlin.hadoop.scriptInputFormat.script=scripts/script_challenge.groovy

gremlin.hadoop.outputLocation=output

####################################

# SparkGraphComputer Configuration #

####################################

spark.master=local[4]

spark.executor.memory=4g

spark.serializer=org.apache.spark.serializer.KryoSerializer


Error

graph.compute(SparkGraphComputer).program(blvp).submit().get()

09:20:48 WARN  org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration  - Local setting ids.block-size=1000000 (Type: GLOBAL_OFFLINE) is overridden by globally managed value (10000).  Use the ManagementSystem interface instead of the local configuration to control this setting.

09:21:54 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 78887762 bytes, TID = 2

09:21:55 WARN  org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration  - Local setting ids.block-size=1000000 (Type: GLOBAL_OFFLINE) is overridden by globally managed value (10000).  Use the ManagementSystem interface instead of the local configuration to control this setting.

09:22:25 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 349007340 bytes, TID = 3

09:22:25 ERROR org.apache.spark.executor.Executor  - Exception in task 0.0 in stage 5.0 (TID 3)

org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException

09:22:25 WARN  org.apache.spark.scheduler.TaskSetManager  - Lost task 0.0 in stage 5.0 (TID 3, localhost): org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException


09:22:25 ERROR org.apache.spark.scheduler.TaskSetManager  - Task 0 in stage 5.0 failed 1 times; aborting job

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 3, localhost): org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException



Not sure why I am getting this. Is this a spark configuration setting ?

Also, I am only use factory.vertex and not graph.addVertex method, to add vertices.

Only happens when I try to load a large number of vertices at a time 

I am running this on EC2 C4.4XL Ubuntu Xenial, 8 Core, and 30GB RAM







amark...@...
 

Hey Ted,

Tried setting gremlin.spark.persistStorageLevel=DISK_ONLY, but still getting 

7:53:16 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 36796560 bytes, TID = 2

17:53:47 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 167934858 bytes, TID = 3

17:53:47 ERROR org.apache.spark.executor.Executor  - Exception in task 0.0 in stage 5.0 (TID 3)


Thanks 
Ashish
On Monday, October 9, 2017 at 9:09:01 AM UTC-7, Ted Wilmes wrote:
You should be able to use Spark in standalone mode, just as it comes with JanusGraph, to load that amount of data. Here is one more thing you can try, I usually run bulk loads with this set in my script input properties file:

gremlin.spark.persistStorageLevel=DISK_ONLY

This tells Spark to persist intermediate RDD results to disk temporarily instead of memory which in my experience works well for bulk loading and should decrease the pressure on your executor's memory.

--Ted

On Monday, October 9, 2017 at 4:35:30 AM UTC-5, am...@... wrote:
I guess my bigger question is

What do I need to setup in the env for getting the bulk loader to work

I have installed a local instance of hadoop on my EC2 instance, and added this 

export HADOOP_PREFIX=/usr/local/hadoop
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
# Put the Hadoop configuration on the classpath so HDFS doesn't resolve to the local filesystem
export CLASSPATH=$HADOOP_CONF_DIR
export HADOOP_GREMLIN_LIBS=/home/ubuntu/janusgraph-0.1.1-hadoop2/ext/spark-gremlin/lib:/home/ubuntu/janusgraph-0.1.1-hadoop2/ext/spark-gremlin/lib2


This command from Tinkerpop docs makes me think that I also need to install Spark 

bin/init-tp-spark.sh /usr/local/spark s...@....0.1 s...@....0.2 s...@....0.3


I guess, what is the config settings for this. I am using SparkGraphComputer only for bulk loading of data 

Cheers
A

On Sunday, 8 October 2017 20:30:02 UTC-7, am...@... wrote:
Hi Jason,

Please see my comments below. 

I had created the schema before hand, its an empty graph when I start the load the data 

I had increase the ids.blocksize to 100K, default is 10K. Janusgraph docs say that as a rule of thumb, you should increase that to 10X, if you are doing bulk uploads. But I changed that to default 10K

script_challenge.groovy

def parse(line, factory) {


   def (cId, cType, cUserId,cCreatedDate) = line.split(/,/).toList()

   def v1 = factory.vertex(cId, "Challenge")

   v1.property("challengeId", cId) // first value is always the name

   v1.property("challengeType", Short.parseShort(cType)) // first value is always the name

   v1.property("creatorUserId", Long.parseLong(cUserId)) // first value is always the name

   v1.property("challengeCreatedDate", Date.parse("yyyy-mm-dd",cCreatedDate)) // first value is always the name

   return v1


}


sample data : I have about 200 MM rows d such data. I am loading about 1.4 MM when I am getting the errors

challenge-1,2,2,2016-04-04

challenge-2,1,1,2016-04-03



In the error message, .FastNoSuchElementException happens after the I get the executor memory error. I believe that the executor memory error is triggering it 

It mostly does part commit, so out of my 1.4MM rows, I can load about ~200K at a time before I get an error 

Steps taken for isolating the error source 

1) Increased max and initial heap size on EC2. java -XX:+PrintFlagsFinal -Xms2g -Xmx21g -version | grep HeapSize
2) Changes gremlin.sh to launch with more memory (Last time in gremlin.sh) exec $JAVA -Xmx16g $JAVA_OPTIONS $MAIN_CLASS "$@"
3) Increased the memory for executors 

spark.master=local[4]

spark.executor.memory=3g


My observations 

  • graph.addVertex method is not supported for batch upload. Tinkerpop 3.1.8 and 3.2.6 supports it. I have the use the older factory.vertex method 
  • After trying all these combinations and some others, I was still getting the same executor running out of memory error 
I think the problem is with using bulkLoader(OneTimeBulkLoader) in 
BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)

If I go with the default value (incremental loader) by using this 
blvp = BulkLoaderVertexProgram.build().writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)

I don't get the spark executor memory error 

But using this, I get warning messages 

WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertice


And the insert is very slow 

So, I think the problem is bulkLoader(OneTimeBulkLoader). Any thoughts? 

Thanks for looking into it. I appreciate it 


On Sunday, October 8, 2017 at 12:02:46 PM UTC-7, Jason Plurad wrote:
org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException usually indicates you're doing an unchecked call to next() in your traversal.

Are you able to share script_challenge.groovy? Also it seems like you should sort out what's going on with ids.block-size. I'd recommend creating the empty graph before running the bulk loader.

Have you tried increasing the executor memory?


On Sunday, October 8, 2017 at 2:23:55 PM UTC-4, amarkanday wrote:
I keep running into error

Script 
start = new Date();

hdfs
.copyFromLocal("/home/ubuntu/example/data/ctest.csv","data/ctest.csv")
hdfs
.copyFromLocal("/home/ubuntu/example/scripts/script_challenge.groovy","scripts/script_challenge.groovy")

graph
= GraphFactory.open("/home/ubuntu/janusgraphdocker/conf/hadoop-script.properties")
blvp
= BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)
graph
.compute(SparkGraphComputer).program(blvp).submit().get()
stop
= new Date();


Input File size is about 1.7 MM nodes with 3 properties.

hadoop-script properties 

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph

gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat

gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat

gremlin.hadoop.jarsInDistributedCache=true


gremlin.hadoop.inputLocation=data/ctest.csv

gremlin.hadoop.scriptInputFormat.script=scripts/script_challenge.groovy

gremlin.hadoop.outputLocation=output

####################################

# SparkGraphComputer Configuration #

####################################

spark.master=local[4]

spark.executor.memory=4g

spark.serializer=org.apache.spark.serializer.KryoSerializer


Error

graph.compute(SparkGraphComputer).program(blvp).submit().get()

09:20:48 WARN  org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration  - Local setting ids.block-size=1000000 (Type: GLOBAL_OFFLINE) is overridden by globally managed value (10000).  Use the ManagementSystem interface instead of the local configuration to control this setting.

09:21:54 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 78887762 bytes, TID = 2

09:21:55 WARN  org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration  - Local setting ids.block-size=1000000 (Type: GLOBAL_OFFLINE) is overridden by globally managed value (10000).  Use the ManagementSystem interface instead of the local configuration to control this setting.

09:22:25 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 349007340 bytes, TID = 3

09:22:25 ERROR org.apache.spark.executor.Executor  - Exception in task 0.0 in stage 5.0 (TID 3)

org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException

09:22:25 WARN  org.apache.spark.scheduler.TaskSetManager  - Lost task 0.0 in stage 5.0 (TID 3, localhost): org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException


09:22:25 ERROR org.apache.spark.scheduler.TaskSetManager  - Task 0 in stage 5.0 failed 1 times; aborting job

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 3, localhost): org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException



Not sure why I am getting this. Is this a spark configuration setting ?

Also, I am only use factory.vertex and not graph.addVertex method, to add vertices.

Only happens when I try to load a large number of vertices at a time 

I am running this on EC2 C4.4XL Ubuntu Xenial, 8 Core, and 30GB RAM







Ted Wilmes <twi...@...>
 

Hi Ashish,
Read through your memory specific problem too quickly, those scary looking errors are actually ok, you can see an explanation here:


For the incremental load, you can get rid of "WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertices" by explicitly 
adding the default bulk loader id property and an index on it before you start the load. The default name of the property is "bulkLoader.vertex.id". The incremental loader
checks for the existence of the vertices you're loading before it makes any changes so it needs to look them up and to make that quick, there needs to be
an index.

Having said all that, the FastNoSuchElement exception in stage 5 is usually due to a failed edge load where one of the vertices you're connecting an edge to does not exist. I
do not see any edge loading in your script though. Can you reproduce the issue if you attempt to load a few with just a few lines of data?

--Ted

On Monday, October 9, 2017 at 12:57:52 PM UTC-5, amark...@... wrote:
Hey Ted,

Tried setting gremlin.spark.persistStorageLevel=DISK_ONLY, but still getting 

7:53:16 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 36796560 bytes, TID = 2

17:53:47 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 167934858 bytes, TID = 3

17:53:47 ERROR org.apache.spark.executor.Executor  - Exception in task 0.0 in stage 5.0 (TID 3)


Thanks 
Ashish
On Monday, October 9, 2017 at 9:09:01 AM UTC-7, Ted Wilmes wrote:
You should be able to use Spark in standalone mode, just as it comes with JanusGraph, to load that amount of data. Here is one more thing you can try, I usually run bulk loads with this set in my script input properties file:

gremlin.spark.persistStorageLevel=DISK_ONLY

This tells Spark to persist intermediate RDD results to disk temporarily instead of memory which in my experience works well for bulk loading and should decrease the pressure on your executor's memory.

--Ted

On Monday, October 9, 2017 at 4:35:30 AM UTC-5, am...@... wrote:
I guess my bigger question is

What do I need to setup in the env for getting the bulk loader to work

I have installed a local instance of hadoop on my EC2 instance, and added this 

export HADOOP_PREFIX=/usr/local/hadoop
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
# Put the Hadoop configuration on the classpath so HDFS doesn't resolve to the local filesystem
export CLASSPATH=$HADOOP_CONF_DIR
export HADOOP_GREMLIN_LIBS=/home/ubuntu/janusgraph-0.1.1-hadoop2/ext/spark-gremlin/lib:/home/ubuntu/janusgraph-0.1.1-hadoop2/ext/spark-gremlin/lib2


This command from Tinkerpop docs makes me think that I also need to install Spark 

bin/init-tp-spark.sh /usr/local/spark s...@....0.1 s...@....0.2 s...@....0.3


I guess, what is the config settings for this. I am using SparkGraphComputer only for bulk loading of data 

Cheers
A

On Sunday, 8 October 2017 20:30:02 UTC-7, am...@... wrote:
Hi Jason,

Please see my comments below. 

I had created the schema before hand, its an empty graph when I start the load the data 

I had increase the ids.blocksize to 100K, default is 10K. Janusgraph docs say that as a rule of thumb, you should increase that to 10X, if you are doing bulk uploads. But I changed that to default 10K

script_challenge.groovy

def parse(line, factory) {


   def (cId, cType, cUserId,cCreatedDate) = line.split(/,/).toList()

   def v1 = factory.vertex(cId, "Challenge")

   v1.property("challengeId", cId) // first value is always the name

   v1.property("challengeType", Short.parseShort(cType)) // first value is always the name

   v1.property("creatorUserId", Long.parseLong(cUserId)) // first value is always the name

   v1.property("challengeCreatedDate", Date.parse("yyyy-mm-dd",cCreatedDate)) // first value is always the name

   return v1


}


sample data : I have about 200 MM rows d such data. I am loading about 1.4 MM when I am getting the errors

challenge-1,2,2,2016-04-04

challenge-2,1,1,2016-04-03



In the error message, .FastNoSuchElementException happens after the I get the executor memory error. I believe that the executor memory error is triggering it 

It mostly does part commit, so out of my 1.4MM rows, I can load about ~200K at a time before I get an error 

Steps taken for isolating the error source 

1) Increased max and initial heap size on EC2. java -XX:+PrintFlagsFinal -Xms2g -Xmx21g -version | grep HeapSize
2) Changes gremlin.sh to launch with more memory (Last time in gremlin.sh) exec $JAVA -Xmx16g $JAVA_OPTIONS $MAIN_CLASS "$@"
3) Increased the memory for executors 

spark.master=local[4]

spark.executor.memory=3g


My observations 

  • graph.addVertex method is not supported for batch upload. Tinkerpop 3.1.8 and 3.2.6 supports it. I have the use the older factory.vertex method 
  • After trying all these combinations and some others, I was still getting the same executor running out of memory error 
I think the problem is with using bulkLoader(OneTimeBulkLoader) in 
BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)

If I go with the default value (incremental loader) by using this 
blvp = BulkLoaderVertexProgram.build().writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)

I don't get the spark executor memory error 

But using this, I get warning messages 

WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertice


And the insert is very slow 

So, I think the problem is bulkLoader(OneTimeBulkLoader). Any thoughts? 

Thanks for looking into it. I appreciate it 


On Sunday, October 8, 2017 at 12:02:46 PM UTC-7, Jason Plurad wrote:
org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException usually indicates you're doing an unchecked call to next() in your traversal.

Are you able to share script_challenge.groovy? Also it seems like you should sort out what's going on with ids.block-size. I'd recommend creating the empty graph before running the bulk loader.

Have you tried increasing the executor memory?


On Sunday, October 8, 2017 at 2:23:55 PM UTC-4, amarkanday wrote:
I keep running into error

Script 
start = new Date();

hdfs
.copyFromLocal("/home/ubuntu/example/data/ctest.csv","data/ctest.csv")
hdfs
.copyFromLocal("/home/ubuntu/example/scripts/script_challenge.groovy","scripts/script_challenge.groovy")

graph
= GraphFactory.open("/home/ubuntu/janusgraphdocker/conf/hadoop-script.properties")
blvp
= BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)
graph
.compute(SparkGraphComputer).program(blvp).submit().get()
stop
= new Date();


Input File size is about 1.7 MM nodes with 3 properties.

hadoop-script properties 

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph

gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat

gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat

gremlin.hadoop.jarsInDistributedCache=true


gremlin.hadoop.inputLocation=data/ctest.csv

gremlin.hadoop.scriptInputFormat.script=scripts/script_challenge.groovy

gremlin.hadoop.outputLocation=output

####################################

# SparkGraphComputer Configuration #

####################################

spark.master=local[4]

spark.executor.memory=4g

spark.serializer=org.apache.spark.serializer.KryoSerializer


Error

graph.compute(SparkGraphComputer).program(blvp).submit().get()

09:20:48 WARN  org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration  - Local setting ids.block-size=1000000 (Type: GLOBAL_OFFLINE) is overridden by globally managed value (10000).  Use the ManagementSystem interface instead of the local configuration to control this setting.

09:21:54 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 78887762 bytes, TID = 2

09:21:55 WARN  org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration  - Local setting ids.block-size=1000000 (Type: GLOBAL_OFFLINE) is overridden by globally managed value (10000).  Use the ManagementSystem interface instead of the local configuration to control this setting.

09:22:25 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 349007340 bytes, TID = 3

09:22:25 ERROR org.apache.spark.executor.Executor  - Exception in task 0.0 in stage 5.0 (TID 3)

org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException

09:22:25 WARN  org.apache.spark.scheduler.TaskSetManager  - Lost task 0.0 in stage 5.0 (TID 3, localhost): org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException


09:22:25 ERROR org.apache.spark.scheduler.TaskSetManager  - Task 0 in stage 5.0 failed 1 times; aborting job

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 3, localhost): org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException



Not sure why I am getting this. Is this a spark configuration setting ?

Also, I am only use factory.vertex and not graph.addVertex method, to add vertices.

Only happens when I try to load a large number of vertices at a time 

I am running this on EC2 C4.4XL Ubuntu Xenial, 8 Core, and 30GB RAM







amark...@...
 

Thanks Ted. Thanks for your suggestions for adding an index to the bulk loader. That makes sense. Also, thanks for pointing that SPARK JIRA ticket. 

I did have to make an additional change to solve for FastNoSuchElement exception

Changed ids.flush=true in the .properties files and stopped getting the error 

@Jason: Thanks for ur help 












On Monday, October 9, 2017 at 11:44:30 AM UTC-7, Ted Wilmes wrote:
Hi Ashish,
Read through your memory specific problem too quickly, those scary looking errors are actually ok, you can see an explanation here:


For the incremental load, you can get rid of "WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertices" by explicitly 
adding the default bulk loader id property and an index on it before you start the load. The default name of the property is "bulkLoader.vertex.id". The incremental loader
checks for the existence of the vertices you're loading before it makes any changes so it needs to look them up and to make that quick, there needs to be
an index.

Having said all that, the FastNoSuchElement exception in stage 5 is usually due to a failed edge load where one of the vertices you're connecting an edge to does not exist. I
do not see any edge loading in your script though. Can you reproduce the issue if you attempt to load a few with just a few lines of data?

--Ted

On Monday, October 9, 2017 at 12:57:52 PM UTC-5, am...@... wrote:
Hey Ted,

Tried setting gremlin.spark.persistStorageLevel=DISK_ONLY, but still getting 

7:53:16 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 36796560 bytes, TID = 2

17:53:47 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 167934858 bytes, TID = 3

17:53:47 ERROR org.apache.spark.executor.Executor  - Exception in task 0.0 in stage 5.0 (TID 3)


Thanks 
Ashish
On Monday, October 9, 2017 at 9:09:01 AM UTC-7, Ted Wilmes wrote:
You should be able to use Spark in standalone mode, just as it comes with JanusGraph, to load that amount of data. Here is one more thing you can try, I usually run bulk loads with this set in my script input properties file:

gremlin.spark.persistStorageLevel=DISK_ONLY

This tells Spark to persist intermediate RDD results to disk temporarily instead of memory which in my experience works well for bulk loading and should decrease the pressure on your executor's memory.

--Ted

On Monday, October 9, 2017 at 4:35:30 AM UTC-5, am...@... wrote:
I guess my bigger question is

What do I need to setup in the env for getting the bulk loader to work

I have installed a local instance of hadoop on my EC2 instance, and added this 

export HADOOP_PREFIX=/usr/local/hadoop
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
# Put the Hadoop configuration on the classpath so HDFS doesn't resolve to the local filesystem
export CLASSPATH=$HADOOP_CONF_DIR
export HADOOP_GREMLIN_LIBS=/home/ubuntu/janusgraph-0.1.1-hadoop2/ext/spark-gremlin/lib:/home/ubuntu/janusgraph-0.1.1-hadoop2/ext/spark-gremlin/lib2


This command from Tinkerpop docs makes me think that I also need to install Spark 

bin/init-tp-spark.sh /usr/local/spark s...@....0.1 s...@....0.2 s...@....0.3


I guess, what is the config settings for this. I am using SparkGraphComputer only for bulk loading of data 

Cheers
A

On Sunday, 8 October 2017 20:30:02 UTC-7, am...@... wrote:
Hi Jason,

Please see my comments below. 

I had created the schema before hand, its an empty graph when I start the load the data 

I had increase the ids.blocksize to 100K, default is 10K. Janusgraph docs say that as a rule of thumb, you should increase that to 10X, if you are doing bulk uploads. But I changed that to default 10K

script_challenge.groovy

def parse(line, factory) {


   def (cId, cType, cUserId,cCreatedDate) = line.split(/,/).toList()

   def v1 = factory.vertex(cId, "Challenge")

   v1.property("challengeId", cId) // first value is always the name

   v1.property("challengeType", Short.parseShort(cType)) // first value is always the name

   v1.property("creatorUserId", Long.parseLong(cUserId)) // first value is always the name

   v1.property("challengeCreatedDate", Date.parse("yyyy-mm-dd",cCreatedDate)) // first value is always the name

   return v1


}


sample data : I have about 200 MM rows d such data. I am loading about 1.4 MM when I am getting the errors

challenge-1,2,2,2016-04-04

challenge-2,1,1,2016-04-03



In the error message, .FastNoSuchElementException happens after the I get the executor memory error. I believe that the executor memory error is triggering it 

It mostly does part commit, so out of my 1.4MM rows, I can load about ~200K at a time before I get an error 

Steps taken for isolating the error source 

1) Increased max and initial heap size on EC2. java -XX:+PrintFlagsFinal -Xms2g -Xmx21g -version | grep HeapSize
2) Changes gremlin.sh to launch with more memory (Last time in gremlin.sh) exec $JAVA -Xmx16g $JAVA_OPTIONS $MAIN_CLASS "$@"
3) Increased the memory for executors 

spark.master=local[4]

spark.executor.memory=3g


My observations 

  • graph.addVertex method is not supported for batch upload. Tinkerpop 3.1.8 and 3.2.6 supports it. I have the use the older factory.vertex method 
  • After trying all these combinations and some others, I was still getting the same executor running out of memory error 
I think the problem is with using bulkLoader(OneTimeBulkLoader) in 
BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)

If I go with the default value (incremental loader) by using this 
blvp = BulkLoaderVertexProgram.build().writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)

I don't get the spark executor memory error 

But using this, I get warning messages 

WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertice


And the insert is very slow 

So, I think the problem is bulkLoader(OneTimeBulkLoader). Any thoughts? 

Thanks for looking into it. I appreciate it 


On Sunday, October 8, 2017 at 12:02:46 PM UTC-7, Jason Plurad wrote:
org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException usually indicates you're doing an unchecked call to next() in your traversal.

Are you able to share script_challenge.groovy? Also it seems like you should sort out what's going on with ids.block-size. I'd recommend creating the empty graph before running the bulk loader.

Have you tried increasing the executor memory?


On Sunday, October 8, 2017 at 2:23:55 PM UTC-4, amarkanday wrote:
I keep running into error

Script 
start = new Date();

hdfs
.copyFromLocal("/home/ubuntu/example/data/ctest.csv","data/ctest.csv")
hdfs
.copyFromLocal("/home/ubuntu/example/scripts/script_challenge.groovy","scripts/script_challenge.groovy")

graph
= GraphFactory.open("/home/ubuntu/janusgraphdocker/conf/hadoop-script.properties")
blvp
= BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)
graph
.compute(SparkGraphComputer).program(blvp).submit().get()
stop
= new Date();


Input File size is about 1.7 MM nodes with 3 properties.

hadoop-script properties 

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph

gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat

gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat

gremlin.hadoop.jarsInDistributedCache=true


gremlin.hadoop.inputLocation=data/ctest.csv

gremlin.hadoop.scriptInputFormat.script=scripts/script_challenge.groovy

gremlin.hadoop.outputLocation=output

####################################

# SparkGraphComputer Configuration #

####################################

spark.master=local[4]

spark.executor.memory=4g

spark.serializer=org.apache.spark.serializer.KryoSerializer


Error

graph.compute(SparkGraphComputer).program(blvp).submit().get()

09:20:48 WARN  org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration  - Local setting ids.block-size=1000000 (Type: GLOBAL_OFFLINE) is overridden by globally managed value (10000).  Use the ManagementSystem interface instead of the local configuration to control this setting.

09:21:54 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 78887762 bytes, TID = 2

09:21:55 WARN  org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration  - Local setting ids.block-size=1000000 (Type: GLOBAL_OFFLINE) is overridden by globally managed value (10000).  Use the ManagementSystem interface instead of the local configuration to control this setting.

09:22:25 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 349007340 bytes, TID = 3

09:22:25 ERROR org.apache.spark.executor.Executor  - Exception in task 0.0 in stage 5.0 (TID 3)

org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException

09:22:25 WARN  org.apache.spark.scheduler.TaskSetManager  - Lost task 0.0 in stage 5.0 (TID 3, localhost): org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException


09:22:25 ERROR org.apache.spark.scheduler.TaskSetManager  - Task 0 in stage 5.0 failed 1 times; aborting job

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 3, localhost): org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException



Not sure why I am getting this. Is this a spark configuration setting ?

Also, I am only use factory.vertex and not graph.addVertex method, to add vertices.

Only happens when I try to load a large number of vertices at a time 

I am running this on EC2 C4.4XL Ubuntu Xenial, 8 Core, and 30GB RAM







Ted Wilmes <twi...@...>
 

Great! Thanks for reporting back, that ids.flush=false is the missing variable. I've entered an issue [1] for that so we can fix or at 
least document the incompatibility with the BulkLoaderVertexProgram. Also, when you get to loading edges, just to warn you, you'll
want to have that index defined on bulkLoader.vertex.id also, even if you're not running in incremental mode.

--Ted



On Tuesday, October 10, 2017 at 4:19:19 AM UTC-5, amark...@... wrote:
Thanks Ted. Thanks for your suggestions for adding an index to the bulk loader. That makes sense. Also, thanks for pointing that SPARK JIRA ticket. 

I did have to make an additional change to solve for FastNoSuchElement exception

Changed ids.flush=true in the .properties files and stopped getting the error 

@Jason: Thanks for ur help 












On Monday, October 9, 2017 at 11:44:30 AM UTC-7, Ted Wilmes wrote:
Hi Ashish,
Read through your memory specific problem too quickly, those scary looking errors are actually ok, you can see an explanation here:


For the incremental load, you can get rid of "WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertices" by explicitly 
adding the default bulk loader id property and an index on it before you start the load. The default name of the property is "bulkLoader.vertex.id". The incremental loader
checks for the existence of the vertices you're loading before it makes any changes so it needs to look them up and to make that quick, there needs to be
an index.

Having said all that, the FastNoSuchElement exception in stage 5 is usually due to a failed edge load where one of the vertices you're connecting an edge to does not exist. I
do not see any edge loading in your script though. Can you reproduce the issue if you attempt to load a few with just a few lines of data?

--Ted

On Monday, October 9, 2017 at 12:57:52 PM UTC-5, am...@... wrote:
Hey Ted,

Tried setting gremlin.spark.persistStorageLevel=DISK_ONLY, but still getting 

7:53:16 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 36796560 bytes, TID = 2

17:53:47 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 167934858 bytes, TID = 3

17:53:47 ERROR org.apache.spark.executor.Executor  - Exception in task 0.0 in stage 5.0 (TID 3)


Thanks 
Ashish
On Monday, October 9, 2017 at 9:09:01 AM UTC-7, Ted Wilmes wrote:
You should be able to use Spark in standalone mode, just as it comes with JanusGraph, to load that amount of data. Here is one more thing you can try, I usually run bulk loads with this set in my script input properties file:

gremlin.spark.persistStorageLevel=DISK_ONLY

This tells Spark to persist intermediate RDD results to disk temporarily instead of memory which in my experience works well for bulk loading and should decrease the pressure on your executor's memory.

--Ted

On Monday, October 9, 2017 at 4:35:30 AM UTC-5, am...@... wrote:
I guess my bigger question is

What do I need to setup in the env for getting the bulk loader to work

I have installed a local instance of hadoop on my EC2 instance, and added this 

export HADOOP_PREFIX=/usr/local/hadoop
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
# Put the Hadoop configuration on the classpath so HDFS doesn't resolve to the local filesystem
export CLASSPATH=$HADOOP_CONF_DIR
export HADOOP_GREMLIN_LIBS=/home/ubuntu/janusgraph-0.1.1-hadoop2/ext/spark-gremlin/lib:/home/ubuntu/janusgraph-0.1.1-hadoop2/ext/spark-gremlin/lib2


This command from Tinkerpop docs makes me think that I also need to install Spark 

bin/init-tp-spark.sh /usr/local/spark s...@....0.1 s...@....0.2 s...@....0.3


I guess, what is the config settings for this. I am using SparkGraphComputer only for bulk loading of data 

Cheers
A

On Sunday, 8 October 2017 20:30:02 UTC-7, am...@... wrote:
Hi Jason,

Please see my comments below. 

I had created the schema before hand, its an empty graph when I start the load the data 

I had increase the ids.blocksize to 100K, default is 10K. Janusgraph docs say that as a rule of thumb, you should increase that to 10X, if you are doing bulk uploads. But I changed that to default 10K

script_challenge.groovy

def parse(line, factory) {


   def (cId, cType, cUserId,cCreatedDate) = line.split(/,/).toList()

   def v1 = factory.vertex(cId, "Challenge")

   v1.property("challengeId", cId) // first value is always the name

   v1.property("challengeType", Short.parseShort(cType)) // first value is always the name

   v1.property("creatorUserId", Long.parseLong(cUserId)) // first value is always the name

   v1.property("challengeCreatedDate", Date.parse("yyyy-mm-dd",cCreatedDate)) // first value is always the name

   return v1


}


sample data : I have about 200 MM rows d such data. I am loading about 1.4 MM when I am getting the errors

challenge-1,2,2,2016-04-04

challenge-2,1,1,2016-04-03



In the error message, .FastNoSuchElementException happens after the I get the executor memory error. I believe that the executor memory error is triggering it 

It mostly does part commit, so out of my 1.4MM rows, I can load about ~200K at a time before I get an error 

Steps taken for isolating the error source 

1) Increased max and initial heap size on EC2. java -XX:+PrintFlagsFinal -Xms2g -Xmx21g -version | grep HeapSize
2) Changes gremlin.sh to launch with more memory (Last time in gremlin.sh) exec $JAVA -Xmx16g $JAVA_OPTIONS $MAIN_CLASS "$@"
3) Increased the memory for executors 

spark.master=local[4]

spark.executor.memory=3g


My observations 

  • graph.addVertex method is not supported for batch upload. Tinkerpop 3.1.8 and 3.2.6 supports it. I have the use the older factory.vertex method 
  • After trying all these combinations and some others, I was still getting the same executor running out of memory error 
I think the problem is with using bulkLoader(OneTimeBulkLoader) in 
BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)

If I go with the default value (incremental loader) by using this 
blvp = BulkLoaderVertexProgram.build().writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)

I don't get the spark executor memory error 

But using this, I get warning messages 

WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertice


And the insert is very slow 

So, I think the problem is bulkLoader(OneTimeBulkLoader). Any thoughts? 

Thanks for looking into it. I appreciate it 


On Sunday, October 8, 2017 at 12:02:46 PM UTC-7, Jason Plurad wrote:
org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException usually indicates you're doing an unchecked call to next() in your traversal.

Are you able to share script_challenge.groovy? Also it seems like you should sort out what's going on with ids.block-size. I'd recommend creating the empty graph before running the bulk loader.

Have you tried increasing the executor memory?


On Sunday, October 8, 2017 at 2:23:55 PM UTC-4, amarkanday wrote:
I keep running into error

Script 
start = new Date();

hdfs
.copyFromLocal("/home/ubuntu/example/data/ctest.csv","data/ctest.csv")
hdfs
.copyFromLocal("/home/ubuntu/example/scripts/script_challenge.groovy","scripts/script_challenge.groovy")

graph
= GraphFactory.open("/home/ubuntu/janusgraphdocker/conf/hadoop-script.properties")
blvp
= BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).writeGraph("/home/ubuntu/janusgraphdocker/conf/janusgraph-cassandra-es.properties").create(graph)
graph
.compute(SparkGraphComputer).program(blvp).submit().get()
stop
= new Date();


Input File size is about 1.7 MM nodes with 3 properties.

hadoop-script properties 

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph

gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat

gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat

gremlin.hadoop.jarsInDistributedCache=true


gremlin.hadoop.inputLocation=data/ctest.csv

gremlin.hadoop.scriptInputFormat.script=scripts/script_challenge.groovy

gremlin.hadoop.outputLocation=output

####################################

# SparkGraphComputer Configuration #

####################################

spark.master=local[4]

spark.executor.memory=4g

spark.serializer=org.apache.spark.serializer.KryoSerializer


Error

graph.compute(SparkGraphComputer).program(blvp).submit().get()

09:20:48 WARN  org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration  - Local setting ids.block-size=1000000 (Type: GLOBAL_OFFLINE) is overridden by globally managed value (10000).  Use the ManagementSystem interface instead of the local configuration to control this setting.

09:21:54 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 78887762 bytes, TID = 2

09:21:55 WARN  org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration  - Local setting ids.block-size=1000000 (Type: GLOBAL_OFFLINE) is overridden by globally managed value (10000).  Use the ManagementSystem interface instead of the local configuration to control this setting.

09:22:25 ERROR org.apache.spark.executor.Executor  - Managed memory leak detected; size = 349007340 bytes, TID = 3

09:22:25 ERROR org.apache.spark.executor.Executor  - Exception in task 0.0 in stage 5.0 (TID 3)

org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException

09:22:25 WARN  org.apache.spark.scheduler.TaskSetManager  - Lost task 0.0 in stage 5.0 (TID 3, localhost): org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException


09:22:25 ERROR org.apache.spark.scheduler.TaskSetManager  - Task 0 in stage 5.0 failed 1 times; aborting job

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 3, localhost): org.apache.tinkerpop.gremlin.process.traversal.util.FastNoSuchElementException



Not sure why I am getting this. Is this a spark configuration setting ?

Also, I am only use factory.vertex and not graph.addVertex method, to add vertices.

Only happens when I try to load a large number of vertices at a time 

I am running this on EC2 C4.4XL Ubuntu Xenial, 8 Core, and 30GB RAM