ClassNotFoundException running Gremlin on Spark


borde...@...
 

Hello,

I'm attempting to transition from Titan to JanusGraph 0.1.0 and am having problems getting OLAP queries to work via Spark.  I've loaded a graph with about 2 million vertices and tried to execute a simple count:

gremlin> graph = GraphFactory.open('janusgraph-olap.properties')
gremlin
> g = graph.traversal(computer(SparkGraphComputer))
gremlin
> g.V().count()

The job soon fails with "java.lang.ClassNotFoundException: org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoSerializer", which I know is in spark-gremlin-3.2.3.jar.  This appears to happen before the Spark executor has a chance to start.  I tried adding this jar to spark.executor.extraClassPath, but it didn't help.  Does HADOOP_GREMLIN_LIBS come into play?  I've tried fiddling with it but to no avail.

I'm using HBase 1.1.2.2.5.3.0-37 and Spark 1.6 on HDP 2.5.3.0.

OLTP Gremlin queries work ok.

Here's my properties file:

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph

gremlin
.hadoop.graphInputFormat=org.janusgraph.hadoop.formats.hbase.HBaseInputFormat
gremlin
.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin
.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat

gremlin
.hadoop.deriveMemory=false
gremlin
.hadoop.jarsInDistributedCache=true
gremlin
.hadoop.inputLocation=none
gremlin
.hadoop.outputLocation=dummyoutput

janusgraphmr
.ioformat.conf.storage.backend=hbase
janusgraphmr
.ioformat.conf.storage.hostname=10.1.1.1,10.1.1.2,10.1.1.3
janusgraphmr
.ioformat.conf.storage.port=2181

storage
.backend=hbase
storage
.hostname=10.1.1.1,10.1.1.2,10.1.1.3
storage
.port=2181

cache
.db-cache = true
cache
.db-cache-clean-wait = 20
cache
.db-cache-time = 180000
cache
.db-cache-size = 0.5
spark
.master=yarn-client

spark
.shuffle.service.enabled=true
spark
.dynamicAllocation.enabled=true
spark
.yarn.am.extraJavaOptions=-Dhdp.version=2.5.3.0-37

This was working fine using Titan.

Thanks,
Jerrell


Jason Plurad <plu...@...>
 

A similar message came up on the gremlin-users mailing list. You might want to compare notes with that thread.
https://groups.google.com/d/msg/gremlin-users/LYv-cvZ66hU/vqZJD4OzBQAJ


On Wednesday, May 17, 2017 at 1:12:16 AM UTC-4, Jerrell Schivers wrote:
Hello,

I'm attempting to transition from Titan to JanusGraph 0.1.0 and am having problems getting OLAP queries to work via Spark.  I've loaded a graph with about 2 million vertices and tried to execute a simple count:

gremlin> graph = GraphFactory.open('janusgraph-olap.properties')
gremlin
> g = graph.traversal(computer(SparkGraphComputer))
gremlin
> g.V().count()

The job soon fails with "java.lang.ClassNotFoundException: org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoSerializer", which I know is in spark-gremlin-3.2.3.jar.  This appears to happen before the Spark executor has a chance to start.  I tried adding this jar to spark.executor.extraClassPath, but it didn't help.  Does HADOOP_GREMLIN_LIBS come into play?  I've tried fiddling with it but to no avail.

I'm using HBase 1.1.2.2.5.3.0-37 and Spark 1.6 on HDP 2.5.3.0.

OLTP Gremlin queries work ok.

Here's my properties file:

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph

gremlin
.hadoop.graphInputFormat=org.janusgraph.hadoop.formats.hbase.HBaseInputFormat
gremlin
.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin
.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat

gremlin
.hadoop.deriveMemory=false
gremlin
.hadoop.jarsInDistributedCache=true
gremlin
.hadoop.inputLocation=none
gremlin
.hadoop.outputLocation=dummyoutput

janusgraphmr
.ioformat.conf.storage.backend=hbase
janusgraphmr
.ioformat.conf.storage.hostname=10.1.1.1,10.1.1.2,10.1.1.3
janusgraphmr
.ioformat.conf.storage.port=2181

storage
.backend=hbase
storage
.hostname=10.1.1.1,10.1.1.2,10.1.1.3
storage
.port=2181

cache
.db-cache = true
cache
.db-cache-clean-wait = 20
cache
.db-cache-time = 180000
cache
.db-cache-size = 0.5
spark
.master=yarn-client

spark
.shuffle.service.enabled=true
spark
.dynamicAllocation.enabled=true
spark
.yarn.am.extraJavaOptions=-Dhdp.version=2.5.3.0-37

This was working fine using Titan.

Thanks,
Jerrell