Re: Janus Graph performing OLAP with Spark/Yarn


Jason Plurad <plu...@...>
 

I posted an answer for SparkGraphComputer with YARN for TinkerPop 3.2.4 over on gremlin-users. The approach works similarly for JanusGraph 0.1.1.

I'll echo sjudeng's comments on matching the Spark version exactly. Spark is very picky about that.

# conf/hadoop/read-cassandra.properties

# TinkerPop Hadoop Graph for OLAP
gremlin
.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
# Set the default OLAP computer for graph.traversal().withComputer()
gremlin
.hadoop.defaultGraphComputer=org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer

# I/O Formats
gremlin
.hadoop.graphInputFormat=org.janusgraph.hadoop.formats.cassandra.CassandraInputFormat
gremlin
.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin
.hadoop.inputLocation=none
gremlin
.hadoop.outputLocation=output

# JanusGraph-Cassandra InputFormat configuration
janusgraphmr
.ioformat.conf.storage.backend=cassandra
janusgraphmr
.ioformat.conf.storage.hostname=192.168.70.101
janusgraphmr
.ioformat.conf.storage.port=9160
janusgraphmr
.ioformat.conf.storage.cassandra.keyspace=janusgraph

# Apache Cassandra InputFormat configuration
cassandra
.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner

# Gremlin Console acts as the Spark Driver (YARN client)
spark
.master=yarn-client
spark
.executor.memory=512m

# When true, jars from HADOOP_GREMLIN_LIBS become added jars available via http to executors
# In Spark 1.6.1, jars are added but don't appear to be available...
gremlin
.hadoop.jarsInDistributedCache=false

# Install JanusGraph on all worker nodes, then add jars with local fs path
spark
.executor.extraClassPath=/home/vagrant/_/opt/janusgraph-0.1.1-hadoop2/lib/*

-- Jason


On Wednesday, May 31, 2017 at 10:09:28 PM UTC-4, sjudeng wrote:
I think there are many success stories/snippets out there on this but no consolidated how-to that I'm aware of. Marc I'm pretty sure I've seen plenty of examples from you on this across various lists over the years. I can contribute a couple examples as well if we can get some documentation started on this under JanusGraph. I've had success getting traversals and vertex programs working using Titan SparkGraphComputer with HBase using both TinkerPop-3.0.1/Spark-1.2 (Yarn/Cloudera) and TinkerPop-3.2.3/Spark-1.6 (Yarn/Cloudera and Mesos). But haven't tested this out with JanusGraph yet. Personally I'd recommend you consider running Spark on Mesos instead of Yarn if possible. The configuration is easier in my opinion and you can have apps running against different versions of Spark, making hardware and software updates much easier and less disruptive.

A few notes in case helpful: First is probably obvious but I always match the server Spark version exactly to the Spark version in TinkerPop from the relevant JanusGraph/Titan distribution. Also I've found the spark.executor.extraClassPath property to be crucial to getting things working both with Yarn and Mesos. Jars included there will be at the start of the classpath, which is important when the cluster may have conflicting versions of core/transitive dependencies. I'll usually create a single jar with all dependencies (excluding Spark), put it somewhere accessible on all cluster nodes and then define spark.executor.extraClassPath pointing to same.

Join {janusgraph-users@lists.lfaidata.foundation to automatically receive all group messages.