Date
1 - 5 of 5
Janus Graph performing OLAP with Spark/Yarn
John Helmsen <john....@...>
Gentlemen and Ladies,
Currently our group is trying to stand up an instance of JanusGraph/Titan that performs OLAP operations using SparkGraphComputer in TinkerPop. To do OLAP,.we wish to use Spark with Yarn. So far, however, we have not been able to successfully launch any distributed queries, such as count(), using this approach. While we can post stack traces, etc, I'd like to ask a different question first.
Has anyone gotten the system to perform Spark operations using YARN?
If so, how?
Joe Obernberger <joseph.o...@...>
Hi John - I'm also very interested in how to do this. We
recently built a graph stored in HBase, and when we run
g.E().count(), it took some 5+ hours to complete from the gremlin
shell (79 million edges). Is there any 'how to' or getting
started guide on how to use Spark+YARN with this?
Thank you!
-Joe
On 5/31/2017 1:06 PM, 'John Helmsen'
via JanusGraph users list wrote:
Gentlemen and Ladies,--
Currently our group is trying to stand up an instance of JanusGraph/Titan that performs OLAP operations using SparkGraphComputer in TinkerPop. To do OLAP,.we wish to use Spark with Yarn. So far, however, we have not been able to successfully launch any distributed queries, such as count(), using this approach. While we can post stack traces, etc, I'd like to ask a different question first.
Has anyone gotten the system to perform Spark operations using YARN?If so, how?
You received this message because you are subscribed to the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
For more options, visit https://groups.google.com/d/optout.
HadoopMarc <m.c.d...@...>
Hi John,
I have plans to try this, too, so question seconded. I have TinkerPop-3.1.1 OLAP working on Spark/Yarn (Hortonworks), but the JanusGraph HBase or Cassandra dependencies will make version conflicts harder to handle.
Basically, you need:
- your cluster configs on your application or console classpath
- solve version conflicts. So, get rid of the lower version jars where there is a minor version difference. Report to this list if clashing versions differ by a major version number. I believe the current lib folder of the JanusGraph distribution already has a few double jars with minor version differences (sorry, have not had time to report this). You will hate spark-assembly because it is not easy to remove lower versions from dependencies included in it... Spark has some config options to load user jars first, though. I still wonder if some maven guru can help us to solve this manual work by adding the entire cluster as a dependency to the JG project and get the version conflicts at build time instead of at runtime.
Also, I might be mistaken in the above and simple configs would solve the question. So, the original questions still stands (has anyone ....)
Cheers, Marc
Op woensdag 31 mei 2017 19:36:01 UTC+2 schreef Joseph Obernberger:
I have plans to try this, too, so question seconded. I have TinkerPop-3.1.1 OLAP working on Spark/Yarn (Hortonworks), but the JanusGraph HBase or Cassandra dependencies will make version conflicts harder to handle.
Basically, you need:
- your cluster configs on your application or console classpath
- solve version conflicts. So, get rid of the lower version jars where there is a minor version difference. Report to this list if clashing versions differ by a major version number. I believe the current lib folder of the JanusGraph distribution already has a few double jars with minor version differences (sorry, have not had time to report this). You will hate spark-assembly because it is not easy to remove lower versions from dependencies included in it... Spark has some config options to load user jars first, though. I still wonder if some maven guru can help us to solve this manual work by adding the entire cluster as a dependency to the JG project and get the version conflicts at build time instead of at runtime.
Also, I might be mistaken in the above and simple configs would solve the question. So, the original questions still stands (has anyone ....)
Cheers, Marc
Op woensdag 31 mei 2017 19:36:01 UTC+2 schreef Joseph Obernberger:
Hi John - I'm also very interested in how to do this. We recently built a graph stored in HBase, and when we run g.E().count(), it took some 5+ hours to complete from the gremlin shell (79 million edges). Is there any 'how to' or getting started guide on how to use Spark+YARN with this?
Thank you!
-Joe
On 5/31/2017 1:06 PM, 'John Helmsen' via JanusGraph users list wrote:
Gentlemen and Ladies,--
Currently our group is trying to stand up an instance of JanusGraph/Titan that performs OLAP operations using SparkGraphComputer in TinkerPop. To do OLAP,.we wish to use Spark with Yarn. So far, however, we have not been able to successfully launch any distributed queries, such as count(), using this approach. While we can post stack traces, etc, I'd like to ask a different question first.
Has anyone gotten the system to perform Spark operations using YARN?If so, how?
You received this message because you are subscribed to the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
sju...@...
I think there are many success stories/snippets out there on this but no consolidated how-to that I'm aware of. Marc I'm pretty sure I've seen plenty of examples from you on this across various lists over the years. I can contribute a couple examples as well if we can get some documentation started on this under JanusGraph. I've had success getting traversals and vertex programs working using Titan SparkGraphComputer with HBase using both TinkerPop-3.0.1/Spark-1.2 (Yarn/Cloudera) and TinkerPop-3.2.3/Spark-1.6 (Yarn/Cloudera and Mesos). But haven't tested this out with JanusGraph yet. Personally I'd recommend you consider running Spark on Mesos instead of Yarn if possible. The configuration is easier in my opinion and you can have apps running against different versions of Spark, making hardware and software updates much easier and less disruptive.
A few notes in case helpful: First is probably obvious but I always match the server Spark version exactly to the Spark version in TinkerPop from the relevant JanusGraph/Titan distribution. Also I've found the spark.executor.extraClassPath property to be crucial to getting things working both with Yarn and Mesos. Jars included there will be at the start of the classpath, which is important when the cluster may have conflicting versions of core/transitive dependencies. I'll usually create a single jar with all dependencies (excluding Spark), put it somewhere accessible on all cluster nodes and then define spark.executor.extraClassPath pointing to same.
A few notes in case helpful: First is probably obvious but I always match the server Spark version exactly to the Spark version in TinkerPop from the relevant JanusGraph/Titan distribution. Also I've found the spark.executor.extraClassPath property to be crucial to getting things working both with Yarn and Mesos. Jars included there will be at the start of the classpath, which is important when the cluster may have conflicting versions of core/transitive dependencies. I'll usually create a single jar with all dependencies (excluding Spark), put it somewhere accessible on all cluster nodes and then define spark.executor.extraClassPath pointing to same.
Jason Plurad <plu...@...>
I posted an answer for SparkGraphComputer with YARN for TinkerPop 3.2.4 over on gremlin-users. The approach works similarly for JanusGraph 0.1.1.
I'll echo sjudeng's comments on matching the Spark version exactly. Spark is very picky about that.
-- Jason
I'll echo sjudeng's comments on matching the Spark version exactly. Spark is very picky about that.
# conf/hadoop/read-cassandra.properties
# TinkerPop Hadoop Graph for OLAP
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
# Set the default OLAP computer for graph.traversal().withComputer()
gremlin.hadoop.defaultGraphComputer=org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer
# I/O Formats
gremlin.hadoop.graphInputFormat=org.janusgraph.hadoop.formats.cassandra.CassandraInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output
# JanusGraph-Cassandra InputFormat configuration
janusgraphmr.ioformat.conf.storage.backend=cassandra
janusgraphmr.ioformat.conf.storage.hostname=192.168.70.101
janusgraphmr.ioformat.conf.storage.port=9160
janusgraphmr.ioformat.conf.storage.cassandra.keyspace=janusgraph
# Apache Cassandra InputFormat configuration
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
# Gremlin Console acts as the Spark Driver (YARN client)
spark.master=yarn-client
spark.executor.memory=512m
# When true, jars from HADOOP_GREMLIN_LIBS become added jars available via http to executors
# In Spark 1.6.1, jars are added but don't appear to be available...
gremlin.hadoop.jarsInDistributedCache=false
# Install JanusGraph on all worker nodes, then add jars with local fs path
spark.executor.extraClassPath=/home/vagrant/_/opt/janusgraph-0.1.1-hadoop2/lib/*
-- Jason
On Wednesday, May 31, 2017 at 10:09:28 PM UTC-4, sjudeng wrote:
I think there are many success stories/snippets out there on this but no consolidated how-to that I'm aware of. Marc I'm pretty sure I've seen plenty of examples from you on this across various lists over the years. I can contribute a couple examples as well if we can get some documentation started on this under JanusGraph. I've had success getting traversals and vertex programs working using Titan SparkGraphComputer with HBase using both TinkerPop-3.0.1/Spark-1.2 (Yarn/Cloudera) and TinkerPop-3.2.3/Spark-1.6 (Yarn/Cloudera and Mesos). But haven't tested this out with JanusGraph yet. Personally I'd recommend you consider running Spark on Mesos instead of Yarn if possible. The configuration is easier in my opinion and you can have apps running against different versions of Spark, making hardware and software updates much easier and less disruptive.
A few notes in case helpful: First is probably obvious but I always match the server Spark version exactly to the Spark version in TinkerPop from the relevant JanusGraph/Titan distribution. Also I've found the spark.executor.extraClassPath property to be crucial to getting things working both with Yarn and Mesos. Jars included there will be at the start of the classpath, which is important when the cluster may have conflicting versions of core/transitive dependencies. I'll usually create a single jar with all dependencies (excluding Spark), put it somewhere accessible on all cluster nodes and then define spark.executor.extraClassPath pointing to same.