Topics

Janusgraph - OLAP using Dataproc


bobo...@...
 

Hi,

We are using Janusgraph (0.5.2) with Scylladb as backend. So far we are only using OLTP capabilities but would now like to also do some more advanced batch processing to create shortcut edges, for example for recommendations. To do that, I would like to use the OLAP features.

Reading the documentation this sounds pretty straightforward, assuming one has a Hadoop cluster up and running. But here comes my problem: I would like to use Dataproc - Google's managed solution for Hadoop and Spark. Unfortunately I couldn't find any further information on how to get those two things playing well together.

Does anyone have any experience, hints or documentation on how to properly configure Janusgraph with Dataproc?

In a very first step, a was trying the following (Java application with embedded Janusgraph)

GraphTraversalSource g = GraphFactory.open("graph.properties").traversal().withComputer(SparkGraphComputer.class);
long count = g.V().count().next();
...
g
.close()

the graph.properties file looking like this

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin
.hadoop.graphReader=org.janusgraph.hadoop.formats.cql.CqlInputFormat
gremlin
.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin
.hadoop.jarsInDistributedCache=true
gremlin
.hadoop.inputLocation=none
gremlin
.hadoop.outputLocation=output
gremlin
.spark.persistContext=true

# Cassandra
janusgraphmr
.ioformat.conf.storage.backend=cql
janusgraphmr
.ioformat.conf.storage.hostname=myhost
janusgraphmr
.ioformat.conf.storage.port=9042
janusgraphmr
.ioformat.conf.index.search.backend=lucene
janusgraphmr
.ioformat.conf.index.search.directory=/tmp/
janusgraphmr
.ioformat.conf.index.search.hostname=127.0.0.1
cassandra
.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra
.input.widerows=true
# Spark
spark
.master=local[*]
spark
.executor.memory=1g
spark
.serializer=org.apache.spark.serializer.KryoSerializer
spark
.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator


If I just run the code like this, without specifying anything else, it just results in nothing happening, and endless log output like these
Code hier eingeben...18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.StandardJanusGraphTx - Guava vertex cache size: requested=20000 effective=20000 (min=100)
18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.vertexcache.GuavaVertexCache - Created dirty vertex map with initial size 32
18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.vertexcache.GuavaVertexCache - Created vertex cache with max size 20000

Additionally, I added the hdfs-site extracted from dataproc to my classpath, but that didn't help any.

The same in the OLTP world works like a charm. (of course using a proper query, one not iterating over the whole graph .... :D )

Any hints, ideas, experiences or links are greatly appreciated.

Looking forward to some answers,
Claire


SAURABH VERMA <saurabh...@...>
 

We've set up and janusgraph OLAP with spark-yarn, is that something you are looking for?

Thanks

On Thu, Jun 18, 2020 at 10:39 PM <bobo...@...> wrote:
Hi,

We are using Janusgraph (0.5.2) with Scylladb as backend. So far we are only using OLTP capabilities but would now like to also do some more advanced batch processing to create shortcut edges, for example for recommendations. To do that, I would like to use the OLAP features.

Reading the documentation this sounds pretty straightforward, assuming one has a Hadoop cluster up and running. But here comes my problem: I would like to use Dataproc - Google's managed solution for Hadoop and Spark. Unfortunately I couldn't find any further information on how to get those two things playing well together.

Does anyone have any experience, hints or documentation on how to properly configure Janusgraph with Dataproc?

In a very first step, a was trying the following (Java application with embedded Janusgraph)

GraphTraversalSource g = GraphFactory.open("graph.properties").traversal().withComputer(SparkGraphComputer.class);
long count = g.V().count().next();
...
g
.close()

the graph.properties file looking like this

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin
.hadoop.graphReader=org.janusgraph.hadoop.formats.cql.CqlInputFormat
gremlin
.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin
.hadoop.jarsInDistributedCache=true
gremlin
.hadoop.inputLocation=none
gremlin
.hadoop.outputLocation=output
gremlin
.spark.persistContext=true

# Cassandra
janusgraphmr
.ioformat.conf.storage.backend=cql
janusgraphmr
.ioformat.conf.storage.hostname=myhost
janusgraphmr
.ioformat.conf.storage.port=9042
janusgraphmr
.ioformat.conf.index.search.backend=lucene
janusgraphmr
.ioformat.conf.index.search.directory=/tmp/
janusgraphmr
.ioformat.conf.index.search.hostname=127.0.0.1
cassandra
.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra
.input.widerows=true
# Spark
spark
.master=local[*]
spark
.executor.memory=1g
spark
.serializer=org.apache.spark.serializer.KryoSerializer
spark
.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator


If I just run the code like this, without specifying anything else, it just results in nothing happening, and endless log output like these
Code hier eingeben...18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.StandardJanusGraphTx - Guava vertex cache size: requested=20000 effective=20000 (min=100)
18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.vertexcache.GuavaVertexCache - Created dirty vertex map with initial size 32
18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.vertexcache.GuavaVertexCache - Created vertex cache with max size 20000

Additionally, I added the hdfs-site extracted from dataproc to my classpath, but that didn't help any.

The same in the OLTP world works like a charm. (of course using a proper query, one not iterating over the whole graph .... :D )

Any hints, ideas, experiences or links are greatly appreciated.

Looking forward to some answers,
Claire

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/7dc9a3f1-82bc-47d5-89a1-5f3d4e21e5cdo%40googlegroups.com.


--
Thanks & Regards,
Saurabh Verma,
India



Claire F <bobo...@...>
 

Hi Saurabh,

Thanks for your reply. 
I am really specifically looking with setup using Dataproc.

Regards
Claire

SAURABH VERMA <saurabh...@...> schrieb am Do., 18. Juni 2020, 19:59:

We've set up and janusgraph OLAP with spark-yarn, is that something you are looking for?

Thanks

On Thu, Jun 18, 2020 at 10:39 PM <bobo...@...> wrote:
Hi,

We are using Janusgraph (0.5.2) with Scylladb as backend. So far we are only using OLTP capabilities but would now like to also do some more advanced batch processing to create shortcut edges, for example for recommendations. To do that, I would like to use the OLAP features.

Reading the documentation this sounds pretty straightforward, assuming one has a Hadoop cluster up and running. But here comes my problem: I would like to use Dataproc - Google's managed solution for Hadoop and Spark. Unfortunately I couldn't find any further information on how to get those two things playing well together.

Does anyone have any experience, hints or documentation on how to properly configure Janusgraph with Dataproc?

In a very first step, a was trying the following (Java application with embedded Janusgraph)

GraphTraversalSource g = GraphFactory.open("graph.properties").traversal().withComputer(SparkGraphComputer.class);
long count = g.V().count().next();
...
g
.close()

the graph.properties file looking like this

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin
.hadoop.graphReader=org.janusgraph.hadoop.formats.cql.CqlInputFormat
gremlin
.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin
.hadoop.jarsInDistributedCache=true
gremlin
.hadoop.inputLocation=none
gremlin
.hadoop.outputLocation=output
gremlin
.spark.persistContext=true

# Cassandra
janusgraphmr
.ioformat.conf.storage.backend=cql
janusgraphmr
.ioformat.conf.storage.hostname=myhost
janusgraphmr
.ioformat.conf.storage.port=9042
janusgraphmr
.ioformat.conf.index.search.backend=lucene
janusgraphmr
.ioformat.conf.index.search.directory=/tmp/
janusgraphmr
.ioformat.conf.index.search.hostname=127.0.0.1
cassandra
.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra
.input.widerows=true
# Spark
spark
.master=local[*]
spark
.executor.memory=1g
spark
.serializer=org.apache.spark.serializer.KryoSerializer
spark
.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator


If I just run the code like this, without specifying anything else, it just results in nothing happening, and endless log output like these
Code hier eingeben...18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.StandardJanusGraphTx - Guava vertex cache size: requested=20000 effective=20000 (min=100)
18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.vertexcache.GuavaVertexCache - Created dirty vertex map with initial size 32
18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.vertexcache.GuavaVertexCache - Created vertex cache with max size 20000

Additionally, I added the hdfs-site extracted from dataproc to my classpath, but that didn't help any.

The same in the OLTP world works like a charm. (of course using a proper query, one not iterating over the whole graph .... :D )

Any hints, ideas, experiences or links are greatly appreciated.

Looking forward to some answers,
Claire

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/7dc9a3f1-82bc-47d5-89a1-5f3d4e21e5cdo%40googlegroups.com.


--
Thanks & Regards,
Saurabh Verma,
India


--
You received this message because you are subscribed to a topic in the Google Groups "JanusGraph users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/janusgraph-users/Fh0ARPasw8s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/CADJB8JuGxLsTAv6kJrnKfrry5zjKVZD6yQr6JacWKA5Pq2L%3Dvg%40mail.gmail.com.


HadoopMarc <bi...@...>
 

Hi Claire,

As also indicated by Saurabh, your current config runs spark locally on your client node and does not use dataproc at all.

What possibly could work (I never used dataproc myself):
Best wishes,   Marc

Op donderdag 18 juni 2020 20:20:50 UTC+2 schreef Claire F:

Hi Saurabh,

Thanks for your reply. 
I am really specifically looking with setup using Dataproc.

Regards
Claire

SAURABH VERMA <sau...@...> schrieb am Do., 18. Juni 2020, 19:59:
We've set up and janusgraph OLAP with spark-yarn, is that something you are looking for?

Thanks

On Thu, Jun 18, 2020 at 10:39 PM <bo...@...> wrote:
Hi,

We are using Janusgraph (0.5.2) with Scylladb as backend. So far we are only using OLTP capabilities but would now like to also do some more advanced batch processing to create shortcut edges, for example for recommendations. To do that, I would like to use the OLAP features.

Reading the documentation this sounds pretty straightforward, assuming one has a Hadoop cluster up and running. But here comes my problem: I would like to use Dataproc - Google's managed solution for Hadoop and Spark. Unfortunately I couldn't find any further information on how to get those two things playing well together.

Does anyone have any experience, hints or documentation on how to properly configure Janusgraph with Dataproc?

In a very first step, a was trying the following (Java application with embedded Janusgraph)

GraphTraversalSource g = GraphFactory.open("graph.properties").traversal().withComputer(SparkGraphComputer.class);
long count = g.V().count().next();
...
g
.close()

the graph.properties file looking like this

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin
.hadoop.graphReader=org.janusgraph.hadoop.formats.cql.CqlInputFormat
gremlin
.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin
.hadoop.jarsInDistributedCache=true
gremlin
.hadoop.inputLocation=none
gremlin
.hadoop.outputLocation=output
gremlin
.spark.persistContext=true

# Cassandra
janusgraphmr
.ioformat.conf.storage.backend=cql
janusgraphmr
.ioformat.conf.storage.hostname=myhost
janusgraphmr
.ioformat.conf.storage.port=9042
janusgraphmr
.ioformat.conf.index.search.backend=lucene
janusgraphmr
.ioformat.conf.index.search.directory=/tmp/
janusgraphmr
.ioformat.conf.index.search.hostname=127.0.0.1
cassandra
.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra
.input.widerows=true
# Spark
spark
.master=local[*]
spark
.executor.memory=1g
spark
.serializer=org.apache.spark.serializer.KryoSerializer
spark
.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator


If I just run the code like this, without specifying anything else, it just results in nothing happening, and endless log output like these
Code hier eingeben...18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.StandardJanusGraphTx - Guava vertex cache size: requested=20000 effective=20000 (min=100)
18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.vertexcache.GuavaVertexCache - Created dirty vertex map with initial size 32
18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.vertexcache.GuavaVertexCache - Created vertex cache with max size 20000

Additionally, I added the hdfs-site extracted from dataproc to my classpath, but that didn't help any.

The same in the OLTP world works like a charm. (of course using a proper query, one not iterating over the whole graph .... :D )

Any hints, ideas, experiences or links are greatly appreciated.

Looking forward to some answers,
Claire

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/7dc9a3f1-82bc-47d5-89a1-5f3d4e21e5cdo%40googlegroups.com.


--
Thanks & Regards,
Saurabh Verma,
India


--
You received this message because you are subscribed to a topic in the Google Groups "JanusGraph users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/janusgraph-users/Fh0ARPasw8s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to janusgra...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/CADJB8JuGxLsTAv6kJrnKfrry5zjKVZD6yQr6JacWKA5Pq2L%3Dvg%40mail.gmail.com.


Claire F <bobo...@...>
 

Hi Marc,

Thanks a lot for your detailed answer I will give that a try and see if I can get it to work. 
Then I hope I'll find a way to marry all that into my Java code once I get it working with the gremlin console, but that shouldn't bei an issue then.

I am aware that my current config uses Spark locally. However I seem to have misunderstood the documentation, as I thought the Hadoop cluster was still needed for some temporary files, and this is what I thought I'd need Dataproc's Hadoop component as well. Even better If I don't.

Regards and thanks again
Claire

HadoopMarc <bi...@...> schrieb am Fr., 19. Juni 2020, 08:08:

Hi Claire,

As also indicated by Saurabh, your current config runs spark locally on your client node and does not use dataproc at all.

What possibly could work (I never used dataproc myself):
Best wishes,   Marc

Op donderdag 18 juni 2020 20:20:50 UTC+2 schreef Claire F:
Hi Saurabh,

Thanks for your reply. 
I am really specifically looking with setup using Dataproc.

Regards
Claire

SAURABH VERMA <sau...@...> schrieb am Do., 18. Juni 2020, 19:59:
We've set up and janusgraph OLAP with spark-yarn, is that something you are looking for?

Thanks

On Thu, Jun 18, 2020 at 10:39 PM <bo...@...> wrote:
Hi,

We are using Janusgraph (0.5.2) with Scylladb as backend. So far we are only using OLTP capabilities but would now like to also do some more advanced batch processing to create shortcut edges, for example for recommendations. To do that, I would like to use the OLAP features.

Reading the documentation this sounds pretty straightforward, assuming one has a Hadoop cluster up and running. But here comes my problem: I would like to use Dataproc - Google's managed solution for Hadoop and Spark. Unfortunately I couldn't find any further information on how to get those two things playing well together.

Does anyone have any experience, hints or documentation on how to properly configure Janusgraph with Dataproc?

In a very first step, a was trying the following (Java application with embedded Janusgraph)

GraphTraversalSource g = GraphFactory.open("graph.properties").traversal().withComputer(SparkGraphComputer.class);
long count = g.V().count().next();
...
g
.close()

the graph.properties file looking like this

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin
.hadoop.graphReader=org.janusgraph.hadoop.formats.cql.CqlInputFormat
gremlin
.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin
.hadoop.jarsInDistributedCache=true
gremlin
.hadoop.inputLocation=none
gremlin
.hadoop.outputLocation=output
gremlin
.spark.persistContext=true

# Cassandra
janusgraphmr
.ioformat.conf.storage.backend=cql
janusgraphmr
.ioformat.conf.storage.hostname=myhost
janusgraphmr
.ioformat.conf.storage.port=9042
janusgraphmr
.ioformat.conf.index.search.backend=lucene
janusgraphmr
.ioformat.conf.index.search.directory=/tmp/
janusgraphmr
.ioformat.conf.index.search.hostname=127.0.0.1
cassandra
.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra
.input.widerows=true
# Spark
spark
.master=local[*]
spark
.executor.memory=1g
spark
.serializer=org.apache.spark.serializer.KryoSerializer
spark
.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator


If I just run the code like this, without specifying anything else, it just results in nothing happening, and endless log output like these
Code hier eingeben...18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.StandardJanusGraphTx - Guava vertex cache size: requested=20000 effective=20000 (min=100)
18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.vertexcache.GuavaVertexCache - Created dirty vertex map with initial size 32
18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.vertexcache.GuavaVertexCache - Created vertex cache with max size 20000

Additionally, I added the hdfs-site extracted from dataproc to my classpath, but that didn't help any.

The same in the OLTP world works like a charm. (of course using a proper query, one not iterating over the whole graph .... :D )

Any hints, ideas, experiences or links are greatly appreciated.

Looking forward to some answers,
Claire

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janu...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/7dc9a3f1-82bc-47d5-89a1-5f3d4e21e5cdo%40googlegroups.com.


--
Thanks & Regards,
Saurabh Verma,
India


--
You received this message because you are subscribed to a topic in the Google Groups "JanusGraph users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/janusgraph-users/Fh0ARPasw8s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to janu...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/CADJB8JuGxLsTAv6kJrnKfrry5zjKVZD6yQr6JacWKA5Pq2L%3Dvg%40mail.gmail.com.

--
You received this message because you are subscribed to a topic in the Google Groups "JanusGraph users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/janusgraph-users/Fh0ARPasw8s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/4900585e-5091-43ae-842e-162e9ea94d8do%40googlegroups.com.


bobo...@...
 

Hi

After some version-conflict-solving, I was able to use the SparkGraphComputer using Dataproc's managed Spark. The code is contained within a Java application (with embedded Janusgraph)

As this might be interesting to other people in the future, I wanted to share my setup here:

Janusgraph Version: 0.5.2
TinkerPop Version : 3.4.7
Dataproc Version: 1.2.100-debian9 (because we need Spark 2.2.x)

(Very basic example) Java code

Configuration configuration = new PropertiesConfiguration("graph.properties"); // Need to do it this way, otherwise GraphFactory requires an absolute path
GraphTraversalSource g = GraphFactory.open(configuration).traversal().withComputer(SparkGraphComputer.class);
long count = g.V().count().next();
 
...


graph.properties


# Hadoop Graph Configuration
gremlin
.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin
.hadoop.graphReader=org.janusgraph.hadoop.formats.cql.CqlInputFormat
gremlin
.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin
.hadoop.jarsInDistributedCache=true
gremlin
.hadoop.inputLocation=none
gremlin
.hadoop.outputLocation=output
gremlin
.spark.persistContext=true
# Scylla
janusgraphmr
.ioformat.conf.storage.backend=cql
janusgraphmr
.ioformat.conf.storage.hostname=scylla-host
janusgraphmr
.ioformat.conf.storage.port=9042
janusgraphmr
.ioformat.conf.index.search.backend=lucene
janusgraphmr
.ioformat.conf.index.search.directory=/tmp/
janusgraphmr
.ioformat.conf.index.search.hostname=127.0.0.1
cassandra
.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra
.input.widerows=true
# Spark
spark
.master=yarn-client
spark
.executor.memory=1g
spark
.serializer=org.apache.spark.serializer.KryoSerializer
spark
.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator






Some additional notes:
In our Maven setup we needed to
  • exclude the jackson-databind depdency from spark-gremlin due to version conflicts with Janusgraph
  • use the maven shade plugin to build an uber-jar and shade com.google.commons to overcome Depdencency issues on Guava between the one's on Dataproc and in our application
  • Use Java 8  (due to the version of dataproc we need to use due to Spark 2.2.x requirement)


Finally, we simply build a JAR archive with all the (shaded) depdendencies, and Upload that on Google Cloud Storage. We then submit the Spark Job as follows

gcloud dataproc jobs submit spark --cluster=<cluster> --class=<mainClass> --jars=gs://<bucket>/<folder>/<shaded-jar-with-dependencies>.jar  --region=<region>


Regards
Claire


Am Freitag, 19. Juni 2020 08:08:40 UTC+2 schrieb HadoopMarc:

Hi Claire,

As also indicated by Saurabh, your current config runs spark locally on your client node and does not use dataproc at all.

What possibly could work (I never used dataproc myself):
Best wishes,   Marc

Op donderdag 18 juni 2020 20:20:50 UTC+2 schreef Claire F:
Hi Saurabh,

Thanks for your reply. 
I am really specifically looking with setup using Dataproc.

Regards
Claire

SAURABH VERMA <sau...@...> schrieb am Do., 18. Juni 2020, 19:59:
We've set up and janusgraph OLAP with spark-yarn, is that something you are looking for?

Thanks

On Thu, Jun 18, 2020 at 10:39 PM <bo...@...> wrote:
Hi,

We are using Janusgraph (0.5.2) with Scylladb as backend. So far we are only using OLTP capabilities but would now like to also do some more advanced batch processing to create shortcut edges, for example for recommendations. To do that, I would like to use the OLAP features.

Reading the documentation this sounds pretty straightforward, assuming one has a Hadoop cluster up and running. But here comes my problem: I would like to use Dataproc - Google's managed solution for Hadoop and Spark. Unfortunately I couldn't find any further information on how to get those two things playing well together.

Does anyone have any experience, hints or documentation on how to properly configure Janusgraph with Dataproc?

In a very first step, a was trying the following (Java application with embedded Janusgraph)

GraphTraversalSource g = GraphFactory.open("graph.properties").traversal().withComputer(SparkGraphComputer.class);
long count = g.V().count().next();
...
g
.close()

the graph.properties file looking like this

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin
.hadoop.graphReader=org.janusgraph.hadoop.formats.cql.CqlInputFormat
gremlin
.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin
.hadoop.jarsInDistributedCache=true
gremlin
.hadoop.inputLocation=none
gremlin
.hadoop.outputLocation=output
gremlin
.spark.persistContext=true

# Cassandra
janusgraphmr
.ioformat.conf.storage.backend=cql
janusgraphmr
.ioformat.conf.storage.hostname=myhost
janusgraphmr
.ioformat.conf.storage.port=9042
janusgraphmr
.ioformat.conf.index.search.backend=lucene
janusgraphmr
.ioformat.conf.index.search.directory=/tmp/
janusgraphmr
.ioformat.conf.index.search.hostname=127.0.0.1
cassandra
.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra
.input.widerows=true
# Spark
spark
.master=local[*]
spark
.executor.memory=1g
spark
.serializer=org.apache.spark.serializer.KryoSerializer
spark
.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator


If I just run the code like this, without specifying anything else, it just results in nothing happening, and endless log output like these
Code hier eingeben...18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.StandardJanusGraphTx - Guava vertex cache size: requested=20000 effective=20000 (min=100)
18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.vertexcache.GuavaVertexCache - Created dirty vertex map with initial size 32
18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.vertexcache.GuavaVertexCache - Created vertex cache with max size 20000

Additionally, I added the hdfs-site extracted from dataproc to my classpath, but that didn't help any.

The same in the OLTP world works like a charm. (of course using a proper query, one not iterating over the whole graph .... :D )

Any hints, ideas, experiences or links are greatly appreciated.

Looking forward to some answers,
Claire

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janu...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/7dc9a3f1-82bc-47d5-89a1-5f3d4e21e5cdo%40googlegroups.com.


--
Thanks & Regards,
Saurabh Verma,
India


--
You received this message because you are subscribed to a topic in the Google Groups "JanusGraph users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/janusgraph-users/Fh0ARPasw8s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to janu...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/CADJB8JuGxLsTAv6kJrnKfrry5zjKVZD6yQr6JacWKA5Pq2L%3Dvg%40mail.gmail.com.


HadoopMarc <bi...@...>
 

Great work!

Marc

Op maandag 22 juni 2020 12:12:27 UTC+2 schreef bo...@...:

Hi

After some version-conflict-solving, I was able to use the SparkGraphComputer using Dataproc's managed Spark. The code is contained within a Java application (with embedded Janusgraph)

As this might be interesting to other people in the future, I wanted to share my setup here:

Janusgraph Version: 0.5.2
TinkerPop Version : 3.4.7
Dataproc Version: 1.2.100-debian9 (because we need Spark 2.2.x)

(Very basic example) Java code

Configuration configuration = new PropertiesConfiguration("graph.properties"); // Need to do it this way, otherwise GraphFactory requires an absolute path
GraphTraversalSource g = GraphFactory.open(configuration).traversal().withComputer(SparkGraphComputer.class);
long count = g.V().count().next();
 
...


graph.properties


# Hadoop Graph Configuration
gremlin
.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin
.hadoop.graphReader=org.janusgraph.hadoop.formats.cql.CqlInputFormat
gremlin
.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin
.hadoop.jarsInDistributedCache=true
gremlin
.hadoop.inputLocation=none
gremlin
.hadoop.outputLocation=output
gremlin
.spark.persistContext=true
# Scylla
janusgraphmr
.ioformat.conf.storage.backend=cql
janusgraphmr
.ioformat.conf.storage.hostname=scylla-host
janusgraphmr
.ioformat.conf.storage.port=9042
janusgraphmr
.ioformat.conf.index.search.backend=lucene
janusgraphmr
.ioformat.conf.index.search.directory=/tmp/
janusgraphmr
.ioformat.conf.index.search.hostname=127.0.0.1
cassandra
.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra
.input.widerows=true
# Spark
spark
.master=yarn-client
spark
.executor.memory=1g
spark
.serializer=org.apache.spark.serializer.KryoSerializer
spark
.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator






Some additional notes:
In our Maven setup we needed to
  • exclude the jackson-databind depdency from spark-gremlin due to version conflicts with Janusgraph
  • use the maven shade plugin to build an uber-jar and shade com.google.commons to overcome Depdencency issues on Guava between the one's on Dataproc and in our application
  • Use Java 8  (due to the version of dataproc we need to use due to Spark 2.2.x requirement)


Finally, we simply build a JAR archive with all the (shaded) depdendencies, and Upload that on Google Cloud Storage. We then submit the Spark Job as follows

gcloud dataproc jobs submit spark --cluster=<cluster> --class=<mainClass> --jars=gs://<bucket>/<folder>/<shaded-jar-with-dependencies>.jar  --region=<region>


Regards
Claire


Am Freitag, 19. Juni 2020 08:08:40 UTC+2 schrieb HadoopMarc:
Hi Claire,

As also indicated by Saurabh, your current config runs spark locally on your client node and does not use dataproc at all.

What possibly could work (I never used dataproc myself):
Best wishes,   Marc

Op donderdag 18 juni 2020 20:20:50 UTC+2 schreef Claire F:
Hi Saurabh,

Thanks for your reply. 
I am really specifically looking with setup using Dataproc.

Regards
Claire

SAURABH VERMA <sau...@...> schrieb am Do., 18. Juni 2020, 19:59:
We've set up and janusgraph OLAP with spark-yarn, is that something you are looking for?

Thanks

On Thu, Jun 18, 2020 at 10:39 PM <bo...@...> wrote:
Hi,

We are using Janusgraph (0.5.2) with Scylladb as backend. So far we are only using OLTP capabilities but would now like to also do some more advanced batch processing to create shortcut edges, for example for recommendations. To do that, I would like to use the OLAP features.

Reading the documentation this sounds pretty straightforward, assuming one has a Hadoop cluster up and running. But here comes my problem: I would like to use Dataproc - Google's managed solution for Hadoop and Spark. Unfortunately I couldn't find any further information on how to get those two things playing well together.

Does anyone have any experience, hints or documentation on how to properly configure Janusgraph with Dataproc?

In a very first step, a was trying the following (Java application with embedded Janusgraph)

GraphTraversalSource g = GraphFactory.open("graph.properties").traversal().withComputer(SparkGraphComputer.class);
long count = g.V().count().next();
...
g
.close()

the graph.properties file looking like this

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin
.hadoop.graphReader=org.janusgraph.hadoop.formats.cql.CqlInputFormat
gremlin
.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin
.hadoop.jarsInDistributedCache=true
gremlin
.hadoop.inputLocation=none
gremlin
.hadoop.outputLocation=output
gremlin
.spark.persistContext=true

# Cassandra
janusgraphmr
.ioformat.conf.storage.backend=cql
janusgraphmr
.ioformat.conf.storage.hostname=myhost
janusgraphmr
.ioformat.conf.storage.port=9042
janusgraphmr
.ioformat.conf.index.search.backend=lucene
janusgraphmr
.ioformat.conf.index.search.directory=/tmp/
janusgraphmr
.ioformat.conf.index.search.hostname=127.0.0.1
cassandra
.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra
.input.widerows=true
# Spark
spark
.master=local[*]
spark
.executor.memory=1g
spark
.serializer=org.apache.spark.serializer.KryoSerializer
spark
.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator


If I just run the code like this, without specifying anything else, it just results in nothing happening, and endless log output like these
Code hier eingeben...18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.StandardJanusGraphTx - Guava vertex cache size: requested=20000 effective=20000 (min=100)
18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.vertexcache.GuavaVertexCache - Created dirty vertex map with initial size 32
18:39:07.749 [Executor task launch worker for task 3] DEBUG o.j.g.t.vertexcache.GuavaVertexCache - Created vertex cache with max size 20000

Additionally, I added the hdfs-site extracted from dataproc to my classpath, but that didn't help any.

The same in the OLTP world works like a charm. (of course using a proper query, one not iterating over the whole graph .... :D )

Any hints, ideas, experiences or links are greatly appreciated.

Looking forward to some answers,
Claire

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janu...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/7dc9a3f1-82bc-47d5-89a1-5f3d4e21e5cdo%40googlegroups.com.


--
Thanks & Regards,
Saurabh Verma,
India


--
You received this message because you are subscribed to a topic in the Google Groups "JanusGraph users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/janusgraph-users/Fh0ARPasw8s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to janu...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/CADJB8JuGxLsTAv6kJrnKfrry5zjKVZD6yQr6JacWKA5Pq2L%3Dvg%40mail.gmail.com.


kndoan94@...
 

Hi Claire! 

Would you mind sharing the pom.xml file for your build? I'm trying a similar build for AWS and am hitting a mess of dependency errors.

Thank you :)
Ben