Date   

Streaming graph data

JZ <zamb...@...>
 

Hello,

Does anyone know if there is a way to stream a graph to a viewer such as Gephi from Janus Java client.  When using the gremlin console you can use the  tinkerpop.gephi plugin and redirect a graph Gephi.  Is there a way to do that from a Java  that has created a graph?  I did not find any mention of this in the documentation. 

Thanks

JGZ


Re: Bulk loading using Json, python or Scala?

Yihang Yan <yanyi...@...>
 

I see.. we need to use GraphML file. The issue here is our graph might contain billions of nodes, will Python client be able to handle that?

Thanks!


On Friday, June 16, 2017 at 12:24:25 AM UTC-4, David Brown wrote:
You should be able to bulk load with any client that allows you to submit raw gremlin scripts to the server. For example, you can do it with ipython-gremlin, check out cell # 2: https://github.com/davebshow/ipython-gremlin/blob/master/example.ipynb. You could submit that same script with the new Python client that will be released with TinkerPop 3.2.5 in a couple days. You can also do it with the driver included with aiogremlin 3.2.4+ [1]. Maybe someone with more expertise could weigh in about the Scala client.

Best of luck!

Dave


On Thursday, June 15, 2017 at 12:03:03 PM UTC-4, Yihang Yan wrote:
Other than Groovy, is it possible to do bulk loading using Json, python or Scala? Any sample code could be provided?

Thanks!


Time series modelling help needed...

Ravikumar Govindarajan <ravikumar....@...>
 

I need a help in time series modelling, with Cassandra as the backend storage...

Consider a model as below

                      Has many                                   Which Report
Domain         ----------------->   Sub-Domains     ----------------------->   Traffic_Stats  

Lets say that each Traffic_Stats vertex has a time component


1. For domain="abc" and sub-domain="xyz", get all traffic stats ordered by time descending, limit 10
2. For domain="abc", get all traffic stats ordered by time descending, limit 10 

Will an Edge Index on time linking Sub_Domain & Traffic_Stats satisfy the first-query? 

What about the second query? Should I create duplicate edges from Domain to Traffic_Stats with same Edge Index on Time? 

Also, is it the case that even if I use an Edge Index, Janus will pull all the data in memory & do the sort? In the fictitious case of a Domain having 10k Sub-Domains with each reporting 50k Traffic_Stats, it could be prohibitively expensive

Any help is much appreciated, as I am just beginning with JanusGraph


Thanks and Regards,
Ravi


Re: Bulk loading using Json, python or Scala?

Yihang Yan <yanyi...@...>
 

Thanks, Dave! I am wondering what's the recommended format of the data, xml, txt or csv ?


On Friday, June 16, 2017 at 12:24:25 AM UTC-4, David Brown wrote:
You should be able to bulk load with any client that allows you to submit raw gremlin scripts to the server. For example, you can do it with ipython-gremlin, check out cell # 2: https://github.com/davebshow/ipython-gremlin/blob/master/example.ipynb. You could submit that same script with the new Python client that will be released with TinkerPop 3.2.5 in a couple days. You can also do it with the driver included with aiogremlin 3.2.4+ [1]. Maybe someone with more expertise could weigh in about the Scala client.

Best of luck!

Dave


On Thursday, June 15, 2017 at 12:03:03 PM UTC-4, Yihang Yan wrote:
Other than Groovy, is it possible to do bulk loading using Json, python or Scala? Any sample code could be provided?

Thanks!


Re: HBase unbalanced table regions after bulkload

marc.d...@...
 

Hi Ali,

OK, I overlooked your config line "storage.hbase.region-count=1024". This is far too large a number, since HBase likes regions with a size of the order of 10GB, rather than the 130MB you requested. This 10 GB region split is probably a HBase global setting. It could be a similar number in your HBase cluster, so HBase just ignores the superfluous regions, unless you would configure the region split to a lower number manually (not advised).

Other comments from HBase experts are welcomed (I do not consider myself one).

Cheers,    Marc

Op vrijdag 16 juni 2017 09:57:12 UTC+2 schreef Ali ÖZER:

Hi Marc,

As far as I know, even if yarn schedules executors unevenly, it does not mean that the data written across hbase will be uneven.

The data is written hbase according to the key of the datum and the key-ranges of the regions, it does nothing to do with the node that the writer jvm is working on.

My executors are working on %90 of my nodes (it is not that uneven) however percentage of my empty regions is %90(900 of 1024 regions). If you were right the latter percentage would be %10 instead of %90.

If there is some other mechanism while assigning ids in distributed fashion, may you please keep me updated and elaborate on the mechanism.

Best,
Ali

15 Haziran 2017 Perşembe 22:51:14 UTC+3 tarihinde HadoopMarc yazdı:

Hi Ali,

I have never tried to optimize this myself, but maybe you should also look into the docs at

12.3.30. storage.hbase


...

storage.hbase.region-count


The number of initial regions set when creating JanusGraph’s HBase table

Integer

(no default value)

MASKABLE

storage.hbase.regions-per-server

The number of regions per regionserver to set when creating JanusGraph’s HBase table

Integer

(no default value)

MASKABLE



Normally, HBase does not want many regions, but the number of regions times hdfs replication factor should be at least the number of active datanodes for maximum performance. I think some unbalance is inevitable as yarn will schedule executors unevenly and each executor will try to have local data access.

Further, you can look into HBase's region load balancer configuration, which enables HBase to move regions automatically.

HTH,    Marc

Op donderdag 15 juni 2017 16:04:28 UTC+2 schreef Ali ÖZER:
We are using cloudera 5.7.0 with java 1.8.0_74 and we have spark 1.6.0, janusgraph 0.1.1, hbase 1.2.0.

I managed to bulkload 130GB of data into 1024 region hbase table in 2 hours 30 minute with 1024 spark executors (1-core,20gb memory). Each stage of blvp is configured to run 10240 tasks:
readGraph.compute(SparkGraphComputer).workers(10240).program(blvp).submit().get().

However I am unable to distribute the hbase data evenly across regions, they are pretty imbalanced. I suspect it is related to the conf value of ids.num-partitions.

Here is how I set the conf:

conf/my-janusgraph-hbase.properties:

gremlin.graph=org.janusgraph.core.JanusGraphFactory
storage.backend=hbase
storage.batch-loading=true
storage.hbase.region-count=1024

ids.block-size=2000000
ids.renew-timeout=3600000
storage.buffer-size=10000
ids.num-partitions=1024
ids.partition=true

storage.hbase.table=myjanus
storage.hostname=x.x.x.x
cache.db-cache=true
cache.db-cache-clean-wait=20
cache.db-cache-time=180000
cache.db-cache-size=0.5

I even tried setting ids.num-partitions=10240; however the problem was not solved.

Should I still increase the ids.num-partitions value to an even higher value like 102400?
What is the difference cluster.max-partitions and ids.num-partitions. Is my problem related to cluster.max-partitions? Should I use it?

As far as I know ids.num-partitions value determines the number of randomly gathered prefixes that will be used in assigning ids to elements. And I read somewhere that setting ids.num-partitions to 10 times of region count will be enough; however it seems like that is not the case. And I do not want to increase ids.num-partitions further. Since I could not find any document related to the internals of cluster.max-partitions, I am really ignorant about it and need some help.

Thanks in advance,
Best,
Ali





Re: HBase unbalanced table regions after bulkload

aoz...@...
 

Hi Marc,

As far as I know, even if yarn schedules executors unevenly, it does not mean that the data written across hbase will be uneven.

The data is written hbase according to the key of the datum and the key-ranges of the regions, it does nothing to do with the node that the writer jvm is working on.

My executors are working on %90 of my nodes (it is not that uneven) however percentage of my empty regions is %90(900 of 1024 regions). If you were right the latter percentage would be %10 instead of %90.

If there is some other mechanism while assigning ids in distributed fashion, may you please keep me updated and elaborate on the mechanism.

Best,
Ali

15 Haziran 2017 Perşembe 22:51:14 UTC+3 tarihinde HadoopMarc yazdı:


Hi Ali,

I have never tried to optimize this myself, but maybe you should also look into the docs at

12.3.30. storage.hbase


...

storage.hbase.region-count


The number of initial regions set when creating JanusGraph’s HBase table

Integer

(no default value)

MASKABLE

storage.hbase.regions-per-server

The number of regions per regionserver to set when creating JanusGraph’s HBase table

Integer

(no default value)

MASKABLE



Normally, HBase does not want many regions, but the number of regions times hdfs replication factor should be at least the number of active datanodes for maximum performance. I think some unbalance is inevitable as yarn will schedule executors unevenly and each executor will try to have local data access.

Further, you can look into HBase's region load balancer configuration, which enables HBase to move regions automatically.

HTH,    Marc

Op donderdag 15 juni 2017 16:04:28 UTC+2 schreef Ali ÖZER:
We are using cloudera 5.7.0 with java 1.8.0_74 and we have spark 1.6.0, janusgraph 0.1.1, hbase 1.2.0.

I managed to bulkload 130GB of data into 1024 region hbase table in 2 hours 30 minute with 1024 spark executors (1-core,20gb memory). Each stage of blvp is configured to run 10240 tasks:
readGraph.compute(SparkGraphComputer).workers(10240).program(blvp).submit().get().

However I am unable to distribute the hbase data evenly across regions, they are pretty imbalanced. I suspect it is related to the conf value of ids.num-partitions.

Here is how I set the conf:

conf/my-janusgraph-hbase.properties:

gremlin.graph=org.janusgraph.core.JanusGraphFactory
storage.backend=hbase
storage.batch-loading=true
storage.hbase.region-count=1024

ids.block-size=2000000
ids.renew-timeout=3600000
storage.buffer-size=10000
ids.num-partitions=1024
ids.partition=true

storage.hbase.table=myjanus
storage.hostname=x.x.x.x
cache.db-cache=true
cache.db-cache-clean-wait=20
cache.db-cache-time=180000
cache.db-cache-size=0.5

I even tried setting ids.num-partitions=10240; however the problem was not solved.

Should I still increase the ids.num-partitions value to an even higher value like 102400?
What is the difference cluster.max-partitions and ids.num-partitions. Is my problem related to cluster.max-partitions? Should I use it?

As far as I know ids.num-partitions value determines the number of randomly gathered prefixes that will be used in assigning ids to elements. And I read somewhere that setting ids.num-partitions to 10 times of region count will be enough; however it seems like that is not the case. And I do not want to increase ids.num-partitions further. Since I could not find any document related to the internals of cluster.max-partitions, I am really ignorant about it and need some help.

Thanks in advance,
Best,
Ali





Re: Bulk loading using Json, python or Scala?

David Brown <dave...@...>
 

You should be able to bulk load with any client that allows you to submit raw gremlin scripts to the server. For example, you can do it with ipython-gremlin, check out cell # 2: https://github.com/davebshow/ipython-gremlin/blob/master/example.ipynb. You could submit that same script with the new Python client that will be released with TinkerPop 3.2.5 in a couple days. You can also do it with the driver included with aiogremlin 3.2.4+ [1]. Maybe someone with more expertise could weigh in about the Scala client.

Best of luck!

Dave

1. http://aiogremlin.readthedocs.io/en/latest/usage.html#using-the-driver-module

On Thursday, June 15, 2017 at 12:03:03 PM UTC-4, Yihang Yan wrote:
Other than Groovy, is it possible to do bulk loading using Json, python or Scala? Any sample code could be provided?

Thanks!


Re: HBase unbalanced table regions after bulkload

HadoopMarc <m.c.d...@...>
 


Hi Ali,

I have never tried to optimize this myself, but maybe you should also look into the docs at

12.3.30. storage.hbase


...

storage.hbase.region-count


The number of initial regions set when creating JanusGraph’s HBase table

Integer

(no default value)

MASKABLE

storage.hbase.regions-per-server

The number of regions per regionserver to set when creating JanusGraph’s HBase table

Integer

(no default value)

MASKABLE



Normally, HBase does not want many regions, but the number of regions times hdfs replication factor should be at least the number of active datanodes for maximum performance. I think some unbalance is inevitable as yarn will schedule executors unevenly and each executor will try to have local data access.

Further, you can look into HBase's region load balancer configuration, which enables HBase to move regions automatically.

HTH,    Marc

Op donderdag 15 juni 2017 16:04:28 UTC+2 schreef Ali ÖZER:

We are using cloudera 5.7.0 with java 1.8.0_74 and we have spark 1.6.0, janusgraph 0.1.1, hbase 1.2.0.

I managed to bulkload 130GB of data into 1024 region hbase table in 2 hours 30 minute with 1024 spark executors (1-core,20gb memory). Each stage of blvp is configured to run 10240 tasks:
readGraph.compute(SparkGraphComputer).workers(10240).program(blvp).submit().get().

However I am unable to distribute the hbase data evenly across regions, they are pretty imbalanced. I suspect it is related to the conf value of ids.num-partitions.

Here is how I set the conf:

conf/my-janusgraph-hbase.properties:

gremlin.graph=org.janusgraph.core.JanusGraphFactory
storage.backend=hbase
storage.batch-loading=true
storage.hbase.region-count=1024

ids.block-size=2000000
ids.renew-timeout=3600000
storage.buffer-size=10000
ids.num-partitions=1024
ids.partition=true

storage.hbase.table=myjanus
storage.hostname=x.x.x.x
cache.db-cache=true
cache.db-cache-clean-wait=20
cache.db-cache-time=180000
cache.db-cache-size=0.5

I even tried setting ids.num-partitions=10240; however the problem was not solved.

Should I still increase the ids.num-partitions value to an even higher value like 102400?
What is the difference cluster.max-partitions and ids.num-partitions. Is my problem related to cluster.max-partitions? Should I use it?

As far as I know ids.num-partitions value determines the number of randomly gathered prefixes that will be used in assigning ids to elements. And I read somewhere that setting ids.num-partitions to 10 times of region count will be enough; however it seems like that is not the case. And I do not want to increase ids.num-partitions further. Since I could not find any document related to the internals of cluster.max-partitions, I am really ignorant about it and need some help.

Thanks in advance,
Best,
Ali





Bulk loading using Json, python or Scala?

Yihang Yan <yanyi...@...>
 

Other than Groovy, is it possible to do bulk loading using Json, python or Scala? Any sample code could be provided?

Thanks!


HBase unbalanced table regions after bulkload

aoz...@...
 

We are using cloudera 5.7.0 with java 1.8.0_74 and we have spark 1.6.0, janusgraph 0.1.1, hbase 1.2.0.

I managed to bulkload 130GB of data into 1024 region hbase table in 2 hours 30 minute with 1024 spark executors (1-core,20gb memory). Each stage of blvp is configured to run 10240 tasks:
readGraph.compute(SparkGraphComputer).workers(10240).program(blvp).submit().get().

However I am unable to distribute the hbase data evenly across regions, they are pretty imbalanced. I suspect it is related to the conf value of ids.num-partitions.

Here is how I set the conf:

conf/my-janusgraph-hbase.properties:

gremlin.graph=org.janusgraph.core.JanusGraphFactory
storage.backend=hbase
storage.batch-loading=true
storage.hbase.region-count=1024

ids.block-size=2000000
ids.renew-timeout=3600000
storage.buffer-size=10000
ids.num-partitions=1024
ids.partition=true

storage.hbase.table=myjanus
storage.hostname=x.x.x.x
cache.db-cache=true
cache.db-cache-clean-wait=20
cache.db-cache-time=180000
cache.db-cache-size=0.5

I even tried setting ids.num-partitions=10240; however the problem was not solved.

Should I still increase the ids.num-partitions value to an even higher value like 102400?
What is the difference cluster.max-partitions and ids.num-partitions. Is my problem related to cluster.max-partitions? Should I use it?

As far as I know ids.num-partitions value determines the number of randomly gathered prefixes that will be used in assigning ids to elements. And I read somewhere that setting ids.num-partitions to 10 times of region count will be enough; however it seems like that is not the case. And I do not want to increase ids.num-partitions further. Since I could not find any document related to the internals of cluster.max-partitions, I am really ignorant about it and need some help.

Thanks in advance,
Best,
Ali





Re: how to load a CSV file into janusgraph

HadoopMarc <m.c.d...@...>
 

Hi Elizabeth,

For JanusGraph you should also take into account the TinkerPop documentation. A relevant pointer for you is:

https://groups.google.com/forum/#!searchin/gremlin-users/csv%7Csort:relevance/gremlin-users/AetuGcLiBxo/KW966WAyAQAJ

Cheers,    Marc

Op woensdag 14 juni 2017 18:44:16 UTC+2 schreef Elizabeth:

Hi all,

I am new to Janusgraph, I have dived into docs of Janusgraph for almost two weeks, nothing found.
I could only gather the scatted information and most of the time it will prompt some errors.
Could anyone supply a complete example of bulk loading or loading a CSV file into Janusgraph, please?
Any little help is appreated!

Best regards,

Elis.


Re: Finding supernodes with insufficient frame size

Adam Holley <holl...@...>
 

That won't work if the framesize is not large enough.


On Wednesday, June 14, 2017 at 10:51:09 AM UTC-5, Robert Dale wrote:

This should give you the counts, highest first, by vertex id:

g.V().group().by(id()).by(outE().count()).order(local).by(values,decr)



Robert Dale


Re: Finding supernodes with insufficient frame size

Robert Dale <rob...@...>
 


This should give you the counts, highest first, by vertex id:

g.V().group().by(id()).by(outE().count()).order(local).by(values,decr)



Robert Dale

On Wed, Jun 14, 2017 at 11:32 AM, Adam Holley <holl...@...> wrote:
Using cassandra as the backend, I was trying to count edges {g.E().count} and of course ran into the framesize problem because I had a supernode.  I found that I could identify which node was the supernode with:

g.V().toList().each{it.println(it);g.V(it).outE().count()}

This will print out the supernode vertex right before the framesize exception.  Then I can gradually increase the framesize until g.V(supernodeID) doesn't cause an exception.

Is there a better way to find supernodes without just increasing framesize to a large number, finding the supernode, and then working backward by decreasing the framesize?
As I'm not as familiar with Cassandra, once I know the supernode ID, is there a way to use CQL to determine the appropriate framesize without just gradually increasing or decreasing in the config?

--
You received this message because you are subscribed to the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Finding supernodes with insufficient frame size

Adam Holley <holl...@...>
 

Using cassandra as the backend, I was trying to count edges {g.E().count} and of course ran into the framesize problem because I had a supernode.  I found that I could identify which node was the supernode with:

g.V().toList().each{it.println(it);g.V(it).outE().count()}

This will print out the supernode vertex right before the framesize exception.  Then I can gradually increase the framesize until g.V(supernodeID) doesn't cause an exception.

Is there a better way to find supernodes without just increasing framesize to a large number, finding the supernode, and then working backward by decreasing the framesize?
As I'm not as familiar with Cassandra, once I know the supernode ID, is there a way to use CQL to determine the appropriate framesize without just gradually increasing or decreasing in the config?


how to load a CSV file into janusgraph

Elizabeth <hlf...@...>
 

Hi all,

I am new to Janusgraph, I have dived into docs of Janusgraph for almost two weeks, nothing found.
I could only gather the scatted information and most of the time it will prompt some errors.
Could anyone supply a complete example of bulk loading or loading a CSV file into Janusgraph, please?
Any little help is appreated!

Best regards,

Elis.


Re: Index not being used with 'Between" clause

Gene Fojtik <genef...@...>
 

Outstanding - thank you Jason.

-gene


On Thursday, June 8, 2017 at 11:47:53 PM UTC-5, Jason Plurad wrote:
Make sure you're using a mixed index for numeric range queries. Composite indexes are best for exact matching. The console session below shows the difference:

gremlin> graph = JanusGraphFactory.open('conf/janusgraph-berkeleyje-lucene.properties')
==>standardjanusgraph[berkeleyje:/usr/lib/janusgraph-0.1.1-hadoop2/conf/../db/berkeley]
gremlin
> mgmt = graph.openManagement()
==>org.janusgraph.graphdb.database.management.ManagementSystem@1c8f6a90
gremlin
> lat = mgmt.makePropertyKey('lat').dataType(Integer.class).make()
==>lat
gremlin
> latidx = mgmt.buildIndex('latidx', Vertex.class).addKey(lat).buildCompositeIndex()
==>latidx
gremlin
> lon = mgmt.makePropertyKey('lon').dataType(Integer.class).make()
==>lon
gremlin
> lonidx = mgmt.buildIndex('lonidx', Vertex.class).addKey(lon).buildMixedIndex('search')
==>lonidx
gremlin
> mgmt.commit()
==>null
gremlin
> v = graph.addVertex('code', 'rdu', 'lat', 35, 'lon', -78)
==>v[4184]
gremlin
> g = graph.traversal()
==>graphtraversalsource[standardjanusgraph[berkeleyje:/usr/lib/janusgraph-0.1.1-hadoop2/conf/../db/berkeley], standard]
gremlin
> g.V().has('lat', 35)
==>v[4184]
gremlin
> g.V().has('lat', between(34, 36))
00:40:33 WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertices [(lat >= 34 AND lat < 36)]. For better performance, use indexes
==>v[4184]
gremlin
> g.V().has('lon', -78)
==>v[4184]
gremlin
> g.V().has('lon', between(-79, -77))
==>v[4184]

-- Jason

On Wednesday, June 7, 2017 at 12:01:31 PM UTC-4, Gene Fojtik wrote:
Hello,

Have an index in a property "latitude", when using with the between clause, the index is not being utilized

g.V().has("latitude", 33.333")  works well, however

g.V().has(“latitude”, between(33.889, 33.954)))  does not use the indexes..

Any assistance would be appreciated..

-g


call queue is full on /0.0.0.0.:60020, too many items queued? hbase

aoz...@...
 

Here is my problem:

We are using cloudera 5.7.0 with java 1.8.0_74 and we have spark 1.6.0, janusgraph 0.1.1, hbase 1.2.0.

I try to load 200Gb of graph data and for that I run the following code in gremlin shell:

:load data/call-janusgraph-schema-groovy
writeGraphPath='conf/my-janusgraph-hbase.properties'
writeGraph=JanusGraphFactory.open(writeGraphPath)
defineCallSchema(writeGraph)
writeGraph.close()

readGraph=GraphFactory.open('conf/hadoop-graph/hadoop-call-script.properties')
gRead=readGraph.traversal()
gRead.V().valueMap()

//so far so good everything works perfectly

blvp=BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).intermediateBatchSize(10000).writeGraph(writeGraphPath).create(readGraph)
readGraph.compute(SparkGraphComputer).workers(512).program(blvp).submit().get()

It starts executing the spark job and Stage-0 runs smoothly however at Stage-1 I get an Exception:

org.hbase.async.CallQueueTooBigException: Call queue is full on /0.0.0.0:60020, too many items queued ?

However spark recovers the failed tasks and completes the Stage-1 and then Stage-2 completes flawlessly. Since Spark persists the previous results in memory, Stage-3 and Stage-4 is skipped and Stage-5 is started however Stage-5 gets the same CallQueueTooBigException exceptions, nevertheless spark recovers the problem again. 

My problem is this stage (Stage-5) takes too long to execute. Actually it took 14 hours at my last run and I killed the spark job. I think this is really odd for such a little input data(200 GB). Normally my cluster is so fast that I am able to load 3 TB of data into HBase(with bulkloading via mapreduce) in 1 hour. I tried to increase the number of workers

readGraph.compute(SparkGraphComputer).workers(1024).program(blvp).submit().get()

however this time the number of CallQueueTooBigException exceptions were so high that they did not let the spark job recover from the exceptions.

Is there any way that I can decrease the runtime of the job?


Below I am giving extra materials that hopefully may lead you to the source of the problem:

Here is how I start the gremlin shell

#!/bin/bash

export JAVA_HOME=/mnt/hdfs/jdk.1.8.0_74
export HADOOP_CONF_DIR= /etc/hadoop/conf.cloudera.yarn
export YARN_HOME=/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop-yarn
export YARN_CONF_DIR=$HADOOP_CONF_DIR
export SPARK_HOME=/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark
export SPARK_CONF_DIR=$SPARK_HOME/conf


GREMLINHOME=/mnt/hdfs/janusgraph-0.1.1-hadoop2

export CLASSPATH=$YARN_HOME/*:$YARN_CONF_DIR:$SPARK_HOME/lib/*:$SPARK_CONF_DIR:$CLASSPATH

cd $GREMLINHOME
export GREMLIN_LOG_LEVEL=info
exec $GREMLINHOME/bin/gremlin.sh $*




and here is my conf/hadoop-graph/hadoop-call-script.properties file:

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.GraphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat
gremlin.hadoop.inputLocation=/user/hive/warehouse/tablex/000000_0
gremlin.hadoop.scriptInputFormat.script=/user/me/janus/script-input-call.groovy
gremlin.hadoop.outputLocation=output
gremlin.hadoop.jarsInDistributedCache=true

spark.driver.maxResultSize=8192
spark.yarn.executor.memoryOverhead=5000
spark.executor.cores=1
spark.executor.instances=1024
spark.master=yarn-client
spark.executor.memory=20g
spark.driver.memory=20g
spark.serializer=org.apache.spark.serializer.JavaSerializer


conf/my-janusgraph-hbase.properties:

gremlin.graph=org.janusgraph.core.JanusGraphFactory
storage.backend=hbase
storage.batch-loading=true
storage.hbase.region-count=1024
cluster.max-partitions=1024
cluster.partition=true

ids.block-size=10000
storage.buffer-size=10000
storage.transactions=false
ids.num-partitions=1024

storage.hbase.table=myjanus
storage.hostname=x.x.x.x
cache.db-cache=true
cache.db-cache-clean-wait=20
cache.db-cache-time=180000
cache.db-cache-size=0.5



Thx in advance,
Ali


Re: Production users of JanusGraph

anurag <anurag...@...>
 

Hi Misha ,
Thanks a lot for your response and useful information much appreciated.
Thanks ,
Anurag

On Fri, May 26, 2017 at 2:28 PM, Misha Brukman <mbru...@...> wrote:
Hi Anurag,

I started a list of companies using JanusGraph in production; you can see the current list here: https://github.com/JanusGraph/janusgraph#users (and the logos at the bottom of http://janusgraph.org) and more additions are on the way.

They appear to be happy with JanusGraph, but I'll let them chime in if they want to provide any additional details.

BTW, if anyone else is a production user of JanusGraph, please get in touch with me and let's get you added as well!

Misha

On Wed, Apr 5, 2017 at 12:28 PM, anurag <anurag...@...> wrote:
All,
Many Thanks to the folks who were involved in setting up JanusGraph project . We are using Titan as GraphDB for a Beta feature , the reason our feature is in Beta is we were not sure where Titan is headed. Now that we have JanusGraph we would like to move to it, are they are any users of JanusGraph in production if so can you please share your experiences with it.
Many thanks.
Best,
Anurag

--
You received this message because you are subscribed to the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



Re: Another perspective on JanusGraph embedded versus server mode

Ted Wilmes <twi...@...>
 

Hi Jamie,
Good question, and I dig the ASCII art. To answert your question, they will describe the same graph as if you were running the Janus instances in their own JVMs. I've used both approaches. The embedded approach was attractive initially because I could write Gremlin traversals without passing strings to the driver. Undoubtedly there would be some performance benefit because you're cutting a network hop out, but whether or not that would be appreciable would depend on the latency targets you're trying to hit. My guess is that for most folks, it won't make nearly as much of a difference as the latencies you're seeing between Janus and the Cassandra cluster. At this point, I prefer to deploy Janus like a standalone database for a few reasons. I'll list a few here. First, with the introduction of TinkerPop's remote graph & Gremlin Language Variants, you can still get that embedded feel with the driver [1]. Second, I like to be able to scale and tune the Janus DB components separately from the API. Finally, maybe less of an issue, but dependency conflicts between the API and an embedded Janus can be a pain, not insurmountable, but that goes away if throw a driver in between.

--Ted


On Friday, June 9, 2017 at 1:32:27 AM UTC-5, Jamie Lawson wrote:
We have a _domain specific_ REST API that is architecturally decoupled from JanusGraph. In other words, users of the REST API have no indication that their calls interact with JanusGraph, or even with a graph. These REST calls have a lot of interactions with the JanusGraph database which is currently embedded in the same JVM process. Here is a deployment view:


+-----------------------------------------+    +-----------------+
| JVM Process #1                          |    | JVM Process #2  |
|                                         |    |                 |
|  +-----------------+    +------------+  |    |  +-----------+  |
|  | Domain Specific |----| JanusGraph |--+----+--| Cassandra |  |
|  |    REST API     |    |  Embedded  |  |    |  |  Backend  |  |
|  +-----------------+    +------------+  |    |  +-----------+  |
|                                         |    |                 |
+-----------------------------------------+    +-----------------+


Now consider load balancing. The REST API is the only way we want to access the graph database. That's what keeps it "operationally consistent". If all updates are through the REST API, we will not get stuff in the database that doesn't make sense in the context of the domain. As we expand, is there a good reason to break out JVM Process #1 so that we have something that looks like this, with JanusGraph Server in a separate process:


+----------------------+    +-----------------+    +-----------------+
| JVM Process #1A      |    | JVM Process #1B |    | JVM Process #2  |
|                      |    |                 |    |                 |
|  +-----------------+ |    | +------------+  |    |  +-----------+  |
|  | Domain Specific |-+----+-| JanusGraph |--+----+--| Cassandra |  |
|  |    REST API     | |    | |   SERVER   |  |    |  |  Backend  |  |
|  +-----------------+ |    | +------------+  |    |  +-----------+  |
|                      |    |                 |    |                 |
+----------------------+    +-----------------+    +-----------------+

My expectation would be that connecting to JanusGraph through the embedded API would be much faster than connecting through a WebSocket API. Is that the case?

Now as we expand, is it reasonable to run our REST endpoint with an embedded JanusGraph in the same process and replicate that process with all of the embedded JanusGraphs talking to the same Cassandra backend, something like this:


+-----------------------------------------+
| JVM Process #1.1 on Node #1             |
|                                         |
|  +-----------------+    +------------+  |
|  | Domain Specific |----| JanusGraph |--+--------------+
|  | REST API endpt 1|    |  Embedded  |  |              |
|  +-----------------+    +------------+  |              |
|                                         |              |
+-----------------------------------------+              |
                                                         |
+-----------------------------------------+    +^^^^^^^^^|^^^^^^^+
| JVM Process #1.2 on Node #2             |    { Cluster Process }
|                                         |    {         |       }
|  +-----------------+    +------------+  |    {  +-----------+  }
|  | Domain Specific |----| JanusGraph |--+----+--| Cassandra |  }
|  | REST API endpt 2|    |  Embedded  |  |    {  |  Backend  |  }
|  +-----------------+    +------------+  |    {  +-----------+  }
|                                         |    {         |       }
+-----------------------------------------+    +^^^^^^^^^|^^^^^^^+
                                                         |
+-----------------------------------------+              |
| JVM Process #1.3 on Node #3             |              |
|                                         |              |
|  +-----------------+    +------------+  |              |
|  | Domain Specific |----| JanusGraph |--+--------------+
|  | REST API endpt 3|    |  Embedded  |  |
|  +-----------------+    +------------+  |
|                                         |
+-----------------------------------------+


The real question here is, if different embedded JanusGraphs have the same backend, do they describe the same graph (modulo eventual consistency)? I expect that they will have different stuff in cache, but will they describe the same graph?

And is there an expectation of a performance advantage if we break out the JanusGraph part and separate it from the REST API (running as JanusGraph Server), understanding that all interaction with the graph will be through the REST API, given that each REST call may make a number of sequential JanusGraph (Gremlin) calls?


Another perspective on JanusGraph embedded versus server mode

Jamie Lawson <jamier...@...>
 

We have a _domain specific_ REST API that is architecturally decoupled from JanusGraph. In other words, users of the REST API have no indication that their calls interact with JanusGraph, or even with a graph. These REST calls have a lot of interactions with the JanusGraph database which is currently embedded in the same JVM process. Here is a deployment view:


+-----------------------------------------+    +-----------------+
| JVM Process #1                          |    | JVM Process #2  |
|                                         |    |                 |
|  +-----------------+    +------------+  |    |  +-----------+  |
|  | Domain Specific |----| JanusGraph |--+----+--| Cassandra |  |
|  |    REST API     |    |  Embedded  |  |    |  |  Backend  |  |
|  +-----------------+    +------------+  |    |  +-----------+  |
|                                         |    |                 |
+-----------------------------------------+    +-----------------+


Now consider load balancing. The REST API is the only way we want to access the graph database. That's what keeps it "operationally consistent". If all updates are through the REST API, we will not get stuff in the database that doesn't make sense in the context of the domain. As we expand, is there a good reason to break out JVM Process #1 so that we have something that looks like this, with JanusGraph Server in a separate process:


+----------------------+    +-----------------+    +-----------------+
| JVM Process #1A      |    | JVM Process #1B |    | JVM Process #2  |
|                      |    |                 |    |                 |
|  +-----------------+ |    | +------------+  |    |  +-----------+  |
|  | Domain Specific |-+----+-| JanusGraph |--+----+--| Cassandra |  |
|  |    REST API     | |    | |   SERVER   |  |    |  |  Backend  |  |
|  +-----------------+ |    | +------------+  |    |  +-----------+  |
|                      |    |                 |    |                 |
+----------------------+    +-----------------+    +-----------------+

My expectation would be that connecting to JanusGraph through the embedded API would be much faster than connecting through a WebSocket API. Is that the case?

Now as we expand, is it reasonable to run our REST endpoint with an embedded JanusGraph in the same process and replicate that process with all of the embedded JanusGraphs talking to the same Cassandra backend, something like this:


+-----------------------------------------+
| JVM Process #1.1 on Node #1             |
|                                         |
|  +-----------------+    +------------+  |
|  | Domain Specific |----| JanusGraph |--+--------------+
|  | REST API endpt 1|    |  Embedded  |  |              |
|  +-----------------+    +------------+  |              |
|                                         |              |
+-----------------------------------------+              |
                                                         |
+-----------------------------------------+    +^^^^^^^^^|^^^^^^^+
| JVM Process #1.2 on Node #2             |    { Cluster Process }
|                                         |    {         |       }
|  +-----------------+    +------------+  |    {  +-----------+  }
|  | Domain Specific |----| JanusGraph |--+----+--| Cassandra |  }
|  | REST API endpt 2|    |  Embedded  |  |    {  |  Backend  |  }
|  +-----------------+    +------------+  |    {  +-----------+  }
|                                         |    {         |       }
+-----------------------------------------+    +^^^^^^^^^|^^^^^^^+
                                                         |
+-----------------------------------------+              |
| JVM Process #1.3 on Node #3             |              |
|                                         |              |
|  +-----------------+    +------------+  |              |
|  | Domain Specific |----| JanusGraph |--+--------------+
|  | REST API endpt 3|    |  Embedded  |  |
|  +-----------------+    +------------+  |
|                                         |
+-----------------------------------------+


The real question here is, if different embedded JanusGraphs have the same backend, do they describe the same graph (modulo eventual consistency)? I expect that they will have different stuff in cache, but will they describe the same graph?

And is there an expectation of a performance advantage if we break out the JanusGraph part and separate it from the REST API (running as JanusGraph Server), understanding that all interaction with the graph will be through the REST API, given that each REST call may make a number of sequential JanusGraph (Gremlin) calls?