Date   

Low throughput on Janus vs Neo4j (Tuning issues?)

Carlos <512.qua...@...>
 

So I've been evaluating JanusGraph on a single machine that is also hosting a Cassandra instance. It seems that I am unable to achieve the same throughput that other users here seem to have. Currently we are using Neo4j as we are able to achieve a much higher throughput, but at the cost of being able to scale out.

I did some queries against both Janus and Neo4j through a WebSocket conneciton and timed the requests. Neo4j consistently performed much better than Janus. In our actual test setup we were able to get Neo4j to push 600 calls/second while Janus could only manage at most 50 calls/second.

With Janus I did set up indexing and expected improvements which I did not see. Attached above are the Janus configuration files I am using as well as the timed queries from Janus and Neo4j.

I am using Janus 0.1.1 on a machine with 24 GB of ram and a platter hard drive. Additionally I attempted to move Cassandra's data store to a RAM disk thinking that my platter drive was a bottleneck to no avail.
What is exactly going on with my setup that is causing this issue?


Re: HBase unbalanced table regions after bulkload

HadoopMarc <m.c.d...@...>
 

Hi Ali,

Thanks for returning feedback about your experiments.

The docs have a section on graph partitioning, warning about a too large number of partitions and suggesting to use explicit partitioning if performance counts. They also suggest to take twice the number of region server instances as the region count.

http://docs.janusgraph.org/latest/graph-partitioning.html

HTH,   Marc

Op vrijdag 16 juni 2017 16:42:55 UTC+2 schreef Ali ÖZER:

Hi Marco,

I think it has nothing to do with the region-count and hbase does not ignore any region in any circumstance. Since my regions are imbalaced (only 100 regions have data in them), data size per region is not outputsize/1024 it is outputsize/100.

My output size is not equal to my input size; it is 330GB therefore data size per region is not 130 MB it is 3 GB. Nevertheless as I said, I do not think that the problem is with the region count because I have managed to increase the balance of my regions by doing the followings:

I learned that the whole thing is about cluster.max-partitions parameter. Its default value is 32, I set it to 256 and changed nothing else and re-run the bulkloadervertexprogram and realized that non-empty region count was increased from 100 to 256. (when the parameter was 32; blvp loaded data into only 32 regions however hbase automatically splitted the oversized regions and the number of non-empty regions became 100). Therefore I realized that in order to fill 1024 regions of my 1024 regions I need to set the cluster.max-partitions to 1024. 

However there is one problem, when I increased the value of cluster.max-partitions from 32 to 256; the run-time of my bulkloadervertexprogram increased 5-times. I was able to load the whole data in almost 2 hours; now it is almost 10 hours. I think it is because each spark executor is trying to write 1024 region all at once. And I have 1024 spark executors; this means a lot of network 1024*1024. 

Due to the fact that I do not know the internals of the blvp and janus id assignment; I am not one hundred percent sure about all these.

Is there somebody knows the internals of janus, I would really appreciate that and I am pretty sure that this knowledge will really help me to solve the problem.

Best,
Ali

16 Haziran 2017 Cuma 14:55:00 UTC+3 tarihinde mar...@... yazdı:
Hi Ali,

OK, I overlooked your config line "storage.hbase.region-count=1024". This is far too large a number, since HBase likes regions with a size of the order of 10GB, rather than the 130MB you requested. This 10 GB region split is probably a HBase global setting. It could be a similar number in your HBase cluster, so HBase just ignores the superfluous regions, unless you would configure the region split to a lower number manually (not advised).

Other comments from HBase experts are welcomed (I do not consider myself one).

Cheers,    Marc

Op vrijdag 16 juni 2017 09:57:12 UTC+2 schreef Ali ÖZER:
Hi Marc,

As far as I know, even if yarn schedules executors unevenly, it does not mean that the data written across hbase will be uneven.

The data is written hbase according to the key of the datum and the key-ranges of the regions, it does nothing to do with the node that the writer jvm is working on.

My executors are working on %90 of my nodes (it is not that uneven) however percentage of my empty regions is %90(900 of 1024 regions). If you were right the latter percentage would be %10 instead of %90.

If there is some other mechanism while assigning ids in distributed fashion, may you please keep me updated and elaborate on the mechanism.

Best,
Ali

15 Haziran 2017 Perşembe 22:51:14 UTC+3 tarihinde HadoopMarc yazdı:

Hi Ali,

I have never tried to optimize this myself, but maybe you should also look into the docs at

12.3.30. storage.hbase


...

storage.hbase.region-count


The number of initial regions set when creating JanusGraph’s HBase table

Integer

(no default value)

MASKABLE

storage.hbase.regions-per-server

The number of regions per regionserver to set when creating JanusGraph’s HBase table

Integer

(no default value)

MASKABLE



Normally, HBase does not want many regions, but the number of regions times hdfs replication factor should be at least the number of active datanodes for maximum performance. I think some unbalance is inevitable as yarn will schedule executors unevenly and each executor will try to have local data access.

Further, you can look into HBase's region load balancer configuration, which enables HBase to move regions automatically.

HTH,    Marc

Op donderdag 15 juni 2017 16:04:28 UTC+2 schreef Ali ÖZER:
We are using cloudera 5.7.0 with java 1.8.0_74 and we have spark 1.6.0, janusgraph 0.1.1, hbase 1.2.0.

I managed to bulkload 130GB of data into 1024 region hbase table in 2 hours 30 minute with 1024 spark executors (1-core,20gb memory). Each stage of blvp is configured to run 10240 tasks:
readGraph.compute(SparkGraphComputer).workers(10240).program(blvp).submit().get().

However I am unable to distribute the hbase data evenly across regions, they are pretty imbalanced. I suspect it is related to the conf value of ids.num-partitions.

Here is how I set the conf:

conf/my-janusgraph-hbase.properties:

gremlin.graph=org.janusgraph.core.JanusGraphFactory
storage.backend=hbase
storage.batch-loading=true
storage.hbase.region-count=1024

ids.block-size=2000000
ids.renew-timeout=3600000
storage.buffer-size=10000
ids.num-partitions=1024
ids.partition=true

storage.hbase.table=myjanus
storage.hostname=x.x.x.x
cache.db-cache=true
cache.db-cache-clean-wait=20
cache.db-cache-time=180000
cache.db-cache-size=0.5

I even tried setting ids.num-partitions=10240; however the problem was not solved.

Should I still increase the ids.num-partitions value to an even higher value like 102400?
What is the difference cluster.max-partitions and ids.num-partitions. Is my problem related to cluster.max-partitions? Should I use it?

As far as I know ids.num-partitions value determines the number of randomly gathered prefixes that will be used in assigning ids to elements. And I read somewhere that setting ids.num-partitions to 10 times of region count will be enough; however it seems like that is not the case. And I do not want to increase ids.num-partitions further. Since I could not find any document related to the internals of cluster.max-partitions, I am really ignorant about it and need some help.

Thanks in advance,
Best,
Ali





Re: HBase unbalanced table regions after bulkload

aoz...@...
 

Hi Marco,

I think it has nothing to do with the region-count and hbase does not ignore any region in any circumstance. Since my regions are imbalaced (only 100 regions have data in them), data size per region is not outputsize/1024 it is outputsize/100.

My output size is not equal to my input size; it is 330GB therefore data size per region is not 130 MB it is 3 GB. Nevertheless as I said, I do not think that the problem is with the region count because I have managed to increase the balance of my regions by doing the followings:

I learned that the whole thing is about cluster.max-partitions parameter. Its default value is 32, I set it to 256 and changed nothing else and re-run the bulkloadervertexprogram and realized that non-empty region count was increased from 100 to 256. (when the parameter was 32; blvp loaded data into only 32 regions however hbase automatically splitted the oversized regions and the number of non-empty regions became 100). Therefore I realized that in order to fill 1024 regions of my 1024 regions I need to set the cluster.max-partitions to 1024. 

However there is one problem, when I increased the value of cluster.max-partitions from 32 to 256; the run-time of my bulkloadervertexprogram increased 5-times. I was able to load the whole data in almost 2 hours; now it is almost 10 hours. I think it is because each spark executor is trying to write 1024 region all at once. And I have 1024 spark executors; this means a lot of network 1024*1024. 

Due to the fact that I do not know the internals of the blvp and janus id assignment; I am not one hundred percent sure about all these.

Is there somebody knows the internals of janus, I would really appreciate that and I am pretty sure that this knowledge will really help me to solve the problem.

Best,
Ali

16 Haziran 2017 Cuma 14:55:00 UTC+3 tarihinde mar...@... yazdı:

Hi Ali,

OK, I overlooked your config line "storage.hbase.region-count=1024". This is far too large a number, since HBase likes regions with a size of the order of 10GB, rather than the 130MB you requested. This 10 GB region split is probably a HBase global setting. It could be a similar number in your HBase cluster, so HBase just ignores the superfluous regions, unless you would configure the region split to a lower number manually (not advised).

Other comments from HBase experts are welcomed (I do not consider myself one).

Cheers,    Marc

Op vrijdag 16 juni 2017 09:57:12 UTC+2 schreef Ali ÖZER:
Hi Marc,

As far as I know, even if yarn schedules executors unevenly, it does not mean that the data written across hbase will be uneven.

The data is written hbase according to the key of the datum and the key-ranges of the regions, it does nothing to do with the node that the writer jvm is working on.

My executors are working on %90 of my nodes (it is not that uneven) however percentage of my empty regions is %90(900 of 1024 regions). If you were right the latter percentage would be %10 instead of %90.

If there is some other mechanism while assigning ids in distributed fashion, may you please keep me updated and elaborate on the mechanism.

Best,
Ali

15 Haziran 2017 Perşembe 22:51:14 UTC+3 tarihinde HadoopMarc yazdı:

Hi Ali,

I have never tried to optimize this myself, but maybe you should also look into the docs at

12.3.30. storage.hbase


...

storage.hbase.region-count


The number of initial regions set when creating JanusGraph’s HBase table

Integer

(no default value)

MASKABLE

storage.hbase.regions-per-server

The number of regions per regionserver to set when creating JanusGraph’s HBase table

Integer

(no default value)

MASKABLE



Normally, HBase does not want many regions, but the number of regions times hdfs replication factor should be at least the number of active datanodes for maximum performance. I think some unbalance is inevitable as yarn will schedule executors unevenly and each executor will try to have local data access.

Further, you can look into HBase's region load balancer configuration, which enables HBase to move regions automatically.

HTH,    Marc

Op donderdag 15 juni 2017 16:04:28 UTC+2 schreef Ali ÖZER:
We are using cloudera 5.7.0 with java 1.8.0_74 and we have spark 1.6.0, janusgraph 0.1.1, hbase 1.2.0.

I managed to bulkload 130GB of data into 1024 region hbase table in 2 hours 30 minute with 1024 spark executors (1-core,20gb memory). Each stage of blvp is configured to run 10240 tasks:
readGraph.compute(SparkGraphComputer).workers(10240).program(blvp).submit().get().

However I am unable to distribute the hbase data evenly across regions, they are pretty imbalanced. I suspect it is related to the conf value of ids.num-partitions.

Here is how I set the conf:

conf/my-janusgraph-hbase.properties:

gremlin.graph=org.janusgraph.core.JanusGraphFactory
storage.backend=hbase
storage.batch-loading=true
storage.hbase.region-count=1024

ids.block-size=2000000
ids.renew-timeout=3600000
storage.buffer-size=10000
ids.num-partitions=1024
ids.partition=true

storage.hbase.table=myjanus
storage.hostname=x.x.x.x
cache.db-cache=true
cache.db-cache-clean-wait=20
cache.db-cache-time=180000
cache.db-cache-size=0.5

I even tried setting ids.num-partitions=10240; however the problem was not solved.

Should I still increase the ids.num-partitions value to an even higher value like 102400?
What is the difference cluster.max-partitions and ids.num-partitions. Is my problem related to cluster.max-partitions? Should I use it?

As far as I know ids.num-partitions value determines the number of randomly gathered prefixes that will be used in assigning ids to elements. And I read somewhere that setting ids.num-partitions to 10 times of region count will be enough; however it seems like that is not the case. And I do not want to increase ids.num-partitions further. Since I could not find any document related to the internals of cluster.max-partitions, I am really ignorant about it and need some help.

Thanks in advance,
Best,
Ali





Streaming graph data

JZ <zamb...@...>
 

Hello,

Does anyone know if there is a way to stream a graph to a viewer such as Gephi from Janus Java client.  When using the gremlin console you can use the  tinkerpop.gephi plugin and redirect a graph Gephi.  Is there a way to do that from a Java  that has created a graph?  I did not find any mention of this in the documentation. 

Thanks

JGZ


Re: Bulk loading using Json, python or Scala?

Yihang Yan <yanyi...@...>
 

I see.. we need to use GraphML file. The issue here is our graph might contain billions of nodes, will Python client be able to handle that?

Thanks!


On Friday, June 16, 2017 at 12:24:25 AM UTC-4, David Brown wrote:
You should be able to bulk load with any client that allows you to submit raw gremlin scripts to the server. For example, you can do it with ipython-gremlin, check out cell # 2: https://github.com/davebshow/ipython-gremlin/blob/master/example.ipynb. You could submit that same script with the new Python client that will be released with TinkerPop 3.2.5 in a couple days. You can also do it with the driver included with aiogremlin 3.2.4+ [1]. Maybe someone with more expertise could weigh in about the Scala client.

Best of luck!

Dave


On Thursday, June 15, 2017 at 12:03:03 PM UTC-4, Yihang Yan wrote:
Other than Groovy, is it possible to do bulk loading using Json, python or Scala? Any sample code could be provided?

Thanks!


Time series modelling help needed...

Ravikumar Govindarajan <ravikumar....@...>
 

I need a help in time series modelling, with Cassandra as the backend storage...

Consider a model as below

                      Has many                                   Which Report
Domain         ----------------->   Sub-Domains     ----------------------->   Traffic_Stats  

Lets say that each Traffic_Stats vertex has a time component


1. For domain="abc" and sub-domain="xyz", get all traffic stats ordered by time descending, limit 10
2. For domain="abc", get all traffic stats ordered by time descending, limit 10 

Will an Edge Index on time linking Sub_Domain & Traffic_Stats satisfy the first-query? 

What about the second query? Should I create duplicate edges from Domain to Traffic_Stats with same Edge Index on Time? 

Also, is it the case that even if I use an Edge Index, Janus will pull all the data in memory & do the sort? In the fictitious case of a Domain having 10k Sub-Domains with each reporting 50k Traffic_Stats, it could be prohibitively expensive

Any help is much appreciated, as I am just beginning with JanusGraph


Thanks and Regards,
Ravi


Re: Bulk loading using Json, python or Scala?

Yihang Yan <yanyi...@...>
 

Thanks, Dave! I am wondering what's the recommended format of the data, xml, txt or csv ?


On Friday, June 16, 2017 at 12:24:25 AM UTC-4, David Brown wrote:
You should be able to bulk load with any client that allows you to submit raw gremlin scripts to the server. For example, you can do it with ipython-gremlin, check out cell # 2: https://github.com/davebshow/ipython-gremlin/blob/master/example.ipynb. You could submit that same script with the new Python client that will be released with TinkerPop 3.2.5 in a couple days. You can also do it with the driver included with aiogremlin 3.2.4+ [1]. Maybe someone with more expertise could weigh in about the Scala client.

Best of luck!

Dave


On Thursday, June 15, 2017 at 12:03:03 PM UTC-4, Yihang Yan wrote:
Other than Groovy, is it possible to do bulk loading using Json, python or Scala? Any sample code could be provided?

Thanks!


Re: HBase unbalanced table regions after bulkload

marc.d...@...
 

Hi Ali,

OK, I overlooked your config line "storage.hbase.region-count=1024". This is far too large a number, since HBase likes regions with a size of the order of 10GB, rather than the 130MB you requested. This 10 GB region split is probably a HBase global setting. It could be a similar number in your HBase cluster, so HBase just ignores the superfluous regions, unless you would configure the region split to a lower number manually (not advised).

Other comments from HBase experts are welcomed (I do not consider myself one).

Cheers,    Marc

Op vrijdag 16 juni 2017 09:57:12 UTC+2 schreef Ali ÖZER:

Hi Marc,

As far as I know, even if yarn schedules executors unevenly, it does not mean that the data written across hbase will be uneven.

The data is written hbase according to the key of the datum and the key-ranges of the regions, it does nothing to do with the node that the writer jvm is working on.

My executors are working on %90 of my nodes (it is not that uneven) however percentage of my empty regions is %90(900 of 1024 regions). If you were right the latter percentage would be %10 instead of %90.

If there is some other mechanism while assigning ids in distributed fashion, may you please keep me updated and elaborate on the mechanism.

Best,
Ali

15 Haziran 2017 Perşembe 22:51:14 UTC+3 tarihinde HadoopMarc yazdı:

Hi Ali,

I have never tried to optimize this myself, but maybe you should also look into the docs at

12.3.30. storage.hbase


...

storage.hbase.region-count


The number of initial regions set when creating JanusGraph’s HBase table

Integer

(no default value)

MASKABLE

storage.hbase.regions-per-server

The number of regions per regionserver to set when creating JanusGraph’s HBase table

Integer

(no default value)

MASKABLE



Normally, HBase does not want many regions, but the number of regions times hdfs replication factor should be at least the number of active datanodes for maximum performance. I think some unbalance is inevitable as yarn will schedule executors unevenly and each executor will try to have local data access.

Further, you can look into HBase's region load balancer configuration, which enables HBase to move regions automatically.

HTH,    Marc

Op donderdag 15 juni 2017 16:04:28 UTC+2 schreef Ali ÖZER:
We are using cloudera 5.7.0 with java 1.8.0_74 and we have spark 1.6.0, janusgraph 0.1.1, hbase 1.2.0.

I managed to bulkload 130GB of data into 1024 region hbase table in 2 hours 30 minute with 1024 spark executors (1-core,20gb memory). Each stage of blvp is configured to run 10240 tasks:
readGraph.compute(SparkGraphComputer).workers(10240).program(blvp).submit().get().

However I am unable to distribute the hbase data evenly across regions, they are pretty imbalanced. I suspect it is related to the conf value of ids.num-partitions.

Here is how I set the conf:

conf/my-janusgraph-hbase.properties:

gremlin.graph=org.janusgraph.core.JanusGraphFactory
storage.backend=hbase
storage.batch-loading=true
storage.hbase.region-count=1024

ids.block-size=2000000
ids.renew-timeout=3600000
storage.buffer-size=10000
ids.num-partitions=1024
ids.partition=true

storage.hbase.table=myjanus
storage.hostname=x.x.x.x
cache.db-cache=true
cache.db-cache-clean-wait=20
cache.db-cache-time=180000
cache.db-cache-size=0.5

I even tried setting ids.num-partitions=10240; however the problem was not solved.

Should I still increase the ids.num-partitions value to an even higher value like 102400?
What is the difference cluster.max-partitions and ids.num-partitions. Is my problem related to cluster.max-partitions? Should I use it?

As far as I know ids.num-partitions value determines the number of randomly gathered prefixes that will be used in assigning ids to elements. And I read somewhere that setting ids.num-partitions to 10 times of region count will be enough; however it seems like that is not the case. And I do not want to increase ids.num-partitions further. Since I could not find any document related to the internals of cluster.max-partitions, I am really ignorant about it and need some help.

Thanks in advance,
Best,
Ali





Re: HBase unbalanced table regions after bulkload

aoz...@...
 

Hi Marc,

As far as I know, even if yarn schedules executors unevenly, it does not mean that the data written across hbase will be uneven.

The data is written hbase according to the key of the datum and the key-ranges of the regions, it does nothing to do with the node that the writer jvm is working on.

My executors are working on %90 of my nodes (it is not that uneven) however percentage of my empty regions is %90(900 of 1024 regions). If you were right the latter percentage would be %10 instead of %90.

If there is some other mechanism while assigning ids in distributed fashion, may you please keep me updated and elaborate on the mechanism.

Best,
Ali

15 Haziran 2017 Perşembe 22:51:14 UTC+3 tarihinde HadoopMarc yazdı:


Hi Ali,

I have never tried to optimize this myself, but maybe you should also look into the docs at

12.3.30. storage.hbase


...

storage.hbase.region-count


The number of initial regions set when creating JanusGraph’s HBase table

Integer

(no default value)

MASKABLE

storage.hbase.regions-per-server

The number of regions per regionserver to set when creating JanusGraph’s HBase table

Integer

(no default value)

MASKABLE



Normally, HBase does not want many regions, but the number of regions times hdfs replication factor should be at least the number of active datanodes for maximum performance. I think some unbalance is inevitable as yarn will schedule executors unevenly and each executor will try to have local data access.

Further, you can look into HBase's region load balancer configuration, which enables HBase to move regions automatically.

HTH,    Marc

Op donderdag 15 juni 2017 16:04:28 UTC+2 schreef Ali ÖZER:
We are using cloudera 5.7.0 with java 1.8.0_74 and we have spark 1.6.0, janusgraph 0.1.1, hbase 1.2.0.

I managed to bulkload 130GB of data into 1024 region hbase table in 2 hours 30 minute with 1024 spark executors (1-core,20gb memory). Each stage of blvp is configured to run 10240 tasks:
readGraph.compute(SparkGraphComputer).workers(10240).program(blvp).submit().get().

However I am unable to distribute the hbase data evenly across regions, they are pretty imbalanced. I suspect it is related to the conf value of ids.num-partitions.

Here is how I set the conf:

conf/my-janusgraph-hbase.properties:

gremlin.graph=org.janusgraph.core.JanusGraphFactory
storage.backend=hbase
storage.batch-loading=true
storage.hbase.region-count=1024

ids.block-size=2000000
ids.renew-timeout=3600000
storage.buffer-size=10000
ids.num-partitions=1024
ids.partition=true

storage.hbase.table=myjanus
storage.hostname=x.x.x.x
cache.db-cache=true
cache.db-cache-clean-wait=20
cache.db-cache-time=180000
cache.db-cache-size=0.5

I even tried setting ids.num-partitions=10240; however the problem was not solved.

Should I still increase the ids.num-partitions value to an even higher value like 102400?
What is the difference cluster.max-partitions and ids.num-partitions. Is my problem related to cluster.max-partitions? Should I use it?

As far as I know ids.num-partitions value determines the number of randomly gathered prefixes that will be used in assigning ids to elements. And I read somewhere that setting ids.num-partitions to 10 times of region count will be enough; however it seems like that is not the case. And I do not want to increase ids.num-partitions further. Since I could not find any document related to the internals of cluster.max-partitions, I am really ignorant about it and need some help.

Thanks in advance,
Best,
Ali





Re: Bulk loading using Json, python or Scala?

David Brown <dave...@...>
 

You should be able to bulk load with any client that allows you to submit raw gremlin scripts to the server. For example, you can do it with ipython-gremlin, check out cell # 2: https://github.com/davebshow/ipython-gremlin/blob/master/example.ipynb. You could submit that same script with the new Python client that will be released with TinkerPop 3.2.5 in a couple days. You can also do it with the driver included with aiogremlin 3.2.4+ [1]. Maybe someone with more expertise could weigh in about the Scala client.

Best of luck!

Dave

1. http://aiogremlin.readthedocs.io/en/latest/usage.html#using-the-driver-module

On Thursday, June 15, 2017 at 12:03:03 PM UTC-4, Yihang Yan wrote:
Other than Groovy, is it possible to do bulk loading using Json, python or Scala? Any sample code could be provided?

Thanks!


Re: HBase unbalanced table regions after bulkload

HadoopMarc <m.c.d...@...>
 


Hi Ali,

I have never tried to optimize this myself, but maybe you should also look into the docs at

12.3.30. storage.hbase


...

storage.hbase.region-count


The number of initial regions set when creating JanusGraph’s HBase table

Integer

(no default value)

MASKABLE

storage.hbase.regions-per-server

The number of regions per regionserver to set when creating JanusGraph’s HBase table

Integer

(no default value)

MASKABLE



Normally, HBase does not want many regions, but the number of regions times hdfs replication factor should be at least the number of active datanodes for maximum performance. I think some unbalance is inevitable as yarn will schedule executors unevenly and each executor will try to have local data access.

Further, you can look into HBase's region load balancer configuration, which enables HBase to move regions automatically.

HTH,    Marc

Op donderdag 15 juni 2017 16:04:28 UTC+2 schreef Ali ÖZER:

We are using cloudera 5.7.0 with java 1.8.0_74 and we have spark 1.6.0, janusgraph 0.1.1, hbase 1.2.0.

I managed to bulkload 130GB of data into 1024 region hbase table in 2 hours 30 minute with 1024 spark executors (1-core,20gb memory). Each stage of blvp is configured to run 10240 tasks:
readGraph.compute(SparkGraphComputer).workers(10240).program(blvp).submit().get().

However I am unable to distribute the hbase data evenly across regions, they are pretty imbalanced. I suspect it is related to the conf value of ids.num-partitions.

Here is how I set the conf:

conf/my-janusgraph-hbase.properties:

gremlin.graph=org.janusgraph.core.JanusGraphFactory
storage.backend=hbase
storage.batch-loading=true
storage.hbase.region-count=1024

ids.block-size=2000000
ids.renew-timeout=3600000
storage.buffer-size=10000
ids.num-partitions=1024
ids.partition=true

storage.hbase.table=myjanus
storage.hostname=x.x.x.x
cache.db-cache=true
cache.db-cache-clean-wait=20
cache.db-cache-time=180000
cache.db-cache-size=0.5

I even tried setting ids.num-partitions=10240; however the problem was not solved.

Should I still increase the ids.num-partitions value to an even higher value like 102400?
What is the difference cluster.max-partitions and ids.num-partitions. Is my problem related to cluster.max-partitions? Should I use it?

As far as I know ids.num-partitions value determines the number of randomly gathered prefixes that will be used in assigning ids to elements. And I read somewhere that setting ids.num-partitions to 10 times of region count will be enough; however it seems like that is not the case. And I do not want to increase ids.num-partitions further. Since I could not find any document related to the internals of cluster.max-partitions, I am really ignorant about it and need some help.

Thanks in advance,
Best,
Ali





Bulk loading using Json, python or Scala?

Yihang Yan <yanyi...@...>
 

Other than Groovy, is it possible to do bulk loading using Json, python or Scala? Any sample code could be provided?

Thanks!


HBase unbalanced table regions after bulkload

aoz...@...
 

We are using cloudera 5.7.0 with java 1.8.0_74 and we have spark 1.6.0, janusgraph 0.1.1, hbase 1.2.0.

I managed to bulkload 130GB of data into 1024 region hbase table in 2 hours 30 minute with 1024 spark executors (1-core,20gb memory). Each stage of blvp is configured to run 10240 tasks:
readGraph.compute(SparkGraphComputer).workers(10240).program(blvp).submit().get().

However I am unable to distribute the hbase data evenly across regions, they are pretty imbalanced. I suspect it is related to the conf value of ids.num-partitions.

Here is how I set the conf:

conf/my-janusgraph-hbase.properties:

gremlin.graph=org.janusgraph.core.JanusGraphFactory
storage.backend=hbase
storage.batch-loading=true
storage.hbase.region-count=1024

ids.block-size=2000000
ids.renew-timeout=3600000
storage.buffer-size=10000
ids.num-partitions=1024
ids.partition=true

storage.hbase.table=myjanus
storage.hostname=x.x.x.x
cache.db-cache=true
cache.db-cache-clean-wait=20
cache.db-cache-time=180000
cache.db-cache-size=0.5

I even tried setting ids.num-partitions=10240; however the problem was not solved.

Should I still increase the ids.num-partitions value to an even higher value like 102400?
What is the difference cluster.max-partitions and ids.num-partitions. Is my problem related to cluster.max-partitions? Should I use it?

As far as I know ids.num-partitions value determines the number of randomly gathered prefixes that will be used in assigning ids to elements. And I read somewhere that setting ids.num-partitions to 10 times of region count will be enough; however it seems like that is not the case. And I do not want to increase ids.num-partitions further. Since I could not find any document related to the internals of cluster.max-partitions, I am really ignorant about it and need some help.

Thanks in advance,
Best,
Ali





Re: how to load a CSV file into janusgraph

HadoopMarc <m.c.d...@...>
 

Hi Elizabeth,

For JanusGraph you should also take into account the TinkerPop documentation. A relevant pointer for you is:

https://groups.google.com/forum/#!searchin/gremlin-users/csv%7Csort:relevance/gremlin-users/AetuGcLiBxo/KW966WAyAQAJ

Cheers,    Marc

Op woensdag 14 juni 2017 18:44:16 UTC+2 schreef Elizabeth:

Hi all,

I am new to Janusgraph, I have dived into docs of Janusgraph for almost two weeks, nothing found.
I could only gather the scatted information and most of the time it will prompt some errors.
Could anyone supply a complete example of bulk loading or loading a CSV file into Janusgraph, please?
Any little help is appreated!

Best regards,

Elis.


Re: Finding supernodes with insufficient frame size

Adam Holley <holl...@...>
 

That won't work if the framesize is not large enough.


On Wednesday, June 14, 2017 at 10:51:09 AM UTC-5, Robert Dale wrote:

This should give you the counts, highest first, by vertex id:

g.V().group().by(id()).by(outE().count()).order(local).by(values,decr)



Robert Dale


Re: Finding supernodes with insufficient frame size

Robert Dale <rob...@...>
 


This should give you the counts, highest first, by vertex id:

g.V().group().by(id()).by(outE().count()).order(local).by(values,decr)



Robert Dale

On Wed, Jun 14, 2017 at 11:32 AM, Adam Holley <holl...@...> wrote:
Using cassandra as the backend, I was trying to count edges {g.E().count} and of course ran into the framesize problem because I had a supernode.  I found that I could identify which node was the supernode with:

g.V().toList().each{it.println(it);g.V(it).outE().count()}

This will print out the supernode vertex right before the framesize exception.  Then I can gradually increase the framesize until g.V(supernodeID) doesn't cause an exception.

Is there a better way to find supernodes without just increasing framesize to a large number, finding the supernode, and then working backward by decreasing the framesize?
As I'm not as familiar with Cassandra, once I know the supernode ID, is there a way to use CQL to determine the appropriate framesize without just gradually increasing or decreasing in the config?

--
You received this message because you are subscribed to the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Finding supernodes with insufficient frame size

Adam Holley <holl...@...>
 

Using cassandra as the backend, I was trying to count edges {g.E().count} and of course ran into the framesize problem because I had a supernode.  I found that I could identify which node was the supernode with:

g.V().toList().each{it.println(it);g.V(it).outE().count()}

This will print out the supernode vertex right before the framesize exception.  Then I can gradually increase the framesize until g.V(supernodeID) doesn't cause an exception.

Is there a better way to find supernodes without just increasing framesize to a large number, finding the supernode, and then working backward by decreasing the framesize?
As I'm not as familiar with Cassandra, once I know the supernode ID, is there a way to use CQL to determine the appropriate framesize without just gradually increasing or decreasing in the config?


how to load a CSV file into janusgraph

Elizabeth <hlf...@...>
 

Hi all,

I am new to Janusgraph, I have dived into docs of Janusgraph for almost two weeks, nothing found.
I could only gather the scatted information and most of the time it will prompt some errors.
Could anyone supply a complete example of bulk loading or loading a CSV file into Janusgraph, please?
Any little help is appreated!

Best regards,

Elis.


Re: Index not being used with 'Between" clause

Gene Fojtik <genef...@...>
 

Outstanding - thank you Jason.

-gene


On Thursday, June 8, 2017 at 11:47:53 PM UTC-5, Jason Plurad wrote:
Make sure you're using a mixed index for numeric range queries. Composite indexes are best for exact matching. The console session below shows the difference:

gremlin> graph = JanusGraphFactory.open('conf/janusgraph-berkeleyje-lucene.properties')
==>standardjanusgraph[berkeleyje:/usr/lib/janusgraph-0.1.1-hadoop2/conf/../db/berkeley]
gremlin
> mgmt = graph.openManagement()
==>org.janusgraph.graphdb.database.management.ManagementSystem@1c8f6a90
gremlin
> lat = mgmt.makePropertyKey('lat').dataType(Integer.class).make()
==>lat
gremlin
> latidx = mgmt.buildIndex('latidx', Vertex.class).addKey(lat).buildCompositeIndex()
==>latidx
gremlin
> lon = mgmt.makePropertyKey('lon').dataType(Integer.class).make()
==>lon
gremlin
> lonidx = mgmt.buildIndex('lonidx', Vertex.class).addKey(lon).buildMixedIndex('search')
==>lonidx
gremlin
> mgmt.commit()
==>null
gremlin
> v = graph.addVertex('code', 'rdu', 'lat', 35, 'lon', -78)
==>v[4184]
gremlin
> g = graph.traversal()
==>graphtraversalsource[standardjanusgraph[berkeleyje:/usr/lib/janusgraph-0.1.1-hadoop2/conf/../db/berkeley], standard]
gremlin
> g.V().has('lat', 35)
==>v[4184]
gremlin
> g.V().has('lat', between(34, 36))
00:40:33 WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertices [(lat >= 34 AND lat < 36)]. For better performance, use indexes
==>v[4184]
gremlin
> g.V().has('lon', -78)
==>v[4184]
gremlin
> g.V().has('lon', between(-79, -77))
==>v[4184]

-- Jason

On Wednesday, June 7, 2017 at 12:01:31 PM UTC-4, Gene Fojtik wrote:
Hello,

Have an index in a property "latitude", when using with the between clause, the index is not being utilized

g.V().has("latitude", 33.333")  works well, however

g.V().has(“latitude”, between(33.889, 33.954)))  does not use the indexes..

Any assistance would be appreciated..

-g


call queue is full on /0.0.0.0.:60020, too many items queued? hbase

aoz...@...
 

Here is my problem:

We are using cloudera 5.7.0 with java 1.8.0_74 and we have spark 1.6.0, janusgraph 0.1.1, hbase 1.2.0.

I try to load 200Gb of graph data and for that I run the following code in gremlin shell:

:load data/call-janusgraph-schema-groovy
writeGraphPath='conf/my-janusgraph-hbase.properties'
writeGraph=JanusGraphFactory.open(writeGraphPath)
defineCallSchema(writeGraph)
writeGraph.close()

readGraph=GraphFactory.open('conf/hadoop-graph/hadoop-call-script.properties')
gRead=readGraph.traversal()
gRead.V().valueMap()

//so far so good everything works perfectly

blvp=BulkLoaderVertexProgram.build().bulkLoader(OneTimeBulkLoader).intermediateBatchSize(10000).writeGraph(writeGraphPath).create(readGraph)
readGraph.compute(SparkGraphComputer).workers(512).program(blvp).submit().get()

It starts executing the spark job and Stage-0 runs smoothly however at Stage-1 I get an Exception:

org.hbase.async.CallQueueTooBigException: Call queue is full on /0.0.0.0:60020, too many items queued ?

However spark recovers the failed tasks and completes the Stage-1 and then Stage-2 completes flawlessly. Since Spark persists the previous results in memory, Stage-3 and Stage-4 is skipped and Stage-5 is started however Stage-5 gets the same CallQueueTooBigException exceptions, nevertheless spark recovers the problem again. 

My problem is this stage (Stage-5) takes too long to execute. Actually it took 14 hours at my last run and I killed the spark job. I think this is really odd for such a little input data(200 GB). Normally my cluster is so fast that I am able to load 3 TB of data into HBase(with bulkloading via mapreduce) in 1 hour. I tried to increase the number of workers

readGraph.compute(SparkGraphComputer).workers(1024).program(blvp).submit().get()

however this time the number of CallQueueTooBigException exceptions were so high that they did not let the spark job recover from the exceptions.

Is there any way that I can decrease the runtime of the job?


Below I am giving extra materials that hopefully may lead you to the source of the problem:

Here is how I start the gremlin shell

#!/bin/bash

export JAVA_HOME=/mnt/hdfs/jdk.1.8.0_74
export HADOOP_CONF_DIR= /etc/hadoop/conf.cloudera.yarn
export YARN_HOME=/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/hadoop-yarn
export YARN_CONF_DIR=$HADOOP_CONF_DIR
export SPARK_HOME=/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark
export SPARK_CONF_DIR=$SPARK_HOME/conf


GREMLINHOME=/mnt/hdfs/janusgraph-0.1.1-hadoop2

export CLASSPATH=$YARN_HOME/*:$YARN_CONF_DIR:$SPARK_HOME/lib/*:$SPARK_CONF_DIR:$CLASSPATH

cd $GREMLINHOME
export GREMLIN_LOG_LEVEL=info
exec $GREMLINHOME/bin/gremlin.sh $*




and here is my conf/hadoop-graph/hadoop-call-script.properties file:

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.GraphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat
gremlin.hadoop.inputLocation=/user/hive/warehouse/tablex/000000_0
gremlin.hadoop.scriptInputFormat.script=/user/me/janus/script-input-call.groovy
gremlin.hadoop.outputLocation=output
gremlin.hadoop.jarsInDistributedCache=true

spark.driver.maxResultSize=8192
spark.yarn.executor.memoryOverhead=5000
spark.executor.cores=1
spark.executor.instances=1024
spark.master=yarn-client
spark.executor.memory=20g
spark.driver.memory=20g
spark.serializer=org.apache.spark.serializer.JavaSerializer


conf/my-janusgraph-hbase.properties:

gremlin.graph=org.janusgraph.core.JanusGraphFactory
storage.backend=hbase
storage.batch-loading=true
storage.hbase.region-count=1024
cluster.max-partitions=1024
cluster.partition=true

ids.block-size=10000
storage.buffer-size=10000
storage.transactions=false
ids.num-partitions=1024

storage.hbase.table=myjanus
storage.hostname=x.x.x.x
cache.db-cache=true
cache.db-cache-clean-wait=20
cache.db-cache-time=180000
cache.db-cache-size=0.5



Thx in advance,
Ali

6341 - 6360 of 6661