Re: HBase unbalanced table regions after bulkload


HadoopMarc <m.c.d...@...>
 


Hi Ali,

I have never tried to optimize this myself, but maybe you should also look into the docs at

12.3.30. storage.hbase


...

storage.hbase.region-count


The number of initial regions set when creating JanusGraph’s HBase table

Integer

(no default value)

MASKABLE

storage.hbase.regions-per-server

The number of regions per regionserver to set when creating JanusGraph’s HBase table

Integer

(no default value)

MASKABLE



Normally, HBase does not want many regions, but the number of regions times hdfs replication factor should be at least the number of active datanodes for maximum performance. I think some unbalance is inevitable as yarn will schedule executors unevenly and each executor will try to have local data access.

Further, you can look into HBase's region load balancer configuration, which enables HBase to move regions automatically.

HTH,    Marc

Op donderdag 15 juni 2017 16:04:28 UTC+2 schreef Ali ÖZER:

We are using cloudera 5.7.0 with java 1.8.0_74 and we have spark 1.6.0, janusgraph 0.1.1, hbase 1.2.0.

I managed to bulkload 130GB of data into 1024 region hbase table in 2 hours 30 minute with 1024 spark executors (1-core,20gb memory). Each stage of blvp is configured to run 10240 tasks:
readGraph.compute(SparkGraphComputer).workers(10240).program(blvp).submit().get().

However I am unable to distribute the hbase data evenly across regions, they are pretty imbalanced. I suspect it is related to the conf value of ids.num-partitions.

Here is how I set the conf:

conf/my-janusgraph-hbase.properties:

gremlin.graph=org.janusgraph.core.JanusGraphFactory
storage.backend=hbase
storage.batch-loading=true
storage.hbase.region-count=1024

ids.block-size=2000000
ids.renew-timeout=3600000
storage.buffer-size=10000
ids.num-partitions=1024
ids.partition=true

storage.hbase.table=myjanus
storage.hostname=x.x.x.x
cache.db-cache=true
cache.db-cache-clean-wait=20
cache.db-cache-time=180000
cache.db-cache-size=0.5

I even tried setting ids.num-partitions=10240; however the problem was not solved.

Should I still increase the ids.num-partitions value to an even higher value like 102400?
What is the difference cluster.max-partitions and ids.num-partitions. Is my problem related to cluster.max-partitions? Should I use it?

As far as I know ids.num-partitions value determines the number of randomly gathered prefixes that will be used in assigning ids to elements. And I read somewhere that setting ids.num-partitions to 10 times of region count will be enough; however it seems like that is not the case. And I do not want to increase ids.num-partitions further. Since I could not find any document related to the internals of cluster.max-partitions, I am really ignorant about it and need some help.

Thanks in advance,
Best,
Ali




Join janusgraph-users@lists.lfaidata.foundation to automatically receive all group messages.