Re: HBase unbalanced table regions after bulkload
HadoopMarc <m.c.d...@...>
Hi Ali,
I have never tried to optimize this myself, but maybe you should also look into the docs at
12.3.30. storage.hbase
...
storage.hbase.region-count
The number of initial regions set when creating JanusGraph’s HBase table | Integer | (no default value) | MASKABLE | |
storage.hbase.regions-per-server | The number of regions per regionserver to set when creating JanusGraph’s HBase table | Integer | (no default value) | MASKABLE |
Normally, HBase does not want many regions, but the number of regions times hdfs replication factor should be at least the number of active datanodes for maximum performance. I think some unbalance is inevitable as yarn will schedule executors unevenly and each executor will try to have local data access.
Further, you can look into HBase's region load balancer configuration, which enables HBase to move regions automatically.
HTH, Marc
Op donderdag 15 juni 2017 16:04:28 UTC+2 schreef Ali ÖZER:
We are using cloudera 5.7.0 with java 1.8.0_74 and we have spark 1.6.0, janusgraph 0.1.1, hbase 1.2.0.I managed to bulkload 130GB of data into 1024 region hbase table in 2 hours 30 minute with 1024 spark executors (1-core,20gb memory). Each stage of blvp is configured to run 10240 tasks:readGraph.compute(SparkGraphComputer).workers(10240). program(blvp).submit().get(). However I am unable to distribute the hbase data evenly across regions, they are pretty imbalanced. I suspect it is related to the conf value of ids.num-partitions.Here is how I set the conf:conf/my-janusgraph-hbase.properties: gremlin.graph=org.janusgraph.core.JanusGraphFactory storage.backend=hbasestorage.batch-loading=truestorage.hbase.region-count=1024 ids.block-size=2000000ids.renew-timeout=3600000storage.buffer-size=10000ids.num-partitions=1024ids.partition=truestorage.hbase.table=myjanusstorage.hostname=x.x.x.xcache.db-cache=truecache.db-cache-clean-wait=20cache.db-cache-time=180000cache.db-cache-size=0.5I even tried setting ids.num-partitions=10240; however the problem was not solved. Should I still increase the ids.num-partitions value to an even higher value like 102400?What is the difference cluster.max-partitions and ids.num-partitions. Is my problem related to cluster.max-partitions? Should I use it? As far as I know ids.num-partitions value determines the number of randomly gathered prefixes that will be used in assigning ids to elements. And I read somewhere that setting ids.num-partitions to 10 times of region count will be enough; however it seems like that is not the case. And I do not want to increase ids.num-partitions further. Since I could not find any document related to the internals of cluster.max-partitions, I am really ignorant about it and need some help.Thanks in advance,Best,Ali