We are using cloudera 5.7.0 with java 1.8.0_74 and we have spark 1.6.0, janusgraph 0.1.1, hbase 1.2.0.
I managed to bulkload 130GB of data into 1024 region hbase table in 2 hours 30 minute with 1024 spark executors (1-core,20gb memory). Each stage of blvp is configured to run 10240 tasks:
However I am unable to distribute the hbase data evenly across regions, they are pretty imbalanced. I suspect it is related to the conf value of ids.num-partitions.
Here is how I set the conf:
I even tried setting ids.num-partitions=10240; however the problem was not solved.
Should I still increase the ids.num-partitions value to an even higher value like 102400?
What is the difference cluster.max-partitions and ids.num-partitions. Is my problem related to cluster.max-partitions? Should I use it?
As far as I know ids.num-partitions value determines the number of randomly gathered prefixes that will be used in assigning ids to elements. And I read somewhere that setting ids.num-partitions to 10 times of region count will be enough; however it seems like that is not the case. And I do not want to increase ids.num-partitions further. Since I could not find any document related to the internals of cluster.max-partitions, I am really ignorant about it and need some help.
Thanks in advance,