Re: HBase unbalanced table regions after bulkload
toggle quoted messageShow quoted text
I think it has nothing to do with the region-count and hbase does not ignore any region in any circumstance. Since my regions are imbalaced (only 100 regions have data in them), data size per region is not outputsize/1024 it is outputsize/100.
My output size is not equal to my input size; it is 330GB therefore data size per region is not 130 MB it is 3 GB. Nevertheless as I said, I do not think that the problem is with the region count because I have managed to increase the balance of my regions by doing the followings:
I learned that the whole thing is about cluster.max-partitions parameter. Its default value is 32, I set it to 256 and changed nothing else and re-run the bulkloadervertexprogram and realized that non-empty region count was increased from 100 to 256. (when the parameter was 32; blvp loaded data into only 32 regions however hbase automatically splitted the oversized regions and the number of non-empty regions became 100). Therefore I realized that in order to fill 1024 regions of my 1024 regions I need to set the cluster.max-partitions to 1024.
However there is one problem, when I increased the value of cluster.max-partitions from 32 to 256; the run-time of my bulkloadervertexprogram increased 5-times. I was able to load the whole data in almost 2 hours; now it is almost 10 hours. I think it is because each spark executor is trying to write 1024 region all at once. And I have 1024 spark executors; this means a lot of network 1024*1024.
Due to the fact that I do not know the internals of the blvp and janus id assignment; I am not one hundred percent sure about all these.
Is there somebody knows the internals of janus, I would really appreciate that and I am pretty sure that this knowledge will really help me to solve the problem.
16 Haziran 2017 Cuma 14:55:00 UTC+3 tarihinde mar...@... yazdı: