Re: Running OLAP on HBase with SparkGraphComputer fails with Error Container killed by YARN for exceeding memory limits

Roy Yu <7604...@...>

Hi Marc, 

The parameter  hbase.mapreduce.tableinput.mappers.per.region  can be effective. I set it to 40, and there are 40 tasks processing every region. But here comes the new promblem--the data skew. I use g.E().count() to count all the edges of the graph. During counting one region, one spark task containing all 2.6GB data, while other 39 tasks containing 0 data. The task failed again.  I checked my data. There are some vertices which have more 1 million incident edges.  So I tried to solve this promblem using vertex cut(, my graph schema is something like  [mgmt.makeVertexLabel('product').partition().make() ]. But when I using MR to load data to the new graph, it consumed more than 10 times when the attemp without using partition(), from the hbase table detail page, I found the data loading process was busy reading data from  and writing data to the first region. The first region became the hot spot. I guess it relates to vertex ids. Could help me again?

On Tuesday, December 8, 2020 at 3:13:42 PM UTC+8 HadoopMarc wrote:
Hi Roy,

As I mentioned, I did not keep up with possibly new janusgraph-hbase features. From the HBase source, I see that HBase now has a "hbase.mapreduce.tableinput.mappers.per.region" config parameter.

It should not be too difficult to adapt the janusgraph HBaseInputFormat to leverage this feature (or maybe it even works without change???).

Best wishes,


Op dinsdag 8 december 2020 om 04:21:19 UTC+1 schreef Roy Yu:
you seem to run on cloud infra that reduces your requested 40 Gb to 33 Gb (see Fact of life. 
Sorry Marc I misled you. Error Message was generated when I set spark.executor.memory to 30G, when it failed, I increased spark.executor.memory  to 40G, it failed either. I felt desperate and come here to ask for help
On Tuesday, December 8, 2020 at 10:35:19 AM UTC+8 Roy Yu wrote:
Hi Marc

Thanks for your immediate response.
I've tried to set spark.yarn.executor.memoryOverhead=10G and re-run the task, and it stilled failed. From the spark task UI, I saw 80% of processing time is Full GC time. As you said, 2.6GB(GZ compressed) exploding is  my root cause. Now I'm trying to reduce my region size to 1GB, if that will still fail, I'm gonna config the hbase hfile not using compressed format.
This was my first time running janusgraph OLAP, and I think this is a common promblom, as HBase region size 2.6GB(compressed) is not large, 20GB is very common in our production. If the community dose not solve the promblem, the Janusgraph HBase based OLAP solution cannot be adopted by other companies either.

On Tuesday, December 8, 2020 at 12:40:40 AM UTC+8 HadoopMarc wrote:
Hi Roy,

There seem to be three things bothering you here:
  1. you did not specify spark.yarn.executor.memoryOverhead, as the exception message says. Easily solved.
  2. you seem to run on cloud infra that reduces your requested 40 Gb to 33 Gb (see Fact of life.
  3. the janusgraph HBaseInputFormat use sentire HBase regions as hadoop partitions, which are fed into spark tasks. The 2.6Gb region size is for compressed binary data which explodes when expanded into java objects. This is your real problem.
I did not follow the latest status of janusgraph-hbase features for the HBaseInputFormat, but you have to somehow use spark with smaller partitions than an entire HBase region.
A long time ago, I had success with skipping the HBaseInputFormat and have spark executors connect to JanusGraph themselves. That is not a quick solution, though.

Best wishes,


Op maandag 7 december 2020 om 14:10:55 UTC+1 schreef Roy Yu:
Error message:
ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 33.1 GB of 33 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714. 

 graph conifg:
spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=500 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/mnt/data_1/log/spark2/gc-spark%p.log

Region info:
hdfs dfs -du -h /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc
67     134    /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/.regioninfo
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/.tmp
2.6 G  5.1 G  /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/e
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/f
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/g
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/h
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/i
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/l
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/m
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/recovered.edits
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/s
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/t

Anybody who can help me?

Join { to automatically receive all group messages.