Hi Joe, Thanks for reporting back your results and confirming the recipe for CDH. Also, your job execution times seem consistent now with the ones I posted above. As to your question whether these figures make sense: I think the loading part of OLAP jobs with HBaseInputFormat is way too slow and needs attention. At his point you are better of with storing the vertex id's on hdfs, do a RDD mapPartitions on these id's and have each spark executor make a connection to JanusGraph and get the vertices it needs with low delay after warming of all HBase caches (I used this approach with Titan and will probably keep it for a while with JanusGraph). I do not know which plans the JanusGraph team have with the HBaseInputFormat, but I figure they will wait for the future HBase 2.0.0 release which will hopefully cover a number of relevant features, such as: https://issues.apache.org/jira/browse/HBASE-14789Cheers, Marc Op dinsdag 22 augustus 2017 17:04:03 UTC+2 schreef Joseph Obernberger:
toggle quoted messageShow quoted text
Hi All - I rebuilt Janusgraph from git with the
CDH 5.10.0 libraries (just modified the poms) and using that
library created a new graph with 159,103,508 and 278,901,629
edges. I then manually moved regions around in HBase and did
splits across our 5 server cluster into 88 regions. The original
size was 22 regions. The test (g.V().count()) took 1.2 hours to
run with Spark to do a count, and a similar amount of time to do
the edge count. I don’t have an exact number, but it looks like
to do it without spark took a similar time. Honestly, I don't
know if this is good or bad!
I replaced the jar files in the lib directory
with jars from CDH and then rebuilt the lib.zip file. My
configuration follows:
#
# Hadoop Graph Configuration
#
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=org.janusgraph.hadoop.formats.hbase.HBaseInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
gremlin.hadoop.memoryOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.deriveMemory=false
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=output
gremlin.hadoop.outputLocation=output
log4j.rootLogger=WARNING, STDOUT
log4j.logger.deng=WARNING
log4j.appender.STDOUT=org.apache.log4j.ConsoleAppender
org.slf4j.simpleLogger.defaultLogLevel=warn
#
# JanusGraph HBase InputFormat configuration
#
janusgraphmr.ioformat.conf.storage.backend=hbase
janusgraphmr.ioformat.conf.storage.hostname=10.22.5.63:2181,10.22.5.64:2181,10.22.5.65:2181
janusgraphmr.ioformat.conf.storage.hbase.table=FullSpark
janusgraphmr.ioformat.conf.storage.hbase.region-count=44
janusgraphmr.ioformat.conf.storage.hbase.regions-per-server=5
janusgraphmr.ioformat.conf.storage.hbase.short-cf-names=false
janusgraphmr.ioformat.conf.storage.cache.db-cache-size = 0.5
zookeeper.znode.parent=/hbase
#
# SparkGraphComputer with Yarn Configuration
#
spark.executor.extraJavaOptions=-XX:ReservedCodeCacheSize=100M
-XX:MaxMetaspaceSize=256m -XX:CompressedClassSpaceSize=256m
-Dlogback.configurationFile=logback.xml
spark.driver.extraJavaOptons=-XX:ReservedCodeCacheSize=100M
-XX:MaxMetaspaceSize=256m -XX:CompressedClassSpaceSize=256m
spark.master=yarn-cluster
spark.executor.memory=10240m
spark.serializer=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoSerializer
spark.yarn.dist.archives=/home/graph/janusgraph-0.2.0-SNAPSHOT-hadoop2.JOE/lib.zip
spark.yarn.dist.files=/opt/cloudera/parcels/CDH/jars/janusgraph-hbase-0.2.0-SNAPSHOT.jar,/home/graph/janusgraph-0.2.0-SNAPSHOT-hadoop2.JOE/conf/logback.xml
spark.yarn.dist.jars=/opt/cloudera/parcels/CDH/jars/janusgraph-hbase-0.2.0-SNAPSHOT.jar
spark.yarn.appMasterEnv.CLASSPATH=/etc/haddop/conf:/etc/hbase/conf:./lib.zip/*
#spark.executor.extraClassPath=/etc/hadoop/conf:/etc/hbase/conf:/home/graph/janusgraph-0.2.0-SNAPSHOT-hadoop2/janusgraph-hbase-0.2.0-SNAPSHOT.jar:./lib.zip/*
spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH/lib/hadoop/native:/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64
spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH/lib/hadoop/native:/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64
spark.akka.frameSize=1024
spark.kyroserializer.buffer.max=1600m
spark.network.timeout=90000
spark.executor.heartbeatInterval=100000
spark.cores.max=5
#
# Relevant configs from spark-defaults.conf
#
spark.authenticate=false
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.executorIdleTimeout=60
spark.dynamicAllocation.minExecutors=0
spark.dynamicAllocation.schedulerBacklogTimeout=1
spark.eventLog.enabled=true
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.shuffle.service.enabled=true
spark.shuffle.service.port=7337
spark.ui.killEnabled=true
spark.executor.extraClassPath=/opt/cloudera/parcels/CDH/jars/janusgraph-hbase-0.2.0-SNAPSHOT.jar:./lib.zip/*:\
/opt/cloudera/parcels/CDH/lib/hbase/bin/../lib/*:\
/etc/hbase/conf:
spark.eventLog.dir=hdfs://host001:8020/user/spark/applicationHistory
spark.yarn.historyServer.address=http://host001:18088
#spark.yarn.jar=local:/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/spark/lib/spark-assembly.jar
spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hadoop/lib/native
spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hadoop/lib/native
spark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hadoop/lib/native
spark.yarn.config.gatewayPath=/opt/cloudera/parcels
spark.yarn.config.replacementPath={{HADOOP_COMMON_HOME}}/../../..
spark.master=yarn-client
Hope that helps!
-Joe
Hey - Joseph,Did your test successed?Can you share
your experience for me ? Thx
在 2017年8月15日星期二 UTC+8上午6:17:12,Joseph Obernberger写道:
Marc - thank you for this. I'm going to try getting the
latest version of JanusGraph, and compiling it with our
specific version of Cloudera CDH, then run some tests.
Will report back.
-Joe
On 8/13/2017 4:07 PM, HadoopMarc wrote:
Hi Joe,
To shed some more light on the running figures you
presented, I ran some tests on my own cluster:
1. I loaded the default janusgraph-hbase table with the
following simple script from the console:
graph=JanusGraphFactory.open("conf/janusgraph-hbase.properties")
g = graph.traversal()
m = 1200L
n = 10000L
(0L..<m).each{
(0L..<n).each{
v1 = g.addV().id().next()
v2 = g.addV().id().next()
g.V(v1).addE('link1').to(g.V(v2)).next()
g.V(v1).addE('link2').to(g.V(v2)).next()
}
g.tx().commit()
}
This scipt runs about 20(?) minutes and results in 24M
vertices and edges committed to the graph.
2. I did an OLTP g.V().count() on this graph from the
console: 11 minutes first time, 10 minutes second time
3. I ran OLAP jobs on this graph using janusgraph-hhbase
in two ways:
a) with g =
graph.traversal().withComputer(SparkGraphComputer)
b) with g =
graph.traversal(). withComputer(new Computer().graphComputer( SparkGraphComputer).workers( 10))
the properties file was as in the recipe, with the
exception of:
spark.executor.memory=4096m # smaller
values might work, but the 512m from the recipe is
definitely too small
spark.executor.instances=4
#spark.executor.cores not set, so default value 1
This resulted in the following running times:
a) stage 0,1,2 => 12min, 12min, 3s => 24min
total
b) stage 0,1,2 => 18min, 1min, 86ms => 19 min
total
Discussion:
- HBase is not an easy source for OLAP: HBase wants
large regions for efficiency (configurable, but
typically 2-20GB), while mapreduce inputformats
(like janusgraph's HBaseInputFormat) take regions as
inputsplits by default. This means that only a few
executors will read from HBase unless the
HBaseInputFormat is extended to split a region's
keyspace into multiple inputsplits. This mismatch
between the numbers of regions and spark executors
is a potential JanusGraph issue. Examples exist to
improve on this, e.g.
org.apache.hadoop.hbase.mapreduce.RowCounter
- For spark stages after stage 0 (reading from
HBase), increasing the number of spark tasks with
the "workers()" setting helps optimizing the
parallelization. This means that for larger
traversals than just a vertex count, the
parallelization with spark will really pay off.
- I did not try to repeat your settings with a large
number of cores. Various sources discourage the use
of spark.executor.cores values larger than 5, e.g.
https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/,
https://stackoverflow.com/questions/37871194/how-to-tune-spark-executor-number-cores-and-executor-memory
Hopefully, these tests provide you and other readers
with some additional perspectives on the configuration
of janusgraph-hbase.
Cheers, Marc
Op donderdag 10 augustus 2017 15:40:21 UTC+2 schreef
Joseph Obernberger:
Thank you Marc.
I did not set spark.executor.instances, but I do
have spark.cores.max set to 64 and within YARN, it
is configured to allow has much RAM/cores for our
5 server cluster. When I run a job on a table
that has 61 regions, I see that 43 tasks are
started and running on all 5 nodes in the Spark UI
(and running top on each of the servers). If I
lower the amount of RAM (heap) that each tasks has
(currently set to 10G), they fail with OutOfMemory
exceptions. It still hits one HBase node very
hard and cycles through them. While that may be a
reason for a performance issue, it doesn't explain
the massive number of calls that HBase receives
for a count job, and why using SparkGraphComputer
takes so much more time.
Running with your command below appears to not
alter the behavior. I did run a job last night
with DEBUG turned on, but it produced too much
logging filling up the log directory on 3 of the 5
nodes before stopping.
Thanks again Marc!
-Joe
On 8/10/2017 7:33 AM, HadoopMarc wrote:
Hi Joe,
Another thing to try (only tested on Tinkerpop,
not on JanusGraph): create the traversalsource
as follows:
g = graph.traversal(). withComputer(new
Computer().graphComputer( SparkGraphComputer).workers( 100))
With HadoopGraph this helps hdfs files with very
large or no partitions to be split across tasks;
I did not check the effect yet for
HBaseInputFormat in JanusGraph. And did you add
spark.executor.instances=10 (or some suitable
number) to your config? And did you check in the
RM ui or Spark history server whether these
executors were really allocated and started?
More later,
Marc
Op donderdag 10 augustus 2017 00:13:09 UTC+2
schreef Joseph Obernberger:
Marc - thank you. I've updated the
classpath and removed nearly all of the
CDH jars; had to keep chimera and some of
the HBase libs in there. Apart from those
and all the jars in lib.zip, it is working
as it did before. The reason I turned
DEBUG off was because it was producing
100+GBytes of logs. Nearly all of which
are things like:
18:04:29 DEBUG
org.janusgraph.diskstorage.hbase.HBaseKeyColumnValueStore
- Generated HBase Filter ColumnRangeFilter
[\x10\xC0, \x10\xC1)
18:04:29 DEBUG org.janusgraph.graphdb.transaction.StandardJanusGraphTx
- Guava vertex cache size: requested=20000
effective=20000 (min=100)
18:04:29 DEBUG org.janusgraph.graphdb.transaction.vertexcache.GuavaVertexCache
- Created dirty vertex map with initial
size 32
18:04:29 DEBUG org.janusgraph.graphdb.transaction.vertexcache.GuavaVertexCache
- Created vertex cache with max size 20000
18:04:29 DEBUG org.janusgraph.diskstorage.hbase.HBaseKeyColumnValueStore
- Generated HBase Filter ColumnRangeFilter
[\x10\xC2, \x10\xC3)
18:04:29 DEBUG org.janusgraph.graphdb.transaction.StandardJanusGraphTx
- Guava vertex cache size: requested=20000
effective=20000 (min=100)
18:04:29 DEBUG org.janusgraph.graphdb.transaction.vertexcache.GuavaVertexCache
- Created dirty vertex map with initial
size 32
18:04:29 DEBUG org.janusgraph.graphdb.transaction.vertexcache.GuavaVertexCache
- Created vertex cache with max size 20000
Do those mean anything to you? I've turned
it back on for running with smaller graph
sizes, but so far I don't see anything
helpful there apart from an exception about
not setting HADOOP_HOME.
Here are the spark properties; notice the
nice and small extraClassPath! :)
Name
|
Value
|
gremlin.graph
|
org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
|
gremlin.hadoop.deriveMemory
|
false
|
gremlin.hadoop.graphReader
|
org.janusgraph.hadoop.formats.hbase.HBaseInputFormat
|
gremlin.hadoop.graphWriter
|
org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
|
gremlin.hadoop.graphWriter.hasEdges
|
false
|
gremlin.hadoop.inputLocation
|
none
|
gremlin.hadoop.jarsInDistributedCache
|
true
|
gremlin.hadoop.memoryOutputFormat
|
org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
|
gremlin.hadoop.outputLocation
|
output
|
janusgraphmr.ioformat.conf.storage.backend
|
hbase
|
janusgraphmr.ioformat.conf.storage.hbase.region-count
|
5
|
janusgraphmr.ioformat.conf.storage.hbase.regions-per-server
|
5
|
janusgraphmr.ioformat.conf.storage.hbase.short-cf-names
|
false
|
janusgraphmr.ioformat.conf.storage.hbase.table
|
TEST0.2.0
|
janusgraphmr.ioformat.conf.storage.hostname
|
10.22.5.65:2181
|
log4j.appender.STDOUT
|
org.apache.log4j.ConsoleAppender
|
log4j.logger.deng
|
WARNING
|
log4j.rootLogger
|
STDOUT
|
org.slf4j.simpleLogger.defaultLogLevel
|
warn
|
spark.akka.frameSize
|
1024
|
spark.app.id
|
application_1502118729859_0041
|
spark.app.name
|
Apache
TinkerPop's Spark-Gremlin
|
spark.authenticate
|
false
|
spark.cores.max
|
64
|
spark.driver.appUIAddress
|
http://10.22.5.61:4040
|
spark.driver.extraJavaOptons
|
-XX:ReservedCodeCacheSize=100M
-XX:MaxMetaspaceSize=256m
-XX:CompressedClassSpaceSize=256m
|
spark.driver.extraLibraryPath
|
/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hadoop/lib/native
|
spark.driver.host
|
10.22.5.61
|
spark.driver.port
|
38529
|
spark.dynamicAllocation.enabled
|
true
|
spark.dynamicAllocation.executorIdleTimeout
|
60
|
spark.dynamicAllocation.minExecutors
|
0
|
spark.dynamicAllocation.schedulerBacklogTimeout
|
1
|
spark.eventLog.dir
|
hdfs://host001:8020/user/spark/applicationHistory
|
spark.eventLog.enabled
|
true
|
spark.executor.extraClassPath
|
/opt/cloudera/parcels/CDH/jars/janusgraph-hbase-0.2.0-SNAPSHOT.jar:./lib.zip/*:/opt/cloudera/parcels/CDH/lib/hbase/bin/../lib/*:/etc/hbase/conf:
|
spark.executor.extraJavaOptions
|
-XX:ReservedCodeCacheSize=100M
-XX:MaxMetaspaceSize=256m
-XX:CompressedClassSpaceSize=256m
-Dlogback.configurationFile=logback.xml
|
spark.executor.extraLibraryPath
|
/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hadoop/lib/native
|
spark.executor.heartbeatInterval
|
100000
|
spark.executor.id
|
driver
|
spark.executor.memory
|
10240m
|
spark.externalBlockStore.folderName
|
spark-27dac3f3-dfbc-4f32-b52d-ececdbcae0db
|
spark.kyroserializer.buffer.max
|
1600m
|
spark.master
|
yarn-client
|
spark.network.timeout
|
90000
|
spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS
|
host005
|
spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES
|
http://host005:8088/proxy/application_1502118729859_0041
|
spark.scheduler.mode
|
FIFO
|
spark.serializer
|
org.apache.spark.serializer.KryoSerializer
|
spark.shuffle.service.enabled
|
true
|
spark.shuffle.service.port
|
7337
|
spark.ui.filters
|
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
|
spark.ui.killEnabled
|
true
|
spark.yarn.am.extraLibraryPath
|
/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hadoop/lib/native
|
spark.yarn.appMasterEnv.CLASSPATH
|
/etc/haddop/conf:/etc/hbase/conf:./lib.zip/*
|
spark.yarn.config.gatewayPath
|
/opt/cloudera/parcels
|
spark.yarn.config.replacementPath
|
{{HADOOP_COMMON_HOME}}/../../..
|
spark.yarn.dist.archives
|
/home/graph/janusgraph-0.2.0-SNAPSHOT-hadoop2.JOE/lib.zip
|
spark.yarn.dist.files
|
/home/graph/janusgraph-0.2.0-SNAPSHOT-hadoop2.JOE/conf/logback.xml
|
spark.yarn.dist.jars
|
/opt/cloudera/parcels/CDH/jars/janusgraph-hbase-0.2.0-SNAPSHOT.jar
|
spark.yarn.historyServer.address
|
http://host001:18088
|
zookeeper.znode.parent
|
/hbase
|
-Joe
On 8/9/2017 3:33 PM, HadoopMarc wrote:
Hi Gari and Joe,
Glad to see you testing the recipes for
MapR and Cloudera respectively! I am
sure that you realized by now that
getting this to work is like walking
through a minefield. If you deviate from
the known path, the odds for getting
through are dim, and no one wants to be
in your vicinity. So, if you see a need
to deviate (which there may be for the
hadoop distributions you use), you will
need your mine sweeper, that is, put the
logging level to DEBUG for relevant java
packages.
This is where you deviated:
- for Gari: you put all kinds of
MapR lib folders on the applications
master's classpath (other classpath
configs are not visible from your
post)
- for Joe: you put all kinds of
Cloudera lib folders on the
executors classpath (worst of all
the spark-assembly.jar)
Probably, you experience all kinds of
mismatches in netty libraries which
slows down or even kills all comms
between the yarn containers. The
philosophy of the recipes really is to
only add the minimum number of conf
folders and jars to the
Tinkerpop/Janusgraph distribution and
see from there if any libraries are
missing.
At my side, it has become apparent
that I should at least add to the
recipes:
- proof of work for a medium-sized
graph (say 10M vertices and edges)
- configs for the number of
executors present in the OLAP job
(instead of relying on spark default
number of 2)
So, still some work to do!
Cheers, Marc
--
You received this message because you are
subscribed to the Google Groups "JanusGraph users
list" group.
To unsubscribe from this group and stop receiving
emails from it, send an email to janusgraph-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to
the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to janusgraph-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to janusgraph-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
|