Re: Configuring TTL in edges and vertices of graph
Jason Plurad <plu...@...>
It doesn't look like you committed the management transaction with mgmt.commit()
http://docs.janusgraph.org/latest/advanced-schema.html#_vertex_ttl
toggle quoted messageShow quoted text
On Wednesday, August 23, 2017 at 8:56:43 AM UTC-4, Abhay Tibrewal wrote: I was able to set TTL for vertices and edges, but even after the time has passed, the vertex did not got removed from the Db. My storage backend is Cassandra. So do we have to configure anything for Cassandra to activate TTL through janusgraph.
I used the following code :-
graph = JanusGraphFactory.open('conf/janusgraph-cassandra-solr.properties)
mgmt = graph.openManagement()
tweet = mgmt.makeVertexLabel('tweet').setStatic().make()
mgmt.setTTL(tweet, Duration.ofMinutes(2))
|
|
Re: hey guys ,how to query a person relational depth
Jason Plurad <plu...@...>
There's a recipe for this http://tinkerpop.apache.org/docs/current/recipes/#_maximum_depth
toggle quoted messageShow quoted text
On Wednesday, August 23, 2017 at 3:52:14 AM UTC-4, 李平 wrote: I want to know ,one person in the janusGraph ,his relational depth,use gremlin
|
|
Re: Can BulkLoaderVertexProgram also add mixed indexes
Jason Plurad <plu...@...>
The class org.janusgraph.diskstorage.es.ElasticSearchIndex is in janusgraph-es-0.1.1.jar. If you're getting a NoClassDefFoundError, there's really not much more we can tell you other than be completely certain that the jar is on the appropriate classpath. Did you add janusgraph-*.jar only or did you add all jars in the $JANUSGRAPH_HOME/lib directory?
toggle quoted messageShow quoted text
On Tuesday, August 22, 2017 at 1:28:18 PM UTC-4, mystic m wrote: Hi,
I am exploring Janusgraph bulk load via SparkGraphComputer, janusgraph has been setup as plugin to tinkerpop server and console, with HBase as underlying storage and Elasticsearch as external index store. I am running this setup on MapR cluster and had to recompile Janusgraph to resolve guava specific conflicts (shaded guava with relocation).
Next I am trying out the example BulkLoaderVertexProgram code provided in Chapter 33, It works fine till I have composite and vertex centric indexes in my schema, but as soon as I define mixed indexes and execute same code I end up with following exception in my Spark Job in stage 2 of job 1 -
java.lang.NoClassDefFoundError: Could not initialize class org.janusgraph.diskstorage.es.ElasticSearchIndex
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.janusgraph.util.system.ConfigurationUtil.instantiate(ConfigurationUtil.java:56)
at org.janusgraph.diskstorage.Backend.getImplementationClass(Backend.java:477)
at org.janusgraph.diskstorage.Backend.getIndexes(Backend.java:464)
at org.janusgraph.diskstorage.Backend.<init>(Backend.java:149)
at org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration.getBackend(GraphDatabaseConfiguration.java:1850)
at org.janusgraph.graphdb.database.StandardJanusGraph.<init>(StandardJanusGraph.java:134)
I have verified that all janusgraph specific jars are in spark executor classpath and mixed indexes work fine with GraphOfGod example.
First I want to understand is it right path to use BulkLoaderVertexProgram be used to add mixed indexes? or should I upload the data and build indexes thereafter?
let me know if any additional info is required to dig deeper.
~mbaxi
|
|
Configuring TTL in edges and vertices of graph
I was able to set TTL for vertices and edges, but even after the time has passed, the vertex did not got removed from the Db. My storage backend is Cassandra. So do we have to configure anything for Cassandra to activate TTL through janusgraph.
I used the following code :-
graph = JanusGraphFactory.open('conf/janusgraph-cassandra-solr.properties) mgmt = graph.openManagement() tweet = mgmt.makeVertexLabel('tweet').setStatic().make() mgmt.setTTL(tweet, Duration.ofMinutes(2))
|
|
hey guys ,how to query a person relational depth
I want to know ,one person in the janusGraph ,his relational depth,use gremlin
|
|
Re: New committers: Robert Dale, Paul Kendall, Samant Maharaj
Robert, Paul and Samant - Thanks for the great work you've put into JanusGraph and welcome aboard!
toggle quoted messageShow quoted text
On Tuesday, August 22, 2017 at 6:32:27 AM UTC-5, Jason Plurad wrote: On behalf of the JanusGraph Technical Steering Committee (TSC), I'm pleased to welcome 3 new committers on the project! Here they are in alphabetical order by last name.
Robert Dale: Robert has been a solid contributor, and his contributions are across the board -- triaging issues, submitting/reviewing pull requests, and answering questions on the Google groups. He's also on the Apache TinkerPop PMC. Paul Kendall and Samant Maharaj: Paul and Samant contributed the CQL storage adapter. This is a pretty big achievement and
helps steer JanusGraph towards future compatibility with Cassandra 4.0. They are continuing work on cleaning up the Cassandra source code tree that will help make testing it easier and better.
|
|
Re: New committers: Robert Dale, Paul Kendall, Samant Maharaj
Congratulations and welcome!
toggle quoted messageShow quoted text
On Tue, Aug 22, 2017 at 3:42 PM, sjudeng <sju...@...> wrote: Robert, Paul and Samant - Thanks for the great work you've put into JanusGraph and welcome aboard!
On Tuesday, August 22, 2017 at 6:32:26 AM UTC-5, Jason Plurad wrote:
On behalf of the JanusGraph Technical Steering Committee (TSC), I'm pleased to welcome 3 new committers on the project! Here they are in alphabetical order by last name.
Robert Dale: Robert has been a solid contributor, and his contributions are across the board -- triaging issues, submitting/reviewing pull requests, and answering questions on the Google groups. He's also on the Apache TinkerPop PMC.
Paul Kendall and Samant Maharaj: Paul and Samant contributed the CQL storage adapter. This is a pretty big achievement and helps steer JanusGraph towards future compatibility with Cassandra 4.0. They are continuing work on cleaning up the Cassandra source code tree that will help make testing it easier and better.
Congratulations to all! -- You received this message because you are subscribed to the Google Groups "JanusGraph developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@.... For more options, visit https://groups.google.com/d/optout.
|
|
Can BulkLoaderVertexProgram also add mixed indexes
Hi,
I am exploring Janusgraph bulk load via SparkGraphComputer, janusgraph has been setup as plugin to tinkerpop server and console, with HBase as underlying storage and Elasticsearch as external index store. I am running this setup on MapR cluster and had to recompile Janusgraph to resolve guava specific conflicts (shaded guava with relocation).
Next I am trying out the example BulkLoaderVertexProgram code provided in Chapter 33, It works fine till I have composite and vertex centric indexes in my schema, but as soon as I define mixed indexes and execute same code I end up with following exception in my Spark Job in stage 2 of job 1 -
java.lang.NoClassDefFoundError: Could not initialize class org.janusgraph.diskstorage.es.ElasticSearchIndex
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.janusgraph.util.system.ConfigurationUtil.instantiate(ConfigurationUtil.java:56)
at org.janusgraph.diskstorage.Backend.getImplementationClass(Backend.java:477)
at org.janusgraph.diskstorage.Backend.getIndexes(Backend.java:464)
at org.janusgraph.diskstorage.Backend.<init>(Backend.java:149)
at org.janusgraph.graphdb.configuration.GraphDatabaseConfiguration.getBackend(GraphDatabaseConfiguration.java:1850)
at org.janusgraph.graphdb.database.StandardJanusGraph.<init>(StandardJanusGraph.java:134)
I have verified that all janusgraph specific jars are in spark executor classpath and mixed indexes work fine with GraphOfGod example.
First I want to understand is it right path to use BulkLoaderVertexProgram be used to add mixed indexes? or should I upload the data and build indexes thereafter?
let me know if any additional info is required to dig deeper.
~mbaxi
|
|
Re: New committers: Robert Dale, Paul Kendall, Samant Maharaj
Misha Brukman <mbru...@...>
Robert, Paul and Samant — thank you for the great work and welcome!
toggle quoted messageShow quoted text
On Tue, Aug 22, 2017 at 7:32 AM, Jason Plurad <plu...@...> wrote: On behalf of the JanusGraph Technical Steering Committee (TSC), I'm pleased to welcome 3 new committers on the project! Here they are in alphabetical order by last name.
Robert Dale: Robert has been a solid contributor, and his contributions are across the board -- triaging issues, submitting/reviewing pull requests, and answering questions on the Google groups. He's also on the Apache TinkerPop PMC. Paul Kendall and Samant Maharaj: Paul and Samant contributed the CQL storage adapter. This is a pretty big achievement and
helps steer JanusGraph towards future compatibility with Cassandra 4.0. They are continuing work on cleaning up the Cassandra source code tree that will help make testing it easier and better.
--
You received this message because you are subscribed to the Google Groups "JanusGraph developer list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
|
|
Re: [BLOG] Configuring JanusGraph for spark-yarn
Joe Obernberger <joseph.o...@...>
Hi All - I rebuilt Janusgraph from git with the
CDH 5.10.0 libraries (just modified the poms) and using that
library created a new graph with 159,103,508 and 278,901,629
edges. I then manually moved regions around in HBase and did
splits across our 5 server cluster into 88 regions. The original
size was 22 regions. The test (g.V().count()) took 1.2 hours to
run with Spark to do a count, and a similar amount of time to do
the edge count. I don’t have an exact number, but it looks like
to do it without spark took a similar time. Honestly, I don't
know if this is good or bad!
I replaced the jar files in the lib directory
with jars from CDH and then rebuilt the lib.zip file. My
configuration follows:
#
# Hadoop Graph Configuration
#
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=org.janusgraph.hadoop.formats.hbase.HBaseInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
gremlin.hadoop.memoryOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.deriveMemory=false
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=output
gremlin.hadoop.outputLocation=output
log4j.rootLogger=WARNING, STDOUT
log4j.logger.deng=WARNING
log4j.appender.STDOUT=org.apache.log4j.ConsoleAppender
org.slf4j.simpleLogger.defaultLogLevel=warn
#
# JanusGraph HBase InputFormat configuration
#
janusgraphmr.ioformat.conf.storage.backend=hbase
janusgraphmr.ioformat.conf.storage.hostname=10.22.5.63:2181,10.22.5.64:2181,10.22.5.65:2181
janusgraphmr.ioformat.conf.storage.hbase.table=FullSpark
janusgraphmr.ioformat.conf.storage.hbase.region-count=44
janusgraphmr.ioformat.conf.storage.hbase.regions-per-server=5
janusgraphmr.ioformat.conf.storage.hbase.short-cf-names=false
janusgraphmr.ioformat.conf.storage.cache.db-cache-size = 0.5
zookeeper.znode.parent=/hbase
#
# SparkGraphComputer with Yarn Configuration
#
spark.executor.extraJavaOptions=-XX:ReservedCodeCacheSize=100M
-XX:MaxMetaspaceSize=256m -XX:CompressedClassSpaceSize=256m
-Dlogback.configurationFile=logback.xml
spark.driver.extraJavaOptons=-XX:ReservedCodeCacheSize=100M
-XX:MaxMetaspaceSize=256m -XX:CompressedClassSpaceSize=256m
spark.master=yarn-cluster
spark.executor.memory=10240m
spark.serializer=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoSerializer
spark.yarn.dist.archives=/home/graph/janusgraph-0.2.0-SNAPSHOT-hadoop2.JOE/lib.zip
spark.yarn.dist.files=/opt/cloudera/parcels/CDH/jars/janusgraph-hbase-0.2.0-SNAPSHOT.jar,/home/graph/janusgraph-0.2.0-SNAPSHOT-hadoop2.JOE/conf/logback.xml
spark.yarn.dist.jars=/opt/cloudera/parcels/CDH/jars/janusgraph-hbase-0.2.0-SNAPSHOT.jar
spark.yarn.appMasterEnv.CLASSPATH=/etc/haddop/conf:/etc/hbase/conf:./lib.zip/*
#spark.executor.extraClassPath=/etc/hadoop/conf:/etc/hbase/conf:/home/graph/janusgraph-0.2.0-SNAPSHOT-hadoop2/janusgraph-hbase-0.2.0-SNAPSHOT.jar:./lib.zip/*
spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH/lib/hadoop/native:/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64
spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH/lib/hadoop/native:/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/lib/native/Linux-amd64-64
spark.akka.frameSize=1024
spark.kyroserializer.buffer.max=1600m
spark.network.timeout=90000
spark.executor.heartbeatInterval=100000
spark.cores.max=5
#
# Relevant configs from spark-defaults.conf
#
spark.authenticate=false
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.executorIdleTimeout=60
spark.dynamicAllocation.minExecutors=0
spark.dynamicAllocation.schedulerBacklogTimeout=1
spark.eventLog.enabled=true
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.shuffle.service.enabled=true
spark.shuffle.service.port=7337
spark.ui.killEnabled=true
spark.executor.extraClassPath=/opt/cloudera/parcels/CDH/jars/janusgraph-hbase-0.2.0-SNAPSHOT.jar:./lib.zip/*:\
/opt/cloudera/parcels/CDH/lib/hbase/bin/../lib/*:\
/etc/hbase/conf:
spark.eventLog.dir=hdfs://host001:8020/user/spark/applicationHistory
spark.yarn.historyServer.address=http://host001:18088
#spark.yarn.jar=local:/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/spark/lib/spark-assembly.jar
spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hadoop/lib/native
spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hadoop/lib/native
spark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hadoop/lib/native
spark.yarn.config.gatewayPath=/opt/cloudera/parcels
spark.yarn.config.replacementPath={{HADOOP_COMMON_HOME}}/../../..
spark.master=yarn-client
Hope that helps!
-Joe
toggle quoted messageShow quoted text
Hey - Joseph,Did your test successed?Can you share
your experience for me ? Thx
在 2017年8月15日星期二 UTC+8上午6:17:12,Joseph Obernberger写道:
Marc - thank you for this. I'm going to try getting the
latest version of JanusGraph, and compiling it with our
specific version of Cloudera CDH, then run some tests.
Will report back.
-Joe
On 8/13/2017 4:07 PM, HadoopMarc wrote:
Hi Joe,
To shed some more light on the running figures you
presented, I ran some tests on my own cluster:
1. I loaded the default janusgraph-hbase table with the
following simple script from the console:
graph=JanusGraphFactory.open("conf/janusgraph-hbase.properties")
g = graph.traversal()
m = 1200L
n = 10000L
(0L..<m).each{
(0L..<n).each{
v1 = g.addV().id().next()
v2 = g.addV().id().next()
g.V(v1).addE('link1').to(g.V(v2)).next()
g.V(v1).addE('link2').to(g.V(v2)).next()
}
g.tx().commit()
}
This scipt runs about 20(?) minutes and results in 24M
vertices and edges committed to the graph.
2. I did an OLTP g.V().count() on this graph from the
console: 11 minutes first time, 10 minutes second time
3. I ran OLAP jobs on this graph using janusgraph-hhbase
in two ways:
a) with g =
graph.traversal().withComputer(SparkGraphComputer)
b) with g =
graph.traversal(). withComputer(new Computer().graphComputer( SparkGraphComputer).workers( 10))
the properties file was as in the recipe, with the
exception of:
spark.executor.memory=4096m # smaller
values might work, but the 512m from the recipe is
definitely too small
spark.executor.instances=4
#spark.executor.cores not set, so default value 1
This resulted in the following running times:
a) stage 0,1,2 => 12min, 12min, 3s => 24min
total
b) stage 0,1,2 => 18min, 1min, 86ms => 19 min
total
Discussion:
- HBase is not an easy source for OLAP: HBase wants
large regions for efficiency (configurable, but
typically 2-20GB), while mapreduce inputformats
(like janusgraph's HBaseInputFormat) take regions as
inputsplits by default. This means that only a few
executors will read from HBase unless the
HBaseInputFormat is extended to split a region's
keyspace into multiple inputsplits. This mismatch
between the numbers of regions and spark executors
is a potential JanusGraph issue. Examples exist to
improve on this, e.g.
org.apache.hadoop.hbase.mapreduce.RowCounter
- For spark stages after stage 0 (reading from
HBase), increasing the number of spark tasks with
the "workers()" setting helps optimizing the
parallelization. This means that for larger
traversals than just a vertex count, the
parallelization with spark will really pay off.
- I did not try to repeat your settings with a large
number of cores. Various sources discourage the use
of spark.executor.cores values larger than 5, e.g.
https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/,
https://stackoverflow.com/questions/37871194/how-to-tune-spark-executor-number-cores-and-executor-memory
Hopefully, these tests provide you and other readers
with some additional perspectives on the configuration
of janusgraph-hbase.
Cheers, Marc
Op donderdag 10 augustus 2017 15:40:21 UTC+2 schreef
Joseph Obernberger:
Thank you Marc.
I did not set spark.executor.instances, but I do
have spark.cores.max set to 64 and within YARN, it
is configured to allow has much RAM/cores for our
5 server cluster. When I run a job on a table
that has 61 regions, I see that 43 tasks are
started and running on all 5 nodes in the Spark UI
(and running top on each of the servers). If I
lower the amount of RAM (heap) that each tasks has
(currently set to 10G), they fail with OutOfMemory
exceptions. It still hits one HBase node very
hard and cycles through them. While that may be a
reason for a performance issue, it doesn't explain
the massive number of calls that HBase receives
for a count job, and why using SparkGraphComputer
takes so much more time.
Running with your command below appears to not
alter the behavior. I did run a job last night
with DEBUG turned on, but it produced too much
logging filling up the log directory on 3 of the 5
nodes before stopping.
Thanks again Marc!
-Joe
On 8/10/2017 7:33 AM, HadoopMarc wrote:
Hi Joe,
Another thing to try (only tested on Tinkerpop,
not on JanusGraph): create the traversalsource
as follows:
g = graph.traversal(). withComputer(new
Computer().graphComputer( SparkGraphComputer).workers( 100))
With HadoopGraph this helps hdfs files with very
large or no partitions to be split across tasks;
I did not check the effect yet for
HBaseInputFormat in JanusGraph. And did you add
spark.executor.instances=10 (or some suitable
number) to your config? And did you check in the
RM ui or Spark history server whether these
executors were really allocated and started?
More later,
Marc
Op donderdag 10 augustus 2017 00:13:09 UTC+2
schreef Joseph Obernberger:
Marc - thank you. I've updated the
classpath and removed nearly all of the
CDH jars; had to keep chimera and some of
the HBase libs in there. Apart from those
and all the jars in lib.zip, it is working
as it did before. The reason I turned
DEBUG off was because it was producing
100+GBytes of logs. Nearly all of which
are things like:
18:04:29 DEBUG
org.janusgraph.diskstorage.hbase.HBaseKeyColumnValueStore
- Generated HBase Filter ColumnRangeFilter
[\x10\xC0, \x10\xC1)
18:04:29 DEBUG org.janusgraph.graphdb.transaction.StandardJanusGraphTx
- Guava vertex cache size: requested=20000
effective=20000 (min=100)
18:04:29 DEBUG org.janusgraph.graphdb.transaction.vertexcache.GuavaVertexCache
- Created dirty vertex map with initial
size 32
18:04:29 DEBUG org.janusgraph.graphdb.transaction.vertexcache.GuavaVertexCache
- Created vertex cache with max size 20000
18:04:29 DEBUG org.janusgraph.diskstorage.hbase.HBaseKeyColumnValueStore
- Generated HBase Filter ColumnRangeFilter
[\x10\xC2, \x10\xC3)
18:04:29 DEBUG org.janusgraph.graphdb.transaction.StandardJanusGraphTx
- Guava vertex cache size: requested=20000
effective=20000 (min=100)
18:04:29 DEBUG org.janusgraph.graphdb.transaction.vertexcache.GuavaVertexCache
- Created dirty vertex map with initial
size 32
18:04:29 DEBUG org.janusgraph.graphdb.transaction.vertexcache.GuavaVertexCache
- Created vertex cache with max size 20000
Do those mean anything to you? I've turned
it back on for running with smaller graph
sizes, but so far I don't see anything
helpful there apart from an exception about
not setting HADOOP_HOME.
Here are the spark properties; notice the
nice and small extraClassPath! :)
Name
|
Value
|
gremlin.graph
|
org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
|
gremlin.hadoop.deriveMemory
|
false
|
gremlin.hadoop.graphReader
|
org.janusgraph.hadoop.formats.hbase.HBaseInputFormat
|
gremlin.hadoop.graphWriter
|
org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
|
gremlin.hadoop.graphWriter.hasEdges
|
false
|
gremlin.hadoop.inputLocation
|
none
|
gremlin.hadoop.jarsInDistributedCache
|
true
|
gremlin.hadoop.memoryOutputFormat
|
org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
|
gremlin.hadoop.outputLocation
|
output
|
janusgraphmr.ioformat.conf.storage.backend
|
hbase
|
janusgraphmr.ioformat.conf.storage.hbase.region-count
|
5
|
janusgraphmr.ioformat.conf.storage.hbase.regions-per-server
|
5
|
janusgraphmr.ioformat.conf.storage.hbase.short-cf-names
|
false
|
janusgraphmr.ioformat.conf.storage.hbase.table
|
TEST0.2.0
|
janusgraphmr.ioformat.conf.storage.hostname
|
10.22.5.65:2181
|
log4j.appender.STDOUT
|
org.apache.log4j.ConsoleAppender
|
log4j.logger.deng
|
WARNING
|
log4j.rootLogger
|
STDOUT
|
org.slf4j.simpleLogger.defaultLogLevel
|
warn
|
spark.akka.frameSize
|
1024
|
spark.app.id
|
application_1502118729859_0041
|
spark.app.name
|
Apache
TinkerPop's Spark-Gremlin
|
spark.authenticate
|
false
|
spark.cores.max
|
64
|
spark.driver.appUIAddress
|
http://10.22.5.61:4040
|
spark.driver.extraJavaOptons
|
-XX:ReservedCodeCacheSize=100M
-XX:MaxMetaspaceSize=256m
-XX:CompressedClassSpaceSize=256m
|
spark.driver.extraLibraryPath
|
/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hadoop/lib/native
|
spark.driver.host
|
10.22.5.61
|
spark.driver.port
|
38529
|
spark.dynamicAllocation.enabled
|
true
|
spark.dynamicAllocation.executorIdleTimeout
|
60
|
spark.dynamicAllocation.minExecutors
|
0
|
spark.dynamicAllocation.schedulerBacklogTimeout
|
1
|
spark.eventLog.dir
|
hdfs://host001:8020/user/spark/applicationHistory
|
spark.eventLog.enabled
|
true
|
spark.executor.extraClassPath
|
/opt/cloudera/parcels/CDH/jars/janusgraph-hbase-0.2.0-SNAPSHOT.jar:./lib.zip/*:/opt/cloudera/parcels/CDH/lib/hbase/bin/../lib/*:/etc/hbase/conf:
|
spark.executor.extraJavaOptions
|
-XX:ReservedCodeCacheSize=100M
-XX:MaxMetaspaceSize=256m
-XX:CompressedClassSpaceSize=256m
-Dlogback.configurationFile=logback.xml
|
spark.executor.extraLibraryPath
|
/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hadoop/lib/native
|
spark.executor.heartbeatInterval
|
100000
|
spark.executor.id
|
driver
|
spark.executor.memory
|
10240m
|
spark.externalBlockStore.folderName
|
spark-27dac3f3-dfbc-4f32-b52d-ececdbcae0db
|
spark.kyroserializer.buffer.max
|
1600m
|
spark.master
|
yarn-client
|
spark.network.timeout
|
90000
|
spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS
|
host005
|
spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES
|
http://host005:8088/proxy/application_1502118729859_0041
|
spark.scheduler.mode
|
FIFO
|
spark.serializer
|
org.apache.spark.serializer.KryoSerializer
|
spark.shuffle.service.enabled
|
true
|
spark.shuffle.service.port
|
7337
|
spark.ui.filters
|
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
|
spark.ui.killEnabled
|
true
|
spark.yarn.am.extraLibraryPath
|
/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hadoop/lib/native
|
spark.yarn.appMasterEnv.CLASSPATH
|
/etc/haddop/conf:/etc/hbase/conf:./lib.zip/*
|
spark.yarn.config.gatewayPath
|
/opt/cloudera/parcels
|
spark.yarn.config.replacementPath
|
{{HADOOP_COMMON_HOME}}/../../..
|
spark.yarn.dist.archives
|
/home/graph/janusgraph-0.2.0-SNAPSHOT-hadoop2.JOE/lib.zip
|
spark.yarn.dist.files
|
/home/graph/janusgraph-0.2.0-SNAPSHOT-hadoop2.JOE/conf/logback.xml
|
spark.yarn.dist.jars
|
/opt/cloudera/parcels/CDH/jars/janusgraph-hbase-0.2.0-SNAPSHOT.jar
|
spark.yarn.historyServer.address
|
http://host001:18088
|
zookeeper.znode.parent
|
/hbase
|
-Joe
On 8/9/2017 3:33 PM, HadoopMarc wrote:
Hi Gari and Joe,
Glad to see you testing the recipes for
MapR and Cloudera respectively! I am
sure that you realized by now that
getting this to work is like walking
through a minefield. If you deviate from
the known path, the odds for getting
through are dim, and no one wants to be
in your vicinity. So, if you see a need
to deviate (which there may be for the
hadoop distributions you use), you will
need your mine sweeper, that is, put the
logging level to DEBUG for relevant java
packages.
This is where you deviated:
- for Gari: you put all kinds of
MapR lib folders on the applications
master's classpath (other classpath
configs are not visible from your
post)
- for Joe: you put all kinds of
Cloudera lib folders on the
executors classpath (worst of all
the spark-assembly.jar)
Probably, you experience all kinds of
mismatches in netty libraries which
slows down or even kills all comms
between the yarn containers. The
philosophy of the recipes really is to
only add the minimum number of conf
folders and jars to the
Tinkerpop/Janusgraph distribution and
see from there if any libraries are
missing.
At my side, it has become apparent
that I should at least add to the
recipes:
- proof of work for a medium-sized
graph (say 10M vertices and edges)
- configs for the number of
executors present in the OLAP job
(instead of relying on spark default
number of 2)
So, still some work to do!
Cheers, Marc
--
You received this message because you are
subscribed to the Google Groups "JanusGraph users
list" group.
To unsubscribe from this group and stop receiving
emails from it, send an email to janusgraph-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to
the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to janusgraph-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to janusgra...@....
For more options, visit https://groups.google.com/d/optout.
|
|
Re: New committers: Robert Dale, Paul Kendall, Samant Maharaj
Welcome aboard Robert, Paul, and Samant! Thanks for the excellent contributions.
toggle quoted messageShow quoted text
On Tuesday, August 22, 2017 at 6:32:27 AM UTC-5, Jason Plurad wrote: On behalf of the JanusGraph Technical Steering Committee (TSC), I'm pleased to welcome 3 new committers on the project! Here they are in alphabetical order by last name.
Robert Dale: Robert has been a solid contributor, and his contributions are across the board -- triaging issues, submitting/reviewing pull requests, and answering questions on the Google groups. He's also on the Apache TinkerPop PMC. Paul Kendall and Samant Maharaj: Paul and Samant contributed the CQL storage adapter. This is a pretty big achievement and
helps steer JanusGraph towards future compatibility with Cassandra 4.0. They are continuing work on cleaning up the Cassandra source code tree that will help make testing it easier and better.
|
|
New committers: Robert Dale, Paul Kendall, Samant Maharaj
Jason Plurad <plu...@...>
On behalf of the JanusGraph Technical Steering Committee (TSC), I'm pleased to welcome 3 new committers on the project! Here they are in alphabetical order by last name.
Robert Dale: Robert has been a solid contributor, and his contributions are across the board -- triaging issues, submitting/reviewing pull requests, and answering questions on the Google groups. He's also on the Apache TinkerPop PMC. Paul Kendall and Samant Maharaj: Paul and Samant contributed the CQL storage adapter. This is a pretty big achievement and
helps steer JanusGraph towards future compatibility with Cassandra 4.0. They are continuing work on cleaning up the Cassandra source code tree that will help make testing it easier and better.
|
|
Takao Magoori <ma...@...>
Hi Marc,
I finally understood what you mean. It would be theoretically possible, thanks! I feel it is difficult for me, since I am not familiar with scala/java, though I will try it. But,,, It would be nice if someone has the spark connector which can be used by python :(
Takao Magoori 2017年8月19日土曜日 4時14分57秒 UTC+9 HadoopMarc:
toggle quoted messageShow quoted text
Hi Takao, JanusGraph reads data from distributed backends into hadoop using its HBaseInputFormat and CassandraInputFomat classes (which are descendents of org.apache.hadoop.mapreduce. InputFormat). Therefore, it seems possible to directly access graphs in these backends from spark using sc.newAPIHadoopRDD. AFAIK, this particular use of the inputformats is nowhere documented or demonstrated, though. My earlier answer effectively came down to storing the graph to hdfs using the OutputRDD class for the gremlin.hadoop.graphWriter property and spark serialization (my earlier suggestion of persisting the graphRDD using PersistedOutputRDD would not work for you because python and gremlin-server would not share the same SparkContext). This may or may not be easier or more efficient than writing your own csv input/output routines (in combination with the BulkDumperVertexProgram to parallelize the writing). Hope this helps, Marc Op vrijdag 18 augustus 2017 04:19:33 UTC+2 schreef Takao Magoori: Hi Marc,
Thank you! But I don't understand what you mean, sorry. I feel SparkGraphComputer is "OLAP by gremlin on top of spark distributed power". But I want "OLAP by spark using janusGraph data".
So, I want to run "spark-submit", create pyspark sparkContext, load JanusGraph data into DataFrame. Then, I can use spark Dataframe, spark ML and python machine-learning packages. If there is no such solution, I guess I have to "dump whole graph into csv and read it from pyspark".
-------- spark_session = SparkSession.builder.appName('test').enableHiveSupport().getOrCreate()
df_user = spark_session.read.format( 'org.apache.janusgraph.some_spark_gremlin_connector', ).options( url='url', query='g.V().hasLabel("user").has("age", gt(29)).valueMap("user_id", "name" "age")', ).load().dropna().join( other=some_df, )
df_item = spark_session.read.format( 'org.apache.janusgraph.some_spark_gremlin_connector', ).options( url='url', query='g.V().hasLabel("user").has("age", gt(29)).out("buy").hasLabel("item").valueMap("item_id", "name")', ).load().dropna()
df_sale = spark_session.read.format( 'org.apache.janusgraph.some_spark_gremlin_connector', ).options( url='url', query='g.V().hasLabel("user").has("age", gt(29)).outE("buy").valueMap("timestamp")', ).load().select( col('item_id'), col('name'), ).dropna() --------
2017年8月18日金曜日 4時08分02秒 UTC+9 HadoopMarc: Hi Takao, Only some directions. If you combine: http://yaaics.blogspot.nl/ (using CassandraInputFormat in your case) http://tinkerpop.apache.org/docs/current/reference/#interacting-with-sparkit should be possible to access the PersistedInputRDD alias graphRDD from the Spark object. Never done this myself, I would be interested to read if this works! Probably you will need to run an OLAP query with SparkGraphComputer anyway (e.g. g.V()) to have the PersistedInputRDD realized (RDD's are not realized until a spark action is run on them.) Cheers, Marc Op donderdag 17 augustus 2017 16:25:42 UTC+2 schreef Takao Magoori: I have a JanusGraph Server (github master, gremlin 3.2.5) on top of Cassandra storage backend, to store users, items and "WHEN, WHERE, WHO bought WHAT ?" relations. To get data from and modify data in the graph, I use Python aiogremlin driver-mode (== groovy sessionless eval mode) and it works well for now. Thanks developers !
So now, I have to compute recommendation and forecast item sales. In order to data-cleaning, data-normalization, recommendation and forecasting, Because of a little big graph, I want to use higher-level pyspark tools (ex. DataFrame, ML) and python machine learning packages (ex, scikit-learn). But I can not find the way to load graph data into Spark. What I want is "connector" which can be used by pyspark to load data from JanusGraph, not SparkGraphComputer.
Could someone please how to do it ?
- Additional info It seems OrientDB has some Spark connectors (though, I don't know these can be used by pyspark). But I want JanusGraph's one.
|
|
Re: How can I load the GraphSON(JSON) to JanusGraph and how about update,delete vertices and edges?
hi, guys.Are you figure out how to update and delete vertices and edges?
在 2017年8月8日星期二 UTC+8上午12:01:35,hu junjie写道:
toggle quoted messageShow quoted text
I used 2 methods to import it all are failed. gremlin> graph.io(graphson()).readGraph("import/test.json") graph.io(IoCore.graphson()).readGraph("import/test.json"); But for the example graphson I can import it.
gremlin> graph.io(graphson()).readGraph("data/tinkerpop-modern.json") Another issue is about update and delete vertices and edges?
Below is the failed GraphSON file example: This is the reference : https://github.com/tinkerpop/blueprints/wiki/GraphSON-Reader-and-Writer-Library{ "graph": { "mode":"NORMAL", "vertices": [ { "name": "lop", "lang": "java", "_id": "3", "_type": "vertex" }, { "name": "vadas", "age": 27, "_id": "2", "_type": "vertex" }, { "name": "marko", "age": 29, "_id": "1", "_type": "vertex" }, { "name": "peter", "age": 35, "_id": "6", "_type": "vertex" }, { "name": "ripple", "lang": "java", "_id": "5", "_type": "vertex" }, { "name": "josh", "age": 32, "_id": "4", "_type": "vertex" } ], "edges": [ { "weight": 1, "_id": "10", "_type": "edge", "_outV": "4", "_inV": "5", "_label": "created" }, { "weight": 0.5, "_id": "7", "_type": "edge", "_outV": "1", "_inV": "2", "_label": "knows" }, { "weight": 0.4000000059604645, "_id": "9", "_type": "edge", "_outV": "1", "_inV": "3", "_label": "created" }, { "weight": 1, "_id": "8", "_type": "edge", "_outV": "1", "_inV": "4", "_label": "knows" }, { "weight": 0.4000000059604645, "_id": "11", "_type": "edge", "_outV": "4", "_inV": "3", "_label": "created" }, { "weight": 0.20000000298023224, "_id": "12", "_type": "edge", "_outV": "6", "_inV": "3", "_label": "created" } ] } }
|
|
Re: How can I load the GraphSON(JSON) to JanusGraph and how about update,delete vertices and edges?
Hello Robert! I read the document with the link you sent.I want to know the issue about update and delete vertices and edges. Where is the document about the issue? I want to use these elements with my Python application! Thanks a lot! 在 2017年8月8日星期二 UTC+8上午12:44:12,Robert Dale写道:
toggle quoted messageShow quoted text
On Mon, Aug 7, 2017 at 1:17 AM, hu junjie <h...@...> wrote: I used 2 methods to import it all are failed. gremlin> graph.io(graphson()).readGraph("import/test.json") graph.io(IoCore.graphson()).readGraph("import/test.json"); But for the example graphson I can import it.
gremlin> graph.io(graphson()).readGraph("data/tinkerpop-modern.json") Another issue is about update and delete vertices and edges?
Below is the failed GraphSON file example: This is the reference : https://github.com/tinkerpop/blueprints/wiki/GraphSON-Reader-and-Writer-Library{ "graph": { "mode":"NORMAL", "vertices": [ { "name": "lop", "lang": "java", "_id": "3", "_type": "vertex" }, { "name": "vadas", "age": 27, "_id": "2", "_type": "vertex" }, { "name": "marko", "age": 29, "_id": "1", "_type": "vertex" }, { "name": "peter", "age": 35, "_id": "6", "_type": "vertex" }, { "name": "ripple", "lang": "java", "_id": "5", "_type": "vertex" }, { "name": "josh", "age": 32, "_id": "4", "_type": "vertex" } ], "edges": [ { "weight": 1, "_id": "10", "_type": "edge", "_outV": "4", "_inV": "5", "_label": "created" }, { "weight": 0.5, "_id": "7", "_type": "edge", "_outV": "1", "_inV": "2", "_label": "knows" }, { "weight": 0.4000000059604645, "_id": "9", "_type": "edge", "_outV": "1", "_inV": "3", "_label": "created" }, { "weight": 1, "_id": "8", "_type": "edge", "_outV": "1", "_inV": "4", "_label": "knows" }, { "weight": 0.4000000059604645, "_id": "11", "_type": "edge", "_outV": "4", "_inV": "3", "_label": "created" }, { "weight": 0.20000000298023224, "_id": "12", "_type": "edge", "_outV": "6", "_inV": "3", "_label": "created" } ] } }
--
You received this message because you are subscribed to the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
|
|
Re: [BLOG] Configuring JanusGraph for spark-yarn
Hey - Joseph,Did your test successed?Can you share your experience for me ? Thx
在 2017年8月15日星期二 UTC+8上午6:17:12,Joseph Obernberger写道:
toggle quoted messageShow quoted text
Marc - thank you for this. I'm going to try getting the latest
version of JanusGraph, and compiling it with our specific version
of Cloudera CDH, then run some tests. Will report back.
-Joe
On 8/13/2017 4:07 PM, HadoopMarc wrote:
Hi Joe,
To shed some more light on the running figures you presented, I
ran some tests on my own cluster:
1. I loaded the default janusgraph-hbase table with the
following simple script from the console:
graph=JanusGraphFactory.open("conf/janusgraph-hbase.properties")
g = graph.traversal()
m = 1200L
n = 10000L
(0L..<m).each{
(0L..<n).each{
v1 = g.addV().id().next()
v2 = g.addV().id().next()
g.V(v1).addE('link1').to(g.V(v2)).next()
g.V(v1).addE('link2').to(g.V(v2)).next()
}
g.tx().commit()
}
This scipt runs about 20(?) minutes and results in 24M vertices
and edges committed to the graph.
2. I did an OLTP g.V().count() on this graph from the console:
11 minutes first time, 10 minutes second time
3. I ran OLAP jobs on this graph using janusgraph-hhbase in two
ways:
a) with g =
graph.traversal().withComputer(SparkGraphComputer)
b) with g =
graph.traversal(). withComputer(new
Computer().graphComputer( SparkGraphComputer).workers( 10))
the properties file was as in the recipe, with the exception of:
spark.executor.memory=4096m # smaller values might
work, but the 512m from the recipe is definitely too small
spark.executor.instances=4
#spark.executor.cores not set, so default value 1
This resulted in the following running times:
a) stage 0,1,2 => 12min, 12min, 3s => 24min total
b) stage 0,1,2 => 18min, 1min, 86ms => 19 min total
Discussion:
- HBase is not an easy source for OLAP: HBase wants large
regions for efficiency (configurable, but typically 2-20GB),
while mapreduce inputformats (like janusgraph's
HBaseInputFormat) take regions as inputsplits by default.
This means that only a few executors will read from HBase
unless the HBaseInputFormat is extended to split a region's
keyspace into multiple inputsplits. This mismatch between
the numbers of regions and spark executors is a potential
JanusGraph issue. Examples exist to improve on this, e.g.
org.apache.hadoop.hbase.mapreduce.RowCounter
- For spark stages after stage 0 (reading from HBase),
increasing the number of spark tasks with the "workers()"
setting helps optimizing the parallelization. This means
that for larger traversals than just a vertex count, the
parallelization with spark will really pay off.
- I did not try to repeat your settings with a large number
of cores. Various sources discourage the use of
spark.executor.cores values larger than 5, e.g.
https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/,
https://stackoverflow.com/questions/37871194/how-to-tune-spark-executor-number-cores-and-executor-memory
Hopefully, these tests provide you and other readers with some
additional perspectives on the configuration of
janusgraph-hbase.
Cheers, Marc
Op donderdag 10 augustus 2017 15:40:21 UTC+2 schreef Joseph
Obernberger:
Thank you Marc.
I did not set spark.executor.instances, but I do have
spark.cores.max set to 64 and within YARN, it is
configured to allow has much RAM/cores for our 5 server
cluster. When I run a job on a table that has 61 regions,
I see that 43 tasks are started and running on all 5 nodes
in the Spark UI (and running top on each of the servers).
If I lower the amount of RAM (heap) that each tasks has
(currently set to 10G), they fail with OutOfMemory
exceptions. It still hits one HBase node very hard and
cycles through them. While that may be a reason for a
performance issue, it doesn't explain the massive number
of calls that HBase receives for a count job, and why
using SparkGraphComputer takes so much more time.
Running with your command below appears to not alter the
behavior. I did run a job last night with DEBUG turned
on, but it produced too much logging filling up the log
directory on 3 of the 5 nodes before stopping.
Thanks again Marc!
-Joe
On 8/10/2017 7:33 AM, HadoopMarc wrote:
Hi Joe,
Another thing to try (only tested on Tinkerpop, not on
JanusGraph): create the traversalsource as follows:
g = graph.traversal(). withComputer(new
Computer().graphComputer( SparkGraphComputer).workers( 100))
With HadoopGraph this helps hdfs files with very large
or no partitions to be split across tasks; I did not
check the effect yet for HBaseInputFormat in JanusGraph.
And did you add spark.executor.instances=10 (or some
suitable number) to your config? And did you check in
the RM ui or Spark history server whether these
executors were really allocated and started?
More later,
Marc
Op donderdag 10 augustus 2017 00:13:09 UTC+2 schreef
Joseph Obernberger:
Marc - thank you. I've updated the classpath and
removed nearly all of the CDH jars; had to keep
chimera and some of the HBase libs in there.
Apart from those and all the jars in lib.zip, it
is working as it did before. The reason I turned
DEBUG off was because it was producing 100+GBytes
of logs. Nearly all of which are things like:
18:04:29 DEBUG
org.janusgraph.diskstorage.hbase.HBaseKeyColumnValueStore
- Generated HBase Filter ColumnRangeFilter
[\x10\xC0, \x10\xC1)
18:04:29 DEBUG org.janusgraph.graphdb.transaction.StandardJanusGraphTx
- Guava vertex cache size: requested=20000
effective=20000 (min=100)
18:04:29 DEBUG org.janusgraph.graphdb.transaction.vertexcache.GuavaVertexCache
- Created dirty vertex map with initial size 32
18:04:29 DEBUG org.janusgraph.graphdb.transaction.vertexcache.GuavaVertexCache
- Created vertex cache with max size 20000
18:04:29 DEBUG org.janusgraph.diskstorage.hbase.HBaseKeyColumnValueStore
- Generated HBase Filter ColumnRangeFilter
[\x10\xC2, \x10\xC3)
18:04:29 DEBUG org.janusgraph.graphdb.transaction.StandardJanusGraphTx
- Guava vertex cache size: requested=20000
effective=20000 (min=100)
18:04:29 DEBUG org.janusgraph.graphdb.transaction.vertexcache.GuavaVertexCache
- Created dirty vertex map with initial size 32
18:04:29 DEBUG org.janusgraph.graphdb.transaction.vertexcache.GuavaVertexCache
- Created vertex cache with max size 20000
Do those mean anything to you? I've turned it back
on for running with smaller graph sizes, but so far
I don't see anything helpful there apart from an
exception about not setting HADOOP_HOME.
Here are the spark properties; notice the nice and
small extraClassPath! :)
Name
|
Value
|
gremlin.graph
|
org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
|
gremlin.hadoop.deriveMemory
|
false
|
gremlin.hadoop.graphReader
|
org.janusgraph.hadoop.formats.hbase.HBaseInputFormat
|
gremlin.hadoop.graphWriter
|
org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
|
gremlin.hadoop.graphWriter.hasEdges
|
false
|
gremlin.hadoop.inputLocation
|
none
|
gremlin.hadoop.jarsInDistributedCache
|
true
|
gremlin.hadoop.memoryOutputFormat
|
org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
|
gremlin.hadoop.outputLocation
|
output
|
janusgraphmr.ioformat.conf.storage.backend
|
hbase
|
janusgraphmr.ioformat.conf.storage.hbase.region-count
|
5
|
janusgraphmr.ioformat.conf.storage.hbase.regions-per-server
|
5
|
janusgraphmr.ioformat.conf.storage.hbase.short-cf-names
|
false
|
janusgraphmr.ioformat.conf.storage.hbase.table
|
TEST0.2.0
|
janusgraphmr.ioformat.conf.storage.hostname
|
10.22.5.65:2181
|
log4j.appender.STDOUT
|
org.apache.log4j.ConsoleAppender
|
log4j.logger.deng
|
WARNING
|
log4j.rootLogger
|
STDOUT
|
org.slf4j.simpleLogger.defaultLogLevel
|
warn
|
spark.akka.frameSize
|
1024
|
spark.app.id
|
application_1502118729859_0041
|
spark.app.name
|
Apache TinkerPop's
Spark-Gremlin
|
spark.authenticate
|
false
|
spark.cores.max
|
64
|
spark.driver.appUIAddress
|
http://10.22.5.61:4040
|
spark.driver.extraJavaOptons
|
-XX:ReservedCodeCacheSize=100M
-XX:MaxMetaspaceSize=256m
-XX:CompressedClassSpaceSize=256m
|
spark.driver.extraLibraryPath
|
/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hadoop/lib/native
|
spark.driver.host
|
10.22.5.61
|
spark.driver.port
|
38529
|
spark.dynamicAllocation.enabled
|
true
|
spark.dynamicAllocation.executorIdleTimeout
|
60
|
spark.dynamicAllocation.minExecutors
|
0
|
spark.dynamicAllocation.schedulerBacklogTimeout
|
1
|
spark.eventLog.dir
|
hdfs://host001:8020/user/spark/applicationHistory
|
spark.eventLog.enabled
|
true
|
spark.executor.extraClassPath
|
/opt/cloudera/parcels/CDH/jars/janusgraph-hbase-0.2.0-SNAPSHOT.jar:./lib.zip/*:/opt/cloudera/parcels/CDH/lib/hbase/bin/../lib/*:/etc/hbase/conf:
|
spark.executor.extraJavaOptions
|
-XX:ReservedCodeCacheSize=100M
-XX:MaxMetaspaceSize=256m
-XX:CompressedClassSpaceSize=256m
-Dlogback.configurationFile=logback.xml
|
spark.executor.extraLibraryPath
|
/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hadoop/lib/native
|
spark.executor.heartbeatInterval
|
100000
|
spark.executor.id
|
driver
|
spark.executor.memory
|
10240m
|
spark.externalBlockStore.folderName
|
spark-27dac3f3-dfbc-4f32-b52d-ececdbcae0db
|
spark.kyroserializer.buffer.max
|
1600m
|
spark.master
|
yarn-client
|
spark.network.timeout
|
90000
|
spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS
|
host005
|
spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES
|
http://host005:8088/proxy/application_1502118729859_0041
|
spark.scheduler.mode
|
FIFO
|
spark.serializer
|
org.apache.spark.serializer.KryoSerializer
|
spark.shuffle.service.enabled
|
true
|
spark.shuffle.service.port
|
7337
|
spark.ui.filters
|
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
|
spark.ui.killEnabled
|
true
|
spark.yarn.am.extraLibraryPath
|
/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/hadoop/lib/native
|
spark.yarn.appMasterEnv.CLASSPATH
|
/etc/haddop/conf:/etc/hbase/conf:./lib.zip/*
|
spark.yarn.config.gatewayPath
|
/opt/cloudera/parcels
|
spark.yarn.config.replacementPath
|
{{HADOOP_COMMON_HOME}}/../../..
|
spark.yarn.dist.archives
|
/home/graph/janusgraph-0.2.0-SNAPSHOT-hadoop2.JOE/lib.zip
|
spark.yarn.dist.files
|
/home/graph/janusgraph-0.2.0-SNAPSHOT-hadoop2.JOE/conf/logback.xml
|
spark.yarn.dist.jars
|
/opt/cloudera/parcels/CDH/jars/janusgraph-hbase-0.2.0-SNAPSHOT.jar
|
spark.yarn.historyServer.address
|
http://host001:18088
|
zookeeper.znode.parent
|
/hbase
|
-Joe
On 8/9/2017 3:33 PM, HadoopMarc wrote:
Hi Gari and Joe,
Glad to see you testing the recipes for MapR and
Cloudera respectively! I am sure that you
realized by now that getting this to work is
like walking through a minefield. If you deviate
from the known path, the odds for getting
through are dim, and no one wants to be in your
vicinity. So, if you see a need to deviate
(which there may be for the hadoop distributions
you use), you will need your mine sweeper, that
is, put the logging level to DEBUG for relevant
java packages.
This is where you deviated:
- for Gari: you put all kinds of MapR lib
folders on the applications master's
classpath (other classpath configs are not
visible from your post)
- for Joe: you put all kinds of Cloudera lib
folders on the executors classpath (worst of
all the spark-assembly.jar)
Probably, you experience all kinds of
mismatches in netty libraries which slows down
or even kills all comms between the yarn
containers. The philosophy of the recipes
really is to only add the minimum number of
conf folders and jars to the
Tinkerpop/Janusgraph distribution and see from
there if any libraries are missing.
At my side, it has become apparent that I
should at least add to the recipes:
- proof of work for a medium-sized graph
(say 10M vertices and edges)
- configs for the number of executors
present in the OLAP job (instead of relying
on spark default number of 2)
So, still some work to do!
Cheers, Marc
--
You received this message because you are subscribed to
the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to janusgraph-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to janusgraph-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
|
|
Re: What's wrong with this code? It throws NoSuchElementException when I try to add an Edge?
Jason Plurad <plu...@...>
Double check your usage of "propId" for the "b" vertex: Vertex creation: g.addV().property(String.format("propId", cols[5]), cols[3])
Traversal: V().has("propId", cols[3])
toggle quoted messageShow quoted text
On Sunday, August 20, 2017 at 2:41:59 PM UTC-4, 刑天 wrote:
package com.sankuai.kg;
import java.io.File; import java.util.Iterator;
import org.apache.commons.io.FileUtils; import org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.GraphTraversal; import org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.GraphTraversalSource; import org.apache.tinkerpop.gremlin.structure.Graph; import org.apache.tinkerpop.gremlin.structure.Transaction; import org.apache.tinkerpop.gremlin.structure.Vertex; import org.apache.tinkerpop.gremlin.structure.util.empty.EmptyGraph;
public class Loader {
public static void main(String[] args) throws Exception { Graph graph = EmptyGraph.instance(); GraphTraversalSource g = graph.traversal().withRemote("remote-graph.properties"); Iterator<String> lineIt = FileUtils.lineIterator(new File(args[0])); while (lineIt.hasNext()) { String line = lineIt.next(); String[] cols = line.split(","); GraphTraversal<Vertex, Vertex> t1 = g.V().has("poiId", cols[0]); GraphTraversal<Vertex, Vertex> t2 = g.V().has("poiId", cols[3]);
if (!t1.hasNext()) g.addV().property("poiId", cols[0]).property("name", cols[1]).property("type", cols[2]).next(); if (!t2.hasNext()) g.addV().property(String.format("propId", cols[5]), cols[3]).property("name", cols[4]) .property("type", cols[5]).next(); g.V().has("poiId", cols[0]).as("a").V().has("propId", cols[3]).as("b").addE(cols[6]) .from("a").to("b").next(); }
g.close(); }
}
|
|
What's wrong with this code? It throws NoSuchElementException when I try to add an Edge?
package com.sankuai.kg;
import java.io.File; import java.util.Iterator;
import org.apache.commons.io.FileUtils; import org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.GraphTraversal; import org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.GraphTraversalSource; import org.apache.tinkerpop.gremlin.structure.Graph; import org.apache.tinkerpop.gremlin.structure.Transaction; import org.apache.tinkerpop.gremlin.structure.Vertex; import org.apache.tinkerpop.gremlin.structure.util.empty.EmptyGraph;
public class Loader {
public static void main(String[] args) throws Exception { Graph graph = EmptyGraph.instance(); GraphTraversalSource g = graph.traversal().withRemote("remote-graph.properties"); Iterator<String> lineIt = FileUtils.lineIterator(new File(args[0])); while (lineIt.hasNext()) { String line = lineIt.next(); String[] cols = line.split(","); GraphTraversal<Vertex, Vertex> t1 = g.V().has("poiId", cols[0]); GraphTraversal<Vertex, Vertex> t2 = g.V().has("poiId", cols[3]);
if (!t1.hasNext()) g.addV().property("poiId", cols[0]).property("name", cols[1]).property("type", cols[2]).next(); if (!t2.hasNext()) g.addV().property(String.format("propId", cols[5]), cols[3]).property("name", cols[4]) .property("type", cols[5]).next(); g.V().has("poiId", cols[0]).as("a").V().has("propId", cols[3]).as("b").addE(cols[6]) .from("a").to("b").next(); }
g.close(); }
}
|
|
Re: Performance issues on a laptop.
I've just managed to connect to my standalone Cassandra 3.11 and can work from the gremlin shell with no visible hit on the CPUs, so I'm happy with that. In the past, I think I've missed the notice about running "nodetool enablethrift" and would get a error regarding AstyanaxStoreManager.
This the setup I need going forward anyway as I plan to websocket into gremlin server from microservices.
I still have no idea why running janusgrah.sh causes it's cassandra to go berserk.
toggle quoted messageShow quoted text
On Friday, 18 August 2017 18:56:56 UTC+1, Robert Dale wrote: Maybe search to see if there's a known issue with running Cassandra on a MacBook. You could upgrade Cassandra if that's what you need. I believe JanusGraph is known to work with all current versions.
On Fri, Aug 18, 2017 at 12:06 PM, 'Ray Scott' via JanusGraph users list <janusgra...@googlegroups.com> wrote: As soon as I killed the cassandra process, the CPU usage plummeted. So at least I know who the culprit was. I'll just start Gremlin Server directly (configured to us BDB), instead of using JanusGraph.sh.
On Thursday, 17 August 2017 19:22:51 UTC+1, Robert Dale wrote: For how long does the cpu remain high after you get the `gremlin>` prompt?
On Thu, Aug 17, 2017 at 2:20 PM, 'Ray Scott' via JanusGraph users list <janu...@...> wrote: Hi,
I'm trying to get JanusGraph running on my laptop (MacBook Air 2 Core Intel i7, 8GB) so that I can develop a small working prototype.
I've gone for the janusgraph.sh method of starting the server and everything is fine until I open a gremlin shell. Then I see the CPU usage for my terminal rocket up past 350%. Once I close the gremlin shell, the CPU usage remains at the same high level, indefinitely. I've tried launching with the Berkeley DB option and no Elastic Search, but I get the exact same behaviour.
Is there something I can do to stop this from using most of my CPU, or do I just have to live with it?
Is there a "lite" recommended setup that someone has had success with in the past?
Thanks.
--
You received this message because you are subscribed to the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
|
|
Hi Takao, JanusGraph reads data from distributed backends into hadoop using its HBaseInputFormat and CassandraInputFomat classes (which are descendents of org.apache.hadoop.mapreduce.InputFormat). Therefore, it seems possible to directly access graphs in these backends from spark using sc.newAPIHadoopRDD. AFAIK, this particular use of the inputformats is nowhere documented or demonstrated, though. My earlier answer effectively came down to storing the graph to hdfs using the OutputRDD class for the gremlin.hadoop.graphWriter property and spark serialization (my earlier suggestion of persisting the graphRDD using PersistedOutputRDD would not work for you because python and gremlin-server would not share the same SparkContext). This may or may not be easier or more efficient than writing your own csv input/output routines (in combination with the BulkDumperVertexProgram to parallelize the writing). Hope this helps, Marc Op vrijdag 18 augustus 2017 04:19:33 UTC+2 schreef Takao Magoori:
toggle quoted messageShow quoted text
Hi Marc,
Thank you! But I don't understand what you mean, sorry. I feel SparkGraphComputer is "OLAP by gremlin on top of spark distributed power". But I want "OLAP by spark using janusGraph data".
So, I want to run "spark-submit", create pyspark sparkContext, load JanusGraph data into DataFrame. Then, I can use spark Dataframe, spark ML and python machine-learning packages. If there is no such solution, I guess I have to "dump whole graph into csv and read it from pyspark".
-------- spark_session = SparkSession.builder.appName('test').enableHiveSupport().getOrCreate()
df_user = spark_session.read.format( 'org.apache.janusgraph.some_spark_gremlin_connector', ).options( url='url', query='g.V().hasLabel("user").has("age", gt(29)).valueMap("user_id", "name" "age")', ).load().dropna().join( other=some_df, )
df_item = spark_session.read.format( 'org.apache.janusgraph.some_spark_gremlin_connector', ).options( url='url', query='g.V().hasLabel("user").has("age", gt(29)).out("buy").hasLabel("item").valueMap("item_id", "name")', ).load().dropna()
df_sale = spark_session.read.format( 'org.apache.janusgraph.some_spark_gremlin_connector', ).options( url='url', query='g.V().hasLabel("user").has("age", gt(29)).outE("buy").valueMap("timestamp")', ).load().select( col('item_id'), col('name'), ).dropna() --------
2017年8月18日金曜日 4時08分02秒 UTC+9 HadoopMarc: Hi Takao, Only some directions. If you combine: http://yaaics.blogspot.nl/ (using CassandraInputFormat in your case) http://tinkerpop.apache.org/docs/current/reference/#interacting-with-sparkit should be possible to access the PersistedInputRDD alias graphRDD from the Spark object. Never done this myself, I would be interested to read if this works! Probably you will need to run an OLAP query with SparkGraphComputer anyway (e.g. g.V()) to have the PersistedInputRDD realized (RDD's are not realized until a spark action is run on them.) Cheers, Marc Op donderdag 17 augustus 2017 16:25:42 UTC+2 schreef Takao Magoori: I have a JanusGraph Server (github master, gremlin 3.2.5) on top of Cassandra storage backend, to store users, items and "WHEN, WHERE, WHO bought WHAT ?" relations. To get data from and modify data in the graph, I use Python aiogremlin driver-mode (== groovy sessionless eval mode) and it works well for now. Thanks developers !
So now, I have to compute recommendation and forecast item sales. In order to data-cleaning, data-normalization, recommendation and forecasting, Because of a little big graph, I want to use higher-level pyspark tools (ex. DataFrame, ML) and python machine learning packages (ex, scikit-learn). But I can not find the way to load graph data into Spark. What I want is "connector" which can be used by pyspark to load data from JanusGraph, not SparkGraphComputer.
Could someone please how to do it ?
- Additional info It seems OrientDB has some Spark connectors (though, I don't know these can be used by pyspark). But I want JanusGraph's one.
|
|