Joe Obernberger <joseph.o...@...>
Hi All - I've been experimenting with SparkGraphComputer, and have it working, but I'm having performance issues. What is the best way to run PageRank against a very large graph stored inside of JanusGraph?
Thank you!
-Joe
|
|
Hi Joe, This question reminds me to an earlier discussion we had on the performance of OLAP traversals for janusgraph-hbase. My conclusion there that janusgraph-hbase needs a better HbaseInputFormat that delivers more partitions than one partition per HBase region. I guess Pagerank suffers from that in the same way. Do you maybe have the option to use Cassandra, which has a configurable cassandra.inpit.split.size ? I did not try this myself. HTH, Marc Op vrijdag 22 september 2017 15:41:12 UTC+2 schreef Joseph Obernberger:
toggle quoted message
Show quoted text
Hi All - I've been experimenting with SparkGraphComputer, and have it
working, but I'm having performance issues. What is the best way to run
PageRank against a very large graph stored inside of JanusGraph?
Thank you!
-Joe
|
|
Joe Obernberger <joseph.o...@...>
It reminds me of that one too! At present, I'm locked in with
HBase, so I can't make the switch to Cassandra very easily. I did
try:
result = graph.compute().program(PageRankVertexProgram.build().create()).submit().get()
It took a little over 8 hours to run, but did complete once I
adjusted the hbase.client.scanner.timeout.period to something very
long. Interestingly, I had to modify that in the included jar
file, not in the file in /etc/hbase/conf.
Would really like to get this time to run way down, but not sure
what other method to try.
-Joe
On 9/22/2017 1:05 PM, HadoopMarc wrote:
toggle quoted message
Show quoted text
Hi Joe,
This question reminds me to an earlier discussion we had on
the performance of OLAP traversals for janusgraph-hbase. My
conclusion there that janusgraph-hbase needs a better
HbaseInputFormat that delivers more partitions than one
partition per HBase region. I guess Pagerank suffers from that
in the same way. Do you maybe have the option to use Cassandra,
which has a configurable cassandra.inpit.split.size ? I did
not try this myself.
HTH, Marc
Op vrijdag 22 september 2017 15:41:12 UTC+2 schreef Joseph
Obernberger:
Hi All -
I've been experimenting with SparkGraphComputer, and have it
working, but I'm having performance issues. What is the best
way to run
PageRank against a very large graph stored inside of
JanusGraph?
Thank you!
-Joe
--
You received this message because you are subscribed to the Google
Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/1bf6c7c5-84b6-483e-982c-c299fca3e8ef%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
|
|
Hi Joe,
Maybe a suggestion after all. I believe you ran the PageRankVertexProgram directly on the JanusGraph instance, but it should also be possible to run it on a HadoopGraph with compute(SparkGraphComputer) via JanusGraph's HBaseInputFormat. That would at least parallelize the table scan to the number of HBase regions. In my previous answer I assumed you did that!
Cheers, Marc
Op maandag 25 september 2017 17:24:55 UTC+2 schreef Joseph Obernberger:
toggle quoted message
Show quoted text
It reminds me of that one too! At present, I'm locked in with
HBase, so I can't make the switch to Cassandra very easily. I did
try:
result = graph.compute().program(PageRankVertexProgram.build().create()).submit().get()
It took a little over 8 hours to run, but did complete once I
adjusted the hbase.client.scanner.timeout.period to something very
long. Interestingly, I had to modify that in the included jar
file, not in the file in /etc/hbase/conf.
Would really like to get this time to run way down, but not sure
what other method to try.
-Joe
On 9/22/2017 1:05 PM, HadoopMarc wrote:
Hi Joe,
This question reminds me to an earlier discussion we had on
the performance of OLAP traversals for janusgraph-hbase. My
conclusion there that janusgraph-hbase needs a better
HbaseInputFormat that delivers more partitions than one
partition per HBase region. I guess Pagerank suffers from that
in the same way. Do you maybe have the option to use Cassandra,
which has a configurable cassandra.inpit.split.size ? I did
not try this myself.
HTH, Marc
Op vrijdag 22 september 2017 15:41:12 UTC+2 schreef Joseph
Obernberger:
Hi All -
I've been experimenting with SparkGraphComputer, and have it
working, but I'm having performance issues. What is the best
way to run
PageRank against a very large graph stored inside of
JanusGraph?
Thank you!
-Joe
--
You received this message because you are subscribed to the Google
Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/1bf6c7c5-84b6-483e-982c-c299fca3e8ef%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
|
|
Joe Obernberger <joseph.o...@...>
Thank you Marc. I assume this would be java code that would be
executed via spark-submit?
-Joe
On 9/25/2017 3:21 PM, HadoopMarc wrote:
toggle quoted message
Show quoted text
Hi Joe,
Maybe a suggestion after all. I believe you ran the
PageRankVertexProgram directly on the JanusGraph instance, but
it should also be possible to run it on a HadoopGraph with
compute(SparkGraphComputer) via JanusGraph's HBaseInputFormat.
That would at least parallelize the table scan to the number of
HBase regions. In my previous answer I assumed you did that!
Cheers, Marc
Op maandag 25 september 2017 17:24:55 UTC+2 schreef Joseph
Obernberger:
It reminds me of that one too! At present, I'm locked in
with HBase, so I can't make the switch to Cassandra very
easily. I did try:
result = graph.compute().program(PageRankVertexProgram.build().create()).submit().get()
It took a little over 8 hours to run, but did complete
once I adjusted the hbase.client.scanner.timeout.period
to something very long. Interestingly, I had to modify
that in the included jar file, not in the file in
/etc/hbase/conf.
Would really like to get this time to run way down, but
not sure what other method to try.
-Joe
On 9/22/2017 1:05 PM, HadoopMarc wrote:
Hi Joe,
This question reminds me to an earlier discussion we
had on the performance of OLAP traversals for
janusgraph-hbase. My conclusion there that
janusgraph-hbase needs a better HbaseInputFormat that
delivers more partitions than one partition per HBase
region. I guess Pagerank suffers from that in the same
way. Do you maybe have the option to use Cassandra,
which has a configurable cassandra.inpit.split.size
? I did not try this myself.
HTH, Marc
Op vrijdag 22 september 2017 15:41:12 UTC+2 schreef
Joseph Obernberger:
Hi All - I've been
experimenting with SparkGraphComputer, and have it
working, but I'm having performance issues. What is
the best way to run
PageRank against a very large graph stored inside of
JanusGraph?
Thank you!
-Joe
--
You received this message because you are subscribed to
the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/1bf6c7c5-84b6-483e-982c-c299fca3e8ef%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/bca40d9f-6376-4dcd-b637-313bb1229d9d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
|
|
Hi Joe,
No, not exactly, because the TinkerPop recipe points at spark-submit as the source of most of the version conflicts. Spark-submit is just a big wrapper around the Spark launch API that sets the environment but does not do that in an application-friendly way. I would first try from the gremlin console for which the recipe was written. Doing the OLAP pagerank in a java project without spark-submit will require some effort to get the classpath right.
HTH, Marc
Op dinsdag 26 september 2017 00:46:26 UTC+2 schreef Joseph Obernberger:
toggle quoted message
Show quoted text
Thank you Marc. I assume this would be java code that would be
executed via spark-submit?
-Joe
On 9/25/2017 3:21 PM, HadoopMarc wrote:
Hi Joe,
Maybe a suggestion after all. I believe you ran the
PageRankVertexProgram directly on the JanusGraph instance, but
it should also be possible to run it on a HadoopGraph with
compute(SparkGraphComputer) via JanusGraph's HBaseInputFormat.
That would at least parallelize the table scan to the number of
HBase regions. In my previous answer I assumed you did that!
Cheers, Marc
Op maandag 25 september 2017 17:24:55 UTC+2 schreef Joseph
Obernberger:
It reminds me of that one too! At present, I'm locked in
with HBase, so I can't make the switch to Cassandra very
easily. I did try:
result = graph.compute().program(PageRankVertexProgram.build().create()).submit().get()
It took a little over 8 hours to run, but did complete
once I adjusted the hbase.client.scanner.timeout.period
to something very long. Interestingly, I had to modify
that in the included jar file, not in the file in
/etc/hbase/conf.
Would really like to get this time to run way down, but
not sure what other method to try.
-Joe
On 9/22/2017 1:05 PM, HadoopMarc wrote:
Hi Joe,
This question reminds me to an earlier discussion we
had on the performance of OLAP traversals for
janusgraph-hbase. My conclusion there that
janusgraph-hbase needs a better HbaseInputFormat that
delivers more partitions than one partition per HBase
region. I guess Pagerank suffers from that in the same
way. Do you maybe have the option to use Cassandra,
which has a configurable cassandra.inpit.split.size
? I did not try this myself.
HTH, Marc
Op vrijdag 22 september 2017 15:41:12 UTC+2 schreef
Joseph Obernberger:
Hi All - I've been
experimenting with SparkGraphComputer, and have it
working, but I'm having performance issues. What is
the best way to run
PageRank against a very large graph stored inside of
JanusGraph?
Thank you!
-Joe
--
You received this message because you are subscribed to
the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/1bf6c7c5-84b6-483e-982c-c299fca3e8ef%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/bca40d9f-6376-4dcd-b637-313bb1229d9d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
|
|
Joe Obernberger <joseph.o...@...>
Hi Marc - not sure I understand. I tried this:
gremlin>
g=graph.traversal()
==>graphtraversalsource[standardjanusgraph[hbase:[10.22.5.63:2181,
10.22.5.64:2181, 10.22.5.65:2181]], standard]
gremlin>
result=graph.compute().program(PageRankVertexProgram.build().create()).submit().get()
Is that what you mean? That does not work on very large graphs.
Even on a small graph (about 9 million nodes), it took 8 hours to
complete, and uses only one machine to do the work. I'm looking for
methods to calculate values on very large graphs. Any ideas?
Thank you!
-Joe
On 9/26/2017 3:40 PM, HadoopMarc wrote:
toggle quoted message
Show quoted text
Hi Joe,
No, not exactly, because the TinkerPop recipe points at
spark-submit as the source of most of the version conflicts.
Spark-submit is just a big wrapper around the Spark launch API
that sets the environment but does not do that in an
application-friendly way. I would first try from the gremlin
console for which the recipe was written. Doing the OLAP
pagerank in a java project without spark-submit will require
some effort to get the classpath right.
HTH, Marc
Op dinsdag 26 september 2017 00:46:26 UTC+2 schreef Joseph
Obernberger:
Thank you Marc. I assume this would be java code that
would be executed via spark-submit?
-Joe
On 9/25/2017 3:21 PM, HadoopMarc wrote:
Hi Joe,
Maybe a suggestion after all. I believe you ran the
PageRankVertexProgram directly on the JanusGraph
instance, but it should also be possible to run it on a
HadoopGraph with compute(SparkGraphComputer) via
JanusGraph's HBaseInputFormat. That would at least
parallelize the table scan to the number of HBase
regions. In my previous answer I assumed you did that!
Cheers, Marc
Op maandag 25 september 2017 17:24:55 UTC+2 schreef
Joseph Obernberger:
It reminds me of that one too! At present, I'm
locked in with HBase, so I can't make the switch
to Cassandra very easily. I did try:
result = graph.compute().program(PageRankVertexProgram.build().create()).submit().get()
It took a little over 8 hours to run, but did
complete once I adjusted the
hbase.client.scanner.timeout.period to
something very long. Interestingly, I had to
modify that in the included jar file, not in the
file in /etc/hbase/conf.
Would really like to get this time to run way
down, but not sure what other method to try.
-Joe
On 9/22/2017 1:05 PM, HadoopMarc wrote:
Hi Joe,
This question reminds me to an earlier
discussion we had on the performance of
OLAP traversals for janusgraph-hbase. My
conclusion there that janusgraph-hbase needs a
better HbaseInputFormat that delivers more
partitions than one partition per HBase region.
I guess Pagerank suffers from that in the same
way. Do you maybe have the option to use
Cassandra, which has a configurable cassandra.inpit.split.size
? I did not try this myself.
HTH, Marc
Op vrijdag 22 september 2017 15:41:12 UTC+2
schreef Joseph Obernberger:
Hi All - I've
been experimenting with SparkGraphComputer,
and have it
working, but I'm having performance issues.
What is the best way to run
PageRank against a very large graph stored
inside of JanusGraph?
Thank you!
-Joe
--
You received this message because you are
subscribed to the Google Groups "JanusGraph users"
group.
To unsubscribe from this group and stop receiving
emails from it, send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/1bf6c7c5-84b6-483e-982c-c299fca3e8ef%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to
the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/bca40d9f-6376-4dcd-b637-313bb1229d9d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/9fbf5f28-e86b-4158-9aec-d6924f48a266%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
|
|
Hi Joe, My thoughts were more like: graph = GraphFactory.open('conf/hadoop-graph/read-hbase-spark-yarn.properties')
result=graph.compute(SparkGraphComputer).program(PageRankVertexProgram.build().create()).submit().get()
along the lines of "Exporting with BulkDumperVertexProgram" in http://tinkerpop.apache.org/docs/3.2.3/reference/#sparkgraphcomputerI am curious whether it works! Marc Op woensdag 27 september 2017 15:06:19 UTC+2 schreef Joseph Obernberger:
toggle quoted message
Show quoted text
Hi Marc - not sure I understand. I tried this:
gremlin>
g=graph.traversal()
==>graphtraversalsource[standardjanusgraph[hbase:[10.22.5.63:2181,
10.22.5.64:2181, 10.22.5.65:2181]], standard]
gremlin>
result=graph.compute().program(PageRankVertexProgram.build().create()).submit().get()
Is that what you mean? That does not work on very large graphs.
Even on a small graph (about 9 million nodes), it took 8 hours to
complete, and uses only one machine to do the work. I'm looking for
methods to calculate values on very large graphs. Any ideas?
Thank you!
-Joe
On 9/26/2017 3:40 PM, HadoopMarc wrote:
Hi Joe,
No, not exactly, because the TinkerPop recipe points at
spark-submit as the source of most of the version conflicts.
Spark-submit is just a big wrapper around the Spark launch API
that sets the environment but does not do that in an
application-friendly way. I would first try from the gremlin
console for which the recipe was written. Doing the OLAP
pagerank in a java project without spark-submit will require
some effort to get the classpath right.
HTH, Marc
Op dinsdag 26 september 2017 00:46:26 UTC+2 schreef Joseph
Obernberger:
Thank you Marc. I assume this would be java code that
would be executed via spark-submit?
-Joe
On 9/25/2017 3:21 PM, HadoopMarc wrote:
Hi Joe,
Maybe a suggestion after all. I believe you ran the
PageRankVertexProgram directly on the JanusGraph
instance, but it should also be possible to run it on a
HadoopGraph with compute(SparkGraphComputer) via
JanusGraph's HBaseInputFormat. That would at least
parallelize the table scan to the number of HBase
regions. In my previous answer I assumed you did that!
Cheers, Marc
Op maandag 25 september 2017 17:24:55 UTC+2 schreef
Joseph Obernberger:
It reminds me of that one too! At present, I'm
locked in with HBase, so I can't make the switch
to Cassandra very easily. I did try:
result = graph.compute().program(PageRankVertexProgram.build().create()).submit().get()
It took a little over 8 hours to run, but did
complete once I adjusted the
hbase.client.scanner.timeout.period to
something very long. Interestingly, I had to
modify that in the included jar file, not in the
file in /etc/hbase/conf.
Would really like to get this time to run way
down, but not sure what other method to try.
-Joe
On 9/22/2017 1:05 PM, HadoopMarc wrote:
Hi Joe,
This question reminds me to an earlier
discussion we had on the performance of
OLAP traversals for janusgraph-hbase. My
conclusion there that janusgraph-hbase needs a
better HbaseInputFormat that delivers more
partitions than one partition per HBase region.
I guess Pagerank suffers from that in the same
way. Do you maybe have the option to use
Cassandra, which has a configurable cassandra.inpit.split.size
? I did not try this myself.
HTH, Marc
Op vrijdag 22 september 2017 15:41:12 UTC+2
schreef Joseph Obernberger:
Hi All - I've
been experimenting with SparkGraphComputer,
and have it
working, but I'm having performance issues.
What is the best way to run
PageRank against a very large graph stored
inside of JanusGraph?
Thank you!
-Joe
--
You received this message because you are
subscribed to the Google Groups "JanusGraph users"
group.
To unsubscribe from this group and stop receiving
emails from it, send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/1bf6c7c5-84b6-483e-982c-c299fca3e8ef%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to
the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/bca40d9f-6376-4dcd-b637-313bb1229d9d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/9fbf5f28-e86b-4158-9aec-d6924f48a266%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
|
|
Joe Obernberger <joseph.o...@...>
Thank you Marc. That runs on my cluster, but takes a very long
time. If I try it on a larger graph, the YARN jobs run out of
heap. Right now I'm giving them 10G each.
On a small graph, I can run it OK, and I can run the
BulkDumperVertexProgram as well. What I can't do, when I run with
SparkGraphComputer, is look at the results.
After running:
result =
graph.compute(SparkGraphComputer).program(PageRankVertexProgram.build().create()).submit().get()
I can do a result.memory().runtime, which returns a number (in my
case 609821).
I then do:
g = result.graph().traversal(computer(SparkGraphComputer))
Unfortunately, any command with g, gives the same error - for
example:
g.V().valueMap() returns:
java.io.IOException: No input paths specified in job
Since this is a small graph, if I run it without
SparkGraphComputer, those commands on g work fine, such as:
g.V(id).valueMap('gremlin.pageRankVertexProgram.pageRank')
Trying to find any method to run PageRank on a very large graph
that is stored in JanusGraph. Thanks! Anything you would like me
to try?
-Joe
On 9/27/2017 12:04 PM, HadoopMarc
wrote:
toggle quoted message
Show quoted text
Hi Joe,
My thoughts were more like:
graph =
GraphFactory.open('conf/hadoop-graph/read-hbase-spark-yarn.properties')
result=graph.compute(SparkGraphComputer).program(PageRankVertexProgram.build().create()).submit().get()
along the lines of "Exporting with
BulkDumperVertexProgram" in http://tinkerpop.apache.org/docs/3.2.3/reference/#sparkgraphcomputer
I am curious whether it works!
Marc
Op woensdag 27 september 2017 15:06:19 UTC+2 schreef Joseph
Obernberger:
Hi Marc - not sure I understand. I tried this:
gremlin>
g=graph.traversal()
==>graphtraversalsource[standardjanusgraph[hbase:[10.22.5.63:2181,
10.22.5.64:2181,
10.22.5.65:2181]], standard]
gremlin>
result=graph.compute().program(PageRankVertexProgram.build().create()).submit().get()
Is that what you mean? That does not work on very large
graphs. Even on a small graph (about 9 million nodes), it
took 8 hours to complete, and uses only one machine to do
the work. I'm looking for methods to calculate values on
very large graphs. Any ideas?
Thank you!
-Joe
On 9/26/2017 3:40 PM, HadoopMarc wrote:
Hi Joe,
No, not exactly, because the TinkerPop recipe points at
spark-submit as the source of most of the version
conflicts. Spark-submit is just a big wrapper around the
Spark launch API that sets the environment but does not
do that in an application-friendly way. I would first
try from the gremlin console for which the recipe was
written. Doing the OLAP pagerank in a java project
without spark-submit will require some effort to get the
classpath right.
HTH, Marc
Op dinsdag 26 september 2017 00:46:26 UTC+2 schreef
Joseph Obernberger:
Thank you Marc. I assume this would be java code
that would be executed via spark-submit?
-Joe
On 9/25/2017 3:21 PM, HadoopMarc wrote:
Hi Joe,
Maybe a suggestion after all. I believe you ran
the PageRankVertexProgram directly on the
JanusGraph instance, but it should also be
possible to run it on a HadoopGraph with
compute(SparkGraphComputer) via JanusGraph's
HBaseInputFormat. That would at least
parallelize the table scan to the number of
HBase regions. In my previous answer I assumed
you did that!
Cheers, Marc
Op maandag 25 september 2017 17:24:55 UTC+2
schreef Joseph Obernberger:
It reminds me of that one too! At
present, I'm locked in with HBase, so I
can't make the switch to Cassandra very
easily. I did try:
result = graph.compute().program(PageRankVertexProgram.build().create()).submit().get()
It took a little over 8 hours to run, but
did complete once I adjusted the
hbase.client.scanner.timeout.period
to something very long. Interestingly, I
had to modify that in the included jar
file, not in the file in /etc/hbase/conf.
Would really like to get this time to run
way down, but not sure what other method
to try.
-Joe
On 9/22/2017 1:05 PM, HadoopMarc wrote:
Hi Joe,
This question reminds me to an
earlier discussion we had on the
performance of OLAP traversals for
janusgraph-hbase. My conclusion there
that janusgraph-hbase needs a better
HbaseInputFormat that delivers more
partitions than one partition per HBase
region. I guess Pagerank suffers from
that in the same way. Do you maybe have
the option to use Cassandra, which has a
configurable cassandra.inpit.split.size
? I did not try this myself.
HTH, Marc
Op vrijdag 22 september 2017 15:41:12
UTC+2 schreef Joseph Obernberger:
Hi All -
I've been experimenting with
SparkGraphComputer, and have it
working, but I'm having performance
issues. What is the best way to run
PageRank against a very large graph
stored inside of JanusGraph?
Thank you!
-Joe
--
You received this message because you are
subscribed to the Google Groups
"JanusGraph users" group.
To unsubscribe from this group and stop
receiving emails from it, send an email to
janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/1bf6c7c5-84b6-483e-982c-c299fca3e8ef%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are
subscribed to the Google Groups "JanusGraph users"
group.
To unsubscribe from this group and stop receiving
emails from it, send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/bca40d9f-6376-4dcd-b637-313bb1229d9d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to
the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/9fbf5f28-e86b-4158-9aec-d6924f48a266%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/bd772ed8-6482-4f6f-9af5-4e36976d2bce%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
|
|
Hi Joe,
Thanks for reporting back. So, it indeed seems the same problem as for OLAP traversals: input splits of HBaseInutFormat have the size of a complete region which is a bit too much for SparkGraphComputer. I think it should be fairly easy to adapt JanusGraph->HBaseInputFormat a bit, such that the splits coming from parent HBase->TableInputFormat are split in smaller parts, let's say smaller than some configurable janusgraph.hbase.mapreduce.maxinputsplitsize=128M. All the necessary variables and methods are present in HBase->TableInputFormat. I plan to do it some time in the future, but please do not rely on it. If someone else wants to take up the work sooner, please create a ticket first so that others know.
Cheers, Marc
Op woensdag 27 september 2017 22:30:33 UTC+2 schreef Joseph Obernberger:
toggle quoted message
Show quoted text
Thank you Marc. That runs on my cluster, but takes a very long
time. If I try it on a larger graph, the YARN jobs run out of
heap. Right now I'm giving them 10G each.
On a small graph, I can run it OK, and I can run the
BulkDumperVertexProgram as well. What I can't do, when I run with
SparkGraphComputer, is look at the results.
After running:
result =
graph.compute(SparkGraphComputer).program(PageRankVertexProgram.build().create()).submit().get()
I can do a result.memory().runtime, which returns a number (in my
case 609821).
I then do:
g = result.graph().traversal(computer(SparkGraphComputer))
Unfortunately, any command with g, gives the same error - for
example:
g.V().valueMap() returns:
java.io.IOException: No input paths specified in job
Since this is a small graph, if I run it without
SparkGraphComputer, those commands on g work fine, such as:
g.V(id).valueMap('gremlin.pageRankVertexProgram.pageRank')
Trying to find any method to run PageRank on a very large graph
that is stored in JanusGraph. Thanks! Anything you would like me
to try?
-Joe
On 9/27/2017 12:04 PM, HadoopMarc
wrote:
Hi Joe,
My thoughts were more like:
graph =
GraphFactory.open('conf/hadoop-graph/read-hbase-spark-yarn.properties')
result=graph.compute(SparkGraphComputer).program(PageRankVertexProgram.build().create()).submit().get()
along the lines of "Exporting with
BulkDumperVertexProgram" in http://tinkerpop.apache.org/docs/3.2.3/reference/#sparkgraphcomputer
I am curious whether it works!
Marc
Op woensdag 27 september 2017 15:06:19 UTC+2 schreef Joseph
Obernberger:
Hi Marc - not sure I understand. I tried this:
gremlin>
g=graph.traversal()
==>graphtraversalsource[standardjanusgraph[hbase:[10.22.5.63:2181,
10.22.5.64:2181,
10.22.5.65:2181]], standard]
gremlin>
result=graph.compute().program(PageRankVertexProgram.build().create()).submit().get()
Is that what you mean? That does not work on very large
graphs. Even on a small graph (about 9 million nodes), it
took 8 hours to complete, and uses only one machine to do
the work. I'm looking for methods to calculate values on
very large graphs. Any ideas?
Thank you!
-Joe
On 9/26/2017 3:40 PM, HadoopMarc wrote:
Hi Joe,
No, not exactly, because the TinkerPop recipe points at
spark-submit as the source of most of the version
conflicts. Spark-submit is just a big wrapper around the
Spark launch API that sets the environment but does not
do that in an application-friendly way. I would first
try from the gremlin console for which the recipe was
written. Doing the OLAP pagerank in a java project
without spark-submit will require some effort to get the
classpath right.
HTH, Marc
Op dinsdag 26 september 2017 00:46:26 UTC+2 schreef
Joseph Obernberger:
Thank you Marc. I assume this would be java code
that would be executed via spark-submit?
-Joe
On 9/25/2017 3:21 PM, HadoopMarc wrote:
Hi Joe,
Maybe a suggestion after all. I believe you ran
the PageRankVertexProgram directly on the
JanusGraph instance, but it should also be
possible to run it on a HadoopGraph with
compute(SparkGraphComputer) via JanusGraph's
HBaseInputFormat. That would at least
parallelize the table scan to the number of
HBase regions. In my previous answer I assumed
you did that!
Cheers, Marc
Op maandag 25 september 2017 17:24:55 UTC+2
schreef Joseph Obernberger:
It reminds me of that one too! At
present, I'm locked in with HBase, so I
can't make the switch to Cassandra very
easily. I did try:
result = graph.compute().program(PageRankVertexProgram.build().create()).submit().get()
It took a little over 8 hours to run, but
did complete once I adjusted the
hbase.client.scanner.timeout.period
to something very long. Interestingly, I
had to modify that in the included jar
file, not in the file in /etc/hbase/conf.
Would really like to get this time to run
way down, but not sure what other method
to try.
-Joe
On 9/22/2017 1:05 PM, HadoopMarc wrote:
Hi Joe,
This question reminds me to an
earlier discussion we had on the
performance of OLAP traversals for
janusgraph-hbase. My conclusion there
that janusgraph-hbase needs a better
HbaseInputFormat that delivers more
partitions than one partition per HBase
region. I guess Pagerank suffers from
that in the same way. Do you maybe have
the option to use Cassandra, which has a
configurable cassandra.inpit.split.size
? I did not try this myself.
HTH, Marc
Op vrijdag 22 september 2017 15:41:12
UTC+2 schreef Joseph Obernberger:
Hi All -
I've been experimenting with
SparkGraphComputer, and have it
working, but I'm having performance
issues. What is the best way to run
PageRank against a very large graph
stored inside of JanusGraph?
Thank you!
-Joe
--
You received this message because you are
subscribed to the Google Groups
"JanusGraph users" group.
To unsubscribe from this group and stop
receiving emails from it, send an email to
janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/1bf6c7c5-84b6-483e-982c-c299fca3e8ef%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are
subscribed to the Google Groups "JanusGraph users"
group.
To unsubscribe from this group and stop receiving
emails from it, send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/bca40d9f-6376-4dcd-b637-313bb1229d9d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to
the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/9fbf5f28-e86b-4158-9aec-d6924f48a266%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/bd772ed8-6482-4f6f-9af5-4e36976d2bce%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
|
|
Joe Obernberger <joseph.o...@...>
Thank you Marc. This seems to suggest that if I split the HBase
table up into many many regions, that would correct the issue of
running PageRank.
Any idea why I can't execute any commands on the graph once the
SparkGraphComputer job completes? They all return
java.io.IOException: No input paths specified in job
Thanks again!
-Joe
On 9/28/2017 4:09 PM, HadoopMarc wrote:
toggle quoted message
Show quoted text
Hi Joe,
Thanks for reporting back. So, it indeed seems the same problem
as for OLAP traversals: input splits of HBaseInutFormat have the
size of a complete region which is a bit too much for
SparkGraphComputer. I think it should be fairly easy to adapt
JanusGraph->HBaseInputFormat a bit, such that the splits
coming from parent HBase->TableInputFormat are split in
smaller parts, let's say smaller than some configurable
janusgraph.hbase.mapreduce.maxinputsplitsize=128M. All the
necessary variables and methods are present in
HBase->TableInputFormat. I plan to do it some time in the
future, but please do not rely on it. If someone else wants to
take up the work sooner, please create a ticket first so that
others know.
Cheers, Marc
Op woensdag 27 september 2017 22:30:33 UTC+2 schreef Joseph
Obernberger:
Thank you Marc. That runs on my cluster, but takes a
very long time. If I try it on a larger graph, the YARN
jobs run out of heap. Right now I'm giving them 10G each.
On a small graph, I can run it OK, and I can run the
BulkDumperVertexProgram as well. What I can't do, when I
run with SparkGraphComputer, is look at the results.
After running:
result =
graph.compute(SparkGraphComputer).program(PageRankVertexProgram.build().create()).submit().get()
I can do a result.memory().runtime, which returns a number
(in my case 609821).
I then do:
g = result.graph().traversal(computer(SparkGraphComputer))
Unfortunately, any command with g, gives the same error -
for example:
g.V().valueMap() returns:
java.io.IOException: No input paths specified in job
Since this is a small graph, if I run it without
SparkGraphComputer, those commands on g work fine, such
as:
g.V(id).valueMap('gremlin.pageRankVertexProgram.pageRank')
Trying to find any method to run PageRank on a very large
graph that is stored in JanusGraph. Thanks! Anything you
would like me to try?
-Joe
On 9/27/2017 12:04 PM, HadoopMarc wrote:
Hi Joe,
My thoughts were more like:
graph = GraphFactory.open('conf/hadoop-graph/read-hbase-spark-yarn.properties')
result=graph.compute(SparkGraphComputer).program(PageRankVertexProgram.build().create()).submit().get()
along the lines of "Exporting with
BulkDumperVertexProgram" in http://tinkerpop.apache.org/docs/3.2.3/reference/#sparkgraphcomputer
I am curious whether it works!
Marc
Op woensdag 27 september 2017 15:06:19 UTC+2 schreef
Joseph Obernberger:
Hi Marc - not sure I understand. I tried this:
gremlin>
g=graph.traversal()
==>graphtraversalsource[standardjanusgraph[hbase:[10.22.5.63:2181,
10.22.5.64:2181,
10.22.5.65:2181]], standard]
gremlin> result=graph.compute().program(PageRankVertexProgram.build().create()).submit().get()
Is that what you mean? That does not work on very
large graphs. Even on a small graph (about 9
million nodes), it took 8 hours to complete, and
uses only one machine to do the work. I'm looking
for methods to calculate values on very large
graphs. Any ideas?
Thank you!
-Joe
On 9/26/2017 3:40 PM, HadoopMarc wrote:
Hi Joe,
No, not exactly, because the TinkerPop recipe
points at spark-submit as the source of most of
the version conflicts. Spark-submit is just a
big wrapper around the Spark launch API that
sets the environment but does not do that in an
application-friendly way. I would first try from
the gremlin console for which the recipe was
written. Doing the OLAP pagerank in a java
project without spark-submit will require some
effort to get the classpath right.
HTH, Marc
Op dinsdag 26 september 2017 00:46:26 UTC+2
schreef Joseph Obernberger:
Thank you Marc. I assume this would be
java code that would be executed via
spark-submit?
-Joe
On 9/25/2017 3:21 PM, HadoopMarc wrote:
Hi Joe,
Maybe a suggestion after all. I believe
you ran the PageRankVertexProgram
directly on the JanusGraph instance, but
it should also be possible to run it on
a HadoopGraph with
compute(SparkGraphComputer) via
JanusGraph's HBaseInputFormat. That
would at least parallelize the table
scan to the number of HBase regions. In
my previous answer I assumed you did
that!
Cheers, Marc
Op maandag 25 september 2017 17:24:55
UTC+2 schreef Joseph Obernberger:
It reminds me of that one too!
At present, I'm locked in with
HBase, so I can't make the switch
to Cassandra very easily. I did
try:
result = graph.compute().program(PageRankVertexProgram.build().create()).submit().get()
It took a little over 8 hours to
run, but did complete once I
adjusted the
hbase.client.scanner.timeout.period
to something very long.
Interestingly, I had to modify
that in the included jar file, not
in the file in /etc/hbase/conf.
Would really like to get this
time to run way down, but not sure
what other method to try.
-Joe
On 9/22/2017 1:05 PM,
HadoopMarc wrote:
Hi Joe,
This question reminds me to an
earlier discussion we had
on the performance of OLAP
traversals for janusgraph-hbase.
My conclusion there that
janusgraph-hbase needs a better
HbaseInputFormat that delivers
more partitions than one
partition per HBase region. I
guess Pagerank suffers from that
in the same way. Do you maybe
have the option to use
Cassandra, which has a
configurable cassandra.inpit.split.size
? I did not try this myself.
HTH, Marc
Op vrijdag 22 september 2017
15:41:12 UTC+2 schreef Joseph
Obernberger:
Hi All
- I've been experimenting with
SparkGraphComputer, and have
it
working, but I'm having
performance issues. What is
the best way to run
PageRank against a very large
graph stored inside of
JanusGraph?
Thank you!
-Joe
--
You received this message because
you are subscribed to the Google
Groups "JanusGraph users" group.
To unsubscribe from this group and
stop receiving emails from it,
send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web
visit https://groups.google.com/d/msgid/janusgraph-users/1bf6c7c5-84b6-483e-982c-c299fca3e8ef%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are
subscribed to the Google Groups
"JanusGraph users" group.
To unsubscribe from this group and stop
receiving emails from it, send an email to
janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/bca40d9f-6376-4dcd-b637-313bb1229d9d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are
subscribed to the Google Groups "JanusGraph users"
group.
To unsubscribe from this group and stop receiving
emails from it, send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/9fbf5f28-e86b-4158-9aec-d6924f48a266%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to
the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/bd772ed8-6482-4f6f-9af5-4e36976d2bce%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/a7258c9b-74d5-4520-9119-18d98762dbfd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
|
|
Hi Joe, Regarding not finding the OLAP output, did you try this section of the TinkerPop ref docs? Cheers, Marc Op donderdag 28 september 2017 23:51:18 UTC+2 schreef Joseph Obernberger:
toggle quoted message
Show quoted text
Thank you Marc. This seems to suggest that if I split the HBase
table up into many many regions, that would correct the issue of
running PageRank.
Any idea why I can't execute any commands on the graph once the
SparkGraphComputer job completes? They all return
java.io.IOException: No input paths specified in job
Thanks again!
-Joe
On 9/28/2017 4:09 PM, HadoopMarc wrote:
Hi Joe,
Thanks for reporting back. So, it indeed seems the same problem
as for OLAP traversals: input splits of HBaseInutFormat have the
size of a complete region which is a bit too much for
SparkGraphComputer. I think it should be fairly easy to adapt
JanusGraph->HBaseInputFormat a bit, such that the splits
coming from parent HBase->TableInputFormat are split in
smaller parts, let's say smaller than some configurable
janusgraph.hbase.mapreduce. maxinputsplitsize=128M. All the
necessary variables and methods are present in
HBase->TableInputFormat. I plan to do it some time in the
future, but please do not rely on it. If someone else wants to
take up the work sooner, please create a ticket first so that
others know.
Cheers, Marc
Op woensdag 27 september 2017 22:30:33 UTC+2 schreef Joseph
Obernberger:
Thank you Marc. That runs on my cluster, but takes a
very long time. If I try it on a larger graph, the YARN
jobs run out of heap. Right now I'm giving them 10G each.
On a small graph, I can run it OK, and I can run the
BulkDumperVertexProgram as well. What I can't do, when I
run with SparkGraphComputer, is look at the results.
After running:
result =
graph.compute(SparkGraphComputer).program(PageRankVertexProgram.build().create()).submit().get()
I can do a result.memory().runtime, which returns a number
(in my case 609821).
I then do:
g = result.graph().traversal(computer(SparkGraphComputer))
Unfortunately, any command with g, gives the same error -
for example:
g.V().valueMap() returns:
java.io.IOException: No input paths specified in job
Since this is a small graph, if I run it without
SparkGraphComputer, those commands on g work fine, such
as:
g.V(id).valueMap('gremlin.pageRankVertexProgram.pageRank')
Trying to find any method to run PageRank on a very large
graph that is stored in JanusGraph. Thanks! Anything you
would like me to try?
-Joe
On 9/27/2017 12:04 PM, HadoopMarc wrote:
Hi Joe,
My thoughts were more like:
graph = GraphFactory.open('conf/hadoop-graph/read-hbase-spark-yarn.properties')
result=graph.compute(SparkGraphComputer).program(PageRankVertexProgram.build().create()).submit().get()
along the lines of "Exporting with
BulkDumperVertexProgram" in http://tinkerpop.apache.org/docs/3.2.3/reference/#sparkgraphcomputer
I am curious whether it works!
Marc
Op woensdag 27 september 2017 15:06:19 UTC+2 schreef
Joseph Obernberger:
Hi Marc - not sure I understand. I tried this:
gremlin>
g=graph.traversal()
==>graphtraversalsource[standardjanusgraph[hbase:[10.22.5.63:2181,
10.22.5.64:2181,
10.22.5.65:2181]], standard]
gremlin> result=graph.compute().program(PageRankVertexProgram.build().create()).submit().get()
Is that what you mean? That does not work on very
large graphs. Even on a small graph (about 9
million nodes), it took 8 hours to complete, and
uses only one machine to do the work. I'm looking
for methods to calculate values on very large
graphs. Any ideas?
Thank you!
-Joe
On 9/26/2017 3:40 PM, HadoopMarc wrote:
Hi Joe,
No, not exactly, because the TinkerPop recipe
points at spark-submit as the source of most of
the version conflicts. Spark-submit is just a
big wrapper around the Spark launch API that
sets the environment but does not do that in an
application-friendly way. I would first try from
the gremlin console for which the recipe was
written. Doing the OLAP pagerank in a java
project without spark-submit will require some
effort to get the classpath right.
HTH, Marc
Op dinsdag 26 september 2017 00:46:26 UTC+2
schreef Joseph Obernberger:
Thank you Marc. I assume this would be
java code that would be executed via
spark-submit?
-Joe
On 9/25/2017 3:21 PM, HadoopMarc wrote:
Hi Joe,
Maybe a suggestion after all. I believe
you ran the PageRankVertexProgram
directly on the JanusGraph instance, but
it should also be possible to run it on
a HadoopGraph with
compute(SparkGraphComputer) via
JanusGraph's HBaseInputFormat. That
would at least parallelize the table
scan to the number of HBase regions. In
my previous answer I assumed you did
that!
Cheers, Marc
Op maandag 25 september 2017 17:24:55
UTC+2 schreef Joseph Obernberger:
It reminds me of that one too!
At present, I'm locked in with
HBase, so I can't make the switch
to Cassandra very easily. I did
try:
result = graph.compute().program(PageRankVertexProgram.build().create()).submit().get()
It took a little over 8 hours to
run, but did complete once I
adjusted the
hbase.client.scanner.timeout.period
to something very long.
Interestingly, I had to modify
that in the included jar file, not
in the file in /etc/hbase/conf.
Would really like to get this
time to run way down, but not sure
what other method to try.
-Joe
On 9/22/2017 1:05 PM,
HadoopMarc wrote:
Hi Joe,
This question reminds me to an
earlier discussion we had
on the performance of OLAP
traversals for janusgraph-hbase.
My conclusion there that
janusgraph-hbase needs a better
HbaseInputFormat that delivers
more partitions than one
partition per HBase region. I
guess Pagerank suffers from that
in the same way. Do you maybe
have the option to use
Cassandra, which has a
configurable cassandra.inpit.split.size
? I did not try this myself.
HTH, Marc
Op vrijdag 22 september 2017
15:41:12 UTC+2 schreef Joseph
Obernberger:
Hi All
- I've been experimenting with
SparkGraphComputer, and have
it
working, but I'm having
performance issues. What is
the best way to run
PageRank against a very large
graph stored inside of
JanusGraph?
Thank you!
-Joe
--
You received this message because
you are subscribed to the Google
Groups "JanusGraph users" group.
To unsubscribe from this group and
stop receiving emails from it,
send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web
visit https://groups.google.com/d/msgid/janusgraph-users/1bf6c7c5-84b6-483e-982c-c299fca3e8ef%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are
subscribed to the Google Groups
"JanusGraph users" group.
To unsubscribe from this group and stop
receiving emails from it, send an email to
janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/bca40d9f-6376-4dcd-b637-313bb1229d9d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are
subscribed to the Google Groups "JanusGraph users"
group.
To unsubscribe from this group and stop receiving
emails from it, send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/9fbf5f28-e86b-4158-9aec-d6924f48a266%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to
the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/bd772ed8-6482-4f6f-9af5-4e36976d2bce%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/a7258c9b-74d5-4520-9119-18d98762dbfd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
|
|
Joe Obernberger <joseph.o...@...>
Hi Marc,
Ah - I see the output in /user/username/output/~g. This appears
to be gryo format. Thank you! Do you know of a way to update the
actual JanuGraph with a new page rank property on each vertex
instead of writing out an entire graph in HDFS? Would that be a
modification of the PageRank code?
What appears to work, increases performance, and reduces memory
requirements is splitting the tables up into many regions. I have
a graph that is about 24.4 million vertices, uses 7.8G of space in
HBase and I've split it into 462 regions. I can run PageRank on
that graph in 44 minutes on a 5 server cluster with 128G of RAM in
each server. In this case, I gave each task 10G of RAM with a max
memory per node of 96G. I think what may work is to set the max
file size in HBase to something very small like 16M to force
splits with:
alter table 't1', MAX_FILESIZE => 16777216
Interestingly, I lowered the spark.executor.memory from 10G to 4G
and the process completed, but it took almost twice as long. I was
thinking that since it then had more CPU to use (96G/4G instead of
96G/10G), it would run faster. Running more tests. Thanks again
for the help on this!
-Joe
On 9/29/2017 11:27 AM, HadoopMarc
wrote:
toggle quoted message
Show quoted text
Hi Joe,
Regarding not finding the OLAP output, did you try this section of the TinkerPop ref
docs?
Cheers, Marc
Op donderdag 28 september 2017 23:51:18 UTC+2 schreef Joseph
Obernberger:
Thank you Marc. This seems to suggest that if I split
the HBase table up into many many regions, that would
correct the issue of running PageRank.
Any idea why I can't execute any commands on the graph
once the SparkGraphComputer job completes? They all
return java.io.IOException: No input paths specified in
job
Thanks again!
-Joe
On 9/28/2017 4:09 PM, HadoopMarc wrote:
Hi Joe,
Thanks for reporting back. So, it indeed seems the same
problem as for OLAP traversals: input splits of
HBaseInutFormat have the size of a complete region which
is a bit too much for SparkGraphComputer. I think it
should be fairly easy to adapt
JanusGraph->HBaseInputFormat a bit, such that the
splits coming from parent HBase->TableInputFormat are
split in smaller parts, let's say smaller than some
configurable janusgraph.hbase.mapreduce. maxinputsplitsize=128M.
All the necessary variables and methods are present in
HBase->TableInputFormat. I plan to do it some time in
the future, but please do not rely on it. If someone
else wants to take up the work sooner, please create a
ticket first so that others know.
Cheers, Marc
Op woensdag 27 september 2017 22:30:33 UTC+2 schreef
Joseph Obernberger:
Thank you Marc. That runs on my cluster, but
takes a very long time. If I try it on a larger
graph, the YARN jobs run out of heap. Right now
I'm giving them 10G each.
On a small graph, I can run it OK, and I can run
the BulkDumperVertexProgram as well. What I can't
do, when I run with SparkGraphComputer, is look at
the results.
After running:
result = graph.compute(SparkGraphComputer).program(PageRankVertexProgram.build().create()).submit().get()
I can do a result.memory().runtime, which returns
a number (in my case 609821).
I then do:
g = result.graph().traversal(computer(SparkGraphComputer))
Unfortunately, any command with g, gives the same
error - for example:
g.V().valueMap() returns:
java.io.IOException: No input paths specified in
job
Since this is a small graph, if I run it without
SparkGraphComputer, those commands on g work fine,
such as:
g.V(id).valueMap('gremlin.pageRankVertexProgram.pageRank')
Trying to find any method to run PageRank on a
very large graph that is stored in JanusGraph.
Thanks! Anything you would like me to try?
-Joe
On 9/27/2017 12:04 PM, HadoopMarc wrote:
Hi Joe,
My thoughts were more like:
graph =
GraphFactory.open('conf/hadoop-graph/read-hbase-spark-yarn.properties')
result=graph.compute(SparkGraphComputer).program(PageRankVertexProgram.build().create()).submit().get()
along the lines of "Exporting
with BulkDumperVertexProgram" in http://tinkerpop.apache.org/docs/3.2.3/reference/#sparkgraphcomputer
I am curious whether it works!
Marc
Op woensdag 27 september 2017 15:06:19 UTC+2
schreef Joseph Obernberger:
Hi Marc - not sure I understand. I tried
this:
gremlin>
g=graph.traversal()
==>graphtraversalsource[standardjanusgraph[hbase:[10.22.5.63:2181,
10.22.5.64:2181,
10.22.5.65:2181]], standard]
gremlin> result=graph.compute().program(PageRankVertexProgram.build().create()).submit().get()
Is that what you mean? That does not work
on very large graphs. Even on a small graph
(about 9 million nodes), it took 8 hours to
complete, and uses only one machine to do
the work. I'm looking for methods to
calculate values on very large graphs. Any
ideas?
Thank you!
-Joe
On 9/26/2017 3:40 PM, HadoopMarc wrote:
Hi Joe,
No, not exactly, because the TinkerPop
recipe points at spark-submit as the
source of most of the version conflicts.
Spark-submit is just a big wrapper
around the Spark launch API that sets
the environment but does not do that in
an application-friendly way. I would
first try from the gremlin console for
which the recipe was written. Doing the
OLAP pagerank in a java project without
spark-submit will require some effort to
get the classpath right.
HTH, Marc
Op dinsdag 26 september 2017 00:46:26
UTC+2 schreef Joseph Obernberger:
Thank you Marc. I assume this
would be java code that would be
executed via spark-submit?
-Joe
On 9/25/2017 3:21 PM,
HadoopMarc wrote:
Hi Joe,
Maybe a suggestion after all. I
believe you ran the
PageRankVertexProgram directly
on the JanusGraph instance, but
it should also be possible to
run it on a HadoopGraph with
compute(SparkGraphComputer) via
JanusGraph's HBaseInputFormat.
That would at least parallelize
the table scan to the number of
HBase regions. In my previous
answer I assumed you did that!
Cheers, Marc
Op maandag 25 september 2017
17:24:55 UTC+2 schreef Joseph
Obernberger:
It reminds me of that one
too! At present, I'm
locked in with HBase, so I
can't make the switch to
Cassandra very easily. I
did try:
result = graph.compute().program(PageRankVertexProgram.build().create()).submit().get()
It took a little over 8
hours to run, but did
complete once I adjusted
the
hbase.client.scanner.timeout.period
to something very long.
Interestingly, I had to
modify that in the
included jar file, not in
the file in
/etc/hbase/conf.
Would really like to get
this time to run way down,
but not sure what other
method to try.
-Joe
On 9/22/2017 1:05 PM,
HadoopMarc wrote:
Hi Joe,
This question reminds me
to an
earlier discussion we
had on the performance
of OLAP traversals for
janusgraph-hbase. My
conclusion there that
janusgraph-hbase needs a
better HbaseInputFormat
that delivers more
partitions than one
partition per HBase
region. I guess Pagerank
suffers from that in the
same way. Do you maybe
have the option to use
Cassandra, which has a
configurable cassandra.inpit.split.size
? I did not try this
myself.
HTH, Marc
Op vrijdag 22 september
2017 15:41:12 UTC+2
schreef Joseph
Obernberger:
Hi
All - I've been
experimenting with
SparkGraphComputer,
and have it
working, but I'm
having performance
issues. What is the
best way to run
PageRank against a
very large graph
stored inside of
JanusGraph?
Thank you!
-Joe
--
You received this message
because you are subscribed
to the Google Groups
"JanusGraph users" group.
To unsubscribe from this
group and stop receiving
emails from it, send an
email to janusgraph-use...@googlegroups.com.
To view this discussion on
the web visit https://groups.google.com/d/msgid/janusgraph-users/1bf6c7c5-84b6-483e-982c-c299fca3e8ef%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because
you are subscribed to the Google
Groups "JanusGraph users" group.
To unsubscribe from this group and
stop receiving emails from it,
send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web
visit https://groups.google.com/d/msgid/janusgraph-users/bca40d9f-6376-4dcd-b637-313bb1229d9d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are
subscribed to the Google Groups
"JanusGraph users" group.
To unsubscribe from this group and stop
receiving emails from it, send an email to
janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/9fbf5f28-e86b-4158-9aec-d6924f48a266%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are
subscribed to the Google Groups "JanusGraph users"
group.
To unsubscribe from this group and stop receiving
emails from it, send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/bd772ed8-6482-4f6f-9af5-4e36976d2bce%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to
the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to janusgraph-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/a7258c9b-74d5-4520-9119-18d98762dbfd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/43add385-59d9-48e2-897d-33474bb3069b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
|
|