Re: Calling a SparkGraphComputer from within Spark


"Jun(Terry) Yang" <terr...@...>
 

Hi Rob,

I went through the tinkerpop code, just PageRankMapReduce, ClusterCountMapReduce, ClusterPopulationMapReduce has memoryKey function. 
(I found the description "We still recommend users call persist on the resulting RDD if they plan to reuse it." in spark doc http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence)
Not sure if this is the design.

After running PeerPressureVertexProgram sample(http://tinkerpop.apache.org/docs/current/reference/#interacting-with-spark) I saw 2 RDDs, and the result of the sample is 2(integer)
gremlin> spark.ls()
==>output/clusterCount [Memory Deserialized 1x Replicated]
==>output/~g [Memory Deserialized 1x Replicated]
gremlin> spark.head('output', 'clusterCount', PersistedInputRDD)
==>2
 
Then I tried the read these RDDs with gremlin.hadoop.graphReader=PersistedInputRDD.class.getCanonicalName():
a).I failed to read "output/clusterCount" with excretion: java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.tinkerpop.gremlin.hadoop.structure.io.VertexWritable 
    The integer value should be read at this case, but the graph structure can't accept it, so I guess some spark program may access this persistence RDD.
b).And successful with "output/~g"
gremlin> graph2 = GraphFactory.open('conf/hadoop-graph/hadoop-gryo.properties')
==>hadoopgraph[gryoinputformat->gryooutputformat]
gremlin> graph2.configuration().setProperty('gremlin.hadoop.graphReader', PersistedInputRDD.class.getCanonicalName())
==>null
gremlin> graph2.configuration().setProperty('gremlin.hadoop.inputLocation', 'output/~g')
==>null
gremlin> 
gremlin> g2.V().valueMap()
==>[gremlin.peerPressureVertexProgram.cluster:[1], name:[josh], age:[32]]
==>[gremlin.peerPressureVertexProgram.cluster:[1], name:[marko], age:[29]]
==>[gremlin.peerPressureVertexProgram.cluster:[6], name:[peter], age:[35]]
==>[gremlin.peerPressureVertexProgram.cluster:[1], name:[lop], lang:[java]]
==>[gremlin.peerPressureVertexProgram.cluster:[1], name:[ripple], lang:[java]]
==>[gremlin.peerPressureVertexProgram.cluster:[1], name:[vadas], age:[27]]


Hope this will help you~


Thanks!
Terry



On Monday, March 20, 2017 at 2:49:41 AM UTC+8, Rob Keevil wrote:
Last battle before I think this is all done, I need to extract the output without collecting results to the driver and exploding the memory there.

Gremlin has a page at http://tinkerpop.apache.org/docs/current/reference/#interacting-with-spark on how to retrieve the result as a persisted RDD  However, their calculation uses a vertex program, which can name the step using memoryKey('clusterCount').  A regular traversal doesn't seem to have this option, and Spark logs that it removes the RDD after the traversal.  Do you know of any way to access this RDD?

(I've set the required gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.spark.structure.io.PersistedOutputRDD property).

Join janusgraph-users@lists.lfaidata.foundation to automatically receive all group messages.