Re: Calling a SparkGraphComputer from within Spark


HadoopMarc <m.c.d...@...>
 

Hi Rob,

It sound like your battling skills are OK!  I have never used the PersistedOutputRDD option myself, but if your are stuck you could also try the 
org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat class and the output location. This just writes the query output to hdfs and at least keeps you going.

Btw, I assumed you did not miss the
graph.configuration().setProperty('gremlin.spark.persistContext',true)
part of the reference section you linked to. Did you also try the option with the PersistedOutputRdd from the gremlin console or only from your scala program?

Cheers,    Marc

Op zondag 19 maart 2017 19:49:41 UTC+1 schreef Rob Keevil:

Last battle before I think this is all done, I need to extract the output without collecting results to the driver and exploding the memory there.

Gremlin has a page at http://tinkerpop.apache.org/docs/current/reference/#interacting-with-spark on how to retrieve the result as a persisted RDD  However, their calculation uses a vertex program, which can name the step using memoryKey('clusterCount').  A regular traversal doesn't seem to have this option, and Spark logs that it removes the RDD after the traversal.  Do you know of any way to access this RDD?

(I've set the required gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.spark.structure.io.PersistedOutputRDD property).

Join janusgraph-users@lists.lfaidata.foundation to automatically receive all group messages.