Re: Create new node for each group of connected nodes


hadoopmarc@...
 

Hi Anjani,

Your use case obviously comes down to an OLAP query. While JanusGraph provides InputFormat classes to use TInkerPop's SparkGraphComputer and HadoopGraph, many users have experienced problems with them, see e.g. the latest thread:
https://lists.lfaidata.foundation/g/janusgraph-users/topic/issues_with_controlling/80107845?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,0,80107845

If you would get OLAP with SparkGraphComputer running on your graph with sufficient performance, an additional advantage would be that you can apply TinkerPop's ConnectedComponentVertexProgram.

A safer way to go, not depending on the JanusGraph InputFormats, would be:
  • run an OLTP query writing all vertex id's to a file. This may take days, but it will also give you a baseline of how long a full tablescan takes and what parallellism you need to get a reasonable running time. Be sure to iterate the traversal and not keep all id's in memory.
  • Use the file with id's as input to a spark job that does for each vertex the gremlin query to get all connected vertices. If the starting vertex has the lowest id, then add the additional required vertex and edges (relying on detecting the new vertex is not safe on an eventually consistent backend). Each spark executor can instantiate its own embedded janusgraph instance and queries inside an executor are done in an OLTP way.
Best wishes,    Marc

Join janusgraph-users@lists.lfaidata.foundation to automatically receive all group messages.