Create new node for each group of connected nodes


On Sat, Jan 30, 2021 at 10:42 PM, <hadoopmarc@...> wrote:
Thanks Marc for quick response. 


Hi Anjani,

Your use case obviously comes down to an OLAP query. While JanusGraph provides InputFormat classes to use TInkerPop's SparkGraphComputer and HadoopGraph, many users have experienced problems with them, see e.g. the latest thread:,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,0,80107845

If you would get OLAP with SparkGraphComputer running on your graph with sufficient performance, an additional advantage would be that you can apply TinkerPop's ConnectedComponentVertexProgram.

A safer way to go, not depending on the JanusGraph InputFormats, would be:
  • run an OLTP query writing all vertex id's to a file. This may take days, but it will also give you a baseline of how long a full tablescan takes and what parallellism you need to get a reasonable running time. Be sure to iterate the traversal and not keep all id's in memory.
  • Use the file with id's as input to a spark job that does for each vertex the gremlin query to get all connected vertices. If the starting vertex has the lowest id, then add the additional required vertex and edges (relying on detecting the new vertex is not safe on an eventually consistent backend). Each spark executor can instantiate its own embedded janusgraph instance and queries inside an executor are done in an OLTP way.
Best wishes,    Marc


Hi All,

We are using Janus graph 0.5.2 with Cassandra as storage and Elastic as search engine. We have 700M + nodes.
Nodes are already connected by edges.

We got a use case to add one more node for each group of connected nodes and then create edges between newly created node and exiting nodes. 
For ex, say

node A and B are connected by an edge.
node C , D and E are connected by an edge.

create one node for A and B and creates edges between newly created node and existing nodes
create one node for C, D and E and creates edges between newly created node and existing nodes

I would appreciate to have suggestions to achieve this considering our huge graph size.


Thanks in advance.