How to split graph in multiple graphml files and load them separately


Laura Morales <lauretas@...>
 

Assuming that my colleagues and I are working on different "parts" of the same graph, everyone of us creates one GraphML file and then we'd like to load our files into the graph (we're using .readGraph("file.graphml"). My problems are:

1. if I load one file, then the other, Janus will not create the edges that have "origin" in one file and "target" on another because I guess it does not find the target vertex on the same file. Janus assigns its own IDs so it looks like we have to "merge" all the files into one before inserting data to the graph
2. because of 1. we cannot "update" only the part of the graph where the file has changed, instead I have to recreate the whole graph everytime

I'd like to know your comments about how we could organize a collaboration like this, ie. people working on different part of the same graph, merging them together, and update only the parts that have changed. "readGraph" is very useful because a file can be loaded in one line without having to write any custom groovy scripts for parsing all the files.
Thank you.


hadoopmarc@...
 

Hi Laura,

I do not see an easy solution. Although JanusGraph supports custom vertex id's, I do not belief this is compatible with the gremlin io readers (at least, not out of the box, I tried...).

An alternative collaboration model would be to setup Gremlin Server. Then you have the gremlin language variants available (e.g. python) to write new and modified data directly to a shared graph (without using graphML files for transport). Apperently, you have an external naming convention to recognize shared vertices, so you could add the external names as properties and define a janusgraph index for that.

Best wishes,     Marc


Laura Morales <lauretas@...>
 

Apperently, you have an external naming convention to recognize shared vertices
The convention is simply to use custom IDs in graphml, like this

<node id="data_source1:id0"/>
<node id="data_source1:id1"/>
...

<node id="data_source2:id0"/>
<node id="data_source2:id1"/>
...

When I "merge" all the nodes/edges of the two graphml files into a single file and load the new file into Janus, Janus will replace all the IDs with its custom Long values. But all the vertexes and edges are imported correctly otherwise. Only the IDs have been changed from String to Long. For my particular use case I don't mind the IDs being changed, but having to "merge" and reinsert the whole graph every time is really inconvenient and doesn't really scale beyond a small graph. I need to "merge" all the files because if I load them separately, Janus will not treat two vertexes with the same ID from two separate files as the same vertex; it will create 2 nodes and give them 2 different IDs.
I feel like this problem probably wouldn't exist if the graphml or graphson loaders would use the user-defined IDs instead of replacing them with Longs.


Laura Morales <lauretas@...>
 

I've also noticed that graphml files can specify an "id" for the <graph> node, but I guess this has no effect on Janus at all? Like, it's completely ignored? Am I right?

Sent: Monday, July 26, 2021 at 7:50 AM
From: "Laura Morales" <lauretas@...>
To: janusgraph-users@...
Cc: janusgraph-users@...
Subject: Re: [janusgraph-users] How to split graph in multiple graphml files and load them separately

Apperently, you have an external naming convention to recognize shared vertices
The convention is simply to use custom IDs in graphml, like this

<node id="data_source1:id0"/>
<node id="data_source1:id1"/>
...

<node id="data_source2:id0"/>
<node id="data_source2:id1"/>
...

When I "merge" all the nodes/edges of the two graphml files into a single file and load the new file into Janus, Janus will replace all the IDs with its custom Long values. But all the vertexes and edges are imported correctly otherwise. Only the IDs have been changed from String to Long. For my particular use case I don't mind the IDs being changed, but having to "merge" and reinsert the whole graph every time is really inconvenient and doesn't really scale beyond a small graph. I need to "merge" all the files because if I load them separately, Janus will not treat two vertexes with the same ID from two separate files as the same vertex; it will create 2 nodes and give them 2 different IDs.
I feel like this problem probably wouldn't exist if the graphml or graphson loaders would use the user-defined IDs instead of replacing them with Longs.


hadoopmarc@...
 

Hi Laura,

Without checking this in the code, it only seems logical that the graph id is ignored, because you have to supply the io readers with an existing Graph instance. Apparently it was chosen to make the user responsible for supplying the Graph that corresponds to the graph id in the xml file.

Marc