Traversal binding of dynamically created graphs are not propagated in multi-node cluster


Anton Eroshenko <erosh.anton@...>
 

Hi
We use dynamically created graphs in a multi-node JanusGraph cluster. With a single JunusGraph node it seems to work, but when we are using more than one, synchronization between JanusGraph nodes doesn't work, gremlin server on some nodes does not recognize newly created graph traversal. 
Documentation page says that with a maximum of a 20s lag for the binding to take effect on any node in the cluster, but in fact the new traversal is binded only on the node we did request to, not on the others, no matter how long you wait. So it looks like a bug. 
We're creating a new graph with 
ConfiguredGraphFactory.create(graphName)
It is created successfully, but not propagated to other nodes. 

As a workaround I'm calling ConfiguredGraphFactory.open(graphName) on an unsynced instance, but it is not reliable since from Java application you don't know what instance you will be redirected to by LB. 

I attached a docker-compose file with which it can be reproduced. There are two JanusGraph instances, they expose different ports. But be aware that two JanusGraph instances starting up at the same time result in concurrency error on one of the nodes, another issue of multi-node configuration. So I simply stop one of the containers on start-up and restart it later. 


hadoopmarc@...
 

Hi Anton,

If I do a $  docker run janusgraph/janusgraph:latest
the logs show it runs with the berkeleyje backend.

If I look at:
https://github.com/JanusGraph/janusgraph-docker/blob/master/0.5/Dockerfile
and your docker compose file, I can not see how you make your janusgraph containers use the scylla/cql backend. So, check the logs of your janusgraph containers to see what they are running.

And, if this was not clear, sharing configured graphs between janusgraph instances is only possible if they share a distributed storage backend. If berkeleyje is used, each janusgraph container has its private storage backend.

Best wises,    Marc


Anton Eroshenko <erosh.anton@...>
 

Hi Marc,
The environment properties in docker-compose are making it work with scylla as a backend storage and with ConfiguredGraphFactory for dynamically created graphs. It works as expected except the sync issues I described above. I attached our logs during start-up if you'd like to look at it



On Wed, Mar 24, 2021 at 9:20 PM Anton Eroshenko <erosh.anton@...> wrote:
Hi
We use dynamically created graphs in a multi-node JanusGraph cluster. With a single JunusGraph node it seems to work, but when we are using more than one, synchronization between JanusGraph nodes doesn't work, gremlin server on some nodes does not recognize newly created graph traversal. 
Documentation page says that with a maximum of a 20s lag for the binding to take effect on any node in the cluster, but in fact the new traversal is binded only on the node we did request to, not on the others, no matter how long you wait. So it looks like a bug. 
We're creating a new graph with 
ConfiguredGraphFactory.create(graphName)
It is created successfully, but not propagated to other nodes. 

As a workaround I'm calling ConfiguredGraphFactory.open(graphName) on an unsynced instance, but it is not reliable since from Java application you don't know what instance you will be redirected to by LB. 

I attached a docker-compose file with which it can be reproduced. There are two JanusGraph instances, they expose different ports. But be aware that two JanusGraph instances starting up at the same time result in concurrency error on one of the nodes, another issue of multi-node configuration. So I simply stop one of the containers on start-up and restart it later. 


hadoopmarc@...
 

Hi Anton,

I did not feel like debugging your docker-compose file, but I could not find any test covering your scenario on github/janusgraph either, so I just replayed your scenario with the default janusgraph-full-0.5.3 distribution. These are the steps:
  1. start a cassandra-cql instance with bin/janusgraph.sh start   (ignore the gremlin server and elasticsearch that are started too)
  2. make two files conf/gremlin-server/gremlin-server-configuration8185.yaml and conf/gremlin-server/gremlin-server-configuration8186.yaml, using conf/gremlin-server/gremlin-server-configuration.yaml as a template but changing the port numbers,
  3. start two gremlin server instances with these yaml files, so serving at port 8185 and 8186
  4. make two files conf/remote8185.yaml and remote8186.yaml
  5. start two gremlin console instances and play the following:
In the first console:
gremlin> :remote connect tinkerpop.server conf/remote8185.yaml session
==>Configured localhost/127.0.0.1:8185-[3aa66b8e-8468-4cd7-95aa-0e642bb8434c]
gremlin> :remote console
==>All scripts will now be sent to Gremlin Server - [localhost/127.0.0.1:8185]-[3aa66b8e-8468-4cd7-95aa-0e642bb8434c] - type ':remote console' to return to local mode
gremlin> map = new HashMap<String, Object>();
gremlin> map.put("storage.backend", "cql");
==>null
gremlin> map.put("storage.hostname", "127.0.0.1");
==>null
gremlin> map.put("graph.graphname", "graph1");
==>null
gremlin> ConfiguredGraphFactory.createConfiguration(new MapConfiguration(map));
==>null
gremlin> graph1 = ConfiguredGraphFactory.open("graph1")
==>standardjanusgraph[cql:[127.0.0.1]]
gremlin> g1 = graph1.traversal()
==>graphtraversalsource[standardjanusgraph[cql:[127.0.0.1]], standard]
gremlin> g1.addV()
==>v[4136]
gremlin> g1.V()
==>v[4136]
gremlin> g1.tx().commit()
==>null
gremlin>

In the second console:
gremlin> :remote connect tinkerpop.server conf/remote8186.yaml session
==>Configured localhost/127.0.0.1:8186-[00729ace-48e0-4896-83e6-2aeb19abe84d]
gremlin> :remote console
==>All scripts will now be sent to Gremlin Server - [localhost/127.0.0.1:8186]-[00729ace-48e0-4896-83e6-2aeb19abe84d] - type ':remote console' to return to local mode
gremlin> graph2 = ConfiguredGraphFactory.open("graph2")
Please create configuration for this graph using the ConfigurationManagementGraph#createConfiguration API.
Type ':help' or ':h' for help.
Display stack trace? [yN]n
gremlin> graph1 = ConfiguredGraphFactory.open("graph1")
==>standardjanusgraph[cql:[127.0.0.1]]
gremlin> g1=graph1.traversal()
==>graphtraversalsource[standardjanusgraph[cql:[127.0.0.1]], standard]
gremlin> g1.V()
==>v[4136]

The assignment to graph1 differs from what is shown in the ref docs at:
https://docs.janusgraph.org/basics/configured-graph-factory/#binding-example

But otherwise the scenario you are looking for works as expected. I trust you can use it as a reference for debugging your docker-compose file.

Best wishes,    Marc


Anton Eroshenko <erosh.anton@...>
 

Marc, thanks for your help.
The way you test it is similar to how it works in my environment. I do ConfiguredGraphFactory.open("graph1") as a workaround for the second JanusGraph instance. 
But the question is about this statement in documentation
The JanusGraphManager rebinds every graph stored on the ConfigurationManagementGraph (or those for which you have created configurations) every 20 seconds. This means your graph and traversal bindings for graphs created using the ConfiguredGraphFactory will be available on all JanusGraph nodes with a maximum of a 20 second lag. It also means that a binding will still be available on a node after a server restart.
 So I'm expecting that after 20 seconds the new graph traversal will be binded in all JanusGraph nodes without explicitly opening the graph with ConfiguredGraphFactory.open() for each node. I saw in JanusGraphManager the code responsible for this dynamic rebinding, but it doesn't seem to work.


hadoopmarc@...
 

Hi Anton,

OK, it took me some time to reach your level of understanding, but hopefully the
scenario below really starts adding to our common understanding. While the
issue hurts you in a setup with multiple gremlin servers, the issue already
appears in a setup with a single gremlin server.

The scenario comprises the following steps:
1. start Cassandra with:
   $ bin/janusgraph.sh start
   
2. start gremlin server:
   $ bin/gremlin-server.sh conf/gremlin-server/gremlin-server-configuration8185.yaml
   
3. connect with a gremlin console and run the following commands:

gremlin> :remote connect tinkerpop.server conf/remote.yaml session
==>Configured localhost/127.0.0.1:8185-[70e1320f-5c24-4804-9851-cc59db23e78e]
gremlin> :remote console
==>All scripts will now be sent to Gremlin Server - [localhost/127.0.0.1:8185]-[70e1320f-5c24-4804-9851-cc59db23e78e] - type ':remote console' to return to local mode
gremlin> map = new HashMap<String, Object>();
gremlin> map.put("storage.backend", "cql");
==>null
gremlin> map.put("storage.hostname", "127.0.0.1");
==>null
gremlin> map.put("graph.graphname", "graph6");
==>null
gremlin> ConfiguredGraphFactory.createConfiguration(new MapConfiguration(map));
==>null

... wait > 20 seconds
... new remote connection required for bindings to take effect

gremlin> :remote connect tinkerpop.server conf/remote8185.yaml session
==>Configured localhost/127.0.0.1:8185-[a1ddd2f3-9ab3-4eee-a415-1aa4ea57ca66]
gremlin> graph6
No such property: graph6 for class: Script8
Type ':help' or ':h' for help.
Display stack trace? [yN]n
gremlin> ConfiguredGraphFactory.getGraphNames()
==>graph5
==>graph4
==>graph3
==>graph2
==>graph1
==>graph6
gremlin>

If you now restart the gremlin server and reconnect in gremlin console,
graph6 is opened on the server and available as binding in the console.

So, indeed the automatic opening + binding of graphs as intended in line 105 of
https://github.com/JanusGraph/janusgraph/blob/master/janusgraph-core/src/main/java/org/janusgraph/graphdb/management/JanusGraphManager.java
is somehow not functional.

Did we formulate the issue as succinct as possible now?

Best wishes,     Marc


hadoopmarc@...
 

You could also check the scenario at line 65 of:

https://github.com/JanusGraph/janusgraph/blob/master/janusgraph-server/src/test/java/org/janusgraph/graphdb/tinkerpop/ConfigurationManagementGraphServerTest.java

This is with the inmemory storage backend rather than cassandra.

Marc


Anton Eroshenko <erosh.anton@...>
 

Hi Mark, 
I'm glad that you managed to reproduce it in the Gremlin Console. But I believe that in fact you do it with two JanusGraph servers, not with a single server as you assumed. As far as I understand janusgraph.sh in step 1 and gremlin-server.sh in step 2 are both starting a JanusGraph instance. So I think your test scenario is close to multi-node configuration. That's why a single node test you mentioned could not catch this issue. For single node it works fine. 
So should I file an issue in the project Github? 


hadoopmarc@...
 

Hi Anton,

No, my last post only concerned the gremlin server on port 8185, although the
first line of step3 should have been (This was a hand edit error):
    :remote connect tinkerpop.server conf/remote8185.yaml session
The gremlin server on port 8182 from janusgraph.sh is ignored.

Anyway, the link to the succesful test on github actually held the key to some
more insight. It turns out that our issue (bindings are not automatically
generated after max 20 seconds) is absent if you use the sequence
createTemplateConfiguration() and create(). Unfortunately, this only holds on
the same server where the new configuration was created.

So, I will report this all as an issue and you can comment on it if necessary.

Best wishes,    Marc


hadoopmarc@...