Date   

Re: Bindings for graphs created using ConfiguredGraphFactory not working as expected

hadoopmarc@...
 

Hi Anya,

In v0.6.0 the bin/janusgraph-server.sh start script does not start Cassandra any more. Are you sure you did start Cassandra ("cassandra/bin/cassandra") before starting JanusGraph?
Also check whether you did not mix up the graph1 and graph1_config graph.graphname values.

I guess you found out to do (before running bin/janusgraph-server.sh):
export JANUSGRAPH_YAML=conf/gremlin-server/gremlin-server-configuration.yaml

Best wishes,      Marc


Re: Cleaning up old data in large graphs

hadoopmarc@...
 

Hi Mladen,

Indeed, there is still a load of open issues regarding TTL:

https://github.com/JanusGraph/janusgraph/issues?q=is%3Aissue+is%3Aopen+ttl

Your last remark about empty vertices sounds plausible, although it would be pretty bad if true. Searching on "new HashMap" on github gives too many results to inspect, so please keep an open eye on more hints where it would occur.
I did not see open issues that report empty vertices after ghost vertex removal.

Best wishes,    Marc


Re: Cleaning up old data in large graphs

Mladen Marović
 
Edited

Hi Mark,

thanks for the response.

  1. As described in https://docs.janusgraph.org/schema/advschema/, TTL is already supported. However, there are two issues in my case:

    a) Changing the TTL is supported, but the new TTL will only be applied on inserts and updates. In other words, if I have a TTL of 12 months, I change it to 18 months, it will effectively take 12 months before that change comes into effect because all the old data will still have TTL set to 12 months. A possible workaround would be to run over all objects in the database and update them in some way to force setting the new TTL, although that seems a bit costly.

    b) I'm not sure how the TTL setting applies exactly in Janusgraph. Is it set only on the data or on the composite indexes as well? Because if it's set only on the data, then after a while the indexes should be filled with non-existing entries. I can confirm this to be the case for mixed indexes - during testing, data was deleted in cassandra, but mixed index entries in elasticsearch were not, which means that I would have to delete them manually as well. This would be OK if janusgraph supported using multiple indexes in elasticsearch for a single index (which would be a really cool feature btw!), but I don't think that's the case - I tried to trick janusgraph into using an alias, but things did not work as expected.

  2. I don't think the problem in the Spark jobs is with transactions. By default, in case of an exception, Spark should repeat that task, and eventually the job ends, so all tasks finished successfully. Also, in my case, there actually are no exceptions. I even managed to manually find the vertices that caused the issues via the gremlin console, but their valueMap() is {} where I would expect it to contain the 10-15 properties they usually have, if they weren't deleted. Basically, Janusgraph acts as if it found a vertex (or some part of it), but during deletion, nothing happens.

    If I remember correctly, I tried to analyze what is happening a while ago and I seem to have found some place in the janusgraph source where a dummy (empty) vertex is created if Janusgraph does not find the proper data. I guess that's what's happening to me when I get the {} result. Maybe the index entry wasn't cleaned up, Janusgraph thinks there should be something, finds nothing, so it returns the empty vertex. When I try to delete it, again there is nothing to be deleted so the index entry isn't cleared. I don't know if that's actually possible, but that might explain my case.

Best regards,

Mladen Marović


Re: Cleaning up old data in large graphs

hadoopmarc@...
 

Hi Mladen,

Just two things that come up while reading your story:
  • the cassandra TTL feature seems promising for your use case, see e.g. https://www.geeksforgeeks.org/time-to-live-ttl-for-a-column-in-cassandra/ I guess this would require code changes in janusgraph-cassandra.
  • how is transaction control in the spark jobs? You want transactions of reasonable size (say 10.000 vertices or edges) and you want spark tasks to fail if the transaction commit fails. In that way spark will repeat the task and will hopefully succeed.

Best wishes,    Marc


Bindings for graphs created using ConfiguredGraphFactory not working as expected

anya.sharma@...
 
Edited

Hello,

I have a local setup of JanusGraph 0.6.0 with Cassandra 3.11.9. I am creating a graph using the ConfiguredGraphFactory. For this, I am using the bundled properties and yaml files and creating the graph by running the following commands from the Gremlin console (also bundled with the JanusGraph installation):

gremlin> :remote connect tinkerpop.server conf/remote.yaml session
gremlin> :remote console
gremlin> map.put('storage.backend', 'cql');
gremlin> map.put('storage.hostname', '127.0.0.1');
gremlin> map.put('graph.graphname', 'graph1');
gremlin> map.put('storage.username', 'myDBUsername');
gremlin> map.put('storage.password', 'myDBPassword');
gremlin> ConfiguredGraphFactory.createConfiguration(new MapConfiguration(map));

Once I have created the map, I try to access the graph and the traversal variables bound to it, but I get the following response:

gremlin>ConfiguredGraphFactory.open('graph1')
gremlin> graph1
No such property: graph1 for class: Script7
 
gremlin> graph1_traversal
No such property: graph1_traversal for class: Script8
 
I am using the gremlin-server-configuration.yaml and janusgraph-cql-configuration.properties bundled with the JanusGraph installation package. The only changes I have made are adding the credentials and custom graph.graphname:

graph.graphname=graph1_config
storage.hostname=127.0.0.1
storage.username=myDBUsername
storage.password=myDBPassword

According to the documentation, I should be able to access the bound variables. I was able to do this in the 0.3.1 version of Janusgraph. What could I be missing/doing wrong?

Thanks
Anya


Re: Duplicate vertex issue with Uniqueness constraints | Janusgraph CQL

Pawan Shriwas
 

Hi Marc, 

Adding additional data - 

Checking duplicate data with uniqueness constraints on name_cons field -

gremlin> g.V().has('gId',P.within('da209078-4a2f-4db2-b489-27da028df983','ba81f5d3-a29b-4a2c-88c3-c265ce3f68a5','9804b32d-31d9-409a-a441-a38fdbf998f7')).valueMap()

==>[gId:[da209078-4a2f-4db2-b489-27da028df983],entityGId:[9e51c70d-f148-401f-8eea-53b767d9bbb6],name_cons:[CGNAT_NS2]]

==>[gId:[ba81f5d3-a29b-4a2c-88c3-c265ce3f68a5],entityGId:[7e763ebc-b2e0-4d04-baaa-4463d04ca436],name_cons:[CGNAT_NS2]]

==>[gId:[9804b32d-31d9-409a-a441-a38fdbf998f7],entityGId:[23fd7efd-3688-4b58-aab6-173d25a8dd63],name_cons:[CGNAT_NS2]]

gremlin>



Reading of data with unique index property with Consistency lock and get only one record - 


gremlin> g.V().has('name_cons','CGNAT_NS2').valueMap()

==>[gId:[290cc878-19e1-44f6-9f6c-62b7471e21bc],entityGId:[0b59889d-e725-46e5-9f42-d96daaeaa21d],name_cons:[CGNAT_NS2]]

gremlin>

gremlin>


Hope this clarifies!!!!


On Mon, Nov 22, 2021 at 12:39 PM Pawan Shriwas via lists.lfaidata.foundation <shriwas.pawan=gmail.com@...> wrote:
Hi Marc;

Yes, We are committing the transaction after each operation.   

how do you know about "duplicate vertex creation" when "it returns only 1 record"?
Vertex is being ingested with the same data and graph generate different id for the same. When we query the graph with these different ids, list object return having same name multiple time but  when we retrieve the data with name parameter(having unique index with lock consistency) graph returns only one record.

Hope this helps.

Thanks,
Pawan

 

On Sun, Nov 21, 2021 at 4:01 PM <hadoopmarc@...> wrote:
Hi Pawan,

Your code mirrors the example at https://docs.janusgraph.org/advanced-topics/eventual-consistency/#data-consistency for the greatest part. Are you sure the changes on graphMgmt get committed?

Also, how do you know about "duplicate vertex creation" when "it returns only 1 record"?

Best wishes,   Marc

PS. Most of the software community reserves names starting with a verb to functions and class methods. Violating this convention (e.g. PropertyKey makePropertyKey) makes your code almost unreadable to others.



--
Thanks & Regard

PAWAN SHRIWAS



--
Thanks & Regard

PAWAN SHRIWAS


Cleaning up old data in large graphs

Mladen Marović
 

Hello,

I have a graph (Janusgraph 0.5.3 running on a cql backend and an elasticsearch index) that is updated in near real-time. About 50M new vertices and 100M new edges are added every month. A large part of these (around 90%) should be deleted after 1 year, and the customer may require to change this at a later date. The remaining 10% of the data has no fixed expiration period, but vertices are expected to be deleted when they have no more edges.

Currently, I have a daily Spark job that deletes vertices and their edges by checking their date field (a field denoting the date they were added to the graph). A second Spark job is used to delete vertices without edges. This sort of works, but is definitely not perfect for the following reasons:

  1. After running the first cleanup job for a specific date, there's always a small amount of items (vertices or edges) left. The job reports the number of deleted items, and even after running the job for several times, there's always a non-zero number of items being reported as deleted in that run. For example, in the first run it will report several million items as deleted, in the second about 5000, in the third about 4800, in the fourth about 4620 etc. This converges to some non-zero small number eventually, meaning the Spark job always sees some vertices that it repeatedly attempts to delete, but never actually does, even though no errors appear.

    I'm guessing this is caused by some consistency issues, but could not resolve it completely. I tried to run the GhostVertexRemover vertex program which helps and further reduces the number of remaining items, but some still persist. Also, when running the cleanup job on a smaller scale (less workers and data), the job seems to work without issues, so I don't think there are any major bugs in the code itself that would cause this.

  2. Once it starts, the cleaning job is quite performance-intensive and can sometimes interfere with the input job that loads the graph data, which is something I want to avoid.

  3. During the cleanup job, cassandra delete operations produce a lot of tombstones. If the tombstone threshold is too low and exceeded on a single node, the entire graph will no longer accept any changes until a cassandra compaction is run. A large number of tombstones also degrades search performance. Graph supernodes with an especially large edge count may require several "run the cleanup job -> cleanup fails -> run compaction" cycles before everything is properly cleaned up. An alternative is to configure the tombstone threshold to be some absurdly high number to prevent failures completely and schedule daily compaction on each cassandra node after each cleanup job, which is what I'm doing currently.

I was wondering if anyone has some suggestions or best practices on how to manage graph data with a retention period (that could change over time)?

Best regards,

Mladen Marović


Re: Duplicate vertex issue with Uniqueness constraints | Janusgraph CQL

Pawan Shriwas
 

Hi Marc;

Yes, We are committing the transaction after each operation.   

how do you know about "duplicate vertex creation" when "it returns only 1 record"?
Vertex is being ingested with the same data and graph generate different id for the same. When we query the graph with these different ids, list object return having same name multiple time but  when we retrieve the data with name parameter(having unique index with lock consistency) graph returns only one record.

Hope this helps.

Thanks,
Pawan

 

On Sun, Nov 21, 2021 at 4:01 PM <hadoopmarc@...> wrote:
Hi Pawan,

Your code mirrors the example at https://docs.janusgraph.org/advanced-topics/eventual-consistency/#data-consistency for the greatest part. Are you sure the changes on graphMgmt get committed?

Also, how do you know about "duplicate vertex creation" when "it returns only 1 record"?

Best wishes,   Marc

PS. Most of the software community reserves names starting with a verb to functions and class methods. Violating this convention (e.g. PropertyKey makePropertyKey) makes your code almost unreadable to others.



--
Thanks & Regard

PAWAN SHRIWAS


Re: Duplicate vertex issue with Uniqueness constraints | Janusgraph CQL

hadoopmarc@...
 

Hi Pawan,

Your code mirrors the example at https://docs.janusgraph.org/advanced-topics/eventual-consistency/#data-consistency for the greatest part. Are you sure the changes on graphMgmt get committed?

Also, how do you know about "duplicate vertex creation" when "it returns only 1 record"?

Best wishes,   Marc

PS. Most of the software community reserves names starting with a verb to functions and class methods. Violating this convention (e.g. PropertyKey makePropertyKey) makes your code almost unreadable to others.


Re: jvm.options broken

hadoopmarc@...
 

Hi Matthias,

Thanks for taking the trouble to report this. It took a while, but your report did not go unnoticed:

https://github.com/JanusGraph/janusgraph/issues/2857

Best wishes,    Marc


Duplicate vertex issue with Uniqueness constraints | Janusgraph CQL

Pawan Shriwas
 

Hi Everyone,

I am facing a duplicate vertex creation issue even though the unique index is present in that property and when i retrive the data with the same index it returns only 1 record.

Please see below information for the same.

Storage Backend - Cassandra CQL
Janusgraph version - 0.5.2
index - Composite 
Uniqueness -  True
Consistency - yes
Index Status - ENABLED

Below are the code snippet - 

0-02-08-f4ca12e27990b7b27cd9a92fd2028024e13f5784cf7afa26f54da58cce631438_1c6da93e7293a7.png

Index Status : 

2021_11_18_0y5_Kleki.png
 
Thanks,
Pawan


Re: Diagnosing slow write speeds to BigTable

AC
 

I have a follow-up question in addition to my reply above: Is there any guide for understanding the JanusGraph metrics available? I have written a basic metrics integration but I'm finding it quite hard to interpret the metrics that are being produced.


On Tue, Nov 16, 2021 at 12:35 PM AC via lists.lfaidata.foundation <acrane=twitter.com@...> wrote:
Hey again Boxuan, thanks for your help in this thread!

1) Read speed is quite fast, at least as fast as I would expect for using a remote database like BigTable.
2) That is a good idea, I will try making some writes to BigTable outside of JanusGraph in this container. However, considering that the BigTable client stats and BigTable server stats both report low latencies from within the JanusGraph application, this is looking like a JanusGraph-related issue. I will report back with results today.

On Tue, Nov 16, 2021 at 11:48 AM Boxuan Li <liboxuan@...> wrote:
I am not an expert on this and I've never used BigTable or GCP before, but here are my two cents:

1) Did you test the read speed? Is it also very slow compared to writing?

2) Did you try using an HBase/Bigtable client (in the same GCP container as your JanusGraph instance) to write to your BigTable cluster? If it's also very slow then the problem might be with your network or other setups.

Best,
Boxuan


Re: Diagnosing slow write speeds to BigTable

AC
 

Hey again Boxuan, thanks for your help in this thread!

1) Read speed is quite fast, at least as fast as I would expect for using a remote database like BigTable.
2) That is a good idea, I will try making some writes to BigTable outside of JanusGraph in this container. However, considering that the BigTable client stats and BigTable server stats both report low latencies from within the JanusGraph application, this is looking like a JanusGraph-related issue. I will report back with results today.

On Tue, Nov 16, 2021 at 11:48 AM Boxuan Li <liboxuan@...> wrote:
I am not an expert on this and I've never used BigTable or GCP before, but here are my two cents:

1) Did you test the read speed? Is it also very slow compared to writing?

2) Did you try using an HBase/Bigtable client (in the same GCP container as your JanusGraph instance) to write to your BigTable cluster? If it's also very slow then the problem might be with your network or other setups.

Best,
Boxuan


Re: Diagnosing slow write speeds to BigTable

Boxuan Li
 

I am not an expert on this and I've never used BigTable or GCP before, but here are my two cents:

1) Did you test the read speed? Is it also very slow compared to writing?

2) Did you try using an HBase/Bigtable client (in the same GCP container as your JanusGraph instance) to write to your BigTable cluster? If it's also very slow then the problem might be with your network or other setups.

Best,
Boxuan


Diagnosing slow write speeds to BigTable

AC
 

Hey there, folks. Firstly I want to say thanks for your help with the previous bug we uncovered.

I'm evaluating JanusGraph performance on BigTable and observing very slow write speeds when writing even a single vertex and committing a transaction. Starting a new transaction, writing a single vertex, and committing the transaction takes at minimum 5-6 seconds.

BigTable metrics indicate that the backend is never taking more than 100ms (max) to perform a write. It's hard to imagine that any amount of overhead on the BigTable side would bring this up to 5-6 seconds. The basic BigTable stats inside our application also look reasonable.

Here is the current configuration:

"storage.backend": "hbase"
"metrics.enabled": true
"cache.db-cache": false
"query.batch": true
"storage.page-size": 1000
"storage.hbase.ext.hbase.client.connection.impl": "com.google.cloud.bigtable.hbase2_x.BigtableConnection"
"storage.hbase.ext.google.bigtable.grpc.retry.deadlineexceeded.enable": true
"storage.hbase.ext.google.bigtable.grpc.channel.count": 50
"storage.lock.retries": 5
"storage.lock.wait-time": 50.millis

This is running in a GCP container that is rather beefy and not doing anything else, and is located in the same region as the BigTable cluster. Other traffic to/from the container seems fine.

I'm currently using hbase-shaded-client rev 2.1.5 since that's aligned to JanusGraph 0.5.3 which we are currently using. I experimented with up to 2.4.8 and saw no difference. I'm also using bigtable-hbase-2.x-shaded 1.25.1, the latest stable revision.

I'm at a loss how to progress further with my diagnosis, as all evidence indicates that the latency is originating with JanusGraph's operation. How can I better find and eliminate the source of this latency?

Thanks!


Re: How to change GLOBAL_OFFLINE configuration when graph can't be instantiated

toom@...
 

Hi Marc,

Your solution works if the configuration hasn't been changed yet. If you change the index backend and set a wrong hostname, you cannot access your data anymore:
mgmt = graph.openManagement()
mgmt.set("index.search.backend", "elasticsearch")
mgmt.set("index.search.hostname", "non-existant.hostname")
mgmt.commit()

Then the database cannot be open.

Regards,

Toom.


Re: Potential transaction issue (JG 0.6.0)

Boxuan Li
 

I agree with Sergey that "this problem was just hidden in the previous version as resources were not released properly".

I tried to reproduce in Java (not remote graph) but failed. @Charles, are you able to release the complete recipe of your code, or spot anything that I am missing?

My code is as follows (you can put it in JanusGraphTest.java and run):

@Test
public void testTransactionIssue() {
JanusGraphVertex v1 = tx.addVertex(T.label, "company", "companyId", 44507);
JanusGraphVertex v2 = tx.addVertex("status", "APPROVED", "workerId", 123, "lastName", "A", "firstName", "aa");
JanusGraphVertex v3 = tx.addVertex("status", "APPROVED", "workerId", 124, "lastName", "C", "firstName", "a");
JanusGraphVertex v4 = tx.addVertex("status", "APPROVED", "workerId", 125, "lastName", "aa", "firstName", "C");
v1.addEdge("EMPLOYS", v2);
v1.addEdge("EMPLOYS", v3);
v1.addEdge("EMPLOYS", v4);
tx.commit();
newTx();
List list = tx.traversal().V().has("company", "companyId", 44507).out("EMPLOYS").has("status", "APPROVED").skip(0).limit(1).elementMap("workerId").toList();
tx.traversal().V().has("company", "companyId", 44507).out("EMPLOYS").has("status", "APPROVED").skip(1).limit(1).elementMap("workerId").toList();
tx.traversal().V().has("company", "companyId", 44507).out("EMPLOYS").has("status", "APPROVED").skip(0).limit(2).elementMap("workerId").toList();
tx.rollback();
tx = graph.newTransaction();
tx.traversal().V().has("companyId", 44507).out("EMPLOYS").has("status", "APPROVED").skip(0).limit(1).elementMap("workerId").toList();
tx.traversal().V().has("companyId", 44507).out("EMPLOYS").has("status", "APPROVED").order().by("lastName").by("firstName").skip(1).limit(1).elementMap("workerId").toList();


graph.traversal().V().has("companyId", 44507).out("EMPLOYS").has("status", "APPROVED").skip(0).limit(1).elementMap("workerId").toList();
graph.traversal().V().has("companyId", 44507).out("EMPLOYS").has("status", "APPROVED").order().by("lastName").by("firstName").skip(1).limit(1).elementMap("workerId").toList();
}


Re: Cassandra 4

hadoopmarc@...
 

Hi,

There is an issue tracking this, but no PR's yet, see: https://github.com/JanusGraph/janusgraph/issues/2325

Best wishes,     Marc


Cassandra 4

Kusnierz.Krzysztof@...
 

Hi, has anyone tried JG with Cassandra 4 ? Does it work ?


Re: How to Merge Two Vertices in JanusGraph into single vertex

hadoopmarc@...
 

Hi Krishna,

Nope. However, you are not the first to ask, see:
https://stackoverflow.com/questions/46363737/tinkerpop-gremlin-merge-vertices-and-edges/46435070#46435070

Best wishes,   Marc

301 - 320 of 6554