Date
1 - 5 of 5
Rapid deletion of vertices
Scott Friedman
Good afternoon,
We're running a docker-compose janusgraph:0.6.1 with cassandra:3 and elasticsearch:6.6.0. We're primarily utilizing JanusGraph within Python 3.8 via gremlinpython.
We frequently reset our graph store to run an experiment or demonstration. To date, we've either (1) dropped the graph and re-loaded our schema and re-defined our indices or (2) deleted all the vertices to maintain the schema and indices. Often #2 is faster (and less error-prone), but it's slower for large graphs. I hope somebody can lend some advice that will speed up our resettting-the-graph workflow with JanusGraph.
For deleting 6K nodes (and many incident edges), here's the timing data:
...so it takes roughly 1 minute to delete 6K vertices in batches of 2000.
Here's our Python code for deleting the nodes:
This never fails, but it's obviously quite slow, especially for larger graphs. Is there a way to speed this up? We haven't tried running it async, since we're not sure how to do so safely.
Thanks in advance for any wisdom!
Scott
We're running a docker-compose janusgraph:0.6.1 with cassandra:3 and elasticsearch:6.6.0. We're primarily utilizing JanusGraph within Python 3.8 via gremlinpython.
We frequently reset our graph store to run an experiment or demonstration. To date, we've either (1) dropped the graph and re-loaded our schema and re-defined our indices or (2) deleted all the vertices to maintain the schema and indices. Often #2 is faster (and less error-prone), but it's slower for large graphs. I hope somebody can lend some advice that will speed up our resettting-the-graph workflow with JanusGraph.
For deleting 6K nodes (and many incident edges), here's the timing data:
2022-05-05 16:40:44,261 - INFO - Deleting batch 1.
2022-05-05 16:41:09,961 - INFO - Deleting batch 2.
2022-05-05 16:41:27,689 - INFO - Deleting batch 3.
2022-05-05 16:41:43,678 - INFO - Deleting batch 4.
2022-05-05 16:41:45,561 - INFO - Deleted 6226 vertices over 4 batch(es).
...so it takes roughly 1 minute to delete 6K vertices in batches of 2000.
Here's our Python code for deleting the nodes:
batches = 0
nodes = 0
while True:
batches += 1
com.log(f'Deleting batch {batches}.')
num_nodes = g.V().limit(batch_size).sideEffect(__.drop()).count().next()
nodes += num_nodes
if num_nodes < batch_size:
break
log(f'Deleted {nodes} nodes over {batches} batch(es).')
This never fails, but it's obviously quite slow, especially for larger graphs. Is there a way to speed this up? We haven't tried running it async, since we're not sure how to do so safely.
Thanks in advance for any wisdom!
Scott
Boxuan Li
Hi Scott,
toggle quoted message
Show quoted text
One idea that first came into my mind is to first collect all vertex ids, and then delete them in batch & in parallel using multi-threading.
Best regards,
Boxuan
On May 5, 2022, at 5:57 PM, Scott Friedman <friedman@...> wrote:Good afternoon,
We're running a docker-compose janusgraph:0.6.1 with cassandra:3 and elasticsearch:6.6.0. We're primarily utilizing JanusGraph within Python 3.8 via gremlinpython.
We frequently reset our graph store to run an experiment or demonstration. To date, we've either (1) dropped the graph and re-loaded our schema and re-defined our indices or (2) deleted all the vertices to maintain the schema and indices. Often #2 is faster (and less error-prone), but it's slower for large graphs. I hope somebody can lend some advice that will speed up our resettting-the-graph workflow with JanusGraph.
For deleting 6K nodes (and many incident edges), here's the timing data:2022-05-05 16:40:44,261 - INFO - Deleting batch 1.
2022-05-05 16:41:09,961 - INFO - Deleting batch 2.
2022-05-05 16:41:27,689 - INFO - Deleting batch 3.
2022-05-05 16:41:43,678 - INFO - Deleting batch 4.
2022-05-05 16:41:45,561 - INFO - Deleted 6226 vertices over 4 batch(es).
...so it takes roughly 1 minute to delete 6K vertices in batches of 2000.
Here's our Python code for deleting the nodes:
batches = 0nodes = 0while True:batches += 1com.log(f'Deleting batch {batches}.')num_nodes = g.V().limit(batch_size).sideEffect(__.drop()).count().next()nodes += num_nodesif num_nodes < batch_size:breaklog(f'Deleted {nodes} nodes over {batches} batch(es).')
This never fails, but it's obviously quite slow, especially for larger graphs. Is there a way to speed this up? We haven't tried running it async, since we're not sure how to do so safely.
Thanks in advance for any wisdom!
Scott
hadoopmarc@...
Hi Scott,
Another approach is to take snapshots of the Cassandra tables and Elasticsearch indices after creating the schema and indices.
Note that there are some subtleties when taking snapshots of non-empty graphs (not your present use case), see:
https://lists.lfaidata.foundation/g/janusgraph-users/topic/82475527#5867
Best wishes, Marc
Another approach is to take snapshots of the Cassandra tables and Elasticsearch indices after creating the schema and indices.
Note that there are some subtleties when taking snapshots of non-empty graphs (not your present use case), see:
https://lists.lfaidata.foundation/g/janusgraph-users/topic/82475527#5867
Best wishes, Marc
eric.neufeld@...
Hello Scott,
i had the same situation but with much more data. Fastest way was stopping the server, then clear all, start it again and create the schema again.
bin/janusgraph.sh stop
bin/janusgraph.sh clear
bin/janusgraph.sh start
bin/gremlin.sh -i scripts/init-myGraph.groovy
Of course these steps could added to some sh script like resetjanusgraph.sh.
In init-myGraph.groovy i added something like:
In data/gravity-my-schema.groovy there i define that groovy function defineMySchema(graph)
i had the same situation but with much more data. Fastest way was stopping the server, then clear all, start it again and create the schema again.
bin/janusgraph.sh stop
bin/janusgraph.sh clear
bin/janusgraph.sh start
bin/gremlin.sh -i scripts/init-myGraph.groovy
Of course these steps could added to some sh script like resetjanusgraph.sh.
In init-myGraph.groovy i added something like:
:remote connect tinkerpop.server conf/.....-remote.yaml
:remote console
:load data/gravity-my-schema.groovy
defineMySchema(graph)
:q
In data/gravity-my-schema.groovy there i define that groovy function defineMySchema(graph)
//#!/usr/bin/env groovy
def defineWafermapSchema(graph) {
// Create graph schema and indexes, if they haven't already been created
m = graph.openManagement()
println 'setup schema'
if (m.getPropertyKey('name') == null) {
la= m.makeVertexLabel("la").make()
....
Maybe this helps,
Eric
....
Maybe this helps,
Eric
Scott Friedman
Thanks Boxuan, Marc, and Eric.
I implemented Boxuan's get-vertex-ids-and-delete-in-parallel suggestion with 8 gremlinpython workers, and it saves an order of magnitude of time. I imagine it could scale up further with more parallelism. That's some great time savings, thank you!
Marc, good idea to snapshot and then do full down-and-up. I assume we'd have to take down Cassandra and Elasticsearch as well, and then start their docker images back up with substituted volumes. This would obviously be outperform for millions/billions of vertices.
Eric, it sounds like your approach may while keeping the data store (e.g., Cassandra) and indexer (e.g., Elasticsearch) alive, which could improve efficiency over a full tear-down. We'll consider this as well, probably identifying some docker-based analogs for some of the janusgraph shell commands. Thanks!
I implemented Boxuan's get-vertex-ids-and-delete-in-parallel suggestion with 8 gremlinpython workers, and it saves an order of magnitude of time. I imagine it could scale up further with more parallelism. That's some great time savings, thank you!
Marc, good idea to snapshot and then do full down-and-up. I assume we'd have to take down Cassandra and Elasticsearch as well, and then start their docker images back up with substituted volumes. This would obviously be outperform for millions/billions of vertices.
Eric, it sounds like your approach may while keeping the data store (e.g., Cassandra) and indexer (e.g., Elasticsearch) alive, which could improve efficiency over a full tear-down. We'll consider this as well, probably identifying some docker-based analogs for some of the janusgraph shell commands. Thanks!