Good afternoon,
We're running a docker-compose janusgraph:0.6.1 with cassandra:3 and elasticsearch:6.6.0. We're primarily utilizing JanusGraph within Python 3.8 via gremlinpython.
We frequently reset our graph store to run an experiment or demonstration. To date, we've either (1) dropped the graph and re-loaded our schema and re-defined our indices or (2) deleted all the vertices to maintain the schema and indices. Often #2 is faster (and less error-prone), but it's slower for large graphs. I hope somebody can lend some advice that will speed up our resettting-the-graph workflow with JanusGraph.
For deleting 6K nodes (and many incident edges), here's the timing data:
2022-05-05 16:40:44,261 - INFO - Deleting batch 1.
2022-05-05 16:41:09,961 - INFO - Deleting batch 2.
2022-05-05 16:41:27,689 - INFO - Deleting batch 3.
2022-05-05 16:41:43,678 - INFO - Deleting batch 4.
2022-05-05 16:41:45,561 - INFO - Deleted 6226 vertices over 4 batch(es).
...so it takes roughly 1 minute to delete 6K vertices in batches of 2000.
Here's our Python code for deleting the nodes:
batches = 0
nodes = 0
while True:
batches += 1
com.log(f'Deleting batch {batches}.')
num_nodes = g.V().limit(batch_size).sideEffect(__.drop()).count().next()
nodes += num_nodes
if num_nodes < batch_size:
break
log(f'Deleted {nodes} nodes over {batches} batch(es).')
This never fails, but it's obviously quite slow, especially for larger graphs. Is there a way to speed this up? We haven't tried running it async, since we're not sure how to do so safely.
Thanks in advance for any wisdom!
Scott