Date
1 - 5 of 5
Rapid deletion of vertices
Scott Friedman
Good afternoon,
We're running a docker-compose janusgraph:0.6.1 with cassandra:3 and elasticsearch:6.6.0. We're primarily utilizing JanusGraph within Python 3.8 via gremlinpython. We frequently reset our graph store to run an experiment or demonstration. To date, we've either (1) dropped the graph and re-loaded our schema and re-defined our indices or (2) deleted all the vertices to maintain the schema and indices. Often #2 is faster (and less error-prone), but it's slower for large graphs. I hope somebody can lend some advice that will speed up our resettting-the-graph workflow with JanusGraph. For deleting 6K nodes (and many incident edges), here's the timing data: 2022-05-05 16:40:44,261 - INFO - Deleting batch 1. 2022-05-05 16:41:09,961 - INFO - Deleting batch 2. 2022-05-05 16:41:27,689 - INFO - Deleting batch 3. 2022-05-05 16:41:43,678 - INFO - Deleting batch 4. 2022-05-05 16:41:45,561 - INFO - Deleted 6226 vertices over 4 batch(es). ...so it takes roughly 1 minute to delete 6K vertices in batches of 2000. Here's our Python code for deleting the nodes: batches = 0
nodes = 0
while True:
batches += 1
com.log(f'Deleting batch {batches}.')
num_nodes = g.V().limit(batch_size).sideEffect(__.drop()).count().next()
nodes += num_nodes
if num_nodes < batch_size:
break
log(f'Deleted {nodes} nodes over {batches} batch(es).')
This never fails, but it's obviously quite slow, especially for larger graphs. Is there a way to speed this up? We haven't tried running it async, since we're not sure how to do so safely. Thanks in advance for any wisdom! Scott
|
|
Boxuan Li
Hi Scott,
toggle quoted messageShow quoted text
One idea that first came into my mind is to first collect all vertex ids, and then delete them in batch & in parallel using multi-threading. Best regards, Boxuan
|
|
hadoopmarc@...
Hi Scott,
Another approach is to take snapshots of the Cassandra tables and Elasticsearch indices after creating the schema and indices. Note that there are some subtleties when taking snapshots of non-empty graphs (not your present use case), see: https://lists.lfaidata.foundation/g/janusgraph-users/topic/82475527#5867 Best wishes, Marc
|
|
eric.neufeld@...
Hello Scott,
i had the same situation but with much more data. Fastest way was stopping the server, then clear all, start it again and create the schema again. bin/janusgraph.sh stop bin/janusgraph.sh clear bin/janusgraph.sh start bin/gremlin.sh -i scripts/init-myGraph.groovy Of course these steps could added to some sh script like resetjanusgraph.sh. In init-myGraph.groovy i added something like: :remote connect tinkerpop.server conf/.....-remote.yaml
:remote console
:load data/gravity-my-schema.groovy
defineMySchema(graph)
:q
In data/gravity-my-schema.groovy there i define that groovy function defineMySchema(graph) //#!/usr/bin/env groovy
def defineWafermapSchema(graph) {
// Create graph schema and indexes, if they haven't already been created
m = graph.openManagement()
println 'setup schema'
if (m.getPropertyKey('name') == null) {
la= m.makeVertexLabel("la").make() .... Maybe this helps, Eric
|
|
Scott Friedman
Thanks Boxuan, Marc, and Eric.
I implemented Boxuan's get-vertex-ids-and-delete-in-parallel suggestion with 8 gremlinpython workers, and it saves an order of magnitude of time. I imagine it could scale up further with more parallelism. That's some great time savings, thank you! Marc, good idea to snapshot and then do full down-and-up. I assume we'd have to take down Cassandra and Elasticsearch as well, and then start their docker images back up with substituted volumes. This would obviously be outperform for millions/billions of vertices. Eric, it sounds like your approach may while keeping the data store (e.g., Cassandra) and indexer (e.g., Elasticsearch) alive, which could improve efficiency over a full tear-down. We'll consider this as well, probably identifying some docker-based analogs for some of the janusgraph shell commands. Thanks!
|
|