I have a graph (Janusgraph 0.5.3 running on a cql backend and an elasticsearch index) that is updated in near real-time. About 50M new vertices and 100M new edges are added every month. A large part of these (around 90%) should be deleted after 1 year, and the customer may require to change this at a later date. The remaining 10% of the data has no fixed expiration period, but vertices are expected to be deleted when they have no more edges.
Currently, I have a daily Spark job that deletes vertices and their edges by checking their
date field (a field denoting the date they were added to the graph). A second Spark job is used to delete vertices without edges. This sort of works, but is definitely not perfect for the following reasons:
After running the first cleanup job for a specific date, there's always a small amount of items (vertices or edges) left. The job reports the number of deleted items, and even after running the job for several times, there's always a non-zero number of items being reported as deleted in that run. For example, in the first run it will report several million items as deleted, in the second about 5000, in the third about 4800, in the fourth about 4620 etc. This converges to some non-zero small number eventually, meaning the Spark job always sees some vertices that it repeatedly attempts to delete, but never actually does, even though no errors appear.
I'm guessing this is caused by some consistency issues, but could not resolve it completely. I tried to run the
GhostVertexRemover vertex program which helps and further reduces the number of remaining items, but some still persist. Also, when running the cleanup job on a smaller scale (less workers and data), the job seems to work without issues, so I don't think there are any major bugs in the code itself that would cause this.
Once it starts, the cleaning job is quite performance-intensive and can sometimes interfere with the input job that loads the graph data, which is something I want to avoid.
During the cleanup job, cassandra delete operations produce a lot of tombstones. If the tombstone threshold is too low and exceeded on a single node, the entire graph will no longer accept any changes until a cassandra compaction is run. A large number of tombstones also degrades search performance. Graph supernodes with an especially large edge count may require several "run the cleanup job -> cleanup fails -> run compaction" cycles before everything is properly cleaned up. An alternative is to configure the tombstone threshold to be some absurdly high number to prevent failures completely and schedule daily compaction on each cassandra node after each cleanup job, which is what I'm doing currently.
I was wondering if anyone has some suggestions or best practices on how to manage graph data with a retention period (that could change over time)?