[DISCUSSION] JanusGraph db-cache in distributed environment


Oleksandr Porunov
 

Hello,

I would like to start a topic about JanusGraph db-cache we have today and ways to improve it for distributed environments.
I want to split this topic on several issues I see with this cache:
1) Invalidation
2) Performance
3) Sharding

Invalidation

The first main issue with this cache is that it doesn't have good invalidation mechanisms.
As for now there are 3 invalidation mechanisms:
  1. Enough time passed (cache.db-cache-time).
  2. Evicted due to cache size limitation (cache.db-cache-time).
  3. Evicted on current JanusGraph instance only due to being mutated on the current JanusGraph instance.
I believe that we should give users a possibility to either somehow invalidate this cache when they deemed necessary (using their own strategy). Thus, I created the next PR to give users a simple possibility to invalidate cache manually: https://github.com/JanusGraph/janusgraph/pull/3184
Moreover, I think it would be great if there would be some kind of pattern implemented in JanusGraph to be able to invalidate data on mutation on all JanusGraph nodes ( the issue to track this feature is here: https://github.com/JanusGraph/janusgraph/issues/3155 ). I don't know how it's best to implement the later feature, but I have following ideas:
- We can reuse JanusGraph messaging mechanism (i.e. the one which works on top of a storage backend). With this feature users don't need any external messaging tools and can enable global db-cache invalidation on mutation easily. The downside is most likely performance, because usually those storage backends are not the best tools for messaging.
- Using external tool for messaging (let's say Kafka, Redis, etc.). The advantage would be performance most likely but the disadvantage is that the user now needs to manage a separate external system.
- Providing an interface for mutated data invalidation. We can make a general interface which accepts a set of keys which need to be evicted in db-cache and then the implementation can be either developed by the user themselves or they can use existing JanusGraph solutions (let.s say we will have 3 options at the beggining: `storage-messaging`, `redis-messaging`, `kafka-messaging` and we will be able to add more systems if there is interest in them).

That's just a brainstorming, so if anyone has any thoughts about it. please share. Maybe this issue should be solved differently and maybe I should look at it from a different angle.

Performance

For some reason we didn't look at this side too much but as noted here and here the Guava cache we use isn't the best option. I didn't investigate what is the best option for those caches we have (and we have 8 caches as commented here).
As an obvious solution is to move from Guava to Caffeine cache. That said, if anyone thinks we need to try another cache or have any thoughts about it, please post them here.

Sharding

As for now db-cache caches all the data per JanusGraph instance. No any cache data is shared between multiple JanusGraph instances. In some use-cases this is an advantage but in some situations it's a disadvantage.
I think that it would makes sense to add several options for different db-cache implementation. Some implementations would be local only and some distributed against all participating JanusGraph nodes.
The are several implementations I could think of:
1) Default local only db-cache. This cache would use Caffeine cache implementation and there could be some invalidation strategies available to trigger invalidation on all JanusGraph nodes using some messaging tools (as described in `Invalidation` section).
2) Redis cache - this cache would use external Redis nodes to cache all the data. The advantage is that Redis may be shared and scaled separately. All the data in Redis will be distributed (depending on installation) which may improve cache usage in some cases. Maybe also using client-side caching could improve performance in some cases.
3) Hazelcast cache - this cache would require additional configuration and making sure that all JanusGraph nodes can reach other JanusGraph nodes. Hazelcast cache can be bound to JVM but being able to form a cluster on nodes for the cache and shard / replicate all the data between all the nodes in the cluster. With this cache users won't need to manage external cache nodes but users will need to make sure that their JanusGraph installation has direct network access to all nodes in the cluster.

Any thoughts about any of the above points (or maybe missing points / issues) are very welcome.

Best regards,
Oleksandr


Boxuan Li
 

Thanks for starting this thread! Just want to mention there is an open PR for Redis cache integration: https://github.com/JanusGraph/janusgraph/pull/3100