Hi JanusGraph team,
We would like to contribute an improved implementation of in-memory backend for JanusGraph.
Background/Rationale
There are many possible applications for both embedded and standalone JanusGraph with a 100% in-memory backend – i.e. cases where the graph can potentially fit within the boundaries of one JVM, is built dynamically and in parallel transactions, and the performance during build-out of the graph and querying it is critical. A quick search in the issues and mailing list seems to confirm that there is general interest in this kind of use case, both for JanusGraph specifically and for an open-source in-memory graph in general (e.g. TinkerGraph may not fit the bill in all cases as it is not transactional etc).
However there seems to be a bit of a gap in JanusGraph offering in this space. Current implementation of JanusGraph’s in-memory backend has not changed since Titan times, and is still declared as “for testing only”, “not ready for production use”.
Contribution
We have done some analysis a while back which shows that the main obstacle for production use of current in-memory backend is enormous overhead when storing key values (likely a trade-off vs simplicity of implementation for something that was only intended for testing purposes).
Quite often the total overhead of wrapping data structures and references is bigger than the actual data being stored.
After a series of improvements this overhead was significantly reduced (2x-5x depending on the “shape” of the data stored and the size of individual data entries).
The modified version of in-memory backend is successfully used in production, handling up to 70+ concurrent read/write transactions at any one moment, 10+ millions of vertices (of which quite a few have up to 60 properties) and about 3x more edges, within one JVM.
More detailed analysis and memory profiling of current vs improved implementation is attached to the corresponding github issue:
The initial PR is here:
NOTE: the PR currently suggests adding the new backend rather than modifying the existing one, as it seemed cleaner and easier to compare performance within one codebase branch.
However, having read recent discussions about adding more backends into main repository vs maintenance costs, I am starting to think that actually modifying the existing in-memory backend could be a better idea.
This is because:
a) the modifications are fairly straightforward and do not change the general structure of existing backend
b) don’t add too much extra code or any external dependencies
c) don’t require any additional documentation or setup instructions, at least initially
d) all existing (and a few extra) integrity tests pass and using it in place of current version is unlikely to create any issues for current (test-only) uses
e) the same backend will simply be fit for production use, no new backends to maintain
Any thoughts on the contribution as a whole and new backend vs updating the existing one are much appreciated.
Many thanks,
Dmitry