I have a clarification on a distributed query execution, Have JanusGraph setup with cassandra distributed storage. I am worried about the performance of complex queries.
1. Where does my query computation happen? Is it in the JanusGraph Gremlin server or on the distributed storage?
2. If the execution is on a gremlin server, does all the computations happen on a single gremlin instance or distributed query?
3. What is the max traversal of vertices and edges that can be handled by janusgraph server? eg. counts?
computation happens inside janusgraph (not the storage backend) and janusgraph runs as part of gremlin server
yes, a query and its computations run on a single gremlin server instance
None. If you run a g.V().count(), that is a full table scan, gremlin server will not run out of memory, but if you have billions of vertices the query will take days. For the kinds of workloads that you worry about, JanusGraph has some initial support for OLAP type operations in which all data are loaded by a spark cluster and computation results are returned to the janusgraph instance.