Note: lists.lfaidata.foundation will be down for maintenance on Monday, September 26th, starting at 9AM Pacific Time (4PM Monday September 26, 2022 UTC), for approximately one hour.
Query performance with range
I have some performance issue extacting nodes attached to a node with pagination
I have a simple graph with CompositeIndex on property name (Please find schema definition in attachments).
The graph has three a 3 genre nodes:
Our goal is given a genre to extract the 20K movies attached to "action" that has value="a", this should be done iteratively limiting the chunk of data extracted at each execution (e.g. We paginate the query using range).
I'm using janus 0.6.0 with cassandra 3.11.6. Please find attached the docker compose I've used to create the janus+cassandra environment and also the janus and gremlin configurations.
This is the query that we use to extract a page:
g.V().has("name", "action").to(Direction.IN, "has_genre").has("value", "a").range(skip, limit).valueMap("id").next();
Here the results of the extraction with different page size:
This is the profile of the query for the last chunk:
gremlin> g.V().has("name", "action").to(Direction.IN, "has_genre").has("value", "a").range(19900, 20000).valueMap("id").profile();
Step Count Traversers Time (ms) % Dur
JanusGraphStep(,[name.eq(action)]) 1 1 0.904 0.01
\_condition=(name = action)
backend-query 1 0.460
JanusGraphMultiQueryStep 1 1 0.059 0.00
JanusGraphVertexStep(IN,[has_genre],vertex) 40000 40000 240.729 2.31
backend-query 40000 86.565
HasStep([value.eq(a)]) 20000 20000 10157.815 97.51
RangeGlobalStep(19900,20000) 100 100 15.166 0.15
PropertyMapStep([id],value) 100 100 2.791 0.03
>TOTAL - - 10417.467 -
It seems that the condition has("value", "a") is evaluated reading each of the attached nodes one by one and than evaluating the filter, is this the expected behaviour and performance? Is there any possible optimization in the interaction between Janus and Cassandra (For example read attached nodes in bulk)?
We have verified that activating db-cache (cache.db-cache=true) has a huge impact on perfomance but this is not easly applicable on our real scenario because we have multiple janus nodes (to support the scaling of the system) and with the cache active we have the risk of read stale data (The data are updated frequently and changes must be read by other services in our processing pipeline).
Paging with range can only work with a vertex centric index, otherwise the vertex table is scanned for every page. If you just want all results, the alternative is to forget about the range() step and iterate over the query result.