Query performance with range


Claudio Fumagalli
 

Hi,

I have some performance issue extacting nodes attached to a node with pagination  
I have a simple graph with CompositeIndex on property name (Please find schema definition in attachments).
The graph has three a 3 genre nodes:
  • "action" node has 20K attached movies with value="a" and 20K attached movies with value="b".
  • "drama" node has 10K attached movies with value="b"
  • "comedy" node has 10K attached movies with value="c"
Genre nodes has id and name properties, movie nodes has id,name and value properties.

Our goal is given a genre to extract the 20K movies attached to "action" that has value="a", this should be done iteratively limiting the chunk of data extracted at each execution (e.g. We paginate the query using range).

I'm using janus 0.6.0 with cassandra 3.11.6. Please find attached the docker compose I've used to create the janus+cassandra environment and also the janus and gremlin configurations.

This is the query that we use to extract a page:
g.V().has("name", "action").to(Direction.IN, "has_genre").has("value", "a").range(skip, limit).valueMap("id").next();

Here the results of the extraction with different page size:
  • page size 100 read 200 pages in 591453 ms - average elapsed per page 2957.265 ms - min 284 ms - max 11618 ms
  • page size 1000 read 20 pages in 62293 ms - average elapsed per page 3114.65 ms - min 632 ms - max 9712 ms
This is the profile of the query for the last chunk:

gremlin> g.V().has("name", "action").to(Direction.IN, "has_genre").has("value", "a").range(19900, 20000).valueMap("id").profile();
==>Traversal Metrics
Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
JanusGraphStep([],[name.eq(action)])                                   1           1           0.904     0.01
  constructGraphCentricQuery                                                                   0.169
  GraphCentricQuery                                                                            0.591
    \_condition=(name = action)
    \_orders=[]
    \_isFitted=true
    \_isOrdered=true
    \_query=multiKSQ[1]
    \_index=ByName
    backend-query                                                      1                       0.460
    \_query=ByName:multiKSQ[1]
JanusGraphMultiQueryStep                                               1           1           0.059     0.00
JanusGraphVertexStep(IN,[has_genre],vertex)                        40000       40000         240.729     2.31
    \_condition=type[has_genre]
    \_orders=[]
    \_isFitted=true
    \_isOrdered=true
    \_query=has_genre:SliceQuery[0x71E1,0x71E2)
    \_multi=true
    \_vertices=1
  optimization                                                                                 0.019
  backend-query                                                    40000                      86.565
    \_query=has_genre:SliceQuery[0x71E1,0x71E2)
HasStep([value.eq(a)])                                             20000       20000       10157.815    97.51
RangeGlobalStep(19900,20000)                                         100         100          15.166     0.15
PropertyMapStep([id],value)                                          100         100           2.791     0.03
                                            >TOTAL                     -           -       10417.467        -
 

It seems that the condition has("value", "a") is evaluated reading each of the attached nodes one by one and than evaluating the filter, is this the expected behaviour and performance? Is there any possible optimization in the interaction between Janus and Cassandra (For example read attached nodes in bulk)? 

We have verified that activating db-cache (cache.db-cache=true) has a huge impact on perfomance but this is not easly applicable on our real scenario because we have multiple janus nodes (to support the scaling of the system) and with the cache active we have the risk of read stale data (The data are updated frequently and changes must be read by other services in our processing pipeline).

Thank you


hadoopmarc@...
 

Hi Claudio,

Paging with range can only work with a vertex centric index, otherwise the vertex table is scanned for every page. If you just want all results, the alternative is to forget about the range() step and iterate over the query result.

Marc