Topics

Count Query Optimisation


Vinayak Bali
 

Hi All, 

The Data Model of the graph is as follows:

Nodes:

Label: Node1, count: 130K
Label: Node2, count: 183K
Label: Node3, count: 437K
Label: Node4, count: 156

Relations:

Node1 to Node2 Label: Edge1, count: 9K
Node2 to Node3 Label: Edge2, count: 200K
Node2 to Node4 Label: Edge3, count: 71K
Node4 to Node3 Label: Edge4, count: 15K
Node4 to Node1 Label: Edge5 , count: 1K

The Count query used to get vertex and edge count :

g2.V().has('title', 'Node2').aggregate('v').outE().has('title','Edge2').aggregate('e').inV().has('title', 'Node3').aggregate('v').select('v').dedup().as('vertexCount').select('e').dedup().as('edgeCount').select('vertexCount','edgeCount').by(unfold().count())

This query takes around 3.5 mins to execute and the output returned is as follows:
[{"vertexCount":383633,"edgeCount":200166}]

The problem is traversing the edges takes more time.
g.V().has('title','Node3').dedup().count() takes 3 sec to return 437K nodes.
g.E().has('title','Edge2').dedup()..count() takes 1 min to return 200K edges

In some cases, subsequent calls are faster, due to cache usage. 
I also considered in-memory backend, but the data is large and I don't think that will work. Is there any way to cache the result at first-time execution of query ?? or any approach to load the graph from cql backend to in-memory to improve performance?

Please help me to improve the performance, count query should not take much time.

Janusgraph : 0.5.2
Storage: Cassandra cql
The server specification is high and that is not the issue.

Thanks & Regards,
Vinayak


hadoopmarc@...
 

Hi Vinayak,

For other readers, see also this other recent thread.

A couple of remarks:
  • In the separate edge count does it make any difference if you select the edges by label rather than by property, so g.E().hasLabel('Edge2').dedup().count() ? You can see in the JanusGraph data model that the edge label is somewhat easier to access than its properties.
  • If you use an indexing backend, it is also possible to do some simple counts against the index, but this will not help you out for your original query.
  • You also asked about using Spark. Most of the time, OLAP performance is (still) disappointing. But if you need more details you will have to show what you have tried and what problems you encountered.

Best wishes,     Marc