Count Query Optimization


Vinayak Bali
 

Hi All,

The schema consists of A, B as nodes, and E as an edge with some other nodes and edges. 
A: 183468
B: 437317
E: 186513

Query:  g.V().has('property1', 'A').as('v1').outE().has('property1','E').as('e').inV().has('property1', 'B').as('v2').select('v1','e','v2').dedup().count()
Output: 200166
Time Taken: 1min

Query: g.V().has('property1', 'A').aggregate('v').outE().has('property1','E').aggregate('e').inV().has('property1', 'B').aggregate('v').select('v').dedup().as('vetexCount').select('e').dedup().as('edgeCount').select('vetexCount','edgeCount').by(unfold().count())
Output: ==>[vetexCount:383633,edgeCount:200166]
Time: 3.5 mins
Property1 is the index.
How can I optimize the queries because minutes of time for count query is not optimal. Please suggest different approaches. 

Thanks & Regards,
Vinayak


AMIYA KUMAR SAHOO
 

Hi Vinayak,

For query 1.

What is the degree centrality of vertex having property A. How much percentage satisfy out edge having property E. If it is small, VCI will help to increase speed for this traversal.

You can give it a try to below query, not sure if it will speed up.

g.V().has('property1', 'A').
    outE().has('property1','E').
    inV().has('property1', 'B').
    dedup().by(path()).
    count()



On Fri, 12 Mar 2021, 13:30 Vinayak Bali, <vinayakbali16@...> wrote:
Hi All,

The schema consists of A, B as nodes, and E as an edge with some other nodes and edges. 
A: 183468
B: 437317
E: 186513

Query:  g.V().has('property1', 'A').as('v1').outE().has('property1','E').as('e').inV().has('property1', 'B').as('v2').select('v1','e','v2').dedup().count()
Output: 200166
Time Taken: 1min

Query: g.V().has('property1', 'A').aggregate('v').outE().has('property1','E').aggregate('e').inV().has('property1', 'B').aggregate('v').select('v').dedup().as('vetexCount').select('e').dedup().as('edgeCount').select('vetexCount','edgeCount').by(unfold().count())
Output: ==>[vetexCount:383633,edgeCount:200166]
Time: 3.5 mins
Property1 is the index.
How can I optimize the queries because minutes of time for count query is not optimal. Please suggest different approaches. 

Thanks & Regards,
Vinayak


hadoopmarc@...
 

Hi all,

I also thought about the vertex centrex index first, but I am afraid that the VCI can only help to filter the edges to follow, but it does not help in counting the edges. A better way to investigate is to leave out the final inV() step. So, e.g. you can count the number of distinct v2 id's with:
g.V().has('property1', 'A').outE().has('property1','E').id().map{it.get().getOutVertexId()}.dedup().count()

Note that E().id() returns RelationIdentifier() objects that contain both the edge id, the inVertexId and the OutVertexId. This should diminish the number of storage backend calls.

Best wishes,    Marc


AMIYA KUMAR SAHOO
 

Hi Marc,

Vinayak query has a filter on inV property (property1 = B), hence I did not stop at edge itself.

If this kind of query is frequent, decision can be made if the same value makes sense to keep duplicate at both vertex and edge. That will help eliminate the traversal to the out vertex.

Regards,
Amiya


Boxuan Li
 

Apart from rewriting the query, there are some config options (https://docs.janusgraph.org/basics/configuration-reference/#query) worth trying:

1) Turn on query.batch
2) Turn off 
query.fast-property


Vinayak Bali
 

Hi All, 

The solution from BO XUAN LI to change config files worked for the following query:
g.V().has('property1', 'A').as('v1').outE().has('property1','E').as('e').inV().has('property1', 'B').as('v2').select('v1','e','v2').dedup().count()

But not for the following query:
g.V().has('property1', 'A').aggregate('v').outE().has('property1','E').aggregate('e').inV().has('property1', 'B').aggregate('v').select('v').dedup().as('vetexCount').select('e').dedup().as('edgeCount').select('vetexCount','edgeCount').by(unfold().count())

I need an optimized query to get both nodes, as well as edges, count. Request you to provide your valuable feedback and help me to achieve it.

Thanks & Regards,
Vinayak


On Sat, Mar 13, 2021 at 8:16 AM BO XUAN LI <liboxuan@...> wrote:
Apart from rewriting the query, there are some config options (https://docs.janusgraph.org/basics/configuration-reference/#query) worth trying:

1) Turn on query.batch
2) Turn off 
query.fast-property


hadoopmarc@...
 

Hi Vinayak,

Referring to you last post, what happens if you use aggregate(local, 'v') and aggregate(local, 'e'). The local modifier makes the aggregate() step lazy, which hopefully gives janusgraph more opportunity to batch the storage backend requests.
https://tinkerpop.apache.org/docs/current/reference/#store-step

Best wishes,    Marc


Vinayak Bali
 

Hi Marc,

Using local returns the output after each count. For example:

==>[vetexCount:184439,edgeCount:972]
==>[vetexCount:184440,edgeCount:973]
==>[vetexCount:184441,edgeCount:974]
==>[vetexCount:184442,edgeCount:975]
==>[vetexCount:184443,edgeCount:976]
==>[vetexCount:184444,edgeCount:977]
==>[vetexCount:184445,edgeCount:978]
==>[vetexCount:184446,edgeCount:979]
==>[vetexCount:184447,edgeCount:980]
==>[vetexCount:184448,edgeCount:981]
==>[vetexCount:184449,edgeCount:982]
==>[vetexCount:184450,edgeCount:983]
==>[vetexCount:184451,edgeCount:984]
==>[vetexCount:184452,edgeCount:985]
==>[vetexCount:184453,edgeCount:986]
==>[vetexCount:184454,edgeCount:987]
==>[vetexCount:184455,edgeCount:988]
==>[vetexCount:184456,edgeCount:989]
==>[vetexCount:184457,edgeCount:990]
==>[vetexCount:184458,edgeCount:991]
==>[vetexCount:184459,edgeCount:992]
==>[vetexCount:184460,edgeCount:993]
==>[vetexCount:184461,edgeCount:994]
==>[vetexCount:184462,edgeCount:995]
==>[vetexCount:184463,edgeCount:996]
==>[vetexCount:184464,edgeCount:997]
==>[vetexCount:184465,edgeCount:998]

You can suggest some other approach too. I really need it working.

Thanks & Regards,
Vinayak

On Wed, Mar 17, 2021 at 5:54 PM <hadoopmarc@...> wrote:
Hi Vinayak,

Referring to you last post, what happens if you use aggregate(local, 'v') and aggregate(local, 'e'). The local modifier makes the aggregate() step lazy, which hopefully gives janusgraph more opportunity to batch the storage backend requests.
https://tinkerpop.apache.org/docs/current/reference/#store-step

Best wishes,    Marc


Nicolas Trangosi <nicolas.trangosi@...>
 

Hi,
You may try to use denormalization by setting property1 from inV also on edge. 
Then once edges are updated, following query should work:

g.V().has('property1', 'A').aggregate('v').outE().has('property1','E').has('inVproperty1', 'B').aggregate('e').inV().aggregate('v').select('v').dedup().as('vetexCount').select('e').dedup().as('edgeCount').select('vetexCount','edgeCount').by(unfold().count())


Le mer. 17 mars 2021 à 14:05, Vinayak Bali <vinayakbali16@...> a écrit :
Hi Marc,

Using local returns the output after each count. For example:

==>[vetexCount:184439,edgeCount:972]
==>[vetexCount:184440,edgeCount:973]
==>[vetexCount:184441,edgeCount:974]
==>[vetexCount:184442,edgeCount:975]
==>[vetexCount:184443,edgeCount:976]
==>[vetexCount:184444,edgeCount:977]
==>[vetexCount:184445,edgeCount:978]
==>[vetexCount:184446,edgeCount:979]
==>[vetexCount:184447,edgeCount:980]
==>[vetexCount:184448,edgeCount:981]
==>[vetexCount:184449,edgeCount:982]
==>[vetexCount:184450,edgeCount:983]
==>[vetexCount:184451,edgeCount:984]
==>[vetexCount:184452,edgeCount:985]
==>[vetexCount:184453,edgeCount:986]
==>[vetexCount:184454,edgeCount:987]
==>[vetexCount:184455,edgeCount:988]
==>[vetexCount:184456,edgeCount:989]
==>[vetexCount:184457,edgeCount:990]
==>[vetexCount:184458,edgeCount:991]
==>[vetexCount:184459,edgeCount:992]
==>[vetexCount:184460,edgeCount:993]
==>[vetexCount:184461,edgeCount:994]
==>[vetexCount:184462,edgeCount:995]
==>[vetexCount:184463,edgeCount:996]
==>[vetexCount:184464,edgeCount:997]
==>[vetexCount:184465,edgeCount:998]

You can suggest some other approach too. I really need it working.

Thanks & Regards,
Vinayak

On Wed, Mar 17, 2021 at 5:54 PM <hadoopmarc@...> wrote:
Hi Vinayak,

Referring to you last post, what happens if you use aggregate(local, 'v') and aggregate(local, 'e'). The local modifier makes the aggregate() step lazy, which hopefully gives janusgraph more opportunity to batch the storage backend requests.
https://tinkerpop.apache.org/docs/current/reference/#store-step

Best wishes,    Marc



--

  

Nicolas Trangosi

Lead back

+33 (0)6 77 86 66 44      

   




Ce message et ses pièces jointes peuvent contenir des informations confidentielles ou privilégiées et ne doivent donc pas être diffusés, exploités ou copiés sans autorisation. 
Si vous avez reçu ce message par erreur, veuillez le signaler a l'expéditeur et le détruire ainsi que les pièces jointes. 
Les messages électroniques étant susceptibles d'altération, DCbrain décline toute responsabilité si ce message a été altéré, déformé ou falsifié. Merci. 

This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, DCbrain is not liable for messages that have been modified, changed or falsified. Thank you.


hadoopmarc@...
 

Hi Vinayak,

Another attempt, this one is very similar to the one that works.

gremlin> graph = JanusGraphFactory.open('conf/janusgraph-inmemory.properties')
==>standardjanusgraph[inmemory:[127.0.0.1]]
gremlin> g = graph.traversal()
==>graphtraversalsource[standardjanusgraph[inmemory:[127.0.0.1]], standard]
gremlin> GraphOfTheGodsFactory.loadWithoutMixedIndex(graph,true)
==>null

gremlin> g.V().as('v1').outE().as('e').inV().as('v2').union(select('v1'), select('v2')).dedup().count()
16:12:39 WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertices [()]. For better performance, use indexes
==>12

gremlin> g.V().as('v1').outE().as('e').inV().as('v2').select('e').dedup().count()
16:15:30 WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertices [()]. For better performance, use indexes
==>17

gremlin> g.V().as('v1').outE().as('e').inV().as('v2').union(
......1>     union(select('v1'), select('v2')).dedup().count(),
......2>     select('e').dedup().count().as('ecount')
......3>     )
16:27:42 WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertices [()]. For better performance, use indexes
==>12
==>17
   
Best wishes,    Marc


AMIYA KUMAR SAHOO
 

Hi Vinayak,

May be try below.

g.V().has('property1', 'A').
   outE().has('property1','E').
       where(inV().has('property1', 'B')). fold().
   project('edgeCount', 'vertexCount').
            by(count(local)).
            by(unfold().bothV().dedup().count())    // I do not think dedup is required for your use case, can try both with and without dedup

Regards, Amiya


Vinayak Bali
 

Hi Amiya,

With dedup:
g.V().has('property1', 'A').
   outE().has('property1','E').
       where(inV().has('property1', 'B')). fold().
   project('edgeCount', 'vertexCount').
            by(count(local)).
            by(unfold().bothV().dedup().count())
Output: ==>[edgeCount:200166,vertexCount:34693]

without dedup:
g.V().has('property1', 'A').
   outE().has('property1','E').
       where(inV().has('property1', 'B')). fold().
   project('edgeCount', 'vertexCount').
            by(count(local)).
            by(unfold().bothV().count())
Output: ==>[edgeCount:200166,vertexCount:400332]

Both queries are taking approx 3 sec to run.

Query: g.V().has('property1', 'A').aggregate('v').outE().has('property1','E').aggregate('e').inV().has('property1', 'B').aggregate('v').select('v').dedup().as('vetexCount').select('e').dedup().as('edgeCount').select('vetexCount','edgeCount').by(unfold().count())
Output: ==>[vetexCount:383633,edgeCount:200166]
Time: 3.5 mins

Edge Count is the same for all the queries but getting different vertexCount. Which one is the right vertex count??

Thanks & Regards,
Vinayak


On Thu, Mar 18, 2021 at 11:18 AM AMIYA KUMAR SAHOO <amiyakr.sahoo91@...> wrote:
Hi Vinayak,

May be try below.

g.V().has('property1', 'A').
   outE().has('property1','E').
       where(inV().has('property1', 'B')). fold().
   project('edgeCount', 'vertexCount').
            by(count(local)).
            by(unfold().bothV().dedup().count())    // I do not think dedup is required for your use case, can try both with and without dedup

Regards, Amiya


AMIYA KUMAR SAHOO
 

Hi Vinayak,

Correct vertex count is ( 400332 non-unique, 34693 unique).

g.V().has('property1', 'A').aggregate('v'), all the vertex having property1 = A  might be getting included in count in your second query because of eager evaluation (does not matter they  have outE with property1 = E or not)

Regards,
Amiya


Vinayak Bali
 

Amiya - I need to check the data, there is some mismatch with the counts.

Consider we have more than one relation to get the count. How can we modify the query?

For example:
 
A->E->B query is as follows:
g.V().has('property1', 'A').
   outE().has('property1','E').
       where(inV().has('property1', 'B')). fold().
   project('edgeCount', 'vertexCount').
            by(count(local)).
            by(unfold().bothV().dedup().count())

A->E->B->E1->C->E2->D

What changes can be made in the query ??

Thanks



On Thu, Mar 18, 2021 at 1:59 PM AMIYA KUMAR SAHOO <amiyakr.sahoo91@...> wrote:
Hi Vinayak,

Correct vertex count is ( 400332 non-unique, 34693 unique).

g.V().has('property1', 'A').aggregate('v'), all the vertex having property1 = A  might be getting included in count in your second query because of eager evaluation (does not matter they  have outE with property1 = E or not)

Regards,
Amiya


AMIYA KUMAR SAHOO
 

Hi Vinayak,

Try below. If it works for you, you can add E2 and D similarly.

g.V().has('property1', 'A').
   outE().has('property1', 'E').as('e').
   inV().has('property1', 'B').
   outE().has('property1', 'E1').as('e').
   where (inV().has('property1', 'C')).
 select (all, 'e').fold().
    project('edgeCount', 'vertexCount').
            by(count(local)).
        by(unfold().bothV().dedup().count())

Regards,
Amiya

On Thu, 18 Mar 2021, 15:47 Vinayak Bali, <vinayakbali16@...> wrote:
Amiya - I need to check the data, there is some mismatch with the counts.

Consider we have more than one relation to get the count. How can we modify the query?

For example:
 
A->E->B query is as follows:
g.V().has('property1', 'A').
   outE().has('property1','E').
       where(inV().has('property1', 'B')). fold().
   project('edgeCount', 'vertexCount').
            by(count(local)).
            by(unfold().bothV().dedup().count())

A->E->B->E1->C->E2->D

What changes can be made in the query ??

Thanks



On Thu, Mar 18, 2021 at 1:59 PM AMIYA KUMAR SAHOO <amiyakr.sahoo91@...> wrote:
Hi Vinayak,

Correct vertex count is ( 400332 non-unique, 34693 unique).

g.V().has('property1', 'A').aggregate('v'), all the vertex having property1 = A  might be getting included in count in your second query because of eager evaluation (does not matter they  have outE with property1 = E or not)

Regards,
Amiya


Vinayak Bali
 

Hi All,

Adding these properties in the configuration file affects edge traversal. Retrieving a single edge takes 7 mins of time. 
1) Turn on query.batch
2) Turn off 
query.fast-property
Count query is faster but edge traversal becomes more expensive.
Is there any other way to improve count performance without affecting other queries.

Thanks & Regards,
Vinayak

On Fri, Mar 19, 2021 at 1:53 AM AMIYA KUMAR SAHOO <amiyakr.sahoo91@...> wrote:
Hi Vinayak,

Try below. If it works for you, you can add E2 and D similarly.

g.V().has('property1', 'A').
   outE().has('property1', 'E').as('e').
   inV().has('property1', 'B').
   outE().has('property1', 'E1').as('e').
   where (inV().has('property1', 'C')).
 select (all, 'e').fold().
    project('edgeCount', 'vertexCount').
            by(count(local)).
        by(unfold().bothV().dedup().count())

Regards,
Amiya

On Thu, 18 Mar 2021, 15:47 Vinayak Bali, <vinayakbali16@...> wrote:
Amiya - I need to check the data, there is some mismatch with the counts.

Consider we have more than one relation to get the count. How can we modify the query?

For example:
 
A->E->B query is as follows:
g.V().has('property1', 'A').
   outE().has('property1','E').
       where(inV().has('property1', 'B')). fold().
   project('edgeCount', 'vertexCount').
            by(count(local)).
            by(unfold().bothV().dedup().count())

A->E->B->E1->C->E2->D

What changes can be made in the query ??

Thanks



On Thu, Mar 18, 2021 at 1:59 PM AMIYA KUMAR SAHOO <amiyakr.sahoo91@...> wrote:
Hi Vinayak,

Correct vertex count is ( 400332 non-unique, 34693 unique).

g.V().has('property1', 'A').aggregate('v'), all the vertex having property1 = A  might be getting included in count in your second query because of eager evaluation (does not matter they  have outE with property1 = E or not)

Regards,
Amiya


Boxuan Li
 

Have you tried keeping query.batch = true AND query.fast-property = true?

Regards,
Boxuan

On Mar 22, 2021, at 8:28 PM, Vinayak Bali <vinayakbali16@...> wrote:

Hi All,

Adding these properties in the configuration file affects edge traversal. Retrieving a single edge takes 7 mins of time. 
1) Turn on query.batch
2) Turn off 
query.fast-property
Count query is faster but edge traversal becomes more expensive.
Is there any other way to improve count performance without affecting other queries.

Thanks & Regards,
Vinayak

On Fri, Mar 19, 2021 at 1:53 AM AMIYA KUMAR SAHOO <amiyakr.sahoo91@...> wrote:
Hi Vinayak,

Try below. If it works for you, you can add E2 and D similarly.

g.V().has('property1', 'A').
   outE().has('property1', 'E').as('e').
   inV().has('property1', 'B').
   outE().has('property1', 'E1').as('e').
   where (inV().has('property1', 'C')).
 select (all, 'e').fold().
    project('edgeCount', 'vertexCount').
            by(count(local)).
        by(unfold().bothV().dedup().count())

Regards,
Amiya

On Thu, 18 Mar 2021, 15:47 Vinayak Bali, <vinayakbali16@...> wrote:
Amiya - I need to check the data, there is some mismatch with the counts.

Consider we have more than one relation to get the count. How can we modify the query?

For example:
 
A->E->B query is as follows:
g.V().has('property1', 'A').
   outE().has('property1','E').
       where(inV().has('property1', 'B')). fold().
   project('edgeCount', 'vertexCount').
            by(count(local)).
            by(unfold().bothV().dedup().count())

A->E->B->E1->C->E2->D

What changes can be made in the query ??

Thanks



On Thu, Mar 18, 2021 at 1:59 PM AMIYA KUMAR SAHOO <amiyakr.sahoo91@...> wrote:
Hi Vinayak,

Correct vertex count is ( 400332 non-unique, 34693 unique).

g.V().has('property1', 'A').aggregate('v'), all the vertex having property1 = A  might be getting included in count in your second query because of eager evaluation (does not matter they  have outE with property1 = E or not)

Regards,
Amiya








Vinayak Bali
 

Hi All, 

query.batch = true AND query.fast-property = true 
this doesn't work. facing the same problem. Is there any other way??

Thanks & Regards,
Vinayak

On Mon, Mar 22, 2021 at 6:06 PM Boxuan Li <liboxuan@...> wrote:
Have you tried keeping query.batch = true AND query.fast-property = true?

Regards,
Boxuan

On Mar 22, 2021, at 8:28 PM, Vinayak Bali <vinayakbali16@...> wrote:

Hi All,

Adding these properties in the configuration file affects edge traversal. Retrieving a single edge takes 7 mins of time. 
1) Turn on query.batch
2) Turn off 
query.fast-property
Count query is faster but edge traversal becomes more expensive.
Is there any other way to improve count performance without affecting other queries.

Thanks & Regards,
Vinayak

On Fri, Mar 19, 2021 at 1:53 AM AMIYA KUMAR SAHOO <amiyakr.sahoo91@...> wrote:
Hi Vinayak,

Try below. If it works for you, you can add E2 and D similarly.

g.V().has('property1', 'A').
   outE().has('property1', 'E').as('e').
   inV().has('property1', 'B').
   outE().has('property1', 'E1').as('e').
   where (inV().has('property1', 'C')).
 select (all, 'e').fold().
    project('edgeCount', 'vertexCount').
            by(count(local)).
        by(unfold().bothV().dedup().count())

Regards,
Amiya

On Thu, 18 Mar 2021, 15:47 Vinayak Bali, <vinayakbali16@...> wrote:
Amiya - I need to check the data, there is some mismatch with the counts.

Consider we have more than one relation to get the count. How can we modify the query?

For example:
 
A->E->B query is as follows:
g.V().has('property1', 'A').
   outE().has('property1','E').
       where(inV().has('property1', 'B')). fold().
   project('edgeCount', 'vertexCount').
            by(count(local)).
            by(unfold().bothV().dedup().count())

A->E->B->E1->C->E2->D

What changes can be made in the query ??

Thanks



On Thu, Mar 18, 2021 at 1:59 PM AMIYA KUMAR SAHOO <amiyakr.sahoo91@...> wrote:
Hi Vinayak,

Correct vertex count is ( 400332 non-unique, 34693 unique).

g.V().has('property1', 'A').aggregate('v'), all the vertex having property1 = A  might be getting included in count in your second query because of eager evaluation (does not matter they  have outE with property1 = E or not)

Regards,
Amiya