Degree-Centrality Filtering & Search – Scalable Strategies for OLTP


"zb...@gmail.com" <zblu...@...>
 

Hello all,

Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs.  i.e. something like :

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().count().is(gt(10)));                            

Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience).  My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do:

g.V()

   .has("someProperty",eq("someValue"))

   .has(“outDegree”,gt(10))

This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. 

Thank you,

Zach


BO XUAN LI <libo...@...>
 

Hi Zach,

Personally I think your workaround is the most optimal one. JanusGraph does not store number of edges as metadata in the vertex (there are both Pros & Cons for doing / not doing this).

Btw do you have to have another job doing centrality calculation separately? If your application is built on top of JanusGraph, then probably you can maintain the “outDegree” property when inserting/deleting edges.

Best regards,
Boxuan

On Dec 29, 2020, at 6:49 AM, zblu...@gmail.com <zblu...@...> wrote:

Hello all,

Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs.  i.e. something like :

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().count().is(gt(10)));                            

Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience).  My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do:

g.V()

   .has("someProperty",eq("someValue"))

   .has(“outDegree”,gt(10))

This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. 

Thank you,

Zach


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/385af431-d723-4be6-95cb-43b2954f2e58n%40googlegroups.com.


"zb...@gmail.com" <zblu...@...>
 

Thank you Boxuan,

Was using the term “job” pretty loosely.  Your inference about doing these things within ingest/deletion process makes sense.

I know there is a lot on the community’s plate now, but if my above solution is truly optimal for current state, I wonder if a JG feature addition may help tackle this problem more consistently. Something like an additional, 3rd , index type (in addition to “graph” and “vertex-centric” indices) . i.e. “edge-connection” or “degree-centrality” index. The feature would require a mixed indexing backend, and minimally a mechanism to choose vertex and edge label combinations to count IN, OUT, and/or BOTH degree centrality.

Not sure what the level of effort or implementation details would be, but this is a very common business requirement for graph-based search.  If JanusGraph has native/tested support for it, it would make JanusGraph even easier to champion.

😊

Best,

Zach


On Tuesday, December 29, 2020 at 3:19:46 AM UTC-5 li...@... wrote:
Hi Zach,

Personally I think your workaround is the most optimal one. JanusGraph does not store number of edges as metadata in the vertex (there are both Pros & Cons for doing / not doing this).

Btw do you have to have another job doing centrality calculation separately? If your application is built on top of JanusGraph, then probably you can maintain the “outDegree” property when inserting/deleting edges.

Best regards,
Boxuan

On Dec 29, 2020, at 6:49 AM, zblu...@gmail.com <zb...@...> wrote:

Hello all,

Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs.  i.e. something like :

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().count().is(gt(10)));                            

Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience).  My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do:

g.V()

   .has("someProperty",eq("someValue"))

   .has(“outDegree”,gt(10))

This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. 

Thank you,

Zach


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/385af431-d723-4be6-95cb-43b2954f2e58n%40googlegroups.com.


BO XUAN LI <libo...@...>
 

Hi Zach,

I have some concerns over concurrency and consistency issues, but this might still be a nice feature to have. I think you could open a new discussion on https://github.com/JanusGraph/janusgraph/discussions. That would be a better place for brainstorming. It would be awesome if you can share more context on why you think this is a very common business requirement.

Cheers,
Boxuan

On Dec 30, 2020, at 4:42 AM, zblu...@gmail.com <zblu...@...> wrote:

Thank you Boxuan,

Was using the term “job” pretty loosely.  Your inference about doing these things within ingest/deletion process makes sense.

I know there is a lot on the community’s plate now, but if my above solution is truly optimal for current state, I wonder if a JG feature addition may help tackle this problem more consistently. Something like an additional, 3rd , index type (in addition to “graph” and “vertex-centric” indices) . i.e. “edge-connection” or “degree-centrality” index. The feature would require a mixed indexing backend, and minimally a mechanism to choose vertex and edge label combinations to count IN, OUT, and/or BOTH degree centrality.

Not sure what the level of effort or implementation details would be, but this is a very common business requirement for graph-based search.  If JanusGraph has native/tested support for it, it would make JanusGraph even easier to champion.

😊

Best,

Zach


On Tuesday, December 29, 2020 at 3:19:46 AM UTC-5 libo...@connect.hku.hk wrote:
Hi Zach,

Personally I think your workaround is the most optimal one. JanusGraph does not store number of edges as metadata in the vertex (there are both Pros & Cons for doing / not doing this).

Btw do you have to have another job doing centrality calculation separately? If your application is built on top of JanusGraph, then probably you can maintain the “outDegree” property when inserting/deleting edges.

Best regards,
Boxuan

On Dec 29, 2020, at 6:49 AM, zblu...@gmail.com <zb...@...> wrote:

Hello all,

Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs.  i.e. something like :

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().count().is(gt(10)));                            

Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience).  My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do:

g.V()

   .has("someProperty",eq("someValue"))

   .has(“outDegree”,gt(10))

This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. 

Thank you,

Zach


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/385af431-d723-4be6-95cb-43b2954f2e58n%40googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/c6539751-c586-42c1-af96-010b6275d1f1n%40googlegroups.com.


HadoopMarc <bi...@...>
 

Hi Zach, Boxuan,

There is one thing I do not understand. According to the JanusGraph datamodel, the outE relatationIdentifiers are stored in the vertex. So, retrieving all outE() relationIdentifiers with the vertex for counting them should not take long, even if there are 100 thousands of them. What happens if you do:

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().id().count().is(gt(10)));

If this does not work, it should be possible to configure/modify JanusGraph such, that it does not start fetching edge properties that are not needed for the count.

Best wishes,   Marc

Op woensdag 30 december 2020 om 04:15:46 UTC+1 schreef li...@...:

Hi Zach,

I have some concerns over concurrency and consistency issues, but this might still be a nice feature to have. I think you could open a new discussion on https://github.com/JanusGraph/janusgraph/discussions. That would be a better place for brainstorming. It would be awesome if you can share more context on why you think this is a very common business requirement.

Cheers,
Boxuan

On Dec 30, 2020, at 4:42 AM, zblu...@gmail.com <zb...@...> wrote:

Thank you Boxuan,

Was using the term “job” pretty loosely.  Your inference about doing these things within ingest/deletion process makes sense.

I know there is a lot on the community’s plate now, but if my above solution is truly optimal for current state, I wonder if a JG feature addition may help tackle this problem more consistently. Something like an additional, 3rd , index type (in addition to “graph” and “vertex-centric” indices) . i.e. “edge-connection” or “degree-centrality” index. The feature would require a mixed indexing backend, and minimally a mechanism to choose vertex and edge label combinations to count IN, OUT, and/or BOTH degree centrality.

Not sure what the level of effort or implementation details would be, but this is a very common business requirement for graph-based search.  If JanusGraph has native/tested support for it, it would make JanusGraph even easier to champion.

😊

Best,

Zach


On Tuesday, December 29, 2020 at 3:19:46 AM UTC-5 libo...@connect.hku.hk wrote:
Hi Zach,

Personally I think your workaround is the most optimal one. JanusGraph does not store number of edges as metadata in the vertex (there are both Pros & Cons for doing / not doing this).

Btw do you have to have another job doing centrality calculation separately? If your application is built on top of JanusGraph, then probably you can maintain the “outDegree” property when inserting/deleting edges.

Best regards,
Boxuan

On Dec 29, 2020, at 6:49 AM, zblu...@gmail.com <zb...@...> wrote:

Hello all,

Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs.  i.e. something like :

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().count().is(gt(10)));                            

Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience).  My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do:

g.V()

   .has("someProperty",eq("someValue"))

   .has(“outDegree”,gt(10))

This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. 

Thank you,

Zach


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/385af431-d723-4be6-95cb-43b2954f2e58n%40googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....


BO XUAN LI <libo...@...>
 

Hi Marc,

I think it will be as same slow as the initial one if not even slower. If I recall correctly, JanusGraph always fetches the whole edge (column + value) even if you only need the count (in which case neither column nor value is really needed), or you only need edge id (in which case only column is needed). I created https://github.com/JanusGraph/janusgraph/discussions/2315 to discuss about this potential optimization. Btw, even if we assume this optimization is in-place, I don’t expect significant performance boost for Zach’s usecase.

Best regards,
Boxuan


On Dec 30, 2020, at 4:44 PM, HadoopMarc <bi...@...> wrote:

Hi Zach, Boxuan,

There is one thing I do not understand. According to the JanusGraph datamodel, the outE relatationIdentifiers are stored in the vertex. So, retrieving all outE() relationIdentifiers with the vertex for counting them should not take long, even if there are 100 thousands of them. What happens if you do:

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().id().count().is(gt(10)));

If this does not work, it should be possible to configure/modify JanusGraph such, that it does not start fetching edge properties that are not needed for the count.

Best wishes,   Marc

Op woensdag 30 december 2020 om 04:15:46 UTC+1 schreef libo...@connect.hku.hk:
Hi Zach,

I have some concerns over concurrency and consistency issues, but this might still be a nice feature to have. I think you could open a new discussion on https://github.com/JanusGraph/janusgraph/discussions. That would be a better place for brainstorming. It would be awesome if you can share more context on why you think this is a very common business requirement.

Cheers,
Boxuan

On Dec 30, 2020, at 4:42 AM, zblu...@gmail.com <zb...@...> wrote:

Thank you Boxuan,

Was using the term “job” pretty loosely.  Your inference about doing these things within ingest/deletion process makes sense.

I know there is a lot on the community’s plate now, but if my above solution is truly optimal for current state, I wonder if a JG feature addition may help tackle this problem more consistently. Something like an additional, 3rd , index type (in addition to “graph” and “vertex-centric” indices) . i.e. “edge-connection” or “degree-centrality” index. The feature would require a mixed indexing backend, and minimally a mechanism to choose vertex and edge label combinations to count IN, OUT, and/or BOTH degree centrality.

Not sure what the level of effort or implementation details would be, but this is a very common business requirement for graph-based search.  If JanusGraph has native/tested support for it, it would make JanusGraph even easier to champion.

😊

Best,

Zach


On Tuesday, December 29, 2020 at 3:19:46 AM UTC-5 libo...@connect.hku.hk wrote:
Hi Zach,

Personally I think your workaround is the most optimal one. JanusGraph does not store number of edges as metadata in the vertex (there are both Pros & Cons for doing / not doing this).

Btw do you have to have another job doing centrality calculation separately? If your application is built on top of JanusGraph, then probably you can maintain the “outDegree” property when inserting/deleting edges.

Best regards,
Boxuan

On Dec 29, 2020, at 6:49 AM, zblu...@gmail.com <zb...@...> wrote:

Hello all,

Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs.  i.e. something like :

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().count().is(gt(10)));                            

Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience).  My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do:

g.V()

   .has("someProperty",eq("someValue"))

   .has(“outDegree”,gt(10))

This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. 

Thank you,

Zach


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/385af431-d723-4be6-95cb-43b2954f2e58n%40googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/5c39e3cb-1b97-4c16-a1a7-0fb0b6f1ae7dn%40googlegroups.com.


"zb...@gmail.com" <zblu...@...>
 

Hi Marc, Boxuan,

Thank you for the discussion. I have been experimenting with different queries including your id() suggesting Marc.  Along Boxuan’s feedback, the where() step performs about the same (maybe slightly slower) when adding .id() step.

My bigger concern for my use case is how this type of operation scales in a matter that seems relatively linear with sample size. i.e.

g.V().limit(10).where(InE().count().is(gt(6))).profile() => ~30 ms

g.V().limit(100).where(InE().count().is(gt(6))).profile() => ~147 ms

g.V().limit(1000).where(InE().count().is(gt(6))).profile()  => ~1284 ms

g.V().limit(10000).where(InE().count().is(gt(6))).profile()  => ~13779 ms

g.V().limit(100000).where(InE().count().is(gt(6))).profile()  => ? > 120000 ms (timeout)

 

This behavior makes sense when I think about it and also when I inspect the profile (example profile of limit(10) traversal below)

I know the above traversal seems a bit funky, but I am trying to consistently analyze the effect of sample size on the edge count portion of the query.

Looking at the profile, it seems like JG needs to perform a sliceQuery operation on each vertex sequentially which isn’t well optimized for my use case. I know that if centrality properties were included in a mixed index then it can be configured for scalable performance.  However, going back to the original post, I am not sure that is the best/only way.  Are there other configurations that could be optimized to make this operation more scalable without to an additional index property? 

In case it is relevant, I am using JanusGraph v 0.5.2 with Cassandra-CQL backend v3.11.

Thank you,

Zach

Example Profile

gremlin> g.V().limit(10).where(inE().count().is(gt(6))).profile()

==>Traversal Metrics

Step                                                               Count  Traversers       Time (ms)    % Dur

=============================================================================================================

JanusGraphStep(vertex,[])                                             10          10           8.684    28.71

    \_condition=()

    \_orders=[]

    \_limit=10

    \_isFitted=false

    \_isOrdered=true

    \_query=[]

  optimization                                                                                 0.005

  optimization                                                                                 0.001

  scan                                                                                         0.000

    \_query=[]

    \_fullscan=true

    \_condition=VERTEX

TraversalFilterStep([JanusGraphVertexStep(IN,ed...                                            21.564    71.29

  JanusGraphVertexStep(IN,edge)                                       13          13          21.350

    \_condition=(EDGE AND visibility:normal)

    \_orders=[]

    \_limit=7

    \_isFitted=false

    \_isOrdered=true

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_vertices=1

    optimization                                                                               0.003

    backend-query                                                      3                       4.434

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      1                       1.291

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.311

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      1                       2.483

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.310

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.313

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.192

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      4                       1.287

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      3                       1.231

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       3.546

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

  RangeGlobalStep(0,7)                                                13          13           0.037

  CountGlobalStep                                                     10          10           0.041

  IsStep(gt(6))                                                                                0.022

                                            >TOTAL                     -           -          30.249        -


On Wednesday, December 30, 2020 at 4:59:20 AM UTC-5 li...@... wrote:
Hi Marc,

I think it will be as same slow as the initial one if not even slower. If I recall correctly, JanusGraph always fetches the whole edge (column + value) even if you only need the count (in which case neither column nor value is really needed), or you only need edge id (in which case only column is needed). I created https://github.com/JanusGraph/janusgraph/discussions/2315 to discuss about this potential optimization. Btw, even if we assume this optimization is in-place, I don’t expect significant performance boost for Zach’s usecase.

Best regards,
Boxuan


On Dec 30, 2020, at 4:44 PM, HadoopMarc <b...@...> wrote:

Hi Zach, Boxuan,

There is one thing I do not understand. According to the JanusGraph datamodel, the outE relatationIdentifiers are stored in the vertex. So, retrieving all outE() relationIdentifiers with the vertex for counting them should not take long, even if there are 100 thousands of them. What happens if you do:

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().id().count().is(gt(10)));

If this does not work, it should be possible to configure/modify JanusGraph such, that it does not start fetching edge properties that are not needed for the count.

Best wishes,   Marc

Op woensdag 30 december 2020 om 04:15:46 UTC+1 schreef libo...@connect.hku.hk:
Hi Zach,

I have some concerns over concurrency and consistency issues, but this might still be a nice feature to have. I think you could open a new discussion on https://github.com/JanusGraph/janusgraph/discussions. That would be a better place for brainstorming. It would be awesome if you can share more context on why you think this is a very common business requirement.

Cheers,
Boxuan

On Dec 30, 2020, at 4:42 AM, zblu...@gmail.com <zb...@...> wrote:

Thank you Boxuan,

Was using the term “job” pretty loosely.  Your inference about doing these things within ingest/deletion process makes sense.

I know there is a lot on the community’s plate now, but if my above solution is truly optimal for current state, I wonder if a JG feature addition may help tackle this problem more consistently. Something like an additional, 3rd , index type (in addition to “graph” and “vertex-centric” indices) . i.e. “edge-connection” or “degree-centrality” index. The feature would require a mixed indexing backend, and minimally a mechanism to choose vertex and edge label combinations to count IN, OUT, and/or BOTH degree centrality.

Not sure what the level of effort or implementation details would be, but this is a very common business requirement for graph-based search.  If JanusGraph has native/tested support for it, it would make JanusGraph even easier to champion.

😊

Best,

Zach


On Tuesday, December 29, 2020 at 3:19:46 AM UTC-5 libo...@connect.hku.hk wrote:
Hi Zach,

Personally I think your workaround is the most optimal one. JanusGraph does not store number of edges as metadata in the vertex (there are both Pros & Cons for doing / not doing this).

Btw do you have to have another job doing centrality calculation separately? If your application is built on top of JanusGraph, then probably you can maintain the “outDegree” property when inserting/deleting edges.

Best regards,
Boxuan

On Dec 29, 2020, at 6:49 AM, zblu...@gmail.com <zb...@...> wrote:

Hello all,

Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs.  i.e. something like :

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().count().is(gt(10)));                            

Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience).  My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do:

g.V()

   .has("someProperty",eq("someValue"))

   .has(“outDegree”,gt(10))

This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. 

Thank you,

Zach


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/385af431-d723-4be6-95cb-43b2954f2e58n%40googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....


BO XUAN LI <libo...@...>
 

Hi Zach,

If you want to run the query in a multi-threading manner, try enabling “query.batch” (ref: https://docs.janusgraph.org/basics/configuration-reference/#query).

Since you are using Cassandra which does not support batch reading natively, JanusGraph will use a thread pool to fire the backend queries. This should reduce latency of this single query but might impact overall application performance if your application is already handling heavy workloads.

Best regards,
Boxuan

On Dec 31, 2020, at 8:51 AM, zblu...@gmail.com <zblu...@...> wrote:

Hi Marc, Boxuan,

Thank you for the discussion. I have been experimenting with different queries including your id() suggesting Marc.  Along Boxuan’s feedback, the where() step performs about the same (maybe slightly slower) when adding .id() step.

My bigger concern for my use case is how this type of operation scales in a matter that seems relatively linear with sample size. i.e.

g.V().limit(10).where(InE().count().is(gt(6))).profile() => ~30 ms

g.V().limit(100).where(InE().count().is(gt(6))).profile() => ~147 ms

g.V().limit(1000).where(InE().count().is(gt(6))).profile()  => ~1284 ms

g.V().limit(10000).where(InE().count().is(gt(6))).profile()  => ~13779 ms

g.V().limit(100000).where(InE().count().is(gt(6))).profile()  => ? > 120000 ms (timeout)

 

This behavior makes sense when I think about it and also when I inspect the profile (example profile of limit(10) traversal below)

I know the above traversal seems a bit funky, but I am trying to consistently analyze the effect of sample size on the edge count portion of the query.

Looking at the profile, it seems like JG needs to perform a sliceQuery operation on each vertex sequentially which isn’t well optimized for my use case. I know that if centrality properties were included in a mixed index then it can be configured for scalable performance.  However, going back to the original post, I am not sure that is the best/only way.  Are there other configurations that could be optimized to make this operation more scalable without to an additional index property? 

In case it is relevant, I am using JanusGraph v 0.5.2 with Cassandra-CQL backend v3.11.

Thank you,

Zach

Example Profile

gremlin> g.V().limit(10).where(inE().count().is(gt(6))).profile()

==>Traversal Metrics

Step                                                               Count  Traversers       Time (ms)    % Dur

=============================================================================================================

JanusGraphStep(vertex,[])                                             10          10           8.684    28.71

    \_condition=()

    \_orders=[]

    \_limit=10

    \_isFitted=false

    \_isOrdered=true

    \_query=[]

  optimization                                                                                 0.005

  optimization                                                                                 0.001

  scan                                                                                         0.000

    \_query=[]

    \_fullscan=true

    \_condition=VERTEX

TraversalFilterStep([JanusGraphVertexStep(IN,ed...                                            21.564    71.29

  JanusGraphVertexStep(IN,edge)                                       13          13          21.350

    \_condition=(EDGE AND visibility:normal)

    \_orders=[]

    \_limit=7

    \_isFitted=false

    \_isOrdered=true

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_vertices=1

    optimization                                                                               0.003

    backend-query                                                      3                       4.434

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      1                       1.291

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.311

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      1                       2.483

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.310

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.313

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.192

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      4                       1.287

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      3                       1.231

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       3.546

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

  RangeGlobalStep(0,7)                                                13          13           0.037

  CountGlobalStep                                                     10          10           0.041

  IsStep(gt(6))                                                                                0.022

                                            >TOTAL                     -           -          30.249        -


On Wednesday, December 30, 2020 at 4:59:20 AM UTC-5 libo...@connect.hku.hk wrote:
Hi Marc,

I think it will be as same slow as the initial one if not even slower. If I recall correctly, JanusGraph always fetches the whole edge (column + value) even if you only need the count (in which case neither column nor value is really needed), or you only need edge id (in which case only column is needed). I created https://github.com/JanusGraph/janusgraph/discussions/2315 to discuss about this potential optimization. Btw, even if we assume this optimization is in-place, I don’t expect significant performance boost for Zach’s usecase.

Best regards,
Boxuan


On Dec 30, 2020, at 4:44 PM, HadoopMarc <b...@...> wrote:

Hi Zach, Boxuan,

There is one thing I do not understand. According to the JanusGraph datamodel, the outE relatationIdentifiers are stored in the vertex. So, retrieving all outE() relationIdentifiers with the vertex for counting them should not take long, even if there are 100 thousands of them. What happens if you do:

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().id().count().is(gt(10)));

If this does not work, it should be possible to configure/modify JanusGraph such, that it does not start fetching edge properties that are not needed for the count.

Best wishes,   Marc

Op woensdag 30 december 2020 om 04:15:46 UTC+1 schreef libo...@connect.hku.hk:
Hi Zach,

I have some concerns over concurrency and consistency issues, but this might still be a nice feature to have. I think you could open a new discussion on https://github.com/JanusGraph/janusgraph/discussions. That would be a better place for brainstorming. It would be awesome if you can share more context on why you think this is a very common business requirement.

Cheers,
Boxuan

On Dec 30, 2020, at 4:42 AM, zblu...@gmail.com <zb...@...> wrote:

Thank you Boxuan,

Was using the term “job” pretty loosely.  Your inference about doing these things within ingest/deletion process makes sense.

I know there is a lot on the community’s plate now, but if my above solution is truly optimal for current state, I wonder if a JG feature addition may help tackle this problem more consistently. Something like an additional, 3rd , index type (in addition to “graph” and “vertex-centric” indices) . i.e. “edge-connection” or “degree-centrality” index. The feature would require a mixed indexing backend, and minimally a mechanism to choose vertex and edge label combinations to count IN, OUT, and/or BOTH degree centrality.

Not sure what the level of effort or implementation details would be, but this is a very common business requirement for graph-based search.  If JanusGraph has native/tested support for it, it would make JanusGraph even easier to champion.

😊

Best,

Zach


On Tuesday, December 29, 2020 at 3:19:46 AM UTC-5 libo...@connect.hku.hk wrote:
Hi Zach,

Personally I think your workaround is the most optimal one. JanusGraph does not store number of edges as metadata in the vertex (there are both Pros & Cons for doing / not doing this).

Btw do you have to have another job doing centrality calculation separately? If your application is built on top of JanusGraph, then probably you can maintain the “outDegree” property when inserting/deleting edges.

Best regards,
Boxuan

On Dec 29, 2020, at 6:49 AM, zblu...@gmail.com <zb...@...> wrote:

Hello all,

Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs.  i.e. something like :

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().count().is(gt(10)));                            

Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience).  My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do:

g.V()

   .has("someProperty",eq("someValue"))

   .has(“outDegree”,gt(10))

This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. 

Thank you,

Zach


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/385af431-d723-4be6-95cb-43b2954f2e58n%40googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/0ff0c37a-6a56-476c-8efb-c30416380ec1n%40googlegroups.com.