Date   

Re: Degree-Centrality Filtering & Search – Scalable Strategies for OLTP

BO XUAN LI <libo...@...>
 

Hi Zach,

If you want to run the query in a multi-threading manner, try enabling “query.batch” (ref: https://docs.janusgraph.org/basics/configuration-reference/#query).

Since you are using Cassandra which does not support batch reading natively, JanusGraph will use a thread pool to fire the backend queries. This should reduce latency of this single query but might impact overall application performance if your application is already handling heavy workloads.

Best regards,
Boxuan

On Dec 31, 2020, at 8:51 AM, zblu...@gmail.com <zblu...@...> wrote:

Hi Marc, Boxuan,

Thank you for the discussion. I have been experimenting with different queries including your id() suggesting Marc.  Along Boxuan’s feedback, the where() step performs about the same (maybe slightly slower) when adding .id() step.

My bigger concern for my use case is how this type of operation scales in a matter that seems relatively linear with sample size. i.e.

g.V().limit(10).where(InE().count().is(gt(6))).profile() => ~30 ms

g.V().limit(100).where(InE().count().is(gt(6))).profile() => ~147 ms

g.V().limit(1000).where(InE().count().is(gt(6))).profile()  => ~1284 ms

g.V().limit(10000).where(InE().count().is(gt(6))).profile()  => ~13779 ms

g.V().limit(100000).where(InE().count().is(gt(6))).profile()  => ? > 120000 ms (timeout)

 

This behavior makes sense when I think about it and also when I inspect the profile (example profile of limit(10) traversal below)

I know the above traversal seems a bit funky, but I am trying to consistently analyze the effect of sample size on the edge count portion of the query.

Looking at the profile, it seems like JG needs to perform a sliceQuery operation on each vertex sequentially which isn’t well optimized for my use case. I know that if centrality properties were included in a mixed index then it can be configured for scalable performance.  However, going back to the original post, I am not sure that is the best/only way.  Are there other configurations that could be optimized to make this operation more scalable without to an additional index property? 

In case it is relevant, I am using JanusGraph v 0.5.2 with Cassandra-CQL backend v3.11.

Thank you,

Zach

Example Profile

gremlin> g.V().limit(10).where(inE().count().is(gt(6))).profile()

==>Traversal Metrics

Step                                                               Count  Traversers       Time (ms)    % Dur

=============================================================================================================

JanusGraphStep(vertex,[])                                             10          10           8.684    28.71

    \_condition=()

    \_orders=[]

    \_limit=10

    \_isFitted=false

    \_isOrdered=true

    \_query=[]

  optimization                                                                                 0.005

  optimization                                                                                 0.001

  scan                                                                                         0.000

    \_query=[]

    \_fullscan=true

    \_condition=VERTEX

TraversalFilterStep([JanusGraphVertexStep(IN,ed...                                            21.564    71.29

  JanusGraphVertexStep(IN,edge)                                       13          13          21.350

    \_condition=(EDGE AND visibility:normal)

    \_orders=[]

    \_limit=7

    \_isFitted=false

    \_isOrdered=true

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_vertices=1

    optimization                                                                               0.003

    backend-query                                                      3                       4.434

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      1                       1.291

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.311

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      1                       2.483

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.310

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.313

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.192

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      4                       1.287

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      3                       1.231

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       3.546

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

  RangeGlobalStep(0,7)                                                13          13           0.037

  CountGlobalStep                                                     10          10           0.041

  IsStep(gt(6))                                                                                0.022

                                            >TOTAL                     -           -          30.249        -


On Wednesday, December 30, 2020 at 4:59:20 AM UTC-5 libo...@connect.hku.hk wrote:
Hi Marc,

I think it will be as same slow as the initial one if not even slower. If I recall correctly, JanusGraph always fetches the whole edge (column + value) even if you only need the count (in which case neither column nor value is really needed), or you only need edge id (in which case only column is needed). I created https://github.com/JanusGraph/janusgraph/discussions/2315 to discuss about this potential optimization. Btw, even if we assume this optimization is in-place, I don’t expect significant performance boost for Zach’s usecase.

Best regards,
Boxuan


On Dec 30, 2020, at 4:44 PM, HadoopMarc <b...@...> wrote:

Hi Zach, Boxuan,

There is one thing I do not understand. According to the JanusGraph datamodel, the outE relatationIdentifiers are stored in the vertex. So, retrieving all outE() relationIdentifiers with the vertex for counting them should not take long, even if there are 100 thousands of them. What happens if you do:

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().id().count().is(gt(10)));

If this does not work, it should be possible to configure/modify JanusGraph such, that it does not start fetching edge properties that are not needed for the count.

Best wishes,   Marc

Op woensdag 30 december 2020 om 04:15:46 UTC+1 schreef libo...@connect.hku.hk:
Hi Zach,

I have some concerns over concurrency and consistency issues, but this might still be a nice feature to have. I think you could open a new discussion on https://github.com/JanusGraph/janusgraph/discussions. That would be a better place for brainstorming. It would be awesome if you can share more context on why you think this is a very common business requirement.

Cheers,
Boxuan

On Dec 30, 2020, at 4:42 AM, zblu...@gmail.com <zb...@...> wrote:

Thank you Boxuan,

Was using the term “job” pretty loosely.  Your inference about doing these things within ingest/deletion process makes sense.

I know there is a lot on the community’s plate now, but if my above solution is truly optimal for current state, I wonder if a JG feature addition may help tackle this problem more consistently. Something like an additional, 3rd , index type (in addition to “graph” and “vertex-centric” indices) . i.e. “edge-connection” or “degree-centrality” index. The feature would require a mixed indexing backend, and minimally a mechanism to choose vertex and edge label combinations to count IN, OUT, and/or BOTH degree centrality.

Not sure what the level of effort or implementation details would be, but this is a very common business requirement for graph-based search.  If JanusGraph has native/tested support for it, it would make JanusGraph even easier to champion.

😊

Best,

Zach


On Tuesday, December 29, 2020 at 3:19:46 AM UTC-5 libo...@connect.hku.hk wrote:
Hi Zach,

Personally I think your workaround is the most optimal one. JanusGraph does not store number of edges as metadata in the vertex (there are both Pros & Cons for doing / not doing this).

Btw do you have to have another job doing centrality calculation separately? If your application is built on top of JanusGraph, then probably you can maintain the “outDegree” property when inserting/deleting edges.

Best regards,
Boxuan

On Dec 29, 2020, at 6:49 AM, zblu...@gmail.com <zb...@...> wrote:

Hello all,

Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs.  i.e. something like :

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().count().is(gt(10)));                            

Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience).  My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do:

g.V()

   .has("someProperty",eq("someValue"))

   .has(“outDegree”,gt(10))

This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. 

Thank you,

Zach


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/385af431-d723-4be6-95cb-43b2954f2e58n%40googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/0ff0c37a-6a56-476c-8efb-c30416380ec1n%40googlegroups.com.


Re: Degree-Centrality Filtering & Search – Scalable Strategies for OLTP

"zb...@gmail.com" <zblu...@...>
 

Hi Marc, Boxuan,

Thank you for the discussion. I have been experimenting with different queries including your id() suggesting Marc.  Along Boxuan’s feedback, the where() step performs about the same (maybe slightly slower) when adding .id() step.

My bigger concern for my use case is how this type of operation scales in a matter that seems relatively linear with sample size. i.e.

g.V().limit(10).where(InE().count().is(gt(6))).profile() => ~30 ms

g.V().limit(100).where(InE().count().is(gt(6))).profile() => ~147 ms

g.V().limit(1000).where(InE().count().is(gt(6))).profile()  => ~1284 ms

g.V().limit(10000).where(InE().count().is(gt(6))).profile()  => ~13779 ms

g.V().limit(100000).where(InE().count().is(gt(6))).profile()  => ? > 120000 ms (timeout)

 

This behavior makes sense when I think about it and also when I inspect the profile (example profile of limit(10) traversal below)

I know the above traversal seems a bit funky, but I am trying to consistently analyze the effect of sample size on the edge count portion of the query.

Looking at the profile, it seems like JG needs to perform a sliceQuery operation on each vertex sequentially which isn’t well optimized for my use case. I know that if centrality properties were included in a mixed index then it can be configured for scalable performance.  However, going back to the original post, I am not sure that is the best/only way.  Are there other configurations that could be optimized to make this operation more scalable without to an additional index property? 

In case it is relevant, I am using JanusGraph v 0.5.2 with Cassandra-CQL backend v3.11.

Thank you,

Zach

Example Profile

gremlin> g.V().limit(10).where(inE().count().is(gt(6))).profile()

==>Traversal Metrics

Step                                                               Count  Traversers       Time (ms)    % Dur

=============================================================================================================

JanusGraphStep(vertex,[])                                             10          10           8.684    28.71

    \_condition=()

    \_orders=[]

    \_limit=10

    \_isFitted=false

    \_isOrdered=true

    \_query=[]

  optimization                                                                                 0.005

  optimization                                                                                 0.001

  scan                                                                                         0.000

    \_query=[]

    \_fullscan=true

    \_condition=VERTEX

TraversalFilterStep([JanusGraphVertexStep(IN,ed...                                            21.564    71.29

  JanusGraphVertexStep(IN,edge)                                       13          13          21.350

    \_condition=(EDGE AND visibility:normal)

    \_orders=[]

    \_limit=7

    \_isFitted=false

    \_isOrdered=true

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_vertices=1

    optimization                                                                               0.003

    backend-query                                                      3                       4.434

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      1                       1.291

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.311

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      1                       2.483

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.310

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.313

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.192

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      4                       1.287

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      3                       1.231

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       3.546

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

  RangeGlobalStep(0,7)                                                13          13           0.037

  CountGlobalStep                                                     10          10           0.041

  IsStep(gt(6))                                                                                0.022

                                            >TOTAL                     -           -          30.249        -


On Wednesday, December 30, 2020 at 4:59:20 AM UTC-5 li...@... wrote:
Hi Marc,

I think it will be as same slow as the initial one if not even slower. If I recall correctly, JanusGraph always fetches the whole edge (column + value) even if you only need the count (in which case neither column nor value is really needed), or you only need edge id (in which case only column is needed). I created https://github.com/JanusGraph/janusgraph/discussions/2315 to discuss about this potential optimization. Btw, even if we assume this optimization is in-place, I don’t expect significant performance boost for Zach’s usecase.

Best regards,
Boxuan


On Dec 30, 2020, at 4:44 PM, HadoopMarc <b...@...> wrote:

Hi Zach, Boxuan,

There is one thing I do not understand. According to the JanusGraph datamodel, the outE relatationIdentifiers are stored in the vertex. So, retrieving all outE() relationIdentifiers with the vertex for counting them should not take long, even if there are 100 thousands of them. What happens if you do:

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().id().count().is(gt(10)));

If this does not work, it should be possible to configure/modify JanusGraph such, that it does not start fetching edge properties that are not needed for the count.

Best wishes,   Marc

Op woensdag 30 december 2020 om 04:15:46 UTC+1 schreef libo...@connect.hku.hk:
Hi Zach,

I have some concerns over concurrency and consistency issues, but this might still be a nice feature to have. I think you could open a new discussion on https://github.com/JanusGraph/janusgraph/discussions. That would be a better place for brainstorming. It would be awesome if you can share more context on why you think this is a very common business requirement.

Cheers,
Boxuan

On Dec 30, 2020, at 4:42 AM, zblu...@gmail.com <zb...@...> wrote:

Thank you Boxuan,

Was using the term “job” pretty loosely.  Your inference about doing these things within ingest/deletion process makes sense.

I know there is a lot on the community’s plate now, but if my above solution is truly optimal for current state, I wonder if a JG feature addition may help tackle this problem more consistently. Something like an additional, 3rd , index type (in addition to “graph” and “vertex-centric” indices) . i.e. “edge-connection” or “degree-centrality” index. The feature would require a mixed indexing backend, and minimally a mechanism to choose vertex and edge label combinations to count IN, OUT, and/or BOTH degree centrality.

Not sure what the level of effort or implementation details would be, but this is a very common business requirement for graph-based search.  If JanusGraph has native/tested support for it, it would make JanusGraph even easier to champion.

😊

Best,

Zach


On Tuesday, December 29, 2020 at 3:19:46 AM UTC-5 libo...@connect.hku.hk wrote:
Hi Zach,

Personally I think your workaround is the most optimal one. JanusGraph does not store number of edges as metadata in the vertex (there are both Pros & Cons for doing / not doing this).

Btw do you have to have another job doing centrality calculation separately? If your application is built on top of JanusGraph, then probably you can maintain the “outDegree” property when inserting/deleting edges.

Best regards,
Boxuan

On Dec 29, 2020, at 6:49 AM, zblu...@gmail.com <zb...@...> wrote:

Hello all,

Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs.  i.e. something like :

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().count().is(gt(10)));                            

Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience).  My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do:

g.V()

   .has("someProperty",eq("someValue"))

   .has(“outDegree”,gt(10))

This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. 

Thank you,

Zach


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/385af431-d723-4be6-95cb-43b2954f2e58n%40googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....


Remote Traversal with Java

Peter Borissow <peter....@...>
 

Dear All,
    I have installed/configured a single node JanusGraph Server with a Berkeley database backend and ConfigurationManagementGraph support so that I can create/manage multiple graphs on the server. 

In a Gremlin console on my desktop I can connect to the remote server, create graphs, create vertexes, etc.   

In Java code on my desktop, I can connect to the remote server and issue commands via Client.submit() method. However, I cannot figure out how to open a specific graph on the server and get a traversal. In the Gremlin console it is as simple as this:

gremlin> :remote connect tinkerpop.server conf/remote.yaml session 
gremlin> :remote console  
gremlin> graph = ConfiguredGraphFactory.open("test"); 
gremlin> g = graph.traversal();  

In Java, once I connect to the server/cluster and create a client connection, I think it should be as simple as this:

DriverRemoteConnection conn = DriverRemoteConnection.using(client, name);
GraphTraversalSource g = AnonymousTraversalSource.traversal().withRemote(conn);  

More info here:
https://stackoverflow.com/questions/65486512/janusgraph-remote-traversal-with-java

Any help/guidance would be greatly appreciated!

Thanks,
Peter


Re: Degree-Centrality Filtering & Search – Scalable Strategies for OLTP

BO XUAN LI <libo...@...>
 

Hi Marc,

I think it will be as same slow as the initial one if not even slower. If I recall correctly, JanusGraph always fetches the whole edge (column + value) even if you only need the count (in which case neither column nor value is really needed), or you only need edge id (in which case only column is needed). I created https://github.com/JanusGraph/janusgraph/discussions/2315 to discuss about this potential optimization. Btw, even if we assume this optimization is in-place, I don’t expect significant performance boost for Zach’s usecase.

Best regards,
Boxuan


On Dec 30, 2020, at 4:44 PM, HadoopMarc <bi...@...> wrote:

Hi Zach, Boxuan,

There is one thing I do not understand. According to the JanusGraph datamodel, the outE relatationIdentifiers are stored in the vertex. So, retrieving all outE() relationIdentifiers with the vertex for counting them should not take long, even if there are 100 thousands of them. What happens if you do:

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().id().count().is(gt(10)));

If this does not work, it should be possible to configure/modify JanusGraph such, that it does not start fetching edge properties that are not needed for the count.

Best wishes,   Marc

Op woensdag 30 december 2020 om 04:15:46 UTC+1 schreef libo...@connect.hku.hk:
Hi Zach,

I have some concerns over concurrency and consistency issues, but this might still be a nice feature to have. I think you could open a new discussion on https://github.com/JanusGraph/janusgraph/discussions. That would be a better place for brainstorming. It would be awesome if you can share more context on why you think this is a very common business requirement.

Cheers,
Boxuan

On Dec 30, 2020, at 4:42 AM, zblu...@gmail.com <zb...@...> wrote:

Thank you Boxuan,

Was using the term “job” pretty loosely.  Your inference about doing these things within ingest/deletion process makes sense.

I know there is a lot on the community’s plate now, but if my above solution is truly optimal for current state, I wonder if a JG feature addition may help tackle this problem more consistently. Something like an additional, 3rd , index type (in addition to “graph” and “vertex-centric” indices) . i.e. “edge-connection” or “degree-centrality” index. The feature would require a mixed indexing backend, and minimally a mechanism to choose vertex and edge label combinations to count IN, OUT, and/or BOTH degree centrality.

Not sure what the level of effort or implementation details would be, but this is a very common business requirement for graph-based search.  If JanusGraph has native/tested support for it, it would make JanusGraph even easier to champion.

😊

Best,

Zach


On Tuesday, December 29, 2020 at 3:19:46 AM UTC-5 libo...@connect.hku.hk wrote:
Hi Zach,

Personally I think your workaround is the most optimal one. JanusGraph does not store number of edges as metadata in the vertex (there are both Pros & Cons for doing / not doing this).

Btw do you have to have another job doing centrality calculation separately? If your application is built on top of JanusGraph, then probably you can maintain the “outDegree” property when inserting/deleting edges.

Best regards,
Boxuan

On Dec 29, 2020, at 6:49 AM, zblu...@gmail.com <zb...@...> wrote:

Hello all,

Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs.  i.e. something like :

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().count().is(gt(10)));                            

Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience).  My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do:

g.V()

   .has("someProperty",eq("someValue"))

   .has(“outDegree”,gt(10))

This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. 

Thank you,

Zach


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/385af431-d723-4be6-95cb-43b2954f2e58n%40googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/5c39e3cb-1b97-4c16-a1a7-0fb0b6f1ae7dn%40googlegroups.com.


Re: Degree-Centrality Filtering & Search – Scalable Strategies for OLTP

HadoopMarc <bi...@...>
 

Hi Zach, Boxuan,

There is one thing I do not understand. According to the JanusGraph datamodel, the outE relatationIdentifiers are stored in the vertex. So, retrieving all outE() relationIdentifiers with the vertex for counting them should not take long, even if there are 100 thousands of them. What happens if you do:

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().id().count().is(gt(10)));

If this does not work, it should be possible to configure/modify JanusGraph such, that it does not start fetching edge properties that are not needed for the count.

Best wishes,   Marc

Op woensdag 30 december 2020 om 04:15:46 UTC+1 schreef li...@...:

Hi Zach,

I have some concerns over concurrency and consistency issues, but this might still be a nice feature to have. I think you could open a new discussion on https://github.com/JanusGraph/janusgraph/discussions. That would be a better place for brainstorming. It would be awesome if you can share more context on why you think this is a very common business requirement.

Cheers,
Boxuan

On Dec 30, 2020, at 4:42 AM, zblu...@gmail.com <zb...@...> wrote:

Thank you Boxuan,

Was using the term “job” pretty loosely.  Your inference about doing these things within ingest/deletion process makes sense.

I know there is a lot on the community’s plate now, but if my above solution is truly optimal for current state, I wonder if a JG feature addition may help tackle this problem more consistently. Something like an additional, 3rd , index type (in addition to “graph” and “vertex-centric” indices) . i.e. “edge-connection” or “degree-centrality” index. The feature would require a mixed indexing backend, and minimally a mechanism to choose vertex and edge label combinations to count IN, OUT, and/or BOTH degree centrality.

Not sure what the level of effort or implementation details would be, but this is a very common business requirement for graph-based search.  If JanusGraph has native/tested support for it, it would make JanusGraph even easier to champion.

😊

Best,

Zach


On Tuesday, December 29, 2020 at 3:19:46 AM UTC-5 libo...@connect.hku.hk wrote:
Hi Zach,

Personally I think your workaround is the most optimal one. JanusGraph does not store number of edges as metadata in the vertex (there are both Pros & Cons for doing / not doing this).

Btw do you have to have another job doing centrality calculation separately? If your application is built on top of JanusGraph, then probably you can maintain the “outDegree” property when inserting/deleting edges.

Best regards,
Boxuan

On Dec 29, 2020, at 6:49 AM, zblu...@gmail.com <zb...@...> wrote:

Hello all,

Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs.  i.e. something like :

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().count().is(gt(10)));                            

Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience).  My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do:

g.V()

   .has("someProperty",eq("someValue"))

   .has(“outDegree”,gt(10))

This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. 

Thank you,

Zach


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/385af431-d723-4be6-95cb-43b2954f2e58n%40googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....


Re: Degree-Centrality Filtering & Search – Scalable Strategies for OLTP

BO XUAN LI <libo...@...>
 

Hi Zach,

I have some concerns over concurrency and consistency issues, but this might still be a nice feature to have. I think you could open a new discussion on https://github.com/JanusGraph/janusgraph/discussions. That would be a better place for brainstorming. It would be awesome if you can share more context on why you think this is a very common business requirement.

Cheers,
Boxuan

On Dec 30, 2020, at 4:42 AM, zblu...@gmail.com <zblu...@...> wrote:

Thank you Boxuan,

Was using the term “job” pretty loosely.  Your inference about doing these things within ingest/deletion process makes sense.

I know there is a lot on the community’s plate now, but if my above solution is truly optimal for current state, I wonder if a JG feature addition may help tackle this problem more consistently. Something like an additional, 3rd , index type (in addition to “graph” and “vertex-centric” indices) . i.e. “edge-connection” or “degree-centrality” index. The feature would require a mixed indexing backend, and minimally a mechanism to choose vertex and edge label combinations to count IN, OUT, and/or BOTH degree centrality.

Not sure what the level of effort or implementation details would be, but this is a very common business requirement for graph-based search.  If JanusGraph has native/tested support for it, it would make JanusGraph even easier to champion.

😊

Best,

Zach


On Tuesday, December 29, 2020 at 3:19:46 AM UTC-5 libo...@connect.hku.hk wrote:
Hi Zach,

Personally I think your workaround is the most optimal one. JanusGraph does not store number of edges as metadata in the vertex (there are both Pros & Cons for doing / not doing this).

Btw do you have to have another job doing centrality calculation separately? If your application is built on top of JanusGraph, then probably you can maintain the “outDegree” property when inserting/deleting edges.

Best regards,
Boxuan

On Dec 29, 2020, at 6:49 AM, zblu...@gmail.com <zb...@...> wrote:

Hello all,

Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs.  i.e. something like :

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().count().is(gt(10)));                            

Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience).  My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do:

g.V()

   .has("someProperty",eq("someValue"))

   .has(“outDegree”,gt(10))

This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. 

Thank you,

Zach


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/385af431-d723-4be6-95cb-43b2954f2e58n%40googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/c6539751-c586-42c1-af96-010b6275d1f1n%40googlegroups.com.


Re: Degree-Centrality Filtering & Search – Scalable Strategies for OLTP

"zb...@gmail.com" <zblu...@...>
 

Thank you Boxuan,

Was using the term “job” pretty loosely.  Your inference about doing these things within ingest/deletion process makes sense.

I know there is a lot on the community’s plate now, but if my above solution is truly optimal for current state, I wonder if a JG feature addition may help tackle this problem more consistently. Something like an additional, 3rd , index type (in addition to “graph” and “vertex-centric” indices) . i.e. “edge-connection” or “degree-centrality” index. The feature would require a mixed indexing backend, and minimally a mechanism to choose vertex and edge label combinations to count IN, OUT, and/or BOTH degree centrality.

Not sure what the level of effort or implementation details would be, but this is a very common business requirement for graph-based search.  If JanusGraph has native/tested support for it, it would make JanusGraph even easier to champion.

😊

Best,

Zach


On Tuesday, December 29, 2020 at 3:19:46 AM UTC-5 li...@... wrote:
Hi Zach,

Personally I think your workaround is the most optimal one. JanusGraph does not store number of edges as metadata in the vertex (there are both Pros & Cons for doing / not doing this).

Btw do you have to have another job doing centrality calculation separately? If your application is built on top of JanusGraph, then probably you can maintain the “outDegree” property when inserting/deleting edges.

Best regards,
Boxuan

On Dec 29, 2020, at 6:49 AM, zblu...@gmail.com <zb...@...> wrote:

Hello all,

Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs.  i.e. something like :

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().count().is(gt(10)));                            

Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience).  My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do:

g.V()

   .has("someProperty",eq("someValue"))

   .has(“outDegree”,gt(10))

This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. 

Thank you,

Zach


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/385af431-d723-4be6-95cb-43b2954f2e58n%40googlegroups.com.


Re: Degree-Centrality Filtering & Search – Scalable Strategies for OLTP

BO XUAN LI <libo...@...>
 

Hi Zach,

Personally I think your workaround is the most optimal one. JanusGraph does not store number of edges as metadata in the vertex (there are both Pros & Cons for doing / not doing this).

Btw do you have to have another job doing centrality calculation separately? If your application is built on top of JanusGraph, then probably you can maintain the “outDegree” property when inserting/deleting edges.

Best regards,
Boxuan

On Dec 29, 2020, at 6:49 AM, zblu...@gmail.com <zblu...@...> wrote:

Hello all,

Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs.  i.e. something like :

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().count().is(gt(10)));                            

Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience).  My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do:

g.V()

   .has("someProperty",eq("someValue"))

   .has(“outDegree”,gt(10))

This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. 

Thank you,

Zach


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/385af431-d723-4be6-95cb-43b2954f2e58n%40googlegroups.com.


Degree-Centrality Filtering & Search – Scalable Strategies for OLTP

"zb...@gmail.com" <zblu...@...>
 

Hello all,

Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs.  i.e. something like :

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().count().is(gt(10)));                            

Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience).  My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do:

g.V()

   .has("someProperty",eq("someValue"))

   .has(“outDegree”,gt(10))

This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. 

Thank you,

Zach


Re: JanusGraph 0.5.2 and BigTable

Assaf Schwartz <schw...@...>
 

Hi Boxuan!

I perhaps wasn't clear. The composite indexing didn't solve the locking issue (it went away by itself 🙄, as if there's a cold start issue).
However, my actual problem, about failing to lookup, was indeed solved. 

Again many thanks for the information and promptly replied.
Assaf

On Saturday, December 19, 2020 at 10:09:42 AM UTC+2 li...@... wrote:
> About the locking, what do you consider a JVM instance? An instance of the Gremlin server? JanusGraph iteslf? If I try and use Janus as a cluster (multiple dockers instead of one), will that translate to having more than 1 JVM?

Sorry I wasn’t very clear about this. I meant by the JVM instance where JanusGraph itself runs. To be accurate, you could see local lock contentions when multiple threads, under the same process, contend for the same lock. This is due to JanusGraph’s locking mechanism:

Step 1: Local lock resolution (inter-thread synchronization), utilizing in-memory data structures (concurrent hashmap). If conflict detected, you typically see error message like "Local lock contention”.
Step 2: Inter-process synchronization, utilizing data backend (e.g. HBase). If conflict detected, you typically see other error messages like “Lock write retry count exceeded”.


If you have multiple transactions contending for the same lock, then it’s better to have them running on the same JVM instance because local lock synchronization is faster and can let conflicting transactions fail early.

Glad to hear you don’t have the problem anymore. To be honest I don’t know why switching to composite indexes helps you resolve the locking exception issues.

Cheers,
Boxuan

On Dec 19, 2020, at 3:41 PM, Assaf Schwartz <sc...@...> wrote:

Thanks a lot Boxuan!

For some reason I missed being notified on your response.
The indexes were indeed the issue (as I began to suspect), switching them to being composite indexes (there was no real need for them being mixed) solved the issue :)

About the locking, what do you consider a JVM instance? An instance of the Gremlin server? JanusGraph iteslf? If I try and use Janus as a cluster (multiple dockers instead of one), will that translate to having more than 1 JVM?

Again thanks,
Assaf

On Thursday, December 17, 2020 at 12:39:48 PM UTC+2 libo...@connect.hku.hk wrote:
Hi Assaf,

I am not familiar with GKE but I can try to answer some of your questions:

> how does a traversal behave when looking up based on an index key when the key is not yet indexed

Assuming the index has been enabled. If a particular key is still in the indexing process (e.g. you are in the middle of a committing process) in one thread, then another thread will not be able to find any data because it finds nothing in the index key lookup. Note that when you are using mixed index, the data is written to your primary backend (e.g. hbase) first, and then mixed index backend (e.g. Elasticsearch). If the data has already been written into hbase but not into Elasticsearch yet, the querying thread cannot find any data (if JanusGraph decides your query can be satisfied by a mixed index).

> org.janusgraph.diskstorage.locking.PermanentLockingException: Local lock contention at org.janusgraph.diskstorage.locking.AbstractLocker.writeLock(AbstractLocker.java:327) 

This usually happens when you have multiple local threads (running on the same JVM instance) contending for the same lock. You might want to check your application logic.

Best regards,
Boxuan

On Dec 17, 2020, at 6:16 PM, Assaf Schwartz <sc...@...> wrote:

Could this be related to delays in indexing? I don't know how to figure out of such exists, but assuming this happens - 
how does a traversal behave when looking up based on an index key when the key is not yet indexed?

On Thursday, December 17, 2020 at 10:54:32 AM UTC+2 Assaf Schwartz wrote:
Hi All,

I'm experiencing an issues with running JanusGraph (on top of GKE) against BigTable.
This is the general setup description:
  • We are using a single node BigTable cluster (for development / integration purposes) with the vanilla 0.5.2 docker.
  • Indexing is configured to be done with ES (also running on GKE)
  • JanusGraph is configured through environment variables:
  • Interaction with JanusGraph are done only through a single gRPC server that is running gremlin-python, let's call it DB-SERVER.
  • Last time we've done testing against BT was with version 0.4.1 of JanusGraph, precompiled to support HBase1.
  • All of our components communicate via gRPC.
Description of the problem:
  1. The DB-SERVER creates a Vertex i, generate some XML to represent work to be done, and sends it to another service for processing, let's call in ORCHESTRATOR.
  2. The ORCHESTRATOR generates two properties, w and r (local identifiers) and sends them back to the DB-SERVER, so they will be set as properties on Vertex i. These two properties are also mixed String indexes.
  3. After setting the properties, DB-SERVER will ack ORCHESTRATOR, which will start processing. As part of the processing, ORCHESTRATOR will send updates back to the DB-SERVER using w and r.
  4. On getting these updates DB-SERVER, it will try looking up Vertex i based on w and r, like so:
    g.V().has("r", <some_r>).has("w", <some_w>).next()
  5. At that point, a null / None is returned as the traversal fails to find Vertex i.
  6. Trying the same traversal in a separate console (python and gremlin) does fetch the vertex. Since it's a single instance cluster, I ruled out any eventual consistency issues.
I'm not sure if it's a regression introduced after 0.4.1.
I've also validated that db-caching is turned off.

Help! :)
Many thanks in advance,
Assaf




-- 
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/d4373c5a-ab97-4aa4-a143-f26c3ce50677n%40googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....


Re: How to upload rdf bulk data to janus graph

Arpan Jain <arpan...@...>
 

Actually I have around 70 fields. So my doubt is - whether is it possible to insert so data without bulk upload so that Janus graph will create it's own schema and letter for remaining data I will use bulk upload true.
Will this process give error?

On Thu, 24 Dec, 2020, 5:14 pm alex...@..., <alexand...@...> wrote:
That's right

On Thursday, December 24, 2020 at 1:43:42 PM UTC+2 ar...@... wrote:
All these properties I need to set in the Janusgraph properties file right? I mean the config on which the server is starting. I mean the file where we set the backend storage and host etc.

On Thu, 24 Dec, 2020, 4:05 pm alex...@..., <alex...@...> wrote:
Hi,

Try to enable batch loading: "storage.batch-loading=true".
Increase your batch mutations buffer: "storage.buffer-size=20480".
Increase ids block size: "ids.block-size=10000000".
Not sure if your flows just adds or upserts data. In case it upserts you may also set "query.batch=true".
That said, I didn't use rdf2gremlin and can't suggest much. Above configurations are just options which I can immediately think of. Of course a proper investigation should be done to suggest performance improvement. You may additionally optimize your ScyllaDB for your use cases. 

Best regards,
Oleksandr
On Thursday, December 24, 2020 at 12:24:10 PM UTC+2 ar...@... wrote:
I have data in RDF(ttl) format. It is having around 6 million triplets. Currently, I have used rdf2gremlin python script for this conversion but it's taking to much time i.e. for 10k records it took around 1 hour. I am using Scylla DB as a Janus graph backend. Below is the python code I am using. 

from rdf2g import setup_graph
DEFAULT_LOCAL_CONNECTION_STRING = "ws://localhost:8182/gremlin"
g = setup_graph(DEFAULT_LOCAL_CONNECTION_STRING) 
 import rdflib 
import pathlib
OUTPUT_FILE_LAM_PROPERTIES = pathlib.Path("path/to/ttl/file/.ttl").resolve() 
rdf_graph = rdflib.Graph() 
rdf_graph.parse(str(OUTPUT_FILE_LAM_PROPERTIES), format="ttl") 

Same RDF data in neo4j is taking around only 10 mins to load the whole data. But I want to use the Janus graph.

Kindly suggest to me the best way to upload bulk RDF data to Janus graph using python or java.

--
You received this message because you are subscribed to a topic in the Google Groups "JanusGraph users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/janusgraph-users/Buk0hjlxVOs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/7a341919-78a4-48d2-9380-100f827803e1n%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/ddb3eb4d-3fe2-4a4e-9c34-4a76476af7c2n%40googlegroups.com.


Re: How to upload rdf bulk data to janus graph

"alex...@gmail.com" <alexand...@...>
 

That's right


On Thursday, December 24, 2020 at 1:43:42 PM UTC+2 ar...@... wrote:
All these properties I need to set in the Janusgraph properties file right? I mean the config on which the server is starting. I mean the file where we set the backend storage and host etc.

On Thu, 24 Dec, 2020, 4:05 pm alex...@..., <alex...@...> wrote:
Hi,

Try to enable batch loading: "storage.batch-loading=true".
Increase your batch mutations buffer: "storage.buffer-size=20480".
Increase ids block size: "ids.block-size=10000000".
Not sure if your flows just adds or upserts data. In case it upserts you may also set "query.batch=true".
That said, I didn't use rdf2gremlin and can't suggest much. Above configurations are just options which I can immediately think of. Of course a proper investigation should be done to suggest performance improvement. You may additionally optimize your ScyllaDB for your use cases. 

Best regards,
Oleksandr
On Thursday, December 24, 2020 at 12:24:10 PM UTC+2 ar...@... wrote:
I have data in RDF(ttl) format. It is having around 6 million triplets. Currently, I have used rdf2gremlin python script for this conversion but it's taking to much time i.e. for 10k records it took around 1 hour. I am using Scylla DB as a Janus graph backend. Below is the python code I am using. 

from rdf2g import setup_graph
DEFAULT_LOCAL_CONNECTION_STRING = "ws://localhost:8182/gremlin"
g = setup_graph(DEFAULT_LOCAL_CONNECTION_STRING) 
 import rdflib 
import pathlib
OUTPUT_FILE_LAM_PROPERTIES = pathlib.Path("path/to/ttl/file/.ttl").resolve() 
rdf_graph = rdflib.Graph() 
rdf_graph.parse(str(OUTPUT_FILE_LAM_PROPERTIES), format="ttl") 

Same RDF data in neo4j is taking around only 10 mins to load the whole data. But I want to use the Janus graph.

Kindly suggest to me the best way to upload bulk RDF data to Janus graph using python or java.

--
You received this message because you are subscribed to a topic in the Google Groups "JanusGraph users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/janusgraph-users/Buk0hjlxVOs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/7a341919-78a4-48d2-9380-100f827803e1n%40googlegroups.com.


Re: How to upload rdf bulk data to janus graph

Arpan Jain <arpan...@...>
 

All these properties I need to set in the Janusgraph properties file right? I mean the config on which the server is starting. I mean the file where we set the backend storage and host etc.


On Thu, 24 Dec, 2020, 4:05 pm alex...@..., <alexand...@...> wrote:
Hi,

Try to enable batch loading: "storage.batch-loading=true".
Increase your batch mutations buffer: "storage.buffer-size=20480".
Increase ids block size: "ids.block-size=10000000".
Not sure if your flows just adds or upserts data. In case it upserts you may also set "query.batch=true".
That said, I didn't use rdf2gremlin and can't suggest much. Above configurations are just options which I can immediately think of. Of course a proper investigation should be done to suggest performance improvement. You may additionally optimize your ScyllaDB for your use cases. 

Best regards,
Oleksandr
On Thursday, December 24, 2020 at 12:24:10 PM UTC+2 ar...@... wrote:
I have data in RDF(ttl) format. It is having around 6 million triplets. Currently, I have used rdf2gremlin python script for this conversion but it's taking to much time i.e. for 10k records it took around 1 hour. I am using Scylla DB as a Janus graph backend. Below is the python code I am using. 

from rdf2g import setup_graph
DEFAULT_LOCAL_CONNECTION_STRING = "ws://localhost:8182/gremlin"
g = setup_graph(DEFAULT_LOCAL_CONNECTION_STRING) 
 import rdflib 
import pathlib
OUTPUT_FILE_LAM_PROPERTIES = pathlib.Path("path/to/ttl/file/.ttl").resolve() 
rdf_graph = rdflib.Graph() 
rdf_graph.parse(str(OUTPUT_FILE_LAM_PROPERTIES), format="ttl") 

Same RDF data in neo4j is taking around only 10 mins to load the whole data. But I want to use the Janus graph.

Kindly suggest to me the best way to upload bulk RDF data to Janus graph using python or java.

--
You received this message because you are subscribed to a topic in the Google Groups "JanusGraph users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/janusgraph-users/Buk0hjlxVOs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/7a341919-78a4-48d2-9380-100f827803e1n%40googlegroups.com.


Re: How to upload rdf bulk data to janus graph

"alex...@gmail.com" <alexand...@...>
 

Hi,

Try to enable batch loading: "storage.batch-loading=true".
Increase your batch mutations buffer: "storage.buffer-size=20480".
Increase ids block size: "ids.block-size=10000000".
Not sure if your flows just adds or upserts data. In case it upserts you may also set "query.batch=true".
That said, I didn't use rdf2gremlin and can't suggest much. Above configurations are just options which I can immediately think of. Of course a proper investigation should be done to suggest performance improvement. You may additionally optimize your ScyllaDB for your use cases. 

Best regards,
Oleksandr

On Thursday, December 24, 2020 at 12:24:10 PM UTC+2 ar...@... wrote:
I have data in RDF(ttl) format. It is having around 6 million triplets. Currently, I have used rdf2gremlin python script for this conversion but it's taking to much time i.e. for 10k records it took around 1 hour. I am using Scylla DB as a Janus graph backend. Below is the python code I am using. 

from rdf2g import setup_graph
DEFAULT_LOCAL_CONNECTION_STRING = "ws://localhost:8182/gremlin"
g = setup_graph(DEFAULT_LOCAL_CONNECTION_STRING) 
 import rdflib 
import pathlib
OUTPUT_FILE_LAM_PROPERTIES = pathlib.Path("path/to/ttl/file/.ttl").resolve() 
rdf_graph = rdflib.Graph() 
rdf_graph.parse(str(OUTPUT_FILE_LAM_PROPERTIES), format="ttl") 

Same RDF data in neo4j is taking around only 10 mins to load the whole data. But I want to use the Janus graph.

Kindly suggest to me the best way to upload bulk RDF data to Janus graph using python or java.


How to upload rdf bulk data to janus graph

Arpan Jain <arpan...@...>
 

I have data in RDF(ttl) format. It is having around 6 million triplets. Currently, I have used rdf2gremlin python script for this conversion but it's taking to much time i.e. for 10k records it took around 1 hour. I am using Scylla DB as a Janus graph backend. Below is the python code I am using. 

from rdf2g import setup_graph
DEFAULT_LOCAL_CONNECTION_STRING = "ws://localhost:8182/gremlin"
g = setup_graph(DEFAULT_LOCAL_CONNECTION_STRING) 
 import rdflib 
import pathlib
OUTPUT_FILE_LAM_PROPERTIES = pathlib.Path("path/to/ttl/file/.ttl").resolve() 
rdf_graph = rdflib.Graph() 
rdf_graph.parse(str(OUTPUT_FILE_LAM_PROPERTIES), format="ttl") 

Same RDF data in neo4j is taking around only 10 mins to load the whole data. But I want to use the Janus graph.

Kindly suggest to me the best way to upload bulk RDF data to Janus graph using python or java.


Re: Aggregating edges based on the source & target vertex attributes

vishnu gajendran <ggvis...@...>
 

Thank you Marc. As you mentioned I might be able to execute the above mentioned aggregation query faster if I use other tools/datastore. But, I was exploring JanusGraph primarily for OLAP use-cases. For example, running graph algorithms like page rank, BFS etc... on the fly and for graph visualization where in I would like to aggregate the nodes & edges on the fly based on user selected node/edge attributes. I was thinking that JanusGraph might be optimal for these use-cases. Correct me if I am wrong.

I will also try the SparkGraphComputer as you suggested.

On Monday, December 21, 2020 at 5:19:56 PM UTC+5:30 HadoopMarc wrote:
Hi Vishnu,

The processing time does not really surprise me, JanusGraph has to do everything in java. For the typical JanusGraph use case, the storage backend is the limiting factor and the java processing does not really matter. If you want to do this query fast in memory with multiple cores, you are better off with python dask or the like (and do the aggregation on a single dataframe with the edge id, inV label and outV label). I would not be surprised if pandas, using a single core, already does this within a second.

For the queries given above I believe only a single core is used when run as OLTP query. Because this N x N query is not easy to parallelize for TinkerPop, you have to take care how to run it as OLAP query. I would guess that with(SparkGraphComputer) with a single spark executor with 8 cores will work best because then the spark cores share the memory. This is automatically true for spark.master=local[*] .

Best wishes,    Marc

PS Thanks for introducing me into the Indian numbering system. Happily, you do not have 1.5 crore vertices!


Op maandag 21 december 2020 om 09:16:08 UTC+1 schreef vishnu gajendran:

Thank you Kevin and Marc for quick response. I tried both the queries and they are working as expected. My use case demands to run such query for a bigger dataset. I ran the query for 1 lakh vertices and 5 million edges in my desktop using the in-memory backend (assuming that in-memory will be faster compared to other external data stores) and it took roughly 2 minutes to execute. My desktop contains 8 logical cores and 64 GB RAM. Few questions regarding the same:

1. Is this the expected performance for such aggregation queries in JanusGraph?
2. Will increasing the number of cores (i.e. processing power) improve the performance of the query?

The dataset I am dealing with can be as big as 1.5 lakh vertices and 20 million edges and I would like to support the above aggregation query in real time (i.e. in few seconds and not in minutes). Can we achieve the same using JanusGraph?
On Thursday, December 17, 2020 at 7:51:56 PM UTC+5:30 kt...@... wrote:
Thanks for improving it!  Always good to learn more.

On Thu, Dec 17, 2020 at 6:11 AM HadoopMarc <b...@...> wrote:
And here a small variation without the keys and with some code formatting:

g.V().as('a').outE().as('e').inV().as('b').
    group().by(
        union(select('a').values('organization'), select('b').values('organization')).fold()
    ).by(
        select('e').by('collaborationHours').sum()
    ).unfold()
==>[marketing, engineering]=2
==>[sales, marketing]=2
==>[engineering, sales]=3
==>[engineering, marketing]=2
Marc



Op donderdag 17 december 2020 om 14:50:11 UTC+1 schreef kt...@...:
Vishu,

This may not be optimal, but seems to work:

g.E().hasLabel('collaboration').as('e').outV().values('organization').as('1').select('e').inV().values('organization').as('2').select('e').group().by(select('1', '2')).by(values('collaborationHours').sum()).unfold();

==>{1=engineering, 2=marketing}=2
==>{1=marketing, 2=engineering}=2
==>{1=engineering, 2=sales}=3
==>{1=sales, 2=marketing}=2

Note, you have some leading spaces in your Gremlin on 'collaborationHours' I had to remove, and with the data you provided the engineering/sales total is 3, not 4.

Kevin 

On Wed, Dec 16, 2020 at 11:57 PM vishnu gajendran <gg...@...> wrote:
Hello,

I request your help regarding the janus graph query which I am trying to construct. Let's consider the following graph where each vertex denotes a person and the edge between any two vertex denotes collaboration between them.

Vertices:
p1 = graph.addVertex('person')
p1.property('personId', 1)
p1.property('organization', "engineering")

p2 = graph.addVertex('person')
p2.property('personId', 2)
p2.property('organization', "sales")

p3 = graph.addVertex('person')
p3.property('personId', 3)
p3.property('organization', "marketing")

p4 = graph.addVertex('person')
p4.property('personId', 4)
p4.property('organization', "engineering")

Edges:
p1.addEdge('collaboration', p2, 'collaborationHours', 1)
p1.addEdge('collaboration', p3, 'collaborationHours', 2)

p2.addEdge('collaboration', p3, 'collaborationHours', 2)

p3.addEdge('collaboration', p4, ' collaborationHours', 2)

p4.addEdge('collaboration', p2, ' collaborationHours', 2)

Expected Result is the following table:

Organization1  Organization2 Total Collaboration Hours
Engineering      Sales                 4
Engineering      Marketing         2
Sales                 Marketing          2
Marketing         Engineering       2

Here, I am trying to aggregate the "person to person" graph into "organization to organization" graph. Does JanusGraph support such aggregation queries? If yes, can you please help me with the query for the same?

Thanks

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/5fb20cf1-0aeb-4128-91da-857ec6295587n%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....


Re: Janusgraph connect with MySQL Storage Backend

Liu <1854...@...>
 

Thank you so much~ And I will read them to get some useful infomation.

在2020年12月23日星期三 UTC+8 下午9:51:02...@...> 写道:

Hi Molong,

Did you have a chance to read https://docs.janusgraph.org/advanced-topics/data-model/ yet? JanusGraph needs a column-family type database which can efficiently sort the cells by column. For example, the extensive usage of “sliceStart” and “sliceEnd” in CQLKeyColumnValueStore is based on the assumption that Cassandra can store entries sorted by Clustering Column (https://docs.datastax.com/en/dse/5.1/cql/cql/cql_using/whereClustering.html).

As far as I know Mysql cannot store data in a user-specified sorting order. That being said, Mysql allows using index to accelerate ORDER BY operations. So I guess you can make use of ORDER BY to achieve the sorting order, but I am not sure how efficient that would be.

Some resources on creating a storage backend:


Best regards,
Boxuan


On Dec 22, 2020, at 11:39 AM, liu molong <18...@...> wrote:

Hello,

I request your help regardding the janusgraph connect with Mysql or other  Relational DBMS. Because I'm trying to code a module which can connects to Mysql.
And current this module can connect with Mysql Server and do some simple operations like add/delete/query and so on.
But There‘’s some different when do query, I see that sliceStart/sliceEnd when do getSlice。

question 1:
Is it necessary to use sliceStart/sliceEnd when query?Beacuse I just try to  replace the Storage with MySQL,I am not sure whether the slice is necessary for JanusGraph.

question 2:
why don't you plan to support Mysql? Is it for some special reason? 

I will be very appreciate that if you can give some suggestion for the Mysql plan.

Thanks

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/219c2de9-207b-45fa-b7a8-0639783a560fn%40googlegroups.com.


Re: Janusgraph connect with MySQL Storage Backend

BO XUAN LI <libo...@...>
 

Hi Molong,

Did you have a chance to read https://docs.janusgraph.org/advanced-topics/data-model/ yet? JanusGraph needs a column-family type database which can efficiently sort the cells by column. For example, the extensive usage of “sliceStart” and “sliceEnd” in CQLKeyColumnValueStore is based on the assumption that Cassandra can store entries sorted by Clustering Column (https://docs.datastax.com/en/dse/5.1/cql/cql/cql_using/whereClustering.html).

As far as I know Mysql cannot store data in a user-specified sorting order. That being said, Mysql allows using index to accelerate ORDER BY operations. So I guess you can make use of ORDER BY to achieve the sorting order, but I am not sure how efficient that would be.

Some resources on creating a storage backend:


Best regards,
Boxuan


On Dec 22, 2020, at 11:39 AM, liu molong <1854...@...> wrote:

Hello,

I request your help regardding the janusgraph connect with Mysql or other  Relational DBMS. Because I'm trying to code a module which can connects to Mysql.
And current this module can connect with Mysql Server and do some simple operations like add/delete/query and so on.
But There‘’s some different when do query, I see that sliceStart/sliceEnd when do getSlice。

question 1:
Is it necessary to use sliceStart/sliceEnd when query?Beacuse I just try to  replace the Storage with MySQL,I am not sure whether the slice is necessary for JanusGraph.

question 2:
why don't you plan to support Mysql? Is it for some special reason? 

I will be very appreciate that if you can give some suggestion for the Mysql plan.

Thanks

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/219c2de9-207b-45fa-b7a8-0639783a560fn%40googlegroups.com.


Re: RDF Import into JanusGraph

HadoopMarc <bi...@...>
 

Hi,

Yes, this is certainly possible, but it is not well documented and will require hand coding. Resources to start with:

https://pypi.org/project/rdf2gremlin/
https://tinkerpop.apache.org/docs/current/reference/#sparql-gremlin
https://docs.aws.amazon.com/neptune/latest/userguide/bulk-load-tutorial-format-rdf.html

Best wishes,     Marc

Op woensdag 23 december 2020 om 10:32:31 UTC+1 schreef lal...@...:

Hi,

I want to know whether we can import RDF(Any format like Turtle or RDF/JSON) into Janusgraph or not.
If yes then what are the steps to do that.
Any help will highly be appreciated.

Thanks.


RDF Import into JanusGraph

Ritu Lalwani <lalwani...@...>
 

Hi,

I want to know whether we can import RDF(Any format like Turtle or RDF/JSON) into Janusgraph or not.
If yes then what are the steps to do that.
Any help will highly be appreciated.

Thanks.

1201 - 1220 of 6663