Date   

Re: hasNext() slow for large number of incoming edges

Matthew Nguyen <nguyenm9@...>
 

Hi Boxuan,

Happy to put in a request on github but still a little confused. Are we saying g.E().has('index_key', 'large_number_of_edges').hasNext() isn't streaming but should (note: g.E().hasNext() is fast) ?  Also, I think to close the gap on RDF/Property Graph, we do need to see what can be done about allowing for natural modeling in RDF which is really to make liberal use of edges. The problem with properties and RDF is that RDF expects you to index virtually everything in order for the queries to be quick.  Not sure how we can model non-generic properties in that capacity.

BTW I'm using Joshua's Graphsail (https://github.com/joshsh/graphsail) implementation to see if I can get it to work and trying to work through some of the edge (no pun intended) cases.

thx, matt


Exception while creating vertex with custom vertex id

Umesh Gade
 

Hi all,
We faced below exception while creating vertex with custom vertex id. Issue did not reproduce steadily. Setup configuration is JG-0.6.0 + cassadra-4.0 in 3 node-single DC cluster
java.lang.NullPointerException: null
        at org.janusgraph.graphdb.types.VertexLabelVertex.isPartitioned(VertexLabelVertex.java:41) ~[janusgraph-core-0.6.0.jar:?]
        at org.janusgraph.graphdb.types.VertexLabelVertex.hasDefaultConfiguration(VertexLabelVertex.java:67) ~[janusgraph-core-0.6.0.jar:?]
        at org.janusgraph.graphdb.transaction.StandardJanusGraphTx.addVertex(StandardJanusGraphTx.java:579) ~[janusgraph-core-0.6.0.jar:?]
        at org.janusgraph.graphdb.tinkerpop.JanusGraphBlueprintsTransaction.addVertex(JanusGraphBlueprintsTransaction.java:127) ~[janusgraph-core-0.6.0.jar:?]
        at org.janusgraph.graphdb.tinkerpop.JanusGraphBlueprintsGraph.addVertex(JanusGraphBlueprintsGraph.java:143) ~[janusgraph-core-0.6.0.jar:?]

Any pointer to debug further ?



Re: hasNext() slow for large number of incoming edges

Boxuan Li
 

Hi Matt,

It will definitely be a valid and valuable feature, if we could expose the streaming capacity to end users. If I recall correctly, the low-level results are indeed streamed (it might vary depending on the storage backend), but the interface is not exposed to the upper level APIs. Do you want to create a feature request on GitHub? Otherwise I can do it later.

Regarding your particular usecase, you said you had triples like <microsoft> <rdfs:type> <company>, which you modelled as V('microsoft') -> E('rdfs:type') -> V('company'). I suggest you model “type” as a property rather an edge. So, you will not create a vertex called “company”. Rather, you create a vertex called “microsoft” with a property “type” whose value is “company”, e.g.

g.addV().property(“value”, “microsoft”).property(“type”, “company”)

Rule of thumb: when you anticipate a super node, consider modeling it as a property rather than a vertex. Edges should be used to describe “relationships between nodes” rather than “properties attached to nodes”. This is the difference between a RDF and a property graph.

Best,
Boxuan

On Thu, Jan 27, 2022 at 12:51 AM Matthew Nguyen via lists.lfaidata.foundation <nguyenm9=aol.com@...> wrote:

Hi Boxuan, thanks for the response. Some background:  I'm trying to use JG as a triplestore and importing rdf.

The triple <microsoft> <rdfs:type> <company> can be modelled as V('microsoft') -> E('rdfs:type') -> V('company') such that:

g.V().has('value', 'microsoft').out().has('value', 'company').inE('rdfs:type').hasNext() = true

Certainly there can be millions of companies out there that can be modelled similarly.  I u/d the issue surround supernodes,  so perhaps this question is more about trying to u/d some internals of JG.

Note:  again, my use case is not exactly like above where everything is know but more around the sparql query:  select ?company where { ?comp rdfs:type <company> } or give me all companies of rdfs:type company which translates to Gremlin:
   g.V('value','company').inE() and then traverse inE().  But  g.V('value','company').inE().hasNext() takes a long time to initially run.

1) what is g.V(v).inE(e).hasNext() doing above that a call on a supernode is taking so long?  if it's trying to load all incidental edges, should either the documentation be updated or maybe the function be renamed to reflect potential latency issues?  or maybe the implementation is broken up something like c++ iteration -> traversal.begin(); while (traversal.hasNext()) traversal.next()... or something like that.  begin() and hasNext() can be implemented via the range(..) function you mentioned to better control perceived latency.  

2) When you mention remodelling, I can think of 2 ways to do so off the top of my head (please advise on others).
a. Have multiple types of Companies (TechCompany, FinancialCompany, etc.) to reduce the likelihood of a supernode
b. Add a property to V('microsoft').has('rdfs:type', 'company').  If I do this, and assuming 'rdfs:type' is property indexed, will V().has('rdf:type', 'company').hasNext() be fast?  If so, why?  

I hope this doesn't come across negatively.  I am very interested in trying to bridge the gap btwn LPG & RDF (3store) and I think I have some good use cases that can hopefully help to improve JG down the road.

thx, matt

 

 


Re: Potential transaction issue (JG 0.6.0)

Boxuan Li
 

Hi Umesh,

What you reported might be due to a different cause. Are you able to reproduce it steadily, and if so, could you please share the steps to reproduce the problem? It would be great if you could create a new thread for the problem you reported, thanks!

Best,
Boxuan

On Thu, Jan 27, 2022 at 9:55 PM Umesh Gade <er.umeshgade@...> wrote:
We faced below NPE with JG 0.6.0 while creating new vertex. Not sure if this is related but never seen this exception with earlier JG versions. 

java.lang.NullPointerException: null
        at org.janusgraph.graphdb.types.VertexLabelVertex.isPartitioned(VertexLabelVertex.java:41) ~[janusgraph-core-0.6.0.jar:?]
        at org.janusgraph.graphdb.types.VertexLabelVertex.hasDefaultConfiguration(VertexLabelVertex.java:67) ~[janusgraph-core-0.6.0.jar:?]
        at org.janusgraph.graphdb.transaction.StandardJanusGraphTx.addVertex(StandardJanusGraphTx.java:579) ~[janusgraph-core-0.6.0.jar:?]
        at org.janusgraph.graphdb.tinkerpop.JanusGraphBlueprintsTransaction.addVertex(JanusGraphBlueprintsTransaction.java:127) ~[janusgraph-core-0.6.0.jar:?]
        at org.janusgraph.graphdb.tinkerpop.JanusGraphBlueprintsGraph.addVertex(JanusGraphBlueprintsGraph.java:143) ~[janusgraph-core-0.6.0.jar:?]


Re: Potential transaction issue (JG 0.6.0)

Umesh Gade
 

We faced below NPE with JG 0.6.0 while creating new vertex. Not sure if this is related but never seen this exception with earlier JG versions. 

java.lang.NullPointerException: null
        at org.janusgraph.graphdb.types.VertexLabelVertex.isPartitioned(VertexLabelVertex.java:41) ~[janusgraph-core-0.6.0.jar:?]
        at org.janusgraph.graphdb.types.VertexLabelVertex.hasDefaultConfiguration(VertexLabelVertex.java:67) ~[janusgraph-core-0.6.0.jar:?]
        at org.janusgraph.graphdb.transaction.StandardJanusGraphTx.addVertex(StandardJanusGraphTx.java:579) ~[janusgraph-core-0.6.0.jar:?]
        at org.janusgraph.graphdb.tinkerpop.JanusGraphBlueprintsTransaction.addVertex(JanusGraphBlueprintsTransaction.java:127) ~[janusgraph-core-0.6.0.jar:?]
        at org.janusgraph.graphdb.tinkerpop.JanusGraphBlueprintsGraph.addVertex(JanusGraphBlueprintsGraph.java:143) ~[janusgraph-core-0.6.0.jar:?]


Re: Janusgraph embedded multi instance(JVM) data sync issue

Pawan Shriwas
 

Hi Marc,

All code and property configuration are shared in the last trail mail. I hope if we have not provided the cache properties then it means it will default false.

Thanks,
Pawan

On Tue, 25 Jan 2022, 2:14 am , <hadoopmarc@...> wrote:
Hi Pawan,

Interesting, I could not find a JanusGraph unit test for this basic scenario (there is one with two instances and an index, though). This needs more investigation.

Meawhile, are you sure that you have no hidden configs for caching in the springframework rest service?

Best wishes,    Marc


Re: hasNext() slow for large number of incoming edges

Matthew Nguyen <nguyenm9@...>
 

Hi Boxuan, thanks for the response. Some background:  I'm trying to use JG as a triplestore and importing rdf.

The triple <microsoft> <rdfs:type> <company> can be modelled as V('microsoft') -> E('rdfs:type') -> V('company') such that:

g.V().has('value', 'microsoft').out().has('value', 'company').inE('rdfs:type').hasNext() = true

Certainly there can be millions of companies out there that can be modelled similarly.  I u/d the issue surround supernodes,  so perhaps this question is more about trying to u/d some internals of JG.

Note:  again, my use case is not exactly like above where everything is know but more around the sparql query:  select ?company where { ?comp rdfs:type <company> } or give me all companies of rdfs:type company which translates to Gremlin:
   g.V('value','company').inE() and then traverse inE().  But  g.V('value','company').inE().hasNext() takes a long time to initially run.

1) what is g.V(v).inE(e).hasNext() doing above that a call on a supernode is taking so long?  if it's trying to load all incidental edges, should either the documentation be updated or maybe the function be renamed to reflect potential latency issues?  or maybe the implementation is broken up something like c++ iteration -> traversal.begin(); while (traversal.hasNext()) traversal.next()... or something like that.  begin() and hasNext() can be implemented via the range(..) function you mentioned to better control perceived latency.  

2) When you mention remodelling, I can think of 2 ways to do so off the top of my head (please advise on others).
a. Have multiple types of Companies (TechCompany, FinancialCompany, etc.) to reduce the likelihood of a supernode
b. Add a property to V('microsoft').has('rdfs:type', 'company').  If I do this, and assuming 'rdfs:type' is property indexed, will V().has('rdf:type', 'company').hasNext() be fast?  If so, why?  

I hope this doesn't come across negatively.  I am very interested in trying to bridge the gap btwn LPG & RDF (3store) and I think I have some good use cases that can hopefully help to improve JG down the road.

thx, matt

 

 


Re: hasNext() slow for large number of incoming edges

Boxuan Li
 

Hi Matthew,

Unfortunately, that is not possible. You could do 

g.V(v).inE(e).range(from, to).hasNext()

to “page” the result by yourself, but under the hood, it will fetch all the first “to” results and drop the first “from” results.

Btw, 7 million incident edges sound too many to me. This could cause various problems, e.g. high memory usage, large partition (depending on your storage). You might consider remodelling it.

Best,
Boxuan

On Wed, Jan 26, 2022 at 3:11 AM Matthew Nguyen via lists.lfaidata.foundation <nguyenm9=aol.com@...> wrote:
Hi, I do need to traverse it but was hoping it would get chunked/streamed in.  Or is there a better way for streaming/lazy loads?


-----Original Message-----
From: AMIYA KUMAR SAHOO <amiyakr.sahoo91@...>
To: janusgraph-users@...
Sent: Tue, Jan 25, 2022 1:43 pm
Subject: Re: [janusgraph-users] hasNext() slow for large number of incoming edges

Hi Mathew,

I don't know what it does underneath.

But if you want to just check about edge existence with hasNext, Can you try with limit 1. 


g.V(n).inE(e).limit(1).hasNext()

Let's see,
Amiya



On Tue, 25 Jan 2022, 23:05 Matthew Nguyen via lists.lfaidata.foundation, <nguyenm9=aol.com@...> wrote:

Hey folks, I have a Vertex v who has about 7m+ incoming edges e.  The following query takes about 30+ seconds on a local installation of cassandra.

g.V(n).inE(e).hasNext()

whereas

g.V(n).inE(e).tryNext() 

returns immediately with an answer.

Any idea why hasNext() would be so much slower?  I was under the impression that having "resultIterationBatchSize: 64" set would restrict to batching only any iteration to 64 elements at a time but it appears hasNext() is doing something else.  Is that correct?


dynamic graphics, limits and global index

Matthew Nguyen <nguyenm9@...>
 

Hi, if we're creating graphs dynamically in JG are the limits like 2^60 edges and Global V and E indexes bound/scoped to the graph or bound across all graphs in the set of graphs managed by ConfigurationManagementGraph?


Re: hasNext() slow for large number of incoming edges

Matthew Nguyen <nguyenm9@...>
 

Hi, I do need to traverse it but was hoping it would get chunked/streamed in.  Or is there a better way for streaming/lazy loads?


-----Original Message-----
From: AMIYA KUMAR SAHOO <amiyakr.sahoo91@...>
To: janusgraph-users@...
Sent: Tue, Jan 25, 2022 1:43 pm
Subject: Re: [janusgraph-users] hasNext() slow for large number of incoming edges

Hi Mathew,

I don't know what it does underneath.

But if you want to just check about edge existence with hasNext, Can you try with limit 1. 


g.V(n).inE(e).limit(1).hasNext()

Let's see,
Amiya



On Tue, 25 Jan 2022, 23:05 Matthew Nguyen via lists.lfaidata.foundation, <nguyenm9=aol.com@...> wrote:

Hey folks, I have a Vertex v who has about 7m+ incoming edges e.  The following query takes about 30+ seconds on a local installation of cassandra.

g.V(n).inE(e).hasNext()

whereas

g.V(n).inE(e).tryNext() 

returns immediately with an answer.

Any idea why hasNext() would be so much slower?  I was under the impression that having "resultIterationBatchSize: 64" set would restrict to batching only any iteration to 64 elements at a time but it appears hasNext() is doing something else.  Is that correct?


Re: hasNext() slow for large number of incoming edges

AMIYA KUMAR SAHOO
 

Hi Mathew,

I don't know what it does underneath.

But if you want to just check about edge existence with hasNext, Can you try with limit 1. 


g.V(n).inE(e).limit(1).hasNext()

Let's see,
Amiya



On Tue, 25 Jan 2022, 23:05 Matthew Nguyen via lists.lfaidata.foundation, <nguyenm9=aol.com@...> wrote:

Hey folks, I have a Vertex v who has about 7m+ incoming edges e.  The following query takes about 30+ seconds on a local installation of cassandra.

g.V(n).inE(e).hasNext()

whereas

g.V(n).inE(e).tryNext() 

returns immediately with an answer.

Any idea why hasNext() would be so much slower?  I was under the impression that having "resultIterationBatchSize: 64" set would restrict to batching only any iteration to 64 elements at a time but it appears hasNext() is doing something else.  Is that correct?


hasNext() slow for large number of incoming edges

Matthew Nguyen <nguyenm9@...>
 


Hey folks, I have a Vertex v who has about 7m+ incoming edges e.  The following query takes about 30+ seconds on a local installation of cassandra.

g.V(n).inE(e).hasNext()

whereas

g.V(n).inE(e).tryNext() 

returns immediately with an answer.

Any idea why hasNext() would be so much slower?  I was under the impression that having "resultIterationBatchSize: 64" set would restrict to batching only any iteration to 64 elements at a time but it appears hasNext() is doing something else.  Is that correct?


Re: JanusGraph 0.6.0 traversal change?

criminosis@...
 

Duly noted. I'll keep in mind to dig deeper when seeing closed issues mentioning Tinkerpop updates. Thanks everyone!


Re: JanusGraph 0.6.0 traversal change?

Florian Hockmann
 

Yes, exactly. This was a breaking change in TinkerPop 3.5.0 which is included in JanusGraph 0.6.0: https://tinkerpop.apache.org/docs/current/upgrade/#_anonymous_child_traversals

 

 

Von: janusgraph-users@... <janusgraph-users@...> Im Auftrag von criminosis@...
Gesendet: Dienstag, 25. Januar 2022 00:09
An: janusgraph-users@...
Betreff: Re: [janusgraph-users] JanusGraph 0.6.0 traversal change?

 

Ahh I see, I guess this is the intended usage now?

gremlin> g.addV()

==>v[16496]

gremlin> g.addV()

==>v[4096]

gremlin> g.V(16496).addE('user_alias').to(__.V(4096))
==>e[4cu-cq8-xzp-35s][16496-user_alias->4096]
gremlin>

Basically doing "__.V" instead of "g.V" for the child traversal?


Re: JanusGraph 0.6.0 traversal change?

Clement de Groc
 

Yes. In JanusGraph 0.6.0, Apache TinkerPop was upgraded to 3.5.1 and requires using an anonymous traversal in such cases.
This is mentioned in their "Upgrade for users" guide here: https://tinkerpop.apache.org/docs/current/upgrade/#_anonymous_child_traversals


Re: JanusGraph 0.6.0 traversal change?

criminosis@...
 

Ahh I see, I guess this is the intended usage now?

gremlin> g.addV()
==>v[16496]
gremlin> g.addV()
==>v[4096]
gremlin> g.V(16496).addE('user_alias').to(__.V(4096))
==>e[4cu-cq8-xzp-35s][16496-user_alias->4096]
gremlin>
Basically doing "__.V" instead of "g.V" for the child traversal?


JanusGraph 0.6.0 traversal change?

criminosis@...
 

When running with 0.5.2 I was able to do this traversal to add an edge between to vertices

gremlin> g.addV()
==>v[8200]
gremlin> g.addV()
==>v[4336]
gremlin> g.V(8200).addE('my_edge_label').to(g.V(4336))

But when doing it through 0.6.0 I get this now:

gremlin> g.V(8200).addE('my_edge_label').to(g.V(4336))
The child traversal of [GraphStep(vertex,[4336])] was not spawned anonymously - use the __ class rather than a TraversalSource to construct the child traversal
Type ':help' or ':h' for help.
Display stack trace? [yN]y
java.lang.IllegalStateException: The child traversal of [GraphStep(vertex,[4336])] was not spawned anonymously - use the __ class rather than a TraversalSource to construct the child traversal
at org.apache.tinkerpop.gremlin.process.traversal.Bytecode.convertArgument(Bytecode.java:302)
at org.apache.tinkerpop.gremlin.process.traversal.Bytecode.flattenArguments(Bytecode.java:287)
at org.apache.tinkerpop.gremlin.process.traversal.Bytecode.addStep(Bytecode.java:94)
at org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.GraphTraversal.to(GraphTraversal.java:1145)
at org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.GraphTraversal$to$4.call(Unknown Source)
at Script148.run(Script148.groovy:1)
at org.apache.tinkerpop.gremlin.groovy.jsr223.GremlinGroovyScriptEngine.eval(GremlinGroovyScriptEngine.java:676)
at org.apache.tinkerpop.gremlin.groovy.jsr223.GremlinGroovyScriptEngine.eval(GremlinGroovyScriptEngine.java:378)
at javax.script.AbstractScriptEngine.eval(AbstractScriptEngine.java:233)
at org.apache.tinkerpop.gremlin.groovy.engine.GremlinExecutor.lambda$eval$0(GremlinExecutor.java:272)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
gremlin>

After fiddling with it though I noticed I was able to do this:

gremlin> g.V(4336).as('test').V(8200).addE('my_edge_label').to('test')
==>e[2dd-6bs-xzp-3cg][8200-my_edge_label->4336]

However just doing the vertex id is not permitted, which makes sense given it's just an integer with no context.

gremlin> g.V(8200).addE('my_edge_label').to(4336)
No signature of method: org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.DefaultGraphTraversal.to() is applicable for argument types: (Integer) values: [4336]
Possible solutions: is(java.lang.Object), take(int), tap(groovy.lang.Closure), by(groovy.lang.Closure), drop(int), any()
I've been looking through the 0.6.0 milestone and found this issue, but that seemed more about a documentation change in 0.6.0 than a code change. Environment wise I'm just running these in a gremlin session within a docker-compose environment with Cassandra and Elasticsearch as backends.

Just wondering if the change here is intentional? Seemed weird that it was suggesting to the use "__" class too.


Re: Janusgraph embedded multi instance(JVM) data sync issue

hadoopmarc@...
 

Hi Pawan,

Interesting, I could not find a JanusGraph unit test for this basic scenario (there is one with two instances and an index, though). This needs more investigation.

Meawhile, are you sure that you have no hidden configs for caching in the springframework rest service?

Best wishes,    Marc


Re: Indexing Strategies for RDF edges/predicates on Janusgraph

Matthew Nguyen <nguyenm9@...>
 

Thanks for the clarification.  That makes sense.  I suppose if there were multiple edges from V(h) to V('mother') I would need to qualify it to insure the path exists?

eg V(h).out('mother').inE('isSon') vs V(h).out('mother').inE('battled') 



-----Original Message-----
From: AMIYA KUMAR SAHOO <amiyakr.sahoo91@...>
To: janusgraph-users@...
Sent: Mon, Jan 24, 2022 2:26 pm
Subject: Re: [janusgraph-users] Indexing Strategies for RDF edges/predicates on Janusgraph

Hi Mathew,

Both of the example shows 2 different types of default index.

g.V(h).out('mother')
- This is example for default vertex-centric indexes per edge label 
- This will help to traverse specific type of edge among different types of edge quickly.
- in your case to find all employees employedBy a company will use this.

g.V(h).values('age')
- This is example for default vertex-centric indexes per property key.
- This will help to get the value of a single property among several properties of a single vertex


Now there can be a situation you can have 1k types of edges associated to a vertex (one company). Except emploedBy edge, other edges have less cardinality(let's say < 10). But 2k employees  employedBy by that company. You want to find if company has a employee with name John. In this case if your your travesal starts from company and goes with employedBy edge, it has to traverse all 2k edges to find out whether John is an employee or not. This situation can be made faster if employee name is available on edge and there is a VCI enabled on it.

This might not be a very good example as it can be optimised in different ways
1) if employee have less degree for employedBy edge, you can start traversal from employee vertex. 

Hope it helps,
Amiya

On Tue, 25 Jan 2022, 00:01 Matthew Nguyen via lists.lfaidata.foundation, <nguyenm9=aol.com@...> wrote:
Hi Amiya, I saw that but wasn't quite sure the intent given the example.  It talks about edge labels but the examples are vertices & values?
g.V(h).out('mother') -> returns a vertex traversal?
g.V(h).values('age') -> returns a Value?
Also, what do you mean by 'But if you have a high cardinality for a single edge type, then you have to manually create edge index on respective property.'?  
thx, matt


Re: Indexing Strategies for RDF edges/predicates on Janusgraph

AMIYA KUMAR SAHOO
 

Hi Mathew,

Both of the example shows 2 different types of default index.

g.V(h).out('mother')
- This is example for default vertex-centric indexes per edge label 
- This will help to traverse specific type of edge among different types of edge quickly.
- in your case to find all employees employedBy a company will use this.

g.V(h).values('age')
- This is example for default vertex-centric indexes per property key.
- This will help to get the value of a single property among several properties of a single vertex


Now there can be a situation you can have 1k types of edges associated to a vertex (one company). Except emploedBy edge, other edges have less cardinality(let's say < 10). But 2k employees  employedBy by that company. You want to find if company has a employee with name John. In this case if your your travesal starts from company and goes with employedBy edge, it has to traverse all 2k edges to find out whether John is an employee or not. This situation can be made faster if employee name is available on edge and there is a VCI enabled on it.

This might not be a very good example as it can be optimised in different ways
1) if employee have less degree for employedBy edge, you can start traversal from employee vertex. 

Hope it helps,
Amiya


On Tue, 25 Jan 2022, 00:01 Matthew Nguyen via lists.lfaidata.foundation, <nguyenm9=aol.com@...> wrote:

Hi Amiya, I saw that but wasn't quite sure the intent given the example.  It talks about edge labels but the examples are vertices & values?

g.V(h).out('mother') -> returns a vertex traversal?
g.V(h).values('age') -> returns a Value?

Also, what do you mean by 'But if you have a high cardinality for a single edge type, then you have to manually create edge index on respective property.'?  

thx, matt