hasNext() slow for large number of incoming edges
Hey folks, I have a Vertex v who has about 7m+ incoming edges e. The following query takes about 30+ seconds on a local installation of cassandra.
g.V(n).inE(e).hasNext()
whereas
g.V(n).inE(e).tryNext()
returns immediately with an answer.
Any idea why hasNext() would be so much slower? I was under the impression that having "resultIterationBatchSize: 64" set would restrict to batching only any iteration to 64 elements at a time but it appears hasNext() is doing something else. Is that correct?
Hey folks, I have a Vertex v who has about 7m+ incoming edges e. The following query takes about 30+ seconds on a local installation of cassandra.
g.V(n).inE(e).hasNext()
whereas
g.V(n).inE(e).tryNext()
returns immediately with an answer.
Any idea why hasNext() would be so much slower? I was under the impression that having "resultIterationBatchSize: 64" set would restrict to batching only any iteration to 64 elements at a time but it appears hasNext() is doing something else. Is that correct?
From: AMIYA KUMAR SAHOO <amiyakr.sahoo91@...>
To: janusgraph-users@...
Sent: Tue, Jan 25, 2022 1:43 pm
Subject: Re: [janusgraph-users] hasNext() slow for large number of incoming edges
Hey folks, I have a Vertex v who has about 7m+ incoming edges e. The following query takes about 30+ seconds on a local installation of cassandra.
g.V(n).inE(e).hasNext()
whereas
g.V(n).inE(e).tryNext()
returns immediately with an answer.
Any idea why hasNext() would be so much slower? I was under the impression that having "resultIterationBatchSize: 64" set would restrict to batching only any iteration to 64 elements at a time but it appears hasNext() is doing something else. Is that correct?
Hi, I do need to traverse it but was hoping it would get chunked/streamed in. Or is there a better way for streaming/lazy loads?
-----Original Message-----
From: AMIYA KUMAR SAHOO <amiyakr.sahoo91@...>
To: janusgraph-users@...
Sent: Tue, Jan 25, 2022 1:43 pm
Subject: Re: [janusgraph-users] hasNext() slow for large number of incoming edges
Hi Mathew,I don't know what it does underneath.But if you want to just check about edge existence with hasNext, Can you try with limit 1.g.V(n).inE(e).limit(1).hasNext()Let's see,Amiya
On Tue, 25 Jan 2022, 23:05 Matthew Nguyen via lists.lfaidata.foundation, <nguyenm9=aol.com@...> wrote:
Hey folks, I have a Vertex v who has about 7m+ incoming edges e. The following query takes about 30+ seconds on a local installation of cassandra.
g.V(n).inE(e).hasNext()
whereas
g.V(n).inE(e).tryNext()
returns immediately with an answer.
Any idea why hasNext() would be so much slower? I was under the impression that having "resultIterationBatchSize: 64" set would restrict to batching only any iteration to 64 elements at a time but it appears hasNext() is doing something else. Is that correct?
Hi Boxuan, thanks for the response. Some background: I'm trying to use JG as a triplestore and importing rdf.
The triple <microsoft> <rdfs:type> <company> can be modelled as V('microsoft') -> E('rdfs:type') -> V('company') such that:
g.V().has('value', 'microsoft').out().has('value', 'company').inE('rdfs:type').hasNext() = true
Certainly there can be millions of companies out there that can be modelled similarly. I u/d the issue surround supernodes, so perhaps this question is more about trying to u/d some internals of JG.
Note: again, my use case is not exactly like above where everything is know but more around the sparql query: select ?company where { ?comp rdfs:type <company> } or give me all companies of rdfs:type company which translates to Gremlin:
g.V('value','company').inE() and then traverse inE(). But g.V('value','company').inE().hasNext() takes a long time to initially run.
1) what is g.V(v).inE(e).hasNext() doing above that a call on a supernode is taking so long? if it's trying to load all incidental edges, should either the documentation be updated or maybe the function be renamed to reflect potential latency issues? or maybe the implementation is broken up something like c++ iteration -> traversal.begin(); while (traversal.hasNext()) traversal.next()... or something like that. begin() and hasNext() can be implemented via the range(..) function you mentioned to better control perceived latency.
2) When you mention remodelling, I can think of 2 ways to do so off the top of my head (please advise on others).
a. Have multiple types of Companies (TechCompany, FinancialCompany, etc.) to reduce the likelihood of a supernode
b. Add a property to V('microsoft').has('rdfs:type', 'company'). If I do this, and assuming 'rdfs:type' is property indexed, will V().has('rdf:type', 'company').hasNext() be fast? If so, why?
I hope this doesn't come across negatively. I am very interested in trying to bridge the gap btwn LPG & RDF (3store) and I think I have some good use cases that can hopefully help to improve JG down the road.
thx, matt
Hi Boxuan, thanks for the response. Some background: I'm trying to use JG as a triplestore and importing rdf.
The triple <microsoft> <rdfs:type> <company> can be modelled as V('microsoft') -> E('rdfs:type') -> V('company') such that:g.V().has('value', 'microsoft').out().has('value', 'company').inE('rdfs:type').hasNext() = true
Certainly there can be millions of companies out there that can be modelled similarly. I u/d the issue surround supernodes, so perhaps this question is more about trying to u/d some internals of JG.
Note: again, my use case is not exactly like above where everything is know but more around the sparql query: select ?company where { ?comp rdfs:type <company> } or give me all companies of rdfs:type company which translates to Gremlin:
g.V('value','company').inE() and then traverse inE(). But g.V('value','company').inE().hasNext() takes a long time to initially run.1) what is g.V(v).inE(e).hasNext() doing above that a call on a supernode is taking so long? if it's trying to load all incidental edges, should either the documentation be updated or maybe the function be renamed to reflect potential latency issues? or maybe the implementation is broken up something like c++ iteration -> traversal.begin(); while (traversal.hasNext()) traversal.next()... or something like that. begin() and hasNext() can be implemented via the range(..) function you mentioned to better control perceived latency.
2) When you mention remodelling, I can think of 2 ways to do so off the top of my head (please advise on others).
a. Have multiple types of Companies (TechCompany, FinancialCompany, etc.) to reduce the likelihood of a supernode
b. Add a property to V('microsoft').has('rdfs:type', 'company'). If I do this, and assuming 'rdfs:type' is property indexed, will V().has('rdf:type', 'company').hasNext() be fast? If so, why?
I hope this doesn't come across negatively. I am very interested in trying to bridge the gap btwn LPG & RDF (3store) and I think I have some good use cases that can hopefully help to improve JG down the road.thx, matt
Hi Boxuan,
Happy to put in a request on github but still a little confused. Are we saying g.E().has('index_key', 'large_number_of_edges').hasNext() isn't streaming but should (note: g.E().hasNext() is fast) ? Also, I think to close the gap on RDF/Property Graph, we do need to see what can be done about allowing for natural modeling in RDF which is really to make liberal use of edges. The problem with properties and RDF is that RDF expects you to index virtually everything in order for the queries to be quick. Not sure how we can model non-generic properties in that capacity.
BTW I'm using Joshua's Graphsail (https://github.com/joshsh/graphsail) implementation to see if I can get it to work and trying to work through some of the edge (no pun intended) cases.
thx, matt
On Jan 28, 2022, at 9:24 PM, Matthew Nguyen via lists.lfaidata.foundation <nguyenm9=aol.com@...> wrote:Hi Boxuan,
Happy to put in a request on github but still a little confused. Are we saying g.E().has('index_key', 'large_number_of_edges').hasNext() isn't streaming but should (note: g.E().hasNext() is fast) ? Also, I think to close the gap on RDF/Property Graph, we do need to see what can be done about allowing for natural modeling in RDF which is really to make liberal use of edges. The problem with properties and RDF is that RDF expects you to index virtually everything in order for the queries to be quick. Not sure how we can model non-generic properties in that capacity.
BTW I'm using Joshua's Graphsail (https://github.com/joshsh/graphsail) implementation to see if I can get it to work and trying to work through some of the edge (no pun intended) cases.
thx, matt