hasNext() slow for large number of incoming edges
Matthew Nguyen <nguyenm9@...>
Hey folks, I have a Vertex v who has about 7m+ incoming edges e. The following query takes about 30+ seconds on a local installation of cassandra. |
|
AMIYA KUMAR SAHOO
Hi Mathew, I don't know what it does underneath. But if you want to just check about edge existence with hasNext, Can you try with limit 1. g.V(n).inE(e).limit(1).hasNext() Let's see, Amiya On Tue, 25 Jan 2022, 23:05 Matthew Nguyen via lists.lfaidata.foundation, <nguyenm9=aol.com@...> wrote:
|
|
Matthew Nguyen <nguyenm9@...>
Hi, I do need to traverse it but was hoping it would get chunked/streamed in. Or is there a better way for streaming/lazy loads? -----Original Message-----
From: AMIYA KUMAR SAHOO <amiyakr.sahoo91@...> To: janusgraph-users@... Sent: Tue, Jan 25, 2022 1:43 pm Subject: Re: [janusgraph-users] hasNext() slow for large number of incoming edges Hi Mathew,
I don't know what it does underneath.
But if you want to just check about edge existence with hasNext, Can you try with limit 1.
g.V(n).inE(e).limit(1).hasNext()
Let's see,
Amiya
On Tue, 25 Jan 2022, 23:05 Matthew Nguyen via lists.lfaidata.foundation, <nguyenm9=aol.com@...> wrote:
|
|
Boxuan Li
Hi Matthew, Unfortunately, that is not possible. You could do g.V(v).inE(e).range(from, to).hasNext() to “page” the result by yourself, but under the hood, it will fetch all the first “to” results and drop the first “from” results. Btw, 7 million incident edges sound too many to me. This could cause various problems, e.g. high memory usage, large partition (depending on your storage). You might consider remodelling it. Best, Boxuan On Wed, Jan 26, 2022 at 3:11 AM Matthew Nguyen via lists.lfaidata.foundation <nguyenm9=aol.com@...> wrote:
|
|
Matthew Nguyen <nguyenm9@...>
Hi Boxuan, thanks for the response. Some background: I'm trying to use JG as a triplestore and importing rdf. The triple <microsoft> <rdfs:type> <company> can be modelled as V('microsoft') -> E('rdfs:type') -> V('company') such that: g.V().has('value', 'microsoft').out().has('value', 'company').inE('rdfs:type').hasNext() = true Certainly there can be millions of companies out there that can be modelled similarly. I u/d the issue surround supernodes, so perhaps this question is more about trying to u/d some internals of JG. Note: again, my use case is not exactly like above where everything is know but more around the sparql query: select ?company where { ?comp rdfs:type <company> } or give me all companies of rdfs:type company which translates to Gremlin: 1) what is g.V(v).inE(e).hasNext() doing above that a call on a supernode is taking so long? if it's trying to load all incidental edges, should either the documentation be updated or maybe the function be renamed to reflect potential latency issues? or maybe the implementation is broken up something like c++ iteration -> traversal.begin(); while (traversal.hasNext()) traversal.next()... or something like that. begin() and hasNext() can be implemented via the range(..) function you mentioned to better control perceived latency. 2) When you mention remodelling, I can think of 2 ways to do so off the top of my head (please advise on others). thx, matt
|
|
Boxuan Li
Hi Matt, It will definitely be a valid and valuable feature, if we could expose the streaming capacity to end users. If I recall correctly, the low-level results are indeed streamed (it might vary depending on the storage backend), but the interface is not exposed to the upper level APIs. Do you want to create a feature request on GitHub? Otherwise I can do it later. Regarding your particular usecase, you said you had triples like <microsoft> <rdfs:type> <company>, which you modelled as V('microsoft') -> E('rdfs:type') -> V('company'). I suggest you model “type” as a property rather an edge. So, you will not create a vertex called “company”. Rather, you create a vertex called “microsoft” with a property “type” whose value is “company”, e.g. g.addV().property(“value”, “microsoft”).property(“type”, “company”) Rule of thumb: when you anticipate a super node, consider modeling it as a property rather than a vertex. Edges should be used to describe “relationships between nodes” rather than “properties attached to nodes”. This is the difference between a RDF and a property graph. Best, Boxuan On Thu, Jan 27, 2022 at 12:51 AM Matthew Nguyen via lists.lfaidata.foundation <nguyenm9=aol.com@...> wrote:
|
|
Matthew Nguyen <nguyenm9@...>
Hi Boxuan, Happy to put in a request on github but still a little confused. Are we saying g.E().has('index_key', 'large_number_of_edges').hasNext() isn't streaming but should (note: g.E().hasNext() is fast) ? Also, I think to close the gap on RDF/Property Graph, we do need to see what can be done about allowing for natural modeling in RDF which is really to make liberal use of edges. The problem with properties and RDF is that RDF expects you to index virtually everything in order for the queries to be quick. Not sure how we can model non-generic properties in that capacity. BTW I'm using Joshua's Graphsail (https://github.com/joshsh/graphsail) implementation to see if I can get it to work and trying to work through some of the edge (no pun intended) cases. |
|
Boxuan Li
Hi Matt,
toggle quoted message
Show quoted text
No worries, let me create an issue. You are right, g.E().hasNext() is fast, and that’s because the results are streamed. On the other hand, g.V().has(“id”, “v0”).outE().hasNext() is slow if vertex v0 has a huge amount of incident edges, and that’s because the results, in this case, are not streamed. It definitely needs some investigation, but usually it’s not a big problem because people don’t expect a large number of incident edges attached to a node. Good luck with your work! If it’s not JanusGraph specific, you might also want to join the TinkerPop Discord server (https://discord.gg/ndMpKZcBEE) to interact with the wider graph community. Best, Boxuan
|
|
Boxuan Li
Created https://github.com/JanusGraph/janusgraph/issues/2966 to track the streaming feature request.
|
|