Indexing Strategies for RDF edges/predicates on Janusgraph


Matthew Nguyen <nguyenm9@...>
 

Hi, I am trying to build a triplestore ontop of JG.  The general model is:

Vertex (subject or object) Properties:
  Label
  Value (IRI, Literal) - indexed

Edge (predicate) Properties:
  Label (predicate)
  hash - effectively a unique hash of predicate so I can globally index it

So effectively we can have Vertex(subject) -> Edge (predicate) -> Vertex(object)

Let's assume I insert the following triples into this model

<matt> <employedBy> <some_company>
<jane> <employedBy> <some_company>
<product1> <isSoldBy> <some_company>
<some_offce> <isLeasedBy> <some_company>
etc

let's say there's literally a 1k different predicates that can be associated with <some_company> and things like <employedBy> can have high cardinality if the company is large.  What's a good way to index these edges/predicates so I can quickly query for all a particular type of edge/predicate on <some_company> (eg 'give me all the ?people <employedBy> <some_company>')

I'm aware of the vertex-centric indexes on edges but it appears I would need to build an index for each of the possible edge labels of <some_company> if I understand the docs correctly (https://docs.janusgraph.org/schema/index-management/index-performance/#edge-indexes).  Please correct me if I'm wrong.  If not, is there another strategy I can use?

thx, matt


hadoopmarc@...
 

Hi Matthew,

It would be possible to replace the employedBy, isSoldby, isLeasedBy relations with a relatedToCompany relation with employment, selling and lease properties. But I do not see any advantages compare to the original model, because there is nothing wrong with a lot of frequently used vertex centric indices and the original model is easier to use.

Cheers,     Marc


AMIYA KUMAR SAHOO
 

Hi Mathew,


As per the below Note from Janusgraph docs, even if company is having 1k different types of edge related to it, traverse by edge lable will be fast.

Such as find employees employedBy (edge lable) company. 

But if you have a high cardinality for a single edge type, then you have to manually create edge index on respective property.

JanusGraph automatically builds vertex-centric indexes per edge label and property key. That means, even with thousands of incident battled edges, queries like g.V(h).out('mother') or g.V(h).values('age') are efficiently answered by the local index.


Thanks,
Amiya


On Mon, 24 Jan 2022, 12:32 , <hadoopmarc@...> wrote:
Hi Matthew,

It would be possible to replace the employedBy, isSoldby, isLeasedBy relations with a relatedToCompany relation with employment, selling and lease properties. But I do not see any advantages compare to the original model, because there is nothing wrong with a lot of frequently used vertex centric indices and the original model is easier to use.

Cheers,     Marc


Matthew Nguyen <nguyenm9@...>
 

Hi Amiya, I saw that but wasn't quite sure the intent given the example.  It talks about edge labels but the examples are vertices & values?

g.V(h).out('mother') -> returns a vertex traversal?
g.V(h).values('age') -> returns a Value?

Also, what do you mean by 'But if you have a high cardinality for a single edge type, then you have to manually create edge index on respective property.'?  

thx, matt


AMIYA KUMAR SAHOO
 

Hi Mathew,

Both of the example shows 2 different types of default index.

g.V(h).out('mother')
- This is example for default vertex-centric indexes per edge label 
- This will help to traverse specific type of edge among different types of edge quickly.
- in your case to find all employees employedBy a company will use this.

g.V(h).values('age')
- This is example for default vertex-centric indexes per property key.
- This will help to get the value of a single property among several properties of a single vertex


Now there can be a situation you can have 1k types of edges associated to a vertex (one company). Except emploedBy edge, other edges have less cardinality(let's say < 10). But 2k employees  employedBy by that company. You want to find if company has a employee with name John. In this case if your your travesal starts from company and goes with employedBy edge, it has to traverse all 2k edges to find out whether John is an employee or not. This situation can be made faster if employee name is available on edge and there is a VCI enabled on it.

This might not be a very good example as it can be optimised in different ways
1) if employee have less degree for employedBy edge, you can start traversal from employee vertex. 

Hope it helps,
Amiya


On Tue, 25 Jan 2022, 00:01 Matthew Nguyen via lists.lfaidata.foundation, <nguyenm9=aol.com@...> wrote:

Hi Amiya, I saw that but wasn't quite sure the intent given the example.  It talks about edge labels but the examples are vertices & values?

g.V(h).out('mother') -> returns a vertex traversal?
g.V(h).values('age') -> returns a Value?

Also, what do you mean by 'But if you have a high cardinality for a single edge type, then you have to manually create edge index on respective property.'?  

thx, matt


Matthew Nguyen <nguyenm9@...>
 

Thanks for the clarification.  That makes sense.  I suppose if there were multiple edges from V(h) to V('mother') I would need to qualify it to insure the path exists?

eg V(h).out('mother').inE('isSon') vs V(h).out('mother').inE('battled') 



-----Original Message-----
From: AMIYA KUMAR SAHOO <amiyakr.sahoo91@...>
To: janusgraph-users@...
Sent: Mon, Jan 24, 2022 2:26 pm
Subject: Re: [janusgraph-users] Indexing Strategies for RDF edges/predicates on Janusgraph

Hi Mathew,

Both of the example shows 2 different types of default index.

g.V(h).out('mother')
- This is example for default vertex-centric indexes per edge label 
- This will help to traverse specific type of edge among different types of edge quickly.
- in your case to find all employees employedBy a company will use this.

g.V(h).values('age')
- This is example for default vertex-centric indexes per property key.
- This will help to get the value of a single property among several properties of a single vertex


Now there can be a situation you can have 1k types of edges associated to a vertex (one company). Except emploedBy edge, other edges have less cardinality(let's say < 10). But 2k employees  employedBy by that company. You want to find if company has a employee with name John. In this case if your your travesal starts from company and goes with employedBy edge, it has to traverse all 2k edges to find out whether John is an employee or not. This situation can be made faster if employee name is available on edge and there is a VCI enabled on it.

This might not be a very good example as it can be optimised in different ways
1) if employee have less degree for employedBy edge, you can start traversal from employee vertex. 

Hope it helps,
Amiya

On Tue, 25 Jan 2022, 00:01 Matthew Nguyen via lists.lfaidata.foundation, <nguyenm9=aol.com@...> wrote:
Hi Amiya, I saw that but wasn't quite sure the intent given the example.  It talks about edge labels but the examples are vertices & values?
g.V(h).out('mother') -> returns a vertex traversal?
g.V(h).values('age') -> returns a Value?
Also, what do you mean by 'But if you have a high cardinality for a single edge type, then you have to manually create edge index on respective property.'?  
thx, matt