Thanks@pluradjfor helping out. The problem we were facing while using JanusGraph is that, the data we have loaded in our system was loaded one time, as a batch job quite early. We have a dataset with around1.8 billion nodes. The data loading speed is pretty slower (When it was done first time) and made use of only selective composite index at that time. Thus when we are creating microservices around multiple queries, not all the microservices are quick enough, i.e. their response times are huge and unacceptable.
Upon profiling, we can see that JanusGraph and query in itself didn't make use of proper index per our understanding likehasNot() or has(property, null)does optimizations instead of using index queries thus making the responses slower. Avoiding that can be a way to create meta properties and defaulting them to single value. Like instead ofhas(property, null)you can create a traversal which replacesnullwith adefault valuelikeg.V().has("nodelabel", "vertex").as("a").property("property", coalesce(select("a").values("property"), -999).iterate()and then querying likeg.V().has("property", -999)and such queries will make use of index, and optimize the traversal time as well.
Now, sometime, if we follow such process, we may need to create additional properties which weren't defined as part of initial schema. Like if we want toQuery those vertex which doesn't have a particular type of edgethe query will fall back to issue previously mentioned, and the query won't be fast. Workaround it will be create a propertyedge_count_on_vertexand store the number of edge count in it. If 0, it means that the vertex doesn't have the edge which is same ashasNot(__.bothE("edgeLabel"))in traversal.
Since, it requires creating a new property, the same new property needs to be reindex as well. Ideally we want schema to be definedaprioriand avoid re-indexing as much as possible, but in our case that's not possible, hence we had to reindex our data. So the issue we faced was related to re-indexing and above was background into the problem
The following is snippet of conversation b/w me and@pluradj So, Reindexing is an expensive job. Our data size is around 1.8 billion vertices (This is for Lighthouse project as well)I know traditional way of reindexing using following steps:
CREATE INDEX: // Create an index mgmt = graph.openManagement()
# REINDEX //Wait for the index to become available ManagementSystem.awaitGraphIndexStatus(graph, 'byVideoCombo1Mixed_2').call() //Reindex the existing data mgmt = graph.openManagement() mgmt.updateIndex(mgmt.getGraphIndex("byVideoCombo1Mixed_2"), SchemaAction.REINDEX).get() mgmt.commit()
[1:24 PM] I think this makes use ofJanusGraphManagementto do reindexing on single machine (From docs as well)spawns a single-machine OLAP jobso as expected this will be really slow for the scale of data we are talking about. [1:25 PM] I think there is also a way to reindex data usingMapReduce jobright? How do we do that? I think this was part of new versions. Per docs (https://docs.janusgraph.org/index-management/index-reindexing/) we can do following:
Well, using that as well didn't optimize reindexing time much, but there was certain reduction in reinding step. Also, as a workaround, we restricted index creation tovertexLabelto reduce scope of reindexing and making it finish in tangiable time.