The following is snippet of conversation b/w me and @pluradj
So, Reindexing is an expensive job. Our data size is around 1.8 billion vertices (This is for Lighthouse project as well)I know traditional way of reindexing using following steps:CREATE INDEX:
// Create an index
mgmt = graph.openManagement()
deletedOn = mgmt.getPropertyKey("deletedOn")
expirationDate = mgmt.getPropertyKey("expirationDate")
vertexLabel = mgmt.getPropertyKey("vertexLabel")
videoMixedIndex = mgmt.buildIndex('byVideoCombo1Mixed_2', Vertex.class).addKey(deletedOn).addKey(expirationDate).addKey(vertexLabel).buildMixedIndex("search")
mgmt.commit()
graph.tx().rollback()
# REINDEX
//Wait for the index to become available
ManagementSystem.awaitGraphIndexStatus(graph, 'byVideoCombo1Mixed_2').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("byVideoCombo1Mixed_2"), SchemaAction.REINDEX).get()
mgmt.commit()
[1:24 PM]
I think this makes use of JanusGraphManagement
to do reindexing on single machine (From docs as well) spawns a single-machine OLAP job
so as expected this will be really slow for the scale of data we are talking about.
[1:25 PM]
I think there is also a way to reindex data using MapReduce job
right? How do we do that? I think this was part of new versions. Per docs (https://docs.janusgraph.org/index-management/index-reindexing/) we can do following:mgmt = graph.openManagement()
mr = new MapReduceIndexManagement(graph)
mr.updateIndex(mgmt.getRelationIndex(mgmt.getRelationType("battled"), "battlesByTime"), SchemaAction.REINDEX).get()
mgmt.commit()
[1:27 PM]
But, gremlin console throws an exception when I run```
mr = new MapReduceIndexManagement(graph)
I'm using `JanusGraph 0.3.2`
gremlin> mr = new MapReduceIndexManagement(graph)
groovysh_evaluate: 3: unable to resolve class MapReduceIndexManagement
```