Options for Bulk Read/Bulk Export


subbu165@...
 

Hi There, we have Janus-Graph with back-end store Foundation DB and index back-end as Elastic Search. Please let me know what is the best way to export/read Millions of Records from JaunusGraph by keeping performance in mind. We don't have the option of using Spark in our environment.  I have seen 100s of articles on Bulk Loading but not bulk export/Read. Any suggestion would be of a great help here.


hadoopmarc@...
 

Hi,

There are three solution directions:
  1. if you have keys to your vertices available, either vertex ids or unique values of some vertex property, you can start as many gremlin clients as your backends can handle and distribute the keys over the clients. This is the easy case, but often not applicable.
  2. if there are no keys available, gremlin can only help you with a full table scan g.V(). If you have a client machine with many cores, the withComputer() step, either with or without spark-local, will help you parallelize the scan.
  3. you can copy the vertex files from the storage backend and decode them offline. Decoding procedures are implicit in the janusgraph source code, but I am not aware of any library that does this for you explicitly.
You decide, but I would suggest option 2 with spark-local as the option that works out of the box.

Best wishes,    Marc


Oleksandr Porunov
 

To add to Mark's suggestions there is also a multiQuery option in janusgraph-core. Notice, it's internal API and not Gremlin. Thus, it might be unavailable to you if you can access JanusGraph internal API for any reason.
If you work with multiQuery like `janusGraph.multiQuery().addAllVertices(yourVertices).properties()` then make sure you transaction cache is at least the size of `yourVertices.size()`. Otherwise, additional calls might be executed to your backend which could be not as efficient.

Best regards,
Oleksandr


subbu165@...
 

So currently we have JanusGraph with the storage back-end as FDB and use ElasticSearch for indexing. 
 
First we get the vertexIDs indexes from Elasticsearch back-end and then below is what we do 
JanusGraph graph = JanusGraphFactory.open(janusConfig);
Vertex vertex = graph.vertices(vertexId).next(); 
 
All the above including getting the vertexid indexes from Elasticsearch happens within the spark context using sparkRDD for partition and parallelisation. If we remove spark out of the equation, what else best way I can do bulkExport?
Also @oleksandr, you have stated that "Otherwise, additional calls might be executed to your backend which could be not as efficient." how should we do these additional calls and get subsequent records. Lets say I'm exporting 10M records and our cache/memory size doesn't support that much, so first I retrieve 1 to 1M records and then 1M to 2M, then 2M to 3M and so on, how can we iterate this way? how can this be achieved in Janus, Please throw some light


Oleksandr Porunov
 

> Also @oleksandr, you have stated that "Otherwise, additional calls might be executed to your backend which could be not as efficient." how should we do these additional calls and get subsequent records. Lets say I'm exporting 10M records and our cache/memory size doesn't support that much, so first I retrieve 1 to 1M records and then 1M to 2M, then 2M to 3M and so on, how can we iterate this way? how can this be achieved in Janus, Please throw some light

Not sure I fully following but will try to add some more clearance.
- vertex ids are not cleared from `vertex`. So, when you return vertices you simply hold them in your heap but all edges / properties are also managed by internal caches. By default if you return vertices you don't return it's properties / edges.
To return properties for vertices you might use `valueMap`, `properties`, `values` gremlin steps.
In the previous message I wasn't talking about using Gremlin but said about `multiQuery` which is a JanusGraph feature. `multiQuery` may store data in tx-cache if you preload your properties.
To use multiQuery you must provide vertices for which you want to preload properties (think about it as simple vertex ids rather than a collection of all vertex data). After you preload properties they are stored in tx-level cache and also may be stored in db-level cache if you enabled that. After that you can access vertex properties without additional calls to internal database but instead get those properties from tx-level cache.
There is a property `cache.tx-cache-size` which says `Maximum size of the transaction-level cache of recently-used vertices.`. By default it's 20000 but you can configure this individually per transaction when you are creating your transaction.
As you said, you don't have possibility to store 10M vertices in your cache then you need to split your work on different chunks.
Basically something like:
janusGraph.multiQuery().addAllVertices(youFirstMillionVertices).properties().forEach(
// process your vertex properties
);
janusGraph.multiQuery().addAllVertices(youSecondMillionVertices).properties().forEach(
// process your vertex properties.
// As your youFirstMillionVertices are processed it means they will be evicted from tx-level cache because youSecondMillionVertices are now recently-used vertices.
);
janusGraph.multiQuery().addAllVertices(youThirdMillionVertices).properties().forEach(
// process your vertex properties
// As your youSecondMillionVertices are processed it means they will be evicted from tx-level cache because youThirdMillionVertices are now recently-used vertices.
);
// ...

You may also simply close and reopen transactions when you processed some chunk of your data.

Under the hood multiQuery will either you batch db feature or https://docs.janusgraph.org/configs/configuration-reference/#storageparallel-backend-executor-service

In case you are trying to find a good executor service I would suggest to look at scalable executor service like https://github.com/elastic/elasticsearch/blob/dfac67aff0ca126901d72ed7fe862a1e7adb19b0/server/src/main/java/org/elasticsearch/common/util/concurrent/EsExecutors.java#L74-L81
or similar executor services. I wouldn't recommend using executor services without upper bound limit like Cached thread pool because they are quite dangerous.

Hope it helps somehow.

Best regards,
Oleksandr