Re: Options for Bulk Read/Bulk Export


Oleksandr Porunov
 

> Also @oleksandr, you have stated that "Otherwise, additional calls might be executed to your backend which could be not as efficient." how should we do these additional calls and get subsequent records. Lets say I'm exporting 10M records and our cache/memory size doesn't support that much, so first I retrieve 1 to 1M records and then 1M to 2M, then 2M to 3M and so on, how can we iterate this way? how can this be achieved in Janus, Please throw some light

Not sure I fully following but will try to add some more clearance.
- vertex ids are not cleared from `vertex`. So, when you return vertices you simply hold them in your heap but all edges / properties are also managed by internal caches. By default if you return vertices you don't return it's properties / edges.
To return properties for vertices you might use `valueMap`, `properties`, `values` gremlin steps.
In the previous message I wasn't talking about using Gremlin but said about `multiQuery` which is a JanusGraph feature. `multiQuery` may store data in tx-cache if you preload your properties.
To use multiQuery you must provide vertices for which you want to preload properties (think about it as simple vertex ids rather than a collection of all vertex data). After you preload properties they are stored in tx-level cache and also may be stored in db-level cache if you enabled that. After that you can access vertex properties without additional calls to internal database but instead get those properties from tx-level cache.
There is a property `cache.tx-cache-size` which says `Maximum size of the transaction-level cache of recently-used vertices.`. By default it's 20000 but you can configure this individually per transaction when you are creating your transaction.
As you said, you don't have possibility to store 10M vertices in your cache then you need to split your work on different chunks.
Basically something like:
janusGraph.multiQuery().addAllVertices(youFirstMillionVertices).properties().forEach(
// process your vertex properties
);
janusGraph.multiQuery().addAllVertices(youSecondMillionVertices).properties().forEach(
// process your vertex properties.
// As your youFirstMillionVertices are processed it means they will be evicted from tx-level cache because youSecondMillionVertices are now recently-used vertices.
);
janusGraph.multiQuery().addAllVertices(youThirdMillionVertices).properties().forEach(
// process your vertex properties
// As your youSecondMillionVertices are processed it means they will be evicted from tx-level cache because youThirdMillionVertices are now recently-used vertices.
);
// ...

You may also simply close and reopen transactions when you processed some chunk of your data.

Under the hood multiQuery will either you batch db feature or https://docs.janusgraph.org/configs/configuration-reference/#storageparallel-backend-executor-service

In case you are trying to find a good executor service I would suggest to look at scalable executor service like https://github.com/elastic/elasticsearch/blob/dfac67aff0ca126901d72ed7fe862a1e7adb19b0/server/src/main/java/org/elasticsearch/common/util/concurrent/EsExecutors.java#L74-L81
or similar executor services. I wouldn't recommend using executor services without upper bound limit like Cached thread pool because they are quite dangerous.

Hope it helps somehow.

Best regards,
Oleksandr

Join janusgraph-users@lists.lfaidata.foundation to automatically receive all group messages.