Batching Queries to backend for faster performance
Debasish Kanhar <d.k...@...>
Well the title may be misleading, as I couldnt think of better title. Let me give a brief about the issue we are talking about, the possible solutions we are thinking, and will need your suggestions and help to connect anyone in community who can help us with the problem :-)
So, we have a requirement where we want to implement Snowflake as backend for JanusGraph. (https://groups.google.com/forum/#!topic/janusgraph-dev/9JrMYF_01Cc) . We were able to model Snowflake as KeyValueStore and we were successfully able to create an Interface layer which extends OrderedKeyValueStore to interact with Snowflake. (https://gitlab.com/system-
For example, look at attached file (query breakdown.txt) it shows how the query for a simple gremlin query like g.V().has("node_label", "user").limit(5).valueMap(true) is broken down into set of multiple edgestore queries. (I'm not including queries to graphindex and janusgraph_ids are those in low volumes). We also have been able to capture the order in which the queries are executed. (1st line is 1st query, 2nd line is called second and so on).
My problem here is that, is there some way we can batch the queries here? Since Snowflake is Datawarehouse, each time a query is executed, it takes 100s of ms to execute single query. Thus for example having 100 sub queries like in example file easily takes 10 second minumum. We would like to optimize that by batching the queries the queries together, so that they can be executed together, and their response be re-conciled together?
For example if the flow is as follows:
Can we change the flow as above which is generic flow of Tinkerpop Databases to do something like bellow by bringing a an Accumulater step/Aggregator step bellow?
Instead of directly interacting with backend Snowflake with out interface, we bring in Aggregation step in between.
Aggregation step will be accumulating all the getSlice queries like StartKey and EndKey & Store name till all Querues which can be compartmentalized are accumulated.
Once accumulated, it then executed all of them together against backend.
Once executed, we get all queries’ response back to Aggregation step (Output) and then break it down according to input queries, send it back to GraphStep for reconciliation and building the Output of Gremlin query.
As for things we have been doing, we edited the Janusgraph core classes so that we can track the flow of information from one class to another whenever a Gremlin query has executed. So that we can know when a Gremlin query is executed, what are the classes being called, iteratively, till we reach out Interface's getSlice method and looking for repetitive patterns so that we can find the iterative patters from query. For that we have formulated an approximately 6000 lines of custom logs which we are tracking.
After analyzing logs, we have been able to reach at following flow of classes:
My question is, is this possible from Tinkerpop perspective? From Janusgraph perspective? Our project is ready to pay any JanusGraph or Tinkerpop experts part time as freelancer . We are looking for any experts in domain who can help us achive this problem statement. The results of this use case is tremendous. This can also lead to improve in performance improvements in existing backends as well, and can also help us execute a lot of memory intensive queries a lot faster.