Batching Queries to backend for faster performance


Debasish Kanhar <d.k...@...>
 

Hi All,

Well the title may be misleading, as I couldnt think of better title. Let me give a brief about the issue we are talking about, the possible solutions we are thinking, and will need your suggestions and help to connect anyone in community who can help us with the problem :-)

So, we have a requirement where we want to implement Snowflake as backend for JanusGraph. (https://groups.google.com/forum/#!topic/janusgraph-dev/9JrMYF_01Cc) . We were able to model Snowflake as KeyValueStore and we were successfully able to create an Interface layer which extends OrderedKeyValueStore to interact with Snowflake. (https://gitlab.com/system-soft-technologies-opensource/janus-snowflake) . The problem we face now is anticipated, with respect to slower response times. May it be READ or WRITE. Because, for every Gremlin query which Tinkerpop/Janusgraph issues, its broken down into multiple queries which are executed one after other in sequential other to build a response to Gremlin query.

For example, look at attached file (query breakdown.txt) it shows how the query for a simple gremlin query like g.V().has("node_label", "user").limit(5).valueMap(true) is broken down into set of multiple edgestore queries. (I'm not including queries to graphindex and janusgraph_ids are those in low volumes). We also have been able to capture the order in which the queries are executed. (1st line is 1st query, 2nd line is called second and so on).

My problem here is that, is there some way we can batch the queries here? Since Snowflake is Datawarehouse, each time a query is executed, it takes 100s of ms to execute single query. Thus for example having 100 sub queries like in example file easily takes 10 second minumum. We would like to optimize that by batching the queries the queries together, so that they can be executed together, and their response be re-conciled together?

For example if the flow is as follows:















Can we change the flow as above which is generic flow of Tinkerpop Databases to do something like bellow by bringing a an Accumulater step/Aggregator step bellow?

Instead of directly interacting with backend Snowflake with out interface, we bring in Aggregation step in between.
Aggregation step will be accumulating all the
getSlice queries like StartKey and EndKey & Store name till all Querues which can be compartmentalized are accumulated.
Once accumulated, it then executed all of them together against backend.
Once
executed, we get all queries’ response back to Aggregation step (Output) and then break it down according to input queries, send it back to GraphStep for reconciliation and building the Output of Gremlin query.

As for things we have been doing, we edited the Janusgraph core classes so that we can track the flow of information from one class to another whenever a Gremlin query has executed. So that we can know when a Gremlin query is executed, what are the classes being called, iteratively, till we reach out Interface's getSlice method and looking for repetitive patterns so that we can find the iterative patters from query. For that we have formulated an approximately 6000 lines of custom logs which we are tracking.
After analyzing logs, we have been able to reach at following flow of classes:

My question is, is this possible from Tinkerpop perspective? From Janusgraph perspective? Our project is ready to pay any JanusGraph or Tinkerpop experts part time as freelancer . We are looking for any experts in domain who can help us achive this problem statement. The results of this use case is tremendous. This can also lead to improve in performance improvements in existing backends as well, and can also help us execute a lot of memory intensive queries a lot faster.

Thanks



Debasish Kanhar <d.k...@...>
 

For anyone following this thread. My primary query was to implement multi-get type implementation w.r.t. to my backend. Do check Marko's comment (https://groups.google.com/d/msg/gremlin-users/QMVhLIPiGRE/Yf4ByrlrEQAJ) for clarification on what multi-get is.

My Snowflake backend interface which I've written doesn't support multiQuery yet. Is implementing multiQuery as simple as bellow mentioned steps?

Do we need to uncomment following in StoreManger class?


And setting
features.supportMultiQuery = true;

And implemting following method in KeyValueStore?

Or are there any other changes which needs to be done for implementing multiQuery feature for my backend?
The reason why I feel multiQuery will help us is because, in our simple use case:
g.V(20520).as("root").bothE().as("e").barrier().otherV().as("oth").barrier().project("r","e","o").by(select("root").valueMap()).by(select("e").valueMap()).by(select("oth").valueMap()).barrier().dedup("r", "e", "o").profile()



Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
GraphStep(vertex,[20520])@[root]                                       1           1        1157.329     2.66
JanusGraphVertexStep(BOTH,edge)@[e]                                   12          12        3854.693     8.86
   
\_condition=(EDGE AND visibility:normal)
   
\_isFitted=true
   
\_vertices=1
   
\_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@801a60ee
   
\_orders=[]
   
\_isOrdered=true
   
\_multi=true
  optimization                                                                                
7.573
  backend
-query                                                       12                     828.938
   
\_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@801a60ee
NoOpBarrierStep                                                       12          12           1.828     0.00
EdgeOtherVertexStep@[oth]                                             12          12           1.132     0.00
NoOpBarrierStep                                                       12          12           0.976     0.00
ProjectStep([r, e, o],[[SelectOneStep(last,root...                    12          12       38502.413    88.46
 
SelectOneStep(last,root)                                            12          12           0.508
 
PropertyMapStep(value)                                              12          12       12776.430
 
SelectOneStep(last,e)                                               12          12           0.482
 
PropertyMapStep(value)                                              12          12       15211.454
 
SelectOneStep(last,oth)                                             12          12           0.376
 
PropertyMapStep(value)                                              12          12       10508.737
NoOpBarrierStep                                                       12          12           5.692     0.01
DedupGlobalStep([r, e, o])                                            12          12           1.925     0.00
                                           
>TOTAL                     -           -       43525.993        -

As you can see, retrieval of properties of Graph elements (vertices and edges), is the most time consuming step. On further analysis I realized this is because, retireval of single property from my backend is a single query to backend. Thus, for n elements (Vertex & edges) each with M properties, total calls is N*M which kinda slows down the whole process and execution time.
Maybe that's the reason why Properties() step is the slowest step in my scenario backencd.

So, will implementing multiQuery optimize the performance in such scenario, and is there anything else which needs to be implemented as well? If yes, I can quickly implement this, and we can immediately see some performance improvements, and adding new backend source moves closer to finish line :-)

Thanks in advance.

On Tuesday, 7 April 2020 00:18:38 UTC+5:30, Debasish Kanhar wrote:
Hi All,

Well the title may be misleading, as I couldnt think of better title. Let me give a brief about the issue we are talking about, the possible solutions we are thinking, and will need your suggestions and help to connect anyone in community who can help us with the problem :-)

So, we have a requirement where we want to implement Snowflake as backend for JanusGraph. (https://groups.google.com/forum/#!topic/janusgraph-dev/9JrMYF_01Cc) . We were able to model Snowflake as KeyValueStore and we were successfully able to create an Interface layer which extends OrderedKeyValueStore to interact with Snowflake. (https://gitlab.com/system-soft-technologies-opensource/janus-snowflake) . The problem we face now is anticipated, with respect to slower response times. May it be READ or WRITE. Because, for every Gremlin query which Tinkerpop/Janusgraph issues, its broken down into multiple queries which are executed one after other in sequential other to build a response to Gremlin query.

For example, look at attached file (query breakdown.txt) it shows how the query for a simple gremlin query like g.V().has("node_label", "user").limit(5).valueMap(true) is broken down into set of multiple edgestore queries. (I'm not including queries to graphindex and janusgraph_ids are those in low volumes). We also have been able to capture the order in which the queries are executed. (1st line is 1st query, 2nd line is called second and so on).

My problem here is that, is there some way we can batch the queries here? Since Snowflake is Datawarehouse, each time a query is executed, it takes 100s of ms to execute single query. Thus for example having 100 sub queries like in example file easily takes 10 second minumum. We would like to optimize that by batching the queries the queries together, so that they can be executed together, and their response be re-conciled together?

For example if the flow is as follows:















Can we change the flow as above which is generic flow of Tinkerpop Databases to do something like bellow by bringing a an Accumulater step/Aggregator step bellow?

Instead of directly interacting with backend Snowflake with out interface, we bring in Aggregation step in between.
Aggregation step will be accumulating all the
getSlice queries like StartKey and EndKey & Store name till all Querues which can be compartmentalized are accumulated.
Once accumulated, it then executed all of them together against backend.
Once
executed, we get all queries’ response back to Aggregation step (Output) and then break it down according to input queries, send it back to GraphStep for reconciliation and building the Output of Gremlin query.

As for things we have been doing, we edited the Janusgraph core classes so that we can track the flow of information from one class to another whenever a Gremlin query has executed. So that we can know when a Gremlin query is executed, what are the classes being called, iteratively, till we reach out Interface's getSlice method and looking for repetitive patterns so that we can find the iterative patters from query. For that we have formulated an approximately 6000 lines of custom logs which we are tracking.
After analyzing logs, we have been able to reach at following flow of classes:

My question is, is this possible from Tinkerpop perspective? From Janusgraph perspective? Our project is ready to pay any JanusGraph or Tinkerpop experts part time as freelancer . We are looking for any experts in domain who can help us achive this problem statement. The results of this use case is tremendous. This can also lead to improve in performance improvements in existing backends as well, and can also help us execute a lot of memory intensive queries a lot faster.

Thanks



Pavel Ershov <owner...@...>
 


JG has three options to reduce number of queries https://docs.janusgraph.org/basics/configuration-reference/#query

PROPERTY_PREFETCHING -- enabled by default
USE_MULTIQUERY -- disabled by default
BATCH_PROPERTY_PREFETCHING -- disabled by default

Last two needs to implement OrderedKeyValueStore.getSlices and enable multiQuery feature by store



среда, 8 апреля 2020 г., 19:10:38 UTC+3 пользователь Debasish Kanhar написал:

For anyone following this thread. My primary query was to implement multi-get type implementation w.r.t. to my backend. Do check Marko's comment (https://groups.google.com/d/msg/gremlin-users/QMVhLIPiGRE/Yf4ByrlrEQAJ) for clarification on what multi-get is.

My Snowflake backend interface which I've written doesn't support multiQuery yet. Is implementing multiQuery as simple as bellow mentioned steps?

Do we need to uncomment following in StoreManger class?


And setting
features.supportMultiQuery = true;

And implemting following method in KeyValueStore?

Or are there any other changes which needs to be done for implementing multiQuery feature for my backend?
The reason why I feel multiQuery will help us is because, in our simple use case:
g.V(20520).as("root").bothE().as("e").barrier().otherV().as("oth").barrier().project("r","e","o").by(select("root").valueMap()).by(select("e").valueMap()).by(select("oth").valueMap()).barrier().dedup("r", "e", "o").profile()



Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
GraphStep(vertex,[20520])@[root]                                       1           1        1157.329     2.66
JanusGraphVertexStep(BOTH,edge)@[e]                                   12          12        3854.693     8.86
   
\_condition=(EDGE AND visibility:normal)
   
\_isFitted=true
   
\_vertices=1
   
\_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@801a60ee
   
\_orders=[]
   
\_isOrdered=true
   
\_multi=true
  optimization                                                                                
7.573
  backend
-query                                                       12                     828.938
   
\_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@801a60ee
NoOpBarrierStep                                                       12          12           1.828     0.00
EdgeOtherVertexStep@[oth]                                             12          12           1.132     0.00
NoOpBarrierStep                                                       12          12           0.976     0.00
ProjectStep([r, e, o],[[SelectOneStep(last,root...                    12          12       38502.413    88.46
 
SelectOneStep(last,root)                                            12          12           0.508
 
PropertyMapStep(value)                                              12          12       12776.430
 
SelectOneStep(last,e)                                               12          12           0.482
 
PropertyMapStep(value)                                              12          12       15211.454
 
SelectOneStep(last,oth)                                             12          12           0.376
 
PropertyMapStep(value)                                              12          12       10508.737
NoOpBarrierStep                                                       12          12           5.692     0.01
DedupGlobalStep([r, e, o])                                            12          12           1.925     0.00
                                           
>TOTAL                     -           -       43525.993        -

As you can see, retrieval of properties of Graph elements (vertices and edges), is the most time consuming step. On further analysis I realized this is because, retireval of single property from my backend is a single query to backend. Thus, for n elements (Vertex & edges) each with M properties, total calls is N*M which kinda slows down the whole process and execution time.
Maybe that's the reason why Properties() step is the slowest step in my scenario backencd.

So, will implementing multiQuery optimize the performance in such scenario, and is there anything else which needs to be implemented as well? If yes, I can quickly implement this, and we can immediately see some performance improvements, and adding new backend source moves closer to finish line :-)

Thanks in advance.

On Tuesday, 7 April 2020 00:18:38 UTC+5:30, Debasish Kanhar wrote:
Hi All,

Well the title may be misleading, as I couldnt think of better title. Let me give a brief about the issue we are talking about, the possible solutions we are thinking, and will need your suggestions and help to connect anyone in community who can help us with the problem :-)

So, we have a requirement where we want to implement Snowflake as backend for JanusGraph. (https://groups.google.com/forum/#!topic/janusgraph-dev/9JrMYF_01Cc) . We were able to model Snowflake as KeyValueStore and we were successfully able to create an Interface layer which extends OrderedKeyValueStore to interact with Snowflake. (https://gitlab.com/system-soft-technologies-opensource/janus-snowflake) . The problem we face now is anticipated, with respect to slower response times. May it be READ or WRITE. Because, for every Gremlin query which Tinkerpop/Janusgraph issues, its broken down into multiple queries which are executed one after other in sequential other to build a response to Gremlin query.

For example, look at attached file (query breakdown.txt) it shows how the query for a simple gremlin query like g.V().has("node_label", "user").limit(5).valueMap(true) is broken down into set of multiple edgestore queries. (I'm not including queries to graphindex and janusgraph_ids are those in low volumes). We also have been able to capture the order in which the queries are executed. (1st line is 1st query, 2nd line is called second and so on).

My problem here is that, is there some way we can batch the queries here? Since Snowflake is Datawarehouse, each time a query is executed, it takes 100s of ms to execute single query. Thus for example having 100 sub queries like in example file easily takes 10 second minumum. We would like to optimize that by batching the queries the queries together, so that they can be executed together, and their response be re-conciled together?

For example if the flow is as follows:















Can we change the flow as above which is generic flow of Tinkerpop Databases to do something like bellow by bringing a an Accumulater step/Aggregator step bellow?

Instead of directly interacting with backend Snowflake with out interface, we bring in Aggregation step in between.
Aggregation step will be accumulating all the
getSlice queries like StartKey and EndKey & Store name till all Querues which can be compartmentalized are accumulated.
Once accumulated, it then executed all of them together against backend.
Once
executed, we get all queries’ response back to Aggregation step (Output) and then break it down according to input queries, send it back to GraphStep for reconciliation and building the Output of Gremlin query.

As for things we have been doing, we edited the Janusgraph core classes so that we can track the flow of information from one class to another whenever a Gremlin query has executed. So that we can know when a Gremlin query is executed, what are the classes being called, iteratively, till we reach out Interface's getSlice method and looking for repetitive patterns so that we can find the iterative patters from query. For that we have formulated an approximately 6000 lines of custom logs which we are tracking.
After analyzing logs, we have been able to reach at following flow of classes:

My question is, is this possible from Tinkerpop perspective? From Janusgraph perspective? Our project is ready to pay any JanusGraph or Tinkerpop experts part time as freelancer . We are looking for any experts in domain who can help us achive this problem statement. The results of this use case is tremendous. This can also lead to improve in performance improvements in existing backends as well, and can also help us execute a lot of memory intensive queries a lot faster.

Thanks



Debasish Kanhar <d.k...@...>
 

Thanks Pavel.

I've tried all those, and those have helped me reduce the execution time by almost 50% along with query optimizations as well. But that's still slower for my use case. :-)

I'm looking at more and more batching options wherever I can.

You can check out my discussion on Gremlin-users : https://groups.google.com/forum/#!topic/gremlin-users/RaIHVbDE5rk for more clarity on requirement and details and any help to give ideas on how to implement will be of great help :-)

On Thursday, 16 April 2020 21:20:40 UTC+5:30, Pavel Ershov wrote:

JG has three options to reduce number of queries https://docs.janusgraph.org/basics/configuration-reference/#query

PROPERTY_PREFETCHING -- enabled by default
USE_MULTIQUERY -- disabled by default
BATCH_PROPERTY_PREFETCHING -- disabled by default

Last two needs to implement OrderedKeyValueStore.getSlices and enable multiQuery feature by store



среда, 8 апреля 2020 г., 19:10:38 UTC+3 пользователь Debasish Kanhar написал:
For anyone following this thread. My primary query was to implement multi-get type implementation w.r.t. to my backend. Do check Marko's comment (https://groups.google.com/d/msg/gremlin-users/QMVhLIPiGRE/Yf4ByrlrEQAJ) for clarification on what multi-get is.

My Snowflake backend interface which I've written doesn't support multiQuery yet. Is implementing multiQuery as simple as bellow mentioned steps?

Do we need to uncomment following in StoreManger class?


And setting
features.supportMultiQuery = true;

And implemting following method in KeyValueStore?

Or are there any other changes which needs to be done for implementing multiQuery feature for my backend?
The reason why I feel multiQuery will help us is because, in our simple use case:
g.V(20520).as("root").bothE().as("e").barrier().otherV().as("oth").barrier().project("r","e","o").by(select("root").valueMap()).by(select("e").valueMap()).by(select("oth").valueMap()).barrier().dedup("r", "e", "o").profile()



Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
GraphStep(vertex,[20520])@[root]                                       1           1        1157.329     2.66
JanusGraphVertexStep(BOTH,edge)@[e]                                   12          12        3854.693     8.86
   
\_condition=(EDGE AND visibility:normal)
   
\_isFitted=true
   
\_vertices=1
   
\_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@801a60ee
   
\_orders=[]
   
\_isOrdered=true
   
\_multi=true
  optimization                                                                                
7.573
  backend
-query                                                       12                     828.938
   
\_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@801a60ee
NoOpBarrierStep                                                       12          12           1.828     0.00
EdgeOtherVertexStep@[oth]                                             12          12           1.132     0.00
NoOpBarrierStep                                                       12          12           0.976     0.00
ProjectStep([r, e, o],[[SelectOneStep(last,root...                    12          12       38502.413    88.46
 
SelectOneStep(last,root)                                            12          12           0.508
 
PropertyMapStep(value)                                              12          12       12776.430
 
SelectOneStep(last,e)                                               12          12           0.482
 
PropertyMapStep(value)                                              12          12       15211.454
 
SelectOneStep(last,oth)                                             12          12           0.376
 
PropertyMapStep(value)                                              12          12       10508.737
NoOpBarrierStep                                                       12          12           5.692     0.01
DedupGlobalStep([r, e, o])                                            12          12           1.925     0.00
                                           
>TOTAL                     -           -       43525.993        -

As you can see, retrieval of properties of Graph elements (vertices and edges), is the most time consuming step. On further analysis I realized this is because, retireval of single property from my backend is a single query to backend. Thus, for n elements (Vertex & edges) each with M properties, total calls is N*M which kinda slows down the whole process and execution time.
Maybe that's the reason why Properties() step is the slowest step in my scenario backencd.

So, will implementing multiQuery optimize the performance in such scenario, and is there anything else which needs to be implemented as well? If yes, I can quickly implement this, and we can immediately see some performance improvements, and adding new backend source moves closer to finish line :-)

Thanks in advance.

On Tuesday, 7 April 2020 00:18:38 UTC+5:30, Debasish Kanhar wrote:
Hi All,

Well the title may be misleading, as I couldnt think of better title. Let me give a brief about the issue we are talking about, the possible solutions we are thinking, and will need your suggestions and help to connect anyone in community who can help us with the problem :-)

So, we have a requirement where we want to implement Snowflake as backend for JanusGraph. (https://groups.google.com/forum/#!topic/janusgraph-dev/9JrMYF_01Cc) . We were able to model Snowflake as KeyValueStore and we were successfully able to create an Interface layer which extends OrderedKeyValueStore to interact with Snowflake. (https://gitlab.com/system-soft-technologies-opensource/janus-snowflake) . The problem we face now is anticipated, with respect to slower response times. May it be READ or WRITE. Because, for every Gremlin query which Tinkerpop/Janusgraph issues, its broken down into multiple queries which are executed one after other in sequential other to build a response to Gremlin query.

For example, look at attached file (query breakdown.txt) it shows how the query for a simple gremlin query like g.V().has("node_label", "user").limit(5).valueMap(true) is broken down into set of multiple edgestore queries. (I'm not including queries to graphindex and janusgraph_ids are those in low volumes). We also have been able to capture the order in which the queries are executed. (1st line is 1st query, 2nd line is called second and so on).

My problem here is that, is there some way we can batch the queries here? Since Snowflake is Datawarehouse, each time a query is executed, it takes 100s of ms to execute single query. Thus for example having 100 sub queries like in example file easily takes 10 second minumum. We would like to optimize that by batching the queries the queries together, so that they can be executed together, and their response be re-conciled together?

For example if the flow is as follows:















Can we change the flow as above which is generic flow of Tinkerpop Databases to do something like bellow by bringing a an Accumulater step/Aggregator step bellow?

Instead of directly interacting with backend Snowflake with out interface, we bring in Aggregation step in between.
Aggregation step will be accumulating all the
getSlice queries like StartKey and EndKey & Store name till all Querues which can be compartmentalized are accumulated.
Once accumulated, it then executed all of them together against backend.
Once
executed, we get all queries’ response back to Aggregation step (Output) and then break it down according to input queries, send it back to GraphStep for reconciliation and building the Output of Gremlin query.

As for things we have been doing, we edited the Janusgraph core classes so that we can track the flow of information from one class to another whenever a Gremlin query has executed. So that we can know when a Gremlin query is executed, what are the classes being called, iteratively, till we reach out Interface's getSlice method and looking for repetitive patterns so that we can find the iterative patters from query. For that we have formulated an approximately 6000 lines of custom logs which we are tracking.
After analyzing logs, we have been able to reach at following flow of classes:

My question is, is this possible from Tinkerpop perspective? From Janusgraph perspective? Our project is ready to pay any JanusGraph or Tinkerpop experts part time as freelancer . We are looking for any experts in domain who can help us achive this problem statement. The results of this use case is tremendous. This can also lead to improve in performance improvements in existing backends as well, and can also help us execute a lot of memory intensive queries a lot faster.

Thanks