[DISCUSS] Don't use Scroll API for ElasticSearch requests


Oleksandr Porunov <alexand...@...>
 

Currently we are using Scroll API for realtime search requests when using ElasticSearch as an index backend. In my experience it often creates more than 500 parallel cursors (sometimes more then 10000 cursors). Sure, we can decrease keep-alive parameter "index.[X].elasticsearch.scroll-keep-alive" to keep cursors for less than 60 seconds but I don't think that it is a wise solution. 

Statements from the ElasticSearch documentation:
>> Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration.

>> The Scroll api is recommended for efficient deep scrolling but scroll contexts are costly and it is not recommended to use it for real time user requests.

In addition, I can say that ElasticSearch 7.0.0 (released on 10 April) by default limits the amount of open cursors to 500.

Pros which I see if we remove usage of Scroll API in JanusGraph:
- All real time queries will be faster
- Less overhead on the ElasticSearch side (we don't keep open contexts)

Cons which I see if we remove usage of Scroll API in JanusGraph:
- The use would need to be aware of queries like .limit(1000000) or queries without limit because they may hit a lot of results and that is we may have some problems on ElasticSearch side.

Even considering the con of removing usage of Scroll API I think we should remove Scroll API usage because it is much simpler to write your Gremlin query with `limit` usage than dealing with too many open contexts (just by opinion).

Possible solutions to deal with the con:
- Warn users about possible problems for queries which hit many entities in ElasticSearch. Suggest to use "limit" and some data processing techniques.
- Imitate scroll usage by using "search_after" (this one could be hard to implement and not applicable to queries without sorting by unique parameter / parameters).

What other community members think about it?
Do you see any other pros of using Scroll API which I missed? Are you OK with removing usage of Scroll API?    


mike....@...
 

I'm no expert in how the Scroll Api is being leveraged in Janus currently but given the guidance from the elastic docs seen below it may be advantageous to continue leveraging the Scroll api for OLAP type ish queries. Given your evaluation of  the difficulty in choosing whether or not to leverage the Scroll Api vs. not I would advocate for functionality that would allow the user to opt-in or out which ever the community prefers in order to continue using the Scroll Api. My vote would be to keep the existing Api and possibly if it's not already done leverage an abstraction to allow various elastic query Api's to be used depending on the users preference/configuration.


"While a search request returns a single “page” of results, the scroll API can be used to retrieve large numbers of results (or even all results) from a single search request, in much the same way as you would use a cursor on a traditional database.

Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration."


On Monday, April 15, 2019 at 1:30:58 PM UTC-4, Oleksandr Porunov wrote:
Currently we are using Scroll API for realtime search requests when using ElasticSearch as an index backend. In my experience it often creates more than 500 parallel cursors (sometimes more then 10000 cursors). Sure, we can decrease keep-alive parameter "index.[X].elasticsearch.scroll-keep-alive" to keep cursors for less than 60 seconds but I don't think that it is a wise solution. 

Statements from the ElasticSearch documentation:
>> Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration.

>> The Scroll api is recommended for efficient deep scrolling but scroll contexts are costly and it is not recommended to use it for real time user requests.

In addition, I can say that ElasticSearch 7.0.0 (released on 10 April) by default limits the amount of open cursors to 500.

Pros which I see if we remove usage of Scroll API in JanusGraph:
- All real time queries will be faster
- Less overhead on the ElasticSearch side (we don't keep open contexts)

Cons which I see if we remove usage of Scroll API in JanusGraph:
- The use would need to be aware of queries like .limit(1000000) or queries without limit because they may hit a lot of results and that is we may have some problems on ElasticSearch side.

Even considering the con of removing usage of Scroll API I think we should remove Scroll API usage because it is much simpler to write your Gremlin query with `limit` usage than dealing with too many open contexts (just by opinion).

Possible solutions to deal with the con:
- Warn users about possible problems for queries which hit many entities in ElasticSearch. Suggest to use "limit" and some data processing techniques.
- Imitate scroll usage by using "search_after" (this one could be hard to implement and not applicable to queries without sorting by unique parameter / parameters).

What other community members think about it?
Do you see any other pros of using Scroll API which I missed? Are you OK with removing usage of Scroll API?    


remi.g...@...
 

Hi !
I have strated to look at this point last week.
I haven't code anything, but this topic interests me.
In my searchs, the only problem using "search_after" wich is recommended now by elasticsearch is to find a good ordered properties.

I was thinking about using the ids (Vertex or Edges) but it's not recommanded.. 
In the worst scenario, we should add some kind of UUID to help us order the queries if no order is specified .

IMO, we should do what ES recommends ( https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-after.html ) and switch to search after.
In cases where the index queried returns only a few results, using the scroll api for this cases is to dismesured.






Le lundi 15 avril 2019 19:30:58 UTC+2, Oleksandr Porunov a écrit :
Currently we are using Scroll API for realtime search requests when using ElasticSearch as an index backend. In my experience it often creates more than 500 parallel cursors (sometimes more then 10000 cursors). Sure, we can decrease keep-alive parameter "index.[X].elasticsearch.scroll-keep-alive" to keep cursors for less than 60 seconds but I don't think that it is a wise solution. 

Statements from the ElasticSearch documentation:
>> Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration.

>> The Scroll api is recommended for efficient deep scrolling but scroll contexts are costly and it is not recommended to use it for real time user requests.

In addition, I can say that ElasticSearch 7.0.0 (released on 10 April) by default limits the amount of open cursors to 500.

Pros which I see if we remove usage of Scroll API in JanusGraph:
- All real time queries will be faster
- Less overhead on the ElasticSearch side (we don't keep open contexts)

Cons which I see if we remove usage of Scroll API in JanusGraph:
- The use would need to be aware of queries like .limit(1000000) or queries without limit because they may hit a lot of results and that is we may have some problems on ElasticSearch side.

Even considering the con of removing usage of Scroll API I think we should remove Scroll API usage because it is much simpler to write your Gremlin query with `limit` usage than dealing with too many open contexts (just by opinion).

Possible solutions to deal with the con:
- Warn users about possible problems for queries which hit many entities in ElasticSearch. Suggest to use "limit" and some data processing techniques.
- Imitate scroll usage by using "search_after" (this one could be hard to implement and not applicable to queries without sorting by unique parameter / parameters).

What other community members think about it?
Do you see any other pros of using Scroll API which I missed? Are you OK with removing usage of Scroll API?