Re: [DISCUSS] Don't use Scroll API for ElasticSearch requests


remi.g...@...
 

Hi !
I have strated to look at this point last week.
I haven't code anything, but this topic interests me.
In my searchs, the only problem using "search_after" wich is recommended now by elasticsearch is to find a good ordered properties.

I was thinking about using the ids (Vertex or Edges) but it's not recommanded.. 
In the worst scenario, we should add some kind of UUID to help us order the queries if no order is specified .

IMO, we should do what ES recommends ( https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-after.html ) and switch to search after.
In cases where the index queried returns only a few results, using the scroll api for this cases is to dismesured.






Le lundi 15 avril 2019 19:30:58 UTC+2, Oleksandr Porunov a écrit :
Currently we are using Scroll API for realtime search requests when using ElasticSearch as an index backend. In my experience it often creates more than 500 parallel cursors (sometimes more then 10000 cursors). Sure, we can decrease keep-alive parameter "index.[X].elasticsearch.scroll-keep-alive" to keep cursors for less than 60 seconds but I don't think that it is a wise solution. 

Statements from the ElasticSearch documentation:
>> Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration.

>> The Scroll api is recommended for efficient deep scrolling but scroll contexts are costly and it is not recommended to use it for real time user requests.

In addition, I can say that ElasticSearch 7.0.0 (released on 10 April) by default limits the amount of open cursors to 500.

Pros which I see if we remove usage of Scroll API in JanusGraph:
- All real time queries will be faster
- Less overhead on the ElasticSearch side (we don't keep open contexts)

Cons which I see if we remove usage of Scroll API in JanusGraph:
- The use would need to be aware of queries like .limit(1000000) or queries without limit because they may hit a lot of results and that is we may have some problems on ElasticSearch side.

Even considering the con of removing usage of Scroll API I think we should remove Scroll API usage because it is much simpler to write your Gremlin query with `limit` usage than dealing with too many open contexts (just by opinion).

Possible solutions to deal with the con:
- Warn users about possible problems for queries which hit many entities in ElasticSearch. Suggest to use "limit" and some data processing techniques.
- Imitate scroll usage by using "search_after" (this one could be hard to implement and not applicable to queries without sorting by unique parameter / parameters).

What other community members think about it?
Do you see any other pros of using Scroll API which I missed? Are you OK with removing usage of Scroll API?    

Join janusgraph-dev@lists.lfaidata.foundation to automatically receive all group messages.