JanusGraph and Indexing

Olivier Binda <olus....@...>

I have recently spent some time looking at the indexing related code of Janusgraph (mostly the Lucene related part) and I have a few questions : 

1. A) Because of the way mutation works currently in Janusgraph, Index backends MUST maintain a copy of the fields that have Cardinality.LIST and Cardinality.SET. Or they get lost when a mutation happens.
This means that data gets duplicated and I fear that this impacts indexing performance (basically, the mutation work gets done twice (by the graph and the index) and implementers of index backend must work like crazy toi ensure data/index persistance)

This is safe for Cardinality.SINGLE 

B) Passing the Cardinality.SINGLE and Cardinality.LIST values to the index in case of mutation, would have simplified this a lot, removed the need to do the work twice and to maintain a copy, and would have improved performance (in my maybee not knowleadgeable enough opinion).

So, my question is : is there a good reason why it was implemented that way (A) and not this way (B) ?

2. the documentation is a bit sparse relatively to indexing. Notably, it is somewhat not easy to fully grasp what are expected for Mapping.STRING and Mapping.TEXT.
I got the feeling that Mapping.STRING was for fields that behave mostly like a keyword (a single value) and Mapping.TEXT was for fields that behave mostly like an analyzed text (bag of tokens)

Regarding custom analyzers, this should be true for those too (right ?)
Which means that using an EnglishAnalyzer for Mapping.STRING (like it is done in the tests) feels really weird and contrary to what I was expecting.

But, probably, I don't understand fully how indexing works in janusgraph (it would be nice to have more details in the documentation)

Could someone please explain more how indexing should work, regarding Mappings and custom analyzers ?

3. (Mapping.TEXT) text search is supposed to be case insensitive, which means that fields are basically turned into lower case before they are entered in the index/query.

What use is that for ?
Isn't this crippling custom analyzers ?  
Isn't it the job of the Analyzer to decide if the value should be turned to lower case or not ?

Besides, there are a lot of non-english people out that that might use janusgraph... what is the point of lowering the case for them ? Does it have any sense to lower the case of a japanese field Analyzed with a custom JapaneseAnalyzer ? 
(which already does of great job of normalizing japanse string anayway) 

Also, changing the case of values makes it unreliable to do some stuff with them (like Cmp.EQ of course, which makes it meaningless for (far fetched example) to use a KeywordAnalyzer with Mapping.TEXT...you than have to use it with Mapping.STRING, etc...)

4. It looks like tests are not covering some case regarding lucene Index (fuzzy, ensuring that a Mapping.STRING has only 1 token...).
What if I try to implement unit test for lucene (modifying JanusGraphIndexTest).

This might make the build for the es and solr people fail...should I make a separate pull request ? Will the pr be accepted fby travis if it succeeds for lucene but fail for es/solr ?
How should I proceed ?  

5. for Mapping.TEXT, it is a bit awkward to implement some operators like Text.Prefix. Say, if you want to know which words start with "w" and tour English Analyzer removes words that are less that 2 chars, 
you have to fallback to using the not-analyzed String to produce a query (which is not normalized)... 

Best regards, 
Olivier Binda