You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2021/09/17 00:28:35 UTC

[GitHub] [pinot] siddharthteotia edited a comment on issue #7395: Support for Native Text Indexing in Pinot

siddharthteotia edited a comment on issue #7395:
URL: https://github.com/apache/pinot/issues/7395#issuecomment-921350642

I had followed up for clarifying few additional things with @atris in slack channel. Copying here for reference and visibility

Can we all confirm the following ? I am sorry to have asked this couple of times as part of different threads in the doc but since doc still indicates some sort of migration _Note that till completion of phase 4, we will be maintaining the existing text indices within Pinot_. I just want to make sure

- Existing Lucene text index functionality offered via TEXT_MATCH will continue to work as is and is essentially untouched by this work
- Both indexes can co-exist and we are not removing Lucene dependency ?
- Upon segment reload, existing Lucene index can potentially be converted to new format (if need be). However, if someone wishes to do this, how will the query syntax used in TEXT_MATCH from lucene based remain compliant for native FST index (which I believe will follow SQL LIKE semantics). I am guessing the users will have to change queries if they wish to migrate ?
- For the native FST index, the plan is to eventually support all kinds of searches -- phrase, term, regex, fuzzy etc. So for example, phrase search needs position info which I am not sure if it comes for free as part of FST. Regardless, all of that is the end state and comprehensive text search functionality will be available through this native index ?
- - This is important for us because eventually (and this is a big eventual for us :slightly_smiling_face: ) we might want to migrate our production Li users from Lucene text index to native FST index if performance is better. I can't promise if that will happen as it will certainly be a lot of work (hence seeking confirmation that we are not removing anything). Our production users use a lot of phrase queries.
- General question - are you planning to make this functionality available both via LIKE and TEXT_MATCH or want to keep it separate and just use LIKE ? Latter can also be overloaded as long as user docs clearly indicate that TEXT_MATCH can be used for both native and lucene text index
- Request on code - since FST is like a black box (for me except for whatever I learned from paper and online presentations), can you please make sure that code is sufficiently documented and explains algorithm as and when needed. Initially, we were just relying on Lucene committers but now we will have to maintain. This will also help with easy review

@atris 's response

- Yes, Lucene Indices and TEXT_MATCH will be completely untouched and unaffected by this effort.
- No, we are not removing the dependency and both indices can coexist, oblivious of each other.
- Here is the interesting one. Native FST can support all queries that Lucene does. However, since our indices do not store some metadata (such as positional index) that Lucene Indices do, we will have to implement custom operators on top of native FST. However, syntactically, native FST shall pose no challenges in that implementation. If there are specific operators outside of the four planned currently (regexp, like, phrase and fuzzy) that will be needed for users to migrate, I will be more than happy to support.
- Yes, in the end state, comprehensive text search will be natively available.
- I was actually not planning to overload TEXT_MATCH since it basically supports Lucene syntax, but rather have custom functions for phrase, fuzzy and regexp, and let the LIKE operator deal with the rest. However, there is no reason why we can't go down that route.
- I completely agree. I have tried to document the code as elaborately as possible and also written supporting documents (e.g. On the Regexp compilation process). If there is more needed on specific areas, I will gladly write more :)

---------------------------------------------------------------------

Based on above clarifications, I am ok with proceeding

@amrishlal , @jackjlli please feel free to add any additional discussion notes

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org