You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by "kotharironak (via GitHub)" <gi...@apache.org> on 2023/03/03 05:43:14 UTC

[GitHub] [pinot] kotharironak opened a new issue, #10374: Feature Request: Support for hooking different tokenizer or configuring existing for enhance text search capabilities

kotharironak opened a new issue, #10374:
URL: https://github.com/apache/pinot/issues/10374

   In the latest release, there is a way to use the text search index: https://docs.pinot.apache.org/basics/indexing/text-search-support#text-parsing-and-tokenization
   
   However, currently, it provides only `Lucene's standard english text tokenizer` and configuration options for including/excluding of stop words.
   
   There are certain domain-specific use cases where the above standard tokenizer won't suffice. 
   As an example, 
   - for the text `abc.pqr.xyz`, would like to split tokens using `.` along with existing `space` or `tab`. Here, the expectation is to get three tokens - `abc`, `pqr`,`xyz`
   - for the text `GET /api/v1/customer`, would like split using `/`, and expect `GET`, `api`, `v1`, `customer`
   
   However,  currently, there is no way to include additional split chars for generating tokens in the existing tokenizer along with existing or to use another tokenizer.
   
   
   As part of this ticket:
   - Can we provide a way of extending the existing tokenizer?
   - Can you also consider providing a way to configure a different tokenizer or hooking to a custom tokenizer?
   
   Some discussion: https://apache-pinot.slack.com/archives/CDRCA57FC/p1677766557802739


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on issue #10374: Feature Request: Support for hooking different tokenizer or configuring existing for enhance text search capabilities

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang commented on issue #10374:
URL: https://github.com/apache/pinot/issues/10374#issuecomment-1454075212

   cc @siddharthteotia @atris
   By reading the code, seems the tokenizer is hardcoded to `StandardAnalyzer`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] rohityadav1993 commented on issue #10374: Feature Request: Support for hooking different tokenizer or configuring existing for enhance text search capabilities

Posted by "rohityadav1993 (via GitHub)" <gi...@apache.org>.
rohityadav1993 commented on issue #10374:
URL: https://github.com/apache/pinot/issues/10374#issuecomment-1471721832

   I can take this task up. Had an offline connect with @atris who can help me execute this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] rohityadav1993 commented on issue #10374: Feature Request: Support for hooking different tokenizer or configuring existing for enhance text search capabilities

Posted by "rohityadav1993 (via GitHub)" <gi...@apache.org>.
rohityadav1993 commented on issue #10374:
URL: https://github.com/apache/pinot/issues/10374#issuecomment-1586120672

   Hi @Jackie-Jiang, I wasn't able to pick this up. If this a priority, it can be assigned to someone available to implement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] Feature Request: Support for hooking different tokenizer or configuring existing for enhance text search capabilities [pinot]

Posted by "hpvd (via GitHub)" <gi...@apache.org>.
hpvd commented on issue #10374:
URL: https://github.com/apache/pinot/issues/10374#issuecomment-1943314168

   @rohityadav1993 since this is pretty interesting, is there any news on this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org