You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Dishant Sharma (Jira)" <ji...@apache.org> on 2022/04/19 04:41:00 UTC

[jira] [Updated] (LUCENE-10522) issue with pattern capture group token filter

     [ https://issues.apache.org/jira/browse/LUCENE-10522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dishant Sharma updated LUCENE-10522:
------------------------------------
    Description: 
|The default pattern capture token filter in elastic search gives the same start and end offset for each generated token: the start and end offset as that of the input string. Is there any way by which I can change the start and end offset of an input string to the positions at which they are found in the input string? The issue that I'm currently facing is that in case of highlighting, it highlights enter string instead of the match.


I am getting all the tokens using the regexes that I have created but the only issue is that all the tokens have the same start and end offsets as that of the input string.
I am using the pattern token filter alongwith the whitespace tokenizer. Suppose I have a text:
"Website url is [https://www.google.com/]"
Then, the desired tokens are:
Website, url, is, [https://www.google.com/], https, www, google, com, https:, https:/, https://, /www, .google, .com, [www|http://www/]., google., com/, www [google.com|http://google.com/] etc.
I am getting all these tokens through my regexes the only issue is with the offsets. Suppose the start and end offsets of the entire url "[https://www.google.com/]" are 0 and 23, so it is giving 0 and 23 for all the generated tokens.
But, as per my use case, I'm using the highlighting functionality where I have to use it to highlight all the generated tokens inside the text. But, the issue here is that I instead of highlighting only the match inside the text, it is highlighting the entire input text.|

> issue with pattern capture group token filter
> ---------------------------------------------
>
>                 Key: LUCENE-10522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10522
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Dishant Sharma
>            Priority: Critical
>
> |The default pattern capture token filter in elastic search gives the same start and end offset for each generated token: the start and end offset as that of the input string. Is there any way by which I can change the start and end offset of an input string to the positions at which they are found in the input string? The issue that I'm currently facing is that in case of highlighting, it highlights enter string instead of the match.
> I am getting all the tokens using the regexes that I have created but the only issue is that all the tokens have the same start and end offsets as that of the input string.
> I am using the pattern token filter alongwith the whitespace tokenizer. Suppose I have a text:
> "Website url is [https://www.google.com/]"
> Then, the desired tokens are:
> Website, url, is, [https://www.google.com/], https, www, google, com, https:, https:/, https://, /www, .google, .com, [www|http://www/]., google., com/, www [google.com|http://google.com/] etc.
> I am getting all these tokens through my regexes the only issue is with the offsets. Suppose the start and end offsets of the entire url "[https://www.google.com/]" are 0 and 23, so it is giving 0 and 23 for all the generated tokens.
> But, as per my use case, I'm using the highlighting functionality where I have to use it to highlight all the generated tokens inside the text. But, the issue here is that I instead of highlighting only the match inside the text, it is highlighting the entire input text.|



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org