You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Dishant Sharma (Jira)" <ji...@apache.org> on 2022/04/19 05:05:00 UTC

[jira] [Commented] (LUCENE-10522) issue with pattern capture group token filter

    [ https://issues.apache.org/jira/browse/LUCENE-10522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524054#comment-17524054 ] 

Dishant Sharma commented on LUCENE-10522:
-----------------------------------------

Can I use the offsetAttribute somewhere in the code of patternCaptureTokenFilter.java file and set the start and end offsets as that of the match found in the input string?

> issue with pattern capture group token filter
> ---------------------------------------------
>
>                 Key: LUCENE-10522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10522
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Dishant Sharma
>            Priority: Critical
>
> |The default pattern capture token filter in elastic search gives the same start and end offset for each generated token: the start and end offset as that of the input string. Is there any way by which I can change the start and end offset of an input string to the positions at which they are found in the input string? The issue that I'm currently facing is that in case of highlighting, it highlights enter string instead of the match.|
> The code inside my token filter factory file is:
>  
> {{package pl.allegro.tech.elasticsearch.index.analysis.pl;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.pattern.PatternCaptureGroupTokenFilter;
> import org.elasticsearch.common.settings.Settings;
> import org.elasticsearch.env.Environment;
> import org.elasticsearch.index.IndexSettings;
> import org.elasticsearch.index.analysis.AbstractTokenFilterFactory;
> import java.util.regex.Pattern;
> public class PuAlPuTokenFilterFactory extends AbstractTokenFilterFactory \{
>     public PuAlPuTokenFilterFactory(IndexSettings indexSettings, Environment environment, String name, Settings settings) {
>         super(indexSettings, name, settings);
>     }
>     @Override
>     public TokenStream create(TokenStream tokenStream) \{
>         return new PatternCaptureGroupTokenFilter(tokenStream, true, Pattern.compile("(?<![^\\p{Alnum}\\p\{Punct}])(\\p\{Punct}\\p\{Alnum}+\\p\{Punct})"));
>     }
> }}}
>  
> I have multiple such token filter files inside my code containing the same code as above but having different pattern used in each file inside the PatternCaptureGroupTokenFilter method call. Each pattern is used as to get the different set of tokens as per my requirement.
> I am using the lucene's default PatternCaptureGroupTokenFilter.
> I am not using any mapping but, I am using the below index settings as per my use case:
> "settings" : \{
>       "analysis" : {
>          "analyzer" : {
>             "special_analyzer" : {
>                "tokenizer" : "whitespace",
>                "filter" : [ "url-filter-1", "url-filter-2", "url-filter-3", "url-filter-4", "url-filter-5", "url-filter-6", "url-filter-7", "url-filter-8", "url-filter-9", "url-filter-10", "url-filter-11", "unique" ]
>             }
>          }
>       }
>    }
>  
> I am getting all the tokens using the regexes that I have created but the only issue is that all the tokens have the same start and end offsets as that of the input string.
> I am using the pattern token filter alongwith the whitespace tokenizer. Suppose I have a text: "Website url is [https://www.google.com/]"
> Then, the desired tokens are:
> Website, url, is, [https://www.google.com/], https, www, google, com, https:, https:/, [https://|https:], /www, .google, .com, [www|http://www/]., google., com/, www [google.com|http://google.com/] etc.
> I am getting all these tokens through my regexes the only issue is with the offsets. Suppose the start and end offsets of the entire url "[https://www.google.com/]" are 0 and 23, so it is giving 0 and 23 for all the generated tokens.
> But, as per my use case, I'm using the highlighting functionality where I have to use it to highlight all the generated tokens inside the text. But, the issue here is that I instead of highlighting only the match inside the text, it is highlighting the entire input text.|



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org