You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Michael Sokolov <ms...@gmail.com> on 2018/04/20 13:13:35 UTC

WordDelimiterGraphFilter does not respect KeywordAttribute

I have a use case that generates some tokens containing punctuation
(fractions and other numerical constructs), but I am handling most
punctuation with WordDelimiterGraphFilter, which then decomposes those
tokens into parts and re-composes, so eg 1/2 becomes {1, 2, 12}. I thought
at first that I could mark those tokens as keywords to prevent any future
analysis, but I discovered WDGF ignores that.

I have a workaround using Arabic numerals as separators instead of
punctuation (1/2 -> 1١2) -- these are classified as digits, so WDGF does
not split on them --, but someday I would like to support Arabic (or Hindi)
language numbers as well, and then this hack will bite me.

Does it seem reasonable to update WDGF (and its cousin WDF) to respect
KeywordAttribute? I think it can be done with a very small change.

Re: WordDelimiterGraphFilter does not respect KeywordAttribute

Posted by Michael Sokolov <ms...@gmail.com>.
OK I opened https://issues.apache.org/jira/browse/LUCENE-8265 and will
submit a pr soon

On Sat, Apr 21, 2018 at 3:56 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

> +1
>
> Mike
>
> On Fri, Apr 20, 2018, 9:42 AM Michael Sokolov <ms...@gmail.com> wrote:
>
> > I have a use case that generates some tokens containing punctuation
> > (fractions and other numerical constructs), but I am handling most
> > punctuation with WordDelimiterGraphFilter, which then decomposes those
> > tokens into parts and re-composes, so eg 1/2 becomes {1, 2, 12}. I
> thought
> > at first that I could mark those tokens as keywords to prevent any future
> > analysis, but I discovered WDGF ignores that.
> >
> > I have a workaround using Arabic numerals as separators instead of
> > punctuation (1/2 -> 1١2) -- these are classified as digits, so WDGF does
> > not split on them --, but someday I would like to support Arabic (or
> Hindi)
> > language numbers as well, and then this hack will bite me.
> >
> > Does it seem reasonable to update WDGF (and its cousin WDF) to respect
> > KeywordAttribute? I think it can be done with a very small change.
> >
>

Re: WordDelimiterGraphFilter does not respect KeywordAttribute

Posted by Michael McCandless <lu...@mikemccandless.com>.
+1

Mike

On Fri, Apr 20, 2018, 9:42 AM Michael Sokolov <ms...@gmail.com> wrote:

> I have a use case that generates some tokens containing punctuation
> (fractions and other numerical constructs), but I am handling most
> punctuation with WordDelimiterGraphFilter, which then decomposes those
> tokens into parts and re-composes, so eg 1/2 becomes {1, 2, 12}. I thought
> at first that I could mark those tokens as keywords to prevent any future
> analysis, but I discovered WDGF ignores that.
>
> I have a workaround using Arabic numerals as separators instead of
> punctuation (1/2 -> 1١2) -- these are classified as digits, so WDGF does
> not split on them --, but someday I would like to support Arabic (or Hindi)
> language numbers as well, and then this hack will bite me.
>
> Does it seem reasonable to update WDGF (and its cousin WDF) to respect
> KeywordAttribute? I think it can be done with a very small change.
>