You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2017/01/07 10:38:58 UTC

[jira] [Updated] (LUCENE-7622) Should BaseTokenStreamTestCase catch analyzers that create duplicate tokens?

     [ https://issues.apache.org/jira/browse/LUCENE-7622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-7622:
---------------------------------------
    Attachment: LUCENE-7622.patch

Here's a simple patch ... but I don't plan to pursuing this further now ... I think it's maybe too anal to insist on this from all analyzers ... so I'm posting the patch here in case anyone else gets itchy!

> Should BaseTokenStreamTestCase catch analyzers that create duplicate tokens?
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-7622
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7622
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-7622.patch
>
>
> The change to BTSTC is quite simple, to catch any case where the same term text spans from the same position with the same position length. Such duplicate tokens are silly to add to the index, or to search at search time.
> Yet, this change produced many failures, and I looked briefly at them, and they are cases that I think are actually OK, e.g. {{PatternCaptureGroupTokenFilter}} capturing (..)(..) on the string {{ktkt}} will create a duplicate token.
> Other cases looked more dubious, e.g. {{WordDelimiterFilter}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org