You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steve Rowe (JIRA)" <ji...@apache.org> on 2014/02/19 17:49:23 UTC

[jira] [Reopened] (LUCENE-5447) StandardTokenizer should break at consecutive chars matching Word_Break = MidLetter, MidNum and/or MidNumLet

     [ https://issues.apache.org/jira/browse/LUCENE-5447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steve Rowe reopened LUCENE-5447:
--------------------------------

    Lucene Fields: New,Patch Available  (was: New)

In looking at the committed diffs (when JIRA was down last night and earlier today, the lucene_solr_4_7 commit didn't put a comment on this issue, which sucks), I see that I didn't fully patch StandardTokenizerImpl.jflex, although I *did* correctly patch UAX29URLEmailTokenizerImpl, which is basically a superset of StandardTokenizerImpl.jflex.

I've added some more tests to show the problem (existing tests didn't fail), patch forthcoming.  Here's an example that should be split by StandardTokenizer but isn't currently - the issue is triggered via a preceding char matching {{Word_Break = ExtendNumLet}}, e.g. the underscore character:

{{A:B_A::B}} <- left intact, but should output "{{A:B_A}}", "{{B}}"

By contrast, the current UAX29URLEmailTokenizer gets the above right.

In the JFlex 1.5.0 release, I added the ability to include external files into the rules section of the scanner specification, and I want to take advantage of this to refactor StandardTokenizer and UAX29URLEmailTokenizer so that there is only one definition of the shared rules.  (That would have prevented the problem for which I'm reopening this issue.)  I'll make a separate issue for that.

> StandardTokenizer should break at consecutive chars matching Word_Break = MidLetter, MidNum and/or MidNumLet
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-5447
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5447
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 4.6.1
>            Reporter: Steve Rowe
>            Assignee: Steve Rowe
>             Fix For: 4.7, 5.0
>
>         Attachments: LUCENE-5447-test.patch, LUCENE-5447.patch, LUCENE-5447.patch
>
>
> StandardTokenizer should split all of the following sequences into two tokens each, but they are all instead kept intact and output as single tokens:
> {noformat}
> "A::B"           (':' is in \p{Word_Break = MidLetter})
> "1..2", "A..B"   ('.' is in \p{Word_Break = MidNumLet})
> "A.:B"
> "A:.B"
> "1,,2"           (',' is in \p{Word_Break = MidNum})
> "1,.2"
> "1.,2"
> {noformat}
> Unfortunately, the word break test data released with Unicode, e.g. for Unicode 6.3 [http://www.unicode.org/Public/6.3.0/ucd/auxiliary/WordBreakTest.txt], and incorporated into a versioned Lucene test, e.g. {{WordBreakTestUnicode_6_3_0}}, doesn't cover these cases.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org