You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "Markus Jelsma (Jira)" <ji...@apache.org> on 2019/12/30 17:27:00 UTC

[jira] [Created] (LUCENE-9112) OpenNLP tokenizer is fooled by text containing spurious punctuation

Markus Jelsma created LUCENE-9112:
-------------------------------------

             Summary: OpenNLP tokenizer is fooled by text containing spurious punctuation
                 Key: LUCENE-9112
                 URL: https://issues.apache.org/jira/browse/LUCENE-9112
             Project: Lucene - Core
          Issue Type: Bug
          Components: modules/analysis
    Affects Versions: master (9.0)
            Reporter: Markus Jelsma
             Fix For: master (9.0)


The OpenNLP tokenizer show weird behaviour when text contains spurious punctuation such as having triple dots trailing a sentence...

# the first dot becomes part of the token, having 'sentence.' becomes the token
# much further down the text, a seemingly unrelated token is then suddenly split up, in my example the name 'Baron' is  split into 'Baro' and 'n', this is the real problem

The problems never seem to occur when using small texts in unit tests but it certainly does in real world examples. Depending on how many 'spurious' dots, a completely different term can become split, or the same term in just a different location.

I am not too sure if this is actually a problem in the Lucene code, but it is a problem and i have a Lucene unit test proving the problem.







--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org