You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Markus Jelsma (Jira)" <ji...@apache.org> on 2019/12/31 13:23:00 UTC

[jira] [Comment Edited] (LUCENE-9112) OpenNLP tokenizer is fooled by text containing spurious punctuation

    [ https://issues.apache.org/jira/browse/LUCENE-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17006089#comment-17006089 ] 

Markus Jelsma edited comment on LUCENE-9112 at 12/31/19 1:22 PM:
-----------------------------------------------------------------

I now believe it is a problem in the Lucene code, namely -it being fooled by a punctuation mark and then something- is mishandled with the internal buffer in SegmentingTokenizerBase. The buffer is 1024 and the point where my term is being split is exactly at the 1024th character in the String.

Simply increasing BUFFERMAX 'solves' the problem i have. But i don't know where the underlying problem really is.

edit: i adjusted the text so it no longer needs spurious punctuations marks for it to split a term. It always splits the 1024th character, it is just that in some cases, that character already is a whitespace.


was (Author: markus17):
I now believe it is a problem in the Lucene code, namely it being fooled by a punctuation mark and then something is mishandled with the internal buffer in SegmentingTokenizerBase. The buffer is 1024 and the point where my term is being split is exactly at the 1024th character in the String.

Simply increasing BUFFERMAX 'solves' the problem i have. But i don't know where the underlying problem really is.

> OpenNLP tokenizer is fooled by text containing spurious punctuation
> -------------------------------------------------------------------
>
>                 Key: LUCENE-9112
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9112
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: master (9.0)
>            Reporter: Markus Jelsma
>            Priority: Major
>              Labels: opennlp
>             Fix For: master (9.0)
>
>         Attachments: LUCENE-9112-unittest.patch, LUCENE-9112-unittest.patch
>
>
> The OpenNLP tokenizer show weird behaviour when text contains spurious punctuation such as having triple dots trailing a sentence...
> # the first dot becomes part of the token, having 'sentence.' becomes the token
> # much further down the text, a seemingly unrelated token is then suddenly split up, in my example (see attached unit test) the name 'Baron' is  split into 'Baro' and 'n', this is the real problem
> The problems never seem to occur when using small texts in unit tests but it certainly does in real world examples. Depending on how many 'spurious' dots, a completely different term can become split, or the same term in just a different location.
> I am not too sure if this is actually a problem in the Lucene code, but it is a problem and i have a Lucene unit test proving the problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org