You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Steven Rowe (Jira)" <ji...@apache.org> on 2019/12/30 20:05:00 UTC
[jira] [Commented] (LUCENE-9112) OpenNLP tokenizer is fooled by
text containing spurious punctuation
[ https://issues.apache.org/jira/browse/LUCENE-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17005798#comment-17005798 ]
Steven Rowe commented on LUCENE-9112:
-------------------------------------
You unit test depends on a test model created with very little training data ( < 100 sentences; see {{opennlp/src/tools/test-model-data/tokenizer.txt}}), so it's not at all surprising that you see weird behavior. I would not consider this indicative of a bug in Lucene's OpenNLP support.
I think you should open an OPENNLP issue for this problem, but it's likely that the most you'll get from them is a pointer to the training data they used to create the model they publish. The most likely outcome is that you will have to create a training set that performs better against data you see, and then create a model from that. If you can do that in a way that is shareable with other OpenNLP users, I'm sure they would be interested in your contribution.
> OpenNLP tokenizer is fooled by text containing spurious punctuation
> -------------------------------------------------------------------
>
> Key: LUCENE-9112
> URL: https://issues.apache.org/jira/browse/LUCENE-9112
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Affects Versions: master (9.0)
> Reporter: Markus Jelsma
> Priority: Major
> Labels: opennlp
> Fix For: master (9.0)
>
> Attachments: LUCENE-9112-unittest.patch
>
>
> The OpenNLP tokenizer show weird behaviour when text contains spurious punctuation such as having triple dots trailing a sentence...
> # the first dot becomes part of the token, having 'sentence.' becomes the token
> # much further down the text, a seemingly unrelated token is then suddenly split up, in my example (see attached unit test) the name 'Baron' is split into 'Baro' and 'n', this is the real problem
> The problems never seem to occur when using small texts in unit tests but it certainly does in real world examples. Depending on how many 'spurious' dots, a completely different term can become split, or the same term in just a different location.
> I am not too sure if this is actually a problem in the Lucene code, but it is a problem and i have a Lucene unit test proving the problem.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org