You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Cao Manh Dat (JIRA)" <ji...@apache.org> on 2015/07/08 16:42:05 UTC
[jira] [Updated] (LUCENE-6595) CharFilter offsets correction is
wonky
[ https://issues.apache.org/jira/browse/LUCENE-6595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Cao Manh Dat updated LUCENE-6595:
---------------------------------
Attachment: LUCENE-6595.patch
Refactored some code inside BaseCharFilter to make it cleaner. I think this patch is final.
[~mikemccand] I changed
{code}
addOffCorrectMap(off, cumulativeDiff, 0);
{code}
to
{code}
addOffCorrectMap(off, cumulativeDiff, off);
{code}
But it fail with some test of HTMLStripCharFilterTest. I'm not sure what going on HTMLStripCharFilter.
> CharFilter offsets correction is wonky
> --------------------------------------
>
> Key: LUCENE-6595
> URL: https://issues.apache.org/jira/browse/LUCENE-6595
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Michael McCandless
> Attachments: LUCENE-6595.patch, LUCENE-6595.patch, LUCENE-6595.patch
>
>
> Spinoff from this original Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/11726
> If I make a MappingCharFilter with these mappings:
> {noformat}
> ( ->
> ) ->
> {noformat}
> i.e., just erase left and right paren, then tokenizing the string
> "(F31)" with e.g. WhitespaceTokenizer, produces a single token F31,
> with start offset 1 (good).
> But for its end offset, I would expect/want 4, but it produces 5
> today.
> This can be easily explained given how the mapping works: each time a
> mapping rule matches, we update the cumulative offset difference,
> conceptually as an array like this (it's encoded more compactly):
> {noformat}
> Output offset: 0 1 2 3
> Input offset: 1 2 3 5
> {noformat}
> When the tokenizer produces F31, it assigns it startOffset=0 and
> endOffset=3 based on the characters it sees (F, 3, 1). It then asks
> the CharFilter to correct those offsets, mapping them backwards
> through the above arrays, which creates startOffset=1 (good) and
> endOffset=5 (bad).
> At first, to fix this, I thought this is an "off-by-1" and when
> correcting the endOffset we really should return
> 1+correct(outputEndOffset-1), which would return the correct value (4)
> here.
> But that's too naive, e.g. here's another example:
> {noformat}
> cccc -> cc
> {noformat}
> If I then tokenize cccc, today we produce the correct offsets (0, 4)
> but if we do this "off-by-1" fix for endOffset, we would get the wrong
> endOffset (2).
> I'm not sure what to do here...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org