You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Alexey Makeev <ma...@mail.ru.INVALID> on 2016/11/14 08:49:20 UTC

Re[2]: Too long token is not handled properly?

Hello Steve,

Thanks for the detailed answer and the pointers to issues. I appreciate great amount of work you conducted to make tokenizer handle corner cases. And I think disabling buffer auto-expansion is a right thing to do.
But, please correct me if I wrong, this change of semantics (which has implications from the user point of view) was a workaround for a performance problem? I there was't the performance problem, it would be better to keep original semantics?
E.g. suppose we're indexing text <255 random letters>bob <255 random letters>ed, with current implementation we'll have tokens bob and ed in index. But from the user point of view it's unexpected: neither Bob nor Ed was't mentioned in the text.
Higher maxTokenLength + LengthFilter could solve this, but I'm think it's a workaround too. What value for maxTokenLength should I set? 1M? But what if there will be 2M token in the text?
I agree it's difficult task to make JFlex code be able to silently skip too long tokens. I scheduled for myself attempt to fix it some months later with the following approach. In case we encountered situation when buffer is full and there still could be a bigger match, enter "skipping" mode. In the skipping mode full buffer is emptied, corresponding indexes (zzEndRead and others) are corrected and matching continues. When we hit maximum length match, skipping mode is finished and without returning a token and after yet another indexes correction we enter normal mode. This approach to JFlex matching won't work in general, but I suppose it'll work for tokenizer, because I did't see any backtracking in the code (zzCurrentPos never backtracks non-processed characters).
It would be great to hear you thoughts on this idea.

Best regards,
Alexey Makeev
makeev_1c@mail.ru
>Friday, November 11, 2016 6:06 PM +03:00 from Steve Rowe <sa...@gmail.com>:
>
>Hi Alexey,
>
>The behavior you mention is an intentional change from the behavior in Lucene 4.9.0 and earlier, when tokens longer than maxTokenLenth were silently ignored: see LUCENE-5897[1] and LUCENE-5400[2].
>
>The new behavior is as follows: Token matching rules are no longer allowed to match against input char sequences longer than maxTokenLength.  If a rule that would match a sequence longer than maxTokenLength, but also matches at maxTokenLength chars or fewer, and has the highest priority among all other rules matching at this length, and no other rule matches more chars, then a token will be emitted for that rule at the matching length.  And then the rule-matching iteration simply continues from that point as normal.  If the same rule matches against the remainder of the sequence that the first rule would have matched if maxTokenLength were longer, then another token at the matched length will be emitted, and so on. 
>
>Note that this can result in effectively splitting the sequence at maxTokenLength intervals as you noted.
>
>You can fix the problem by setting maxTokenLength higher - this has the side effect of growing the buffer and not causing unwanted token splitting.  If this results in tokens larger than you would like, you can remove them with LengthFilter.
>
>FYI there is discussion on LUCENE-5897 about separating buffer size from maxTokenLength, starting here: < https://issues.apache.org/jira/browse/LUCENE-5897?focusedCommentId=14105729&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14105729 > - ultimately I decided that few people would benefit from the increased configuration complexity.
>
>[1]  https://issues.apache.org/jira/browse/LUCENE-5897
>[2]  https://issues.apache.org/jira/browse/LUCENE-5400
>
>--
>Steve
>www.lucidworks.com
>
>> On Nov 11, 2016, at 6:23 AM, Alexey Makeev < makeev_1c@mail.ru.INVALID > wrote:
>> 
>> Hello,
>> 
>> I'm using lucene 6.2.0 and expecting the following test to pass:
>> 
>> import org.apache.lucene.analysis.BaseTokenStreamTestCase;
>> import org.apache.lucene.analysis.standard.StandardTokenizer;
>> 
>> import java.io.IOException;
>> import java.io.StringReader;
>> 
>> public class TestStandardTokenizer extends BaseTokenStreamTestCase
>> {
>>     public void testLongToken() throws IOException
>>     {
>>         final StandardTokenizer tokenizer = new StandardTokenizer();
>>         final int maxTokenLength = tokenizer.getMaxTokenLength();
>> 
>>         // string with the following contents: a...maxTokenLength+5 times...a abc
>>         final String longToken = new String(new char[maxTokenLength + 5]).replace("\0", "a") + " abc";
>> 
>>         tokenizer.setReader(new StringReader(longToken));
>> 
>>         assertTokenStreamContents(tokenizer, new String[]{"abc"});
>>         // actual contents: "a" 255 times, "aaaaa", "abc"
>>     }
>> }
>> 
>> It seems like StandardTokenizer considers completely filled buffer as a successfully extracted token (1), and also includes tail of too-long-token as a separate token (2). Maybe (1) is disputable (I think it is bug), but I think (2) is a bug. 
>> 
>> Best regards,
>> Alexey Makeev
>>  makeev_1c@mail.ru
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail:  java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail:  java-user-help@lucene.apache.org
>


Best regards,
Alexey Makeev
makeev_1c@mail.ru

Re: Too long token is not handled properly?

Posted by Steve Rowe <sa...@gmail.com>.
Hi Alexey,

> On Nov 14, 2016, at 3:49 AM, Alexey Makeev <ma...@mail.ru.INVALID> wrote:
> 
> But, please correct me if I wrong, this change of semantics (which has implications from the user point of view) was a workaround for a performance problem? I there was't the performance problem, it would be better to keep original semantics?

Yes, I think so too.

> E.g. suppose we're indexing text <255 random letters>bob <255 random letters>ed, with current implementation we'll have tokens bob and ed in index. But from the user point of view it's unexpected: neither Bob nor Ed was't mentioned in the text.
> Higher maxTokenLength + LengthFilter could solve this, but I'm think it's a workaround too. What value for maxTokenLength should I set? 1M? But what if there will be 2M token in the text?

Yes, that is a problem.  I suspect though for people that have such data and are negatively impacted by split tokens (actually only shorter trailing final tokens from a split long sequence are problematic, since the leading tokens can be stripped by LengthFilter), a CharFilter that removes such character sequences before tokenization, likely regex-based, is probably the best way to go for now.

> I agree it's difficult task to make JFlex code be able to silently skip too long tokens. I scheduled for myself attempt to fix it some months later with the following approach. In case we encountered situation when buffer is full and there still could be a bigger match, enter "skipping" mode. In the skipping mode full buffer is emptied, corresponding indexes (zzEndRead and others) are corrected and matching continues. When we hit maximum length match, skipping mode is finished and without returning a token and after yet another indexes correction we enter normal mode. This approach to JFlex matching won't work in general, but I suppose it'll work for tokenizer, because I did't see any backtracking in the code (zzCurrentPos never backtracks non-processed characters).
> It would be great to hear you thoughts on this idea.

Patches welcome!  I’m not quite sure how you’ll be able to do this for arbitrary match points within arbitrary rules, but I think it’s worth exploring.

--
Steve
www.lucidworks.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org