You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2009/10/29 10:04:59 UTC

[jira] Issue Comment Edited: (LUCENE-2014) position increment bug: smartcn

    [ https://issues.apache.org/jira/browse/LUCENE-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771341#action_12771341 ] 

Robert Muir edited comment on LUCENE-2014 at 10/29/09 9:03 AM:
---------------------------------------------------------------

bq. But the question is, why does IndexWriter fail (how does it fail?). Normally it should not be affected, as the posIncr stays 1.

Oh, the IndexWriter fails because of integer overflow with any large document (lots of posIncr's get added up, overflow and create a negative posIncr)
so the negative posIncr creates an exception.

<edit>

Uwe I think this really happens especially because of the way smartcn works. 
smartcn creates individual tokens for each piece of punctuation (including things like whitespace), and puts these in the stopword list.
so if you have a chinese document with lots of space ... you can imagine how it can add up and overflow.

      was (Author: rcmuir):
    bq. But the question is, why does IndexWriter fail (how does it fail?). Normally it should not be affected, as the posIncr stays 1.

Oh, the IndexWriter fails because of integer overflow with any large document (lots of posIncr's get added up, overflow and create a negative posIncr)
so the negative posIncr creates an exception.
  
> position increment bug: smartcn
> -------------------------------
>
>                 Key: LUCENE-2014
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2014
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>         Attachments: LUCENE-2014.patch, LUCENE-2014.patch
>
>
> If i use LUCENE_VERSION >= 2.9 with smart chinese analyzer, it will crash indexwriter with any reasonable amount of chinese text.
> its especially annoying because it happens in 2.9.1 RC as well.
> this is because the position increments for tokens after stopwords are bogus:
> Here's an example (from test case), where the position increment should be 2, but is instead 91975314!
> {code}
>   public void testChineseStopWords2() throws Exception {
>     Analyzer ca = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); /* will load stopwords */
>     String sentence = "Title:San"; // : is a stopword
>     String result[] = { "titl", "san"};
>     int startOffsets[] = { 0, 6 };
>     int endOffsets[] = { 5, 9 };
>     int posIncr[] = { 1, 2 };
>     assertAnalyzesTo(ca, sentence, result, startOffsets, endOffsets, posIncr);
>   }
> {code}
> junit.framework.AssertionFailedError: posIncrement 1 expected:<2> but was:<91975314>
> 	at junit.framework.Assert.fail(Assert.java:47)
> 	at junit.framework.Assert.failNotEquals(Assert.java:280)
> 	at junit.framework.Assert.assertEquals(Assert.java:64)
> 	at junit.framework.Assert.assertEquals(Assert.java:198)
> 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:83)
> 	...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org