You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (Created) (JIRA)" <ji...@apache.org> on 2012/03/24 15:30:24 UTC

[jira] [Created] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset

HTMLStripCharFilter produces invalid final offset
-------------------------------------------------

                 Key: LUCENE-3913
                 URL: https://issues.apache.org/jira/browse/LUCENE-3913
             Project: Lucene - Java
          Issue Type: Bug
            Reporter: Michael McCandless
             Fix For: 3.6, 4.0


Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset

Posted by "Steven Rowe (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237567#comment-13237567 ] 

Steven Rowe commented on LUCENE-3913:
-------------------------------------

Minimal test failure trigger appears to be "x</br>", where "x" can be any non-whitespace character, and "</br>" must be "</br>".  (No problems with "x<br>", "x</a>", etc.)

<br> and </br> are handled specially, so this should narrow it down.
                
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
>                 Key: LUCENE-3913
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3913
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset

Posted by "Steven Rowe (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe updated LUCENE-3913:
--------------------------------

    Attachment: LUCENE-3913.patch

Patch, a superset of Mike's:

* fixes the identified problem: {{</br>}} offset was improperly calculated.  (Added comments describing the offset calculations everywhere they're performed in the .jflex source.)
* adds a new case emitting {{<\s*(/\s*)?(br|script|style)>?}} to {{_TestUtil.randomHtmlishString()}}, because <br>, <script>, and <style> are handled specially in HTMLStripCharFilter.
* adds a new method {{_TestUtil.randomlyRecaseCodePoints()}}, used by the above-mentioned new {{randomHtmlishString()}} case, to produce things like {{<Br>}}, {{</sCriPT>}}, etc.
* switches {{HTMLStripCharFilterTest.testRandomBrokenHTML()}} to use Mike's new {{BaseTokenStreamTestCase.checkAnalysisConsistency()}}.
* fixes the Jenkins test failure of {{HTMLStripCharFilterTest.testRandomHugeStrings()}} at [https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/12863/]: {{ant test -Dtestcase=HTMLStripCharFilterTest -Dtestmethod=null -Dtests.seed=48bbf57c15b7aa2d:5bb640584c81078d:-7e916259eafd7e54 -Dtests.multiplier=5 -Dargs="-Dfile.encoding=ISO8859-1"
}}

Committing shortly.
                
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
>                 Key: LUCENE-3913
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3913
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Steven Rowe
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3913.patch, LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset

Posted by "Steven Rowe (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237550#comment-13237550 ] 

Steven Rowe commented on LUCENE-3913:
-------------------------------------

I can reproduce - I'm digging.
                
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
>                 Key: LUCENE-3913
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3913
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237696#comment-13237696 ] 

Robert Muir commented on LUCENE-3913:
-------------------------------------

thanks! let's just hope mike doesn't notice :)

-reproducibility policeman
                
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
>                 Key: LUCENE-3913
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3913
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Steven Rowe
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3913.patch, LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset

Posted by "Steven Rowe (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237571#comment-13237571 ] 

Steven Rowe commented on LUCENE-3913:
-------------------------------------

bq. Since we know those are special, a good idea for the future could be to add both those elements to randomHtmlishString

+1
                
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
>                 Key: LUCENE-3913
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3913
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237568#comment-13237568 ] 

Robert Muir commented on LUCENE-3913:
-------------------------------------

Since we know those are special, a good idea for the future could be to add both
those elements to randomHtmlishString
                
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
>                 Key: LUCENE-3913
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3913
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237545#comment-13237545 ] 

Michael McCandless commented on LUCENE-3913:
--------------------------------------------

I forgot to say: patch is against 3.x.

                
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
>                 Key: LUCENE-3913
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3913
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset

Posted by "Steven Rowe (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237689#comment-13237689 ] 

Steven Rowe commented on LUCENE-3913:
-------------------------------------

bq. I don't think it will be interesting, instead just make seeds less reproducible across java 6 and 7 or other jre impls with different # of locales

Hmm, I didn't think of the reproducibility angle.  I'll fix.
                
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
>                 Key: LUCENE-3913
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3913
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Steven Rowe
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3913.patch, LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237654#comment-13237654 ] 

Robert Muir commented on LUCENE-3913:
-------------------------------------

Thank you!

I think the toUpperCase/toLowerCase in recaseCodePoints should take Locale.ENGLISH?

Otherwise you will find this gives interesting things for <script> in the Turkish Locale :)
                
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
>                 Key: LUCENE-3913
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3913
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Steven Rowe
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3913.patch, LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237548#comment-13237548 ] 

Michael McCandless commented on LUCENE-3913:
--------------------------------------------

Good idea!  I'll fix that test case.

Here's the failure output:

{noformat}
    [junit] ------------- Standard Error -----------------
    [junit] NOTE: reproduce with: ant test -Dtestcase=HTMLStripCharFilterTest -Dtestmethod=testOddHTMLString -Dtests.seed=-fe5cdb1aeca4e37:583f6a844412e138:70dc861e8567bea3 -Dargs="-Dfile.encoding=UTF-8"
    [junit] NOTE: reproduce with: ant test -Dtestcase=HTMLStripCharFilterTest -Dtestmethod=null -Dtests.seed=-fe5cdb1aeca4e37:583f6a844412e138:70dc861e8567bea3 -Dargs="-Dfile.encoding=UTF-8"
    [junit] NOTE: test params are: locale=zh_SG, timezone=Europe/Minsk
    [junit] NOTE: all tests run in this JVM:
    [junit] [HTMLStripCharFilterTest]
    [junit] NOTE: Linux 2.6.33.6-147.fc13.x86_64 amd64/Sun Microsystems Inc. 1.6.0_21 (64-bit)/cpus=24,threads=1,free=163214064,total=189988864
    [junit] ------------- ---------------- ---------------
    [junit] Testcase: testOddHTMLString(org.apache.lucene.analysis.charfilter.HTMLStripCharFilterTest):	FAILED
    [junit] finalOffset  expected:<20> but was:<19>
    [junit] junit.framework.AssertionFailedError: finalOffset  expected:<20> but was:<19>
    [junit] 	at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner$3.addError(JUnitTestRunner.java:975)
    [junit] 	at junit.framework.TestResult.addError(TestResult.java:38)
    [junit] 	at junit.framework.JUnit4TestAdapterCache$1.testFailure(JUnit4TestAdapterCache.java:51)
    [junit] 	at org.junit.runner.notification.RunNotifier$4.notifyListener(RunNotifier.java:100)
    [junit] 	at org.junit.runner.notification.RunNotifier$SafeNotifier.run(RunNotifier.java:41)
    [junit] 	at org.junit.runner.notification.RunNotifier.fireTestFailure(RunNotifier.java:97)
    [junit] 	at org.junit.internal.runners.model.EachTestNotifier.addFailure(EachTestNotifier.java:26)
    [junit] 	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:267)
    [junit] 	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
    [junit] 	at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:146)
    [junit] 	at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:50)
    [junit] 	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
    [junit] 	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
    [junit] 	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
    [junit] 	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
    [junit] 	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
    [junit] 	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
    [junit] 	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
    [junit] 	at org.apache.lucene.util.UncaughtExceptionsRule$1.evaluate(UncaughtExceptionsRule.java:74)
    [junit] 	at org.apache.lucene.util.StoreClassNameRule$1.evaluate(StoreClassNameRule.java:36)
    [junit] 	at org.apache.lucene.util.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:67)
    [junit] 	at org.junit.rules.RunRules.evaluate(RunRules.java:18)
    [junit] 	at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
    [junit] 	at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:39)
    [junit] 	at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:420)
    [junit] 	at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:911)
    [junit] 	at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:768)
    [junit] Caused by: java.lang.AssertionError: finalOffset  expected:<20> but was:<19>
    [junit] 	at org.junit.Assert.fail(Assert.java:93)
    [junit] 	at org.junit.Assert.failNotEquals(Assert.java:647)
    [junit] 	at org.junit.Assert.assertEquals(Assert.java:128)
    [junit] 	at org.junit.Assert.assertEquals(Assert.java:472)
    [junit] 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:182)
    [junit] 	at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:574)
    [junit] 	at org.apache.lucene.analysis.charfilter.HTMLStripCharFilterTest.testOddHTMLString(HTMLStripCharFilterTest.java:550)
    [junit] 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    [junit] 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    [junit] 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    [junit] 	at java.lang.reflect.Method.invoke(Method.java:597)
    [junit] 	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
    [junit] 	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
    [junit] 	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
    [junit] 	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
    [junit] 	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
    [junit] 	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
    [junit] 	at org.apache.lucene.util.LuceneTestCase$SubclassSetupTeardownRule$1.evaluate(LuceneTestCase.java:636)
    [junit] 	at org.apache.lucene.util.LuceneTestCase$InternalSetupTeardownRule$1.evaluate(LuceneTestCase.java:542)
    [junit] 	at org.apache.lucene.util.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:67)
    [junit] 	at org.apache.lucene.util.LuceneTestCase$TestResultInterceptorRule$1.evaluate(LuceneTestCase.java:458)
    [junit] 	at org.apache.lucene.util.UncaughtExceptionsRule$1.evaluate(UncaughtExceptionsRule.java:74)
    [junit] 	at org.apache.lucene.util.LuceneTestCase$RememberThreadRule$1.evaluate(LuceneTestCase.java:516)
    [junit] 	at org.junit.rules.RunRules.evaluate(RunRules.java:18)
    [junit] 	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
    [junit] 	... 19 more
    [junit] 
{noformat}

Note that instead of -Dtests.testmethod=null you should pass -Dtests.testmethod=testOddHTMLString.
                
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
>                 Key: LUCENE-3913
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3913
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237546#comment-13237546 ] 

Robert Muir commented on LUCENE-3913:
-------------------------------------

I like the refactored method: though I think we should also call it from
HTMLStripCharFilterTest.testRandomBrokenHTML?
Currently this only does:
{code}
while (reader.read() != -1);
{code}
                
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
>                 Key: LUCENE-3913
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3913
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Issue Comment Edited] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset

Posted by "Steven Rowe (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237648#comment-13237648 ] 

Steven Rowe edited comment on LUCENE-3913 at 3/24/12 8:03 PM:
--------------------------------------------------------------

Patch, a superset of Mike's:

* fixes the identified problem: {{</br>}} offset was improperly calculated.  (Added comments describing the offset calculations everywhere they're performed in the .jflex source.)
* adds a new case emitting {{<\s*(/\s*)?(br|script|style)>?}} to {{_TestUtil.randomHtmlishString()}}, because <br>, <script>, and <style> are handled specially in HTMLStripCharFilter.
* adds a new method {{_TestUtil.randomlyRecaseCodePoints()}}, used by the above-mentioned new {{randomHtmlishString()}} case, to produce things like {{<Br>}}, {{</sCriPT>}}, etc.
* switches {{HTMLStripCharFilterTest.testRandomBrokenHTML()}} to use Mike's new {{BaseTokenStreamTestCase.checkAnalysisConsistency()}}.
* fixes the Jenkins test failure of {{HTMLStripCharFilterTest.testRandomHugeStrings()}} at [https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/12863/]:

{noformat}
ant test -Dtestcase=HTMLStripCharFilterTest -Dtestmethod=null -Dtests.seed=48bbf57c15b7aa2d:5bb640584c81078d:-7e916259eafd7e54 -Dtests.multiplier=5 -Dargs="-Dfile.encoding=ISO8859-1"
{noformat}

Committing shortly.
                
      was (Author: steve_rowe):
    Patch, a superset of Mike's:

* fixes the identified problem: {{</br>}} offset was improperly calculated.  (Added comments describing the offset calculations everywhere they're performed in the .jflex source.)
* adds a new case emitting {{<\s*(/\s*)?(br|script|style)>?}} to {{_TestUtil.randomHtmlishString()}}, because <br>, <script>, and <style> are handled specially in HTMLStripCharFilter.
* adds a new method {{_TestUtil.randomlyRecaseCodePoints()}}, used by the above-mentioned new {{randomHtmlishString()}} case, to produce things like {{<Br>}}, {{</sCriPT>}}, etc.
* switches {{HTMLStripCharFilterTest.testRandomBrokenHTML()}} to use Mike's new {{BaseTokenStreamTestCase.checkAnalysisConsistency()}}.
* fixes the Jenkins test failure of {{HTMLStripCharFilterTest.testRandomHugeStrings()}} at [https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/12863/]: {{ant test -Dtestcase=HTMLStripCharFilterTest -Dtestmethod=null -Dtests.seed=48bbf57c15b7aa2d:5bb640584c81078d:-7e916259eafd7e54 -Dtests.multiplier=5 -Dargs="-Dfile.encoding=ISO8859-1"
}}

Committing shortly.
                  
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
>                 Key: LUCENE-3913
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3913
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Steven Rowe
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3913.patch, LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Resolved] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset

Posted by "Steven Rowe (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe resolved LUCENE-3913.
---------------------------------

       Resolution: Fixed
    Lucene Fields: New,Patch Available  (was: New)

Committed to branch_3x and trunk.

Thanks Mike and Robert!
                
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
>                 Key: LUCENE-3913
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3913
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Steven Rowe
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3913.patch, LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Assigned] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset

Posted by "Steven Rowe (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Rowe reassigned LUCENE-3913:
-----------------------------------

    Assignee: Steven Rowe
    
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
>                 Key: LUCENE-3913
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3913
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Steven Rowe
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset

Posted by "Steven Rowe (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237656#comment-13237656 ] 

Steven Rowe commented on LUCENE-3913:
-------------------------------------

bq. I think the toUpperCase/toLowerCase in recaseCodePoints should take Locale.ENGLISH?

Yeah, I had that in at first, but then I thought it might be useful to use the randomized locale to trigger "interesting things".  That's what we want, right?
                
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
>                 Key: LUCENE-3913
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3913
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Steven Rowe
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3913.patch, LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset

Posted by "Steven Rowe (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237695#comment-13237695 ] 

Steven Rowe commented on LUCENE-3913:
-------------------------------------

{quote}
bq. I don't think it will be interesting, instead just make seeds less reproducible across java 6 and 7 or other jre impls with different # of locales

Hmm, I didn't think of the reproducibility angle. I'll fix.
{quote}

Committed to trunk and branch_3x.

Thanks again Robert!
                
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
>                 Key: LUCENE-3913
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3913
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Steven Rowe
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3913.patch, LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset

Posted by "Michael McCandless (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3913:
---------------------------------------

    Attachment: LUCENE-3913.patch

Patch showing the failure....

It happens on input " Secretary)</br> [[M", in case anyone can see something obviously interesting :)

I have no idea where the bug is...
                
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
>                 Key: LUCENE-3913
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3913
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237685#comment-13237685 ] 

Robert Muir commented on LUCENE-3913:
-------------------------------------

I don't think it will be interesting,
instead just make seeds less reproducible 
across java 6 and 7 or other jre impls
with different # of locales
                
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
>                 Key: LUCENE-3913
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3913
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Steven Rowe
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3913.patch, LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237652#comment-13237652 ] 

Michael McCandless commented on LUCENE-3913:
--------------------------------------------

Awesome, thanks Steve!
                
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
>                 Key: LUCENE-3913
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3913
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Steven Rowe
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3913.patch, LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces invalid final offset

Posted by "Steven Rowe (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237650#comment-13237650 ] 

Steven Rowe commented on LUCENE-3913:
-------------------------------------

bq. ant test -Dtestcase=HTMLStripCharFilterTest -Dtestmethod=null ...

Actually, I ran it without {{-Dtestmethod=null}}.
                
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
>                 Key: LUCENE-3913
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3913
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Steven Rowe
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3913.patch, LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org