You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (Created) (JIRA)" <ji...@apache.org> on 2012/03/24 15:30:24 UTC
[jira] [Created] (LUCENE-3913) HTMLStripCharFilter produces invalid
final offset
HTMLStripCharFilter produces invalid final offset
-------------------------------------------------
Key: LUCENE-3913
URL: https://issues.apache.org/jira/browse/LUCENE-3913
Project: Lucene - Java
Issue Type: Bug
Reporter: Michael McCandless
Fix For: 3.6, 4.0
Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces
invalid final offset
Posted by "Steven Rowe (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237567#comment-13237567 ]
Steven Rowe commented on LUCENE-3913:
-------------------------------------
Minimal test failure trigger appears to be "x</br>", where "x" can be any non-whitespace character, and "</br>" must be "</br>". (No problems with "x<br>", "x</a>", etc.)
<br> and </br> are handled specially, so this should narrow it down.
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
> Key: LUCENE-3913
> URL: https://issues.apache.org/jira/browse/LUCENE-3913
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Updated] (LUCENE-3913) HTMLStripCharFilter produces invalid
final offset
Posted by "Steven Rowe (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steven Rowe updated LUCENE-3913:
--------------------------------
Attachment: LUCENE-3913.patch
Patch, a superset of Mike's:
* fixes the identified problem: {{</br>}} offset was improperly calculated. (Added comments describing the offset calculations everywhere they're performed in the .jflex source.)
* adds a new case emitting {{<\s*(/\s*)?(br|script|style)>?}} to {{_TestUtil.randomHtmlishString()}}, because <br>, <script>, and <style> are handled specially in HTMLStripCharFilter.
* adds a new method {{_TestUtil.randomlyRecaseCodePoints()}}, used by the above-mentioned new {{randomHtmlishString()}} case, to produce things like {{<Br>}}, {{</sCriPT>}}, etc.
* switches {{HTMLStripCharFilterTest.testRandomBrokenHTML()}} to use Mike's new {{BaseTokenStreamTestCase.checkAnalysisConsistency()}}.
* fixes the Jenkins test failure of {{HTMLStripCharFilterTest.testRandomHugeStrings()}} at [https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/12863/]: {{ant test -Dtestcase=HTMLStripCharFilterTest -Dtestmethod=null -Dtests.seed=48bbf57c15b7aa2d:5bb640584c81078d:-7e916259eafd7e54 -Dtests.multiplier=5 -Dargs="-Dfile.encoding=ISO8859-1"
}}
Committing shortly.
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
> Key: LUCENE-3913
> URL: https://issues.apache.org/jira/browse/LUCENE-3913
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Steven Rowe
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3913.patch, LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces
invalid final offset
Posted by "Steven Rowe (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237550#comment-13237550 ]
Steven Rowe commented on LUCENE-3913:
-------------------------------------
I can reproduce - I'm digging.
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
> Key: LUCENE-3913
> URL: https://issues.apache.org/jira/browse/LUCENE-3913
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces
invalid final offset
Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237696#comment-13237696 ]
Robert Muir commented on LUCENE-3913:
-------------------------------------
thanks! let's just hope mike doesn't notice :)
-reproducibility policeman
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
> Key: LUCENE-3913
> URL: https://issues.apache.org/jira/browse/LUCENE-3913
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Steven Rowe
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3913.patch, LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces
invalid final offset
Posted by "Steven Rowe (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237571#comment-13237571 ]
Steven Rowe commented on LUCENE-3913:
-------------------------------------
bq. Since we know those are special, a good idea for the future could be to add both those elements to randomHtmlishString
+1
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
> Key: LUCENE-3913
> URL: https://issues.apache.org/jira/browse/LUCENE-3913
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces
invalid final offset
Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237568#comment-13237568 ]
Robert Muir commented on LUCENE-3913:
-------------------------------------
Since we know those are special, a good idea for the future could be to add both
those elements to randomHtmlishString
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
> Key: LUCENE-3913
> URL: https://issues.apache.org/jira/browse/LUCENE-3913
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces
invalid final offset
Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237545#comment-13237545 ]
Michael McCandless commented on LUCENE-3913:
--------------------------------------------
I forgot to say: patch is against 3.x.
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
> Key: LUCENE-3913
> URL: https://issues.apache.org/jira/browse/LUCENE-3913
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces
invalid final offset
Posted by "Steven Rowe (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237689#comment-13237689 ]
Steven Rowe commented on LUCENE-3913:
-------------------------------------
bq. I don't think it will be interesting, instead just make seeds less reproducible across java 6 and 7 or other jre impls with different # of locales
Hmm, I didn't think of the reproducibility angle. I'll fix.
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
> Key: LUCENE-3913
> URL: https://issues.apache.org/jira/browse/LUCENE-3913
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Steven Rowe
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3913.patch, LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces
invalid final offset
Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237654#comment-13237654 ]
Robert Muir commented on LUCENE-3913:
-------------------------------------
Thank you!
I think the toUpperCase/toLowerCase in recaseCodePoints should take Locale.ENGLISH?
Otherwise you will find this gives interesting things for <script> in the Turkish Locale :)
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
> Key: LUCENE-3913
> URL: https://issues.apache.org/jira/browse/LUCENE-3913
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Steven Rowe
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3913.patch, LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces
invalid final offset
Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237548#comment-13237548 ]
Michael McCandless commented on LUCENE-3913:
--------------------------------------------
Good idea! I'll fix that test case.
Here's the failure output:
{noformat}
[junit] ------------- Standard Error -----------------
[junit] NOTE: reproduce with: ant test -Dtestcase=HTMLStripCharFilterTest -Dtestmethod=testOddHTMLString -Dtests.seed=-fe5cdb1aeca4e37:583f6a844412e138:70dc861e8567bea3 -Dargs="-Dfile.encoding=UTF-8"
[junit] NOTE: reproduce with: ant test -Dtestcase=HTMLStripCharFilterTest -Dtestmethod=null -Dtests.seed=-fe5cdb1aeca4e37:583f6a844412e138:70dc861e8567bea3 -Dargs="-Dfile.encoding=UTF-8"
[junit] NOTE: test params are: locale=zh_SG, timezone=Europe/Minsk
[junit] NOTE: all tests run in this JVM:
[junit] [HTMLStripCharFilterTest]
[junit] NOTE: Linux 2.6.33.6-147.fc13.x86_64 amd64/Sun Microsystems Inc. 1.6.0_21 (64-bit)/cpus=24,threads=1,free=163214064,total=189988864
[junit] ------------- ---------------- ---------------
[junit] Testcase: testOddHTMLString(org.apache.lucene.analysis.charfilter.HTMLStripCharFilterTest): FAILED
[junit] finalOffset expected:<20> but was:<19>
[junit] junit.framework.AssertionFailedError: finalOffset expected:<20> but was:<19>
[junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner$3.addError(JUnitTestRunner.java:975)
[junit] at junit.framework.TestResult.addError(TestResult.java:38)
[junit] at junit.framework.JUnit4TestAdapterCache$1.testFailure(JUnit4TestAdapterCache.java:51)
[junit] at org.junit.runner.notification.RunNotifier$4.notifyListener(RunNotifier.java:100)
[junit] at org.junit.runner.notification.RunNotifier$SafeNotifier.run(RunNotifier.java:41)
[junit] at org.junit.runner.notification.RunNotifier.fireTestFailure(RunNotifier.java:97)
[junit] at org.junit.internal.runners.model.EachTestNotifier.addFailure(EachTestNotifier.java:26)
[junit] at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:267)
[junit] at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
[junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:146)
[junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:50)
[junit] at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
[junit] at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
[junit] at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
[junit] at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
[junit] at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
[junit] at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
[junit] at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
[junit] at org.apache.lucene.util.UncaughtExceptionsRule$1.evaluate(UncaughtExceptionsRule.java:74)
[junit] at org.apache.lucene.util.StoreClassNameRule$1.evaluate(StoreClassNameRule.java:36)
[junit] at org.apache.lucene.util.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:67)
[junit] at org.junit.rules.RunRules.evaluate(RunRules.java:18)
[junit] at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
[junit] at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:39)
[junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:420)
[junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:911)
[junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:768)
[junit] Caused by: java.lang.AssertionError: finalOffset expected:<20> but was:<19>
[junit] at org.junit.Assert.fail(Assert.java:93)
[junit] at org.junit.Assert.failNotEquals(Assert.java:647)
[junit] at org.junit.Assert.assertEquals(Assert.java:128)
[junit] at org.junit.Assert.assertEquals(Assert.java:472)
[junit] at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:182)
[junit] at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:574)
[junit] at org.apache.lucene.analysis.charfilter.HTMLStripCharFilterTest.testOddHTMLString(HTMLStripCharFilterTest.java:550)
[junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[junit] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
[junit] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
[junit] at java.lang.reflect.Method.invoke(Method.java:597)
[junit] at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
[junit] at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
[junit] at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
[junit] at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
[junit] at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
[junit] at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
[junit] at org.apache.lucene.util.LuceneTestCase$SubclassSetupTeardownRule$1.evaluate(LuceneTestCase.java:636)
[junit] at org.apache.lucene.util.LuceneTestCase$InternalSetupTeardownRule$1.evaluate(LuceneTestCase.java:542)
[junit] at org.apache.lucene.util.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:67)
[junit] at org.apache.lucene.util.LuceneTestCase$TestResultInterceptorRule$1.evaluate(LuceneTestCase.java:458)
[junit] at org.apache.lucene.util.UncaughtExceptionsRule$1.evaluate(UncaughtExceptionsRule.java:74)
[junit] at org.apache.lucene.util.LuceneTestCase$RememberThreadRule$1.evaluate(LuceneTestCase.java:516)
[junit] at org.junit.rules.RunRules.evaluate(RunRules.java:18)
[junit] at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
[junit] ... 19 more
[junit]
{noformat}
Note that instead of -Dtests.testmethod=null you should pass -Dtests.testmethod=testOddHTMLString.
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
> Key: LUCENE-3913
> URL: https://issues.apache.org/jira/browse/LUCENE-3913
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces
invalid final offset
Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237546#comment-13237546 ]
Robert Muir commented on LUCENE-3913:
-------------------------------------
I like the refactored method: though I think we should also call it from
HTMLStripCharFilterTest.testRandomBrokenHTML?
Currently this only does:
{code}
while (reader.read() != -1);
{code}
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
> Key: LUCENE-3913
> URL: https://issues.apache.org/jira/browse/LUCENE-3913
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3913) HTMLStripCharFilter
produces invalid final offset
Posted by "Steven Rowe (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237648#comment-13237648 ]
Steven Rowe edited comment on LUCENE-3913 at 3/24/12 8:03 PM:
--------------------------------------------------------------
Patch, a superset of Mike's:
* fixes the identified problem: {{</br>}} offset was improperly calculated. (Added comments describing the offset calculations everywhere they're performed in the .jflex source.)
* adds a new case emitting {{<\s*(/\s*)?(br|script|style)>?}} to {{_TestUtil.randomHtmlishString()}}, because <br>, <script>, and <style> are handled specially in HTMLStripCharFilter.
* adds a new method {{_TestUtil.randomlyRecaseCodePoints()}}, used by the above-mentioned new {{randomHtmlishString()}} case, to produce things like {{<Br>}}, {{</sCriPT>}}, etc.
* switches {{HTMLStripCharFilterTest.testRandomBrokenHTML()}} to use Mike's new {{BaseTokenStreamTestCase.checkAnalysisConsistency()}}.
* fixes the Jenkins test failure of {{HTMLStripCharFilterTest.testRandomHugeStrings()}} at [https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/12863/]:
{noformat}
ant test -Dtestcase=HTMLStripCharFilterTest -Dtestmethod=null -Dtests.seed=48bbf57c15b7aa2d:5bb640584c81078d:-7e916259eafd7e54 -Dtests.multiplier=5 -Dargs="-Dfile.encoding=ISO8859-1"
{noformat}
Committing shortly.
was (Author: steve_rowe):
Patch, a superset of Mike's:
* fixes the identified problem: {{</br>}} offset was improperly calculated. (Added comments describing the offset calculations everywhere they're performed in the .jflex source.)
* adds a new case emitting {{<\s*(/\s*)?(br|script|style)>?}} to {{_TestUtil.randomHtmlishString()}}, because <br>, <script>, and <style> are handled specially in HTMLStripCharFilter.
* adds a new method {{_TestUtil.randomlyRecaseCodePoints()}}, used by the above-mentioned new {{randomHtmlishString()}} case, to produce things like {{<Br>}}, {{</sCriPT>}}, etc.
* switches {{HTMLStripCharFilterTest.testRandomBrokenHTML()}} to use Mike's new {{BaseTokenStreamTestCase.checkAnalysisConsistency()}}.
* fixes the Jenkins test failure of {{HTMLStripCharFilterTest.testRandomHugeStrings()}} at [https://builds.apache.org/job/Lucene-Solr-tests-only-3.x/12863/]: {{ant test -Dtestcase=HTMLStripCharFilterTest -Dtestmethod=null -Dtests.seed=48bbf57c15b7aa2d:5bb640584c81078d:-7e916259eafd7e54 -Dtests.multiplier=5 -Dargs="-Dfile.encoding=ISO8859-1"
}}
Committing shortly.
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
> Key: LUCENE-3913
> URL: https://issues.apache.org/jira/browse/LUCENE-3913
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Steven Rowe
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3913.patch, LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Resolved] (LUCENE-3913) HTMLStripCharFilter produces
invalid final offset
Posted by "Steven Rowe (Resolved) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steven Rowe resolved LUCENE-3913.
---------------------------------
Resolution: Fixed
Lucene Fields: New,Patch Available (was: New)
Committed to branch_3x and trunk.
Thanks Mike and Robert!
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
> Key: LUCENE-3913
> URL: https://issues.apache.org/jira/browse/LUCENE-3913
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Steven Rowe
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3913.patch, LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Assigned] (LUCENE-3913) HTMLStripCharFilter produces
invalid final offset
Posted by "Steven Rowe (Assigned) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steven Rowe reassigned LUCENE-3913:
-----------------------------------
Assignee: Steven Rowe
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
> Key: LUCENE-3913
> URL: https://issues.apache.org/jira/browse/LUCENE-3913
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Steven Rowe
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces
invalid final offset
Posted by "Steven Rowe (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237656#comment-13237656 ]
Steven Rowe commented on LUCENE-3913:
-------------------------------------
bq. I think the toUpperCase/toLowerCase in recaseCodePoints should take Locale.ENGLISH?
Yeah, I had that in at first, but then I thought it might be useful to use the randomized locale to trigger "interesting things". That's what we want, right?
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
> Key: LUCENE-3913
> URL: https://issues.apache.org/jira/browse/LUCENE-3913
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Steven Rowe
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3913.patch, LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces
invalid final offset
Posted by "Steven Rowe (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237695#comment-13237695 ]
Steven Rowe commented on LUCENE-3913:
-------------------------------------
{quote}
bq. I don't think it will be interesting, instead just make seeds less reproducible across java 6 and 7 or other jre impls with different # of locales
Hmm, I didn't think of the reproducibility angle. I'll fix.
{quote}
Committed to trunk and branch_3x.
Thanks again Robert!
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
> Key: LUCENE-3913
> URL: https://issues.apache.org/jira/browse/LUCENE-3913
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Steven Rowe
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3913.patch, LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Updated] (LUCENE-3913) HTMLStripCharFilter produces invalid
final offset
Posted by "Michael McCandless (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-3913:
---------------------------------------
Attachment: LUCENE-3913.patch
Patch showing the failure....
It happens on input " Secretary)</br> [[M", in case anyone can see something obviously interesting :)
I have no idea where the bug is...
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
> Key: LUCENE-3913
> URL: https://issues.apache.org/jira/browse/LUCENE-3913
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces
invalid final offset
Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237685#comment-13237685 ]
Robert Muir commented on LUCENE-3913:
-------------------------------------
I don't think it will be interesting,
instead just make seeds less reproducible
across java 6 and 7 or other jre impls
with different # of locales
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
> Key: LUCENE-3913
> URL: https://issues.apache.org/jira/browse/LUCENE-3913
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Steven Rowe
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3913.patch, LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces
invalid final offset
Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237652#comment-13237652 ]
Michael McCandless commented on LUCENE-3913:
--------------------------------------------
Awesome, thanks Steve!
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
> Key: LUCENE-3913
> URL: https://issues.apache.org/jira/browse/LUCENE-3913
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Steven Rowe
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3913.patch, LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
[jira] [Commented] (LUCENE-3913) HTMLStripCharFilter produces
invalid final offset
Posted by "Steven Rowe (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237650#comment-13237650 ]
Steven Rowe commented on LUCENE-3913:
-------------------------------------
bq. ant test -Dtestcase=HTMLStripCharFilterTest -Dtestmethod=null ...
Actually, I ran it without {{-Dtestmethod=null}}.
> HTMLStripCharFilter produces invalid final offset
> -------------------------------------------------
>
> Key: LUCENE-3913
> URL: https://issues.apache.org/jira/browse/LUCENE-3913
> Project: Lucene - Java
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Steven Rowe
> Fix For: 3.6, 4.0
>
> Attachments: LUCENE-3913.patch, LUCENE-3913.patch
>
>
> Nightly build found this... I boiled it down to a small test case that doesn't require the big line file docs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org