You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/08/28 06:03:17 UTC

[GitHub] [lucene] elliotzlin opened a new pull request, #11724: LUCENE-10520 / #11556 HTMLStripCharFilter bugfix

elliotzlin opened a new pull request, #11724:
URL: https://github.com/apache/lucene/pull/11724

   ### Description
   "<" and ">" characters are acceptable in attribute values per the HTML5 spec. See #11556 for more details.
   
   <!--
   If this is your first contribution to Lucene, please make sure you have reviewed the contribution guide.
   https://github.com/apache/lucene/blob/main/CONTRIBUTING.md
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] dweiss commented on a diff in pull request #11724: LUCENE-10520 / #11556 HTMLStripCharFilter bugfix

Posted by GitBox <gi...@apache.org>.
dweiss commented on code in PR #11724:
URL: https://github.com/apache/lucene/pull/11724#discussion_r972866335


##########
lucene/analysis/common/src/test/org/apache/lucene/analysis/charfilter/TestHTMLStripCharFilter.java:
##########
@@ -632,6 +635,22 @@ public void testUTF16Surrogates() throws Exception {
     analyzer.close();
   }
 
+  /**
+   * Test that attributes with '>' or '<' characters are parsed correctly.

Review Comment:
   ```suggestion
      * Test that attributes with {@code '>'} or {@code '<'} characters are parsed correctly.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] dweiss commented on pull request #11724: LUCENE-10520 / #11556 HTMLStripCharFilter bugfix

Posted by "dweiss (via GitHub)" <gi...@apache.org>.
dweiss commented on PR #11724:
URL: https://github.com/apache/lucene/pull/11724#issuecomment-1742195597

   I will take a look tomorrow - was out of office last week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #11724: LUCENE-10520 / #11556 HTMLStripCharFilter bugfix

Posted by GitBox <gi...@apache.org>.
rmuir commented on PR #11724:
URL: https://github.com/apache/lucene/pull/11724#issuecomment-1232192192

   I don't think you did anything wrong. The changes look fine, it is just a real puzzler.
   
   I even cloned your repo and inspected the enormous diff to the generated code and I only see smaller DFAs, not any "logical changes". The test suite does pass.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] elliotzlin commented on pull request #11724: LUCENE-10520 / #11556 HTMLStripCharFilter bugfix

Posted by GitBox <gi...@apache.org>.
elliotzlin commented on PR #11724:
URL: https://github.com/apache/lucene/pull/11724#issuecomment-1244935841

   @rmuir @dweiss I wanted to follow up on this. Is there anything I should work on here or is this good to go?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] elliotzlin commented on pull request #11724: LUCENE-10520 / #11556 HTMLStripCharFilter bugfix

Posted by GitBox <gi...@apache.org>.
elliotzlin commented on PR #11724:
URL: https://github.com/apache/lucene/pull/11724#issuecomment-1232037743

   I hadn't even noticed that! I hope it's not some JFlex side-effect that I unwittingly triggered with my change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] dweiss commented on a diff in pull request #11724: LUCENE-10520 / #11556 HTMLStripCharFilter bugfix

Posted by GitBox <gi...@apache.org>.
dweiss commented on code in PR #11724:
URL: https://github.com/apache/lucene/pull/11724#discussion_r969212841


##########
lucene/analysis/common/src/test/org/apache/lucene/analysis/charfilter/TestHTMLStripCharFilter.java:
##########
@@ -632,6 +635,16 @@ public void testUTF16Surrogates() throws Exception {
     analyzer.close();
   }
 
+  public void testForIssue10520() throws IOException {

Review Comment:
   Maybe add an explicit javadoc with a see tag pointing at the github issue here (full URL)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] elliotzlin commented on pull request #11724: LUCENE-10520 / #11556 HTMLStripCharFilter bugfix

Posted by GitBox <gi...@apache.org>.
elliotzlin commented on PR #11724:
URL: https://github.com/apache/lucene/pull/11724#issuecomment-1323161181

   @dweiss @rmuir bumping this old thread again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] elliotzlin commented on pull request #11724: LUCENE-10520 / #11556 HTMLStripCharFilter bugfix

Posted by "elliotzlin (via GitHub)" <gi...@apache.org>.
elliotzlin commented on PR #11724:
URL: https://github.com/apache/lucene/pull/11724#issuecomment-1732391579

   @dweiss @rmuir apologies for reviving this ancient thread, but I was wondering if one of you can help run the PR checks again?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] LUCENE-10520 / #11556 HTMLStripCharFilter bugfix [lucene]

Posted by "mjustice3 (via GitHub)" <gi...@apache.org>.
mjustice3 commented on PR #11724:
URL: https://github.com/apache/lucene/pull/11724#issuecomment-1977531543

   I believe I've found a regression with this bugfix. A test case that exposes:
   
   ```
     public void testForIssue10520Regression() throws IOException {
       String test =
           "<!DOCTYPE html><html lang=\"en\"><head><title>Test</title></head><a href=\"https://www.somewhere.com?data=\">a link</a> some text <a href=\"https://www.elsewhere.com\">another link</a></html>";
       Reader reader = new StringReader(test);
       HTMLStripCharFilter filter = new HTMLStripCharFilter(reader);
       StringWriter result = new StringWriter();
       filter.transferTo(result);
       assertEquals("Test\n\na link some text another link", result.toString().trim());
     }
   ```
   
   The problem is with the empty `data=` parameter at the end of the first url. We see a few of those in our document set and that is how I noticed.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #11724: LUCENE-10520 / #11556 HTMLStripCharFilter bugfix

Posted by GitBox <gi...@apache.org>.
rmuir commented on PR #11724:
URL: https://github.com/apache/lucene/pull/11724#issuecomment-1230736718

   This change looks good to me! I just want to clone the branch and try to figure how you removed 25,000 lines of code from the generated jflex :)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] dweiss commented on pull request #11724: LUCENE-10520 / #11556 HTMLStripCharFilter bugfix

Posted by GitBox <gi...@apache.org>.
dweiss commented on PR #11724:
URL: https://github.com/apache/lucene/pull/11724#issuecomment-1244970221

   Sorry this slipped, @elliotzlin . I think it's fine. Please add a lucene/CHANGES.txt entry (it'll go towards 9.5.0). I'll handle the commit and backport.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] dweiss merged pull request #11724: LUCENE-10520 / #11556 HTMLStripCharFilter bugfix

Posted by "dweiss (via GitHub)" <gi...@apache.org>.
dweiss merged PR #11724:
URL: https://github.com/apache/lucene/pull/11724


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] dweiss commented on pull request #11724: LUCENE-10520 / #11556 HTMLStripCharFilter bugfix

Posted by GitBox <gi...@apache.org>.
dweiss commented on PR #11724:
URL: https://github.com/apache/lucene/pull/11724#issuecomment-1231221646

   What?! How?! :) Seems like the automaton got a whole lot smaller:
   ```
      private static int [] zzUnpackAttribute() {
   -    int [] result = new int[14794];
   +    int [] result = new int[3082]; 
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] elliotzlin commented on pull request #11724: LUCENE-10520 / #11556 HTMLStripCharFilter bugfix

Posted by GitBox <gi...@apache.org>.
elliotzlin commented on PR #11724:
URL: https://github.com/apache/lucene/pull/11724#issuecomment-1257315679

   @dweiss can you help run the workflows again?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] elliotzlin commented on pull request #11724: LUCENE-10520 / #11556 HTMLStripCharFilter bugfix

Posted by GitBox <gi...@apache.org>.
elliotzlin commented on PR #11724:
URL: https://github.com/apache/lucene/pull/11724#issuecomment-1259830640

   @rmuir would you be able to help with running the workflows?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org