You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Lucene/Solr QA (Jira)" <ji...@apache.org> on 2019/12/13 09:29:00 UTC

[jira] [Commented] (LUCENE-9091) UnifiedHighlighter HTML escaping should only escape essentials

    [ https://issues.apache.org/jira/browse/LUCENE-9091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995503#comment-16995503 ] 

Lucene/Solr QA commented on LUCENE-9091:
----------------------------------------

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m  0s{color} | {color:green} The patch appears to include 5 new or modified test files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 15s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  2m 10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Release audit (RAT) {color} | {color:green}  0m 49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Check forbidden APIs {color} | {color:green}  0m 40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate source patterns {color} | {color:green}  0m 40s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  2m 12s{color} | {color:green} highlighter in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 89m 14s{color} | {color:green} core in the patch passed. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}102m 14s{color} | {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | LUCENE-9091 |
| JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12988720/LUCENE-9091.patch |
| Optional Tests |  compile  javac  unit  ratsources  checkforbiddenapis  validatesourcepatterns  |
| uname | Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh |
| git revision | master / 3ba0054 |
| ant | version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018 |
| Default Java | LTS |
|  Test Results | https://builds.apache.org/job/PreCommit-LUCENE-Build/244/testReport/ |
| modules | C: lucene lucene/highlighter solr/core U: . |
| Console output | https://builds.apache.org/job/PreCommit-LUCENE-Build/244/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> UnifiedHighlighter HTML escaping should only escape essentials
> --------------------------------------------------------------
>
>                 Key: LUCENE-9091
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9091
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: Nándor Mátravölgyi
>            Priority: Minor
>         Attachments: LUCENE-9091.patch
>
>
> The unified highlighter does not use the *org.apache.lucene.search.highlight.SimpleHTMLEncoder* through *org.apache.solr.highlight.HtmlEncoder*. It has the HTML escaping feature re-implemented and embedded in the *org.apache.lucene.search.uhighlight.DefaultPassageFormatter*.
> The HTML escaping done by the unified highlighter escapes characters that do not need it. This makes the result payload 50%+ more heavy with no benefit.
> Here is a highlight snippet using the original highlighter:
> {noformat}
> A <em>filter</em> that stems words using a Snowball-generated stemmer. Available stemmers &amp; x are listed in org.tartarus.snowball.ext. Note: This <em>filter</em> is aware of the KeywordAttribute.
> {noformat}
> Here is the same highlight snippet using the unified highlighter:
> {noformat}
> A&#32;<em>filter</em>&#32;that&#32;stems&#32;words&#32;using&#32;a&#32;Snowball&#45;generated&#32;stemmer&#46;&#32;Available&#32;stemmers&#32;&amp;&#32;x&#32;are&#32;listed&#32;in&#32;org&#46;tartarus&#46;snowball&#46;ext&#46;&#32;Note&#58;&#32;This&#32;<em>filter</em>&#32;is&#32;aware&#32;of&#32;the&#32;KeywordAttribute&#46;
> {noformat}
> Maybe I'm missing the point why this is done the way it is. If this behaviour is desired for some use-case it should be a separate encoder, and the HTML encoder should only escape the necessary characters.
> Affects all versions of Lucene-Solr since the addition of the UnifiedHighlighter. Here are the lines where the escaping are implemented differently:
>  * [Escaping by the unified highlighter|https://github.com/apache/lucene-solr/blob/2387bb9d60ae44eeeb4fbcb2f2877f46be5303a0/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/DefaultPassageFormatter.java#L132]
>  * [Escaping by the other highlighters|https://github.com/apache/lucene-solr/blob/2387bb9d60ae44eeeb4fbcb2f2877f46be5303a0/lucene/highlighter/src/java/org/apache/lucene/search/highlight/SimpleHTMLEncoder.java#L69]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org