You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by bu...@apache.org on 2004/08/12 17:39:17 UTC
DO NOT REPLY [Bug 30621] New: - HTML Parser doesn't decode character references in attributes

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=30621>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=30621

HTML Parser doesn't decode character references in attributes

           Summary: HTML Parser doesn't decode character references in
                    attributes
           Product: Lucene
           Version: 1.4
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: Normal
          Priority: Other
         Component: Examples
        AssignedTo: lucene-dev@jakarta.apache.org
        ReportedBy: Dave.Sparks@teamware.co.uk


The HTML Parser includes the values of certain attributes in the summary, the
metaTags and the output stream.  Character references in the attribute values
are not decoded.  Specifically:

1. The value of the alt= attribute of an <img ...> tag is included in the
summary and the output stream.  This value is case-significant, and may include
character references.  The character references are not decoded.

2. The value of the content= attribute of a <meta ...> tag is included in the
metaTags if the tag also has a name= or http-equiv= attribute.  This value is
case-significant, and may include character references.  The character
references are not decoded, and the value is downcased (since the fix to bug
#27423).

I've patched our version of the parser to decode the character references, by
adding a decodeAll method to Entities to parse a String for character references
and return a String where the references have been replaced by the corresponding
characters (or the original String, if no change is needed).  This method is
called to decode alt= attributes and content= attributes.  I've removed the
.toLowerCase() on the content= value.  I'm not really happy with this fix, as it
seems to me to be wrong to parse a value which was previously parsed as a single
token; there ought to be a way to get it right the first time.

I've left the name= and http-equiv= values alone.  It's not entirely clear (to
me) whether character references are allowed, and it would be perverse to use
them here.  I also appreciate the convenience of having a single combined
namespace, with downcased names, even though this is technically wrong.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org