You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "stack@archive.org (JIRA)" <ji...@apache.org> on 2005/10/13 02:13:13 UTC

[jira] Created: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

OpenSearchServlet outputs illegal xml characters
------------------------------------------------

         Key: NUTCH-110
         URL: http://issues.apache.org/jira/browse/NUTCH-110
     Project: Nutch
        Type: Bug
  Components: searcher  
    Versions: 0.7    
 Environment: linux, jdk 1.5
    Reporter: stack@archive.org


OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on  search result, its possible for OSS to output xml that is not well-formed.  For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 --  the produced XML will show the FF character as '&#12;' The character/entity '&#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

Posted by "stack@archive.org (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]

stack@archive.org updated NUTCH-110:
------------------------------------

    Attachment: NUTCH-110-version2.patch

Patch version 2.  This patch benefits from discussion held up on nutch dev list. This patch differs from the first in that it handles ALL illegal XML characters, entity encoding the 5 'special characters' AND (silently) dropping characters outside the xml legal range of characters. The previous patch just did the latter task letting the configured transformer/DOM Serializer handle entity escaping.

This patch also differs from patch version 1 in that it moves the method that processes the xml out into util.StringUtil: The assumption being that not only OpenSearchServlet needs to make text safe to include in xml.

The core method, StringUtil#toValidXmlText, was authored by Dawid Weiss and was taken from carrot2 XMLSerializerHelper.  Below is except from mail up on nutch dev where he grants permission to copy toValidXmlText.

Message-ID: <43...@cs.put.poznan.pl>
Date: Fri, 14 Oct 2005 08:42:48 +0200
From: Dawid Weiss <da...@cs.put.poznan.pl>
To: nutch-dev@lucene.apache.org
Subject: Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal
 xml characters

...

> So, will I amend the patch in NUTCH-110 so it uses 
> XMLSerializerHelper#toValidXmlText in place of #getLegalXml method?

Copy the method's contents. It doesn't really make sense to copy the 
entire class just for this method. Good luck.

D. 

> OpenSearchServlet outputs illegal xml characters
> ------------------------------------------------
>
>          Key: NUTCH-110
>          URL: http://issues.apache.org/jira/browse/NUTCH-110
>      Project: Nutch
>         Type: Bug
>   Components: searcher
>     Versions: 0.7
>  Environment: linux, jdk 1.5
>     Reporter: stack@archive.org
>  Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on  search result, its possible for OSS to output xml that is not well-formed.  For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 --  the produced XML will show the FF character as '&#12;' The character/entity '&#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
     
Sami Siren resolved NUTCH-110:
------------------------------

    Fix Version: 0.8-dev
     Resolution: Fixed

I just committed this with small changes (moved test to a test case) thanks.

> OpenSearchServlet outputs illegal xml characters
> ------------------------------------------------
>
>          Key: NUTCH-110
>          URL: http://issues.apache.org/jira/browse/NUTCH-110
>      Project: Nutch
>         Type: Bug

>   Components: searcher
>     Versions: 0.8-dev
>  Environment: linux, jdk 1.5
>     Reporter: stack@archive.org
>     Assignee: Sami Siren
>      Fix For: 0.8-dev
>  Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08-v3.patch, fixIllegalXmlChars08-v4.patch, fixIllegalXmlChars08-v5.patch, fixIllegalXmlChars08.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on  search result, its possible for OSS to output xml that is not well-formed.  For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 --  the produced XML will show the FF character as '&#12;' The character/entity '&#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-110?page=comments#action_12416523 ] 

Jerome Charron commented on NUTCH-110:
--------------------------------------

This patch process the String twice if it contains some illegal characters!

> OpenSearchServlet outputs illegal xml characters
> ------------------------------------------------
>
>          Key: NUTCH-110
>          URL: http://issues.apache.org/jira/browse/NUTCH-110
>      Project: Nutch
>         Type: Bug

>   Components: searcher
>     Versions: 0.7
>  Environment: linux, jdk 1.5
>     Reporter: stack@archive.org
>  Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on  search result, its possible for OSS to output xml that is not well-formed.  For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 --  the produced XML will show the FF character as '&#12;' The character/entity '&#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

Posted by stack <st...@archive.org>.

Dawid Weiss wrote:

> ...
>
>> So, will I amend the patch in NUTCH-110 so it uses 
>> XMLSerializerHelper#toValidXmlText in place of #getLegalXml method?
>
>
> Copy the method's contents. It doesn't really make sense to copy the 
> entire class just for this method. Good luck.  

Thanks Dawid.

I've just uploaded a new patch that puts toValidXmlText into StringUtil, 
adds a few basic unit tests for the just-added method, and has 
OpenSearchServlet call StringUtil#toValidXmlText on all text added to 
DOM nodes.

St.Ack

Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

> Yes but (I think -- I haven't confirmed) this basic escaping is being 
> done by the DOM streaming. It at least is converting characters like 0xC 
> to &#12;.

I'd have to look at the code and see how the XML is serialized... Most 
DOM streaming classes will encode entities somehow, so you shouldn't 
worry about it. But once we're at it, it doesn't make sense to build a 
DOM tree to output the XML -- it is much faster to simply serialize it 
directly to the output stream.

> So, will I amend the patch in NUTCH-110 so it uses 
> XMLSerializerHelper#toValidXmlText in place of #getLegalXml method?

Copy the method's contents. It doesn't really make sense to copy the 
entire class just for this method. Good luck.

D.

Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

Posted by stack <st...@archive.org>.

Dawid Weiss wrote:

> ...
>
>>
>> 1. XMLSerializerHelper#toValidXmlText throws an exception when an 
>> invalid character whereas NUTCH-110 just drops it.
>
>
> Not really, it is governed by a boolean flag. If the flag is set to 
> true, it'll throw an exception, otherwise it silently ignores bad 
> characters.
>
Ok.

>> 2. XMLSerializerHelper#toValidXmlText escapes all characters 
>> including the 5 xml 'special characters' whereas the NUTCH-110 patch 
>> only looks for the characters outside of the allowed XML character 
>> range.
>
>
> If you intend to put this string in any XML text block (such as the 
> content of an attribute, or between tags and not enclosed in a CDATA 
> block), you'll have to deal with special characters such as &lt; and 
> &gt;. If you don't, your XML will be simply incorrect.

Yes but (I think -- I haven't confirmed) this basic escaping is being 
done by the DOM streaming. It at least is converting characters like 0xC 
to &#12;.

>> 3. NUTCH-110 first scans to see if text has 'bad xml' before it goes 
>> about creating new 'safe' string instance.
>
>
> So does this routine, actually. The string buffer is only created if 
> there are changes to be made to the string.
>
Ok.

So, will I amend the patch in NUTCH-110 so it uses 
XMLSerializerHelper#toValidXmlText in place of #getLegalXml method?

St.Ack

Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.


 > The differences between this method and the patch supplied in NUTCH-110
 > are:

Take a closer look at the source code --

> 
> 1. XMLSerializerHelper#toValidXmlText throws an exception when an 
> invalid character whereas NUTCH-110 just drops it.

Not really, it is governed by a boolean flag. If the flag is set to 
true, it'll throw an exception, otherwise it silently ignores bad 
characters.

> 2. XMLSerializerHelper#toValidXmlText escapes all characters including 
> the 5 xml 'special characters' whereas the NUTCH-110 patch only looks 
> for the characters outside of the allowed XML character range.

If you intend to put this string in any XML text block (such as the 
content of an attribute, or between tags and not enclosed in a CDATA 
block), you'll have to deal with special characters such as &lt; and 
&gt;. If you don't, your XML will be simply incorrect.

> 3. NUTCH-110 first scans to see if text has 'bad xml' before it goes 
> about creating new 'safe' string instance.

So does this routine, actually. The string buffer is only created if 
there are changes to be made to the string.

> XMLSerializerHelper#toValidXmlText does because we can't depend on the 
> underlying jdk parser instance doing the right thing?

It's not really about the parser, it's about the XML. If you emit blocks 
like

<searchresult>TEXT</searchresult>

then if TEXT happens to be "2 + 2 > 3" then it has to be either escaped 
or put in a CDATA section. Otherwise any parser will complain (because 
it should).

Also keep in mind that the URL at searchmorph.com --

http://www.searchmorph.com/pub/carrot2/jd/src-html/com/dawidweiss/carrot/util/common/XMLSerializerHelper.html 


shows incorrect source code (entities are replaced to their characters 
which might be confusing), so:

108                    case '>': // '>'
109                        entity = ">";
110
111                        break;

Should actually read

108                    case '>': // '>'
109                        entity = "&lt;";
110
111                        break;

and similar with other entities.

D.

Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

Posted by stack <st...@archive.org>.

Andrzej Bialecki wrote:

> ....
>
> Then we should take the best of both worlds - escape valid characters, 
> and replace invalid ones with '?' or space, or nothing. I know a place 
> where we could find some inspiration (Carrot2 XMLSerializerHelper.java 
> ... ;-) )
>
Thanks for the pointer. See starting at line 92, 
XMLSerializerHelper#toValidXmlText: 
http://www.searchmorph.com/pub/carrot2/jd/src-html/com/dawidweiss/carrot/util/common/XMLSerializerHelper.html

The differences between this method and the patch supplied in NUTCH-110 are:

1. XMLSerializerHelper#toValidXmlText throws an exception when an 
invalid character whereas NUTCH-110 just drops it.
2. XMLSerializerHelper#toValidXmlText escapes all characters including 
the 5 xml 'special characters' whereas the NUTCH-110 patch only looks 
for the characters outside of the allowed XML character range.
3. NUTCH-110 first scans to see if text has 'bad xml' before it goes 
about creating new 'safe' string instance.

I think throwing an exception is inappropriate at search-results-drawing 
time. Dropping the character or replacing it with '?' or some such seems 
better way to go.

Should I change the NUTCH-110 patch to do entity escaping too as 
XMLSerializerHelper#toValidXmlText does because we can't depend on the 
underlying jdk parser instance doing the right thing?

Yours,
St.Ack

Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

> Right, I didn't think about this... somehow I thought this was all about 
> special characters like ' " & <.

Oh, believe me: this knowledge came from sour experience not from book 
wisdom... I know for sure some XML parsers complain about invalid 
characters, while others don't.

> Then we should take the best of both worlds - escape valid characters, 
> and replace invalid ones with '?' or space, or nothing. I know a place 
> where we could find some inspiration (Carrot2 XMLSerializerHelper.java 
> ... ;-) )

Feel free to take anything you need; I don't claim it's the best way to 
implement it, but it is certainly better then passing through incorrect 
character codes. Alternatively you could correct everything that is 
indexed not to contain invalid characters (via a token filter?).

Dawid

Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

Posted by Andrzej Bialecki <ab...@getopt.org>.

Dawid Weiss wrote:
> 
>> We should not drop the offending characters, but escape them. Either 
>> the Unicode entity (&#nn;) or CDATA way is ok (and CDATA way is simpler).
> 
> 
> This isn't entirely true, Andrzej -- escaping a character, or putting it 
> in a CDATA section is just about different ways of expressing the same 
> character code in an XML structure. The same and ILLEGAL character code 
> in terms of XML spec (there is a fragment specifying legal character 
> ranges there), so a conforming XML parser should throw an exception if 
> it encounters anything outside of the legal range. The only way of 
> transferring a full binary is to encode it to legal unicode characters 
> (using uuencode or such).

> I agree with the person who submitted this patch that it is a potential 
> issue and should be addressed somehow.

Right, I didn't think about this... somehow I thought this was all about 
special characters like ' " & <.

Then we should take the best of both worlds - escape valid characters, 
and replace invalid ones with '?' or space, or nothing. I know a place 
where we could find some inspiration (Carrot2 XMLSerializerHelper.java 
... ;-) )

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

> We should not drop the offending characters, but escape them. Either the 
> Unicode entity (&#nn;) or CDATA way is ok (and CDATA way is simpler).

This isn't entirely true, Andrzej -- escaping a character, or putting it 
in a CDATA section is just about different ways of expressing the same 
character code in an XML structure. The same and ILLEGAL character code 
in terms of XML spec (there is a fragment specifying legal character 
ranges there), so a conforming XML parser should throw an exception if 
it encounters anything outside of the legal range. The only way of 
transferring a full binary is to encode it to legal unicode characters 
(using uuencode or such).

I agree with the person who submitted this patch that it is a potential 
issue and should be addressed somehow.

D.

Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

Posted by Andrzej Bialecki <ab...@getopt.org>.

Chris Mattmann wrote:
> Hi,
> 
>  I'm not an XML expert by any means, but wouldn't it be simpler to just wrap
> any text where illegal chars are possible with a <!CDATA[.... ]!> tag? That
> way, the offending characters won't be dropped and the process won't be
> lossy, no?
> 
>   If the CDATA method won't work, and there's no other way to solve the
> problem without losing text, then your patch has my +1.

We should not drop the offending characters, but escape them. Either the 
Unicode entity (&#nn;) or CDATA way is ok (and CDATA way is simpler).

So, this is -1 for the patch.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.

Hi,

 I'm not an XML expert by any means, but wouldn't it be simpler to just wrap
any text where illegal chars are possible with a <!CDATA[.... ]!> tag? That
way, the offending characters won't be dropped and the process won't be
lossy, no?

  If the CDATA method won't work, and there's no other way to solve the
problem without losing text, then your patch has my +1.

Cheers,
 Chris


______________________________________________
Chris A. Mattmann
Chris.Mattmann@jpl.nasa.gov 
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

> -----Original Message-----
> From: stack@archive.org (JIRA) [mailto:jira@apache.org]
> Sent: Wednesday, October 12, 2005 5:19 PM
> To: nutch-dev@incubator.apache.org
> Subject: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml
> characters
> 
>      [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
> 
> stack@archive.org updated NUTCH-110:
> ------------------------------------
> 
>     Attachment: fixIllegalXmlChars.patch
> 
> Attached patch runs all xml text through a check for bad xml characters.
> This patch is brutal dropping silently illegal characters.  Patch was made
> after hunting xalan, jdk, and nutch itself for a method that would do the
> above filtering but was unable to find any such method -- perhaps an
> oversight on my part?
> 
> > OpenSearchServlet outputs illegal xml characters
> > ------------------------------------------------
> >
> >          Key: NUTCH-110
> >          URL: http://issues.apache.org/jira/browse/NUTCH-110
> >      Project: Nutch
> >         Type: Bug
> >   Components: searcher
> >     Versions: 0.7
> >  Environment: linux, jdk 1.5
> >     Reporter: stack@archive.org
> >  Attachments: fixIllegalXmlChars.patch
> >
> > OpenSearchServlet does not check text-to-output for illegal xml
> characters; dependent on  search result, its possible for OSS to output
> xml that is not well-formed.  For example, if text has the character FF
> character in it -- -- i.e. the ascii character at position (decimal) 12 --
> the produced XML will show the FF character as '&#12;' The
> character/entity '&#12;' is not legal in XML according to
> http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.
> 
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
>    http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>    http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

Posted by "stack@archive.org (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]

stack@archive.org updated NUTCH-110:
------------------------------------

    Attachment: fixIllegalXmlChars.patch

Attached patch runs all xml text through a check for bad xml characters.  This patch is brutal dropping silently illegal characters.  Patch was made after hunting xalan, jdk, and nutch itself for a method that would do the above filtering but was unable to find any such method -- perhaps an oversight on my part?

> OpenSearchServlet outputs illegal xml characters
> ------------------------------------------------
>
>          Key: NUTCH-110
>          URL: http://issues.apache.org/jira/browse/NUTCH-110
>      Project: Nutch
>         Type: Bug
>   Components: searcher
>     Versions: 0.7
>  Environment: linux, jdk 1.5
>     Reporter: stack@archive.org
>  Attachments: fixIllegalXmlChars.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on  search result, its possible for OSS to output xml that is not well-formed.  For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 --  the produced XML will show the FF character as '&#12;' The character/entity '&#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Assigned: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]

Sami Siren reassigned NUTCH-110:
--------------------------------

    Assign To: Sami Siren

> OpenSearchServlet outputs illegal xml characters
> ------------------------------------------------
>
>          Key: NUTCH-110
>          URL: http://issues.apache.org/jira/browse/NUTCH-110
>      Project: Nutch
>         Type: Bug

>   Components: searcher
>     Versions: 0.8-dev
>  Environment: linux, jdk 1.5
>     Reporter: stack@archive.org
>     Assignee: Sami Siren
>  Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08-v3.patch, fixIllegalXmlChars08-v4.patch, fixIllegalXmlChars08.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on  search result, its possible for OSS to output xml that is not well-formed.  For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 --  the produced XML will show the FF character as '&#12;' The character/entity '&#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

Posted by "Stefan Neufeind (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]

Stefan Neufeind updated NUTCH-110:
----------------------------------

    Attachment: fixIllegalXmlChars08.patch

Since original patch didn't cleanly apply for me on 0.8-dev (nightly-2006-05-20) I re-did it for 0.8 ...

With this patch the XML is fine. Without I had big trouble parsing the RSS-feed in another application.

> OpenSearchServlet outputs illegal xml characters
> ------------------------------------------------
>
>          Key: NUTCH-110
>          URL: http://issues.apache.org/jira/browse/NUTCH-110
>      Project: Nutch
>         Type: Bug

>   Components: searcher
>     Versions: 0.7
>  Environment: linux, jdk 1.5
>     Reporter: stack@archive.org
>  Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, fixIllegalXmlChars08.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on  search result, its possible for OSS to output xml that is not well-formed.  For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 --  the produced XML will show the FF character as '&#12;' The character/entity '&#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-110?page=comments#action_12416932 ] 

Sami Siren commented on NUTCH-110:
----------------------------------

in method  addAttribute(...)

line:
attribute.setValue(getLegalXml(getLegalXml(value)));

intentional?


> OpenSearchServlet outputs illegal xml characters
> ------------------------------------------------
>
>          Key: NUTCH-110
>          URL: http://issues.apache.org/jira/browse/NUTCH-110
>      Project: Nutch
>         Type: Bug

>   Components: searcher
>     Versions: 0.8-dev
>  Environment: linux, jdk 1.5
>     Reporter: stack@archive.org
>     Assignee: Sami Siren
>  Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08-v3.patch, fixIllegalXmlChars08-v4.patch, fixIllegalXmlChars08.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on  search result, its possible for OSS to output xml that is not well-formed.  For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 --  the produced XML will show the FF character as '&#12;' The character/entity '&#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

Posted by "John VanDyk (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]

John VanDyk updated NUTCH-110:
------------------------------

    Attachment: fixIllegalXmlChars08-v2.patch

Stefan's patch didn't apply cleanly for me on svn revision 413155 so I re-did it.

This patch fixes the illegal XML characters and prevents opensearch clients from choking on that bad XML previously emitted.

> OpenSearchServlet outputs illegal xml characters
> ------------------------------------------------
>
>          Key: NUTCH-110
>          URL: http://issues.apache.org/jira/browse/NUTCH-110
>      Project: Nutch
>         Type: Bug

>   Components: searcher
>     Versions: 0.7
>  Environment: linux, jdk 1.5
>     Reporter: stack@archive.org
>  Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on  search result, its possible for OSS to output xml that is not well-formed.  For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 --  the produced XML will show the FF character as '&#12;' The character/entity '&#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

Posted by "stack@archive.org (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-110?page=comments#action_12357300 ] 

stack@archive.org commented on NUTCH-110:
-----------------------------------------

Scrub NUTCH-110-version2.patch. This patch double-encode certain entities (First by the new toValidXmlText method, second by the javax.xml.transform.Transformer transformer used by OpenSearchServlet). 

Use the original patch, fixIllegalXmlChars.patch, to address the problem described in this issue.

> OpenSearchServlet outputs illegal xml characters
> ------------------------------------------------
>
>          Key: NUTCH-110
>          URL: http://issues.apache.org/jira/browse/NUTCH-110
>      Project: Nutch
>         Type: Bug
>   Components: searcher
>     Versions: 0.7
>  Environment: linux, jdk 1.5
>     Reporter: stack@archive.org
>  Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on  search result, its possible for OSS to output xml that is not well-formed.  For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 --  the produced XML will show the FF character as '&#12;' The character/entity '&#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

Posted by "stack@archive.org (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]

stack@archive.org updated NUTCH-110:
------------------------------------

    Attachment: fixIllegalXmlChars08-v4.patch

v3 mistakenly included debugging code.

Attached cleaned up v4.

> OpenSearchServlet outputs illegal xml characters
> ------------------------------------------------
>
>          Key: NUTCH-110
>          URL: http://issues.apache.org/jira/browse/NUTCH-110
>      Project: Nutch
>         Type: Bug

>   Components: searcher
>     Versions: 0.8-dev
>  Environment: linux, jdk 1.5
>     Reporter: stack@archive.org
>  Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08-v3.patch, fixIllegalXmlChars08-v4.patch, fixIllegalXmlChars08.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on  search result, its possible for OSS to output xml that is not well-formed.  For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 --  the produced XML will show the FF character as '&#12;' The character/entity '&#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

Posted by "stack@archive.org (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]

stack@archive.org updated NUTCH-110:
------------------------------------

    Attachment: fixIllegalXmlChars08-v3.patch

Version of patch that doesn't "...process the String twice if it contains some illegal characters!".  Its name is fixIllegalXmlChars08-v3.patch (Be careful, its not the last patch in the list).  It was made against 414852.

At least 3 different people have run into this awkward issue going by the comments in this issue.  I petition that is sufficent to earn a commit.

Thanks.

> OpenSearchServlet outputs illegal xml characters
> ------------------------------------------------
>
>          Key: NUTCH-110
>          URL: http://issues.apache.org/jira/browse/NUTCH-110
>      Project: Nutch
>         Type: Bug

>   Components: searcher
>     Versions: 0.7
>  Environment: linux, jdk 1.5
>     Reporter: stack@archive.org
>  Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08-v3.patch, fixIllegalXmlChars08.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on  search result, its possible for OSS to output xml that is not well-formed.  For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 --  the produced XML will show the FF character as '&#12;' The character/entity '&#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

Posted by "stack@archive.org (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]

stack@archive.org updated NUTCH-110:
------------------------------------

    Version: 0.8-dev
                 (was: 0.7)

Was version 0.7.  Changed 'Affects Version' to 0.8-dev.

> OpenSearchServlet outputs illegal xml characters
> ------------------------------------------------
>
>          Key: NUTCH-110
>          URL: http://issues.apache.org/jira/browse/NUTCH-110
>      Project: Nutch
>         Type: Bug

>   Components: searcher
>     Versions: 0.8-dev
>  Environment: linux, jdk 1.5
>     Reporter: stack@archive.org
>  Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08-v3.patch, fixIllegalXmlChars08.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on  search result, its possible for OSS to output xml that is not well-formed.  For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 --  the produced XML will show the FF character as '&#12;' The character/entity '&#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

Posted by "stack@archive.org (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]

stack@archive.org updated NUTCH-110:
------------------------------------

    Attachment: fixIllegalXmlChars08-v5.patch

No, the double call to getLegalXml is not intentional.  Its a mistake.  Thanks for finding it.

I've attached yet another version (Any prizes for most revisions to a patch?).

> OpenSearchServlet outputs illegal xml characters
> ------------------------------------------------
>
>          Key: NUTCH-110
>          URL: http://issues.apache.org/jira/browse/NUTCH-110
>      Project: Nutch
>         Type: Bug

>   Components: searcher
>     Versions: 0.8-dev
>  Environment: linux, jdk 1.5
>     Reporter: stack@archive.org
>     Assignee: Sami Siren
>  Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08-v3.patch, fixIllegalXmlChars08-v4.patch, fixIllegalXmlChars08-v5.patch, fixIllegalXmlChars08.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on  search result, its possible for OSS to output xml that is not well-formed.  For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 --  the produced XML will show the FF character as '&#12;' The character/entity '&#12;' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira