You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "stack@archive.org (JIRA)" <ji...@apache.org> on 2005/10/13 02:13:13 UTC
[jira] Created: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
OpenSearchServlet outputs illegal xml characters
------------------------------------------------
Key: NUTCH-110
URL: http://issues.apache.org/jira/browse/NUTCH-110
Project: Nutch
Type: Bug
Components: searcher
Versions: 0.7
Environment: linux, jdk 1.5
Reporter: stack@archive.org
OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '' The character/entity '' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
Posted by "stack@archive.org (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
stack@archive.org updated NUTCH-110:
------------------------------------
Attachment: NUTCH-110-version2.patch
Patch version 2. This patch benefits from discussion held up on nutch dev list. This patch differs from the first in that it handles ALL illegal XML characters, entity encoding the 5 'special characters' AND (silently) dropping characters outside the xml legal range of characters. The previous patch just did the latter task letting the configured transformer/DOM Serializer handle entity escaping.
This patch also differs from patch version 1 in that it moves the method that processes the xml out into util.StringUtil: The assumption being that not only OpenSearchServlet needs to make text safe to include in xml.
The core method, StringUtil#toValidXmlText, was authored by Dawid Weiss and was taken from carrot2 XMLSerializerHelper. Below is except from mail up on nutch dev where he grants permission to copy toValidXmlText.
Message-ID: <43...@cs.put.poznan.pl>
Date: Fri, 14 Oct 2005 08:42:48 +0200
From: Dawid Weiss <da...@cs.put.poznan.pl>
To: nutch-dev@lucene.apache.org
Subject: Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal
xml characters
...
> So, will I amend the patch in NUTCH-110 so it uses
> XMLSerializerHelper#toValidXmlText in place of #getLegalXml method?
Copy the method's contents. It doesn't really make sense to copy the
entire class just for this method. Good luck.
D.
> OpenSearchServlet outputs illegal xml characters
> ------------------------------------------------
>
> Key: NUTCH-110
> URL: http://issues.apache.org/jira/browse/NUTCH-110
> Project: Nutch
> Type: Bug
> Components: searcher
> Versions: 0.7
> Environment: linux, jdk 1.5
> Reporter: stack@archive.org
> Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '' The character/entity '' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-110) OpenSearchServlet outputs illegal xml
characters
Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
Sami Siren resolved NUTCH-110:
------------------------------
Fix Version: 0.8-dev
Resolution: Fixed
I just committed this with small changes (moved test to a test case) thanks.
> OpenSearchServlet outputs illegal xml characters
> ------------------------------------------------
>
> Key: NUTCH-110
> URL: http://issues.apache.org/jira/browse/NUTCH-110
> Project: Nutch
> Type: Bug
> Components: searcher
> Versions: 0.8-dev
> Environment: linux, jdk 1.5
> Reporter: stack@archive.org
> Assignee: Sami Siren
> Fix For: 0.8-dev
> Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08-v3.patch, fixIllegalXmlChars08-v4.patch, fixIllegalXmlChars08-v5.patch, fixIllegalXmlChars08.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '' The character/entity '' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-110) OpenSearchServlet outputs illegal xml
characters
Posted by "Jerome Charron (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-110?page=comments#action_12416523 ]
Jerome Charron commented on NUTCH-110:
--------------------------------------
This patch process the String twice if it contains some illegal characters!
> OpenSearchServlet outputs illegal xml characters
> ------------------------------------------------
>
> Key: NUTCH-110
> URL: http://issues.apache.org/jira/browse/NUTCH-110
> Project: Nutch
> Type: Bug
> Components: searcher
> Versions: 0.7
> Environment: linux, jdk 1.5
> Reporter: stack@archive.org
> Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '' The character/entity '' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal
xml characters
Posted by stack <st...@archive.org>.
Dawid Weiss wrote:
> ...
>
>> So, will I amend the patch in NUTCH-110 so it uses
>> XMLSerializerHelper#toValidXmlText in place of #getLegalXml method?
>
>
> Copy the method's contents. It doesn't really make sense to copy the
> entire class just for this method. Good luck.
Thanks Dawid.
I've just uploaded a new patch that puts toValidXmlText into StringUtil,
adds a few basic unit tests for the just-added method, and has
OpenSearchServlet call StringUtil#toValidXmlText on all text added to
DOM nodes.
St.Ack
Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal
xml characters
Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> Yes but (I think -- I haven't confirmed) this basic escaping is being
> done by the DOM streaming. It at least is converting characters like 0xC
> to .
I'd have to look at the code and see how the XML is serialized... Most
DOM streaming classes will encode entities somehow, so you shouldn't
worry about it. But once we're at it, it doesn't make sense to build a
DOM tree to output the XML -- it is much faster to simply serialize it
directly to the output stream.
> So, will I amend the patch in NUTCH-110 so it uses
> XMLSerializerHelper#toValidXmlText in place of #getLegalXml method?
Copy the method's contents. It doesn't really make sense to copy the
entire class just for this method. Good luck.
D.
Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal
xml characters
Posted by stack <st...@archive.org>.
Dawid Weiss wrote:
> ...
>
>>
>> 1. XMLSerializerHelper#toValidXmlText throws an exception when an
>> invalid character whereas NUTCH-110 just drops it.
>
>
> Not really, it is governed by a boolean flag. If the flag is set to
> true, it'll throw an exception, otherwise it silently ignores bad
> characters.
>
Ok.
>> 2. XMLSerializerHelper#toValidXmlText escapes all characters
>> including the 5 xml 'special characters' whereas the NUTCH-110 patch
>> only looks for the characters outside of the allowed XML character
>> range.
>
>
> If you intend to put this string in any XML text block (such as the
> content of an attribute, or between tags and not enclosed in a CDATA
> block), you'll have to deal with special characters such as < and
> >. If you don't, your XML will be simply incorrect.
Yes but (I think -- I haven't confirmed) this basic escaping is being
done by the DOM streaming. It at least is converting characters like 0xC
to .
>> 3. NUTCH-110 first scans to see if text has 'bad xml' before it goes
>> about creating new 'safe' string instance.
>
>
> So does this routine, actually. The string buffer is only created if
> there are changes to be made to the string.
>
Ok.
So, will I amend the patch in NUTCH-110 so it uses
XMLSerializerHelper#toValidXmlText in place of #getLegalXml method?
St.Ack
Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal
xml characters
Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> The differences between this method and the patch supplied in NUTCH-110
> are:
Take a closer look at the source code --
>
> 1. XMLSerializerHelper#toValidXmlText throws an exception when an
> invalid character whereas NUTCH-110 just drops it.
Not really, it is governed by a boolean flag. If the flag is set to
true, it'll throw an exception, otherwise it silently ignores bad
characters.
> 2. XMLSerializerHelper#toValidXmlText escapes all characters including
> the 5 xml 'special characters' whereas the NUTCH-110 patch only looks
> for the characters outside of the allowed XML character range.
If you intend to put this string in any XML text block (such as the
content of an attribute, or between tags and not enclosed in a CDATA
block), you'll have to deal with special characters such as < and
>. If you don't, your XML will be simply incorrect.
> 3. NUTCH-110 first scans to see if text has 'bad xml' before it goes
> about creating new 'safe' string instance.
So does this routine, actually. The string buffer is only created if
there are changes to be made to the string.
> XMLSerializerHelper#toValidXmlText does because we can't depend on the
> underlying jdk parser instance doing the right thing?
It's not really about the parser, it's about the XML. If you emit blocks
like
<searchresult>TEXT</searchresult>
then if TEXT happens to be "2 + 2 > 3" then it has to be either escaped
or put in a CDATA section. Otherwise any parser will complain (because
it should).
Also keep in mind that the URL at searchmorph.com --
http://www.searchmorph.com/pub/carrot2/jd/src-html/com/dawidweiss/carrot/util/common/XMLSerializerHelper.html
shows incorrect source code (entities are replaced to their characters
which might be confusing), so:
108 case '>': // '>'
109 entity = ">";
110
111 break;
Should actually read
108 case '>': // '>'
109 entity = "<";
110
111 break;
and similar with other entities.
D.
Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal
xml characters
Posted by stack <st...@archive.org>.
Andrzej Bialecki wrote:
> ....
>
> Then we should take the best of both worlds - escape valid characters,
> and replace invalid ones with '?' or space, or nothing. I know a place
> where we could find some inspiration (Carrot2 XMLSerializerHelper.java
> ... ;-) )
>
Thanks for the pointer. See starting at line 92,
XMLSerializerHelper#toValidXmlText:
http://www.searchmorph.com/pub/carrot2/jd/src-html/com/dawidweiss/carrot/util/common/XMLSerializerHelper.html
The differences between this method and the patch supplied in NUTCH-110 are:
1. XMLSerializerHelper#toValidXmlText throws an exception when an
invalid character whereas NUTCH-110 just drops it.
2. XMLSerializerHelper#toValidXmlText escapes all characters including
the 5 xml 'special characters' whereas the NUTCH-110 patch only looks
for the characters outside of the allowed XML character range.
3. NUTCH-110 first scans to see if text has 'bad xml' before it goes
about creating new 'safe' string instance.
I think throwing an exception is inappropriate at search-results-drawing
time. Dropping the character or replacing it with '?' or some such seems
better way to go.
Should I change the NUTCH-110 patch to do entity escaping too as
XMLSerializerHelper#toValidXmlText does because we can't depend on the
underlying jdk parser instance doing the right thing?
Yours,
St.Ack
Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal
xml characters
Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> Right, I didn't think about this... somehow I thought this was all about
> special characters like ' " & <.
Oh, believe me: this knowledge came from sour experience not from book
wisdom... I know for sure some XML parsers complain about invalid
characters, while others don't.
> Then we should take the best of both worlds - escape valid characters,
> and replace invalid ones with '?' or space, or nothing. I know a place
> where we could find some inspiration (Carrot2 XMLSerializerHelper.java
> ... ;-) )
Feel free to take anything you need; I don't claim it's the best way to
implement it, but it is certainly better then passing through incorrect
character codes. Alternatively you could correct everything that is
indexed not to contain invalid characters (via a token filter?).
Dawid
Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal
xml characters
Posted by Andrzej Bialecki <ab...@getopt.org>.
Dawid Weiss wrote:
>
>> We should not drop the offending characters, but escape them. Either
>> the Unicode entity (&#nn;) or CDATA way is ok (and CDATA way is simpler).
>
>
> This isn't entirely true, Andrzej -- escaping a character, or putting it
> in a CDATA section is just about different ways of expressing the same
> character code in an XML structure. The same and ILLEGAL character code
> in terms of XML spec (there is a fragment specifying legal character
> ranges there), so a conforming XML parser should throw an exception if
> it encounters anything outside of the legal range. The only way of
> transferring a full binary is to encode it to legal unicode characters
> (using uuencode or such).
> I agree with the person who submitted this patch that it is a potential
> issue and should be addressed somehow.
Right, I didn't think about this... somehow I thought this was all about
special characters like ' " & <.
Then we should take the best of both worlds - escape valid characters,
and replace invalid ones with '?' or space, or nothing. I know a place
where we could find some inspiration (Carrot2 XMLSerializerHelper.java
... ;-) )
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal
xml characters
Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> We should not drop the offending characters, but escape them. Either the
> Unicode entity (&#nn;) or CDATA way is ok (and CDATA way is simpler).
This isn't entirely true, Andrzej -- escaping a character, or putting it
in a CDATA section is just about different ways of expressing the same
character code in an XML structure. The same and ILLEGAL character code
in terms of XML spec (there is a fragment specifying legal character
ranges there), so a conforming XML parser should throw an exception if
it encounters anything outside of the legal range. The only way of
transferring a full binary is to encode it to legal unicode characters
(using uuencode or such).
I agree with the person who submitted this patch that it is a potential
issue and should be addressed somehow.
D.
Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal
xml characters
Posted by Andrzej Bialecki <ab...@getopt.org>.
Chris Mattmann wrote:
> Hi,
>
> I'm not an XML expert by any means, but wouldn't it be simpler to just wrap
> any text where illegal chars are possible with a <!CDATA[.... ]!> tag? That
> way, the offending characters won't be dropped and the process won't be
> lossy, no?
>
> If the CDATA method won't work, and there's no other way to solve the
> problem without losing text, then your patch has my +1.
We should not drop the offending characters, but escape them. Either the
Unicode entity (&#nn;) or CDATA way is ok (and CDATA way is simpler).
So, this is -1 for the patch.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
RE: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
Posted by Chris Mattmann <ch...@jpl.nasa.gov>.
Hi,
I'm not an XML expert by any means, but wouldn't it be simpler to just wrap
any text where illegal chars are possible with a <!CDATA[.... ]!> tag? That
way, the offending characters won't be dropped and the process won't be
lossy, no?
If the CDATA method won't work, and there's no other way to solve the
problem without losing text, then your patch has my +1.
Cheers,
Chris
______________________________________________
Chris A. Mattmann
Chris.Mattmann@jpl.nasa.gov
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group
_________________________________________________
Jet Propulsion Laboratory Pasadena, CA
Office: 171-266B Mailstop: 171-246
_______________________________________________________
Disclaimer: The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.
> -----Original Message-----
> From: stack@archive.org (JIRA) [mailto:jira@apache.org]
> Sent: Wednesday, October 12, 2005 5:19 PM
> To: nutch-dev@incubator.apache.org
> Subject: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml
> characters
>
> [ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
>
> stack@archive.org updated NUTCH-110:
> ------------------------------------
>
> Attachment: fixIllegalXmlChars.patch
>
> Attached patch runs all xml text through a check for bad xml characters.
> This patch is brutal dropping silently illegal characters. Patch was made
> after hunting xalan, jdk, and nutch itself for a method that would do the
> above filtering but was unable to find any such method -- perhaps an
> oversight on my part?
>
> > OpenSearchServlet outputs illegal xml characters
> > ------------------------------------------------
> >
> > Key: NUTCH-110
> > URL: http://issues.apache.org/jira/browse/NUTCH-110
> > Project: Nutch
> > Type: Bug
> > Components: searcher
> > Versions: 0.7
> > Environment: linux, jdk 1.5
> > Reporter: stack@archive.org
> > Attachments: fixIllegalXmlChars.patch
> >
> > OpenSearchServlet does not check text-to-output for illegal xml
> characters; dependent on search result, its possible for OSS to output
> xml that is not well-formed. For example, if text has the character FF
> character in it -- -- i.e. the ascii character at position (decimal) 12 --
> the produced XML will show the FF character as '' The
> character/entity '' is not legal in XML according to
> http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
> http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
> http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
Posted by "stack@archive.org (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
stack@archive.org updated NUTCH-110:
------------------------------------
Attachment: fixIllegalXmlChars.patch
Attached patch runs all xml text through a check for bad xml characters. This patch is brutal dropping silently illegal characters. Patch was made after hunting xalan, jdk, and nutch itself for a method that would do the above filtering but was unable to find any such method -- perhaps an oversight on my part?
> OpenSearchServlet outputs illegal xml characters
> ------------------------------------------------
>
> Key: NUTCH-110
> URL: http://issues.apache.org/jira/browse/NUTCH-110
> Project: Nutch
> Type: Bug
> Components: searcher
> Versions: 0.7
> Environment: linux, jdk 1.5
> Reporter: stack@archive.org
> Attachments: fixIllegalXmlChars.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '' The character/entity '' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Assigned: (NUTCH-110) OpenSearchServlet outputs illegal xml
characters
Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
Sami Siren reassigned NUTCH-110:
--------------------------------
Assign To: Sami Siren
> OpenSearchServlet outputs illegal xml characters
> ------------------------------------------------
>
> Key: NUTCH-110
> URL: http://issues.apache.org/jira/browse/NUTCH-110
> Project: Nutch
> Type: Bug
> Components: searcher
> Versions: 0.8-dev
> Environment: linux, jdk 1.5
> Reporter: stack@archive.org
> Assignee: Sami Siren
> Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08-v3.patch, fixIllegalXmlChars08-v4.patch, fixIllegalXmlChars08.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '' The character/entity '' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml
characters
Posted by "Stefan Neufeind (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
Stefan Neufeind updated NUTCH-110:
----------------------------------
Attachment: fixIllegalXmlChars08.patch
Since original patch didn't cleanly apply for me on 0.8-dev (nightly-2006-05-20) I re-did it for 0.8 ...
With this patch the XML is fine. Without I had big trouble parsing the RSS-feed in another application.
> OpenSearchServlet outputs illegal xml characters
> ------------------------------------------------
>
> Key: NUTCH-110
> URL: http://issues.apache.org/jira/browse/NUTCH-110
> Project: Nutch
> Type: Bug
> Components: searcher
> Versions: 0.7
> Environment: linux, jdk 1.5
> Reporter: stack@archive.org
> Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, fixIllegalXmlChars08.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '' The character/entity '' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-110) OpenSearchServlet outputs illegal xml
characters
Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-110?page=comments#action_12416932 ]
Sami Siren commented on NUTCH-110:
----------------------------------
in method addAttribute(...)
line:
attribute.setValue(getLegalXml(getLegalXml(value)));
intentional?
> OpenSearchServlet outputs illegal xml characters
> ------------------------------------------------
>
> Key: NUTCH-110
> URL: http://issues.apache.org/jira/browse/NUTCH-110
> Project: Nutch
> Type: Bug
> Components: searcher
> Versions: 0.8-dev
> Environment: linux, jdk 1.5
> Reporter: stack@archive.org
> Assignee: Sami Siren
> Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08-v3.patch, fixIllegalXmlChars08-v4.patch, fixIllegalXmlChars08.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '' The character/entity '' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml
characters
Posted by "John VanDyk (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
John VanDyk updated NUTCH-110:
------------------------------
Attachment: fixIllegalXmlChars08-v2.patch
Stefan's patch didn't apply cleanly for me on svn revision 413155 so I re-did it.
This patch fixes the illegal XML characters and prevents opensearch clients from choking on that bad XML previously emitted.
> OpenSearchServlet outputs illegal xml characters
> ------------------------------------------------
>
> Key: NUTCH-110
> URL: http://issues.apache.org/jira/browse/NUTCH-110
> Project: Nutch
> Type: Bug
> Components: searcher
> Versions: 0.7
> Environment: linux, jdk 1.5
> Reporter: stack@archive.org
> Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '' The character/entity '' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-110) OpenSearchServlet outputs illegal xml characters
Posted by "stack@archive.org (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-110?page=comments#action_12357300 ]
stack@archive.org commented on NUTCH-110:
-----------------------------------------
Scrub NUTCH-110-version2.patch. This patch double-encode certain entities (First by the new toValidXmlText method, second by the javax.xml.transform.Transformer transformer used by OpenSearchServlet).
Use the original patch, fixIllegalXmlChars.patch, to address the problem described in this issue.
> OpenSearchServlet outputs illegal xml characters
> ------------------------------------------------
>
> Key: NUTCH-110
> URL: http://issues.apache.org/jira/browse/NUTCH-110
> Project: Nutch
> Type: Bug
> Components: searcher
> Versions: 0.7
> Environment: linux, jdk 1.5
> Reporter: stack@archive.org
> Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '' The character/entity '' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml
characters
Posted by "stack@archive.org (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
stack@archive.org updated NUTCH-110:
------------------------------------
Attachment: fixIllegalXmlChars08-v4.patch
v3 mistakenly included debugging code.
Attached cleaned up v4.
> OpenSearchServlet outputs illegal xml characters
> ------------------------------------------------
>
> Key: NUTCH-110
> URL: http://issues.apache.org/jira/browse/NUTCH-110
> Project: Nutch
> Type: Bug
> Components: searcher
> Versions: 0.8-dev
> Environment: linux, jdk 1.5
> Reporter: stack@archive.org
> Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08-v3.patch, fixIllegalXmlChars08-v4.patch, fixIllegalXmlChars08.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '' The character/entity '' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml
characters
Posted by "stack@archive.org (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
stack@archive.org updated NUTCH-110:
------------------------------------
Attachment: fixIllegalXmlChars08-v3.patch
Version of patch that doesn't "...process the String twice if it contains some illegal characters!". Its name is fixIllegalXmlChars08-v3.patch (Be careful, its not the last patch in the list). It was made against 414852.
At least 3 different people have run into this awkward issue going by the comments in this issue. I petition that is sufficent to earn a commit.
Thanks.
> OpenSearchServlet outputs illegal xml characters
> ------------------------------------------------
>
> Key: NUTCH-110
> URL: http://issues.apache.org/jira/browse/NUTCH-110
> Project: Nutch
> Type: Bug
> Components: searcher
> Versions: 0.7
> Environment: linux, jdk 1.5
> Reporter: stack@archive.org
> Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08-v3.patch, fixIllegalXmlChars08.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '' The character/entity '' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml
characters
Posted by "stack@archive.org (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
stack@archive.org updated NUTCH-110:
------------------------------------
Version: 0.8-dev
(was: 0.7)
Was version 0.7. Changed 'Affects Version' to 0.8-dev.
> OpenSearchServlet outputs illegal xml characters
> ------------------------------------------------
>
> Key: NUTCH-110
> URL: http://issues.apache.org/jira/browse/NUTCH-110
> Project: Nutch
> Type: Bug
> Components: searcher
> Versions: 0.8-dev
> Environment: linux, jdk 1.5
> Reporter: stack@archive.org
> Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08-v3.patch, fixIllegalXmlChars08.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '' The character/entity '' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml
characters
Posted by "stack@archive.org (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
stack@archive.org updated NUTCH-110:
------------------------------------
Attachment: fixIllegalXmlChars08-v5.patch
No, the double call to getLegalXml is not intentional. Its a mistake. Thanks for finding it.
I've attached yet another version (Any prizes for most revisions to a patch?).
> OpenSearchServlet outputs illegal xml characters
> ------------------------------------------------
>
> Key: NUTCH-110
> URL: http://issues.apache.org/jira/browse/NUTCH-110
> Project: Nutch
> Type: Bug
> Components: searcher
> Versions: 0.8-dev
> Environment: linux, jdk 1.5
> Reporter: stack@archive.org
> Assignee: Sami Siren
> Attachments: NUTCH-110-version2.patch, fixIllegalXmlChars.patch, fixIllegalXmlChars08-v2.patch, fixIllegalXmlChars08-v3.patch, fixIllegalXmlChars08-v4.patch, fixIllegalXmlChars08-v5.patch, fixIllegalXmlChars08.patch
>
> OpenSearchServlet does not check text-to-output for illegal xml characters; dependent on search result, its possible for OSS to output xml that is not well-formed. For example, if text has the character FF character in it -- -- i.e. the ascii character at position (decimal) 12 -- the produced XML will show the FF character as '' The character/entity '' is not legal in XML according to http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira