You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mike Richmond <ri...@gmail.com> on 2006/06/20 17:04:35 UTC

Invalid XML returned from Solr

I have a application that I recently ported to Solr and am running
into a few problems with the XML responses from Solr.  An XML response
which came from a Solr query, returned XML data that was not properly
escaped (no CDATA tag, or entity substitution).  In particular the
"summary" field contains '<' characters. An example of such a response
can be found here: http://www.willetts.com/mike/response.xml

I looked through the source code for XMLWriter and it appears to be
using util.XML.escape to escape the data, so I do not see how this
response able to occur.  Does anyone have any ideas?

Here is the requestHandler tag in the Solr config file:
<requestHandler name="standard" class="solr.StandardRequestHandler" />

On another note:
I also noticed that I get non-utf8 characters in the response even
though the encoding line at the top of the XML document specifies utf8
encoding.  I did not see anywhere in the XMLWriter code that checked
the encoding of the output.  Is this by design, or am I missing
something?


Thanks in advance, the feedback I have received from the user lists
has been invaluable.


Regards,

Mike

Re: Invalid XML returned from Solr

Posted by Mike Richmond <ri...@gmail.com>.
Hi Yonik,

Thanks again for the quick help.  I switched to Tomcat and all the
problems went away.

Not sure what the process would be but I'd be willing to migrate the
example application to tomcat and update the existing documentation.
I would like to give back to this project as it has done quite a bit
for me.


--Mike


On 6/20/06, Yonik Seeley <ys...@gmail.com> wrote:
> I've confirmed this is a Jetty bug related to international chars
> (>=128) and their output writer.  When I moved the example to Tomcat
> 5.5, everything worked as expected.
>
> For the exact same Lucene index file,
> Tomcat outputs
>   <str>I¹ll &lt;email></str>
> and Jetty outputs
>   <str>I¹ll <email>&lt;email></str>
>
> We should really look into switching the appserver we bundle for the example.
>
> -Yonik
>

Re: Invalid XML returned from Solr

Posted by Yonik Seeley <ys...@gmail.com>.
I've confirmed this is a Jetty bug related to international chars
(>=128) and their output writer.  When I moved the example to Tomcat
5.5, everything worked as expected.

For the exact same Lucene index file,
Tomcat outputs
  <str>I¹ll &lt;email></str>
and Jetty outputs
  <str>I¹ll <email>&lt;email></str>

We should really look into switching the appserver we bundle for the example.

-Yonik

Re: Invalid XML returned from Solr

Posted by Mike Richmond <ri...@gmail.com>.
Hi Yonik,

Thanks for the quick reply.  I am willing to give you access to my
index, config files, or any other pieces that you may need if it would
help.  I am basically running the example application (which uses
Jetty), but with a modified schema.xml and a couple other small
changes.

I'll look into giving Tomcat a try over Jetty.


--Mike


On 6/20/06, Yonik Seeley <ys...@gmail.com> wrote:
> On 6/20/06, Mike Richmond <ri...@gmail.com> wrote:
> > I have a application that I recently ported to Solr and am running
> > into a few problems with the XML responses from Solr.  An XML response
> > which came from a Solr query, returned XML data that was not properly
> > escaped (no CDATA tag, or entity substitution).  In particular the
> > "summary" field contains '<' characters. An example of such a response
> > can be found here: http://www.willetts.com/mike/response.xml
>
> Hmmm, that is interesting... I haven't seen that before.
> I'll try and duplicate it with your example "summary" field.
>
> > On another note:
> > I also noticed that I get non-utf8 characters in the response even
> > though the encoding line at the top of the XML document specifies utf8
> > encoding.
>
> Are you using the bundled version of Jetty?  People have been having
> problems with international chars with that.  You might try using
> Tomcat.
>
> > I did not see anywhere in the XMLWriter code that checked
> > the encoding of the output.  Is this by design, or am I missing
> > something?
>
> By design... XMLWriter writes java characters and strings, and the
> servlet container handles encoding to UTF-8.
>
> -Yonik
>

Re: Invalid XML returned from Solr

Posted by Yonik Seeley <ys...@gmail.com>.
On 6/20/06, Mike Richmond <ri...@gmail.com> wrote:
> I have a application that I recently ported to Solr and am running
> into a few problems with the XML responses from Solr.  An XML response
> which came from a Solr query, returned XML data that was not properly
> escaped (no CDATA tag, or entity substitution).  In particular the
> "summary" field contains '<' characters. An example of such a response
> can be found here: http://www.willetts.com/mike/response.xml

Hmmm, that is interesting... I haven't seen that before.
I'll try and duplicate it with your example "summary" field.

> On another note:
> I also noticed that I get non-utf8 characters in the response even
> though the encoding line at the top of the XML document specifies utf8
> encoding.

Are you using the bundled version of Jetty?  People have been having
problems with international chars with that.  You might try using
Tomcat.

> I did not see anywhere in the XMLWriter code that checked
> the encoding of the output.  Is this by design, or am I missing
> something?

By design... XMLWriter writes java characters and strings, and the
servlet container handles encoding to UTF-8.

-Yonik