You are viewing a plain text version of this content. The canonical link for it is here.
Posted to ruby-dev@lucene.apache.org by Thiago Jackiw <tj...@gmail.com> on 2007/06/20 03:55:38 UTC

Can't render html entities when adding documents

There's something funky with solr-ruby's xml processing when adding
documents, but I don't really know what it is yet. It can't process
html entities at all, not even an html blank space "&nbsp;":

SEVERE: org.xmlpull.v1.XmlPullParserException: could not resolve
entity named 'nbsp' (position: START_TAG seen ... to participate and
contribute to the Open Source Community.&nbsp;... @1:1085)

Please look into it as soon as possible, acts_as_solr is using
solr-ruby as the backend it cannot have a buggy behavior.

Thanks.

--
Thiago Jackiw
acts_as_solr => http://acts-as-solr.railsfreaks.com

Re: Can't render html entities when adding documents

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Shedding more light on the REXML issue, with Ruby 1.8.6 it works!

irb(main):003:0> REXML::Text.new("&nbsp;",false,nil,false).to_s
=> "&amp;nbsp;"
irb(main):004:0> REXML::Text.new("&",false,nil,false).to_s
=> "&amp;"

So, do we require a higher version of Ruby (I was using 1.8.4  
before)?  Or.. ?

	Erik


On Jun 24, 2007, at 11:02 AM, Erik Hatcher wrote:

> Firstly: REXML Sucks!
>
> good grief: <http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby- 
> talk/161603>
>
> Text.new("&nbsp;",false,nil,false).to_s
> => "&nbsp;"
>
> I've added this currently failing test to server_test.rb:
>
>   def test_entities
>     @connection.add(:id => 1, :title_text => "&nbsp;")
>     response = @connection.query('nbsp')
>     assert_equal 1, response.total_hits
>     assert_equal '1', response.hits[0]['id']
>   end
>
> This works fine with libxml, but fails with REXML because of  
> REXML's ridiculous escape-everything-not-already-escaped policy.   
> At the moment I'm not sure how to resolve this, and I'm not  
> currently sure how acts_as_solr worked with REXML any differently.   
> Thiago - can you shed any light on that?
>
> My vote is to get rid of REXML support in solr-ruby and either  
> require libxml-ruby to be installed or find some other lighter  
> weight replacement.
>
> Thoughts?
>
> 	Erik
>
>
>
> On Jun 19, 2007, at 9:55 PM, Thiago Jackiw wrote:
>
>> There's something funky with solr-ruby's xml processing when adding
>> documents, but I don't really know what it is yet. It can't process
>> html entities at all, not even an html blank space "&nbsp;":
>>
>> SEVERE: org.xmlpull.v1.XmlPullParserException: could not resolve
>> entity named 'nbsp' (position: START_TAG seen ... to participate and
>> contribute to the Open Source Community.&nbsp;... @1:1085)
>>
>> Please look into it as soon as possible, acts_as_solr is using
>> solr-ruby as the backend it cannot have a buggy behavior.
>>
>> Thanks.
>>
>> --
>> Thiago Jackiw
>> acts_as_solr => http://acts-as-solr.railsfreaks.com


Re: Can't render html entities when adding documents

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Firstly: REXML Sucks!

good grief: <http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby- 
talk/161603>

Text.new("&nbsp;",false,nil,false).to_s
=> "&nbsp;"

I've added this currently failing test to server_test.rb:

   def test_entities
     @connection.add(:id => 1, :title_text => "&nbsp;")
     response = @connection.query('nbsp')
     assert_equal 1, response.total_hits
     assert_equal '1', response.hits[0]['id']
   end

This works fine with libxml, but fails with REXML because of REXML's  
ridiculous escape-everything-not-already-escaped policy.  At the  
moment I'm not sure how to resolve this, and I'm not currently sure  
how acts_as_solr worked with REXML any differently.  Thiago - can you  
shed any light on that?

My vote is to get rid of REXML support in solr-ruby and either  
require libxml-ruby to be installed or find some other lighter weight  
replacement.

Thoughts?

	Erik



On Jun 19, 2007, at 9:55 PM, Thiago Jackiw wrote:

> There's something funky with solr-ruby's xml processing when adding
> documents, but I don't really know what it is yet. It can't process
> html entities at all, not even an html blank space "&nbsp;":
>
> SEVERE: org.xmlpull.v1.XmlPullParserException: could not resolve
> entity named 'nbsp' (position: START_TAG seen ... to participate and
> contribute to the Open Source Community.&nbsp;... @1:1085)
>
> Please look into it as soon as possible, acts_as_solr is using
> solr-ruby as the backend it cannot have a buggy behavior.
>
> Thanks.
>
> --
> Thiago Jackiw
> acts_as_solr => http://acts-as-solr.railsfreaks.com


Re: Can't render html entities when adding documents

Posted by Yonik Seeley <yo...@apache.org>.
On 6/19/07, Yonik Seeley <yo...@apache.org> wrote:
> On 6/19/07, Thiago Jackiw <tj...@gmail.com> wrote:
> > There's something funky with solr-ruby's xml processing when adding
> > documents, but I don't really know what it is yet. It can't process
> > html entities at all, not even an html blank space "&nbsp;":
>
> nbsp is not a default XML entity.
> Try replacing it with &#160;

Even though the current Solr behavior is correct, I'm practical over
purist... if we could find a way to seed the XML parser with common
HTML entities, I don't think I'd be opposed to it.

-Yonik

Re: Can't render html entities when adding documents

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Thiago,

I'll have to look late this week/weekend if I get a chance then, but  
how did acts_as_solr create the XML passed to Solr?   I think you  
used my original hack for that communication which used REXML,  
right?   solr-ruby now supports both REXML and libxml2 - and I've  
found that libxml2 does things properly whereas REXML was screwing  
things up.

I suspect we can come up with a simple test case that shows where  
things are wacky.  If you can submit one of those I'll be glad to  
look into this as soon as I can (this weekend at the earliest).

	Erik


On Jun 20, 2007, at 2:06 AM, Thiago Jackiw wrote:

> Replying to my own post, I just tried with solr 1.2 with the last 2
> previous versions of acts_as_solr and it worked great, so I'm pretty
> sure this is a solr-ruby issue. I'll do some more testing with the way
> solr-ruby adds documents to Solr.
>
> --
> Thiago Jackiw
> acts_as_solr => http://acts-as-solr.railsfreaks.com
>
>
> On 6/19/07, Thiago Jackiw <tj...@gmail.com> wrote:
>> What's interesting is that on the previous versions of acts_as_solr
>> (without solr-ruby) the html entities where getting indexed fine
>> without passing through ERB's html_escape method. That's that I  
>> did as
>> a fast fix before starting this thread.
>>
>> Did anything change in Solr 1.2 in regards to xml parsing? And I  
>> guess
>> I should try the previous version of the acts_as_solr plugin with  
>> Solr
>> 1.2 to see if I get the same error.
>>
>> --
>> Thiago Jackiw
>> acts_as_solr => http://acts-as-solr.railsfreaks.com
>>
>>
>> On 6/19/07, Aaron Suggs <aa...@ktheory.com> wrote:
>> > I'm was getting the same XmlPullParserException from solr while  
>> using
>> > solr-ruby to index HTML.
>> >
>> > I solved things by running text through the html_escape() method in
>> > ERB::Utils before submitting to Solr.
>> >
>> > In the console, the following generates the  
>> XmlPullParserException in
>> > solr, which manifests itself as a Net::HTTPFatalError in solr-ruby:
>> >
>> >   Solr::Connection.new(http://localhost:8083/solr, :autocommit =>
>> > :on).add(:id => 1, :value_t => '&nbsp;')
>> > Net::HTTPFatalError: 500...XmlPullParserException...
>> >
>> > But escape_html (aliased as the h() method by default) characters
>> > works like a charm:
>> >
>> >   include ERB::Util
>> >   Solr::Connection.new(http://localhost:8083/solr, :autocommit =>
>> > :on).add(:id => 1, :value_t => h('&nbsp;'))
>> > => true
>> >
>> > Subsequently, searching for strings like 'nbsp' returns hits on  
>> those
>> > escaped entities, which may or may not be what you want:
>> > >> Solr::Connection.new(SOLR_URL, :autocommit => :on).query 
>> ('value_t:nbsp').hits
>> > => [{"score"=>10.771498, "id"=>1, "value_t"=>"&nbsp;"}]
>> >
>> > If you don't want searches for 'nbsp' to return all documents with
>> > escaped non-breaking spaces, the solution lies in defining some new
>> > fieldtype in solr/conf/schema.xml
>> >
>> > -Aaron Suggs
>> >
>> > On 6/19/07, Yonik Seeley <yo...@apache.org> wrote:
>> > > On 6/19/07, Thiago Jackiw <tj...@gmail.com> wrote:
>> > > > There's something funky with solr-ruby's xml processing when  
>> adding
>> > > > documents, but I don't really know what it is yet. It can't  
>> process
>> > > > html entities at all, not even an html blank space "&nbsp;":
>> > >
>> > > nbsp is not a default XML entity.
>> > > Try replacing it with &#160;
>> > >
>> > > -Yonik
>> > >
>> >
>>


Re: Can't render html entities when adding documents

Posted by Thiago Jackiw <tj...@gmail.com>.
Replying to my own post, I just tried with solr 1.2 with the last 2
previous versions of acts_as_solr and it worked great, so I'm pretty
sure this is a solr-ruby issue. I'll do some more testing with the way
solr-ruby adds documents to Solr.

--
Thiago Jackiw
acts_as_solr => http://acts-as-solr.railsfreaks.com


On 6/19/07, Thiago Jackiw <tj...@gmail.com> wrote:
> What's interesting is that on the previous versions of acts_as_solr
> (without solr-ruby) the html entities where getting indexed fine
> without passing through ERB's html_escape method. That's that I did as
> a fast fix before starting this thread.
>
> Did anything change in Solr 1.2 in regards to xml parsing? And I guess
> I should try the previous version of the acts_as_solr plugin with Solr
> 1.2 to see if I get the same error.
>
> --
> Thiago Jackiw
> acts_as_solr => http://acts-as-solr.railsfreaks.com
>
>
> On 6/19/07, Aaron Suggs <aa...@ktheory.com> wrote:
> > I'm was getting the same XmlPullParserException from solr while using
> > solr-ruby to index HTML.
> >
> > I solved things by running text through the html_escape() method in
> > ERB::Utils before submitting to Solr.
> >
> > In the console, the following generates the XmlPullParserException in
> > solr, which manifests itself as a Net::HTTPFatalError in solr-ruby:
> >
> >   Solr::Connection.new(http://localhost:8083/solr, :autocommit =>
> > :on).add(:id => 1, :value_t => '&nbsp;')
> > Net::HTTPFatalError: 500...XmlPullParserException...
> >
> > But escape_html (aliased as the h() method by default) characters
> > works like a charm:
> >
> >   include ERB::Util
> >   Solr::Connection.new(http://localhost:8083/solr, :autocommit =>
> > :on).add(:id => 1, :value_t => h('&nbsp;'))
> > => true
> >
> > Subsequently, searching for strings like 'nbsp' returns hits on those
> > escaped entities, which may or may not be what you want:
> > >> Solr::Connection.new(SOLR_URL, :autocommit => :on).query('value_t:nbsp').hits
> > => [{"score"=>10.771498, "id"=>1, "value_t"=>"&nbsp;"}]
> >
> > If you don't want searches for 'nbsp' to return all documents with
> > escaped non-breaking spaces, the solution lies in defining some new
> > fieldtype in solr/conf/schema.xml
> >
> > -Aaron Suggs
> >
> > On 6/19/07, Yonik Seeley <yo...@apache.org> wrote:
> > > On 6/19/07, Thiago Jackiw <tj...@gmail.com> wrote:
> > > > There's something funky with solr-ruby's xml processing when adding
> > > > documents, but I don't really know what it is yet. It can't process
> > > > html entities at all, not even an html blank space "&nbsp;":
> > >
> > > nbsp is not a default XML entity.
> > > Try replacing it with &#160;
> > >
> > > -Yonik
> > >
> >
>

Re: Can't render html entities when adding documents

Posted by Thiago Jackiw <tj...@gmail.com>.
What's interesting is that on the previous versions of acts_as_solr
(without solr-ruby) the html entities where getting indexed fine
without passing through ERB's html_escape method. That's that I did as
a fast fix before starting this thread.

Did anything change in Solr 1.2 in regards to xml parsing? And I guess
I should try the previous version of the acts_as_solr plugin with Solr
1.2 to see if I get the same error.

--
Thiago Jackiw
acts_as_solr => http://acts-as-solr.railsfreaks.com


On 6/19/07, Aaron Suggs <aa...@ktheory.com> wrote:
> I'm was getting the same XmlPullParserException from solr while using
> solr-ruby to index HTML.
>
> I solved things by running text through the html_escape() method in
> ERB::Utils before submitting to Solr.
>
> In the console, the following generates the XmlPullParserException in
> solr, which manifests itself as a Net::HTTPFatalError in solr-ruby:
>
>   Solr::Connection.new(http://localhost:8083/solr, :autocommit =>
> :on).add(:id => 1, :value_t => '&nbsp;')
> Net::HTTPFatalError: 500...XmlPullParserException...
>
> But escape_html (aliased as the h() method by default) characters
> works like a charm:
>
>   include ERB::Util
>   Solr::Connection.new(http://localhost:8083/solr, :autocommit =>
> :on).add(:id => 1, :value_t => h('&nbsp;'))
> => true
>
> Subsequently, searching for strings like 'nbsp' returns hits on those
> escaped entities, which may or may not be what you want:
> >> Solr::Connection.new(SOLR_URL, :autocommit => :on).query('value_t:nbsp').hits
> => [{"score"=>10.771498, "id"=>1, "value_t"=>"&nbsp;"}]
>
> If you don't want searches for 'nbsp' to return all documents with
> escaped non-breaking spaces, the solution lies in defining some new
> fieldtype in solr/conf/schema.xml
>
> -Aaron Suggs
>
> On 6/19/07, Yonik Seeley <yo...@apache.org> wrote:
> > On 6/19/07, Thiago Jackiw <tj...@gmail.com> wrote:
> > > There's something funky with solr-ruby's xml processing when adding
> > > documents, but I don't really know what it is yet. It can't process
> > > html entities at all, not even an html blank space "&nbsp;":
> >
> > nbsp is not a default XML entity.
> > Try replacing it with &#160;
> >
> > -Yonik
> >
>

Re: Can't render html entities when adding documents

Posted by Aaron Suggs <aa...@ktheory.com>.
I'm was getting the same XmlPullParserException from solr while using
solr-ruby to index HTML.

I solved things by running text through the html_escape() method in
ERB::Utils before submitting to Solr.

In the console, the following generates the XmlPullParserException in
solr, which manifests itself as a Net::HTTPFatalError in solr-ruby:

  Solr::Connection.new(http://localhost:8083/solr, :autocommit =>
:on).add(:id => 1, :value_t => '&nbsp;')
Net::HTTPFatalError: 500...XmlPullParserException...

But escape_html (aliased as the h() method by default) characters
works like a charm:

  include ERB::Util
  Solr::Connection.new(http://localhost:8083/solr, :autocommit =>
:on).add(:id => 1, :value_t => h('&nbsp;'))
=> true

Subsequently, searching for strings like 'nbsp' returns hits on those
escaped entities, which may or may not be what you want:
>> Solr::Connection.new(SOLR_URL, :autocommit => :on).query('value_t:nbsp').hits
=> [{"score"=>10.771498, "id"=>1, "value_t"=>"&nbsp;"}]

If you don't want searches for 'nbsp' to return all documents with
escaped non-breaking spaces, the solution lies in defining some new
fieldtype in solr/conf/schema.xml

-Aaron Suggs

On 6/19/07, Yonik Seeley <yo...@apache.org> wrote:
> On 6/19/07, Thiago Jackiw <tj...@gmail.com> wrote:
> > There's something funky with solr-ruby's xml processing when adding
> > documents, but I don't really know what it is yet. It can't process
> > html entities at all, not even an html blank space "&nbsp;":
>
> nbsp is not a default XML entity.
> Try replacing it with &#160;
>
> -Yonik
>

Re: Can't render html entities when adding documents

Posted by Yonik Seeley <yo...@apache.org>.
On 6/19/07, Thiago Jackiw <tj...@gmail.com> wrote:
> There's something funky with solr-ruby's xml processing when adding
> documents, but I don't really know what it is yet. It can't process
> html entities at all, not even an html blank space "&nbsp;":

nbsp is not a default XML entity.
Try replacing it with &#160;

-Yonik