You are viewing a plain text version of this content. The canonical link for it is here.
Posted to ruby-dev@lucene.apache.org by Thiago Jackiw <tj...@gmail.com> on 2007/06/20 03:55:38 UTC
Can't render html entities when adding documents
There's something funky with solr-ruby's xml processing when adding
documents, but I don't really know what it is yet. It can't process
html entities at all, not even an html blank space " ":
SEVERE: org.xmlpull.v1.XmlPullParserException: could not resolve
entity named 'nbsp' (position: START_TAG seen ... to participate and
contribute to the Open Source Community. ... @1:1085)
Please look into it as soon as possible, acts_as_solr is using
solr-ruby as the backend it cannot have a buggy behavior.
Thanks.
--
Thiago Jackiw
acts_as_solr => http://acts-as-solr.railsfreaks.com
Re: Can't render html entities when adding documents
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Shedding more light on the REXML issue, with Ruby 1.8.6 it works!
irb(main):003:0> REXML::Text.new(" ",false,nil,false).to_s
=> "&nbsp;"
irb(main):004:0> REXML::Text.new("&",false,nil,false).to_s
=> "&"
So, do we require a higher version of Ruby (I was using 1.8.4
before)? Or.. ?
Erik
On Jun 24, 2007, at 11:02 AM, Erik Hatcher wrote:
> Firstly: REXML Sucks!
>
> good grief: <http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-
> talk/161603>
>
> Text.new(" ",false,nil,false).to_s
> => " "
>
> I've added this currently failing test to server_test.rb:
>
> def test_entities
> @connection.add(:id => 1, :title_text => " ")
> response = @connection.query('nbsp')
> assert_equal 1, response.total_hits
> assert_equal '1', response.hits[0]['id']
> end
>
> This works fine with libxml, but fails with REXML because of
> REXML's ridiculous escape-everything-not-already-escaped policy.
> At the moment I'm not sure how to resolve this, and I'm not
> currently sure how acts_as_solr worked with REXML any differently.
> Thiago - can you shed any light on that?
>
> My vote is to get rid of REXML support in solr-ruby and either
> require libxml-ruby to be installed or find some other lighter
> weight replacement.
>
> Thoughts?
>
> Erik
>
>
>
> On Jun 19, 2007, at 9:55 PM, Thiago Jackiw wrote:
>
>> There's something funky with solr-ruby's xml processing when adding
>> documents, but I don't really know what it is yet. It can't process
>> html entities at all, not even an html blank space " ":
>>
>> SEVERE: org.xmlpull.v1.XmlPullParserException: could not resolve
>> entity named 'nbsp' (position: START_TAG seen ... to participate and
>> contribute to the Open Source Community. ... @1:1085)
>>
>> Please look into it as soon as possible, acts_as_solr is using
>> solr-ruby as the backend it cannot have a buggy behavior.
>>
>> Thanks.
>>
>> --
>> Thiago Jackiw
>> acts_as_solr => http://acts-as-solr.railsfreaks.com
Re: Can't render html entities when adding documents
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Firstly: REXML Sucks!
good grief: <http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-
talk/161603>
Text.new(" ",false,nil,false).to_s
=> " "
I've added this currently failing test to server_test.rb:
def test_entities
@connection.add(:id => 1, :title_text => " ")
response = @connection.query('nbsp')
assert_equal 1, response.total_hits
assert_equal '1', response.hits[0]['id']
end
This works fine with libxml, but fails with REXML because of REXML's
ridiculous escape-everything-not-already-escaped policy. At the
moment I'm not sure how to resolve this, and I'm not currently sure
how acts_as_solr worked with REXML any differently. Thiago - can you
shed any light on that?
My vote is to get rid of REXML support in solr-ruby and either
require libxml-ruby to be installed or find some other lighter weight
replacement.
Thoughts?
Erik
On Jun 19, 2007, at 9:55 PM, Thiago Jackiw wrote:
> There's something funky with solr-ruby's xml processing when adding
> documents, but I don't really know what it is yet. It can't process
> html entities at all, not even an html blank space " ":
>
> SEVERE: org.xmlpull.v1.XmlPullParserException: could not resolve
> entity named 'nbsp' (position: START_TAG seen ... to participate and
> contribute to the Open Source Community. ... @1:1085)
>
> Please look into it as soon as possible, acts_as_solr is using
> solr-ruby as the backend it cannot have a buggy behavior.
>
> Thanks.
>
> --
> Thiago Jackiw
> acts_as_solr => http://acts-as-solr.railsfreaks.com
Re: Can't render html entities when adding documents
Posted by Yonik Seeley <yo...@apache.org>.
On 6/19/07, Yonik Seeley <yo...@apache.org> wrote:
> On 6/19/07, Thiago Jackiw <tj...@gmail.com> wrote:
> > There's something funky with solr-ruby's xml processing when adding
> > documents, but I don't really know what it is yet. It can't process
> > html entities at all, not even an html blank space " ":
>
> nbsp is not a default XML entity.
> Try replacing it with  
Even though the current Solr behavior is correct, I'm practical over
purist... if we could find a way to seed the XML parser with common
HTML entities, I don't think I'd be opposed to it.
-Yonik
Re: Can't render html entities when adding documents
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
Thiago,
I'll have to look late this week/weekend if I get a chance then, but
how did acts_as_solr create the XML passed to Solr? I think you
used my original hack for that communication which used REXML,
right? solr-ruby now supports both REXML and libxml2 - and I've
found that libxml2 does things properly whereas REXML was screwing
things up.
I suspect we can come up with a simple test case that shows where
things are wacky. If you can submit one of those I'll be glad to
look into this as soon as I can (this weekend at the earliest).
Erik
On Jun 20, 2007, at 2:06 AM, Thiago Jackiw wrote:
> Replying to my own post, I just tried with solr 1.2 with the last 2
> previous versions of acts_as_solr and it worked great, so I'm pretty
> sure this is a solr-ruby issue. I'll do some more testing with the way
> solr-ruby adds documents to Solr.
>
> --
> Thiago Jackiw
> acts_as_solr => http://acts-as-solr.railsfreaks.com
>
>
> On 6/19/07, Thiago Jackiw <tj...@gmail.com> wrote:
>> What's interesting is that on the previous versions of acts_as_solr
>> (without solr-ruby) the html entities where getting indexed fine
>> without passing through ERB's html_escape method. That's that I
>> did as
>> a fast fix before starting this thread.
>>
>> Did anything change in Solr 1.2 in regards to xml parsing? And I
>> guess
>> I should try the previous version of the acts_as_solr plugin with
>> Solr
>> 1.2 to see if I get the same error.
>>
>> --
>> Thiago Jackiw
>> acts_as_solr => http://acts-as-solr.railsfreaks.com
>>
>>
>> On 6/19/07, Aaron Suggs <aa...@ktheory.com> wrote:
>> > I'm was getting the same XmlPullParserException from solr while
>> using
>> > solr-ruby to index HTML.
>> >
>> > I solved things by running text through the html_escape() method in
>> > ERB::Utils before submitting to Solr.
>> >
>> > In the console, the following generates the
>> XmlPullParserException in
>> > solr, which manifests itself as a Net::HTTPFatalError in solr-ruby:
>> >
>> > Solr::Connection.new(http://localhost:8083/solr, :autocommit =>
>> > :on).add(:id => 1, :value_t => ' ')
>> > Net::HTTPFatalError: 500...XmlPullParserException...
>> >
>> > But escape_html (aliased as the h() method by default) characters
>> > works like a charm:
>> >
>> > include ERB::Util
>> > Solr::Connection.new(http://localhost:8083/solr, :autocommit =>
>> > :on).add(:id => 1, :value_t => h(' '))
>> > => true
>> >
>> > Subsequently, searching for strings like 'nbsp' returns hits on
>> those
>> > escaped entities, which may or may not be what you want:
>> > >> Solr::Connection.new(SOLR_URL, :autocommit => :on).query
>> ('value_t:nbsp').hits
>> > => [{"score"=>10.771498, "id"=>1, "value_t"=>" "}]
>> >
>> > If you don't want searches for 'nbsp' to return all documents with
>> > escaped non-breaking spaces, the solution lies in defining some new
>> > fieldtype in solr/conf/schema.xml
>> >
>> > -Aaron Suggs
>> >
>> > On 6/19/07, Yonik Seeley <yo...@apache.org> wrote:
>> > > On 6/19/07, Thiago Jackiw <tj...@gmail.com> wrote:
>> > > > There's something funky with solr-ruby's xml processing when
>> adding
>> > > > documents, but I don't really know what it is yet. It can't
>> process
>> > > > html entities at all, not even an html blank space " ":
>> > >
>> > > nbsp is not a default XML entity.
>> > > Try replacing it with  
>> > >
>> > > -Yonik
>> > >
>> >
>>
Re: Can't render html entities when adding documents
Posted by Thiago Jackiw <tj...@gmail.com>.
Replying to my own post, I just tried with solr 1.2 with the last 2
previous versions of acts_as_solr and it worked great, so I'm pretty
sure this is a solr-ruby issue. I'll do some more testing with the way
solr-ruby adds documents to Solr.
--
Thiago Jackiw
acts_as_solr => http://acts-as-solr.railsfreaks.com
On 6/19/07, Thiago Jackiw <tj...@gmail.com> wrote:
> What's interesting is that on the previous versions of acts_as_solr
> (without solr-ruby) the html entities where getting indexed fine
> without passing through ERB's html_escape method. That's that I did as
> a fast fix before starting this thread.
>
> Did anything change in Solr 1.2 in regards to xml parsing? And I guess
> I should try the previous version of the acts_as_solr plugin with Solr
> 1.2 to see if I get the same error.
>
> --
> Thiago Jackiw
> acts_as_solr => http://acts-as-solr.railsfreaks.com
>
>
> On 6/19/07, Aaron Suggs <aa...@ktheory.com> wrote:
> > I'm was getting the same XmlPullParserException from solr while using
> > solr-ruby to index HTML.
> >
> > I solved things by running text through the html_escape() method in
> > ERB::Utils before submitting to Solr.
> >
> > In the console, the following generates the XmlPullParserException in
> > solr, which manifests itself as a Net::HTTPFatalError in solr-ruby:
> >
> > Solr::Connection.new(http://localhost:8083/solr, :autocommit =>
> > :on).add(:id => 1, :value_t => ' ')
> > Net::HTTPFatalError: 500...XmlPullParserException...
> >
> > But escape_html (aliased as the h() method by default) characters
> > works like a charm:
> >
> > include ERB::Util
> > Solr::Connection.new(http://localhost:8083/solr, :autocommit =>
> > :on).add(:id => 1, :value_t => h(' '))
> > => true
> >
> > Subsequently, searching for strings like 'nbsp' returns hits on those
> > escaped entities, which may or may not be what you want:
> > >> Solr::Connection.new(SOLR_URL, :autocommit => :on).query('value_t:nbsp').hits
> > => [{"score"=>10.771498, "id"=>1, "value_t"=>" "}]
> >
> > If you don't want searches for 'nbsp' to return all documents with
> > escaped non-breaking spaces, the solution lies in defining some new
> > fieldtype in solr/conf/schema.xml
> >
> > -Aaron Suggs
> >
> > On 6/19/07, Yonik Seeley <yo...@apache.org> wrote:
> > > On 6/19/07, Thiago Jackiw <tj...@gmail.com> wrote:
> > > > There's something funky with solr-ruby's xml processing when adding
> > > > documents, but I don't really know what it is yet. It can't process
> > > > html entities at all, not even an html blank space " ":
> > >
> > > nbsp is not a default XML entity.
> > > Try replacing it with  
> > >
> > > -Yonik
> > >
> >
>
Re: Can't render html entities when adding documents
Posted by Thiago Jackiw <tj...@gmail.com>.
What's interesting is that on the previous versions of acts_as_solr
(without solr-ruby) the html entities where getting indexed fine
without passing through ERB's html_escape method. That's that I did as
a fast fix before starting this thread.
Did anything change in Solr 1.2 in regards to xml parsing? And I guess
I should try the previous version of the acts_as_solr plugin with Solr
1.2 to see if I get the same error.
--
Thiago Jackiw
acts_as_solr => http://acts-as-solr.railsfreaks.com
On 6/19/07, Aaron Suggs <aa...@ktheory.com> wrote:
> I'm was getting the same XmlPullParserException from solr while using
> solr-ruby to index HTML.
>
> I solved things by running text through the html_escape() method in
> ERB::Utils before submitting to Solr.
>
> In the console, the following generates the XmlPullParserException in
> solr, which manifests itself as a Net::HTTPFatalError in solr-ruby:
>
> Solr::Connection.new(http://localhost:8083/solr, :autocommit =>
> :on).add(:id => 1, :value_t => ' ')
> Net::HTTPFatalError: 500...XmlPullParserException...
>
> But escape_html (aliased as the h() method by default) characters
> works like a charm:
>
> include ERB::Util
> Solr::Connection.new(http://localhost:8083/solr, :autocommit =>
> :on).add(:id => 1, :value_t => h(' '))
> => true
>
> Subsequently, searching for strings like 'nbsp' returns hits on those
> escaped entities, which may or may not be what you want:
> >> Solr::Connection.new(SOLR_URL, :autocommit => :on).query('value_t:nbsp').hits
> => [{"score"=>10.771498, "id"=>1, "value_t"=>" "}]
>
> If you don't want searches for 'nbsp' to return all documents with
> escaped non-breaking spaces, the solution lies in defining some new
> fieldtype in solr/conf/schema.xml
>
> -Aaron Suggs
>
> On 6/19/07, Yonik Seeley <yo...@apache.org> wrote:
> > On 6/19/07, Thiago Jackiw <tj...@gmail.com> wrote:
> > > There's something funky with solr-ruby's xml processing when adding
> > > documents, but I don't really know what it is yet. It can't process
> > > html entities at all, not even an html blank space " ":
> >
> > nbsp is not a default XML entity.
> > Try replacing it with  
> >
> > -Yonik
> >
>
Re: Can't render html entities when adding documents
Posted by Aaron Suggs <aa...@ktheory.com>.
I'm was getting the same XmlPullParserException from solr while using
solr-ruby to index HTML.
I solved things by running text through the html_escape() method in
ERB::Utils before submitting to Solr.
In the console, the following generates the XmlPullParserException in
solr, which manifests itself as a Net::HTTPFatalError in solr-ruby:
Solr::Connection.new(http://localhost:8083/solr, :autocommit =>
:on).add(:id => 1, :value_t => ' ')
Net::HTTPFatalError: 500...XmlPullParserException...
But escape_html (aliased as the h() method by default) characters
works like a charm:
include ERB::Util
Solr::Connection.new(http://localhost:8083/solr, :autocommit =>
:on).add(:id => 1, :value_t => h(' '))
=> true
Subsequently, searching for strings like 'nbsp' returns hits on those
escaped entities, which may or may not be what you want:
>> Solr::Connection.new(SOLR_URL, :autocommit => :on).query('value_t:nbsp').hits
=> [{"score"=>10.771498, "id"=>1, "value_t"=>" "}]
If you don't want searches for 'nbsp' to return all documents with
escaped non-breaking spaces, the solution lies in defining some new
fieldtype in solr/conf/schema.xml
-Aaron Suggs
On 6/19/07, Yonik Seeley <yo...@apache.org> wrote:
> On 6/19/07, Thiago Jackiw <tj...@gmail.com> wrote:
> > There's something funky with solr-ruby's xml processing when adding
> > documents, but I don't really know what it is yet. It can't process
> > html entities at all, not even an html blank space " ":
>
> nbsp is not a default XML entity.
> Try replacing it with  
>
> -Yonik
>
Re: Can't render html entities when adding documents
Posted by Yonik Seeley <yo...@apache.org>.
On 6/19/07, Thiago Jackiw <tj...@gmail.com> wrote:
> There's something funky with solr-ruby's xml processing when adding
> documents, but I don't really know what it is yet. It can't process
> html entities at all, not even an html blank space " ":
nbsp is not a default XML entity.
Try replacing it with  
-Yonik