You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Ellery Leung <el...@be-o.com> on 2011/05/27 09:41:23 UTC
HTMLStripTransformer will remove the content in XML??
I have an XML string like this:
<?xml version="1.0"
encoding="UTF-8"?><language><intl><![CDATA[hello]]></intl><loc><![CDATA[solr
]]></loc></language>
By using HTMLStripTransformer, I expect to get 'hello,solr'.
But actual this transformer will remove ALL THE TEXT INSIDE!
Did I do something silly, or is it a bug?
Thank you
RE: HTMLStripTransformer will remove the content in XML??
Posted by Chris Hostetter <ho...@fucit.org>.
FYI: There's a new patch specificly for dealing with xml tags and entities
that handles the CDATA case...
https://issues.apache.org/jira/browse/SOLR-2597
: Date: Fri, 27 May 2011 17:01:26 +0800
: From: Ellery Leung <el...@be-o.com>
: Reply-To: solr-user@lucene.apache.org, elleryleung@be-o.com
: To: solr-user@lucene.apache.org
: Subject: RE: HTMLStripTransformer will remove the content in XML??
:
: Got it. Actually I use solr.MappingCharFilterFactory to replace the <![CDATA[ and ]]> to empty first, and use HTMLStripCharFilterFactory to get "hello" and "solr".
:
: For future reference, here is part of schema.xml
:
: <fieldType name="textMaxWord" class="solr.TextField" >
: <analyzer type="index">
: <charFilter class="solr.MappingCharFilterFactory" mapping="mappings.txt"/>
: <charFilter class="solr.HTMLStripCharFilterFactory" />
: ...
:
: In mappings.txt (2 lines)
:
: "<![CDATA[" => ""
:
: "]]>" => ""
:
: Restart Solr
:
: It works.
:
: Thank you
:
: -----Original Message-----
: From: bryan rasmussen [mailto:rasmussen.bryan@gmail.com]
: Sent: 2011年5月27日 4:20 下午
: To: solr-user@lucene.apache.org; elleryleung@be-o.com
: Subject: Re: HTMLStripTransformer will remove the content in XML??
:
: I would expect that it doesn't understand CDATA and thinks of
: everything between < and > as a 'tag'.
:
: Best Regards,
: Bryan Rasmussen
:
: On Fri, May 27, 2011 at 9:41 AM, Ellery Leung <el...@be-o.com> wrote:
: > I have an XML string like this:
: >
: >
: >
: > <?xml version="1.0"
: > encoding="UTF-8"?><language><intl><![CDATA[hello]]></intl><loc><![CDATA[solr
: > ]]></loc></language>
: >
: >
: >
: > By using HTMLStripTransformer, I expect to get 'hello,solr'.
: >
: >
: >
: > But actual this transformer will remove ALL THE TEXT INSIDE!
: >
: >
: >
: > Did I do something silly, or is it a bug?
: >
: >
: >
: > Thank you
: >
: >
:
:
-Hoss
RE: HTMLStripTransformer will remove the content in XML??
Posted by Ellery Leung <el...@be-o.com>.
Got it. Actually I use solr.MappingCharFilterFactory to replace the <![CDATA[ and ]]> to empty first, and use HTMLStripCharFilterFactory to get "hello" and "solr".
For future reference, here is part of schema.xml
<fieldType name="textMaxWord" class="solr.TextField" >
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mappings.txt"/>
<charFilter class="solr.HTMLStripCharFilterFactory" />
...
In mappings.txt (2 lines)
"<![CDATA[" => ""
"]]>" => ""
Restart Solr
It works.
Thank you
-----Original Message-----
From: bryan rasmussen [mailto:rasmussen.bryan@gmail.com]
Sent: 2011年5月27日 4:20 下午
To: solr-user@lucene.apache.org; elleryleung@be-o.com
Subject: Re: HTMLStripTransformer will remove the content in XML??
I would expect that it doesn't understand CDATA and thinks of
everything between < and > as a 'tag'.
Best Regards,
Bryan Rasmussen
On Fri, May 27, 2011 at 9:41 AM, Ellery Leung <el...@be-o.com> wrote:
> I have an XML string like this:
>
>
>
> <?xml version="1.0"
> encoding="UTF-8"?><language><intl><![CDATA[hello]]></intl><loc><![CDATA[solr
> ]]></loc></language>
>
>
>
> By using HTMLStripTransformer, I expect to get 'hello,solr'.
>
>
>
> But actual this transformer will remove ALL THE TEXT INSIDE!
>
>
>
> Did I do something silly, or is it a bug?
>
>
>
> Thank you
>
>
Re: HTMLStripTransformer will remove the content in XML??
Posted by bryan rasmussen <ra...@gmail.com>.
I would expect that it doesn't understand CDATA and thinks of
everything between < and > as a 'tag'.
Best Regards,
Bryan Rasmussen
On Fri, May 27, 2011 at 9:41 AM, Ellery Leung <el...@be-o.com> wrote:
> I have an XML string like this:
>
>
>
> <?xml version="1.0"
> encoding="UTF-8"?><language><intl><![CDATA[hello]]></intl><loc><![CDATA[solr
> ]]></loc></language>
>
>
>
> By using HTMLStripTransformer, I expect to get 'hello,solr'.
>
>
>
> But actual this transformer will remove ALL THE TEXT INSIDE!
>
>
>
> Did I do something silly, or is it a bug?
>
>
>
> Thank you
>
>