You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by aseem cheema <as...@gmail.com> on 2009/11/11 18:50:35 UTC

XmlUpdateRequestHandler with HTMLStripCharFilterFactory

I am trying to post a document with the following content using SolrJ:
<center>content</center>
I need the xml/html tags to be ignored. Even though this works fine in
analysis.jsp, this does not work with SolrJ, as the client escapes the
< and > with &lt; and &gt; and HTMLStripCharFilterFactory does not
strip those escaped tags. How can I achieve this? Any ideas will be
highly appreciated.

There is escapedTags in HTMLStripCharFilterFactory constructor. Is
there a way to get that to work?
Thanks
-- 
Aseem

Re: XmlUpdateRequestHandler with HTMLStripCharFilterFactory

Posted by Chris Hostetter <ho...@fucit.org>.

: stored without tags. But looks like the html tags are removed and terms are
: indexed purely for indexing, and the actual text is stored in raw format.

Correct. Analysis is all about "indexing" it has nothing to do with 
"stored" content.

You can write UpdateProcessors that modify the content before it is either 
indexed or stored, but there aren't a lot of Processors provided out of 
hte box at the moment.

-Hoss

Re: XmlUpdateRequestHandler with HTMLStripCharFilterFactory

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

On Wed, Jan 13, 2010 at 7:48 AM, Lance Norskog <go...@gmail.com> wrote:

> You can do this stripping in the DataImportHandler. You would have to
> write your own stripping code using regular expresssions.

Note that DIH has a HTMLStripTransformer which wraps Solr's HTMLStripReader.

-- 
Regards,
Shalin Shekhar Mangar.

Re: XmlUpdateRequestHandler with HTMLStripCharFilterFactory

Posted by Lance Norskog <go...@gmail.com>.

You can do this stripping in the DataImportHandler. You would have to
write your own stripping code using regular expresssions. Also, the
ExtractingRequestHandler strips out the html markup when you use it to
index an html file:

http://wiki.apache.org/solr/ExtractingRequestHandler

On Mon, Jan 11, 2010 at 1:43 PM, darniz <rn...@edmunds.com> wrote:
>
> no problem
>
> Erick Erickson wrote:
>>
>> Ah, I read your post too fast and ignored the title. Sorry 'bout that.
>>
>> Erick
>>
>> On Mon, Jan 11, 2010 at 2:55 PM, darniz <rn...@edmunds.com> wrote:
>>
>>>
>>> Well thats the whole discussion we are talking about.
>>> I had the impression that the html tags are filtered and then the field
>>> is
>>> stored without tags. But looks like the html tags are removed and terms
>>> are
>>> indexed purely for indexing, and the actual text is stored in raw format.
>>>
>>> Lets say for example if i enter a field like
>>> <field name="body"><p>honda car road review</field>
>>> When i do analysis on the body field the html filter removes the <p> tag
>>> and
>>> indexed works honda, car, road, review. But when i fetch body field to
>>> display in my document it returns <p>honda car road review
>>>
>>> I hope i make sense.
>>> thanks
>>> darniz
>>>
>>>
>>>
>>> Erick Erickson wrote:
>>> >
>>> > This page: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>>> > <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters>shows you
>>> > many
>>> > of the SOLR analyzers and filters. Would one of
>>> > the various *HTMLStrip* stuff work?
>>> >
>>> > HTH
>>> > ERick
>>> >
>>> > On Mon, Jan 11, 2010 at 2:44 PM, darniz <rn...@edmunds.com>
>>> wrote:
>>> >
>>> >>
>>> >> Thanks we were having the saem issue.
>>> >> We are trying to store article content and we are strong a field like
>>> >> <p>This article is for blah </p>.
>>> >> Wheni see the analysis.jsp page it does strip out the <p> tags and is
>>> >> indexed. but when we fetch the document it returns the field with the
>>> <p>
>>> >> tags.
>>> >> From solr point of view, its correct but our issue is that this kind
>>> of
>>> >> html
>>> >> tags is screwing up our display of our page. Is there an easy way to
>>> >> esure
>>> >> how to strip out hte html tags, or do we have to take care of
>>> manually.
>>> >>
>>> >> Thanks
>>> >> Rashid
>>> >>
>>> >>
>>> >> aseem cheema wrote:
>>> >> >
>>> >> > Alright. It turns out that escapedTags is not for what I thought it
>>> is
>>> >> > for.
>>> >> > The problem that I am having with HTMLStripCharFilterFactory is that
>>> >> > it strips the html while indexing the field, but not while storing
>>> the
>>> >> > field. That is why what is see in analysis.jsp, which is index
>>> >> > analysis, does not match what gets stored... because.. well HTML is
>>> >> > stripped only for indexing. Makes so much sense.
>>> >> >
>>> >> > Thanks to Ryan McKinley for clarifying this.
>>> >> > Aseem
>>> >> >
>>> >> > On Wed, Nov 11, 2009 at 9:50 AM, aseem cheema
>>> <as...@gmail.com>
>>> >> > wrote:
>>> >> >> I am trying to post a document with the following content using
>>> SolrJ:
>>> >> >> <center>content</center>
>>> >> >> I need the xml/html tags to be ignored. Even though this works fine
>>> in
>>> >> >> analysis.jsp, this does not work with SolrJ, as the client escapes
>>> the
>>> >> >> < and > with &lt; and &gt; and HTMLStripCharFilterFactory does not
>>> >> >> strip those escaped tags. How can I achieve this? Any ideas will be
>>> >> >> highly appreciated.
>>> >> >>
>>> >> >> There is escapedTags in HTMLStripCharFilterFactory constructor. Is
>>> >> >> there a way to get that to work?
>>> >> >> Thanks
>>> >> >> --
>>> >> >> Aseem
>>> >> >>
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > Aseem
>>> >> >
>>> >> >
>>> >>
>>> >> --
>>> >> View this message in context:
>>> >>
>>> http://old.nabble.com/XmlUpdateRequestHandler-with-HTMLStripCharFilterFactory-tp26305561p27116434.html
>>> >> Sent from the Solr - User mailing list archive at Nabble.com.
>>> >>
>>> >>
>>> >
>>> >
>>>
>>> --
>>> View this message in context:
>>> http://old.nabble.com/XmlUpdateRequestHandler-with-HTMLStripCharFilterFactory-tp26305561p27116601.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>
> --
> View this message in context: http://old.nabble.com/XmlUpdateRequestHandler-with-HTMLStripCharFilterFactory-tp26305561p27118304.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: XmlUpdateRequestHandler with HTMLStripCharFilterFactory

Posted by darniz <rn...@edmunds.com>.

no problem

Erick Erickson wrote:
> 
> Ah, I read your post too fast and ignored the title. Sorry 'bout that.
> 
> Erick
> 
> On Mon, Jan 11, 2010 at 2:55 PM, darniz <rn...@edmunds.com> wrote:
> 
>>
>> Well thats the whole discussion we are talking about.
>> I had the impression that the html tags are filtered and then the field
>> is
>> stored without tags. But looks like the html tags are removed and terms
>> are
>> indexed purely for indexing, and the actual text is stored in raw format.
>>
>> Lets say for example if i enter a field like
>> <field name="body"><p>honda car road review</field>
>> When i do analysis on the body field the html filter removes the <p> tag
>> and
>> indexed works honda, car, road, review. But when i fetch body field to
>> display in my document it returns <p>honda car road review
>>
>> I hope i make sense.
>> thanks
>> darniz
>>
>>
>>
>> Erick Erickson wrote:
>> >
>> > This page: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>> > <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters>shows you
>> > many
>> > of the SOLR analyzers and filters. Would one of
>> > the various *HTMLStrip* stuff work?
>> >
>> > HTH
>> > ERick
>> >
>> > On Mon, Jan 11, 2010 at 2:44 PM, darniz <rn...@edmunds.com>
>> wrote:
>> >
>> >>
>> >> Thanks we were having the saem issue.
>> >> We are trying to store article content and we are strong a field like
>> >> <p>This article is for blah </p>.
>> >> Wheni see the analysis.jsp page it does strip out the <p> tags and is
>> >> indexed. but when we fetch the document it returns the field with the
>> <p>
>> >> tags.
>> >> From solr point of view, its correct but our issue is that this kind
>> of
>> >> html
>> >> tags is screwing up our display of our page. Is there an easy way to
>> >> esure
>> >> how to strip out hte html tags, or do we have to take care of
>> manually.
>> >>
>> >> Thanks
>> >> Rashid
>> >>
>> >>
>> >> aseem cheema wrote:
>> >> >
>> >> > Alright. It turns out that escapedTags is not for what I thought it
>> is
>> >> > for.
>> >> > The problem that I am having with HTMLStripCharFilterFactory is that
>> >> > it strips the html while indexing the field, but not while storing
>> the
>> >> > field. That is why what is see in analysis.jsp, which is index
>> >> > analysis, does not match what gets stored... because.. well HTML is
>> >> > stripped only for indexing. Makes so much sense.
>> >> >
>> >> > Thanks to Ryan McKinley for clarifying this.
>> >> > Aseem
>> >> >
>> >> > On Wed, Nov 11, 2009 at 9:50 AM, aseem cheema
>> <as...@gmail.com>
>> >> > wrote:
>> >> >> I am trying to post a document with the following content using
>> SolrJ:
>> >> >> <center>content</center>
>> >> >> I need the xml/html tags to be ignored. Even though this works fine
>> in
>> >> >> analysis.jsp, this does not work with SolrJ, as the client escapes
>> the
>> >> >> < and > with &lt; and &gt; and HTMLStripCharFilterFactory does not
>> >> >> strip those escaped tags. How can I achieve this? Any ideas will be
>> >> >> highly appreciated.
>> >> >>
>> >> >> There is escapedTags in HTMLStripCharFilterFactory constructor. Is
>> >> >> there a way to get that to work?
>> >> >> Thanks
>> >> >> --
>> >> >> Aseem
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Aseem
>> >> >
>> >> >
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://old.nabble.com/XmlUpdateRequestHandler-with-HTMLStripCharFilterFactory-tp26305561p27116434.html
>> >> Sent from the Solr - User mailing list archive at Nabble.com.
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://old.nabble.com/XmlUpdateRequestHandler-with-HTMLStripCharFilterFactory-tp26305561p27116601.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://old.nabble.com/XmlUpdateRequestHandler-with-HTMLStripCharFilterFactory-tp26305561p27118304.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: XmlUpdateRequestHandler with HTMLStripCharFilterFactory

Posted by Erick Erickson <er...@gmail.com>.

Ah, I read your post too fast and ignored the title. Sorry 'bout that.

Erick

On Mon, Jan 11, 2010 at 2:55 PM, darniz <rn...@edmunds.com> wrote:

>
> Well thats the whole discussion we are talking about.
> I had the impression that the html tags are filtered and then the field is
> stored without tags. But looks like the html tags are removed and terms are
> indexed purely for indexing, and the actual text is stored in raw format.
>
> Lets say for example if i enter a field like
> <field name="body"><p>honda car road review</field>
> When i do analysis on the body field the html filter removes the <p> tag
> and
> indexed works honda, car, road, review. But when i fetch body field to
> display in my document it returns <p>honda car road review
>
> I hope i make sense.
> thanks
> darniz
>
>
>
> Erick Erickson wrote:
> >
> > This page: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
> > <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters>shows you
> > many
> > of the SOLR analyzers and filters. Would one of
> > the various *HTMLStrip* stuff work?
> >
> > HTH
> > ERick
> >
> > On Mon, Jan 11, 2010 at 2:44 PM, darniz <rn...@edmunds.com> wrote:
> >
> >>
> >> Thanks we were having the saem issue.
> >> We are trying to store article content and we are strong a field like
> >> <p>This article is for blah </p>.
> >> Wheni see the analysis.jsp page it does strip out the <p> tags and is
> >> indexed. but when we fetch the document it returns the field with the
> <p>
> >> tags.
> >> From solr point of view, its correct but our issue is that this kind of
> >> html
> >> tags is screwing up our display of our page. Is there an easy way to
> >> esure
> >> how to strip out hte html tags, or do we have to take care of manually.
> >>
> >> Thanks
> >> Rashid
> >>
> >>
> >> aseem cheema wrote:
> >> >
> >> > Alright. It turns out that escapedTags is not for what I thought it is
> >> > for.
> >> > The problem that I am having with HTMLStripCharFilterFactory is that
> >> > it strips the html while indexing the field, but not while storing the
> >> > field. That is why what is see in analysis.jsp, which is index
> >> > analysis, does not match what gets stored... because.. well HTML is
> >> > stripped only for indexing. Makes so much sense.
> >> >
> >> > Thanks to Ryan McKinley for clarifying this.
> >> > Aseem
> >> >
> >> > On Wed, Nov 11, 2009 at 9:50 AM, aseem cheema <as...@gmail.com>
> >> > wrote:
> >> >> I am trying to post a document with the following content using
> SolrJ:
> >> >> <center>content</center>
> >> >> I need the xml/html tags to be ignored. Even though this works fine
> in
> >> >> analysis.jsp, this does not work with SolrJ, as the client escapes
> the
> >> >> < and > with &lt; and &gt; and HTMLStripCharFilterFactory does not
> >> >> strip those escaped tags. How can I achieve this? Any ideas will be
> >> >> highly appreciated.
> >> >>
> >> >> There is escapedTags in HTMLStripCharFilterFactory constructor. Is
> >> >> there a way to get that to work?
> >> >> Thanks
> >> >> --
> >> >> Aseem
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Aseem
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >>
> http://old.nabble.com/XmlUpdateRequestHandler-with-HTMLStripCharFilterFactory-tp26305561p27116434.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/XmlUpdateRequestHandler-with-HTMLStripCharFilterFactory-tp26305561p27116601.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: XmlUpdateRequestHandler with HTMLStripCharFilterFactory

Posted by darniz <rn...@edmunds.com>.

Well thats the whole discussion we are talking about.
I had the impression that the html tags are filtered and then the field is
stored without tags. But looks like the html tags are removed and terms are
indexed purely for indexing, and the actual text is stored in raw format.

Lets say for example if i enter a field like 
<field name="body"><p>honda car road review</field>
When i do analysis on the body field the html filter removes the <p> tag and
indexed works honda, car, road, review. But when i fetch body field to
display in my document it returns <p>honda car road review

I hope i make sense.
thanks
darniz



Erick Erickson wrote:
> 
> This page: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
> <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters>shows you
> many
> of the SOLR analyzers and filters. Would one of
> the various *HTMLStrip* stuff work?
> 
> HTH
> ERick
> 
> On Mon, Jan 11, 2010 at 2:44 PM, darniz <rn...@edmunds.com> wrote:
> 
>>
>> Thanks we were having the saem issue.
>> We are trying to store article content and we are strong a field like
>> <p>This article is for blah </p>.
>> Wheni see the analysis.jsp page it does strip out the <p> tags and is
>> indexed. but when we fetch the document it returns the field with the <p>
>> tags.
>> From solr point of view, its correct but our issue is that this kind of
>> html
>> tags is screwing up our display of our page. Is there an easy way to
>> esure
>> how to strip out hte html tags, or do we have to take care of manually.
>>
>> Thanks
>> Rashid
>>
>>
>> aseem cheema wrote:
>> >
>> > Alright. It turns out that escapedTags is not for what I thought it is
>> > for.
>> > The problem that I am having with HTMLStripCharFilterFactory is that
>> > it strips the html while indexing the field, but not while storing the
>> > field. That is why what is see in analysis.jsp, which is index
>> > analysis, does not match what gets stored... because.. well HTML is
>> > stripped only for indexing. Makes so much sense.
>> >
>> > Thanks to Ryan McKinley for clarifying this.
>> > Aseem
>> >
>> > On Wed, Nov 11, 2009 at 9:50 AM, aseem cheema <as...@gmail.com>
>> > wrote:
>> >> I am trying to post a document with the following content using SolrJ:
>> >> <center>content</center>
>> >> I need the xml/html tags to be ignored. Even though this works fine in
>> >> analysis.jsp, this does not work with SolrJ, as the client escapes the
>> >> < and > with &lt; and &gt; and HTMLStripCharFilterFactory does not
>> >> strip those escaped tags. How can I achieve this? Any ideas will be
>> >> highly appreciated.
>> >>
>> >> There is escapedTags in HTMLStripCharFilterFactory constructor. Is
>> >> there a way to get that to work?
>> >> Thanks
>> >> --
>> >> Aseem
>> >>
>> >
>> >
>> >
>> > --
>> > Aseem
>> >
>> >
>>
>> --
>> View this message in context:
>> http://old.nabble.com/XmlUpdateRequestHandler-with-HTMLStripCharFilterFactory-tp26305561p27116434.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://old.nabble.com/XmlUpdateRequestHandler-with-HTMLStripCharFilterFactory-tp26305561p27116601.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: XmlUpdateRequestHandler with HTMLStripCharFilterFactory

Posted by Erick Erickson <er...@gmail.com>.

This page: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters>shows you many
of the SOLR analyzers and filters. Would one of
the various *HTMLStrip* stuff work?

HTH
ERick

On Mon, Jan 11, 2010 at 2:44 PM, darniz <rn...@edmunds.com> wrote:

>
> Thanks we were having the saem issue.
> We are trying to store article content and we are strong a field like
> <p>This article is for blah </p>.
> Wheni see the analysis.jsp page it does strip out the <p> tags and is
> indexed. but when we fetch the document it returns the field with the <p>
> tags.
> From solr point of view, its correct but our issue is that this kind of
> html
> tags is screwing up our display of our page. Is there an easy way to esure
> how to strip out hte html tags, or do we have to take care of manually.
>
> Thanks
> Rashid
>
>
> aseem cheema wrote:
> >
> > Alright. It turns out that escapedTags is not for what I thought it is
> > for.
> > The problem that I am having with HTMLStripCharFilterFactory is that
> > it strips the html while indexing the field, but not while storing the
> > field. That is why what is see in analysis.jsp, which is index
> > analysis, does not match what gets stored... because.. well HTML is
> > stripped only for indexing. Makes so much sense.
> >
> > Thanks to Ryan McKinley for clarifying this.
> > Aseem
> >
> > On Wed, Nov 11, 2009 at 9:50 AM, aseem cheema <as...@gmail.com>
> > wrote:
> >> I am trying to post a document with the following content using SolrJ:
> >> <center>content</center>
> >> I need the xml/html tags to be ignored. Even though this works fine in
> >> analysis.jsp, this does not work with SolrJ, as the client escapes the
> >> < and > with &lt; and &gt; and HTMLStripCharFilterFactory does not
> >> strip those escaped tags. How can I achieve this? Any ideas will be
> >> highly appreciated.
> >>
> >> There is escapedTags in HTMLStripCharFilterFactory constructor. Is
> >> there a way to get that to work?
> >> Thanks
> >> --
> >> Aseem
> >>
> >
> >
> >
> > --
> > Aseem
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/XmlUpdateRequestHandler-with-HTMLStripCharFilterFactory-tp26305561p27116434.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: XmlUpdateRequestHandler with HTMLStripCharFilterFactory

Posted by darniz <rn...@edmunds.com>.

Thanks we were having the saem issue.
We are trying to store article content and we are strong a field like
<p>This article is for blah </p>.
Wheni see the analysis.jsp page it does strip out the <p> tags and is
indexed. but when we fetch the document it returns the field with the <p>
tags.
>From solr point of view, its correct but our issue is that this kind of html
tags is screwing up our display of our page. Is there an easy way to esure
how to strip out hte html tags, or do we have to take care of manually.

Thanks
Rashid


aseem cheema wrote:
> 
> Alright. It turns out that escapedTags is not for what I thought it is
> for.
> The problem that I am having with HTMLStripCharFilterFactory is that
> it strips the html while indexing the field, but not while storing the
> field. That is why what is see in analysis.jsp, which is index
> analysis, does not match what gets stored... because.. well HTML is
> stripped only for indexing. Makes so much sense.
> 
> Thanks to Ryan McKinley for clarifying this.
> Aseem
> 
> On Wed, Nov 11, 2009 at 9:50 AM, aseem cheema <as...@gmail.com>
> wrote:
>> I am trying to post a document with the following content using SolrJ:
>> <center>content</center>
>> I need the xml/html tags to be ignored. Even though this works fine in
>> analysis.jsp, this does not work with SolrJ, as the client escapes the
>> < and > with &lt; and &gt; and HTMLStripCharFilterFactory does not
>> strip those escaped tags. How can I achieve this? Any ideas will be
>> highly appreciated.
>>
>> There is escapedTags in HTMLStripCharFilterFactory constructor. Is
>> there a way to get that to work?
>> Thanks
>> --
>> Aseem
>>
> 
> 
> 
> -- 
> Aseem
> 
> 

-- 
View this message in context: http://old.nabble.com/XmlUpdateRequestHandler-with-HTMLStripCharFilterFactory-tp26305561p27116434.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: XmlUpdateRequestHandler with HTMLStripCharFilterFactory

Posted by aseem cheema <as...@gmail.com>.

Alright. It turns out that escapedTags is not for what I thought it is for.
The problem that I am having with HTMLStripCharFilterFactory is that
it strips the html while indexing the field, but not while storing the
field. That is why what is see in analysis.jsp, which is index
analysis, does not match what gets stored... because.. well HTML is
stripped only for indexing. Makes so much sense.

Thanks to Ryan McKinley for clarifying this.
Aseem

On Wed, Nov 11, 2009 at 9:50 AM, aseem cheema <as...@gmail.com> wrote:
> I am trying to post a document with the following content using SolrJ:
> <center>content</center>
> I need the xml/html tags to be ignored. Even though this works fine in
> analysis.jsp, this does not work with SolrJ, as the client escapes the
> < and > with &lt; and &gt; and HTMLStripCharFilterFactory does not
> strip those escaped tags. How can I achieve this? Any ideas will be
> highly appreciated.
>
> There is escapedTags in HTMLStripCharFilterFactory constructor. Is
> there a way to get that to work?
> Thanks
> --
> Aseem
>

-- 
Aseem