You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Indika Tantrigoda <in...@gmail.com> on 2010/03/27 06:13:26 UTC

SolrJ and HTMLStripCharFilterFactory

Hello to all,

I've been working with Solr for a few weeks and I have gotten indexing and
searching to work.
However I am having trouble with indexing HTML content and using
HTMLStripCharFilterFactory.

My schema.xml looks like this

  <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
         <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
      ------
  --------/>

and I am indexing the HTML content using SolrJ as the client (with Spring
being the framework).

However when I do a search for all documents, the HTML content is also in my
text field.

But when I did an analysis using the Solr admin panel with HTML content it
shows the tokens extracted
properly with HTML tags removed.

I found a similar issue at
http://www.mail-archive.com/solr-user@lucene.apache.org/msg28736.html
but I am still unable to get it working. I am using Solr 1.4

Any help regarding this is this much appreciated.

Thanks in advance.

Regards,
Indika

Re: SolrJ and HTMLStripCharFilterFactory

Posted by Indika Tantrigoda <in...@gmail.com>.
Hi Erick,

Thank you very much for the explanation. The example you gave made things
clear. I ran some queries with my existing  index and the results were as
expected.

Regards,
Indika

On 27 March 2010 17:09, Erick Erickson <er...@gmail.com> wrote:

> I think you're getting confused by the difference between indexing and
> storing. These are orthogonal operations for all they occur in the same
> definition.
>
> When you index something, the input is put through your analyzer chain, and
> the resulting tokens are stored after all appropriate transformations,
> which
> is what you're seeing when you look at your index through the admin panel
> and report the html is stripped. This is what's searched.
>
> But when you fetch a field that has been stored, the original raw text is
> returned. This is never searched, just kept around for retrieval.
>
> The idea here is to be able to have your index contain some displayable
> text. Think about the title of a book, for instance "The Grapes of Wrath".
> You want to search it after it's been lower-cased, stop words removed, etc.
> But if you wanted to present it to a user, you sure wouldn't want to
> display
> "grapes wrath" which might be the tokens after lowercasing and removing
> stopwords..
>
> HTH
> Erick
>
> On Sat, Mar 27, 2010 at 1:13 AM, Indika Tantrigoda <indika85@gmail.com
> >wrote:
>
> > Hello to all,
> >
> > I've been working with Solr for a few weeks and I have gotten indexing
> and
> > searching to work.
> > However I am having trouble with indexing HTML content and using
> > HTMLStripCharFilterFactory.
> >
> > My schema.xml looks like this
> >
> >  <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
> >      <analyzer type="index">
> >         <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
> >      ------
> >  --------/>
> >
> > and I am indexing the HTML content using SolrJ as the client (with Spring
> > being the framework).
> >
> > However when I do a search for all documents, the HTML content is also in
> > my
> > text field.
> >
> > But when I did an analysis using the Solr admin panel with HTML content
> it
> > shows the tokens extracted
> > properly with HTML tags removed.
> >
> > I found a similar issue at
> > http://www.mail-archive.com/solr-user@lucene.apache.org/msg28736.html
> > but I am still unable to get it working. I am using Solr 1.4
> >
> > Any help regarding this is this much appreciated.
> >
> > Thanks in advance.
> >
> > Regards,
> > Indika
> >
>

Re: SolrJ and HTMLStripCharFilterFactory

Posted by Erick Erickson <er...@gmail.com>.
I think you're getting confused by the difference between indexing and
storing. These are orthogonal operations for all they occur in the same
definition.

When you index something, the input is put through your analyzer chain, and
the resulting tokens are stored after all appropriate transformations, which
is what you're seeing when you look at your index through the admin panel
and report the html is stripped. This is what's searched.

But when you fetch a field that has been stored, the original raw text is
returned. This is never searched, just kept around for retrieval.

The idea here is to be able to have your index contain some displayable
text. Think about the title of a book, for instance "The Grapes of Wrath".
You want to search it after it's been lower-cased, stop words removed, etc.
But if you wanted to present it to a user, you sure wouldn't want to display
"grapes wrath" which might be the tokens after lowercasing and removing
stopwords..

HTH
Erick

On Sat, Mar 27, 2010 at 1:13 AM, Indika Tantrigoda <in...@gmail.com>wrote:

> Hello to all,
>
> I've been working with Solr for a few weeks and I have gotten indexing and
> searching to work.
> However I am having trouble with indexing HTML content and using
> HTMLStripCharFilterFactory.
>
> My schema.xml looks like this
>
>  <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>      <analyzer type="index">
>         <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
>      ------
>  --------/>
>
> and I am indexing the HTML content using SolrJ as the client (with Spring
> being the framework).
>
> However when I do a search for all documents, the HTML content is also in
> my
> text field.
>
> But when I did an analysis using the Solr admin panel with HTML content it
> shows the tokens extracted
> properly with HTML tags removed.
>
> I found a similar issue at
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg28736.html
> but I am still unable to get it working. I am using Solr 1.4
>
> Any help regarding this is this much appreciated.
>
> Thanks in advance.
>
> Regards,
> Indika
>