You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Damian Bursztyn <db...@gmail.com> on 2010/06/01 15:43:25 UTC

Re: HTML encode extracted docs - Problems with solr.HTMLStripCharFilter

Did anybody find a way to fix this more than removing the
HTMLStripCharFilter analyzer during the indexing?

Thanks

On Sat, Mar 13, 2010 at 7:55 PM, Lance Norskog <go...@gmail.com> wrote:

> HTMLStripCharFilter is only in the analyzer: it creates searchable
> terms from the HTML input. The raw HTML is stored and fetched.
>
> There are some bugs in term positions and highlighting, An
> EntityProcessor wrapping the HTMLStripCharFIlter would be really
> useful.
>
> On Tue, Mar 9, 2010 at 5:31 AM, Mark Roberts <ma...@red-gate.com>
> wrote:
> > Sounds like "solr.HTMLStripCharFilter" may work... except, I'm getting a
> couple of problems:
> >
> > 1) HTML still seems to be getting into my content field
> >
> > All I did was add <charFilter class="solr.HTMLStripCharFilterFactory" />
> to the index analyzer for the my "text" fieldType.
> >
> >
> > 2) Some it seems to have broken my highlighting, I get this error:
> >
> > 'org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token
> wrong exceeds length of provided text sized 3862'
> >
> >
> >
> > Any ideas how I can fix this?
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Lance Norskog [mailto:goksron@gmail.com]
> > Sent: 09 March 2010 04:36
> > To: solr-user@lucene.apache.org
> > Subject: Re: HTML encode extracted docs
> >
> > A Tika integration with the DataImportHandler is in the Solr trunk.
> > With this, you can copy the raw HTML into different fields and process
> > one copy with Tika.
> >
> > If it's just straight HTML, would the HTMLStripCharFilter be good enough?
> >
> > http://www.lucidimagination.com/search/document/CDRG_ch05_5.7.2
> >
> > On Mon, Mar 8, 2010 at 5:50 AM, Mark Roberts <ma...@red-gate.com>
> wrote:
> >> I'm uploading .htm files to be extracted - some of these files are
> "include" files that have snippets of HTML rather than fully formed html
> documents.
> >>
> >> solr-cell stores the raw HTML for these items, rather than extracting
> the text. Is there any way I can get solr to encode this content prior to
> storing it?
> >>
> >> At the moment, I have the problem that when the highlighted snippets are
>  retrieved via search, I need to parse the snippet and HTML encode the bits
> of HTML that where indexed, whilst *not* encoding the bits that where added
> by the highlighter, which is messy and time consuming.
> >>
> >> Thanks! Mark,
> >>
> >
> >
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>



-- 
"A person who never made a mistake never tried anything new."
Albert Einstein