You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Mark Roberts <ma...@red-gate.com> on 2010/03/08 14:50:13 UTC

HTML encode extracted docs

I'm uploading .htm files to be extracted - some of these files are "include" files that have snippets of HTML rather than fully formed html documents.

solr-cell stores the raw HTML for these items, rather than extracting the text. Is there any way I can get solr to encode this content prior to storing it?

At the moment, I have the problem that when the highlighted snippets are  retrieved via search, I need to parse the snippet and HTML encode the bits of HTML that where indexed, whilst *not* encoding the bits that where added by the highlighter, which is messy and time consuming.

Thanks! Mark,

Re: HTML encode extracted docs - Problems with solr.HTMLStripCharFilter

Posted by Damian Bursztyn <db...@gmail.com>.

Did anybody find a way to fix this more than removing the
HTMLStripCharFilter analyzer during the indexing?

Thanks

On Sat, Mar 13, 2010 at 7:55 PM, Lance Norskog <go...@gmail.com> wrote:

> HTMLStripCharFilter is only in the analyzer: it creates searchable
> terms from the HTML input. The raw HTML is stored and fetched.
>
> There are some bugs in term positions and highlighting, An
> EntityProcessor wrapping the HTMLStripCharFIlter would be really
> useful.
>
> On Tue, Mar 9, 2010 at 5:31 AM, Mark Roberts <ma...@red-gate.com>
> wrote:
> > Sounds like "solr.HTMLStripCharFilter" may work... except, I'm getting a
> couple of problems:
> >
> > 1) HTML still seems to be getting into my content field
> >
> > All I did was add <charFilter class="solr.HTMLStripCharFilterFactory" />
> to the index analyzer for the my "text" fieldType.
> >
> >
> > 2) Some it seems to have broken my highlighting, I get this error:
> >
> > 'org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token
> wrong exceeds length of provided text sized 3862'
> >
> >
> >
> > Any ideas how I can fix this?
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Lance Norskog [mailto:goksron@gmail.com]
> > Sent: 09 March 2010 04:36
> > To: solr-user@lucene.apache.org
> > Subject: Re: HTML encode extracted docs
> >
> > A Tika integration with the DataImportHandler is in the Solr trunk.
> > With this, you can copy the raw HTML into different fields and process
> > one copy with Tika.
> >
> > If it's just straight HTML, would the HTMLStripCharFilter be good enough?
> >
> > http://www.lucidimagination.com/search/document/CDRG_ch05_5.7.2
> >
> > On Mon, Mar 8, 2010 at 5:50 AM, Mark Roberts <ma...@red-gate.com>
> wrote:
> >> I'm uploading .htm files to be extracted - some of these files are
> "include" files that have snippets of HTML rather than fully formed html
> documents.
> >>
> >> solr-cell stores the raw HTML for these items, rather than extracting
> the text. Is there any way I can get solr to encode this content prior to
> storing it?
> >>
> >> At the moment, I have the problem that when the highlighted snippets are
>  retrieved via search, I need to parse the snippet and HTML encode the bits
> of HTML that where indexed, whilst *not* encoding the bits that where added
> by the highlighter, which is messy and time consuming.
> >>
> >> Thanks! Mark,
> >>
> >
> >
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>



-- 
"A person who never made a mistake never tried anything new."
Albert Einstein

Re: HTML encode extracted docs - Problems with solr.HTMLStripCharFilter

Posted by Lance Norskog <go...@gmail.com>.

HTMLStripCharFilter is only in the analyzer: it creates searchable
terms from the HTML input. The raw HTML is stored and fetched.

There are some bugs in term positions and highlighting, An
EntityProcessor wrapping the HTMLStripCharFIlter would be really
useful.

On Tue, Mar 9, 2010 at 5:31 AM, Mark Roberts <ma...@red-gate.com> wrote:
> Sounds like "solr.HTMLStripCharFilter" may work... except, I'm getting a couple of problems:
>
> 1) HTML still seems to be getting into my content field
>
> All I did was add <charFilter class="solr.HTMLStripCharFilterFactory" /> to the index analyzer for the my "text" fieldType.
>
>
> 2) Some it seems to have broken my highlighting, I get this error:
>
> 'org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token wrong exceeds length of provided text sized 3862'
>
>
>
> Any ideas how I can fix this?
>
>
>
>
>
> -----Original Message-----
> From: Lance Norskog [mailto:goksron@gmail.com]
> Sent: 09 March 2010 04:36
> To: solr-user@lucene.apache.org
> Subject: Re: HTML encode extracted docs
>
> A Tika integration with the DataImportHandler is in the Solr trunk.
> With this, you can copy the raw HTML into different fields and process
> one copy with Tika.
>
> If it's just straight HTML, would the HTMLStripCharFilter be good enough?
>
> http://www.lucidimagination.com/search/document/CDRG_ch05_5.7.2
>
> On Mon, Mar 8, 2010 at 5:50 AM, Mark Roberts <ma...@red-gate.com> wrote:
>> I'm uploading .htm files to be extracted - some of these files are "include" files that have snippets of HTML rather than fully formed html documents.
>>
>> solr-cell stores the raw HTML for these items, rather than extracting the text. Is there any way I can get solr to encode this content prior to storing it?
>>
>> At the moment, I have the problem that when the highlighted snippets are  retrieved via search, I need to parse the snippet and HTML encode the bits of HTML that where indexed, whilst *not* encoding the bits that where added by the highlighter, which is messy and time consuming.
>>
>> Thanks! Mark,
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>



-- 
Lance Norskog
goksron@gmail.com

RE: HTML encode extracted docs - Problems with solr.HTMLStripCharFilter

Posted by Mark Roberts <ma...@red-gate.com>.

Sounds like "solr.HTMLStripCharFilter" may work... except, I'm getting a couple of problems:

1) HTML still seems to be getting into my content field

All I did was add <charFilter class="solr.HTMLStripCharFilterFactory" /> to the index analyzer for the my "text" fieldType.


2) Some it seems to have broken my highlighting, I get this error:

'org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token wrong exceeds length of provided text sized 3862'



Any ideas how I can fix this?





-----Original Message-----
From: Lance Norskog [mailto:goksron@gmail.com] 
Sent: 09 March 2010 04:36
To: solr-user@lucene.apache.org
Subject: Re: HTML encode extracted docs

A Tika integration with the DataImportHandler is in the Solr trunk.
With this, you can copy the raw HTML into different fields and process
one copy with Tika.

If it's just straight HTML, would the HTMLStripCharFilter be good enough?

http://www.lucidimagination.com/search/document/CDRG_ch05_5.7.2

On Mon, Mar 8, 2010 at 5:50 AM, Mark Roberts <ma...@red-gate.com> wrote:
> I'm uploading .htm files to be extracted - some of these files are "include" files that have snippets of HTML rather than fully formed html documents.
>
> solr-cell stores the raw HTML for these items, rather than extracting the text. Is there any way I can get solr to encode this content prior to storing it?
>
> At the moment, I have the problem that when the highlighted snippets are  retrieved via search, I need to parse the snippet and HTML encode the bits of HTML that where indexed, whilst *not* encoding the bits that where added by the highlighter, which is messy and time consuming.
>
> Thanks! Mark,
>



-- 
Lance Norskog
goksron@gmail.com

Re: HTML encode extracted docs

Posted by Lance Norskog <go...@gmail.com>.

A Tika integration with the DataImportHandler is in the Solr trunk.
With this, you can copy the raw HTML into different fields and process
one copy with Tika.

If it's just straight HTML, would the HTMLStripCharFilter be good enough?

http://www.lucidimagination.com/search/document/CDRG_ch05_5.7.2

On Mon, Mar 8, 2010 at 5:50 AM, Mark Roberts <ma...@red-gate.com> wrote:
> I'm uploading .htm files to be extracted - some of these files are "include" files that have snippets of HTML rather than fully formed html documents.
>
> solr-cell stores the raw HTML for these items, rather than extracting the text. Is there any way I can get solr to encode this content prior to storing it?
>
> At the moment, I have the problem that when the highlighted snippets are  retrieved via search, I need to parse the snippet and HTML encode the bits of HTML that where indexed, whilst *not* encoding the bits that where added by the highlighter, which is messy and time consuming.
>
> Thanks! Mark,
>



-- 
Lance Norskog
goksron@gmail.com