You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by okayndc <bo...@gmail.com> on 2012/04/30 16:07:48 UTC

Solr: extracting/indexing HTML via cURL

Hello,

Over the weekend I experimented with extracting HTML content via cURL and
just
wondering why the extraction/indexing process does not include the HTML
tags.
It seems as though the HTML tags either being ignored or stripped somewhere
in the pipeline.
If this is the case, is it possible to include the HTML tags, as I would
like to keep the
formatted HTML intact?

Any help is greatly appreciated.

Re: Solr: extracting/indexing HTML via cURL

Posted by Lance Norskog <go...@gmail.com>.

You can have two fields: one which is stripped, and another which
stores the original data. You can use <copyField> directives and make
the "stripped" field indexed but not stored, and the original field
stored but not indexed. You only have to upload the file once, and
only store the text once.

If you look in the default schema, you'll find a bunch of text fields
are all copied to "text" or "text_all", which is indexed but not
stored. This catch-all field is the default search field.

http://lucidworks.lucidimagination.com/display/solr/Copying+Fields


On Mon, Apr 30, 2012 at 2:06 PM, okayndc <bo...@gmail.com> wrote:
> Great, thank you for the input.  My understanding of HTMLStripCharFilter is
> that it strips HTML tags, which is not what I want ~ is this correct?  I
> want to keep the HTML tags intact.
>
> On Mon, Apr 30, 2012 at 11:55 AM, Jack Krupansky <ja...@basetechnology.com>wrote:
>
>> If by "extracting HTML content via cURL" you mean using SolrCell to parse
>> html files, this seems to make sense. The sequence is that regardless of
>> the file type, each file extraction "parser" will strip off all formatting
>> and produce a raw text stream. Office, PDF, and HTML files are all treated
>> the same in that way. Then, the unformatted text stream is sent through the
>> field type analyzers to be tokenized into terms that Lucene can index. The
>> input string to the field type analyzer is what gets stored for the field,
>> but this occurs after the extraction file parser has already removed
>> formatting.
>>
>> No way for the formatting to be preserved in that case, other than to go
>> back to the original input document before extraction parsing.
>>
>> If you really do want to preserve full HTML formatted text, you would need
>> to define a field whose field type uses the HTMLStripCharFilter and then
>> directly add documents that direct the raw HTML to that field.
>>
>> There may be some other way to hook into the update processing chain, but
>> that may be too much effort compared to the HTML strip filter.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: okayndc
>> Sent: Monday, April 30, 2012 10:07 AM
>> To: solr-user@lucene.apache.org
>> Subject: Solr: extracting/indexing HTML via cURL
>>
>>
>> Hello,
>>
>> Over the weekend I experimented with extracting HTML content via cURL and
>> just
>> wondering why the extraction/indexing process does not include the HTML
>> tags.
>> It seems as though the HTML tags either being ignored or stripped somewhere
>> in the pipeline.
>> If this is the case, is it possible to include the HTML tags, as I would
>> like to keep the
>> formatted HTML intact?
>>
>> Any help is greatly appreciated.
>>



-- 
Lance Norskog
goksron@gmail.com

Re: extracting/indexing HTML via cURL

Posted by okayndc <bo...@gmail.com>.

Awesome, I'll give it try.  Thanks Jack!

On Tue, May 1, 2012 at 10:23 AM, Jack Krupansky <ja...@basetechnology.com>wrote:

> Sorry for the confusion. It is doable. If you feed the raw HTML into a
> field that has the HTMLStripCharFilter, the stored value will retain the
> HTML tags, while the indexed text will be stripped of the of the tags
> during analysis and be searchable just like a normal text field. Then,
> search will not see "<p>".
>
>
> -- Jack Krupansky
>
> -----Original Message----- From: okayndc
> Sent: Tuesday, May 01, 2012 10:08 AM
> To: solr-user@lucene.apache.org
> Subject: Re: extracting/indexing HTML via cURL
>
>
> Thank you Jack.
>
> So, it's not doable/possible to search and highlight keywords within a
> field that contains the raw formatted HTML?  and strip out the HTML tags
> during analysis...so that a user would get back nothing if they did a
> search for (ex. <p>)?
>
> On Mon, Apr 30, 2012 at 5:17 PM, Jack Krupansky <ja...@basetechnology.com>*
> *wrote:
>
>  I was thinking that you wanted to index the actual text from the HTML
>> page, but have the stored field value still have the raw HTML with tags.
>> If
>> you just want to store only the raw HTML, a simple string field is
>> sufficient, but then you can't easily do a text search on it.
>>
>> Or, you can have two fields, one string field for the raw HTML (stored,
>> but not indexed) and then do a CopyField to a text field field that has
>> the
>> HTMLStripCharFilter to strip the HTML tags and index only the text
>> (indexed, but not stored.)
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: okayndc
>> Sent: Monday, April 30, 2012 5:06 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr: extracting/indexing HTML via cURL
>>
>> Great, thank you for the input.  My understanding of HTMLStripCharFilter
>> is
>> that it strips HTML tags, which is not what I want ~ is this correct?  I
>> want to keep the HTML tags intact.
>>
>> On Mon, Apr 30, 2012 at 11:55 AM, Jack Krupansky <jack@basetechnology.com
>> >
>> **wrote:
>>
>>
>>  If by "extracting HTML content via cURL" you mean using SolrCell to parse
>>
>>> html files, this seems to make sense. The sequence is that regardless of
>>> the file type, each file extraction "parser" will strip off all
>>> formatting
>>> and produce a raw text stream. Office, PDF, and HTML files are all
>>> treated
>>> the same in that way. Then, the unformatted text stream is sent through
>>> the
>>> field type analyzers to be tokenized into terms that Lucene can index.
>>> The
>>> input string to the field type analyzer is what gets stored for the
>>> field,
>>> but this occurs after the extraction file parser has already removed
>>> formatting.
>>>
>>> No way for the formatting to be preserved in that case, other than to go
>>> back to the original input document before extraction parsing.
>>>
>>> If you really do want to preserve full HTML formatted text, you would
>>> need
>>> to define a field whose field type uses the HTMLStripCharFilter and then
>>> directly add documents that direct the raw HTML to that field.
>>>
>>> There may be some other way to hook into the update processing chain, but
>>> that may be too much effort compared to the HTML strip filter.
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: okayndc
>>> Sent: Monday, April 30, 2012 10:07 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Solr: extracting/indexing HTML via cURL
>>>
>>>
>>> Hello,
>>>
>>> Over the weekend I experimented with extracting HTML content via cURL and
>>> just
>>> wondering why the extraction/indexing process does not include the HTML
>>> tags.
>>> It seems as though the HTML tags either being ignored or stripped
>>> somewhere
>>> in the pipeline.
>>> If this is the case, is it possible to include the HTML tags, as I would
>>> like to keep the
>>> formatted HTML intact?
>>>
>>> Any help is greatly appreciated.
>>>
>>>
>>>
>>
>

Re: extracting/indexing HTML via cURL

Posted by Jack Krupansky <ja...@basetechnology.com>.

Sorry for the confusion. It is doable. If you feed the raw HTML into a field 
that has the HTMLStripCharFilter, the stored value will retain the HTML 
tags, while the indexed text will be stripped of the of the tags during 
analysis and be searchable just like a normal text field. Then, search will 
not see "<p>".

-- Jack Krupansky

-----Original Message----- 
From: okayndc
Sent: Tuesday, May 01, 2012 10:08 AM
To: solr-user@lucene.apache.org
Subject: Re: extracting/indexing HTML via cURL

Thank you Jack.

So, it's not doable/possible to search and highlight keywords within a
field that contains the raw formatted HTML?  and strip out the HTML tags
during analysis...so that a user would get back nothing if they did a
search for (ex. <p>)?

On Mon, Apr 30, 2012 at 5:17 PM, Jack Krupansky 
<ja...@basetechnology.com>wrote:

> I was thinking that you wanted to index the actual text from the HTML
> page, but have the stored field value still have the raw HTML with tags. 
> If
> you just want to store only the raw HTML, a simple string field is
> sufficient, but then you can't easily do a text search on it.
>
> Or, you can have two fields, one string field for the raw HTML (stored,
> but not indexed) and then do a CopyField to a text field field that has 
> the
> HTMLStripCharFilter to strip the HTML tags and index only the text
> (indexed, but not stored.)
>
> -- Jack Krupansky
>
> -----Original Message----- From: okayndc
> Sent: Monday, April 30, 2012 5:06 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr: extracting/indexing HTML via cURL
>
> Great, thank you for the input.  My understanding of HTMLStripCharFilter 
> is
> that it strips HTML tags, which is not what I want ~ is this correct?  I
> want to keep the HTML tags intact.
>
> On Mon, Apr 30, 2012 at 11:55 AM, Jack Krupansky <ja...@basetechnology.com>
> **wrote:
>
>  If by "extracting HTML content via cURL" you mean using SolrCell to parse
>> html files, this seems to make sense. The sequence is that regardless of
>> the file type, each file extraction "parser" will strip off all 
>> formatting
>> and produce a raw text stream. Office, PDF, and HTML files are all 
>> treated
>> the same in that way. Then, the unformatted text stream is sent through
>> the
>> field type analyzers to be tokenized into terms that Lucene can index. 
>> The
>> input string to the field type analyzer is what gets stored for the 
>> field,
>> but this occurs after the extraction file parser has already removed
>> formatting.
>>
>> No way for the formatting to be preserved in that case, other than to go
>> back to the original input document before extraction parsing.
>>
>> If you really do want to preserve full HTML formatted text, you would 
>> need
>> to define a field whose field type uses the HTMLStripCharFilter and then
>> directly add documents that direct the raw HTML to that field.
>>
>> There may be some other way to hook into the update processing chain, but
>> that may be too much effort compared to the HTML strip filter.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: okayndc
>> Sent: Monday, April 30, 2012 10:07 AM
>> To: solr-user@lucene.apache.org
>> Subject: Solr: extracting/indexing HTML via cURL
>>
>>
>> Hello,
>>
>> Over the weekend I experimented with extracting HTML content via cURL and
>> just
>> wondering why the extraction/indexing process does not include the HTML
>> tags.
>> It seems as though the HTML tags either being ignored or stripped
>> somewhere
>> in the pipeline.
>> If this is the case, is it possible to include the HTML tags, as I would
>> like to keep the
>> formatted HTML intact?
>>
>> Any help is greatly appreciated.
>>
>>
>

Re: extracting/indexing HTML via cURL

Posted by okayndc <bo...@gmail.com>.

Thank you Jack.

So, it's not doable/possible to search and highlight keywords within a
field that contains the raw formatted HTML?  and strip out the HTML tags
during analysis...so that a user would get back nothing if they did a
search for (ex. <p>)?

On Mon, Apr 30, 2012 at 5:17 PM, Jack Krupansky <ja...@basetechnology.com>wrote:

> I was thinking that you wanted to index the actual text from the HTML
> page, but have the stored field value still have the raw HTML with tags. If
> you just want to store only the raw HTML, a simple string field is
> sufficient, but then you can't easily do a text search on it.
>
> Or, you can have two fields, one string field for the raw HTML (stored,
> but not indexed) and then do a CopyField to a text field field that has the
> HTMLStripCharFilter to strip the HTML tags and index only the text
> (indexed, but not stored.)
>
> -- Jack Krupansky
>
> -----Original Message----- From: okayndc
> Sent: Monday, April 30, 2012 5:06 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr: extracting/indexing HTML via cURL
>
> Great, thank you for the input.  My understanding of HTMLStripCharFilter is
> that it strips HTML tags, which is not what I want ~ is this correct?  I
> want to keep the HTML tags intact.
>
> On Mon, Apr 30, 2012 at 11:55 AM, Jack Krupansky <ja...@basetechnology.com>
> **wrote:
>
>  If by "extracting HTML content via cURL" you mean using SolrCell to parse
>> html files, this seems to make sense. The sequence is that regardless of
>> the file type, each file extraction "parser" will strip off all formatting
>> and produce a raw text stream. Office, PDF, and HTML files are all treated
>> the same in that way. Then, the unformatted text stream is sent through
>> the
>> field type analyzers to be tokenized into terms that Lucene can index. The
>> input string to the field type analyzer is what gets stored for the field,
>> but this occurs after the extraction file parser has already removed
>> formatting.
>>
>> No way for the formatting to be preserved in that case, other than to go
>> back to the original input document before extraction parsing.
>>
>> If you really do want to preserve full HTML formatted text, you would need
>> to define a field whose field type uses the HTMLStripCharFilter and then
>> directly add documents that direct the raw HTML to that field.
>>
>> There may be some other way to hook into the update processing chain, but
>> that may be too much effort compared to the HTML strip filter.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: okayndc
>> Sent: Monday, April 30, 2012 10:07 AM
>> To: solr-user@lucene.apache.org
>> Subject: Solr: extracting/indexing HTML via cURL
>>
>>
>> Hello,
>>
>> Over the weekend I experimented with extracting HTML content via cURL and
>> just
>> wondering why the extraction/indexing process does not include the HTML
>> tags.
>> It seems as though the HTML tags either being ignored or stripped
>> somewhere
>> in the pipeline.
>> If this is the case, is it possible to include the HTML tags, as I would
>> like to keep the
>> formatted HTML intact?
>>
>> Any help is greatly appreciated.
>>
>>
>

Re: extracting/indexing HTML via cURL

Posted by Jack Krupansky <ja...@basetechnology.com>.

I was thinking that you wanted to index the actual text from the HTML page, 
but have the stored field value still have the raw HTML with tags. If you 
just want to store only the raw HTML, a simple string field is sufficient, 
but then you can't easily do a text search on it.

Or, you can have two fields, one string field for the raw HTML (stored, but 
not indexed) and then do a CopyField to a text field field that has the 
HTMLStripCharFilter to strip the HTML tags and index only the text (indexed, 
but not stored.)

-- Jack Krupansky

-----Original Message----- 
From: okayndc
Sent: Monday, April 30, 2012 5:06 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr: extracting/indexing HTML via cURL

Great, thank you for the input.  My understanding of HTMLStripCharFilter is
that it strips HTML tags, which is not what I want ~ is this correct?  I
want to keep the HTML tags intact.

On Mon, Apr 30, 2012 at 11:55 AM, Jack Krupansky 
<ja...@basetechnology.com>wrote:

> If by "extracting HTML content via cURL" you mean using SolrCell to parse
> html files, this seems to make sense. The sequence is that regardless of
> the file type, each file extraction "parser" will strip off all formatting
> and produce a raw text stream. Office, PDF, and HTML files are all treated
> the same in that way. Then, the unformatted text stream is sent through 
> the
> field type analyzers to be tokenized into terms that Lucene can index. The
> input string to the field type analyzer is what gets stored for the field,
> but this occurs after the extraction file parser has already removed
> formatting.
>
> No way for the formatting to be preserved in that case, other than to go
> back to the original input document before extraction parsing.
>
> If you really do want to preserve full HTML formatted text, you would need
> to define a field whose field type uses the HTMLStripCharFilter and then
> directly add documents that direct the raw HTML to that field.
>
> There may be some other way to hook into the update processing chain, but
> that may be too much effort compared to the HTML strip filter.
>
> -- Jack Krupansky
>
> -----Original Message----- From: okayndc
> Sent: Monday, April 30, 2012 10:07 AM
> To: solr-user@lucene.apache.org
> Subject: Solr: extracting/indexing HTML via cURL
>
>
> Hello,
>
> Over the weekend I experimented with extracting HTML content via cURL and
> just
> wondering why the extraction/indexing process does not include the HTML
> tags.
> It seems as though the HTML tags either being ignored or stripped 
> somewhere
> in the pipeline.
> If this is the case, is it possible to include the HTML tags, as I would
> like to keep the
> formatted HTML intact?
>
> Any help is greatly appreciated.
>

Re: Solr: extracting/indexing HTML via cURL

Posted by okayndc <bo...@gmail.com>.

Great, thank you for the input.  My understanding of HTMLStripCharFilter is
that it strips HTML tags, which is not what I want ~ is this correct?  I
want to keep the HTML tags intact.

On Mon, Apr 30, 2012 at 11:55 AM, Jack Krupansky <ja...@basetechnology.com>wrote:

> If by "extracting HTML content via cURL" you mean using SolrCell to parse
> html files, this seems to make sense. The sequence is that regardless of
> the file type, each file extraction "parser" will strip off all formatting
> and produce a raw text stream. Office, PDF, and HTML files are all treated
> the same in that way. Then, the unformatted text stream is sent through the
> field type analyzers to be tokenized into terms that Lucene can index. The
> input string to the field type analyzer is what gets stored for the field,
> but this occurs after the extraction file parser has already removed
> formatting.
>
> No way for the formatting to be preserved in that case, other than to go
> back to the original input document before extraction parsing.
>
> If you really do want to preserve full HTML formatted text, you would need
> to define a field whose field type uses the HTMLStripCharFilter and then
> directly add documents that direct the raw HTML to that field.
>
> There may be some other way to hook into the update processing chain, but
> that may be too much effort compared to the HTML strip filter.
>
> -- Jack Krupansky
>
> -----Original Message----- From: okayndc
> Sent: Monday, April 30, 2012 10:07 AM
> To: solr-user@lucene.apache.org
> Subject: Solr: extracting/indexing HTML via cURL
>
>
> Hello,
>
> Over the weekend I experimented with extracting HTML content via cURL and
> just
> wondering why the extraction/indexing process does not include the HTML
> tags.
> It seems as though the HTML tags either being ignored or stripped somewhere
> in the pipeline.
> If this is the case, is it possible to include the HTML tags, as I would
> like to keep the
> formatted HTML intact?
>
> Any help is greatly appreciated.
>

Re: Solr: extracting/indexing HTML via cURL

Posted by Jack Krupansky <ja...@basetechnology.com>.

If by "extracting HTML content via cURL" you mean using SolrCell to parse 
html files, this seems to make sense. The sequence is that regardless of the 
file type, each file extraction "parser" will strip off all formatting and 
produce a raw text stream. Office, PDF, and HTML files are all treated the 
same in that way. Then, the unformatted text stream is sent through the 
field type analyzers to be tokenized into terms that Lucene can index. The 
input string to the field type analyzer is what gets stored for the field, 
but this occurs after the extraction file parser has already removed 
formatting.

No way for the formatting to be preserved in that case, other than to go 
back to the original input document before extraction parsing.

If you really do want to preserve full HTML formatted text, you would need 
to define a field whose field type uses the HTMLStripCharFilter and then 
directly add documents that direct the raw HTML to that field.

There may be some other way to hook into the update processing chain, but 
that may be too much effort compared to the HTML strip filter.

-- Jack Krupansky

-----Original Message----- 
From: okayndc
Sent: Monday, April 30, 2012 10:07 AM
To: solr-user@lucene.apache.org
Subject: Solr: extracting/indexing HTML via cURL

Hello,

Over the weekend I experimented with extracting HTML content via cURL and
just
wondering why the extraction/indexing process does not include the HTML
tags.
It seems as though the HTML tags either being ignored or stripped somewhere
in the pipeline.
If this is the case, is it possible to include the HTML tags, as I would
like to keep the
formatted HTML intact?

Any help is greatly appreciated.