You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Elisabeth Adler <el...@gmail.com> on 2011/09/12 10:58:10 UTC

Separately indexing headings of the content

Hi,

Since I'm relatively new to Nutch/Solr, I was wondering if the following 
would make sense:

Headings in web pages (h1, h2, h3) should be more important than any 
other content of the page, so if a match to a query turns up in a 
heading, the ranking of the document should be higher. In order to boost 
a field, I would need to separately index it - this would mean on 
parsing the crawled pages, I would need to strip out the headings h1, h2 
and h3, index them in separate fields, and remove them from the content 
field. I presume I would have to modify the HTML Parser and Index Basic 
plugin for this, or is there an easier solution?

Any input appreciated,
Elisabeth

Re: Separately indexing headings of the content

Posted by Elisabeth Adler <el...@gmail.com>.

tx! somehow missed that jira!

On 12.09.2011 11:20, Markus Jelsma wrote:
> https://issues.apache.org/jira/browse/NUTCH-1005
>
>> Hi,
>>
>> Since I'm relatively new to Nutch/Solr, I was wondering if the following
>> would make sense:
>>
>> Headings in web pages (h1, h2, h3) should be more important than any
>> other content of the page, so if a match to a query turns up in a
>> heading, the ranking of the document should be higher. In order to boost
>> a field, I would need to separately index it - this would mean on
>> parsing the crawled pages, I would need to strip out the headings h1, h2
>> and h3, index them in separate fields, and remove them from the content
>> field. I presume I would have to modify the HTML Parser and Index Basic
>> plugin for this, or is there an easier solution?
>>
>> Any input appreciated,
>> Elisabeth

Re: Separately indexing headings of the content

Posted by Markus Jelsma <ma...@openindex.io>.

https://issues.apache.org/jira/browse/NUTCH-1005

> Hi,
> 
> Since I'm relatively new to Nutch/Solr, I was wondering if the following
> would make sense:
> 
> Headings in web pages (h1, h2, h3) should be more important than any
> other content of the page, so if a match to a query turns up in a
> heading, the ranking of the document should be higher. In order to boost
> a field, I would need to separately index it - this would mean on
> parsing the crawled pages, I would need to strip out the headings h1, h2
> and h3, index them in separate fields, and remove them from the content
> field. I presume I would have to modify the HTML Parser and Index Basic
> plugin for this, or is there an easier solution?
> 
> Any input appreciated,
> Elisabeth