You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Joel Halbert <jo...@storequery.com> on 2009/07/09 15:31:43 UTC

Weighting different html text nodes - h1,h2 etc..

Hi, Would I be correct in thinking that Nutch, when indexing an html
document, does not weight the different text nodes (h1, h2, anchor etc)
differently - instead it just lumps together all text as one? (this is
the impression I get from looking at
org.apache.nutch.parse.html.HtmlParser)

Rgs, 
Joel

Re: Weighting different html text nodes - h1,h2 etc..

Posted by Ken Krugler <kk...@transpac.com>.

>Hi, Would I be correct in thinking that Nutch, when indexing an html
>document, does not weight the different text nodes (h1, h2, anchor etc)
>differently - instead it just lumps together all text as one? (this is
>the impression I get from looking at
>org.apache.nutch.parse.html.HtmlParser)

Yes, AFAIK there's no special weighting given to text pulled from the 
body of the HTML.

I believe Nutch does give higher weight to the anchor text found for 
links that point to the page, which is a key factor in generating 
better search results.

-- Ken
-- 
Ken Krugler
+1 530-210-6378