You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Naess, Ronny" <Ro...@avinor.no> on 2007/07/02 09:41:03 UTC

Re: The ranking is wrong

Hi.

The solution for me was to create a taglib to add wherever there is code (HTML) in our intranet that we do not want to include. The taglib adds a comment before and after the part that not should be included.

<!-- START - DO NOT INCLUDE -->
Stuff not to be included
<!-- STOP - DO NOT INCLUDE -->

Then I added a "Filter" where I do a not greedy regex pattern search (so nothing in between STOP and START tags is removed....incase you have more than one) for the comment start and stop above.

<!-- START - DO NOT INCLUDE IN INDEX -->.*?<!-- STOP - DO NOT INCLUDE IN INDEX -->

The filter is added in HtmlParser in the parse(...) method 

We found out that the comment was better than searching for a spesific div tag with a uniqe classname since finding the end for the same div would give us some unwanted headache.

It works like a charm :-)

Best regards,

Ronny N.

-----Opprinnelig melding-----
Fra: Naess, Ronny [mailto:Ronny.Naess@avinor.no] 
Sendt: 29. juni 2007 15:54
Til: nutch-user@lucene.apache.org
Emne: Re: The ranking is wrong

Thanks both of you.

I think this might be the something I must do.

I have played around with plugin parse-html, but I havent found the correct place to hook into yet. I can print out the text (even the menu text), but it is aleready stripped with html content, so I must be in the wrong place. I printet the text out from getTextHelper(...) in DOMContentUtils.

Any pointers to where I should start? Act ually, I would have liked to express a regex with a class id for a current div to skip, that would have been realy nice. Of course that is not something everyone would like, but for Intranet searching where you normally know what to index it might be helpfull. 

This is something I would like to go away. 

<div class="menuContainer">
...lots of unwanted content...
<div>

-Ronny

-  

-----Opprinnelig melding-----
Fra: Doğacan Güney [mailto:dogacan@gmail.com]
Sendt: 27. juni 2007 13:06
Til: nutch-user@lucene.apache.org
Emne: Re: The ranking is wrong

On 6/27/07, Andrzej Bialecki <ab...@getopt.org> wrote:
> Naess, Ronny wrote:
> > Thanks, Ann.
> >
> > You gave me some good pointers.
> >
> > I see that the navigation menu is giving med all the trouble with 
> > ranking. Does somebody know a way to make the parser skip some content?
> > I would like the parser to skip global header and navigation menu so 
> > the content contains the uniq stuff not everything. Guess this is 
> > not a simple thing.
>
>
> No, it's not. Do a Google search for "template detection".
>
> A crude approach, which still might be sufficient in your case, is to 
> do the following:
>
> * remove all font/color/style formatting elements, and coalesce their 
> text children with their parents. E.g.
>
>         this is <span style="abc">a text</span>
>         <b>with bold</b> fragment
>
> becomes:
>         this is a text with bold fragment
>
> * do the same with all non-divisional (structural) tags, i.e. any 
> formatting tags except for div-s, tables and iframe-s.
>
> * sort the remaining text blocks by size
>
> * drop a certain number (or percentage) of the smallest of the text blocks.
>
> * put the blocks back in order, and extract only their text content.
> This is the "main body" text.
>

Alternatively, for any given divisional tag, you might measure the amount of anchor text versus non-anchor text. If a table/div/...
contains mostly anchor text (and all anchor texts consist of a couple of words), you can assume that it is a menu and not relevant content.

>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web ___|||__|| 
> \|  ||  |  Embedded Unix, System Integration http://www.sigram.com
> Contact: info at sigram dot com
>
>


--
Doğacan Güney




!DSPAM:468512fe174686491211187!