You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Emmanuel <jo...@gmail.com> on 2007/08/31 16:58:10 UTC

Re: The ranking is wrong

Hi Guys,

Indexing page menu become very painful. It makes the scoric less efficient.
I've seen that you have discuss about this few weeks ago.

I've follow the advice from Andrzej and I've made some research on the web.
My understanding is that we need to parse 2 times the different pages
crawled.
First, to isolate all block of content
Second to filter all block that are displayed frequently.

Am i correct?
Does anybody has work on this subject ?
If no, could you please give me some clue to indicate where i should
implement those step ?


> Naess, Ronny wrote:
>> Thanks, Ann.
>>
>> You gave me some good pointers.
>>
>> I see that the navigation menu is giving med all the trouble with
>> ranking. Does somebody know a way to make the parser skip some content?
>> I would like the parser to skip global header and navigation menu so the
>> content contains the uniq stuff not everything. Guess this is not a
>> simple thing.
>
>
> No, it's not. Do a Google search for "template detection".
>
> A crude approach, which still might be sufficient in your case, is to do
> the following:
>
> * remove all font/color/style formatting elements, and coalesce their
> text children with their parents. E.g.
>
>     this is <span style="abc">a text</span>
>     <b>with bold</b> fragment
>
> becomes:
>     this is a text with bold fragment
>
> * do the same with all non-divisional (structural) tags, i.e. any
> formatting tags except for div-s, tables and iframe-s.
>
> * sort the remaining text blocks by size
>
> * drop a certain number (or percentage) of the smallest of the text
> blocks.
>
> * put the blocks back in order, and extract only their text content.
> This is the "main body" text.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: The ranking is wrong

Posted by purpureleaf <pu...@gmail.com>.

I have just done this. But my problem may be a little different from yours.
Since all my target pages has the same layout, it is easier for me.
Before indexing, I use the diff utility to find out the similar parts(at
least 100 words) of the pages, then I removed this parts completely from all
the pages.
I got very good result:
http://www.wegoround.com
You can try "France" as key words, you will see the results are correct. 
Almost every pages have France on it(This is for vacation), but those in
Menu, AD are removed. If I didn't remove them, France Spain Italy... 
actually even Iran are completely useless in my case.

Actually my pages has some type of layouts,so I have to grouped them by a
custom url filter.
I would like to share my work to the community someday.

Pan


Emmanuel JOKE wrote:
> 
> Hi Guys,
> 
> Indexing page menu become very painful. It makes the scoric less
> efficient.
> I've seen that you have discuss about this few weeks ago.
> 
> I've follow the advice from Andrzej and I've made some research on the
> web.
> My understanding is that we need to parse 2 times the different pages
> crawled.
> First, to isolate all block of content
> Second to filter all block that are displayed frequently.
> 
> Am i correct?
> Does anybody has work on this subject ?
> If no, could you please give me some clue to indicate where i should
> implement those step ?
> 
> 
>> Naess, Ronny wrote:
>>> Thanks, Ann.
>>>
>>> You gave me some good pointers.
>>>
>>> I see that the navigation menu is giving med all the trouble with
>>> ranking. Does somebody know a way to make the parser skip some content?
>>> I would like the parser to skip global header and navigation menu so the
>>> content contains the uniq stuff not everything. Guess this is not a
>>> simple thing.
>>
>>
>> No, it's not. Do a Google search for "template detection".
>>
>> A crude approach, which still might be sufficient in your case, is to do
>> the following:
>>
>> * remove all font/color/style formatting elements, and coalesce their
>> text children with their parents. E.g.
>>
>>     this is a text
>>     with bold fragment
>>
>> becomes:
>>     this is a text with bold fragment
>>
>> * do the same with all non-divisional (structural) tags, i.e. any
>> formatting tags except for div-s, tables and iframe-s.
>>
>> * sort the remaining text blocks by size
>>
>> * drop a certain number (or percentage) of the smallest of the text
>> blocks.
>>
>> * put the blocks back in order, and extract only their text content.
>> This is the "main body" text.
>>
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>   ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Re%3A-The-ranking-is-wrong-tf4360656.html#a12436408
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: The ranking is wrong

Posted by purpureleaf <pu...@gmail.com>.

You have to be careful if you want to do it, because even if one page doesn't
match the pattern, you get trouble
And don't move what you need :)


Emmanuel JOKE wrote:
> 
> Hi Guys,
> 
> Indexing page menu become very painful. It makes the scoric less
> efficient.
> I've seen that you have discuss about this few weeks ago.
> 
> I've follow the advice from Andrzej and I've made some research on the
> web.
> My understanding is that we need to parse 2 times the different pages
> crawled.
> First, to isolate all block of content
> Second to filter all block that are displayed frequently.
> 
> Am i correct?
> Does anybody has work on this subject ?
> If no, could you please give me some clue to indicate where i should
> implement those step ?
> 
> 
>> Naess, Ronny wrote:
>>> Thanks, Ann.
>>>
>>> You gave me some good pointers.
>>>
>>> I see that the navigation menu is giving med all the trouble with
>>> ranking. Does somebody know a way to make the parser skip some content?
>>> I would like the parser to skip global header and navigation menu so the
>>> content contains the uniq stuff not everything. Guess this is not a
>>> simple thing.
>>
>>
>> No, it's not. Do a Google search for "template detection".
>>
>> A crude approach, which still might be sufficient in your case, is to do
>> the following:
>>
>> * remove all font/color/style formatting elements, and coalesce their
>> text children with their parents. E.g.
>>
>>     this is a text
>>     with bold fragment
>>
>> becomes:
>>     this is a text with bold fragment
>>
>> * do the same with all non-divisional (structural) tags, i.e. any
>> formatting tags except for div-s, tables and iframe-s.
>>
>> * sort the remaining text blocks by size
>>
>> * drop a certain number (or percentage) of the smallest of the text
>> blocks.
>>
>> * put the blocks back in order, and extract only their text content.
>> This is the "main body" text.
>>
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>   ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Re%3A-The-ranking-is-wrong-tf4360656.html#a12436465
Sent from the Nutch - User mailing list archive at Nabble.com.