You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by jamie <ja...@fastmail.fm> on 2006/03/10 10:39:02 UTC

quality of search text

hi everyone

i dont know if we're doing something wrong, but the quality of the text
in the Nutch search results is appauling. 


To give you an example:

the text outputted for http://www.gamingalmanac.com/ is the following:

 ... Gaming Industry Research Publications, Worldwide Gaming Almanacs,
 Bear Stearns Gaming Almanac, Gaming Revenue and Statistics PRODUCT
 OVERVIEW COMPLETE ANALYST PACKAGE NORTH AMERICAN ALMANAC INDIAN GAMING
 INDUSTRY REPORT NEVADA GAMING ALMANAC GLOBAL GAMING ALMANAC GLOBAL
 GAMBLING REPORT MARKET RESEARCH HANDBOOK MICROSOFT MAP POINT Save up to
 45% with a Gaming Analyst Package! The Gaming Almanac Family of
 Products Find every fact, figure, and trend you need on the gaming
 industry. With current property profiles and statistics, historical and
 forward-looking financial data, local, regional, and worldwide gaming
 market summaries, and key player profiles, the Gaming Almanac products
 from Casino City Press offer information essential to every gaming
 executive, supplier, and analyst. Titles Include: Casino City ... 

whereas Google outputs:

Gaming Industry Research Publications, Worldwide Gaming Almanacs ...
The Gaming Almanac products from Casino City Press serve as excellent
reference tools for anyone interested in the worldwide and domestic
gaming markets.
gamingalmanac.com/

Is there any easy way to fix this? The Nutch search results appear to
include text in the website menu's, etc. which affects the usability of
the search results.

Where in Nutch would I go about fixing this?

Thanks

Jamie


RE: quality of search text

Posted by Richard Braman <rb...@bramantax.com>.
touché 

-----Original Message-----
From: Jérôme Charron [mailto:jerome.charron@gmail.com] 
Sent: Friday, March 10, 2006 4:34 PM
To: nutch-dev@lucene.apache.org; rbraman@bramantax.com
Subject: Re: quality of search text


> I think algortihm # 1 is what google uses.
> google ignores content that does not change from page to page, as well

> as content that isn't part of a pblock of text.

Are you sure?
Take a look at this search results:
http://www.google.com/search?hl=en&hs=otT&lr=&c2coff=1&safe=off&client=f
irefox-a&rls=org.mozilla:en-US:official&pwst=1&q=+site:gamingalmanac.com
+global+gaming+almanac
... and you will notice that menus are indexed by google and displayed
in summaries.

But if you can contribute a HtmlParseFilter with ability to remove menus
and navigation, it will be a real improvement. A first step, that I have
developed in a previous project many years ago is to remove pages that
contains textual content only in links: it avoid indexing frames or
iframes that only contains some navigation text...

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/


Re: quality of search text

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
> Hmm... I'm not convinced. How would you generate the best snippet from a 
> relevant, but ignored chunk?

Good point... I guess you simply wouldn't generate anything at all (show 
the title?). I guess structure text should not be relevant enough to 
actually cause a hit on top of the search result by itself; there should 
be some other continuous block of text more relevant to the query that 
caused the hit. It's a bit like assigning priorities to longer chunks of 
text over the shorter ones, don't know if my intuition is clear...

D.

Re: quality of search text

Posted by Howie Wang <ho...@hotmail.com>.
>>I'd agree that (2) is quite important for the end user; Richard's 
>>continuous text heuristic may actually work for that. I'd extend the 
>>meaning of "continuous block" to ignore inline tags such as SPAN, I, B, TT 
>>etc, so only certain tags would actually break the content into chunks. 
>>Snippets then would be generated from these chunks alone, ignoring the 
>>rest of the content. If this heuristic is applied only at 
>>snippet-generation time then Andrzej's concern about missing content is 
>>not relevant anymore.
>
>Hmm... I'm not convinced. How would you generate the best snippet from a 
>relevant, but ignored chunk?

Maybe eventually this could be the start of using tags to boost
certain sections of the page as Google probably does. Normal
text blocks would have a boost of 1.0, while stuff within <B>, <H*>
might be boosted by 1.5. Stuff within suspected navigation text
could be de-boosted by 0.25 or something. Maybe that would
be a more appropriate way of handling relevance of navigation
text. It should have some relevance, but not as much as content.

Maybe the summary text could somehow ignore the de-boosted
sections to improve readability unless the content doesn't have
a better match. You basically construct a snippet giving preference
according to the boost value of the section of text.

This all sounds like a lot of work though :)

Howie



Re: quality of search text

Posted by Andrzej Bialecki <ab...@getopt.org>.
Dawid Weiss wrote:
>
> It seems to me that there are two separate problems:
>
> 1) content parsing to avoid site structure -> influences the index and 
> rankings
> 2) content parsing for KWIC snippet generation -> influences the user 
> perception of the engine.
>
> I'd agree that (2) is quite important for the end user; Richard's 
> continuous text heuristic may actually work for that. I'd extend the 
> meaning of "continuous block" to ignore inline tags such as SPAN, I, 
> B, TT etc, so only certain tags would actually break the content into 
> chunks. Snippets then would be generated from these chunks alone, 
> ignoring the rest of the content. If this heuristic is applied only at 
> snippet-generation time then Andrzej's concern about missing content 
> is not relevant anymore. 

Hmm... I'm not convinced. How would you generate the best snippet from a 
relevant, but ignored chunk?

But I agree that for some (perhaps large) percentage of sites this 
heuristic could work well, and it's simple enough to be easily implemented.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: quality of search text

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.
It seems to me that there are two separate problems:

1) content parsing to avoid site structure -> influences the index and 
rankings
2) content parsing for KWIC snippet generation -> influences the user 
perception of the engine.

I'd agree that (2) is quite important for the end user; Richard's 
continuous text heuristic may actually work for that. I'd extend the 
meaning of "continuous block" to ignore inline tags such as SPAN, I, B, 
TT etc, so only certain tags would actually break the content into 
chunks. Snippets then would be generated from these chunks alone, 
ignoring the rest of the content. If this heuristic is applied only at 
snippet-generation time then Andrzej's concern about missing content is 
not relevant anymore. Of course I realize it is tricky in the current 
architecture because different filters would be used for KWICs and 
indexing...

D.



Jérôme Charron wrote:
>> I think algortihm # 1 is what google uses.
>> google ignores content that does not change from page to page, as well
>> as content that isn't part of a pblock of text.
> 
> Are you sure?
> Take a look at this search results:
> http://www.google.com/search?hl=en&hs=otT&lr=&c2coff=1&safe=off&client=firefox-a&rls=org.mozilla:en-US:official&pwst=1&q=+site:gamingalmanac.com+global+gaming+almanac
> ... and you will notice that menus are indexed by google and displayed in
> summaries.
> 
> But if you can contribute a HtmlParseFilter with ability to remove menus and
> navigation, it will be a real improvement.
> A first step, that I have developed in a previous project many years ago is
> to remove pages that contains textual content only in links: it avoid
> indexing frames or iframes that only contains some navigation text...
> 
> Jérôme
> 
> --
> http://motrech.free.fr/
> http://www.frutch.org/
> 

Re: quality of search text

Posted by Jérôme Charron <je...@gmail.com>.
> I think algortihm # 1 is what google uses.
> google ignores content that does not change from page to page, as well
> as content that isn't part of a pblock of text.

Are you sure?
Take a look at this search results:
http://www.google.com/search?hl=en&hs=otT&lr=&c2coff=1&safe=off&client=firefox-a&rls=org.mozilla:en-US:official&pwst=1&q=+site:gamingalmanac.com+global+gaming+almanac
... and you will notice that menus are indexed by google and displayed in
summaries.

But if you can contribute a HtmlParseFilter with ability to remove menus and
navigation, it will be a real improvement.
A first step, that I have developed in a previous project many years ago is
to remove pages that contains textual content only in links: it avoid
indexing frames or iframes that only contains some navigation text...

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

RE: quality of search text

Posted by Richard Braman <rb...@bramantax.com>.
>nowadays many pages freely mix in markup in the main content area...
Yes, but if that content was nested in a larger block of content, then
it would be included.

I will probably end up implmenting some of these algorithms, but I would
like some good feedback before I go out on a limb.


-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org] 
Sent: Friday, March 10, 2006 2:51 PM
To: nutch-dev@lucene.apache.org
Subject: Re: quality of search text


Richard Braman wrote:
> Here is a potential algorithm:
>
> Look first to Meta Description, if none exists
> Look for continuous block of text, ignore content that doesn't contain

> a continous block of text.  If a given html tag only contains a few 
> words of text, it is not content , but rather a part of the nav 
> structure of the page.
>
>   

You may potentially miss a lot of content this way, nowadays many pages 
freely mix in markup in the main content area...

> Here is yet another algorithm.
>
> When fetching pages from a particular web, analyze the structure of 
> the page, try to make a determination of what content stays similar 
> from page to page within the same web.  That would usually be menus, 
> headers, footers, etc.
>   

This requires collecting pages in advance to train the structure 
recognizer, and preparing "profiles" for groups of pages with common
layout.

> Granted the menus may change slightly from page to page, which is why 
> the algorithm would be pattern based instead of literal. When you 
> determine what is navigation and what is content, you would only parse

> and index the content.
>
> I think algortihm # 1 is what google uses.
> google ignores content that does not change from page to page, as well

> as content that isn't part of a pblock of text.
>
> Comments please
>   

The best way to evaluate this would be to ..erhm.. evaluate these 
algorithms on a set of reference pages. Would you like to implement one 
or both algorithms and test them?

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web ___|||__||
\|  ||  |  Embedded Unix, System Integration http://www.sigram.com
Contact: info at sigram dot com


Re: quality of search text

Posted by Andrzej Bialecki <ab...@getopt.org>.
Richard Braman wrote:
> Here is a potential algorithm:
>
> Look first to Meta Description, if none exists
> Look for continuous block of text, ignore content that doesn't contain a
> continous block of text.  If a given html tag only contains a few words
> of text, it is not content , but rather a part of the nav structure of
> the page.
>
>   

You may potentially miss a lot of content this way, nowadays many pages 
freely mix in markup in the main content area...

> Here is yet another algorithm.
>
> When fetching pages from a particular web, analyze the structure of the
> page, try to make a determination of what content stays similar from
> page to page within the same web.  That would usually be menus, headers,
> footers, etc. 
>   

This requires collecting pages in advance to train the structure 
recognizer, and preparing "profiles" for groups of pages with common layout.

> Granted the menus may change slightly from page to page, which is why
> the algorithm would be pattern based instead of literal.
> When you determine what is navigation and what is content, you would
> only parse and index the content.
>
> I think algortihm # 1 is what google uses.
> google ignores content that does not change from page to page, as well
> as content that isn't part of a pblock of text.
>
> Comments please
>   

The best way to evaluate this would be to ..erhm.. evaluate these 
algorithms on a set of reference pages. Would you like to implement one 
or both algorithms and test them?

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: quality of search text

Posted by Richard Braman <rb...@bramantax.com>.
>it doesn't say in pages "this is menu, this is body text",
Agreed it doesn't say that

>this is definitely NOT trivial
This isn't trivial, but is rather important

>it's hard to come up with a method that works for any layout. 

Here is a potential algorithm:

Look first to Meta Description, if none exists
Look for continuous block of text, ignore content that doesn't contain a
continous block of text.  If a given html tag only contains a few words
of text, it is not content , but rather a part of the nav structure of
the page.

Here is yet another algorithm.

When fetching pages from a particular web, analyze the structure of the
page, try to make a determination of what content stays similar from
page to page within the same web.  That would usually be menus, headers,
footers, etc. 
Granted the menus may change slightly from page to page, which is why
the algorithm would be pattern based instead of literal.
When you determine what is navigation and what is content, you would
only parse and index the content.

I think algortihm # 1 is what google uses.
google ignores content that does not change from page to page, as well
as content that isn't part of a pblock of text.

Comments please

-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org] 
Sent: Friday, March 10, 2006 1:57 PM
To: nutch-dev@lucene.apache.org
Subject: Re: quality of search text


Richard Braman wrote:
> I too have noticed menu text appearing in the search results.
>   

The proper place to fix it would be in parse-html, perhaps in 
DOMContentUtils.

However, be warned that this is definitely NOT trivial - i.e. it doesn't

say in pages "this is menu, this is body text", you have to figure it 
out, and it's hard to come up with a method that works for any layout. 
You may hardcode something that works well for your target group of 
hosts, with pre-determined page layouts.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web ___|||__||
\|  ||  |  Embedded Unix, System Integration http://www.sigram.com
Contact: info at sigram dot com


Re: quality of search text

Posted by Andrzej Bialecki <ab...@getopt.org>.
Richard Braman wrote:
> I too have noticed menu text appearing in the search results.
>   

The proper place to fix it would be in parse-html, perhaps in 
DOMContentUtils.

However, be warned that this is definitely NOT trivial - i.e. it doesn't 
say in pages "this is menu, this is body text", you have to figure it 
out, and it's hard to come up with a method that works for any layout. 
You may hardcode something that works well for your target group of 
hosts, with pre-determined page layouts.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: quality of search text

Posted by Richard Braman <rb...@bramantax.com>.
I too have noticed menu text appearing in the search results.

-----Original Message-----
From: jamie [mailto:jamieb@fastmail.fm] 
Sent: Friday, March 10, 2006 4:39 AM
To: nutch-dev@lucene.apache.org
Subject: quality of search text


hi everyone

i dont know if we're doing something wrong, but the quality of the text
in the Nutch search results is appauling. 


To give you an example:

the text outputted for http://www.gamingalmanac.com/ is the following:

 ... Gaming Industry Research Publications, Worldwide Gaming Almanacs,
Bear Stearns Gaming Almanac, Gaming Revenue and Statistics PRODUCT
OVERVIEW COMPLETE ANALYST PACKAGE NORTH AMERICAN ALMANAC INDIAN GAMING
INDUSTRY REPORT NEVADA GAMING ALMANAC GLOBAL GAMING ALMANAC GLOBAL
GAMBLING REPORT MARKET RESEARCH HANDBOOK MICROSOFT MAP POINT Save up to
45% with a Gaming Analyst Package! The Gaming Almanac Family of
Products Find every fact, figure, and trend you need on the gaming
industry. With current property profiles and statistics, historical and
forward-looking financial data, local, regional, and worldwide gaming
market summaries, and key player profiles, the Gaming Almanac products
from Casino City Press offer information essential to every gaming
executive, supplier, and analyst. Titles Include: Casino City ... 

whereas Google outputs:

Gaming Industry Research Publications, Worldwide Gaming Almanacs ... The
Gaming Almanac products from Casino City Press serve as excellent
reference tools for anyone interested in the worldwide and domestic
gaming markets. gamingalmanac.com/

Is there any easy way to fix this? The Nutch search results appear to
include text in the website menu's, etc. which affects the usability of
the search results.

Where in Nutch would I go about fixing this?

Thanks

Jamie