You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by winz <cw...@yahoo.com> on 2009/10/10 14:20:31 UTC

Re: How to ignore search results that don't have related keywords in main body?

Venkateshprasanna wrote:
> 
> Hi,
> 
> You can very well think of doing that if you know that you would crawl and
> index only a selected set of web pages, which follow the same design.
> Otherwise, it would turn out to be a never ending process - i.e., finding
> out the sections, frames, divs, spans, css classes and the likes - from
> each of the web pages. Scalability would obviously be an issue.
> 

Hi,
Could I please know how we can ignore template items like header, footer and
menu/navigations while crawling and indexing pages which follow the same
design??
I'm using a content management system called Infoglue to develop my website.
A standard template is applied for all the pages on the website.

The search results from Nutch shows content from menu/navigation bar
multiple times.
I need to get rid of menu/navigation content from the search result.

Please guide regarding this.

Thanks,
Vinay

-- 
View this message in context: http://www.nabble.com/How-to-ignore-search-results-that-don%27t-have-related-keywords-in-main-body--tp22654668p25833636.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: How to ignore search results that don't have related keywords in main body?

Posted by BELLINI ADAM <mb...@msn.com>.

As i explained in my poste the sections i dont wnat to index areare headers, top menus, right menus, left menus :
this is what i mean by garbage. 
<div id = 'header'>     bla bla </div>
 <div id = 'top_menu'>  bla bla </div>
 <div id = 'left_menu'>  bla bla </div>
 <div id = 'right_menu'> bla bla </div>each page contains the same header and menus section and i don't want index them becoz they are the same...
so in each page i just want to parse those sections  to get outlinks but dont want to index them...so i have to create a filtred content (without those section).  but how to construct this content since i dont know all the blocks and tags that this pages will contains and i even dont know if they are well formed...(its just HTML)....
the only thing i'm sure about it that there is a template which applies to all pages, this templates are the div sections described above...(menus, left-menus, ....etc).
so i guess the easiest solution is to find a java class which take an HTML file and certains sections
<div id = 'header'>....
  as parameters and just delete those sections form the HTML file and produce the new cleaned HTML....

http://www.israel-stop.com/fr
<a href="http://www.israel-stop.com/fr">israel</a>

> Date: Sat, 10 Oct 2009 18:21:47 +0200
> From: ab@getopt.org
> To: nutch-user@lucene.apache.org
> Subject: Re: How to ignore search results that don't have related keywords in main body?
> 
> BELLINI ADAM wrote:
> > hi guyes.... it's just what im talking about in my post 'indexing
> > just certain content‏'... you can read it mabe it could help you... i
> > was asking how to get rid of the garbage sections in a document and
> > to parse only the important data...so i guess you will create your
> > own parser and indexer...but the problem is how could we delete those
> > garbage section from an html...try to read my post...mabe we can
> > gather our two posts...i dont know if we can gather posts on thsi
> > mailing list...to keep tracking only one post...
> 
> What is garbage? Can you define it in terms of regex pattern or XPath 
> expression that points to specific elements in DOM tree? If you crawl a 
> single (or few) sites with well defined templates then you can hardcode 
> some rules for removing unwanted parts of the page.
> 
> If you can't do this, then there are some heuristic methods to solve 
> this. There are two groups of methods:
> 
> * page at a time (local): this group of methods considers only the 
> current page that you analyze. The quality of filtering is usually limited.
> 
> * groups of pages (e.g. per site): these methods consider many pages at 
> a time, and try to find recurring theme among them. Since you first need 
> to accumulate some pages it can't be done on the fly, i.e. this requires 
> a separate post-processing step.
> 
> The easiest to implement in Nutch is the first approach (page at a 
> time). There are many possible implementations - e.g. based on text 
> patterns, on visual position of elements, on DOM tree patterns, on 
> "block of content" characteristics, etc.
> 
> Here's for example a simple method:
> 
> * collect text from the page in blocks, where each block fits within 
> structural tags (div and table tags). Collect also the number of <a> 
> links in each block.
> 
> * remove a percentage of the smallest blocks, where link number is high 
> - these are likely navigational elements.
> 
> * reconstruct the whole page from the remaining blocks.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 

_________________________________________________________________
New! Faster Messenger access on the new MSN homepage
http://go.microsoft.com/?linkid=9677406

Re: How to ignore search results that don't have related keywords in main body?

Posted by Andrzej Bialecki <ab...@getopt.org>.

BELLINI ADAM wrote:
> hi guyes.... it's just what im talking about in my post 'indexing
> just certain content‏'... you can read it mabe it could help you... i
> was asking how to get rid of the garbage sections in a document and
> to parse only the important data...so i guess you will create your
> own parser and indexer...but the problem is how could we delete those
> garbage section from an html...try to read my post...mabe we can
> gather our two posts...i dont know if we can gather posts on thsi
> mailing list...to keep tracking only one post...

What is garbage? Can you define it in terms of regex pattern or XPath 
expression that points to specific elements in DOM tree? If you crawl a 
single (or few) sites with well defined templates then you can hardcode 
some rules for removing unwanted parts of the page.

If you can't do this, then there are some heuristic methods to solve 
this. There are two groups of methods:

* page at a time (local): this group of methods considers only the 
current page that you analyze. The quality of filtering is usually limited.

* groups of pages (e.g. per site): these methods consider many pages at 
a time, and try to find recurring theme among them. Since you first need 
to accumulate some pages it can't be done on the fly, i.e. this requires 
a separate post-processing step.

The easiest to implement in Nutch is the first approach (page at a 
time). There are many possible implementations - e.g. based on text 
patterns, on visual position of elements, on DOM tree patterns, on 
"block of content" characteristics, etc.

Here's for example a simple method:

* collect text from the page in blocks, where each block fits within 
structural tags (div and table tags). Collect also the number of <a> 
links in each block.

* remove a percentage of the smallest blocks, where link number is high 
- these are likely navigational elements.

* reconstruct the whole page from the remaining blocks.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: How to ignore search results that don't have related keywords in main body?

Posted by BELLINI ADAM <mb...@msn.com>.

hi guyes....
it's just what im talking about in my post 'indexing just certain content‏'...
you can read it mabe it could help you...
i was asking how to get rid of the garbage sections in a document and to parse only the important data...so i guess you will create your own parser and indexer...but the problem is how could we delete those garbage section from an html...try to read my post...mabe we can gather our two posts...i dont know if we can gather posts on thsi mailing list...to keep tracking only one post...

best regards




> Date: Sat, 10 Oct 2009 17:31:57 +0200
> From: ab@getopt.org
> To: nutch-user@lucene.apache.org
> Subject: Re: How to ignore search results that don't have related keywords in main body?
> 
> winz wrote:
> > 
> > Venkateshprasanna wrote:
> >> Hi,
> >>
> >> You can very well think of doing that if you know that you would crawl and
> >> index only a selected set of web pages, which follow the same design.
> >> Otherwise, it would turn out to be a never ending process - i.e., finding
> >> out the sections, frames, divs, spans, css classes and the likes - from
> >> each of the web pages. Scalability would obviously be an issue.
> >>
> > 
> > Hi,
> > Could I please know how we can ignore template items like header, footer and
> > menu/navigations while crawling and indexing pages which follow the same
> > design??
> > I'm using a content management system called Infoglue to develop my website.
> > A standard template is applied for all the pages on the website.
> > 
> > The search results from Nutch shows content from menu/navigation bar
> > multiple times.
> > I need to get rid of menu/navigation content from the search result.
> 
> If all you index is this particular site, then you know the positions of 
> navigation items, right? Then you can remove these elements in your 
> HtmlParseFilter, or modify DOMContentUtils (in parse-html) to skip these 
> elements.
> 
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
 		 	   		  
_________________________________________________________________
New! Faster Messenger access on the new MSN homepage
http://go.microsoft.com/?linkid=9677406

Re: How to ignore search results that don't have related keywords in main body?

Posted by Andrzej Bialecki <ab...@getopt.org>.

winz wrote:
> 
> Venkateshprasanna wrote:
>> Hi,
>>
>> You can very well think of doing that if you know that you would crawl and
>> index only a selected set of web pages, which follow the same design.
>> Otherwise, it would turn out to be a never ending process - i.e., finding
>> out the sections, frames, divs, spans, css classes and the likes - from
>> each of the web pages. Scalability would obviously be an issue.
>>
> 
> Hi,
> Could I please know how we can ignore template items like header, footer and
> menu/navigations while crawling and indexing pages which follow the same
> design??
> I'm using a content management system called Infoglue to develop my website.
> A standard template is applied for all the pages on the website.
> 
> The search results from Nutch shows content from menu/navigation bar
> multiple times.
> I need to get rid of menu/navigation content from the search result.

If all you index is this particular site, then you know the positions of 
navigation items, right? Then you can remove these elements in your 
HtmlParseFilter, or modify DOMContentUtils (in parse-html) to skip these 
elements.



-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com