You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Nayanish Hinge <na...@gmail.com> on 2010/09/02 07:38:11 UTC

Why do nutch has Content Parsing in two places

Hi,
I was wondering, why nutch has an option of parsing
1. right within the fetcher and
2. also as a separate map-reduce job

In Crawl.java, There is a separate step for crawling. But also based on
"fetcher.parse" property in nutch-default.xml, Fetcher will also parse the
content.

Thanks
-- 
Nayanish
Hyderabad

RE: Why do nutch has Content Parsing in two places

Posted by Markus Jelsma <ma...@buyways.nl>.

In small crawls, you could parse the documentright away. For large crawls, however, there may not be enough resources to fetch and parse at the same time.

-----Original message-----
From: Nayanish Hinge <na...@gmail.com>
Sent: Thu 02-09-2010 07:39
To: user@nutch.apache.org; 
Subject: Why do nutch has Content Parsing in two places

Hi,
I was wondering, why nutch has an option of parsing
1. right within the fetcher and
2. also as a separate map-reduce job

In Crawl.java, There is a separate step for crawling. But also based on
"fetcher.parse" property in nutch-default.xml, Fetcher will also parse the
content.

Thanks
-- 
Nayanish
Hyderabad