You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "igor.k" <ig...@thesearchagency.com> on 2010/01/06 03:00:52 UTC

Nutch with Hadoop : Inconsistent # of Crawls

Hey guys,

I have successfully installed Nutch and Hadoop and have set up a 2 Machine
DFS (One master and one slave). When running nutch crawls with nutch as a
standalone, everything works perfectly fine. The problem is that when crawls
are run using Nutch + Hadoop across the 2 machines, the number of pages
crawled is inconsistent. Sometimes all of the pages are properly crawled and
stored in the index, other times, only a portion of the pages are crawled
and indexed.  As a test base I have been using -depth 10 and -TopN 200 for
my crawls.


Any ideas on what might be going wrong? Please let me know if you need any
additional information.

Much Thanks,
-Igor
-- 
View this message in context: http://old.nabble.com/Nutch-with-Hadoop-%3A-Inconsistent---of-Crawls-tp27026759p27026759.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch with Hadoop : Inconsistent # of Crawls

Posted by "igor.k" <ig...@thesearchagency.com>.

*Update*

Solution Found. The two machines' date and time were out of sync.
Synchronizing them fixed the problem.


igor.k wrote:
> 
> Hey guys,
> 
> I have successfully installed Nutch and Hadoop and have set up a 2 Machine
> DFS (One master and one slave). When running nutch crawls with nutch as a
> standalone, everything works perfectly fine. The problem is that when
> crawls are run using Nutch + Hadoop across the 2 machines, the number of
> pages crawled is inconsistent. Sometimes all of the pages are properly
> crawled and stored in the index, other times, only a portion of the pages
> are crawled and indexed.  As a test base I have been using -depth 10 and
> -TopN 200 for my crawls.
> 
> 
> Any ideas on what might be going wrong? Please let me know if you need any
> additional information.
> 
> Much Thanks,
> -Igor
> 

-- 
View this message in context: http://old.nabble.com/Nutch-with-Hadoop-%3A-Inconsistent---of-Crawls-tp27026759p27064896.html
Sent from the Nutch - User mailing list archive at Nabble.com.