You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "igor.k" <ig...@thesearchagency.com> on 2010/01/06 03:00:52 UTC
Nutch with Hadoop : Inconsistent # of Crawls
Hey guys,
I have successfully installed Nutch and Hadoop and have set up a 2 Machine
DFS (One master and one slave). When running nutch crawls with nutch as a
standalone, everything works perfectly fine. The problem is that when crawls
are run using Nutch + Hadoop across the 2 machines, the number of pages
crawled is inconsistent. Sometimes all of the pages are properly crawled and
stored in the index, other times, only a portion of the pages are crawled
and indexed. As a test base I have been using -depth 10 and -TopN 200 for
my crawls.
Any ideas on what might be going wrong? Please let me know if you need any
additional information.
Much Thanks,
-Igor
--
View this message in context: http://old.nabble.com/Nutch-with-Hadoop-%3A-Inconsistent---of-Crawls-tp27026759p27026759.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch with Hadoop : Inconsistent # of Crawls
Posted by "igor.k" <ig...@thesearchagency.com>.
*Update*
Solution Found. The two machines' date and time were out of sync.
Synchronizing them fixed the problem.
igor.k wrote:
>
> Hey guys,
>
> I have successfully installed Nutch and Hadoop and have set up a 2 Machine
> DFS (One master and one slave). When running nutch crawls with nutch as a
> standalone, everything works perfectly fine. The problem is that when
> crawls are run using Nutch + Hadoop across the 2 machines, the number of
> pages crawled is inconsistent. Sometimes all of the pages are properly
> crawled and stored in the index, other times, only a portion of the pages
> are crawled and indexed. As a test base I have been using -depth 10 and
> -TopN 200 for my crawls.
>
>
> Any ideas on what might be going wrong? Please let me know if you need any
> additional information.
>
> Much Thanks,
> -Igor
>
--
View this message in context: http://old.nabble.com/Nutch-with-Hadoop-%3A-Inconsistent---of-Crawls-tp27026759p27064896.html
Sent from the Nutch - User mailing list archive at Nabble.com.