You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by tittutomen <su...@gmail.com> on 2009/10/05 10:21:33 UTC

Nutch - DFS environment. Is it stable?

Hi,

I've been trying to set up a Nutch-hadoop distributed environment to crawl a
3 Million URL list.

My experience so far been is:

1. Nutch is working fine on a single machine environ. Here I wrote a script
file which calls nutch crawl command first to crawl 1000 urls. Then it
crawls the next 1000 urls. The first two indexes formed in these processes
are merged together to form another merged.index. It will repeatedly crawl
for 1000 urls and merge with the previous one. This is stable enough and
goes on smoothly.

2. I tried to create a Distributed environment. I tried with 4 machines.
There are 2 Master nodes each with 2 GB RAM, one for Namenode and another
for JobTracker. The rest 2 machines are 1 GB RAM. I made all the 4 machines
into slave nodes. I run the same script to take 5000 URLs from a list of 3
Million URLs and start crawling. Then the rest 5000 will be called and
merged with the previous one. I found here the DFS environ is not stable.
After running for 2/3 cycles it breaks in different ways. Either the crawl
fails or the merging fails. 

Now after trying with several different configurations like running the both
masters on a single node, running only 3 slaves etc. still I found it is not
going beyond more then 2/3 cycles. 

Could anybody suggest where I'm going wrong or if there is a better
alternative? I have read docs claiming Nutch in 100+ machines. So is that
mean it runs only once? How much time could we make the DFS environ stable?
Do I have to restart DFS before beginning every crawl/merge cycle? 

There are lot of errors like Datanode missing, FileAlreadyCreatedException,
JobFailed, RPCExceptions etc.

I will appreciate help in this regard. And I'm open to share my knowledge so
far also. Please write!

Thanks in advance!!!


-- 
View this message in context: http://www.nabble.com/Nutch---DFS-environment.-Is-it-stable--tp25746827p25746827.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Nutch - DFS environment. Is it stable?

Posted by tittutomen <su...@gmail.com>.


tittutomen wrote:
> 
> Hi,
> 
> I've been trying to set up a Nutch-hadoop distributed environment to crawl
> a 3 Million URL list.
> 
> My experience so far been is:
> 
> 1. Nutch is working fine on a single machine environ. Here I wrote a
> script file which calls nutch crawl command first to crawl 1000 urls. Then
> it crawls the next 1000 urls. The first two indexes formed in these
> processes are merged together to form another merged.index. It will
> repeatedly crawl for 1000 urls and merge with the previous one. This is
> stable enough and goes on smoothly.
> 
> 2. I tried to create a Distributed environment. I tried with 4 machines.
> There are 2 Master nodes each with 2 GB RAM, one for Namenode and another
> for JobTracker. The rest 2 machines are 1 GB RAM. I made all the 4
> machines into slave nodes. I run the same script to take 5000 URLs from a
> list of 3 Million URLs and start crawling. Then the rest 5000 will be
> called and merged with the previous one. I found here the DFS environ is
> not stable. After running for 2/3 cycles it breaks in different ways.
> Either the crawl fails or the merging fails. 
> 
> Now after trying with several different configurations like running the
> both masters on a single node, running only 3 slaves etc. still I found it
> is not going beyond more then 2/3 cycles. 
> 
> Could anybody suggest where I'm going wrong or if there is a better
> alternative? I have read docs claiming Nutch in 100+ machines. So is that
> mean it runs only once? How much time could we make the DFS environ
> stable? Do I have to restart DFS before beginning every crawl/merge cycle? 
> 
> There are lot of errors like Datanode missing,
> FileAlreadyCreatedException, JobFailed, RPCExceptions etc.
> 
> I will appreciate help in this regard. And I'm open to share my knowledge
> so far also. Please write!
> 
> Thanks in advance!!!
> 
> 
> 

Another improvement I found is when i restarted the DFS environ. It takes
time but I think it is making the system stable. Don't know though whether
it is the correct way to go...

Thanks
-Subas
-- 
View this message in context: http://www.nabble.com/Nutch---DFS-environment.-Is-it-stable--tp25746827p25763446.html
Sent from the Nutch - User mailing list archive at Nabble.com.