You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Alexander Aristov <al...@gmail.com> on 2009/04/21 14:21:58 UTC

running two crawlers at the same time

Hi all

I want to run two crawlers using single server at the same time across
different seed lists.

The question is

Is it safe to use one binaries? I have developed scripts to specify
different input/output locations but I wonder if nutch creates some
temporarily folders during its work which I cannot control and so it would
be possible situation when two crawlers overlap working data.

Thanks
Alexander Aristov

Re: running two crawlers at the same time

Posted by Dennis Kubes <ku...@apache.org>.

Alexander Aristov wrote:
> Hi all
> 
> I want to run two crawlers using single server at the same time across
> different seed lists.
> 
> The question is
> 
> Is it safe to use one binaries? I have developed scripts to specify
> different input/output locations but I wonder if nutch creates some
> temporarily folders during its work which I cannot control and so it would
> be possible situation when two crawlers overlap working data.

There aren't any conflicts in having multiple crawling jobs going and 
outputting to different directories at the same time.  You do need to be 
careful about ordering if you are generating the crawl lists from a 
single crawldb and then updating back into that crawldb.

Dennis


> 
> Thanks
> Alexander Aristov
> 

Re: running two crawlers at the same time

Posted by Alex Basa <al...@yahoo.com>.
It's not a problem.  I've done it with up to 30 at a time on a single blade server, each using different lists and outputting to different directories.  I didn't see any cross pollination happening.


--- On Tue, 4/21/09, Alexander Aristov <al...@gmail.com> wrote:

> From: Alexander Aristov <al...@gmail.com>
> Subject: running two crawlers at the same time
> To: nutch-user@lucene.apache.org
> Date: Tuesday, April 21, 2009, 7:21 AM
> Hi all
> 
> I want to run two crawlers using single server at the same
> time across
> different seed lists.
> 
> The question is
> 
> Is it safe to use one binaries? I have developed scripts to
> specify
> different input/output locations but I wonder if nutch
> creates some
> temporarily folders during its work which I cannot control
> and so it would
> be possible situation when two crawlers overlap working
> data.
> 
> Thanks
> Alexander Aristov