You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Chris Alexander <ch...@kusiri.com> on 2011/07/13 16:38:04 UTC

Concurrently running multiple nutch crawls

Hi again,

Continuing my investigations into nutch, I attempted running two nutch
whole-web crawls against two different target URL sets simultaneously and
with different crawl directories. All seemed to be going very well until the
exception below appeared in one of the threads. It looks like something
under the hood is using some lock files that seem to be overlapping. Is it
possible to run two nutch instances side by side, or would it be a better
architecture to prefer to have a single instance of the script running and
have it pick up updates to the URLs it has to crawl (e.g. the user
specifying new top-level URLs to crawl).

Cheers

Chris


Exception in thread "main" java.io.FileNotFoundException: File
file:/tmp/hadoop-root/mapred/system/job_local_0001/job.xml does not exist.
        at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
        at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:142)
        at
org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(LocalFileSystem.java:61)
        at
org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1197)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:92)
        at
org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:373)
        at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:800)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
        at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)
        at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:177)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:163)

Re: Concurrently running multiple nutch crawls

Posted by Julien Nioche <li...@gmail.com>.

Having a single instance is a good solution as it would make the fetching
more efficient (more domains => the more threads working in parallel)+
simplifies the management of the crawls. You can modify the scoring so that
URLs added as seeds are fetched in priority -> see OPIC scoring for the
default implementation.

Julien


Continuing my investigations into nutch, I attempted running two nutch
> whole-web crawls against two different target URL sets simultaneously and
> with different crawl directories. All seemed to be going very well until
> the
> exception below appeared in one of the threads. It looks like something
> under the hood is using some lock files that seem to be overlapping. Is it
> possible to run two nutch instances side by side, or would it be a better
> architecture to prefer to have a single instance of the script running and
> have it pick up updates to the URLs it has to crawl (e.g. the user
> specifying new top-level URLs to crawl).
>
> Cheers
>
> Chris
>
>
> Exception in thread "main" java.io.FileNotFoundException: File
> file:/tmp/hadoop-root/mapred/system/job_local_0001/job.xml does not exist.
>        at
>
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
>        at
>
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
>        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192)
>        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:142)
>        at
>
> org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(LocalFileSystem.java:61)
>        at
> org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1197)
>        at
> org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:92)
>        at
> org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:373)
>        at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:800)
>        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>        at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)
>        at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:177)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:163)
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Concurrently running multiple nutch crawls

Posted by Chris Alexander <ch...@kusiri.com>.

Ok, thanks for the answer. Looks like will have to queue them up.

Cheers!

Chris

On 13 July 2011 15:50, Markus Jelsma <ma...@openindex.io> wrote:

> You're running locally? You cannot run multiple Nutch' locally with each
> sharing the same /tmp/ directory: change /tmp/ per crawl or run on Hadoop
> or
> run in sequence if you can live with it.
>
> On Wednesday 13 July 2011 16:38:04 Chris Alexander wrote:
> > Hi again,
> >
> > Continuing my investigations into nutch, I attempted running two nutch
> > whole-web crawls against two different target URL sets simultaneously and
> > with different crawl directories. All seemed to be going very well until
> > the exception below appeared in one of the threads. It looks like
> > something under the hood is using some lock files that seem to be
> > overlapping. Is it possible to run two nutch instances side by side, or
> > would it be a better architecture to prefer to have a single instance of
> > the script running and have it pick up updates to the URLs it has to
> crawl
> > (e.g. the user specifying new top-level URLs to crawl).
> >
> > Cheers
> >
> > Chris
> >
> >
> > Exception in thread "main" java.io.FileNotFoundException: File
> > file:/tmp/hadoop-root/mapred/system/job_local_0001/job.xml does not
> exist.
> >         at
> >
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.ja
> > va:361) at
> >
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:2
> > 45) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192)
> >         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:142)
> >         at
> >
> org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(LocalFileSystem.java:6
> > 1) at
> > org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1197)
> >         at
> >
> org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:92)
> >         at
> >
> org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:373)
> >         at
> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:800)
> >         at
> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> >         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
> >         at
> org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)
> >         at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:177)
> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >         at
> org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:163)
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Re: Concurrently running multiple nutch crawls

Posted by Markus Jelsma <ma...@openindex.io>.

You're running locally? You cannot run multiple Nutch' locally with each 
sharing the same /tmp/ directory: change /tmp/ per crawl or run on Hadoop or 
run in sequence if you can live with it.

On Wednesday 13 July 2011 16:38:04 Chris Alexander wrote:
> Hi again,
> 
> Continuing my investigations into nutch, I attempted running two nutch
> whole-web crawls against two different target URL sets simultaneously and
> with different crawl directories. All seemed to be going very well until
> the exception below appeared in one of the threads. It looks like
> something under the hood is using some lock files that seem to be
> overlapping. Is it possible to run two nutch instances side by side, or
> would it be a better architecture to prefer to have a single instance of
> the script running and have it pick up updates to the URLs it has to crawl
> (e.g. the user specifying new top-level URLs to crawl).
> 
> Cheers
> 
> Chris
> 
> 
> Exception in thread "main" java.io.FileNotFoundException: File
> file:/tmp/hadoop-root/mapred/system/job_local_0001/job.xml does not exist.
>         at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.ja
> va:361) at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:2
> 45) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192)
>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:142)
>         at
> org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(LocalFileSystem.java:6
> 1) at
> org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1197)
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:92)
>         at
> org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:373)
>         at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:800)
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>         at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)
>         at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:177)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:163)

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350