You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Markus Jelsma <ma...@buyways.nl> on 2010/09/09 17:52:17 UTC

Input path does not exist revisited

Hi,

Well, today it happened again. I had quite a large fetch list and finally it 
all failed. I added a hadoop.tmp.dir setting to my nutch-site.xml file and 
point it to a large enough drive. Later, larger and larger fetch lists all 
went well until a fetch list of about 20k pages finally failed for unclear 
reasons. Madness!

Can anyone try to explain what's really going on and why so many users suffer 
from this issue?

FYI: I'm still running Nutch locally. A Hadoop cluster isn't set up yet.

Cheers,

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: [Solved] Input path does not exist revisited

Posted by Mike Baranczak <mb...@gmail.com>.

That's good to know.

Besides hadoop.tmp.dir, is there any other resource that can cause problems when multiple local processes try to share it? 



On Sep 14, 2010, at 2:10 PM, Markus Jelsma wrote:

> Hi,
> 
>  
> 
> It seems the problem is solved now although it looks like i cannot completely reproduce it under all circumstances. It has everything to do with the hadoop.tmp.dir setting and running multiple jobs in the local machine. Whenever i run a fetch job, it stores data in the tmp dir. If i, in the meanwhile, also run e.g. a readdb job, the fetch job's data in the tmp dir is lost, hench the error.
> 
>  
> 
> Maybe i could have known this if i read much more on Hadoop's behavior but i haven't. It also is, in my case, a bit unexpected as i assume processes not to mess around with other processes' tmp data.
> 
>  
> 
> So, don't run multiple jobs on the local machine using the same hadoop.tmp.dir setting.
> 
>  
> 
> Cheers,
>  
> -----Original message-----
> From: Markus Jelsma <ma...@buyways.nl>
> Sent: Fri 10-09-2010 15:52
> To: user@nutch.apache.org; 
> Subject: RE: Input path does not exist revisited
> 
> The first error in the sequence comes immediately when the fetcher is ready and before parsing the content. 
> 
>  
> 
> 2010-09-10 15:29:59,817 WARN  mapred.LocalJobRunner - job_local_0001
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_local_0001/attempt_local_0001_m_000000_0/output/spill0.out in any of the configured local directories
>         at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:389)
>         at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
>         at org.apache.hadoop.mapred.MapOutputFile.getSpillFile(MapOutputFile.java:94)
>         at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1443)
>         at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> 2010-09-10 15:30:00,638 ERROR fetcher.Fetcher - Fetcher: java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1107)
>         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1145)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1116)
> 
>  
> 
> I've still got no idea on why it happens, how it happens and when it happens. Disk space is not an issue and there is plenty of RAM.
>  
> -----Original message-----
> From: Markus Jelsma <ma...@buyways.nl>
> Sent: Thu 09-09-2010 17:53
> To: user@nutch.apache.org; 
> Subject: Input path does not exist revisited
> 
> Hi,
> 
> Well, today it happened again. I had quite a large fetch list and finally it 
> all failed. I added a hadoop.tmp.dir setting to my nutch-site.xml file and 
> point it to a large enough drive. Later, larger and larger fetch lists all 
> went well until a fetch list of about 20k pages finally failed for unclear 
> reasons. Madness!
> 
> Can anyone try to explain what's really going on and why so many users suffer 
> from this issue?
> 
> FYI: I'm still running Nutch locally. A Hadoop cluster isn't set up yet.
> 
> Cheers,
> 
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

RE: [Solved] Input path does not exist revisited

Posted by Markus Jelsma <ma...@buyways.nl>.

Hi,

 

It seems the problem is solved now although it looks like i cannot completely reproduce it under all circumstances. It has everything to do with the hadoop.tmp.dir setting and running multiple jobs in the local machine. Whenever i run a fetch job, it stores data in the tmp dir. If i, in the meanwhile, also run e.g. a readdb job, the fetch job's data in the tmp dir is lost, hench the error.

 

Maybe i could have known this if i read much more on Hadoop's behavior but i haven't. It also is, in my case, a bit unexpected as i assume processes not to mess around with other processes' tmp data.

 

So, don't run multiple jobs on the local machine using the same hadoop.tmp.dir setting.

 

Cheers,
 
-----Original message-----
From: Markus Jelsma <ma...@buyways.nl>
Sent: Fri 10-09-2010 15:52
To: user@nutch.apache.org; 
Subject: RE: Input path does not exist revisited

The first error in the sequence comes immediately when the fetcher is ready and before parsing the content. 

 

2010-09-10 15:29:59,817 WARN  mapred.LocalJobRunner - job_local_0001
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_local_0001/attempt_local_0001_m_000000_0/output/spill0.out in any of the configured local directories
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:389)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
        at org.apache.hadoop.mapred.MapOutputFile.getSpillFile(MapOutputFile.java:94)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1443)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
2010-09-10 15:30:00,638 ERROR fetcher.Fetcher - Fetcher: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1107)
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1145)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1116)

 

I've still got no idea on why it happens, how it happens and when it happens. Disk space is not an issue and there is plenty of RAM.
 
-----Original message-----
From: Markus Jelsma <ma...@buyways.nl>
Sent: Thu 09-09-2010 17:53
To: user@nutch.apache.org; 
Subject: Input path does not exist revisited

Hi,

Well, today it happened again. I had quite a large fetch list and finally it 
all failed. I added a hadoop.tmp.dir setting to my nutch-site.xml file and 
point it to a large enough drive. Later, larger and larger fetch lists all 
went well until a fetch list of about 20k pages finally failed for unclear 
reasons. Madness!

Can anyone try to explain what's really going on and why so many users suffer 
from this issue?

FYI: I'm still running Nutch locally. A Hadoop cluster isn't set up yet.

Cheers,

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

RE: Input path does not exist revisited

Posted by Markus Jelsma <ma...@buyways.nl>.

The first error in the sequence comes immediately when the fetcher is ready and before parsing the content. 

 

2010-09-10 15:29:59,817 WARN  mapred.LocalJobRunner - job_local_0001
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_local_0001/attempt_local_0001_m_000000_0/output/spill0.out in any of the configured local directories
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:389)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
        at org.apache.hadoop.mapred.MapOutputFile.getSpillFile(MapOutputFile.java:94)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1443)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
2010-09-10 15:30:00,638 ERROR fetcher.Fetcher - Fetcher: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1107)
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1145)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1116)

 

I've still got no idea on why it happens, how it happens and when it happens. Disk space is not an issue and there is plenty of RAM.
 
-----Original message-----
From: Markus Jelsma <ma...@buyways.nl>
Sent: Thu 09-09-2010 17:53
To: user@nutch.apache.org; 
Subject: Input path does not exist revisited

Hi,

Well, today it happened again. I had quite a large fetch list and finally it 
all failed. I added a hadoop.tmp.dir setting to my nutch-site.xml file and 
point it to a large enough drive. Later, larger and larger fetch lists all 
went well until a fetch list of about 20k pages finally failed for unclear 
reasons. Madness!

Can anyone try to explain what's really going on and why so many users suffer 
from this issue?

FYI: I'm still running Nutch locally. A Hadoop cluster isn't set up yet.

Cheers,

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350