You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by ML mail <ml...@yahoo.com> on 2007/10/23 18:03:26 UTC

Fetch failed due to space problems on /tmp (?)

Dear Nutch users,

I am currently using nutch 0.9 to crawl some local websites from the region and my fetch process just failed with the following error:

Fetcher: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)          

The updatedb (which comes after the fetch) also failed:

CrawlDb update: starting
CrawlDb update: db: /data/02/nutch/crawl/crawldb
CrawlDb update: segments: [/data/02/nutch/crawl/segments/20071023000503]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
 - skipping invalid segment /data/02/nutch/crawl/segments/20071023000503
CrawlDb update: Merging segment data into db.                          


I just noticed that somehow nutch (fetch) uses my /tmp directory to store some temporary date in /tmp/hadoop-nutch which filled up to 100% my /tmp partition so I guess this is the problem. Now regarding this I have 3 quick questions:

- What is nutch exactly storing in /tmp/hadoop-nutch ?
- How can I force nutch to use another directory (which has more space) ?
- How can I recover my fetched sites and continue the process without loosing all the work that the fetcher already did up to the point where it stopped ?

Many thanks in advance for your help

Best regards


 __________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Fetch failed due to space problems on /tmp (?)

Posted by ML mail <ml...@yahoo.com>.
Thanks for this tipp, I have now adapted my hadoop-site.xml to use a big disk for temporary usage. 

Regards


Andrzej Bialecki <ab...@getopt.org> wrote: ML mail wrote:
> Thanks for your answer! So I will move on and use the latest nightly build instead of the 0.9 stable version. Hopefully is nightly build stable enough to use in a production environment.
> 
> 
> Lyndon Maydwell  wrote: From what I have read, this has been solved in recent revisions, so
> downloading a new build or checking out the latest source should solve
> the problem. I am still using a version that has this problem, but
> should be switching shortly. My solution in the mean time has been to
> delete the temporary files after crawling. This works for me, and I
> suspect it is due to the failure of Nutch to delete files.
> 

In fact, I doubt this would solve your problem. The latest trunk doesn't 
change in any significant way the temporary space usage, so if you ran 
out of space before, you would do the same with the latest nightly build.

The solution is to configure Hadoop to use a different place than /tmp 
for temporary files, a place where you have enough disk space to fit all 
downloaded and temporary data. You can configure this by adding the 
following to conf/hadoop-site.xml:



 hadoop.tmp.dir
 /my/large/disk/space/hadoop-${user.name}



(if you run Hadoop in non-local mode, you need to restart the cluster).

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



 __________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Fetch failed due to space problems on /tmp (?)

Posted by Andrzej Bialecki <ab...@getopt.org>.
ML mail wrote:
> Thanks for your answer! So I will move on and use the latest nightly build instead of the 0.9 stable version. Hopefully is nightly build stable enough to use in a production environment.
> 
> 
> Lyndon Maydwell <ma...@gmail.com> wrote: From what I have read, this has been solved in recent revisions, so
> downloading a new build or checking out the latest source should solve
> the problem. I am still using a version that has this problem, but
> should be switching shortly. My solution in the mean time has been to
> delete the temporary files after crawling. This works for me, and I
> suspect it is due to the failure of Nutch to delete files.
> 

In fact, I doubt this would solve your problem. The latest trunk doesn't 
change in any significant way the temporary space usage, so if you ran 
out of space before, you would do the same with the latest nightly build.

The solution is to configure Hadoop to use a different place than /tmp 
for temporary files, a place where you have enough disk space to fit all 
downloaded and temporary data. You can configure this by adding the 
following to conf/hadoop-site.xml:

<property>
	<name>hadoop.tmp.dir</name>
	<value>/my/large/disk/space/hadoop-${user.name}</value>
</property>

(if you run Hadoop in non-local mode, you need to restart the cluster).

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Fetch failed due to space problems on /tmp (?)

Posted by ML mail <ml...@yahoo.com>.
Thanks for your answer! So I will move on and use the latest nightly build instead of the 0.9 stable version. Hopefully is nightly build stable enough to use in a production environment.


Lyndon Maydwell <ma...@gmail.com> wrote: From what I have read, this has been solved in recent revisions, so
downloading a new build or checking out the latest source should solve
the problem. I am still using a version that has this problem, but
should be switching shortly. My solution in the mean time has been to
delete the temporary files after crawling. This works for me, and I
suspect it is due to the failure of Nutch to delete files.


 __________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Fetch failed due to space problems on /tmp (?)

Posted by Lyndon Maydwell <ma...@gmail.com>.
>From what I have read, this has been solved in recent revisions, so
downloading a new build or checking out the latest source should solve
the problem. I am still using a version that has this problem, but
should be switching shortly. My solution in the mean time has been to
delete the temporary files after crawling. This works for me, and I
suspect it is due to the failure of Nutch to delete files.