You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Patricio Galeas <pg...@yahoo.de> on 2010/02/27 12:11:28 UTC

recover from hadoop.tmp.dir?

Hello,

Two weeks ago, we started a web crawl (depth=6, threads=10)  and today is the process aborted because our hard disk is full. We defined a 100GB partition for the  hadoop.tmp.dir.

Yesterday (night), I checked the size of  hadoop.tmp.dir by the last crawl and it had 23GB. Some hours later was the 100GB partition full.

Analyzing the hadoop.log (see below) we found that it was a merging problem by executing the last segment crawl:
/bin/nutch fetch  20100215165123 -threads 10

Questions:
1. What type of merge operation is executed by running /bin/nutch fetch <segment> ?
2. Is it possible to recover the crawled data from the hadoop.tmp.dir to build the last segment ?

Thanks
Pat


-------------------------------------------
Our segments 
(with the actual size)
-------------------------------------------
Segment1:  2010-02-05 22:53 20100205224324  - 4.9M
Segment2:  2010-02-05 23:22 20100205225321  - 64M
Segment3:  2010-02-06 01:44 20100205232237  - 258M
Segment4:  2010-02-06 08:27 20100206014546  - 955M
Segment5:  2010-02-08 06:36 20100206083157  - 4.4G 
Segment6:  2010-02-15 18:27 20100215165123  - 100M


--------------------------------------
The last lines of our hadoop.log
--------------------------------------
…
… 
2010-02-26 22:16:38,630 INFO  fetcher.Fetcher - -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=31
2010-02-26 22:16:39,630 INFO  fetcher.Fetcher - -activeThreads=10, spinWaiting=10, fetchQueues.totalSize=31
2010-02-26 22:16:39,630 WARN  fetcher.Fetcher - Aborting with 10 hung threads.
2010-02-26 22:36:30,465 WARN  mapred.LocalJobRunner - job_local_0001
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for attempt_local_0001_m_000000_0/intermediate.91
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
        at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:427)
        at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:326)
        at org.apache.hadoop.mapred.Merger.merge(Merger.java:83)
        at org.apache.hadoop.mapred.Merger.merge(Merger.java:71)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1268)
        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:867)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)

__________________________________________________
Do You Yahoo!?
Sie sind Spam leid? Yahoo! Mail verfügt über einen herausragenden Schutz gegen Massenmails. 
http://mail.yahoo.com