You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by qi wu <ch...@gmail.com> on 2007/04/11 15:01:30 UTC

How to recude the tmp disk space usage during linkdb process?

Hi,
  I have cralwed nearly 3millon pages which are kept in 13 segements and there have 10millon entries in the crawldb. I use Nuth.81 in a sngle Linux box,currently the disk occupied by crawldb and segments is about 20G ,and the machine still have 36G space left. I always failed in building linkdb, and the error was caused by no space left for reducing process, the exception is listed below:
job_f506pk
org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
        at org.apache.hadoop.fs.LocalFileSystem$LocalFSFileOutputStream.write(LocalFileSystem.java:150)
        at org.apache.hadoop.fs.FSDataOutputStream$Summer.write(FSDataOutputStream.java:83)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:112)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:208)
        at org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:913)
        at org.apache.hadoop.io.SequenceFile$Sorter$MergePass.run(SequenceFile.java:800)
        at org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:738)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:112)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:208)
        at org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:913)
        at org.apache.hadoop.io.SequenceFile$Sorter$MergePass.run(SequenceFile.java:800)
        at org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:738)
        at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:542)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:218)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:112)

I wonder why so much space are required by linkdb reduce job, can I config some nutch or hadoop setting to reduce the disk space usage for linkdb? Any hints for me to overcome the problem? //bow

Thanks
-Qi



Re: How to recude the tmp disk space usage during linkdb process?

Posted by Andrzej Bialecki <ab...@getopt.org>.
qi wu wrote:
> 
> I wonder why so much space are required by linkdb reduce job, can I
> config some nutch or hadoop setting to reduce the disk space usage
> for linkdb? Any hints for me to overcome the problem? //bow

* what is you limit of links per url (db.max.inlinks property)? You
could try lowering it.

* Please try the following modification: somewhere around
LinkDb.java:283 add the following line:

	job.setCombinerClass(LinkDb.class);

Recompile and re-run.

* Also, as others suggested, you may want to turn on compression.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com