You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by cao yuzhong <ca...@hotmail.com> on 2005/06/02 10:12:29 UTC

Can Nutch index over 90G html pages ?

Have anyone used nutch to index over 90G html pages(about 6 million pages)?
Is it possible? How many rams does it require?

I tried to use Nutch to index 90G html pages.
My pc has 1G Ram and the JVM parameter set to -Xmx1000m
Following is my problem:

Exception in thread "main" java.lang.OutOfMemoryError
	at java.io.FileInputStream.readBytes(Native Method)
	at java.io.FileInputStream.read(FileInputStream.java:194)
at 
net.nutch.fs.LocalFileSystem$LocalNFSFileInputStream.read(LocalFileSystem.java:68)

	at 
net.nutch.fs.NFSDataInputStream$PositionCache.read(NFSDataInputStream.java:24)

	at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
	at java.io.DataInputStream.

	at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
	at java.io.DataInputStream.readFully(DataInputStream.java:176)
	at net.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:42)
	at net.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:76)
	at net.nutch.io.SequenceFile$Reader.next(SequenceFile.java:241)
	at net.nutch.io.MapFile$Reader.seek(MapFile.java:263)

	at net.nutch.io.MapFile$Reader.get(MapFile.java:306)
	at net.nutch.io.ArrayFile$Reader.get(ArrayFile.java:62)
	at net.nutch.segment.SegmentReader.get(SegmentReader.java:284)
	at net.nutch.indexer.IndexSegment.indexPages(IndexSegment.java:110)
	at net.nutch.indexer.IndexSegment.main(IndexSegment.java:241)

Any seggestions?

Best regards!
cyz



Re: Can Nutch index over 90G html pages ?

Posted by Christophe Noel <ch...@cetic.be>.
Wouldn't it be simply the number of threads that you use to fetch the 
pages ?

Doug Cutting wrote:

> The latest code in SVN requires less RAM.  If you still have problems, 
> try setting the config option io.map.index.skip to 8, and 
> indexer.termIndexInterval to 1024.  These will both cause less RAM to 
> be used.  On a 1GB machine I have built Nutch systems with over 40M 
> pages using these settings.
>
> Doug
>
> cao yuzhong wrote:
>
>> Have anyone used nutch to index over 90G html pages(about 6 million 
>> pages)?
>> Is it possible? How many rams does it require?
>>
>> I tried to use Nutch to index 90G html pages.
>> My pc has 1G Ram and the JVM parameter set to -Xmx1000m
>> Following is my problem:
>>
>> Exception in thread "main" java.lang.OutOfMemoryError
>>     at java.io.FileInputStream.readBytes(Native Method)
>>     at java.io.FileInputStream.read(FileInputStream.java:194)
>> at 
>> net.nutch.fs.LocalFileSystem$LocalNFSFileInputStream.read(LocalFileSystem.java:68) 
>>
>>
>>     at 
>> net.nutch.fs.NFSDataInputStream$PositionCache.read(NFSDataInputStream.java:24) 
>>
>>
>>     at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>>     at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>>     at java.io.DataInputStream.
>>
>>     at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>>     at java.io.DataInputStream.readFully(DataInputStream.java:176)
>>     at 
>> net.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:42)
>>     at net.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:76)
>>     at net.nutch.io.SequenceFile$Reader.next(SequenceFile.java:241)
>>     at net.nutch.io.MapFile$Reader.seek(MapFile.java:263)
>>
>>     at net.nutch.io.MapFile$Reader.get(MapFile.java:306)
>>     at net.nutch.io.ArrayFile$Reader.get(ArrayFile.java:62)
>>     at net.nutch.segment.SegmentReader.get(SegmentReader.java:284)
>>     at net.nutch.indexer.IndexSegment.indexPages(IndexSegment.java:110)
>>     at net.nutch.indexer.IndexSegment.main(IndexSegment.java:241)
>>
>> Any seggestions?
>>
>> Best regards!
>> cyz
>>
>>


Re: Can Nutch index over 90G html pages ?

Posted by Doug Cutting <cu...@nutch.org>.
The latest code in SVN requires less RAM.  If you still have problems, 
try setting the config option io.map.index.skip to 8, and 
indexer.termIndexInterval to 1024.  These will both cause less RAM to be 
used.  On a 1GB machine I have built Nutch systems with over 40M pages 
using these settings.

Doug

cao yuzhong wrote:
> Have anyone used nutch to index over 90G html pages(about 6 million pages)?
> Is it possible? How many rams does it require?
> 
> I tried to use Nutch to index 90G html pages.
> My pc has 1G Ram and the JVM parameter set to -Xmx1000m
> Following is my problem:
> 
> Exception in thread "main" java.lang.OutOfMemoryError
>     at java.io.FileInputStream.readBytes(Native Method)
>     at java.io.FileInputStream.read(FileInputStream.java:194)
> at 
> net.nutch.fs.LocalFileSystem$LocalNFSFileInputStream.read(LocalFileSystem.java:68) 
> 
> 
>     at 
> net.nutch.fs.NFSDataInputStream$PositionCache.read(NFSDataInputStream.java:24) 
> 
> 
>     at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>     at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>     at java.io.DataInputStream.
> 
>     at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>     at java.io.DataInputStream.readFully(DataInputStream.java:176)
>     at net.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:42)
>     at net.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:76)
>     at net.nutch.io.SequenceFile$Reader.next(SequenceFile.java:241)
>     at net.nutch.io.MapFile$Reader.seek(MapFile.java:263)
> 
>     at net.nutch.io.MapFile$Reader.get(MapFile.java:306)
>     at net.nutch.io.ArrayFile$Reader.get(ArrayFile.java:62)
>     at net.nutch.segment.SegmentReader.get(SegmentReader.java:284)
>     at net.nutch.indexer.IndexSegment.indexPages(IndexSegment.java:110)
>     at net.nutch.indexer.IndexSegment.main(IndexSegment.java:241)
> 
> Any seggestions?
> 
> Best regards!
> cyz
> 
>