You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by cao yuzhong <ca...@hotmail.com> on 2005/06/02 10:12:29 UTC
Can Nutch index over 90G html pages ?
Have anyone used nutch to index over 90G html pages(about 6 million pages)?
Is it possible? How many rams does it require?
I tried to use Nutch to index 90G html pages.
My pc has 1G Ram and the JVM parameter set to -Xmx1000m
Following is my problem:
Exception in thread "main" java.lang.OutOfMemoryError
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:194)
at
net.nutch.fs.LocalFileSystem$LocalNFSFileInputStream.read(LocalFileSystem.java:68)
at
net.nutch.fs.NFSDataInputStream$PositionCache.read(NFSDataInputStream.java:24)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
at java.io.DataInputStream.
at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
at java.io.DataInputStream.readFully(DataInputStream.java:176)
at net.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:42)
at net.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:76)
at net.nutch.io.SequenceFile$Reader.next(SequenceFile.java:241)
at net.nutch.io.MapFile$Reader.seek(MapFile.java:263)
at net.nutch.io.MapFile$Reader.get(MapFile.java:306)
at net.nutch.io.ArrayFile$Reader.get(ArrayFile.java:62)
at net.nutch.segment.SegmentReader.get(SegmentReader.java:284)
at net.nutch.indexer.IndexSegment.indexPages(IndexSegment.java:110)
at net.nutch.indexer.IndexSegment.main(IndexSegment.java:241)
Any seggestions?
Best regards!
cyz
Re: Can Nutch index over 90G html pages ?
Posted by Christophe Noel <ch...@cetic.be>.
Wouldn't it be simply the number of threads that you use to fetch the
pages ?
Doug Cutting wrote:
> The latest code in SVN requires less RAM. If you still have problems,
> try setting the config option io.map.index.skip to 8, and
> indexer.termIndexInterval to 1024. These will both cause less RAM to
> be used. On a 1GB machine I have built Nutch systems with over 40M
> pages using these settings.
>
> Doug
>
> cao yuzhong wrote:
>
>> Have anyone used nutch to index over 90G html pages(about 6 million
>> pages)?
>> Is it possible? How many rams does it require?
>>
>> I tried to use Nutch to index 90G html pages.
>> My pc has 1G Ram and the JVM parameter set to -Xmx1000m
>> Following is my problem:
>>
>> Exception in thread "main" java.lang.OutOfMemoryError
>> at java.io.FileInputStream.readBytes(Native Method)
>> at java.io.FileInputStream.read(FileInputStream.java:194)
>> at
>> net.nutch.fs.LocalFileSystem$LocalNFSFileInputStream.read(LocalFileSystem.java:68)
>>
>>
>> at
>> net.nutch.fs.NFSDataInputStream$PositionCache.read(NFSDataInputStream.java:24)
>>
>>
>> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>> at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>> at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>> at java.io.DataInputStream.
>>
>> at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>> at java.io.DataInputStream.readFully(DataInputStream.java:176)
>> at
>> net.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:42)
>> at net.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:76)
>> at net.nutch.io.SequenceFile$Reader.next(SequenceFile.java:241)
>> at net.nutch.io.MapFile$Reader.seek(MapFile.java:263)
>>
>> at net.nutch.io.MapFile$Reader.get(MapFile.java:306)
>> at net.nutch.io.ArrayFile$Reader.get(ArrayFile.java:62)
>> at net.nutch.segment.SegmentReader.get(SegmentReader.java:284)
>> at net.nutch.indexer.IndexSegment.indexPages(IndexSegment.java:110)
>> at net.nutch.indexer.IndexSegment.main(IndexSegment.java:241)
>>
>> Any seggestions?
>>
>> Best regards!
>> cyz
>>
>>
Re: Can Nutch index over 90G html pages ?
Posted by Doug Cutting <cu...@nutch.org>.
The latest code in SVN requires less RAM. If you still have problems,
try setting the config option io.map.index.skip to 8, and
indexer.termIndexInterval to 1024. These will both cause less RAM to be
used. On a 1GB machine I have built Nutch systems with over 40M pages
using these settings.
Doug
cao yuzhong wrote:
> Have anyone used nutch to index over 90G html pages(about 6 million pages)?
> Is it possible? How many rams does it require?
>
> I tried to use Nutch to index 90G html pages.
> My pc has 1G Ram and the JVM parameter set to -Xmx1000m
> Following is my problem:
>
> Exception in thread "main" java.lang.OutOfMemoryError
> at java.io.FileInputStream.readBytes(Native Method)
> at java.io.FileInputStream.read(FileInputStream.java:194)
> at
> net.nutch.fs.LocalFileSystem$LocalNFSFileInputStream.read(LocalFileSystem.java:68)
>
>
> at
> net.nutch.fs.NFSDataInputStream$PositionCache.read(NFSDataInputStream.java:24)
>
>
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
> at java.io.DataInputStream.
>
> at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
> at java.io.DataInputStream.readFully(DataInputStream.java:176)
> at net.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:42)
> at net.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:76)
> at net.nutch.io.SequenceFile$Reader.next(SequenceFile.java:241)
> at net.nutch.io.MapFile$Reader.seek(MapFile.java:263)
>
> at net.nutch.io.MapFile$Reader.get(MapFile.java:306)
> at net.nutch.io.ArrayFile$Reader.get(ArrayFile.java:62)
> at net.nutch.segment.SegmentReader.get(SegmentReader.java:284)
> at net.nutch.indexer.IndexSegment.indexPages(IndexSegment.java:110)
> at net.nutch.indexer.IndexSegment.main(IndexSegment.java:241)
>
> Any seggestions?
>
> Best regards!
> cyz
>
>