You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by vishal vachhani <vi...@gmail.com> on 2009/08/24 10:30:05 UTC

Exception while slicing and parsing old segments without fetching

Hi All,
         I had a big segment(size= 25 GB). Using "mergesegs utility and
slice=20000" , I have divided the segment into around 400 small segments. I
re-paresed(using parse command) all the segments because we have made
changes into the parsing modules of Nutch. Parsing was completed
successfully for all segments. Linkdb is also generated successfully.

I have following questions.

1. Do I need to run "Updatedb" on the parsed segments again? When I run
Updatedb command on these segments, I am getting following exception.
----------------------------------------------------------------------------------------
2009-08-17 20:09:33,679 WARN  fs.FSInputChecker - Problem reading checksum
file: java.io.EOFException. Ignoring.
2009-08-17 20:09:33,700 WARN  mapred.LocalJobRunner - job_fmwtmv
java.lang.RuntimeException: Summer buffer overflow b.len=4096, off=0,
summed=3584, read=4096, bytesPerSum=1, inSum=512
        at
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:201)
        at
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167)
        at
org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:57)
        at
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:91)
        at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1525)
        at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1436)
        at
org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1482)
        at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:73)
        at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
Caused by: java.lang.ArrayIndexOutOfBoundsException
        at java.util.zip.CRC32.update(CRC32.java:43)
        at
org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:199)
        ... 16 more
2009-08-17 20:09:33,749 FATAL crawl.CrawlDb - CrawlDb update:
java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:97)
        at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:199)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
        at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:152)
-----------------------------------------------------------------------------------------------------------------------------

2. When I run the "index" command on the segments,crawldb and linkdb, I am
getting "java heap space" error. While with single big segments and same
configuration of Java heap, we were able to index the segments. Are we doing
something wrong? We will be thankful if somebody could give us some pointers
in the problems.
--------------------------------------------------------------------------------

 java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2786)
        at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at org.apache.hadoop.io.Text.writeString(Text.java:399)
        at org.apache.nutch.metadata.Metadata.write(Metadata.java:225)
        at org.apache.nutch.parse.ParseData.write(ParseData.java:165)
        at
org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:154)
        at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:65)
        at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:315)
        at org.apache.nutch.indexer.Indexer.map(Indexer.java:362)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
2009-08-23 23:19:28,569 FATAL indexer.Indexer - Indexer:
java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.indexer.Indexer.index(Indexer.java:329)
        at org.apache.nutch.indexer.Indexer.run(Indexer.java:351)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
        at org.apache.nutch.indexer.Indexer.main(Indexer.java:334)
----------------------------------------------------------------------------------------------------------------


-- 
Thanks and Regards,
Vishal Vachhani

Re: Exception while slicing and parsing old segments without fetching

Posted by srinivasarao v <sr...@gmail.com>.
Hi Vishal,

I got the same prolem while runing updatedb and invertlinks.
Have you got the solution to the problem?
Please let me know if u get the solution.

Thank You,
Srinivas

On Mon, Aug 24, 2009 at 2:00 PM, vishal vachhani <vi...@gmail.com>wrote:

> Hi All,
>         I had a big segment(size= 25 GB). Using "mergesegs utility and
> slice=20000" , I have divided the segment into around 400 small segments. I
> re-paresed(using parse command) all the segments because we have made
> changes into the parsing modules of Nutch. Parsing was completed
> successfully for all segments. Linkdb is also generated successfully.
>
> I have following questions.
>
> 1. Do I need to run "Updatedb" on the parsed segments again? When I run
> Updatedb command on these segments, I am getting following exception.
>
> ----------------------------------------------------------------------------------------
> 2009-08-17 20:09:33,679 WARN  fs.FSInputChecker - Problem reading checksum
> file: java.io.EOFException. Ignoring.
> 2009-08-17 20:09:33,700 WARN  mapred.LocalJobRunner - job_fmwtmv
> java.lang.RuntimeException: Summer buffer overflow b.len=4096, off=0,
> summed=3584, read=4096, bytesPerSum=1, inSum=512
>        at
>
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:201)
>        at
>
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:167)
>        at
>
> org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:41)
>        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>        at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
>        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>        at java.io.DataInputStream.readFully(DataInputStream.java:178)
>        at
>
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:57)
>        at
> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:91)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1525)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1436)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1482)
>        at
>
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:73)
>        at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
>        at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>        at java.util.zip.CRC32.update(CRC32.java:43)
>        at
>
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:199)
>        ... 16 more
> 2009-08-17 20:09:33,749 FATAL crawl.CrawlDb - CrawlDb update:
> java.io.IOException: Job failed!
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>        at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:97)
>        at org.apache.nutch.crawl.CrawlDb.run(CrawlDb.java:199)
>        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>        at org.apache.nutch.crawl.CrawlDb.main(CrawlDb.java:152)
>
> -----------------------------------------------------------------------------------------------------------------------------
>
> 2. When I run the "index" command on the segments,crawldb and linkdb, I am
> getting "java heap space" error. While with single big segments and same
> configuration of Java heap, we were able to index the segments. Are we
> doing
> something wrong? We will be thankful if somebody could give us some
> pointers
> in the problems.
>
> --------------------------------------------------------------------------------
>
>  java.lang.OutOfMemoryError: Java heap space
>        at java.util.Arrays.copyOf(Arrays.java:2786)
>        at
> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:94)
>        at java.io.DataOutputStream.write(DataOutputStream.java:90)
>        at org.apache.hadoop.io.Text.writeString(Text.java:399)
>        at org.apache.nutch.metadata.Metadata.write(Metadata.java:225)
>        at org.apache.nutch.parse.ParseData.write(ParseData.java:165)
>        at
> org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:154)
>        at org.apache.hadoop.io.ObjectWritable.write(ObjectWritable.java:65)
>        at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:315)
>        at org.apache.nutch.indexer.Indexer.map(Indexer.java:362)
>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
>        at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
> 2009-08-23 23:19:28,569 FATAL indexer.Indexer - Indexer:
> java.io.IOException: Job failed!
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>        at org.apache.nutch.indexer.Indexer.index(Indexer.java:329)
>        at org.apache.nutch.indexer.Indexer.run(Indexer.java:351)
>        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>        at org.apache.nutch.indexer.Indexer.main(Indexer.java:334)
>
> ----------------------------------------------------------------------------------------------------------------
>
>
> --
> Thanks and Regards,
> Vishal Vachhani
>



-- 
http://cheyuta.wordpress.com

Re: LinkDB size difference

Posted by reinhard schwab <re...@aon.at>.
are you sure that you have used the same config?

in nutch-default.xml and nutch-site.xml you have or may have a config
property

<property>
  <name>db.ignore.internal.links</name>
  <value>true</value>
  <description>If true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping only the highest quality
  links.
  </description>
</property>

i'm only aware of the difference described below.
you may look into the Crawl.java code to check whether there are other
differences.

ok, i have done this now.
Crawl.java uses crawl-tool.xml as additional config file
and i have there ( it is default i guess )

<property>
  <name>db.ignore.internal.links</name>
  <value>false</value>
  <description>If true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping the only the highest quality
  links.
  </description>
</property>

this is conform to your observation. the "crawl" command does not ignore
internal links because
of this additional crawl-tool.xml config option, which seems to
overwrite nutch-default.xml and
nutch-site.xml
if you add this to nutch-site.xml, it should behave equal.

reinhard



Hrishikesh Agashe schrieb:
> Thanks Reinhard. I checked this, but both the files are same.
>
> Just to elaborate more, I am downloading images using Nutch, so I have changed both files and removed jpg, gif, png etc from extensions to be skipped. What I see is that if I use "crawl" command, I get all image URLs in LinkDB, but if I execute commands separately I see only absolute links to images. All relative links are missing from LinkDB. (i.e. If HTML page has URL like "http://www.abc.com/img/img.jpg" for image, I can see it in LinkDB in both cases, but if it has URL like "/img/img.jpg" for image, it's missing from LinkDB in case of execution using separate commands.)
>
> Any thoughts?
>
> TIA,
> --Hrishi
>
> -----Original Message-----
> From: reinhard schwab [mailto:reinhard.schwab@aon.at] 
> Sent: Tuesday, September 01, 2009 3:19 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: LinkDB size difference
>
> you can dump the linkdb and analyze where it differs.
> my guess is, that you have different urls there because crawl uses
> crawl-urlfilter.txt to filter urls
> and fetch uses regex-urlfilter.txt.
> so different filters.
> i cant explain why. i have not implemented this. i have only experienced
> the difference myself.
>
> how to dump the linkdb:
>
> reinhard@thord:>bin/nutch readlinkdb
> Usage: LinkDbReader <linkdb> {-dump <out_dir> | -url <url>)
>         -dump <out_dir> dump whole link db to a text file in <out_dir>
>         -url <url>      print information about <url> to System.out
>
>
>
>
> Hrishikesh Agashe schrieb:
>   
>> Hi,
>>
>> I am observing that the size of LinkDB is different when I do a run for same URLs with "crawl" command(intranet crawling) as compared to running individual commands (like inject, generate, fetch, invertlink etc i.e. Internet crawl)
>> Are there any parameters that Nutch passes to invertlink while running with "crawl" option?
>>
>> TIA,
>> --Hrishi
>>
>> DISCLAIMER
>> ==========
>> This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
>>
>>   
>>     
>
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
>
>   


RE: LinkDB size difference

Posted by Hrishikesh Agashe <hr...@persistent.co.in>.
Thanks Reinhard. I checked this, but both the files are same.

Just to elaborate more, I am downloading images using Nutch, so I have changed both files and removed jpg, gif, png etc from extensions to be skipped. What I see is that if I use "crawl" command, I get all image URLs in LinkDB, but if I execute commands separately I see only absolute links to images. All relative links are missing from LinkDB. (i.e. If HTML page has URL like "http://www.abc.com/img/img.jpg" for image, I can see it in LinkDB in both cases, but if it has URL like "/img/img.jpg" for image, it's missing from LinkDB in case of execution using separate commands.)

Any thoughts?

TIA,
--Hrishi

-----Original Message-----
From: reinhard schwab [mailto:reinhard.schwab@aon.at] 
Sent: Tuesday, September 01, 2009 3:19 PM
To: nutch-user@lucene.apache.org
Subject: Re: LinkDB size difference

you can dump the linkdb and analyze where it differs.
my guess is, that you have different urls there because crawl uses
crawl-urlfilter.txt to filter urls
and fetch uses regex-urlfilter.txt.
so different filters.
i cant explain why. i have not implemented this. i have only experienced
the difference myself.

how to dump the linkdb:

reinhard@thord:>bin/nutch readlinkdb
Usage: LinkDbReader <linkdb> {-dump <out_dir> | -url <url>)
        -dump <out_dir> dump whole link db to a text file in <out_dir>
        -url <url>      print information about <url> to System.out




Hrishikesh Agashe schrieb:
> Hi,
>
> I am observing that the size of LinkDB is different when I do a run for same URLs with "crawl" command(intranet crawling) as compared to running individual commands (like inject, generate, fetch, invertlink etc i.e. Internet crawl)
> Are there any parameters that Nutch passes to invertlink while running with "crawl" option?
>
> TIA,
> --Hrishi
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
>
>   


DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

Re: LinkDB size difference

Posted by reinhard schwab <re...@aon.at>.
you can dump the linkdb and analyze where it differs.
my guess is, that you have different urls there because crawl uses
crawl-urlfilter.txt to filter urls
and fetch uses regex-urlfilter.txt.
so different filters.
i cant explain why. i have not implemented this. i have only experienced
the difference myself.

how to dump the linkdb:

reinhard@thord:>bin/nutch readlinkdb
Usage: LinkDbReader <linkdb> {-dump <out_dir> | -url <url>)
        -dump <out_dir> dump whole link db to a text file in <out_dir>
        -url <url>      print information about <url> to System.out




Hrishikesh Agashe schrieb:
> Hi,
>
> I am observing that the size of LinkDB is different when I do a run for same URLs with "crawl" command(intranet crawling) as compared to running individual commands (like inject, generate, fetch, invertlink etc i.e. Internet crawl)
> Are there any parameters that Nutch passes to invertlink while running with "crawl" option?
>
> TIA,
> --Hrishi
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
>
>   


LinkDB size difference

Posted by Hrishikesh Agashe <hr...@persistent.co.in>.
Hi,

I am observing that the size of LinkDB is different when I do a run for same URLs with "crawl" command(intranet crawling) as compared to running individual commands (like inject, generate, fetch, invertlink etc i.e. Internet crawl)
Are there any parameters that Nutch passes to invertlink while running with "crawl" option?

TIA,
--Hrishi

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.