You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Brian Whitman <br...@variogr.am> on 2007/01/18 21:08:12 UTC

java.io.EOFException in latest nightly in mergesegs from hadoop.io.DataOutputBuffer

I wanted to try last night's nightly for the new freegen command.
On my test case, which is:

rm -rf crawl
bin/nutch inject crawl/crawldb urls/  # a single URL is in urls/urls
bin/nutch generate crawl/crawldb crawl/segments
bin/nutch fetch crawl/segments/2007...
bin/nutch updatedb crawl/crawldb crawl/segments/2007...

# generate a new segment with 5 URIs
bin/nutch generate crawl/crawldb crawl/segments -topN 10
bin/nutch fetch crawl/segments/2007... # new segment
bin/nutch updatedb crawl/crawldb crawl/segments/2007... # new segment

# merge the segments and index
bin/nutch mergesegs crawl/merged -dir crawl/segments
..

We get a crash in the mergesegs. This crash, with the exact same  
script and start URI, configuration and plugins, does not happen on a  
nightly from a week ago.


2007-01-18 14:57:11,411 INFO  segment.SegmentMerger - Merging 2  
segments to crawl/merged_07_01_18_14_56_22/20070118145711
2007-01-18 14:57:11,482 INFO  segment.SegmentMerger -  
SegmentMerger:   adding crawl/segments/20070118145628
2007-01-18 14:57:11,489 INFO  segment.SegmentMerger -  
SegmentMerger:   adding crawl/segments/20070118145641
2007-01-18 14:57:11,495 INFO  segment.SegmentMerger - SegmentMerger:  
using segment data from: content crawl_generate crawl_fetch  
crawl_parse parse_data parse_text
2007-01-18 14:57:11,594 INFO  mapred.InputFormatBase - Total input  
paths to process : 12
2007-01-18 14:57:11,819 INFO  mapred.JobClient - Running job: job_5ug2ip
2007-01-18 14:57:12,073 WARN  mapred.LocalJobRunner - job_5ug2ip
java.io.EOFException
         at java.io.DataInputStream.readFully(DataInputStream.java:178)
         at org.apache.hadoop.io.DataOutputBuffer$Buffer.write 
(DataOutputBuffer.java:57)
         at org.apache.hadoop.io.DataOutputBuffer.write 
(DataOutputBuffer.java:91)
         at org.apache.hadoop.io.UTF8.readChars(UTF8.java:212)
         at org.apache.hadoop.io.UTF8.readString(UTF8.java:204)
         at org.apache.hadoop.io.ObjectWritable.readObject 
(ObjectWritable.java:173)
         at org.apache.hadoop.io.ObjectWritable.readFields 
(ObjectWritable.java:61)
         at org.apache.nutch.metadata.MetaWrapper.readFields 
(MetaWrapper.java:100)
         at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spill 
(MapTask.java:427)
         at org.apache.hadoop.mapred.MapTask 
$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:385)
         at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access 
$200(MapTask.java:239)
         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:188)
         at org.apache.hadoop.mapred.LocalJobRunner$Job.run 
(LocalJobRunner.java:109)






--
http://variogr.am/
brian.whitman@variogr.am




Re: java.io.EOFException in latest nightly in mergesegs from hadoop.io.DataOutputBuffer

Posted by Paul Sponagl <pa...@sponagl.de>.
seed: http://www.koeln.de
crawl: bin/nutch crawl urls -dir crawl -depth 3 -topN 10

Am 19.01.2007 um 10:29 schrieb Andrzej Bialecki:

> Paul Sponagl wrote:
>> +1 for a bug (tested two days agon - was not sure if i simply  
>> missed something)
>
>
> Could you guys come up with exact data that causes this bug  
> (primarily I'm interested in a seed list, because then I can see  
> that you simply use the crawl tool, and finally try to run  
> mergesegs). Thanks!
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


Re: java.io.EOFException in latest nightly in mergesegs from hadoop.io.DataOutputBuffer

Posted by Andrzej Bialecki <ab...@getopt.org>.
Sami Siren wrote:
> Brian Whitman wrote:
>   
>> On Jan 21, 2007, at 6:47 AM, Sami Siren wrote:
>>
>>     
>>>> However I cannot find from the change logs of hadoop that what the
>>>> change is that is causing nutch these problems.
>>>>         
>>> It's HADOOP-331, so i guess at least the changes/additions in map() is
>>> required.
>>>       
>> Hi, just following up here-- does this indicate that if I get a hadoop
>> nightly that was patched for HADOOP-331 and have Nutch use it, the
>> EOFException will go away in the latest nightlies?
>>     
>
> No, I mean that HADOOP-331 is the change that is _causing_ these, so we
> need to adapt nutch code to coop with the change in sorting.
>
> Is there somebody that can tell me  why the various utilities (like
> Indexer) is doing the wrapping to ObjectWritable in InputFormat and not
> in Mapper.map in the first place? Is this optimization of some kind?
>   

This is a legacy from the (very recent) times when you had to set a 
key/value class of the InputFormat in your mapred job. You don't have to 
do this now - it's handled transparently by 
InputFormat.getRecordReader().createKey() and createValue().

In fact, there's a lot of this cruft left over in Nutch. We should also 
use GenericWritable in most of these places, and indeed we could wrap 
the values in Mapper.map().

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: java.io.EOFException in latest nightly in mergesegs from hadoop.io.DataOutputBuffer

Posted by Sami Siren <ss...@gmail.com>.
Brian Whitman wrote:
> On Jan 21, 2007, at 6:47 AM, Sami Siren wrote:
> 
>>> However I cannot find from the change logs of hadoop that what the
>>> change is that is causing nutch these problems.
>>
>> It's HADOOP-331, so i guess at least the changes/additions in map() is
>> required.
> 
> Hi, just following up here-- does this indicate that if I get a hadoop
> nightly that was patched for HADOOP-331 and have Nutch use it, the
> EOFException will go away in the latest nightlies?

No, I mean that HADOOP-331 is the change that is _causing_ these, so we
need to adapt nutch code to coop with the change in sorting.

Is there somebody that can tell me  why the various utilities (like
Indexer) is doing the wrapping to ObjectWritable in InputFormat and not
in Mapper.map in the first place? Is this optimization of some kind?

--
 Sami Siren

Re: java.io.EOFException in latest nightly in mergesegs from hadoop.io.DataOutputBuffer

Posted by Brian Whitman <br...@variogr.am>.
On Jan 21, 2007, at 6:47 AM, Sami Siren wrote:

>> However I cannot find from the change logs of hadoop that what the
>> change is that is causing nutch these problems.
>
> It's HADOOP-331, so i guess at least the changes/additions in map() is
> required.

Hi, just following up here-- does this indicate that if I get a  
hadoop nightly that was patched for HADOOP-331 and have Nutch use it,  
the EOFException will go away in the latest nightlies?

I tried that, it now crashes in a different spot, during fetching:

2007-01-22 11:34:53,051 INFO  mapred.LocalJobRunner - 1 pages, 0  
errors, 1.0 pages/s, 20 kb/s,
2007-01-22 11:34:53,134 WARN  mapred.LocalJobRunner - job_yzavye
java.lang.NoSuchMethodError: org.apache.hadoop.io.MapFile 
$Writer.<init>(Lorg/apache/hadoop/fs/FileSystem;Ljava/lang/ 
String;Ljava/lang/Class;Ljava/lang/Class;)V
         at  
org.apache.nutch.fetcher.FetcherOutputFormat.getRecordWriter 
(FetcherOutputFormat.java:58)
         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:303)
         at org.apache.hadoop.mapred.LocalJobRunner$Job.run 
(LocalJobRunner.java:137)
2007-01-22 11:34:53,398 FATAL fetcher.Fetcher - Fetcher:  
java.io.IOException: Job failed!
         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java: 
441)
         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)




Re: java.io.EOFException in latest nightly in mergesegs from hadoop.io.DataOutputBuffer

Posted by Sami Siren <ss...@gmail.com>.
> However I cannot find from the change logs of hadoop that what the
> change is that is causing nutch these problems.

It's HADOOP-331, so i guess at least the changes/additions in map() is
required.

--
 Sami Siren



Re: java.io.EOFException in latest nightly in mergesegs from hadoop.io.DataOutputBuffer

Posted by Sami Siren <ss...@gmail.com>.
Brian Whitman wrote:
> 
> On Jan 19, 2007, at 4:29 AM, Andrzej Bialecki wrote:
>>
> 
>> Could you guys come up with exact data that causes this bug (primarily
>> I'm interested in a seed list, because then I can see that you simply
>> use the crawl tool, and finally try to run mergesegs). Thanks!

I am also experiencing NPE in SegmentReader and Indexer, not 100% sure
yet what exactly causes these problems that happens when hadoop
"spills", I got rid of it with a little patching:

- added/changed Mapper.map to return ObjectWritable instead of plain object.

- patched SequenceFile slightly because of NPE in
SequenceFile.Sorter.MergeQueue

However I cannot find from the change logs of hadoop that what the
change is that is causing nutch these problems.

--
 Sami Siren

Re: java.io.EOFException in latest nightly in mergesegs from hadoop.io.DataOutputBuffer

Posted by Brian Whitman <br...@variogr.am>.
On Jan 19, 2007, at 4:29 AM, Andrzej Bialecki wrote:
>

> Could you guys come up with exact data that causes this bug  
> (primarily I'm interested in a seed list, because then I can see  
> that you simply use the crawl tool, and finally try to run  
> mergesegs). Thanks!

My seed list is simply my personal website http://variogr.am/, one  
line in urls/urls

I don't use the crawl command, I use a variation on the whole- 
internet script from the wiki.  The crash is at mergesegs.

rm -rf crawl
bin/nutch inject crawl/crawldb urls/  # a single URL is in urls/urls
bin/nutch generate crawl/crawldb crawl/segments
bin/nutch fetch crawl/segments/2007...
bin/nutch updatedb crawl/crawldb crawl/segments/2007...

# generate a new segment with 10 URIs
bin/nutch generate crawl/crawldb crawl/segments -topN 10
bin/nutch fetch crawl/segments/2007... # new segment
bin/nutch updatedb crawl/crawldb crawl/segments/2007... # new segment

# merge the segments and index
bin/nutch mergesegs crawl/merged -dir crawl/segments
..




Re: java.io.EOFException in latest nightly in mergesegs from hadoop.io.DataOutputBuffer

Posted by Andrzej Bialecki <ab...@getopt.org>.
Paul Sponagl wrote:
> +1 for a bug (tested two days agon - was not sure if i simply missed 
> something)


Could you guys come up with exact data that causes this bug (primarily 
I'm interested in a seed list, because then I can see that you simply 
use the crawl tool, and finally try to run mergesegs). Thanks!

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: java.io.EOFException in latest nightly in mergesegs from hadoop.io.DataOutputBuffer

Posted by Paul Sponagl <pa...@sponagl.de>.
+1 for a bug (tested two days agon - was not sure if i simply missed  
something)


2007-01-17 12:03:07,691 WARN  util.NativeCodeLoader - Unable to load  
native-hadoop library for your platform... using builtin-java classes  
where applicable
2007-01-17 12:03:07,722 WARN  mapred.LocalJobRunner - job_6cexok
java.io.EOFException
         at java.io.DataInputStream.readFully(DataInputStream.java:178)
         at org.apache.hadoop.io.DataOutputBuffer$Buffer.write 
(DataOutputBuffer.java:57)
         at org.apache.hadoop.io.DataOutputBuffer.write 
(DataOutputBuffer.java:91)
         at org.apache.hadoop.io.UTF8.readChars(UTF8.java:212)
         at org.apache.hadoop.io.UTF8.readString(UTF8.java:204)
         at org.apache.hadoop.io.ObjectWritable.readObject 
(ObjectWritable.java:173)
         at org.apache.hadoop.io.ObjectWritable.readFields 
(ObjectWritable.java:61)
         at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.spill 
(MapTask.java:427)
         at org.apache.hadoop.mapred.MapTask 
$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:385)
         at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access 
$200(MapTask.java:239)
         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:188)
         at org.apache.hadoop.mapred.LocalJobRunner$Job.run 
(LocalJobRunner.java:109)



Am 18.01.2007 um 23:09 schrieb Brian Whitman:

>
> On Jan 18, 2007, at 4:44 PM, Andrzej Bialecki wrote:
>
>>
>>> java.io.EOFException
>>>         at java.io.DataInputStream.readFully(DataInputStream.java: 
>>> 178)
>>>         at org.apache.hadoop.io.DataOutputBuffer$Buffer.write 
>>> (DataOutputBuffer.java:57)
>>>         at org.apache.hadoop.io.DataOutputBuffer.write 
>>> (DataOutputBuffer.java:91)
>>>         at org.apache.hadoop.io.UTF8.readChars(UTF8.java:212)
>>>         at org.apache.hadoop.io.UTF8.readString(UTF8.java:204)
>>>         at org.apache.hadoop.io.ObjectWritable.readObject 
>>> (ObjectWritable.java:173)
>>
>> UTF8? How weird - recent versions of Nutch tools, such as Crawl,  
>> Generate et al (and SegmentMerger) do NOT use UTF8, they use Text.  
>> It seems this data was created with older versions. Please check  
>> that you don't have older versions of Hadoop or nutch classes on  
>> you classpath.
>
> I printed my CLASSPATH in the bin/nutch script before it calls  
> anything, and all the jars and jobs are local to the nightly  
> directory which I downloaded today except for /usr/local/java/lib/ 
> tools.jar. All are dated 2007-01-17 19:42.
>
> hadoop-0.10.1-core is in there.
>
> And the data is brand new (I delete the crawl dir before doing my  
> test run.)
>
> -Brian
>
>


Re: java.io.EOFException in latest nightly in mergesegs from hadoop.io.DataOutputBuffer

Posted by Brian Whitman <br...@variogr.am>.
On Jan 18, 2007, at 4:44 PM, Andrzej Bialecki wrote:

>
>> java.io.EOFException
>>         at java.io.DataInputStream.readFully(DataInputStream.java: 
>> 178)
>>         at org.apache.hadoop.io.DataOutputBuffer$Buffer.write 
>> (DataOutputBuffer.java:57)
>>         at org.apache.hadoop.io.DataOutputBuffer.write 
>> (DataOutputBuffer.java:91)
>>         at org.apache.hadoop.io.UTF8.readChars(UTF8.java:212)
>>         at org.apache.hadoop.io.UTF8.readString(UTF8.java:204)
>>         at org.apache.hadoop.io.ObjectWritable.readObject 
>> (ObjectWritable.java:173)
>
> UTF8? How weird - recent versions of Nutch tools, such as Crawl,  
> Generate et al (and SegmentMerger) do NOT use UTF8, they use Text.  
> It seems this data was created with older versions. Please check  
> that you don't have older versions of Hadoop or nutch classes on  
> you classpath.

I printed my CLASSPATH in the bin/nutch script before it calls  
anything, and all the jars and jobs are local to the nightly  
directory which I downloaded today except for /usr/local/java/lib/ 
tools.jar. All are dated 2007-01-17 19:42.

hadoop-0.10.1-core is in there.

And the data is brand new (I delete the crawl dir before doing my  
test run.)

-Brian



Re: java.io.EOFException in latest nightly in mergesegs from hadoop.io.DataOutputBuffer

Posted by Andrzej Bialecki <ab...@getopt.org>.
Brian Whitman wrote:
> I wanted to try last night's nightly for the new freegen command.
> On my test case, which is:
>
> rm -rf crawl
> bin/nutch inject crawl/crawldb urls/  # a single URL is in urls/urls
> bin/nutch generate crawl/crawldb crawl/segments
> bin/nutch fetch crawl/segments/2007...
> bin/nutch updatedb crawl/crawldb crawl/segments/2007...
>
> # generate a new segment with 5 URIs
> bin/nutch generate crawl/crawldb crawl/segments -topN 10
> bin/nutch fetch crawl/segments/2007... # new segment
> bin/nutch updatedb crawl/crawldb crawl/segments/2007... # new segment
>
> # merge the segments and index
> bin/nutch mergesegs crawl/merged -dir crawl/segments
> ..
>
> We get a crash in the mergesegs. This crash, with the exact same 
> script and start URI, configuration and plugins, does not happen on a 
> nightly from a week ago.
>
>
> 2007-01-18 14:57:11,411 INFO  segment.SegmentMerger - Merging 2 
> segments to crawl/merged_07_01_18_14_56_22/20070118145711
> 2007-01-18 14:57:11,482 INFO  segment.SegmentMerger - SegmentMerger:   
> adding crawl/segments/20070118145628
> 2007-01-18 14:57:11,489 INFO  segment.SegmentMerger - SegmentMerger:   
> adding crawl/segments/20070118145641
> 2007-01-18 14:57:11,495 INFO  segment.SegmentMerger - SegmentMerger: 
> using segment data from: content crawl_generate crawl_fetch 
> crawl_parse parse_data parse_text
> 2007-01-18 14:57:11,594 INFO  mapred.InputFormatBase - Total input 
> paths to process : 12
> 2007-01-18 14:57:11,819 INFO  mapred.JobClient - Running job: job_5ug2ip
> 2007-01-18 14:57:12,073 WARN  mapred.LocalJobRunner - job_5ug2ip
> java.io.EOFException
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>         at 
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:57) 
>
>         at 
> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:91)
>         at org.apache.hadoop.io.UTF8.readChars(UTF8.java:212)
>         at org.apache.hadoop.io.UTF8.readString(UTF8.java:204)
>         at 
> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:173)

UTF8? How weird - recent versions of Nutch tools, such as Crawl, 
Generate et al (and SegmentMerger) do NOT use UTF8, they use Text. It 
seems this data was created with older versions. Please check that you 
don't have older versions of Hadoop or nutch classes on you classpath.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com