You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/10/19 17:01:36 UTC

Fetcher NPE's

Hi,

We sometimes see a fetcher task failing with 0 pages. Inspecing the logs it's 
clear URL's are actually fetched until due to some reason a NPE occurs. The 
thread then dies and seems to output 0 records.

The URL's themselves are fetchable using index- or parser checker, no problem 
there. Any ideas how we can pinpoint the source of the issue? 

Thanks,

A sample exception:

2011-10-19 14:30:50,145 INFO org.apache.nutch.fetcher.Fetcher: fetch of 
http://<SOME_URL>/ failed with: java.lang.NullPointerException
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: 
java.lang.NullPointerException
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at 
java.lang.System.arraycopy(Native Method)
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1276)
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1193)
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at 
java.io.DataOutputStream.writeByte(DataOutputStream.java:136)
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at 
org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:264)
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at 
org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:244)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at 
org.apache.hadoop.io.Text.write(Text.java:281)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1060)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at 
org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:591)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:936)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:805)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: fetcher 
caught:java.lang.NullPointerException

The code catching the error:

801 	} catch (Throwable t) { // unexpected exception
802 	// unblock
803 	fetchQueues.finishFetchItem(fit);
804 	logError(fit.url, t.toString());
805 	output(fit.url, fit.datum, null, ProtocolStatus.STATUS_FAILED, 
CrawlDatum.STATUS_FETCH_RETRY);
806 	} 


Re: Fetcher NPE's

Posted by Markus Jelsma <ma...@openindex.io>.
Hello Sebastian,

Thanks for reminding me of this issue. Our case ran into trouble when we 
decreased mapred.map.timeout in combination with a larger value for the 
fetcher's timeout divisor, causing a fetcher to timeout very quickly if it 
didn't see any progress.

We tuned the divisor to a more reasonable number and all goes well, which 
makes sense, but we simply overlooked it when we decreased the mapred timeout 
value.

Cheers,

On Wednesday 26 October 2011 21:23:16 Sebastian Nagel wrote:
> Hi Markus,
> 
> the error resembles a problem I've observed some time ago but never managed
> to open an issue. Opened right now:
> https://issues.apache.org/jira/browse/NUTCH-1182 The stack you observed is
> the same.
> 
> Sebastian
> 
> On 10/19/2011 05:01 PM, Markus Jelsma wrote:
> > Hi,
> > 
> > We sometimes see a fetcher task failing with 0 pages. Inspecing the logs
> > it's clear URL's are actually fetched until due to some reason a NPE
> > occurs. The thread then dies and seems to output 0 records.
> > 
> > The URL's themselves are fetchable using index- or parser checker, no
> > problem there. Any ideas how we can pinpoint the source of the issue?
> > 
> > Thanks,
> > 
> > A sample exception:
> > 
> > 2011-10-19 14:30:50,145 INFO org.apache.nutch.fetcher.Fetcher: fetch of
> > http://<SOME_URL>/ failed with: java.lang.NullPointerException
> > 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher:
> > java.lang.NullPointerException
> > 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> > java.lang.System.arraycopy(Native Method)
> > 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.jav
> > a:1276) 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher:
> > at
> > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.ja
> > va:1193) 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher:
> > at java.io.DataOutputStream.writeByte(DataOutputStream.java:136)
> > 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> > org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:264)
> > 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> > org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:244)
> > 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> > org.apache.hadoop.io.Text.write(Text.java:281)
> > 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> > org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.
> > serialize(WritableSerialization.java:90) 2011-10-19 14:30:50,146 ERROR
> > org.apache.nutch.fetcher.Fetcher: at
> > org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer
> > .serialize(WritableSerialization.java:77) 2011-10-19 14:30:50,146 ERROR
> > org.apache.nutch.fetcher.Fetcher: at
> > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:10
> > 60) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> > org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java
> > :591) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:936)
> > 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:805)
> > 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: fetcher
> > caught:java.lang.NullPointerException
> > 
> > The code catching the error:
> > 
> > 801 	} catch (Throwable t) { // unexpected exception
> > 802 	// unblock
> > 803 	fetchQueues.finishFetchItem(fit);
> > 804 	logError(fit.url, t.toString());
> > 805 	output(fit.url, fit.datum, null, ProtocolStatus.STATUS_FAILED,
> > CrawlDatum.STATUS_FETCH_RETRY);
> > 806 	}

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Fetcher NPE's

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Markus,

the error resembles a problem I've observed some time ago but never managed
to open an issue. Opened right now: https://issues.apache.org/jira/browse/NUTCH-1182
The stack you observed is the same.

Sebastian

On 10/19/2011 05:01 PM, Markus Jelsma wrote:
> Hi,
>
> We sometimes see a fetcher task failing with 0 pages. Inspecing the logs it's
> clear URL's are actually fetched until due to some reason a NPE occurs. The
> thread then dies and seems to output 0 records.
>
> The URL's themselves are fetchable using index- or parser checker, no problem
> there. Any ideas how we can pinpoint the source of the issue?
>
> Thanks,
>
> A sample exception:
>
> 2011-10-19 14:30:50,145 INFO org.apache.nutch.fetcher.Fetcher: fetch of
> http://<SOME_URL>/ failed with: java.lang.NullPointerException
> 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher:
> java.lang.NullPointerException
> 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> java.lang.System.arraycopy(Native Method)
> 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1276)
> 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1193)
> 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> java.io.DataOutputStream.writeByte(DataOutputStream.java:136)
> 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:264)
> 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:244)
> 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.io.Text.write(Text.java:281)
> 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
> 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
> 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1060)
> 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:591)
> 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:936)
> 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:805)
> 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: fetcher
> caught:java.lang.NullPointerException
>
> The code catching the error:
>
> 801 	} catch (Throwable t) { // unexpected exception
> 802 	// unblock
> 803 	fetchQueues.finishFetchItem(fit);
> 804 	logError(fit.url, t.toString());
> 805 	output(fit.url, fit.datum, null, ProtocolStatus.STATUS_FAILED,
> CrawlDatum.STATUS_FETCH_RETRY);
> 806 	}
>


Re: Fetcher NPE's

Posted by Markus Jelsma <ma...@openindex.io>.
I should add that these URL's not only pass index-0 and parser checker but 
also manual local testing crawl cycles. There's also nothing significant in 
the syslog. Dmesg shows messages about too little memory but that's normal.

> Hi,
> 
> We sometimes see a fetcher task failing with 0 pages. Inspecing the logs
> it's clear URL's are actually fetched until due to some reason a NPE
> occurs. The thread then dies and seems to output 0 records.
> 
> The URL's themselves are fetchable using index- or parser checker, no
> problem there. Any ideas how we can pinpoint the source of the issue?
> 
> Thanks,
> 
> A sample exception:
> 
> 2011-10-19 14:30:50,145 INFO org.apache.nutch.fetcher.Fetcher: fetch of
> http://<SOME_URL>/ failed with: java.lang.NullPointerException
> 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher:
> java.lang.NullPointerException
> 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> java.lang.System.arraycopy(Native Method)
> 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:
> 1276) 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java
> :1193) 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> java.io.DataOutputStream.writeByte(DataOutputStream.java:136)
> 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:264)
> 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:244)
> 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.io.Text.write(Text.java:281)
> 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.se
> rialize(WritableSerialization.java:90) 2011-10-19 14:30:50,146 ERROR
> org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.s
> erialize(WritableSerialization.java:77) 2011-10-19 14:30:50,146 ERROR
> org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1060
> ) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:5
> 91) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:936)
> 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:805)
> 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: fetcher
> caught:java.lang.NullPointerException
> 
> The code catching the error:
> 
> 801 	} catch (Throwable t) { // unexpected exception
> 802 	// unblock
> 803 	fetchQueues.finishFetchItem(fit);
> 804 	logError(fit.url, t.toString());
> 805 	output(fit.url, fit.datum, null, ProtocolStatus.STATUS_FAILED,
> CrawlDatum.STATUS_FETCH_RETRY);
> 806 	}