You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2011/10/19 17:01:36 UTC
Fetcher NPE's
Hi,
We sometimes see a fetcher task failing with 0 pages. Inspecing the logs it's
clear URL's are actually fetched until due to some reason a NPE occurs. The
thread then dies and seems to output 0 records.
The URL's themselves are fetchable using index- or parser checker, no problem
there. Any ideas how we can pinpoint the source of the issue?
Thanks,
A sample exception:
2011-10-19 14:30:50,145 INFO org.apache.nutch.fetcher.Fetcher: fetch of
http://<SOME_URL>/ failed with: java.lang.NullPointerException
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher:
java.lang.NullPointerException
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
java.lang.System.arraycopy(Native Method)
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1276)
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1193)
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
java.io.DataOutputStream.writeByte(DataOutputStream.java:136)
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:264)
2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:244)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
org.apache.hadoop.io.Text.write(Text.java:281)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1060)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:591)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:936)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:805)
2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: fetcher
caught:java.lang.NullPointerException
The code catching the error:
801 } catch (Throwable t) { // unexpected exception
802 // unblock
803 fetchQueues.finishFetchItem(fit);
804 logError(fit.url, t.toString());
805 output(fit.url, fit.datum, null, ProtocolStatus.STATUS_FAILED,
CrawlDatum.STATUS_FETCH_RETRY);
806 }
Re: Fetcher NPE's
Posted by Markus Jelsma <ma...@openindex.io>.
Hello Sebastian,
Thanks for reminding me of this issue. Our case ran into trouble when we
decreased mapred.map.timeout in combination with a larger value for the
fetcher's timeout divisor, causing a fetcher to timeout very quickly if it
didn't see any progress.
We tuned the divisor to a more reasonable number and all goes well, which
makes sense, but we simply overlooked it when we decreased the mapred timeout
value.
Cheers,
On Wednesday 26 October 2011 21:23:16 Sebastian Nagel wrote:
> Hi Markus,
>
> the error resembles a problem I've observed some time ago but never managed
> to open an issue. Opened right now:
> https://issues.apache.org/jira/browse/NUTCH-1182 The stack you observed is
> the same.
>
> Sebastian
>
> On 10/19/2011 05:01 PM, Markus Jelsma wrote:
> > Hi,
> >
> > We sometimes see a fetcher task failing with 0 pages. Inspecing the logs
> > it's clear URL's are actually fetched until due to some reason a NPE
> > occurs. The thread then dies and seems to output 0 records.
> >
> > The URL's themselves are fetchable using index- or parser checker, no
> > problem there. Any ideas how we can pinpoint the source of the issue?
> >
> > Thanks,
> >
> > A sample exception:
> >
> > 2011-10-19 14:30:50,145 INFO org.apache.nutch.fetcher.Fetcher: fetch of
> > http://<SOME_URL>/ failed with: java.lang.NullPointerException
> > 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher:
> > java.lang.NullPointerException
> > 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> > java.lang.System.arraycopy(Native Method)
> > 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.jav
> > a:1276) 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher:
> > at
> > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.ja
> > va:1193) 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher:
> > at java.io.DataOutputStream.writeByte(DataOutputStream.java:136)
> > 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> > org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:264)
> > 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> > org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:244)
> > 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> > org.apache.hadoop.io.Text.write(Text.java:281)
> > 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> > org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.
> > serialize(WritableSerialization.java:90) 2011-10-19 14:30:50,146 ERROR
> > org.apache.nutch.fetcher.Fetcher: at
> > org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer
> > .serialize(WritableSerialization.java:77) 2011-10-19 14:30:50,146 ERROR
> > org.apache.nutch.fetcher.Fetcher: at
> > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:10
> > 60) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> > org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java
> > :591) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:936)
> > 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:805)
> > 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: fetcher
> > caught:java.lang.NullPointerException
> >
> > The code catching the error:
> >
> > 801 } catch (Throwable t) { // unexpected exception
> > 802 // unblock
> > 803 fetchQueues.finishFetchItem(fit);
> > 804 logError(fit.url, t.toString());
> > 805 output(fit.url, fit.datum, null, ProtocolStatus.STATUS_FAILED,
> > CrawlDatum.STATUS_FETCH_RETRY);
> > 806 }
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Re: Fetcher NPE's
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Markus,
the error resembles a problem I've observed some time ago but never managed
to open an issue. Opened right now: https://issues.apache.org/jira/browse/NUTCH-1182
The stack you observed is the same.
Sebastian
On 10/19/2011 05:01 PM, Markus Jelsma wrote:
> Hi,
>
> We sometimes see a fetcher task failing with 0 pages. Inspecing the logs it's
> clear URL's are actually fetched until due to some reason a NPE occurs. The
> thread then dies and seems to output 0 records.
>
> The URL's themselves are fetchable using index- or parser checker, no problem
> there. Any ideas how we can pinpoint the source of the issue?
>
> Thanks,
>
> A sample exception:
>
> 2011-10-19 14:30:50,145 INFO org.apache.nutch.fetcher.Fetcher: fetch of
> http://<SOME_URL>/ failed with: java.lang.NullPointerException
> 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher:
> java.lang.NullPointerException
> 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> java.lang.System.arraycopy(Native Method)
> 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1276)
> 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1193)
> 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> java.io.DataOutputStream.writeByte(DataOutputStream.java:136)
> 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:264)
> 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:244)
> 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.io.Text.write(Text.java:281)
> 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
> 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
> 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1060)
> 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:591)
> 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:936)
> 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:805)
> 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: fetcher
> caught:java.lang.NullPointerException
>
> The code catching the error:
>
> 801 } catch (Throwable t) { // unexpected exception
> 802 // unblock
> 803 fetchQueues.finishFetchItem(fit);
> 804 logError(fit.url, t.toString());
> 805 output(fit.url, fit.datum, null, ProtocolStatus.STATUS_FAILED,
> CrawlDatum.STATUS_FETCH_RETRY);
> 806 }
>
Re: Fetcher NPE's
Posted by Markus Jelsma <ma...@openindex.io>.
I should add that these URL's not only pass index-0 and parser checker but
also manual local testing crawl cycles. There's also nothing significant in
the syslog. Dmesg shows messages about too little memory but that's normal.
> Hi,
>
> We sometimes see a fetcher task failing with 0 pages. Inspecing the logs
> it's clear URL's are actually fetched until due to some reason a NPE
> occurs. The thread then dies and seems to output 0 records.
>
> The URL's themselves are fetchable using index- or parser checker, no
> problem there. Any ideas how we can pinpoint the source of the issue?
>
> Thanks,
>
> A sample exception:
>
> 2011-10-19 14:30:50,145 INFO org.apache.nutch.fetcher.Fetcher: fetch of
> http://<SOME_URL>/ failed with: java.lang.NullPointerException
> 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher:
> java.lang.NullPointerException
> 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> java.lang.System.arraycopy(Native Method)
> 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:
> 1276) 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java
> :1193) 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> java.io.DataOutputStream.writeByte(DataOutputStream.java:136)
> 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.io.WritableUtils.writeVLong(WritableUtils.java:264)
> 2011-10-19 14:30:50,145 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.io.WritableUtils.writeVInt(WritableUtils.java:244)
> 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.io.Text.write(Text.java:281)
> 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.se
> rialize(WritableSerialization.java:90) 2011-10-19 14:30:50,146 ERROR
> org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.s
> erialize(WritableSerialization.java:77) 2011-10-19 14:30:50,146 ERROR
> org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1060
> ) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:5
> 91) 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:936)
> 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:805)
> 2011-10-19 14:30:50,146 ERROR org.apache.nutch.fetcher.Fetcher: fetcher
> caught:java.lang.NullPointerException
>
> The code catching the error:
>
> 801 } catch (Throwable t) { // unexpected exception
> 802 // unblock
> 803 fetchQueues.finishFetchItem(fit);
> 804 logError(fit.url, t.toString());
> 805 output(fit.url, fit.datum, null, ProtocolStatus.STATUS_FAILED,
> CrawlDatum.STATUS_FETCH_RETRY);
> 806 }