You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2014/01/02 17:33:17 UTC
SegmentReader broken in distributed mode
Hi,
We can read segments fine from the local disk like this:
bin/nutch readseg -list 20140102161258 -nocontent -nogenerate
NAME GENERATED FETCHER START FETCHER END FETCHED PARSED
20140102161258 0 2014-01-02T16:13:09 2014-01-02T16:20:39 1227 1096
But we get the following exception when reading the same segment in distributde mode:
bin/nutch readseg -list 20140102161258 -nocontent -nogenerate
Exception in thread "main" java.lang.NullPointerException
at java.util.ComparableTimSort.sort(ComparableTimSort.java:146)
at java.util.Arrays.sort(Arrays.java:472)
at org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:85)
at org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:463)
at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:441)
at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:587)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
Can anyone confirm this issue?
Thanks,
Markus
RE: SegmentReader broken in distributed mode
Posted by Markus Jelsma <ma...@openindex.io>.
It seems that the getStats code ignores the nocontent and nogenerate options:
public void getStats(Path segment, final SegmentReaderStats stats) throws Exception {
SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(getConf(), new Path(segment, CrawlDatum.GENERATE_DIR_NAME));
long cnt = 0L;
Text key = new Text();
for (int i = 0; i < readers.length; i++) {
while (readers[i].next(key)) cnt++;
readers[i].close();
}
...
But then it should also not work in local mode or Hadoop just silently ignores it in local mode?
Seems we should open an issue for getStats not to ignore the -no* flags.
-----Original message-----
> From:Markus Jelsma <ma...@openindex.io>
> Sent: Thursday 2nd January 2014 17:34
> To: user@nutch.apache.org
> Subject: SegmentReader broken in distributed mode
>
> Hi,
>
> We can read segments fine from the local disk like this:
>
> bin/nutch readseg -list 20140102161258 -nocontent -nogenerate
> NAME GENERATED FETCHER START FETCHER END FETCHED PARSED
> 20140102161258 0 2014-01-02T16:13:09 2014-01-02T16:20:39 1227 1096
>
> But we get the following exception when reading the same segment in distributde mode:
>
> bin/nutch readseg -list 20140102161258 -nocontent -nogenerate
> Exception in thread "main" java.lang.NullPointerException
> at java.util.ComparableTimSort.sort(ComparableTimSort.java:146)
> at java.util.Arrays.sort(Arrays.java:472)
> at org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:85)
> at org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:463)
> at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:441)
> at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:587)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
>
> Can anyone confirm this issue?
>
> Thanks,
> Markus
>
RE: SegmentReader broken in distributed mode
Posted by Markus Jelsma <ma...@openindex.io>.
Created issue:
https://issues.apache.org/jira/browse/NUTCH-1692
-----Original message-----
> From:Markus Jelsma <ma...@openindex.io>
> Sent: Thursday 2nd January 2014 17:34
> To: user@nutch.apache.org
> Subject: SegmentReader broken in distributed mode
>
> Hi,
>
> We can read segments fine from the local disk like this:
>
> bin/nutch readseg -list 20140102161258 -nocontent -nogenerate
> NAME GENERATED FETCHER START FETCHER END FETCHED PARSED
> 20140102161258 0 2014-01-02T16:13:09 2014-01-02T16:20:39 1227 1096
>
> But we get the following exception when reading the same segment in distributde mode:
>
> bin/nutch readseg -list 20140102161258 -nocontent -nogenerate
> Exception in thread "main" java.lang.NullPointerException
> at java.util.ComparableTimSort.sort(ComparableTimSort.java:146)
> at java.util.Arrays.sort(Arrays.java:472)
> at org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:85)
> at org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:463)
> at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:441)
> at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:587)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
>
> Can anyone confirm this issue?
>
> Thanks,
> Markus
>