You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2014/01/02 17:33:17 UTC

SegmentReader broken in distributed mode

Hi,

We can read segments fine from the local disk like this:

bin/nutch readseg -list 20140102161258 -nocontent -nogenerate
NAME            GENERATED       FETCHER START           FETCHER END             FETCHED PARSED
20140102161258  0               2014-01-02T16:13:09     2014-01-02T16:20:39     1227    1096

But we get the following exception when reading the same segment in distributde mode:

bin/nutch readseg -list 20140102161258 -nocontent -nogenerate
Exception in thread "main" java.lang.NullPointerException
        at java.util.ComparableTimSort.sort(ComparableTimSort.java:146)
        at java.util.Arrays.sort(Arrays.java:472)
        at org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:85)
        at org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:463)
        at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:441)
        at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:587)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:160)

Can anyone confirm this issue?

Thanks,
Markus

RE: SegmentReader broken in distributed mode

Posted by Markus Jelsma <ma...@openindex.io>.
It seems that the getStats code ignores the nocontent and nogenerate options:

  public void getStats(Path segment, final SegmentReaderStats stats) throws Exception {
    SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(getConf(), new Path(segment, CrawlDatum.GENERATE_DIR_NAME));
    long cnt = 0L;
    Text key = new Text();
    for (int i = 0; i < readers.length; i++) {
      while (readers[i].next(key)) cnt++;
      readers[i].close();
    }
...

But then it should also not work in local mode or Hadoop just silently ignores it in local mode?

Seems we should open an issue for getStats not to ignore the -no* flags. 
 
-----Original message-----
> From:Markus Jelsma <ma...@openindex.io>
> Sent: Thursday 2nd January 2014 17:34
> To: user@nutch.apache.org
> Subject: SegmentReader broken in distributed mode
> 
> Hi,
> 
> We can read segments fine from the local disk like this:
> 
> bin/nutch readseg -list 20140102161258 -nocontent -nogenerate
> NAME            GENERATED       FETCHER START           FETCHER END             FETCHED PARSED
> 20140102161258  0               2014-01-02T16:13:09     2014-01-02T16:20:39     1227    1096
> 
> But we get the following exception when reading the same segment in distributde mode:
> 
> bin/nutch readseg -list 20140102161258 -nocontent -nogenerate
> Exception in thread "main" java.lang.NullPointerException
>         at java.util.ComparableTimSort.sort(ComparableTimSort.java:146)
>         at java.util.Arrays.sort(Arrays.java:472)
>         at org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:85)
>         at org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:463)
>         at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:441)
>         at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:587)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
> 
> Can anyone confirm this issue?
> 
> Thanks,
> Markus
> 

RE: SegmentReader broken in distributed mode

Posted by Markus Jelsma <ma...@openindex.io>.
Created issue:
https://issues.apache.org/jira/browse/NUTCH-1692

 
 
-----Original message-----
> From:Markus Jelsma <ma...@openindex.io>
> Sent: Thursday 2nd January 2014 17:34
> To: user@nutch.apache.org
> Subject: SegmentReader broken in distributed mode
> 
> Hi,
> 
> We can read segments fine from the local disk like this:
> 
> bin/nutch readseg -list 20140102161258 -nocontent -nogenerate
> NAME            GENERATED       FETCHER START           FETCHER END             FETCHED PARSED
> 20140102161258  0               2014-01-02T16:13:09     2014-01-02T16:20:39     1227    1096
> 
> But we get the following exception when reading the same segment in distributde mode:
> 
> bin/nutch readseg -list 20140102161258 -nocontent -nogenerate
> Exception in thread "main" java.lang.NullPointerException
>         at java.util.ComparableTimSort.sort(ComparableTimSort.java:146)
>         at java.util.Arrays.sort(Arrays.java:472)
>         at org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:85)
>         at org.apache.nutch.segment.SegmentReader.getStats(SegmentReader.java:463)
>         at org.apache.nutch.segment.SegmentReader.list(SegmentReader.java:441)
>         at org.apache.nutch.segment.SegmentReader.main(SegmentReader.java:587)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
> 
> Can anyone confirm this issue?
> 
> Thanks,
> Markus
>