You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ali Safdar Kureishy <sa...@gmail.com> on 2012/05/09 10:01:52 UTC

Consistent Checksum error using SequenceFileInputFormat against /content & /parse_text folders output by Nutch.

Hi,

I've included both the Nutch and Hadoop mailing lists, since I don't know
which one of the two is the root cause for this issue, and it might be
possible to pursue a resolution from both sides.

What I'm trying to do is to dump the contents of all the fetched pages from
my nutch crawl -- about 600K of them. I've tried extracting this
information initially from the *<segment>/parse_text* folder, but I kept
receiving the error below, so I switched over to the
*<segment>/content *folder,
but BOTH of these *consistently *give me the following Checksum Error
exception which fails the map-reduce job. At the very least I'm hoping to
get some tip(s) on how to ignore this error and let my job complete.

*org.apache.hadoop.fs.ChecksumException: Checksum Error
    at
org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:164)
    at
org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
    at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:328)
    at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:358)
    at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:342)
    at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:404)
    at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220)
    at
org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:330)
    at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350)
    at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:156)
    at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:499)
    at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381)
    at org.apache.hadoop.mapred.Merger.merge(Merger.java:77)
    at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1522)
    at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
*

I'm using the *SequenceFileInputFormat* to read the data in each case.

I have also attached the Hadoop output (checksum-error.txt). I have no idea
how to ignore this error or to debug it. I've tried setting the boolean "*
io.skip.checksum.errors*" property to *true* on the MapReduce Conf object,
but it makes no difference. The error still happens consistently, so it
seems like I'm either not setting the right property, or that it is being
ignored by Hadoop? Since the error is thrown down in the internals of
Hadoop, there doesn't seem to be any other way to ignore the error either,
without changing Hadoop code (that I'm not able to do at this point). Is
this a problem with the data that was output by Nutch? Or is this a bug
with Hadoop? *Btw, I ran Nutch in local mode (without hadoop), and I'm
running the Hadoop job (below) purely as an application from Eclipse (not
via the bin/hadoop script).*

Any help or pointers on how to dig further with this would be greatly
appreciated. If there is any other way for me to ignore these checksum
errors and let the job complete, do please share that with me as well.

Here is the code for the reader job using MapReduce:

package org.q.alt.sc.nutch.readerjobs;

import java.io.IOException;

import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.mapred.lib.IdentityReducer;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.nutch.protocol.Content;

public class SegmentContentReader extends Configured implements Tool {

    /**
     * @param args
     */
    public static void main(String[] args) throws Exception {
        int exitCode = ToolRunner.run(new SegmentContentReader(), args);
        System.exit(exitCode);
    }

    @Override
    public int run(String[] args) throws Exception {
        if (args.length != 2) {
            System.out.printf(
                "Usage: %s [generic options] <input dir> <output dir>\n",
getClass()
                    .getSimpleName());
            ToolRunner.printGenericCommandUsage(System.out);
            return -1;
        }

        JobConf conf = new JobConf(getConf(), SegmentContentReader.class);
        conf.setBoolean("io.skip.checksum.errors", true);
        conf.setJobName(this.getClass().getName());
        conf.setJarByClass(SegmentContentReader.class);

        FileInputFormat.addInputPath(conf, new Path(args[0]));
        conf.setInputFormat(SequenceFileInputFormat.class);

        conf.setOutputFormat(TextOutputFormat.class);
        FileOutputFormat.setOutputPath(conf, new Path(args[1]));

        conf.setMapperClass(Mapper1.class);
        conf.setMapOutputKeyClass(Text.class);
        conf.setMapOutputValueClass(Text.class);

        conf.setReducerClass(IdentityReducer.class);
        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(Text.class);

        JobClient.runJob(conf);
        return 0;
    }

    public static class Mapper1 extends MapReduceBase implements
Mapper<Text, Content, Text, Text> {

        @Override
        public void map(Text key, Content value,
                OutputCollector<Text, Text> output, Reporter reporter)
                throws IOException {
            String content = new String(value.getContent());
            //System.out.println("Content: " + content);
            output.collect(key, new Text(content));
        }
    }
}

Regards,
Safdar

Re: Consistent Checksum error using SequenceFileInputFormat against /content & /parse_text folders output by Nutch.

Posted by Ali Safdar Kureishy <sa...@gmail.com>.
Actually, the ChecksumError happens every time, but it can happen at
different points of the execution...sometimes at the beginning, and
sometimes at the tail end of the map phase.

Hoping to hear from someone with a workaround...

Regards,
Safdar


On Thu, May 10, 2012 at 7:59 AM, Ali Safdar Kureishy <
safdar.kureishy@gmail.com> wrote:

> Hi Subbu!
>
> Thanks so much for this tip. Strangely, it doesn't seem to work for me ...
> I still get the checksum error (though it appears to happen later on in the
> job).
>
> Has this workaround always worked for you? I also tried using the
> setMaxMapperFailurePercentage() and setMaxReducerFailurePercentage()
> settings (set them to 20% each), but I still see this chekcsum error.
>
> Any thoughts/suggestions?
>
> Thanks again!
>
> Regards,
> Safdar
>
>
> On Wed, May 9, 2012 at 12:37 PM, Kasi Subrahmanyam <kasisubbu440@gmail.com
> > wrote:
>
>> HI Ali,
>> I also faced this error when i ran the jobs either in local or in a
>> cluster.
>> I am able to solve this problem by removing the .crc file created in the
>> input folder for this job.
>> Please check that there is no .crc file in the input.
>> I hope this solves the problem.
>>
>> Thanks,
>> Subbu
>>
>>
>> On Wed, May 9, 2012 at 1:31 PM, Ali Safdar Kureishy <
>> safdar.kureishy@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I've included both the Nutch and Hadoop mailing lists, since I don't
>>> know which one of the two is the root cause for this issue, and it might be
>>> possible to pursue a resolution from both sides.
>>>
>>> What I'm trying to do is to dump the contents of all the fetched pages
>>> from my nutch crawl -- about 600K of them. I've tried extracting this
>>> information initially from the *<segment>/parse_text* folder, but I
>>> kept receiving the error below, so I switched over to the *<segment>/content
>>> *folder, but BOTH of these *consistently *give me the following
>>> Checksum Error exception which fails the map-reduce job. At the very least
>>> I'm hoping to get some tip(s) on how to ignore this error and let my job
>>> complete.
>>>
>>> *org.apache.hadoop.fs.ChecksumException: Checksum Error
>>>     at
>>> org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:164)
>>>     at
>>> org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
>>>     at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:328)
>>>     at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:358)
>>>     at
>>> org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:342)
>>>     at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:404)
>>>     at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220)
>>>     at
>>> org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:330)
>>>     at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350)
>>>     at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:156)
>>>     at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:499)
>>>     at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381)
>>>     at org.apache.hadoop.mapred.Merger.merge(Merger.java:77)
>>>     at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1522)
>>>     at
>>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
>>>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
>>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>>>     at
>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>>> *
>>>
>>> I'm using the *SequenceFileInputFormat* to read the data in each case.
>>>
>>> I have also attached the Hadoop output (checksum-error.txt). I have no
>>> idea how to ignore this error or to debug it. I've tried setting the
>>> boolean "*io.skip.checksum.errors*" property to *true* on the MapReduce
>>> Conf object, but it makes no difference. The error still happens
>>> consistently, so it seems like I'm either not setting the right property,
>>> or that it is being ignored by Hadoop? Since the error is thrown down in
>>> the internals of Hadoop, there doesn't seem to be any other way to ignore
>>> the error either, without changing Hadoop code (that I'm not able to do at
>>> this point). Is this a problem with the data that was output by Nutch? Or
>>> is this a bug with Hadoop? *Btw, I ran Nutch in local mode (without
>>> hadoop), and I'm running the Hadoop job (below) purely as an application
>>> from Eclipse (not via the bin/hadoop script).*
>>>
>>> Any help or pointers on how to dig further with this would be greatly
>>> appreciated. If there is any other way for me to ignore these checksum
>>> errors and let the job complete, do please share that with me as well.
>>>
>>> Here is the code for the reader job using MapReduce:
>>>
>>> package org.q.alt.sc.nutch.readerjobs;
>>>
>>> import java.io.IOException;
>>>
>>> import org.apache.hadoop.conf.Configured;
>>> import org.apache.hadoop.fs.Path;
>>> import org.apache.hadoop.io.Text;
>>> import org.apache.hadoop.mapred.FileInputFormat;
>>> import org.apache.hadoop.mapred.FileOutputFormat;
>>> import org.apache.hadoop.mapred.JobClient;
>>> import org.apache.hadoop.mapred.JobConf;
>>> import org.apache.hadoop.mapred.MapReduceBase;
>>> import org.apache.hadoop.mapred.Mapper;
>>> import org.apache.hadoop.mapred.OutputCollector;
>>> import org.apache.hadoop.mapred.Reporter;
>>> import org.apache.hadoop.mapred.SequenceFileInputFormat;
>>> import org.apache.hadoop.mapred.TextOutputFormat;
>>> import org.apache.hadoop.mapred.lib.IdentityReducer;
>>> import org.apache.hadoop.util.Tool;
>>> import org.apache.hadoop.util.ToolRunner;
>>> import org.apache.nutch.protocol.Content;
>>>
>>> public class SegmentContentReader extends Configured implements Tool {
>>>
>>>     /**
>>>      * @param args
>>>      */
>>>     public static void main(String[] args) throws Exception {
>>>         int exitCode = ToolRunner.run(new SegmentContentReader(), args);
>>>         System.exit(exitCode);
>>>     }
>>>
>>>     @Override
>>>     public int run(String[] args) throws Exception {
>>>         if (args.length != 2) {
>>>             System.out.printf(
>>>                 "Usage: %s [generic options] <input dir> <output
>>> dir>\n", getClass()
>>>                     .getSimpleName());
>>>             ToolRunner.printGenericCommandUsage(System.out);
>>>             return -1;
>>>         }
>>>
>>>         JobConf conf = new JobConf(getConf(),
>>> SegmentContentReader.class);
>>>         conf.setBoolean("io.skip.checksum.errors", true);
>>>         conf.setJobName(this.getClass().getName());
>>>         conf.setJarByClass(SegmentContentReader.class);
>>>
>>>         FileInputFormat.addInputPath(conf, new Path(args[0]));
>>>         conf.setInputFormat(SequenceFileInputFormat.class);
>>>
>>>         conf.setOutputFormat(TextOutputFormat.class);
>>>         FileOutputFormat.setOutputPath(conf, new Path(args[1]));
>>>
>>>         conf.setMapperClass(Mapper1.class);
>>>         conf.setMapOutputKeyClass(Text.class);
>>>         conf.setMapOutputValueClass(Text.class);
>>>
>>>         conf.setReducerClass(IdentityReducer.class);
>>>         conf.setOutputKeyClass(Text.class);
>>>         conf.setOutputValueClass(Text.class);
>>>
>>>         JobClient.runJob(conf);
>>>         return 0;
>>>     }
>>>
>>>     public static class Mapper1 extends MapReduceBase implements
>>> Mapper<Text, Content, Text, Text> {
>>>
>>>         @Override
>>>         public void map(Text key, Content value,
>>>                 OutputCollector<Text, Text> output, Reporter reporter)
>>>                 throws IOException {
>>>             String content = new String(value.getContent());
>>>             //System.out.println("Content: " + content);
>>>             output.collect(key, new Text(content));
>>>         }
>>>     }
>>> }
>>>
>>> Regards,
>>> Safdar
>>>
>>
>>
>

Re: Consistent Checksum error using SequenceFileInputFormat against /content & /parse_text folders output by Nutch.

Posted by Ali Safdar Kureishy <sa...@gmail.com>.
Hi Subbu!

Thanks so much for this tip. Strangely, it doesn't seem to work for me ...
I still get the checksum error (though it appears to happen later on in the
job).

Has this workaround always worked for you? I also tried using the
setMaxMapperFailurePercentage() and setMaxReducerFailurePercentage()
settings (set them to 20% each), but I still see this chekcsum error.

Any thoughts/suggestions?

Thanks again!

Regards,
Safdar

On Wed, May 9, 2012 at 12:37 PM, Kasi Subrahmanyam
<ka...@gmail.com>wrote:

> HI Ali,
> I also faced this error when i ran the jobs either in local or in a
> cluster.
> I am able to solve this problem by removing the .crc file created in the
> input folder for this job.
> Please check that there is no .crc file in the input.
> I hope this solves the problem.
>
> Thanks,
> Subbu
>
>
> On Wed, May 9, 2012 at 1:31 PM, Ali Safdar Kureishy <
> safdar.kureishy@gmail.com> wrote:
>
>> Hi,
>>
>> I've included both the Nutch and Hadoop mailing lists, since I don't know
>> which one of the two is the root cause for this issue, and it might be
>> possible to pursue a resolution from both sides.
>>
>> What I'm trying to do is to dump the contents of all the fetched pages
>> from my nutch crawl -- about 600K of them. I've tried extracting this
>> information initially from the *<segment>/parse_text* folder, but I kept
>> receiving the error below, so I switched over to the *<segment>/content *folder,
>> but BOTH of these *consistently *give me the following Checksum Error
>> exception which fails the map-reduce job. At the very least I'm hoping to
>> get some tip(s) on how to ignore this error and let my job complete.
>>
>> *org.apache.hadoop.fs.ChecksumException: Checksum Error
>>     at
>> org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:164)
>>     at
>> org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
>>     at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:328)
>>     at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:358)
>>     at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:342)
>>     at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:404)
>>     at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220)
>>     at
>> org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:330)
>>     at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350)
>>     at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:156)
>>     at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:499)
>>     at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381)
>>     at org.apache.hadoop.mapred.Merger.merge(Merger.java:77)
>>     at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1522)
>>     at
>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
>>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>>     at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>> *
>>
>> I'm using the *SequenceFileInputFormat* to read the data in each case.
>>
>> I have also attached the Hadoop output (checksum-error.txt). I have no
>> idea how to ignore this error or to debug it. I've tried setting the
>> boolean "*io.skip.checksum.errors*" property to *true* on the MapReduce
>> Conf object, but it makes no difference. The error still happens
>> consistently, so it seems like I'm either not setting the right property,
>> or that it is being ignored by Hadoop? Since the error is thrown down in
>> the internals of Hadoop, there doesn't seem to be any other way to ignore
>> the error either, without changing Hadoop code (that I'm not able to do at
>> this point). Is this a problem with the data that was output by Nutch? Or
>> is this a bug with Hadoop? *Btw, I ran Nutch in local mode (without
>> hadoop), and I'm running the Hadoop job (below) purely as an application
>> from Eclipse (not via the bin/hadoop script).*
>>
>> Any help or pointers on how to dig further with this would be greatly
>> appreciated. If there is any other way for me to ignore these checksum
>> errors and let the job complete, do please share that with me as well.
>>
>> Here is the code for the reader job using MapReduce:
>>
>> package org.q.alt.sc.nutch.readerjobs;
>>
>> import java.io.IOException;
>>
>> import org.apache.hadoop.conf.Configured;
>> import org.apache.hadoop.fs.Path;
>> import org.apache.hadoop.io.Text;
>> import org.apache.hadoop.mapred.FileInputFormat;
>> import org.apache.hadoop.mapred.FileOutputFormat;
>> import org.apache.hadoop.mapred.JobClient;
>> import org.apache.hadoop.mapred.JobConf;
>> import org.apache.hadoop.mapred.MapReduceBase;
>> import org.apache.hadoop.mapred.Mapper;
>> import org.apache.hadoop.mapred.OutputCollector;
>> import org.apache.hadoop.mapred.Reporter;
>> import org.apache.hadoop.mapred.SequenceFileInputFormat;
>> import org.apache.hadoop.mapred.TextOutputFormat;
>> import org.apache.hadoop.mapred.lib.IdentityReducer;
>> import org.apache.hadoop.util.Tool;
>> import org.apache.hadoop.util.ToolRunner;
>> import org.apache.nutch.protocol.Content;
>>
>> public class SegmentContentReader extends Configured implements Tool {
>>
>>     /**
>>      * @param args
>>      */
>>     public static void main(String[] args) throws Exception {
>>         int exitCode = ToolRunner.run(new SegmentContentReader(), args);
>>         System.exit(exitCode);
>>     }
>>
>>     @Override
>>     public int run(String[] args) throws Exception {
>>         if (args.length != 2) {
>>             System.out.printf(
>>                 "Usage: %s [generic options] <input dir> <output dir>\n",
>> getClass()
>>                     .getSimpleName());
>>             ToolRunner.printGenericCommandUsage(System.out);
>>             return -1;
>>         }
>>
>>         JobConf conf = new JobConf(getConf(), SegmentContentReader.class);
>>         conf.setBoolean("io.skip.checksum.errors", true);
>>         conf.setJobName(this.getClass().getName());
>>         conf.setJarByClass(SegmentContentReader.class);
>>
>>         FileInputFormat.addInputPath(conf, new Path(args[0]));
>>         conf.setInputFormat(SequenceFileInputFormat.class);
>>
>>         conf.setOutputFormat(TextOutputFormat.class);
>>         FileOutputFormat.setOutputPath(conf, new Path(args[1]));
>>
>>         conf.setMapperClass(Mapper1.class);
>>         conf.setMapOutputKeyClass(Text.class);
>>         conf.setMapOutputValueClass(Text.class);
>>
>>         conf.setReducerClass(IdentityReducer.class);
>>         conf.setOutputKeyClass(Text.class);
>>         conf.setOutputValueClass(Text.class);
>>
>>         JobClient.runJob(conf);
>>         return 0;
>>     }
>>
>>     public static class Mapper1 extends MapReduceBase implements
>> Mapper<Text, Content, Text, Text> {
>>
>>         @Override
>>         public void map(Text key, Content value,
>>                 OutputCollector<Text, Text> output, Reporter reporter)
>>                 throws IOException {
>>             String content = new String(value.getContent());
>>             //System.out.println("Content: " + content);
>>             output.collect(key, new Text(content));
>>         }
>>     }
>> }
>>
>> Regards,
>> Safdar
>>
>
>

Re: Consistent Checksum error using SequenceFileInputFormat against /content & /parse_text folders output by Nutch.

Posted by Kasi Subrahmanyam <ka...@gmail.com>.
HI Ali,
I also faced this error when i ran the jobs either in local or in a cluster.
I am able to solve this problem by removing the .crc file created in the
input folder for this job.
Please check that there is no .crc file in the input.
I hope this solves the problem.

Thanks,
Subbu

On Wed, May 9, 2012 at 1:31 PM, Ali Safdar Kureishy <
safdar.kureishy@gmail.com> wrote:

> Hi,
>
> I've included both the Nutch and Hadoop mailing lists, since I don't know
> which one of the two is the root cause for this issue, and it might be
> possible to pursue a resolution from both sides.
>
> What I'm trying to do is to dump the contents of all the fetched pages
> from my nutch crawl -- about 600K of them. I've tried extracting this
> information initially from the *<segment>/parse_text* folder, but I kept
> receiving the error below, so I switched over to the *<segment>/content *folder,
> but BOTH of these *consistently *give me the following Checksum Error
> exception which fails the map-reduce job. At the very least I'm hoping to
> get some tip(s) on how to ignore this error and let my job complete.
>
> *org.apache.hadoop.fs.ChecksumException: Checksum Error
>     at
> org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:164)
>     at
> org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
>     at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:328)
>     at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:358)
>     at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:342)
>     at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:404)
>     at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220)
>     at
> org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:330)
>     at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350)
>     at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:156)
>     at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:499)
>     at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:381)
>     at org.apache.hadoop.mapred.Merger.merge(Merger.java:77)
>     at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1522)
>     at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>     at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> *
>
> I'm using the *SequenceFileInputFormat* to read the data in each case.
>
> I have also attached the Hadoop output (checksum-error.txt). I have no
> idea how to ignore this error or to debug it. I've tried setting the
> boolean "*io.skip.checksum.errors*" property to *true* on the MapReduce
> Conf object, but it makes no difference. The error still happens
> consistently, so it seems like I'm either not setting the right property,
> or that it is being ignored by Hadoop? Since the error is thrown down in
> the internals of Hadoop, there doesn't seem to be any other way to ignore
> the error either, without changing Hadoop code (that I'm not able to do at
> this point). Is this a problem with the data that was output by Nutch? Or
> is this a bug with Hadoop? *Btw, I ran Nutch in local mode (without
> hadoop), and I'm running the Hadoop job (below) purely as an application
> from Eclipse (not via the bin/hadoop script).*
>
> Any help or pointers on how to dig further with this would be greatly
> appreciated. If there is any other way for me to ignore these checksum
> errors and let the job complete, do please share that with me as well.
>
> Here is the code for the reader job using MapReduce:
>
> package org.q.alt.sc.nutch.readerjobs;
>
> import java.io.IOException;
>
> import org.apache.hadoop.conf.Configured;
> import org.apache.hadoop.fs.Path;
> import org.apache.hadoop.io.Text;
> import org.apache.hadoop.mapred.FileInputFormat;
> import org.apache.hadoop.mapred.FileOutputFormat;
> import org.apache.hadoop.mapred.JobClient;
> import org.apache.hadoop.mapred.JobConf;
> import org.apache.hadoop.mapred.MapReduceBase;
> import org.apache.hadoop.mapred.Mapper;
> import org.apache.hadoop.mapred.OutputCollector;
> import org.apache.hadoop.mapred.Reporter;
> import org.apache.hadoop.mapred.SequenceFileInputFormat;
> import org.apache.hadoop.mapred.TextOutputFormat;
> import org.apache.hadoop.mapred.lib.IdentityReducer;
> import org.apache.hadoop.util.Tool;
> import org.apache.hadoop.util.ToolRunner;
> import org.apache.nutch.protocol.Content;
>
> public class SegmentContentReader extends Configured implements Tool {
>
>     /**
>      * @param args
>      */
>     public static void main(String[] args) throws Exception {
>         int exitCode = ToolRunner.run(new SegmentContentReader(), args);
>         System.exit(exitCode);
>     }
>
>     @Override
>     public int run(String[] args) throws Exception {
>         if (args.length != 2) {
>             System.out.printf(
>                 "Usage: %s [generic options] <input dir> <output dir>\n",
> getClass()
>                     .getSimpleName());
>             ToolRunner.printGenericCommandUsage(System.out);
>             return -1;
>         }
>
>         JobConf conf = new JobConf(getConf(), SegmentContentReader.class);
>         conf.setBoolean("io.skip.checksum.errors", true);
>         conf.setJobName(this.getClass().getName());
>         conf.setJarByClass(SegmentContentReader.class);
>
>         FileInputFormat.addInputPath(conf, new Path(args[0]));
>         conf.setInputFormat(SequenceFileInputFormat.class);
>
>         conf.setOutputFormat(TextOutputFormat.class);
>         FileOutputFormat.setOutputPath(conf, new Path(args[1]));
>
>         conf.setMapperClass(Mapper1.class);
>         conf.setMapOutputKeyClass(Text.class);
>         conf.setMapOutputValueClass(Text.class);
>
>         conf.setReducerClass(IdentityReducer.class);
>         conf.setOutputKeyClass(Text.class);
>         conf.setOutputValueClass(Text.class);
>
>         JobClient.runJob(conf);
>         return 0;
>     }
>
>     public static class Mapper1 extends MapReduceBase implements
> Mapper<Text, Content, Text, Text> {
>
>         @Override
>         public void map(Text key, Content value,
>                 OutputCollector<Text, Text> output, Reporter reporter)
>                 throws IOException {
>             String content = new String(value.getContent());
>             //System.out.println("Content: " + content);
>             output.collect(key, new Text(content));
>         }
>     }
> }
>
> Regards,
> Safdar
>