You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Boyu Zhang <bo...@gmail.com> on 2010/10/15 23:02:08 UTC

Corrupted input data to map

Hi all,

I am running a program with input 1 million lines of data, among the 1
million, 5 or 6 lines data are corrupted. The way the are corrupted is: in
the position which a float number is expected, like 3.4 , instead of a float
number, something like this is there: 3.4.5.6 . So when the map runs, it
throws a multiple point in num exception.

My question is: the map tasks that have the exception are marked failure,
how about the data processed by the same map before the exception, do they
reach the reduce task? or they are treated like garbage? Thank you very much
any help is appreciated.

Boyu

Re: Corrupted input data to map

Posted by Lance Norskog <go...@gmail.com>.

There is a small but measurable bit error rate in copying data around-
some RAM chips have a bit error per gB per century, others per hour.
HDFS itself has (I believe) a checksum, but moving the gigabytes
around is still vulnerable. At this point new file systems should
include optional checksums for all files.

http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

On Sat, Oct 16, 2010 at 1:31 PM, Raymond Jennings III
<ra...@yahoo.com> wrote:
> I am curious if your data got corrupted when you transferred your file into
> HDFS?  I recently had a very similar situation to yours where I had about 5
> lines of decimal points getting corrupted.  When I transferred the file back out
> of HDFS and compared it to the original is when I finally figured out what was
> wrong.  I don't have an answer for your specific question but am just curious if
> you had experienced the same thing that I did.
>
>
>
>
> ________________________________
> From: Boyu Zhang <bo...@gmail.com>
> To: common-user@hadoop.apache.org; core-user@hadoop.apache.org
> Sent: Fri, October 15, 2010 5:02:08 PM
> Subject: Corrupted input data to map
>
> Hi all,
>
> I am running a program with input 1 million lines of data, among the 1
> million, 5 or 6 lines data are corrupted. The way the are corrupted is: in
> the position which a float number is expected, like 3.4 , instead of a float
> number, something like this is there: 3.4.5.6 . So when the map runs, it
> throws a multiple point in num exception.
>
> My question is: the map tasks that have the exception are marked failure,
> how about the data processed by the same map before the exception, do they
> reach the reduce task? or they are treated like garbage? Thank you very much
> any help is appreciated.
>
> Boyu
>
>
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Corrupted input data to map

Posted by Raymond Jennings III <ra...@yahoo.com>.

I am curious if your data got corrupted when you transferred your file into 
HDFS?  I recently had a very similar situation to yours where I had about 5 
lines of decimal points getting corrupted.  When I transferred the file back out 
of HDFS and compared it to the original is when I finally figured out what was 
wrong.  I don't have an answer for your specific question but am just curious if 
you had experienced the same thing that I did.




________________________________
From: Boyu Zhang <bo...@gmail.com>
To: common-user@hadoop.apache.org; core-user@hadoop.apache.org
Sent: Fri, October 15, 2010 5:02:08 PM
Subject: Corrupted input data to map

Hi all,

I am running a program with input 1 million lines of data, among the 1
million, 5 or 6 lines data are corrupted. The way the are corrupted is: in
the position which a float number is expected, like 3.4 , instead of a float
number, something like this is there: 3.4.5.6 . So when the map runs, it
throws a multiple point in num exception.

My question is: the map tasks that have the exception are marked failure,
how about the data processed by the same map before the exception, do they
reach the reduce task? or they are treated like garbage? Thank you very much
any help is appreciated.

Boyu

Re: Corrupted input data to map

Posted by Jeff Zhang <zj...@gmail.com>.

You can read the input as plain text then do type conversion in
mapper, if there's NumberFormatException happens, you can decide how
to do with it , like add a customized Counter to record it. or set a
default value

On Sat, Oct 16, 2010 at 5:02 AM, Boyu Zhang <bo...@gmail.com> wrote:
> Hi all,
>
> I am running a program with input 1 million lines of data, among the 1
> million, 5 or 6 lines data are corrupted. The way the are corrupted is: in
> the position which a float number is expected, like 3.4 , instead of a float
> number, something like this is there: 3.4.5.6 . So when the map runs, it
> throws a multiple point in num exception.
>
> My question is: the map tasks that have the exception are marked failure,
> how about the data processed by the same map before the exception, do they
> reach the reduce task? or they are treated like garbage? Thank you very much
> any help is appreciated.
>
> Boyu
>



-- 
Best Regards

Jeff Zhang