You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Aaron Kimball <ak...@gmail.com> on 2011/02/01 06:57:58 UTC

Re: tons of bugs and problem found

In MapReduce, filenames that begin with an underscore are "hidden" files and
are not enumerated by FileInputFormat (Hive, I believe, processes tables
with TextInputFormat and SequenceFileInputFormat, both descendants of this
class).

Using "_foo" as a hidden/ignored filename is conventional in the Hadoop
world. This is different than the UNIX convention of using ".foo", but
that's software engineering for you. ;)

This is unlikely to change soon; MapReduce emits files with names like
"_SUCCESS" into directories to indicate successful job completion.
Directories such as "_tmp" and "_logs" also appear in datasets, and are
therefore ignored as input by MapReduce-based tools, but those metadata
names are established in other projects.

If you run 'hadoop fs -mv /path/to/_top.sql /path/to/top.sql', that should
make things work for you.

- Aaron

On Mon, Jan 31, 2011 at 10:21 AM, yongqiang he <he...@gmail.com>wrote:

> You can first try to set io.skip.checksum.errors to true, which will
> ignore bad checksum.
>
> >>In facebook, we also had a requirement to ignore corrupt/bad data - but
> it has not been committed yet. Yongqiang, what is the jira number ?
> there seems no jira for this issue.
>
> thanks
> yongqiang
> 2011/1/31 Namit Jain <nj...@fb.com>:
> >
> >
> > On 1/31/11 7:46 AM, "Laurent Laborde" <ke...@gmail.com> wrote:
> >
> >>On Fri, Jan 28, 2011 at 8:05 AM, Laurent Laborde <ke...@gmail.com>
> >>wrote:
> >>> On Fri, Jan 28, 2011 at 1:12 AM, Namit Jain <nj...@fb.com> wrote:
> >>>> Hi Laurent,
> >>>>
> >>>> 1. Are you saying that _top.sql did not exist in the home directory.
> >>>> Or that, _top.sql existed, but hive was not able to read it after
> >>>>loading
> >>>
> >>> It exist, it's loaded, and i can see it in the hive's warehouse
> >>>directory.
> >>> it's just impossible to query it.
> >>>
> >>>> 2. I don¹t think reserved words are documented somewhere. Can you file
> >>>>a
> >>>> jira for this ?
> >>>
> >>> Ok; will do that today.
> >>>
> >>>> 3. The bad row is printed in the task log.
> >>>>
> >>>> 1. 2011-01-27 11:11:07,046 INFO org.apache.hadoop.fs.FSInputChecker:
> >>>>Found
> >>>> checksum error: b[1024,
> >>>>
>
> >>>>1536]=7374796c653d22666f6e742d73697a653a20313270743b223e3c623e266e627370
> >>>>3b2
> >>>>
>
> >>>>66e6273703b266e6273703b202a202838302920416d69656e733a3c2f623e3c2f7370616
> >>>>e3e
> >>>>
>
> >>>>3c2f7370616e3e5c6e20203c2f703e5c6e20203c703e5c6e202020203c7370616e207374
> >>>>796
> >>>>
>
> >>>>c653d22666f66742d66616d696c793a2068656c7665746963613b223e3c7370616e20737
> >>>>479
> >>>>
>
> >>>>6c653d22666f6e742d73697a653a20313270743b223e3c623e266e6273703b266e627370
> >>>>3b2
> >>>>
>
> >>>>66e6273703b266e6273703b266e6273703b266e6273703b266e6273703b266e6273703b2
> >>>>66e
> >>>>
>
> >>>>6273703b206f203132682c2050697175652d6e6971756520646576616e74206c65205265
> >>>>637
> >>>>
>
> >>>>46f7261742e3c2f623e3c2f7370616e3e3c2f7370616e3e5c6e20203c2f703e5c6e20203
> >>>>c70
> >>>>
>
> >>>>3e5c6e202020203c7370616e207374796c653d22666f6e742d66616d696c793a2068656c
> >>>>766
> >>>>
>
> >>>>5746963613b223e3c7370616e207374796c653d22666f6e742d73697a653a20313270743
> >>>>b22
> >>>>
>
> >>>>3e3c623e266e6273703b266e6273703b266e6273703b266e6273703b266e6273703b266e
> >>>>627
> >>>>
>
> >>>>3703b266e6273703b266e6273703b266e6273703b206f2031346833302c204d6169736f6
> >>>>e20
> >>>>
>
> >>>>6465206c612063756c747572652e3c2f623e3c2f7370616e3e3c2f7370616e3e5c6e2020
> >>>>3c2
> >>>> f703e5c6e20203c703e5c6e202020203c7370616e207374796c653d
> >>>
> >>> Is this the actual data ?
> >>>
> >>>> 2. org.apache.hadoop.fs.ChecksumException: Checksum error:
> >>>> /blk_2466764552666222475:of:/user/hive/warehouse/article/article.copy
> >>>>at
> >>>> 23446528
> >>>
> >>> 23446528 is the line number ?
> >>>
> >>> thank you
> >>
> >>optional question (the previous ones are still open) :
> >>is there a way to tell hive to ignore invalid data ? (if the problem
> >>is invalid data)
> >>
> >
> > Currently, not.
> > In facebook, we also had a requirement to ignore corrupt/bad data - but
> it
> > has not
> > been committed yet. Yongqiang, what is the jira number ?
> >
> >
> > Thanks,
> > -namit
> >
> >
> >>
> >>--
> >>Laurent "ker2x" Laborde
> >>Sysadmin & DBA at http://www.over-blog.com/
> >
> >
>

Re: tons of bugs and problem found

Posted by Laurent Laborde <ke...@gmail.com>.

after a lot of trial and error and doubt...
it's a memory hardware problem (confirmed by memtest) :(
The file is corrupted when moving/writing/reading the 130GB file

thank you for your help and thanks to #hadoop@freenode

-- 
Laurent "ker2x" Laborde
Sysadmin & DBA at http://www.over-blog.com/

Re: tons of bugs and problem found

Posted by yongqiang he <he...@gmail.com>.

I just noticed that your input file is actually text file.  There is
SkipBadRecords feature in Hadoop for text file. But i think hive does
not support that now. But i think you can hack by doing the setting
yourself.

Just look at the SkipBadRecords's code to find the conf name and
value, and set it manually before running your query.
Good luck.

On Tue, Feb 1, 2011 at 12:54 PM, Laurent Laborde <ke...@gmail.com> wrote:
> thank you for your replies.
> i reinstalled hadoop and hive, switched from Cloudera CDH3 to CDH2,
> restarted everything from scratch
> i've set io.skip.checksum.errors=true
>
> and i still have the same error :(
>
> what's wrong ? :(
> the dataset come from a postgresql database and is consistant.
>
>
> On Tue, Feb 1, 2011 at 6:57 AM, Aaron Kimball <ak...@gmail.com> wrote:
>> In MapReduce, filenames that begin with an underscore are "hidden" files and
>> are not enumerated by FileInputFormat (Hive, I believe, processes tables
>> with TextInputFormat and SequenceFileInputFormat, both descendants of this
>> class).
>> Using "_foo" as a hidden/ignored filename is conventional in the Hadoop
>> world. This is different than the UNIX convention of using ".foo", but
>> that's software engineering for you. ;)
>> This is unlikely to change soon; MapReduce emits files with names like
>> "_SUCCESS" into directories to indicate successful job completion.
>> Directories such as "_tmp" and "_logs" also appear in datasets, and are
>> therefore ignored as input by MapReduce-based tools, but those metadata
>> names are established in other projects.
>> If you run 'hadoop fs -mv /path/to/_top.sql /path/to/top.sql', that should
>> make things work for you.
>> - Aaron
>>
>> On Mon, Jan 31, 2011 at 10:21 AM, yongqiang he <he...@gmail.com>
>> wrote:
>>>
>>> You can first try to set io.skip.checksum.errors to true, which will
>>> ignore bad checksum.
>>>
>>> >>In facebook, we also had a requirement to ignore corrupt/bad data - but
>>> >> it has not been committed yet. Yongqiang, what is the jira number ?
>>> there seems no jira for this issue.
>>>
>>> thanks
>>> yongqiang
>>> 2011/1/31 Namit Jain <nj...@fb.com>:
>>> >
>>> >
>>> > On 1/31/11 7:46 AM, "Laurent Laborde" <ke...@gmail.com> wrote:
>>> >
>>> >>On Fri, Jan 28, 2011 at 8:05 AM, Laurent Laborde <ke...@gmail.com>
>>> >>wrote:
>>> >>> On Fri, Jan 28, 2011 at 1:12 AM, Namit Jain <nj...@fb.com> wrote:
>>> >>>> Hi Laurent,
>>> >>>>
>>> >>>> 1. Are you saying that _top.sql did not exist in the home directory.
>>> >>>> Or that, _top.sql existed, but hive was not able to read it after
>>> >>>>loading
>>> >>>
>>> >>> It exist, it's loaded, and i can see it in the hive's warehouse
>>> >>>directory.
>>> >>> it's just impossible to query it.
>>> >>>
>>> >>>> 2. I don¹t think reserved words are documented somewhere. Can you
>>> >>>> file
>>> >>>>a
>>> >>>> jira for this ?
>>> >>>
>>> >>> Ok; will do that today.
>>> >>>
>>> >>>> 3. The bad row is printed in the task log.
>>> >>>>
>>> >>>> 1. 2011-01-27 11:11:07,046 INFO org.apache.hadoop.fs.FSInputChecker:
>>> >>>>Found
>>> >>>> checksum error: b[1024,
>>> >>>>
>>>
>>> >>>> >>>>1536]=7374796c653d22666f6e742d73697a653a20313270743b223e3c623e266e627370
>>> >>>>3b2
>>> >>>>
>>>
>>> >>>> >>>>66e6273703b266e6273703b202a202838302920416d69656e733a3c2f623e3c2f7370616
>>> >>>>e3e
>>> >>>>
>>>
>>> >>>> >>>>3c2f7370616e3e5c6e20203c2f703e5c6e20203c703e5c6e202020203c7370616e207374
>>> >>>>796
>>> >>>>
>>>
>>> >>>> >>>>c653d22666f66742d66616d696c793a2068656c7665746963613b223e3c7370616e20737
>>> >>>>479
>>> >>>>
>>>
>>> >>>> >>>>6c653d22666f6e742d73697a653a20313270743b223e3c623e266e6273703b266e627370
>>> >>>>3b2
>>> >>>>
>>>
>>> >>>> >>>>66e6273703b266e6273703b266e6273703b266e6273703b266e6273703b266e6273703b2
>>> >>>>66e
>>> >>>>
>>>
>>> >>>> >>>>6273703b206f203132682c2050697175652d6e6971756520646576616e74206c65205265
>>> >>>>637
>>> >>>>
>>>
>>> >>>> >>>>46f7261742e3c2f623e3c2f7370616e3e3c2f7370616e3e5c6e20203c2f703e5c6e20203
>>> >>>>c70
>>> >>>>
>>>
>>> >>>> >>>>3e5c6e202020203c7370616e207374796c653d22666f6e742d66616d696c793a2068656c
>>> >>>>766
>>> >>>>
>>>
>>> >>>> >>>>5746963613b223e3c7370616e207374796c653d22666f6e742d73697a653a20313270743
>>> >>>>b22
>>> >>>>
>>>
>>> >>>> >>>>3e3c623e266e6273703b266e6273703b266e6273703b266e6273703b266e6273703b266e
>>> >>>>627
>>> >>>>
>>>
>>> >>>> >>>>3703b266e6273703b266e6273703b266e6273703b206f2031346833302c204d6169736f6
>>> >>>>e20
>>> >>>>
>>>
>>> >>>> >>>>6465206c612063756c747572652e3c2f623e3c2f7370616e3e3c2f7370616e3e5c6e2020
>>> >>>>3c2
>>> >>>> f703e5c6e20203c703e5c6e202020203c7370616e207374796c653d
>>> >>>
>>> >>> Is this the actual data ?
>>> >>>
>>> >>>> 2. org.apache.hadoop.fs.ChecksumException: Checksum error:
>>> >>>> /blk_2466764552666222475:of:/user/hive/warehouse/article/article.copy
>>> >>>>at
>>> >>>> 23446528
>>> >>>
>>> >>> 23446528 is the line number ?
>>> >>>
>>> >>> thank you
>>> >>
>>> >>optional question (the previous ones are still open) :
>>> >>is there a way to tell hive to ignore invalid data ? (if the problem
>>> >>is invalid data)
>>> >>
>>> >
>>> > Currently, not.
>>> > In facebook, we also had a requirement to ignore corrupt/bad data - but
>>> > it
>>> > has not
>>> > been committed yet. Yongqiang, what is the jira number ?
>>> >
>>> >
>>> > Thanks,
>>> > -namit
>>> >
>>> >
>>> >>
>>> >>--
>>> >>Laurent "ker2x" Laborde
>>> >>Sysadmin & DBA at http://www.over-blog.com/
>>> >
>>> >
>>
>>
>
>
>
> --
> Laurent "ker2x" Laborde
> Sysadmin & DBA at http://www.over-blog.com/
>

Re: tons of bugs and problem found

Posted by Laurent Laborde <ke...@gmail.com>.

thank you for your replies.
i reinstalled hadoop and hive, switched from Cloudera CDH3 to CDH2,
restarted everything from scratch
i've set io.skip.checksum.errors=true

and i still have the same error :(

what's wrong ? :(
the dataset come from a postgresql database and is consistant.


On Tue, Feb 1, 2011 at 6:57 AM, Aaron Kimball <ak...@gmail.com> wrote:
> In MapReduce, filenames that begin with an underscore are "hidden" files and
> are not enumerated by FileInputFormat (Hive, I believe, processes tables
> with TextInputFormat and SequenceFileInputFormat, both descendants of this
> class).
> Using "_foo" as a hidden/ignored filename is conventional in the Hadoop
> world. This is different than the UNIX convention of using ".foo", but
> that's software engineering for you. ;)
> This is unlikely to change soon; MapReduce emits files with names like
> "_SUCCESS" into directories to indicate successful job completion.
> Directories such as "_tmp" and "_logs" also appear in datasets, and are
> therefore ignored as input by MapReduce-based tools, but those metadata
> names are established in other projects.
> If you run 'hadoop fs -mv /path/to/_top.sql /path/to/top.sql', that should
> make things work for you.
> - Aaron
>
> On Mon, Jan 31, 2011 at 10:21 AM, yongqiang he <he...@gmail.com>
> wrote:
>>
>> You can first try to set io.skip.checksum.errors to true, which will
>> ignore bad checksum.
>>
>> >>In facebook, we also had a requirement to ignore corrupt/bad data - but
>> >> it has not been committed yet. Yongqiang, what is the jira number ?
>> there seems no jira for this issue.
>>
>> thanks
>> yongqiang
>> 2011/1/31 Namit Jain <nj...@fb.com>:
>> >
>> >
>> > On 1/31/11 7:46 AM, "Laurent Laborde" <ke...@gmail.com> wrote:
>> >
>> >>On Fri, Jan 28, 2011 at 8:05 AM, Laurent Laborde <ke...@gmail.com>
>> >>wrote:
>> >>> On Fri, Jan 28, 2011 at 1:12 AM, Namit Jain <nj...@fb.com> wrote:
>> >>>> Hi Laurent,
>> >>>>
>> >>>> 1. Are you saying that _top.sql did not exist in the home directory.
>> >>>> Or that, _top.sql existed, but hive was not able to read it after
>> >>>>loading
>> >>>
>> >>> It exist, it's loaded, and i can see it in the hive's warehouse
>> >>>directory.
>> >>> it's just impossible to query it.
>> >>>
>> >>>> 2. I don¹t think reserved words are documented somewhere. Can you
>> >>>> file
>> >>>>a
>> >>>> jira for this ?
>> >>>
>> >>> Ok; will do that today.
>> >>>
>> >>>> 3. The bad row is printed in the task log.
>> >>>>
>> >>>> 1. 2011-01-27 11:11:07,046 INFO org.apache.hadoop.fs.FSInputChecker:
>> >>>>Found
>> >>>> checksum error: b[1024,
>> >>>>
>>
>> >>>> >>>>1536]=7374796c653d22666f6e742d73697a653a20313270743b223e3c623e266e627370
>> >>>>3b2
>> >>>>
>>
>> >>>> >>>>66e6273703b266e6273703b202a202838302920416d69656e733a3c2f623e3c2f7370616
>> >>>>e3e
>> >>>>
>>
>> >>>> >>>>3c2f7370616e3e5c6e20203c2f703e5c6e20203c703e5c6e202020203c7370616e207374
>> >>>>796
>> >>>>
>>
>> >>>> >>>>c653d22666f66742d66616d696c793a2068656c7665746963613b223e3c7370616e20737
>> >>>>479
>> >>>>
>>
>> >>>> >>>>6c653d22666f6e742d73697a653a20313270743b223e3c623e266e6273703b266e627370
>> >>>>3b2
>> >>>>
>>
>> >>>> >>>>66e6273703b266e6273703b266e6273703b266e6273703b266e6273703b266e6273703b2
>> >>>>66e
>> >>>>
>>
>> >>>> >>>>6273703b206f203132682c2050697175652d6e6971756520646576616e74206c65205265
>> >>>>637
>> >>>>
>>
>> >>>> >>>>46f7261742e3c2f623e3c2f7370616e3e3c2f7370616e3e5c6e20203c2f703e5c6e20203
>> >>>>c70
>> >>>>
>>
>> >>>> >>>>3e5c6e202020203c7370616e207374796c653d22666f6e742d66616d696c793a2068656c
>> >>>>766
>> >>>>
>>
>> >>>> >>>>5746963613b223e3c7370616e207374796c653d22666f6e742d73697a653a20313270743
>> >>>>b22
>> >>>>
>>
>> >>>> >>>>3e3c623e266e6273703b266e6273703b266e6273703b266e6273703b266e6273703b266e
>> >>>>627
>> >>>>
>>
>> >>>> >>>>3703b266e6273703b266e6273703b266e6273703b206f2031346833302c204d6169736f6
>> >>>>e20
>> >>>>
>>
>> >>>> >>>>6465206c612063756c747572652e3c2f623e3c2f7370616e3e3c2f7370616e3e5c6e2020
>> >>>>3c2
>> >>>> f703e5c6e20203c703e5c6e202020203c7370616e207374796c653d
>> >>>
>> >>> Is this the actual data ?
>> >>>
>> >>>> 2. org.apache.hadoop.fs.ChecksumException: Checksum error:
>> >>>> /blk_2466764552666222475:of:/user/hive/warehouse/article/article.copy
>> >>>>at
>> >>>> 23446528
>> >>>
>> >>> 23446528 is the line number ?
>> >>>
>> >>> thank you
>> >>
>> >>optional question (the previous ones are still open) :
>> >>is there a way to tell hive to ignore invalid data ? (if the problem
>> >>is invalid data)
>> >>
>> >
>> > Currently, not.
>> > In facebook, we also had a requirement to ignore corrupt/bad data - but
>> > it
>> > has not
>> > been committed yet. Yongqiang, what is the jira number ?
>> >
>> >
>> > Thanks,
>> > -namit
>> >
>> >
>> >>
>> >>--
>> >>Laurent "ker2x" Laborde
>> >>Sysadmin & DBA at http://www.over-blog.com/
>> >
>> >
>
>



-- 
Laurent "ker2x" Laborde
Sysadmin & DBA at http://www.over-blog.com/