You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@sqoop.apache.org by Bhargav Nallapu <bh...@corp.247customer.com> on 2012/11/23 06:07:01 UTC

Fwd: Sqoop export .lzo to mysql duplicates

Hi,

Finding this strange issue.

Context:

Hive writes an output to an external table, with LZO  compression in place.
So, my hdfs folder has large_file.lzo

Using Sqoop, when I try to export this file to the mysql table, the num of
rows is doubled.

Then I do,
lzop -d large_file.lzo

This doesn't happen if I load the same file uncompressing it, "large_file"
Rows are as expected.

Where as both small_file and small_file.lzo are loaded with correct rows.

Sqoop : v 1.30
Num of mappers : 1

Observation : Any compressed file (gzipped or lzo) of size greater than 60
MB (might be 64 MB), while exported to DB puts the double the row count,
probably exact duplicates.
Can anyone please help?

Re: Sqoop export .lzo to mysql duplicates

Posted by Jarek Jarcec Cecho <ja...@apache.org>.
Hi Bhargav,
you're right that this ticket was filled yesterday, however I've noticed this behaviour in multiple users and I was able to replicate that in my testing environment. I'm going to continue investigating that today. However you're more then welcome to share do your own debugging. Contributions are always welcomed!

Jarcec

On Fri, Nov 23, 2012 at 02:34:37PM +0530, Bhargav Nallapu wrote:
> Hi Jarec,
> 
> Thanks for a quick reply.
> 
> Infact I've checked this ticket as soon as you directed me to.
> 
> But was just skeptical since it was filed as recent as yesterday.
> 
> Since exporting a gzipped file using sqoop is a pretty common thing to do ,
> I was wondering if it is a known issue already, or probably fixed in any of
> the recent versions. If not, I shall keep track of the ticket , try
> debugging myself or wait to know your findings.
> 
> 
> Thanks.
> 
> 
> On Fri, Nov 23, 2012 at 12:17 PM, Jarek Jarcec Cecho <ja...@apache.org>wrote:
> 
> > Hi Bhargav,
> > I believe that you might be hitting known Sqoop bug SQOOP-721 [1]. I was
> > able to replicate the behaviour in my testing environment today and my
> > intention is to continue debugging tomorrow.
> >
> > As a workaround you can decompress the files manually prior Sqoop export
> > for now.
> >
> > Jarcec
> >
> > Links:
> > 1: https://issues.apache.org/jira/browse/SQOOP-721
> >
> > On Nov 22, 2012, at 10:00 PM, Jarek Jarcec Cecho <ja...@apache.org>
> > wrote:
> >
> > > Hi Bhargav,
> > > I believe that you might be hitting known Sqoop bug SQOOP-721 [1]. I was
> > able to replicate the behaviour in my testing environment today and my
> > intention is to continue debugging tomorrow.
> > >
> > > As a workaround you can decompress the files manually prior Sqoop export
> > for now.
> > >
> > > Jarcec
> > >
> > > Links:
> > > 1: https://issues.apache.org/jira/browse/SQOOP-721
> > >
> > > On Nov 22, 2012, at 9:07 PM, Bhargav Nallapu <
> > bhargav.nallapu@corp.247customer.com> wrote:
> > >
> > >>
> > >> Hi,
> > >>
> > >> Finding this strange issue.
> > >>
> > >> Context:
> > >>
> > >> Hive writes an output to an external table, with LZO  compression in
> > place. So, my hdfs folder has large_file.lzo
> > >>
> > >> Using Sqoop, when I try to export this file to the mysql table, the num
> > of rows is doubled.
> > >>
> > >> Then I do,
> > >> lzop -d large_file.lzo
> > >>
> > >> This doesn't happen if I load the same file uncompressing it,
> > "large_file" Rows are as expected.
> > >>
> > >> Where as both small_file and small_file.lzo are loaded with correct
> > rows.
> > >>
> > >> Sqoop : v 1.30
> > >> Num of mappers : 1
> > >>
> > >> Observation : Any compressed file (gzipped or lzo) of size greater than
> > 60 MB (might be 64 MB), while exported to DB puts the double the row count,
> > probably exact duplicates.
> > >> Can anyone please help?
> > >>
> > >
> >
> >

Re: Sqoop export .lzo to mysql duplicates

Posted by Bhargav Nallapu <bh...@corp.247customer.com>.
Hi Jarec,

Thanks for a quick reply.

Infact I've checked this ticket as soon as you directed me to.

But was just skeptical since it was filed as recent as yesterday.

Since exporting a gzipped file using sqoop is a pretty common thing to do ,
I was wondering if it is a known issue already, or probably fixed in any of
the recent versions. If not, I shall keep track of the ticket , try
debugging myself or wait to know your findings.


Thanks.


On Fri, Nov 23, 2012 at 12:17 PM, Jarek Jarcec Cecho <ja...@apache.org>wrote:

> Hi Bhargav,
> I believe that you might be hitting known Sqoop bug SQOOP-721 [1]. I was
> able to replicate the behaviour in my testing environment today and my
> intention is to continue debugging tomorrow.
>
> As a workaround you can decompress the files manually prior Sqoop export
> for now.
>
> Jarcec
>
> Links:
> 1: https://issues.apache.org/jira/browse/SQOOP-721
>
> On Nov 22, 2012, at 10:00 PM, Jarek Jarcec Cecho <ja...@apache.org>
> wrote:
>
> > Hi Bhargav,
> > I believe that you might be hitting known Sqoop bug SQOOP-721 [1]. I was
> able to replicate the behaviour in my testing environment today and my
> intention is to continue debugging tomorrow.
> >
> > As a workaround you can decompress the files manually prior Sqoop export
> for now.
> >
> > Jarcec
> >
> > Links:
> > 1: https://issues.apache.org/jira/browse/SQOOP-721
> >
> > On Nov 22, 2012, at 9:07 PM, Bhargav Nallapu <
> bhargav.nallapu@corp.247customer.com> wrote:
> >
> >>
> >> Hi,
> >>
> >> Finding this strange issue.
> >>
> >> Context:
> >>
> >> Hive writes an output to an external table, with LZO  compression in
> place. So, my hdfs folder has large_file.lzo
> >>
> >> Using Sqoop, when I try to export this file to the mysql table, the num
> of rows is doubled.
> >>
> >> Then I do,
> >> lzop -d large_file.lzo
> >>
> >> This doesn't happen if I load the same file uncompressing it,
> "large_file" Rows are as expected.
> >>
> >> Where as both small_file and small_file.lzo are loaded with correct
> rows.
> >>
> >> Sqoop : v 1.30
> >> Num of mappers : 1
> >>
> >> Observation : Any compressed file (gzipped or lzo) of size greater than
> 60 MB (might be 64 MB), while exported to DB puts the double the row count,
> probably exact duplicates.
> >> Can anyone please help?
> >>
> >
>
>

Re: Sqoop export .lzo to mysql duplicates

Posted by Jarek Jarcec Cecho <ja...@apache.org>.
Hi Bhargav,
I believe that you might be hitting known Sqoop bug SQOOP-721 [1]. I was able to replicate the behaviour in my testing environment today and my intention is to continue debugging tomorrow.

As a workaround you can decompress the files manually prior Sqoop export for now.

Jarcec

Links:
1: https://issues.apache.org/jira/browse/SQOOP-721

On Nov 22, 2012, at 10:00 PM, Jarek Jarcec Cecho <ja...@apache.org> wrote:

> Hi Bhargav,
> I believe that you might be hitting known Sqoop bug SQOOP-721 [1]. I was able to replicate the behaviour in my testing environment today and my intention is to continue debugging tomorrow.
> 
> As a workaround you can decompress the files manually prior Sqoop export for now.
> 
> Jarcec
> 
> Links:
> 1: https://issues.apache.org/jira/browse/SQOOP-721
> 
> On Nov 22, 2012, at 9:07 PM, Bhargav Nallapu <bh...@corp.247customer.com> wrote:
> 
>> 
>> Hi,
>> 
>> Finding this strange issue.
>> 
>> Context:
>> 
>> Hive writes an output to an external table, with LZO  compression in place. So, my hdfs folder has large_file.lzo
>> 
>> Using Sqoop, when I try to export this file to the mysql table, the num of rows is doubled.
>> 
>> Then I do,
>> lzop -d large_file.lzo
>> 
>> This doesn't happen if I load the same file uncompressing it, "large_file" Rows are as expected.
>> 
>> Where as both small_file and small_file.lzo are loaded with correct rows.
>> 
>> Sqoop : v 1.30
>> Num of mappers : 1
>> 
>> Observation : Any compressed file (gzipped or lzo) of size greater than 60 MB (might be 64 MB), while exported to DB puts the double the row count, probably exact duplicates.
>> Can anyone please help?
>> 
>