You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@sqoop.apache.org by Felix GV <fe...@mate1inc.com> on 2013/03/21 04:27:58 UTC

Getting bogus rows from sqoop import...?

Hello,

I'm trying to import a full table from MySQL to Hadoop/Hive. It works with
certain parameters, but when I try to do an ETL that's somewhat more
complex, I start getting bogus rows in my resulting table.

This works:

sqoop import \
        --connect
'jdbc:mysql://backup.general.db/general?tinyInt1isBit=false&zeroDateTimeBehavior=convertToNull'
\
        --username xxxxx \
        --password xxxxx \
        --hive-import \
        --hive-overwrite \
        -m 23 \
        --direct \
        --hive-table profile_felix_test17 \
        --split-by id \
        --table Profile

But if I use a --query instead of a --table, then I start getting bogus
records (and by that, I mean rows that have a non-sensically high primary
key that doesn't exist in my source database and null for the rest of the
cells).

The output I get with the above query is not exactly the way I want it.
Using --query, I can get the data in the format I want (by transforming
some stuff inside MySQL), but then I also get the bogus rows, which pretty
much makes the Hive table unusable.

I tried various combinations of parameters and it's hard to pin-point
exactly what causes the problem, so it could be more intricate than my
above simplistic description. That being said, removing --table and adding
the following params definitely breaks it:

        --target-dir /tests/sqoop/general/profile_felix_test \
        --query "select * from Profile WHERE \$CONDITIONS"

(Ultimately, I want to use a query that's more complex than this, but even
a simple query like this breaks...)

Any ideas why this would happen and how to solve it?

Is this the kind of problem that Sqoop2's cleaner architecture intends to
solve?

I use CDH 4.2, BTW.

Thanks :) !

--
Felix

Re: Getting bogus rows from sqoop import...?

Posted by Venkat Ranganathan <vr...@hortonworks.com>.
The --direct option most of the time is incompatible with many of the
parsing options in Sqoop (unless some of them are specifically
adapted).

Some of the connectors do not allow --direct and --query for example
(for example Postgres, Netezza etc).   May be other connectors should
check also.   But I am not sure what is the intersection of --query
and --hive-drop-delims?   I mean, if it is table or query should not
affect the output processing.    Do you mean --direct and
--hive-drop-delims?

Thanks

Venkat

On Thu, Mar 21, 2013 at 5:32 PM, Felix GV <fe...@mate1inc.com> wrote:
> So... the verdict appears to be the following:
>
> Using --direct with --table produces correct results.
> Using --query instead of --table absolutely requires using
> --hive-drop-import-delims.
> --hive-drop-import-delims is not compatible with --direct.
> Sqoop allows you to use --query and --direct together, but produces bogus
> results.
>
> Considering the nature of --hive-drop-import-delims, I would say it's quite
> likely that the above observations only hold true if the data in the
> imported table actually contains the conflicting delimiting characters.
>
> In any case, don't take my word for it: test on your own dataset if you're
> having any problem. Hopefully the above observations can provide a good
> starting point though.
>
> Maybe Sqoop should spit out an error or a warning if --query is used without
> --hive-drop-import-delims ...?
>
> --
> Felix
>
>
> On Thu, Mar 21, 2013 at 4:46 PM, Felix GV <fe...@mate1inc.com> wrote:
>>
>> I seem to be getting a proper output with the above parameters BTW.
>>
>> I'll try to re-integrate the rest of my more complex ETL query in that
>> sqoop job...
>>
>> Thanks :) !
>>
>> --
>> Felix
>>
>>
>> On Thu, Mar 21, 2013 at 12:47 AM, Felix GV <fe...@mate1inc.com> wrote:
>>>
>>> Thanks for your response Jarek :)
>>>
>>> I've started a new import run with --hive-drop-import-delims added and
>>> --direct removed (since the two are mutually exclusive), we'll see how it
>>> goes.
>>>
>>> Going to sleep now. I'll report back tomorrow :)
>>>
>>> --
>>> Felix
>>>
>>>
>>> On Thu, Mar 21, 2013 at 12:42 AM, Jarek Jarcec Cecho <ja...@apache.org>
>>> wrote:
>>>>
>>>> Hi Felix,
>>>> we've seen similar behaviour in the past when the data itself contains
>>>> Hive special characters like new line characters. Would you mind trying your
>>>> import with --hive-drop-import-delims to see if it helps?
>>>>
>>>> Jarcec
>>>>
>>>> On Wed, Mar 20, 2013 at 11:27:58PM -0400, Felix GV wrote:
>>>> > Hello,
>>>> >
>>>> > I'm trying to import a full table from MySQL to Hadoop/Hive. It works
>>>> > with
>>>> > certain parameters, but when I try to do an ETL that's somewhat more
>>>> > complex, I start getting bogus rows in my resulting table.
>>>> >
>>>> > This works:
>>>> >
>>>> > sqoop import \
>>>> >         --connect
>>>> >
>>>> > 'jdbc:mysql://backup.general.db/general?tinyInt1isBit=false&zeroDateTimeBehavior=convertToNull'
>>>> > \
>>>> >         --username xxxxx \
>>>> >         --password xxxxx \
>>>> >         --hive-import \
>>>> >         --hive-overwrite \
>>>> >         -m 23 \
>>>> >         --direct \
>>>> >         --hive-table profile_felix_test17 \
>>>> >         --split-by id \
>>>> >         --table Profile
>>>> >
>>>> > But if I use a --query instead of a --table, then I start getting
>>>> > bogus
>>>> > records (and by that, I mean rows that have a non-sensically high
>>>> > primary
>>>> > key that doesn't exist in my source database and null for the rest of
>>>> > the
>>>> > cells).
>>>> >
>>>> > The output I get with the above query is not exactly the way I want
>>>> > it.
>>>> > Using --query, I can get the data in the format I want (by
>>>> > transforming
>>>> > some stuff inside MySQL), but then I also get the bogus rows, which
>>>> > pretty
>>>> > much makes the Hive table unusable.
>>>> >
>>>> > I tried various combinations of parameters and it's hard to pin-point
>>>> > exactly what causes the problem, so it could be more intricate than my
>>>> > above simplistic description. That being said, removing --table and
>>>> > adding
>>>> > the following params definitely breaks it:
>>>> >
>>>> >         --target-dir /tests/sqoop/general/profile_felix_test \
>>>> >         --query "select * from Profile WHERE \$CONDITIONS"
>>>> >
>>>> > (Ultimately, I want to use a query that's more complex than this, but
>>>> > even
>>>> > a simple query like this breaks...)
>>>> >
>>>> > Any ideas why this would happen and how to solve it?
>>>> >
>>>> > Is this the kind of problem that Sqoop2's cleaner architecture intends
>>>> > to
>>>> > solve?
>>>> >
>>>> > I use CDH 4.2, BTW.
>>>> >
>>>> > Thanks :) !
>>>> >
>>>> > --
>>>> > Felix
>>>
>>>
>>
>

Re: Getting bogus rows from sqoop import...?

Posted by Felix GV <fe...@mate1inc.com>.
So... the verdict appears to be the following:

   - Using --direct with --table produces correct results.
   - Using --query instead of --table absolutely requires using
   --hive-drop-import-delims.
   - --hive-drop-import-delims is not compatible with --direct.
   - Sqoop allows you to use --query and --direct together, but produces
   bogus results.

Considering the nature of --hive-drop-import-delims, I would say it's quite
likely that the above observations only hold true if the data in the
imported table actually contains the conflicting delimiting characters.

In any case, don't take my word for it: test on your own dataset if you're
having any problem. Hopefully the above observations can provide a good
starting point though.

Maybe Sqoop should spit out an error or a warning if --query is used
without --hive-drop-import-delims ...?

--
Felix


On Thu, Mar 21, 2013 at 4:46 PM, Felix GV <fe...@mate1inc.com> wrote:

> I seem to be getting a proper output with the above parameters BTW.
>
> I'll try to re-integrate the rest of my more complex ETL query in that
> sqoop job...
>
> Thanks :) !
>
> --
> Felix
>
>
> On Thu, Mar 21, 2013 at 12:47 AM, Felix GV <fe...@mate1inc.com> wrote:
>
>> Thanks for your response Jarek :)
>>
>> I've started a new import run with --hive-drop-import-delims added and
>> --direct removed (since the two are mutually exclusive), we'll see how
>> it goes.
>>
>> Going to sleep now. I'll report back tomorrow :)
>>
>> --
>> Felix
>>
>>
>> On Thu, Mar 21, 2013 at 12:42 AM, Jarek Jarcec Cecho <ja...@apache.org>wrote:
>>
>>> Hi Felix,
>>> we've seen similar behaviour in the past when the data itself contains
>>> Hive special characters like new line characters. Would you mind trying
>>> your import with --hive-drop-import-delims to see if it helps?
>>>
>>> Jarcec
>>>
>>> On Wed, Mar 20, 2013 at 11:27:58PM -0400, Felix GV wrote:
>>> > Hello,
>>> >
>>> > I'm trying to import a full table from MySQL to Hadoop/Hive. It works
>>> with
>>> > certain parameters, but when I try to do an ETL that's somewhat more
>>> > complex, I start getting bogus rows in my resulting table.
>>> >
>>> > This works:
>>> >
>>> > sqoop import \
>>> >         --connect
>>> >
>>> 'jdbc:mysql://backup.general.db/general?tinyInt1isBit=false&zeroDateTimeBehavior=convertToNull'
>>> > \
>>> >         --username xxxxx \
>>> >         --password xxxxx \
>>> >         --hive-import \
>>> >         --hive-overwrite \
>>> >         -m 23 \
>>> >         --direct \
>>> >         --hive-table profile_felix_test17 \
>>> >         --split-by id \
>>> >         --table Profile
>>> >
>>> > But if I use a --query instead of a --table, then I start getting bogus
>>> > records (and by that, I mean rows that have a non-sensically high
>>> primary
>>> > key that doesn't exist in my source database and null for the rest of
>>> the
>>> > cells).
>>> >
>>> > The output I get with the above query is not exactly the way I want it.
>>> > Using --query, I can get the data in the format I want (by transforming
>>> > some stuff inside MySQL), but then I also get the bogus rows, which
>>> pretty
>>> > much makes the Hive table unusable.
>>> >
>>> > I tried various combinations of parameters and it's hard to pin-point
>>> > exactly what causes the problem, so it could be more intricate than my
>>> > above simplistic description. That being said, removing --table and
>>> adding
>>> > the following params definitely breaks it:
>>> >
>>> >         --target-dir /tests/sqoop/general/profile_felix_test \
>>> >         --query "select * from Profile WHERE \$CONDITIONS"
>>> >
>>> > (Ultimately, I want to use a query that's more complex than this, but
>>> even
>>> > a simple query like this breaks...)
>>> >
>>> > Any ideas why this would happen and how to solve it?
>>> >
>>> > Is this the kind of problem that Sqoop2's cleaner architecture intends
>>> to
>>> > solve?
>>> >
>>> > I use CDH 4.2, BTW.
>>> >
>>> > Thanks :) !
>>> >
>>> > --
>>> > Felix
>>>
>>
>>
>

Re: Getting bogus rows from sqoop import...?

Posted by Felix GV <fe...@mate1inc.com>.
I seem to be getting a proper output with the above parameters BTW.

I'll try to re-integrate the rest of my more complex ETL query in that
sqoop job...

Thanks :) !

--
Felix


On Thu, Mar 21, 2013 at 12:47 AM, Felix GV <fe...@mate1inc.com> wrote:

> Thanks for your response Jarek :)
>
> I've started a new import run with --hive-drop-import-delims added and
> --direct removed (since the two are mutually exclusive), we'll see how it
> goes.
>
> Going to sleep now. I'll report back tomorrow :)
>
> --
> Felix
>
>
> On Thu, Mar 21, 2013 at 12:42 AM, Jarek Jarcec Cecho <ja...@apache.org>wrote:
>
>> Hi Felix,
>> we've seen similar behaviour in the past when the data itself contains
>> Hive special characters like new line characters. Would you mind trying
>> your import with --hive-drop-import-delims to see if it helps?
>>
>> Jarcec
>>
>> On Wed, Mar 20, 2013 at 11:27:58PM -0400, Felix GV wrote:
>> > Hello,
>> >
>> > I'm trying to import a full table from MySQL to Hadoop/Hive. It works
>> with
>> > certain parameters, but when I try to do an ETL that's somewhat more
>> > complex, I start getting bogus rows in my resulting table.
>> >
>> > This works:
>> >
>> > sqoop import \
>> >         --connect
>> >
>> 'jdbc:mysql://backup.general.db/general?tinyInt1isBit=false&zeroDateTimeBehavior=convertToNull'
>> > \
>> >         --username xxxxx \
>> >         --password xxxxx \
>> >         --hive-import \
>> >         --hive-overwrite \
>> >         -m 23 \
>> >         --direct \
>> >         --hive-table profile_felix_test17 \
>> >         --split-by id \
>> >         --table Profile
>> >
>> > But if I use a --query instead of a --table, then I start getting bogus
>> > records (and by that, I mean rows that have a non-sensically high
>> primary
>> > key that doesn't exist in my source database and null for the rest of
>> the
>> > cells).
>> >
>> > The output I get with the above query is not exactly the way I want it.
>> > Using --query, I can get the data in the format I want (by transforming
>> > some stuff inside MySQL), but then I also get the bogus rows, which
>> pretty
>> > much makes the Hive table unusable.
>> >
>> > I tried various combinations of parameters and it's hard to pin-point
>> > exactly what causes the problem, so it could be more intricate than my
>> > above simplistic description. That being said, removing --table and
>> adding
>> > the following params definitely breaks it:
>> >
>> >         --target-dir /tests/sqoop/general/profile_felix_test \
>> >         --query "select * from Profile WHERE \$CONDITIONS"
>> >
>> > (Ultimately, I want to use a query that's more complex than this, but
>> even
>> > a simple query like this breaks...)
>> >
>> > Any ideas why this would happen and how to solve it?
>> >
>> > Is this the kind of problem that Sqoop2's cleaner architecture intends
>> to
>> > solve?
>> >
>> > I use CDH 4.2, BTW.
>> >
>> > Thanks :) !
>> >
>> > --
>> > Felix
>>
>
>

Re: Getting bogus rows from sqoop import...?

Posted by Felix GV <fe...@mate1inc.com>.
Thanks for your response Jarek :)

I've started a new import run with --hive-drop-import-delims added and
--direct removed (since the two are mutually exclusive), we'll see how it
goes.

Going to sleep now. I'll report back tomorrow :)

--
Felix


On Thu, Mar 21, 2013 at 12:42 AM, Jarek Jarcec Cecho <ja...@apache.org>wrote:

> Hi Felix,
> we've seen similar behaviour in the past when the data itself contains
> Hive special characters like new line characters. Would you mind trying
> your import with --hive-drop-import-delims to see if it helps?
>
> Jarcec
>
> On Wed, Mar 20, 2013 at 11:27:58PM -0400, Felix GV wrote:
> > Hello,
> >
> > I'm trying to import a full table from MySQL to Hadoop/Hive. It works
> with
> > certain parameters, but when I try to do an ETL that's somewhat more
> > complex, I start getting bogus rows in my resulting table.
> >
> > This works:
> >
> > sqoop import \
> >         --connect
> >
> 'jdbc:mysql://backup.general.db/general?tinyInt1isBit=false&zeroDateTimeBehavior=convertToNull'
> > \
> >         --username xxxxx \
> >         --password xxxxx \
> >         --hive-import \
> >         --hive-overwrite \
> >         -m 23 \
> >         --direct \
> >         --hive-table profile_felix_test17 \
> >         --split-by id \
> >         --table Profile
> >
> > But if I use a --query instead of a --table, then I start getting bogus
> > records (and by that, I mean rows that have a non-sensically high primary
> > key that doesn't exist in my source database and null for the rest of the
> > cells).
> >
> > The output I get with the above query is not exactly the way I want it.
> > Using --query, I can get the data in the format I want (by transforming
> > some stuff inside MySQL), but then I also get the bogus rows, which
> pretty
> > much makes the Hive table unusable.
> >
> > I tried various combinations of parameters and it's hard to pin-point
> > exactly what causes the problem, so it could be more intricate than my
> > above simplistic description. That being said, removing --table and
> adding
> > the following params definitely breaks it:
> >
> >         --target-dir /tests/sqoop/general/profile_felix_test \
> >         --query "select * from Profile WHERE \$CONDITIONS"
> >
> > (Ultimately, I want to use a query that's more complex than this, but
> even
> > a simple query like this breaks...)
> >
> > Any ideas why this would happen and how to solve it?
> >
> > Is this the kind of problem that Sqoop2's cleaner architecture intends to
> > solve?
> >
> > I use CDH 4.2, BTW.
> >
> > Thanks :) !
> >
> > --
> > Felix
>

Re: Getting bogus rows from sqoop import...?

Posted by Jarek Jarcec Cecho <ja...@apache.org>.
Hi Felix,
we've seen similar behaviour in the past when the data itself contains Hive special characters like new line characters. Would you mind trying your import with --hive-drop-import-delims to see if it helps?

Jarcec

On Wed, Mar 20, 2013 at 11:27:58PM -0400, Felix GV wrote:
> Hello,
> 
> I'm trying to import a full table from MySQL to Hadoop/Hive. It works with
> certain parameters, but when I try to do an ETL that's somewhat more
> complex, I start getting bogus rows in my resulting table.
> 
> This works:
> 
> sqoop import \
>         --connect
> 'jdbc:mysql://backup.general.db/general?tinyInt1isBit=false&zeroDateTimeBehavior=convertToNull'
> \
>         --username xxxxx \
>         --password xxxxx \
>         --hive-import \
>         --hive-overwrite \
>         -m 23 \
>         --direct \
>         --hive-table profile_felix_test17 \
>         --split-by id \
>         --table Profile
> 
> But if I use a --query instead of a --table, then I start getting bogus
> records (and by that, I mean rows that have a non-sensically high primary
> key that doesn't exist in my source database and null for the rest of the
> cells).
> 
> The output I get with the above query is not exactly the way I want it.
> Using --query, I can get the data in the format I want (by transforming
> some stuff inside MySQL), but then I also get the bogus rows, which pretty
> much makes the Hive table unusable.
> 
> I tried various combinations of parameters and it's hard to pin-point
> exactly what causes the problem, so it could be more intricate than my
> above simplistic description. That being said, removing --table and adding
> the following params definitely breaks it:
> 
>         --target-dir /tests/sqoop/general/profile_felix_test \
>         --query "select * from Profile WHERE \$CONDITIONS"
> 
> (Ultimately, I want to use a query that's more complex than this, but even
> a simple query like this breaks...)
> 
> Any ideas why this would happen and how to solve it?
> 
> Is this the kind of problem that Sqoop2's cleaner architecture intends to
> solve?
> 
> I use CDH 4.2, BTW.
> 
> Thanks :) !
> 
> --
> Felix