You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@sqoop.apache.org by Arvind Prabhakar <ar...@cloudera.com> on 2012/05/09 09:07:34 UTC

Re: Misleading results from Hive tables

[Moving conversation to user@sqoop.apache.org. Please direct any sqoop
related questions there]

Hi,

> After I use the *--hive-drop-import-delims* all went fine. ...

This implies that you probably had new-lines or other characters within
your data that Hive uses as delimiters. By specifying this option, you are
telling Sqoop to strip such characters out while doing the import.
Depending upon how acceptable is this for your dataset, this may or may not
be the right solution for you.

> ...Do I specially need to use *--split-by ?*

Since in your case the PK is numeric type, you do not need to specify the
--split-by unless that is causing a lot of skew in your datafiles.

Thanks,
Arvind

On Tue, May 8, 2012 at 11:27 PM, Roshan Pradeep <
codevally.mail.list@gmail.com> wrote:

> Thanks Bejoy for the reply.
>
> When importing through Sqoop, I haven't use *--split-by* clause. The
> current import command is:
>
> */app/sqoop/bin/sqoop import --hive-table messagedestination --connect
> jdbc:postgresql://<database_server>/<database> --table messagedestination
> --username <username> --password <password> --hive-home /app/hive
> --hive-import --fields-terminated-by '~' --hive-drop-import-delims*
>
> The PK in postgreSQL is BIGINT.
>
> After I use the *--hive-drop-import-delims* all went fine. Do I specially
> need to use *--split-by ?*
>
> Thanks.
>
> /Roshan
>
>
> On Wednesday, May 9, 2012 4:07:36 PM UTC+10, Bejoy KS wrote:
>>
>> Hi Roshan
>>        Looks like the issue happens during SQOOP import. What is data
>> type of the --split-By column? It can lead to some duplicate row imports if
>> it is String.
>>
>> On Wed, May 9, 2012 at 9:07 AM, Roshan Pradeep <
>> codevally.mail.list@gmail.com**> wrote:
>>
>>> I am using
>>>
>>> Hadoop - 0.20.2
>>> Hive - 0.8.1
>>> Sqoop - sqoop-1.4.1-incubating
>>>
>>>
>>>
>>> On Wednesday, May 9, 2012 1:20:56 PM UTC+10, Roshan Pradeep wrote:
>>>>
>>>> Hi
>>>>
>>>> I have imported some large amount of data from PostgreSQL to Hadoop
>>>> using Sqoop. All went well and no error on log files.
>>>>
>>>> After that I tried to get the maximum id value from my table via Hive,
>>>> but it gives me an very large number.
>>>>
>>>> select max(messageid) from message; ==> 235234523452345
>>>>
>>>> But actually the maximum messageid should be 925817 according to the
>>>> postgreSQL.I found the same misleading behavior if I try to get the total
>>>> number of rows (select count(messageid) from message;) via Hive.
>>>>
>>>> Why is this behavior? Any help is appreciated.
>>>>
>>>> Thanks.
>>>>
>>>> /Roshan
>>>>
>>>>
>>>>
>>
>>
>> --
>> Bejoy KS
>> Customer Operations Engineer, Cloudera
>> www.cloudera.com
>>
>