You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@sqoop.apache.org by yogesh kumar <yo...@gmail.com> on 2013/12/30 19:18:56 UTC

Sqoop incremental import ( can any just help me out)

Hello all,

I have done sqoop import for a particluar table first time say table
Employee..

sqoop import -libjars .....
--query "select empno, name, date, loc from table Employee where
\$CONDITIONS ..  "
*--split-by empno*
--fields-terminated-by ','
.
.
.
.

I have created an external table on hive,

*Now I want to pull data on daily basis by using incremental pull.  can
I specify the different column for --split-by*

like

sqoop import -libjars .....
--query "select empno, name, date, loc from table Employee where
\$CONDITIONS ..  "
--check-column date
--incremental append
--last-value 2013-05-01
*--split-by date*
--split-by empno


Can I change the column for *split by in incremental sqoop*, if not then
how to do it.

Pls suggest

Re: Sqoop incremental import ( can any just help me out)

Posted by yogesh kumar <yo...@gmail.com>.

Thanks a lot Devin,

Yes my column has increasing values, lets say date column for daily pull,
as date keeps on changing same kind of another column which converts every
date into juliene format which is always changing.


I meant that for which I have done split by it keeps changing
and on what i am planning to do split by its also keep changing..

so will it b safe to change the split by to replace older
column(changing values) to new column(changing values at different rate)..


Pls suggest

Thanks
yogesh







On Tue, Dec 31, 2013 at 1:27 AM, Devin Suiter RDX <ds...@rdx.com> wrote:

> If it's kind of a risk, and you can't take any chances...Why are you
> testing in that environment?
>
> Why not set up a VM with a test database, and a VM with a pseudo-cluster,
> and load a subset of your data, and experiment in a development environment
> so that you can know for sure - even if someone guarantees you the answer
> on here, you can not be certain everything is identical across all the
> versions of Sqoop, Hadoop, etc for them as it would be for you...if the
> data you are working with has value, you should find a safe way to
> experiment rather than trust your valuable data to the mailing list answers.
>
> Now, in answer to your question:
>
> According to my peer (I am not the Sqoop person where I work) if your
> incremental split is on a column that has increasing values, you can safely
> split on that, but if the value you split on is always the same, it is a
> bad choice for incremental splitting - he uses a datetime column I believe,
> and then the import is from the last imported datetime value up to the
> current max. I am not sure if that helps your case, but it is my hope that
> you find it useful.
>
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>
>
> On Mon, Dec 30, 2013 at 2:27 PM, yogesh kumar <yo...@gmail.com>wrote:
>
>> Thanks Chalcy, I got your point, let me try a simple test for it..   but
>> the situation here is for incremental import i have to change the column
>> for split by
>>
>> Its a kind of risk..   can not take a chance.  just want to be sure that.
>>
>> it will not affect the hive table and data into it after
>> being incremental import. my incremental  import will directly pull data
>> and put it at where my old sqooped data resides
>>
>> Want suggestion from champions of sqoop
>> Pls hep me out
>>
>>
>>
>>
>>
>> On Tue, Dec 31, 2013 at 12:30 AM, Chalcy <ch...@gmail.com> wrote:
>>
>>> I have not tried this but I believe you can change the split by as you
>>> wish.  The split by is used to split the jobs while --check-column and
>>> --last-value are used for incremental import.
>>>
>>> I do not know exact scenario but if empno gives a better split, you
>>> still can use that for incremental import instead of changing the split-by
>>> field.
>>>
>>> I would suggest you do a very simple test to find out.
>>>
>>> Hope this helps,
>>> Chalcy
>>>
>>>
>>> On Mon, Dec 30, 2013 at 1:18 PM, yogesh kumar <yo...@gmail.com>wrote:
>>>
>>>> Hello all,
>>>>
>>>> I have done sqoop import for a particluar table first time say table
>>>> Employee..
>>>>
>>>> sqoop import -libjars .....
>>>> --query "select empno, name, date, loc from table Employee where
>>>> \$CONDITIONS ..  "
>>>> *--split-by empno*
>>>> --fields-terminated-by ','
>>>> .
>>>> .
>>>> .
>>>> .
>>>>
>>>> I have created an external table on hive,
>>>>
>>>> *Now I want to pull data on daily basis by using incremental pull.  can
>>>> I specify the different column for --split-by*
>>>>
>>>> like
>>>>
>>>> sqoop import -libjars .....
>>>> --query "select empno, name, date, loc from table Employee where
>>>> \$CONDITIONS ..  "
>>>> --check-column date
>>>> --incremental append
>>>> --last-value 2013-05-01
>>>> *--split-by date*
>>>> --split-by empno
>>>>
>>>>
>>>> Can I change the column for *split by in incremental sqoop*, if not
>>>> then how to do it.
>>>>
>>>> Pls suggest
>>>>
>>>
>>>
>>
>

Re: Sqoop incremental import ( can any just help me out)

Posted by Devin Suiter RDX <ds...@rdx.com>.

If it's kind of a risk, and you can't take any chances...Why are you
testing in that environment?

Why not set up a VM with a test database, and a VM with a pseudo-cluster,
and load a subset of your data, and experiment in a development environment
so that you can know for sure - even if someone guarantees you the answer
on here, you can not be certain everything is identical across all the
versions of Sqoop, Hadoop, etc for them as it would be for you...if the
data you are working with has value, you should find a safe way to
experiment rather than trust your valuable data to the mailing list answers.

Now, in answer to your question:

According to my peer (I am not the Sqoop person where I work) if your
incremental split is on a column that has increasing values, you can safely
split on that, but if the value you split on is always the same, it is a
bad choice for incremental splitting - he uses a datetime column I believe,
and then the import is from the last imported datetime value up to the
current max. I am not sure if that helps your case, but it is my hope that
you find it useful.

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com

On Mon, Dec 30, 2013 at 2:27 PM, yogesh kumar <yo...@gmail.com> wrote:

> Thanks Chalcy, I got your point, let me try a simple test for it..   but
> the situation here is for incremental import i have to change the column
> for split by
>
> Its a kind of risk..   can not take a chance.  just want to be sure that.
>
> it will not affect the hive table and data into it after
> being incremental import. my incremental  import will directly pull data
> and put it at where my old sqooped data resides
>
> Want suggestion from champions of sqoop
> Pls hep me out
>
>
>
>
>
> On Tue, Dec 31, 2013 at 12:30 AM, Chalcy <ch...@gmail.com> wrote:
>
>> I have not tried this but I believe you can change the split by as you
>> wish.  The split by is used to split the jobs while --check-column and
>> --last-value are used for incremental import.
>>
>> I do not know exact scenario but if empno gives a better split, you still
>> can use that for incremental import instead of changing the split-by field.
>>
>> I would suggest you do a very simple test to find out.
>>
>> Hope this helps,
>> Chalcy
>>
>>
>> On Mon, Dec 30, 2013 at 1:18 PM, yogesh kumar <yo...@gmail.com>wrote:
>>
>>> Hello all,
>>>
>>> I have done sqoop import for a particluar table first time say table
>>> Employee..
>>>
>>> sqoop import -libjars .....
>>> --query "select empno, name, date, loc from table Employee where
>>> \$CONDITIONS ..  "
>>> *--split-by empno*
>>> --fields-terminated-by ','
>>> .
>>> .
>>> .
>>> .
>>>
>>> I have created an external table on hive,
>>>
>>> *Now I want to pull data on daily basis by using incremental pull.  can
>>> I specify the different column for --split-by*
>>>
>>> like
>>>
>>> sqoop import -libjars .....
>>> --query "select empno, name, date, loc from table Employee where
>>> \$CONDITIONS ..  "
>>> --check-column date
>>> --incremental append
>>> --last-value 2013-05-01
>>> *--split-by date*
>>> --split-by empno
>>>
>>>
>>> Can I change the column for *split by in incremental sqoop*, if not
>>> then how to do it.
>>>
>>> Pls suggest
>>>
>>
>>
>

Re: Sqoop incremental import ( can any just help me out)

Posted by yogesh kumar <yo...@gmail.com>.

Thanks Chalcy, I got your point, let me try a simple test for it..   but
the situation here is for incremental import i have to change the column
for split by

Its a kind of risk..   can not take a chance.  just want to be sure that.

it will not affect the hive table and data into it after
being incremental import. my incremental  import will directly pull data
and put it at where my old sqooped data resides

Want suggestion from champions of sqoop
Pls hep me out





On Tue, Dec 31, 2013 at 12:30 AM, Chalcy <ch...@gmail.com> wrote:

> I have not tried this but I believe you can change the split by as you
> wish.  The split by is used to split the jobs while --check-column and
> --last-value are used for incremental import.
>
> I do not know exact scenario but if empno gives a better split, you still
> can use that for incremental import instead of changing the split-by field.
>
> I would suggest you do a very simple test to find out.
>
> Hope this helps,
> Chalcy
>
>
> On Mon, Dec 30, 2013 at 1:18 PM, yogesh kumar <yo...@gmail.com>wrote:
>
>> Hello all,
>>
>> I have done sqoop import for a particluar table first time say table
>> Employee..
>>
>> sqoop import -libjars .....
>> --query "select empno, name, date, loc from table Employee where
>> \$CONDITIONS ..  "
>> *--split-by empno*
>> --fields-terminated-by ','
>> .
>> .
>> .
>> .
>>
>> I have created an external table on hive,
>>
>> *Now I want to pull data on daily basis by using incremental pull.  can
>> I specify the different column for --split-by*
>>
>> like
>>
>> sqoop import -libjars .....
>> --query "select empno, name, date, loc from table Employee where
>> \$CONDITIONS ..  "
>> --check-column date
>> --incremental append
>> --last-value 2013-05-01
>> *--split-by date*
>> --split-by empno
>>
>>
>> Can I change the column for *split by in incremental sqoop*, if not then
>> how to do it.
>>
>> Pls suggest
>>
>
>

Re: Sqoop incremental import ( can any just help me out)

Posted by Chalcy <ch...@gmail.com>.

I have not tried this but I believe you can change the split by as you
wish.  The split by is used to split the jobs while --check-column and
--last-value are used for incremental import.

I do not know exact scenario but if empno gives a better split, you still
can use that for incremental import instead of changing the split-by field.

I would suggest you do a very simple test to find out.

Hope this helps,
Chalcy


On Mon, Dec 30, 2013 at 1:18 PM, yogesh kumar <yo...@gmail.com> wrote:

> Hello all,
>
> I have done sqoop import for a particluar table first time say table
> Employee..
>
> sqoop import -libjars .....
> --query "select empno, name, date, loc from table Employee where
> \$CONDITIONS ..  "
> *--split-by empno*
> --fields-terminated-by ','
> .
> .
> .
> .
>
> I have created an external table on hive,
>
> *Now I want to pull data on daily basis by using incremental pull.  can
> I specify the different column for --split-by*
>
> like
>
> sqoop import -libjars .....
> --query "select empno, name, date, loc from table Employee where
> \$CONDITIONS ..  "
> --check-column date
> --incremental append
> --last-value 2013-05-01
> *--split-by date*
> --split-by empno
>
>
> Can I change the column for *split by in incremental sqoop*, if not then
> how to do it.
>
> Pls suggest
>