You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@sqoop.apache.org by Sadananda Hegde <sa...@gmail.com> on 2012/10/26 02:38:21 UTC

Sqoop export - incremental extracts

Hello,

I am exploring sqoop to send data from hadoop to EDW. I don't want to send
the same data again and again. I need to identify the changes in HDFS and
send only the data that has changed since my previous export. What is the
best way to implement such incremental export logic?  I see that sqoop
import has incremental logic option; but can't see it in export.

Any recomendations / suggestions would greatly be appreciated.

Thanks,
Sadu

Re: Sqoop export - incremental extracts

Posted by Jarek Jarcec Cecho <ja...@apache.org>.
Hi Sadu,
thank you very much for your detailed used case. I'm afraid that Sqoop do not support your use case out of the box at the moment and you will most likely need some workaround similar to the one you've suggested. 

I do not know how you're going to automatize your workflow so I can't much comment where/how to store the time stamp values. In case of a bash scripting you might save last values in same working directory or in case of an Oozie execution you might create support java actions that will get and save values from some external database system.

Jarcec

On Fri, Oct 26, 2012 at 02:01:24PM -0500, Sadananda Hegde wrote:
> Thanks Jarek.
> 
> 
> Thanks Jarek.
> 
> 
> 
> Here is the use case.
> 
> 
> 
> 1.      My Hive table contains detailed transaction level data and
> continuously getting updated throughout the day (say every 15 minutes)
> 
> 2.      I have to send summary data from Hive/HDFS to other systems like
> EDW say twice a day.
> 
> 
> 
> This need to be automated and scheduled in production. I need to implement
> incremental logic so that I can export only the changes every time. I was
> reading about incremental options in Sqoop Import. It has kind of features
> I am looking for; but I need them on Sqoop Export. Since export does not
> provide that feature, I may have to track it myself.  Some how I need to
> keep track of when was the last time export ran successfully and what data
> has been added to Hive since then. Then I can do something like:
> 
> 
> 
> 1.      Execute Hive Query to extract the data I need to send (summary and
> only changes):
> 
>     Select fld1, fld,2, sum(fld3), …
> 
>     From tableA
> 
>     Where <HDFS_File_create_timestamp>   > <last_extract_timestamp>
> 
>     Group by fld, fld2, …
> 
> 
> 
> 2.      Use SQOOP Export to export the result file to EDW
> 
> 
> 
> I am not sure where / how to get HDFS_File_create_timestamp and
> last_extract_timestamp  values so it can be used dynamically inside Hive
> query.
> 
> 
> 
> Any ideas??? Are there any other options?
> 
> 
> Thanks for your help.
> 
> 
> 
> Sadu
> 
> 
> 
> 
> 
> On Thu, Oct 25, 2012 at 8:04 PM, Jarek Jarcec Cecho <ja...@apache.org>wrote:
> 
> > Hi Sadu,
> > unfortunately Sqoop export is taking entire input directory (--export-dir)
> > and simply exporting it's content to the external database/warehouse
> > system. I'm afraid that there isn't more sophisticated way of doing
> > "incremental" exports then using different hdfs directories for each
> > "incremental" part.
> >
> > If you could describe your use case, there might be other ways how to
> > achieve similar results.
> >
> > Jarcec
> >
> > On Thu, Oct 25, 2012 at 07:38:21PM -0500, Sadananda Hegde wrote:
> > > Hello,
> > >
> > > I am exploring sqoop to send data from hadoop to EDW. I don't want to
> > send
> > > the same data again and again. I need to identify the changes in HDFS and
> > > send only the data that has changed since my previous export. What is the
> > > best way to implement such incremental export logic?  I see that sqoop
> > > import has incremental logic option; but can't see it in export.
> > >
> > > Any recomendations / suggestions would greatly be appreciated.
> > >
> > > Thanks,
> > > Sadu
> >

Re: Sqoop export - incremental extracts

Posted by Sadananda Hegde <sa...@gmail.com>.
Thanks Jarek.


Thanks Jarek.



Here is the use case.



1.      My Hive table contains detailed transaction level data and
continuously getting updated throughout the day (say every 15 minutes)

2.      I have to send summary data from Hive/HDFS to other systems like
EDW say twice a day.



This need to be automated and scheduled in production. I need to implement
incremental logic so that I can export only the changes every time. I was
reading about incremental options in Sqoop Import. It has kind of features
I am looking for; but I need them on Sqoop Export. Since export does not
provide that feature, I may have to track it myself.  Some how I need to
keep track of when was the last time export ran successfully and what data
has been added to Hive since then. Then I can do something like:



1.      Execute Hive Query to extract the data I need to send (summary and
only changes):

    Select fld1, fld,2, sum(fld3), …

    From tableA

    Where <HDFS_File_create_timestamp>   > <last_extract_timestamp>

    Group by fld, fld2, …



2.      Use SQOOP Export to export the result file to EDW



I am not sure where / how to get HDFS_File_create_timestamp and
last_extract_timestamp  values so it can be used dynamically inside Hive
query.



Any ideas??? Are there any other options?


Thanks for your help.



Sadu





On Thu, Oct 25, 2012 at 8:04 PM, Jarek Jarcec Cecho <ja...@apache.org>wrote:

> Hi Sadu,
> unfortunately Sqoop export is taking entire input directory (--export-dir)
> and simply exporting it's content to the external database/warehouse
> system. I'm afraid that there isn't more sophisticated way of doing
> "incremental" exports then using different hdfs directories for each
> "incremental" part.
>
> If you could describe your use case, there might be other ways how to
> achieve similar results.
>
> Jarcec
>
> On Thu, Oct 25, 2012 at 07:38:21PM -0500, Sadananda Hegde wrote:
> > Hello,
> >
> > I am exploring sqoop to send data from hadoop to EDW. I don't want to
> send
> > the same data again and again. I need to identify the changes in HDFS and
> > send only the data that has changed since my previous export. What is the
> > best way to implement such incremental export logic?  I see that sqoop
> > import has incremental logic option; but can't see it in export.
> >
> > Any recomendations / suggestions would greatly be appreciated.
> >
> > Thanks,
> > Sadu
>

Re: Sqoop export - incremental extracts

Posted by Jarek Jarcec Cecho <ja...@apache.org>.
Hi Sadu,
unfortunately Sqoop export is taking entire input directory (--export-dir) and simply exporting it's content to the external database/warehouse system. I'm afraid that there isn't more sophisticated way of doing "incremental" exports then using different hdfs directories for each "incremental" part.

If you could describe your use case, there might be other ways how to achieve similar results.

Jarcec

On Thu, Oct 25, 2012 at 07:38:21PM -0500, Sadananda Hegde wrote:
> Hello,
> 
> I am exploring sqoop to send data from hadoop to EDW. I don't want to send
> the same data again and again. I need to identify the changes in HDFS and
> send only the data that has changed since my previous export. What is the
> best way to implement such incremental export logic?  I see that sqoop
> import has incremental logic option; but can't see it in export.
> 
> Any recomendations / suggestions would greatly be appreciated.
> 
> Thanks,
> Sadu