You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@sqoop.apache.org by Matthieu Labour <ma...@actionx.com> on 2013/05/17 23:44:03 UTC

How does one preprocess the data so that they can be exported using sqoop

Hi

I would be grateful for any tips on how to "prepare" the data so they can
be exported to a Postgesql Database using sqoop.

As an example:

Provided some files of events. (user events, product events,
productActivity events)

[file0001]
event:user propertes:{name:"john" ...}
event:product properties:{ref:123,color:"blue",...
event:productActivity properties:{user:"john", product:"ref", action:"buy"}
.....

How does one come up with the primary keys and the user_product join table
ready to be exported?

On other words.

function(Input:eventfile) => output:[productFile, userFile,
user_productFile with auto generated primary keys ]

what goes into function?

I hope this makes sense!

Thank you in advance for any help

-matt

Re: How does one preprocess the data so that they can be exported using sqoop

Posted by Jarek Jarcec Cecho <ja...@apache.org>.
Hi Matthieu,
as Sqoop is import/exporting data as they are, using natural keys is not usually working as expected. You should prepare your data with proper surrogate keys and use Sqoop as a bulk load tool to get all your data from/to database.

Jarcec

On Mon, May 20, 2013 at 09:27:52AM -0700, Matthieu Labour wrote:
> Jarcec
> 
> Thank you for your response
> 
> May I ask you a tip on best practices with respect to keys. When preparing
> the data for sqoop do people end up using natural keys? Or do folks
> generate surrogate keys while preparing the data?
> 
> As an example:
> 
> instead of having
> user: naturalUserIdentifier, address ....
> product: productRef, productProperties...
> user_product: natural user Identifier, productRef
> 
> I would like
> user: userSurrogateKey, userNaturaIdentifier, address ....
> product: productSurrogateKey, productRef, productProperties...
> user_product: userSurrogateKey, productSurrogateKey
> 
> Thanks a lot
> -matt
> 
> 
> On Sun, May 19, 2013 at 11:37 PM, Jarek Jarcec Cecho <ja...@apache.org>wrote:
> 
> > Hi Matthieu,
> > Sqoop is currently highly specialized EL tool (extract-load) and not a
> > generic ETL tool (extract-transform-load). Thus you need to execute custom
> > mapreduce/pig/hive job that will separate all three different logical
> > tables and prepare data into format that Sqoop can process.
> >
> > Jarcec
> >
> > On Fri, May 17, 2013 at 05:44:03PM -0400, Matthieu Labour wrote:
> > > Hi
> > >
> > > I would be grateful for any tips on how to "prepare" the data so they can
> > > be exported to a Postgesql Database using sqoop.
> > >
> > > As an example:
> > >
> > > Provided some files of events. (user events, product events,
> > > productActivity events)
> > >
> > > [file0001]
> > > event:user propertes:{name:"john" ...}
> > > event:product properties:{ref:123,color:"blue",...
> > > event:productActivity properties:{user:"john", product:"ref",
> > action:"buy"}
> > > .....
> > >
> > > How does one come up with the primary keys and the user_product join
> > table
> > > ready to be exported?
> > >
> > > On other words.
> > >
> > > function(Input:eventfile) => output:[productFile, userFile,
> > > user_productFile with auto generated primary keys ]
> > >
> > > what goes into function?
> > >
> > > I hope this makes sense!
> > >
> > > Thank you in advance for any help
> > >
> > > -matt
> >

Re: How does one preprocess the data so that they can be exported using sqoop

Posted by Matthieu Labour <ma...@actionx.com>.
Jarcec

Thank you for your response

May I ask you a tip on best practices with respect to keys. When preparing
the data for sqoop do people end up using natural keys? Or do folks
generate surrogate keys while preparing the data?

As an example:

instead of having
user: naturalUserIdentifier, address ....
product: productRef, productProperties...
user_product: natural user Identifier, productRef

I would like
user: userSurrogateKey, userNaturaIdentifier, address ....
product: productSurrogateKey, productRef, productProperties...
user_product: userSurrogateKey, productSurrogateKey

Thanks a lot
-matt


On Sun, May 19, 2013 at 11:37 PM, Jarek Jarcec Cecho <ja...@apache.org>wrote:

> Hi Matthieu,
> Sqoop is currently highly specialized EL tool (extract-load) and not a
> generic ETL tool (extract-transform-load). Thus you need to execute custom
> mapreduce/pig/hive job that will separate all three different logical
> tables and prepare data into format that Sqoop can process.
>
> Jarcec
>
> On Fri, May 17, 2013 at 05:44:03PM -0400, Matthieu Labour wrote:
> > Hi
> >
> > I would be grateful for any tips on how to "prepare" the data so they can
> > be exported to a Postgesql Database using sqoop.
> >
> > As an example:
> >
> > Provided some files of events. (user events, product events,
> > productActivity events)
> >
> > [file0001]
> > event:user propertes:{name:"john" ...}
> > event:product properties:{ref:123,color:"blue",...
> > event:productActivity properties:{user:"john", product:"ref",
> action:"buy"}
> > .....
> >
> > How does one come up with the primary keys and the user_product join
> table
> > ready to be exported?
> >
> > On other words.
> >
> > function(Input:eventfile) => output:[productFile, userFile,
> > user_productFile with auto generated primary keys ]
> >
> > what goes into function?
> >
> > I hope this makes sense!
> >
> > Thank you in advance for any help
> >
> > -matt
>

Re: How does one preprocess the data so that they can be exported using sqoop

Posted by Jarek Jarcec Cecho <ja...@apache.org>.
Hi Matthieu,
Sqoop is currently highly specialized EL tool (extract-load) and not a generic ETL tool (extract-transform-load). Thus you need to execute custom mapreduce/pig/hive job that will separate all three different logical tables and prepare data into format that Sqoop can process.

Jarcec

On Fri, May 17, 2013 at 05:44:03PM -0400, Matthieu Labour wrote:
> Hi
> 
> I would be grateful for any tips on how to "prepare" the data so they can
> be exported to a Postgesql Database using sqoop.
> 
> As an example:
> 
> Provided some files of events. (user events, product events,
> productActivity events)
> 
> [file0001]
> event:user propertes:{name:"john" ...}
> event:product properties:{ref:123,color:"blue",...
> event:productActivity properties:{user:"john", product:"ref", action:"buy"}
> .....
> 
> How does one come up with the primary keys and the user_product join table
> ready to be exported?
> 
> On other words.
> 
> function(Input:eventfile) => output:[productFile, userFile,
> user_productFile with auto generated primary keys ]
> 
> what goes into function?
> 
> I hope this makes sense!
> 
> Thank you in advance for any help
> 
> -matt