You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@sqoop.apache.org by Srivathsan Srinivas <pe...@gmail.com> on 2011/08/21 08:06:06 UTC

denormalize data during import

Hi,
   I am looking for a way to denormalize my tables when I am importing from
MySQL to HDFS/HIVE. What is the best approach? Is there a way to add columns
dynamically when using Sqoop or any other means?

For eg., an RDBMS table of students might have student name and address(es).
This may be a normalized form in different tables. What I want during my
import process is a single row for each student where there will be multiple
columns for all the addresses for a given student (like addr_1, addr_2,
...etc). I can write UDFs for it, but, do not know how to attach it to
sqoop. I undertand that sqoop can add SQL queries (such as join) before
importing. But, that will add more rows. I want a single row with multiple
columns.

Should I have UDFs in Sqoop or Oozie in my workflow and denormalize them on
the fly? I have not tried it. Is there any suggested approach? I am
beginning to look at Solr's DataImportHandler...perhaps that could solve
this.

Suggestions are appreciated.

Thanks,
PD.

Re: denormalize data during import

Posted by Arvind Prabhakar <ar...@apache.org>.

Hi Bijoy,

Your suggested solution sounds good to me as it gives the flexibility to the
user to decide how best to modify the data.

One alternative that could work in a limited manner is to use dynamic views
instead of tables that have the denormalized relations already. That would
allow the user to run a Sqoop import and directly get the necessary data set
out in one shot.

Thanks,
Arvind

On Sun, Aug 21, 2011 at 12:33 PM, <be...@gmail.com> wrote:

> ** Hi Srivathsan
> I did get your requirement and if you are in for developing a map reduce
> program, may be you can achieve the same in two steps.
> 1. SQOOP Import data into HDFS
> 2. Use a custom map reduce to de normalize the same.
> In your map reduce join the two data sets using the foreign key. For this
>
> In Mapper
> -Use Multiple Input Format to parse the two different inputs
> -project the data in each mapper with key as the foreign key and the value
> as the combination of required column values in each case
> -prefix an identifier to identify the source of the values in the reducer
>
> In Reducer
> - the identifier helps to know the source of data(which table the data is
> from)
> - Append the data from different sources as required(choose output key
> value as required)
>
> The solution is a crude one and has its flaws. Just scribbled a possible
> solution from my limited expertise.
>
> Experts, please share some sophisticated approach which you feel is optimal
> in handling such requirements.
>
> Regards
> Bejoy K S
> ------------------------------
> *From: * Srivathsan Srinivas <pe...@gmail.com>
> *Date: *Sat, 20 Aug 2011 23:06:06 -0700
> *To: *<sq...@incubator.apache.org>
> *ReplyTo: * sqoop-user@incubator.apache.org
> *Subject: *denormalize data during import
>
> Hi,
>    I am looking for a way to denormalize my tables when I am importing from
> MySQL to HDFS/HIVE. What is the best approach? Is there a way to add columns
> dynamically when using Sqoop or any other means?
>
> For eg., an RDBMS table of students might have student name and
> address(es). This may be a normalized form in different tables. What I want
> during my import process is a single row for each student where there will
> be multiple columns for all the addresses for a given student (like addr_1,
> addr_2, ...etc). I can write UDFs for it, but, do not know how to attach it
> to sqoop. I undertand that sqoop can add SQL queries (such as join) before
> importing. But, that will add more rows. I want a single row with multiple
> columns.
>
> Should I have UDFs in Sqoop or Oozie in my workflow and denormalize them on
> the fly? I have not tried it. Is there any suggested approach? I am
> beginning to look at Solr's DataImportHandler...perhaps that could solve
> this.
>
> Suggestions are appreciated.
>
> Thanks,
> PD.
>

Re: denormalize data during import

Posted by be...@gmail.com.

Hi Srivathsan
          I did get your requirement and if you are in for developing a map reduce program, may be you can achieve the same in two steps.
1. SQOOP Import data into HDFS
2. Use a custom map reduce to de normalize the same.
     In your map reduce join the two data sets using the foreign key. For this

In Mapper
-Use Multiple Input Format to parse the two different inputs
-project the data in each mapper with key as the foreign key and the value as the combination of required column values in each case
-prefix an identifier to identify the source of the values in the reducer

In Reducer
- the identifier helps to know the source of data(which table the data is from)
- Append the data from different sources as required(choose output key value as required)

The solution is a crude one and has its flaws. Just scribbled a possible solution from my limited expertise.
 
Experts, please share some sophisticated approach which you feel is optimal in handling such requirements.

Regards
Bejoy K S

-----Original Message-----
From: Srivathsan Srinivas <pe...@gmail.com>
Date: Sat, 20 Aug 2011 23:06:06 
To: <sq...@incubator.apache.org>
Reply-To: sqoop-user@incubator.apache.org
Subject: denormalize data during import

Hi,
   I am looking for a way to denormalize my tables when I am importing from
MySQL to HDFS/HIVE. What is the best approach? Is there a way to add columns
dynamically when using Sqoop or any other means?

For eg., an RDBMS table of students might have student name and address(es).
This may be a normalized form in different tables. What I want during my
import process is a single row for each student where there will be multiple
columns for all the addresses for a given student (like addr_1, addr_2,
...etc). I can write UDFs for it, but, do not know how to attach it to
sqoop. I undertand that sqoop can add SQL queries (such as join) before
importing. But, that will add more rows. I want a single row with multiple
columns.

Should I have UDFs in Sqoop or Oozie in my workflow and denormalize them on
the fly? I have not tried it. Is there any suggested approach? I am
beginning to look at Solr's DataImportHandler...perhaps that could solve
this.

Suggestions are appreciated.

Thanks,
PD.