You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Srinivas Surasani <hi...@gmail.com> on 2012/06/01 00:02:46 UTC

Hadoop with Sharded MySql

All,

We are trying to implement sqoop in our environment which has 30 mysql
sharded databases and all the databases have around 30 databases with
150 tables in each of the database which are all sharded (horizontally
sharded that means the data is divided into all the tables in mysql).

The problem is that we have a total of around 70K tables which needed
to be pulled from mysql into hdfs.

So, my question is that generating 70K sqoop commands and running them
parallel is feasible or not?

Also, doing incremental updates is going to be like invoking 70K
another sqoop jobs which intern kick of map-reduce jobs.

The main problem is monitoring and managing this huge number of jobs?

Can anyone suggest me the best way of doing it or is sqoop a good
candidate for this type of scenario?

Currently the same process is done by generating tsv files  mysql
server and dumped into staging server and  from there we'll generate
hdfs put statements..

Appreciate your suggestions !!!


Thanks,
Srinivas Surasani

Re: Hadoop with Sharded MySql

Posted by Michael Segel <mi...@hotmail.com>.

Ok just tossing out some ideas... Take them with a grain of salt...

With hive you can create external tables.

Write a custom Java app the creates one thread to each server. Then iterate through each table selecting the rows you want. You can then easily write the output directly to HDFS in each thread.
It's not a map reduce, but it should be fairly efficient.

You can even expand on this if you want.
Java and jdbc...


Sent from my iPhone

On Jun 1, 2012, at 11:30 AM, "Srinivas Surasani" <hi...@gmail.com> wrote:

> All,
> 
> I'm trying to get data into HDFS directly from sharded database and expose
> to existing hive infrastructure.
> 
> ( we are currently doing this way,, mysql->staging server->hdfs put
> commands->hdfs, which is taking lot of time ).
> 
> If we have way of running single sqoop job across all shardes for single
> table, I believe it makes life easier in terms of monotoring and exception
> handlings..
> 
> Thanks,
> Srinivas
> 
> On Fri, Jun 1, 2012 at 1:27 AM, anil gupta <an...@gmail.com> wrote:
> 
>> Hi Sujith,
>> 
>> Srinivas is asking how to import data into HDFS using sqoop?  I believe he
>> must have thought out well before designing the entire
>> architecture/solution. He has not specified whether he would like to modify
>> the data or not. Whether to use HIve or HBase is a different question
>> altogether and depends on his use-case.
>> 
>> Thanks,
>> Anil
>> 
>> 
>> On Thu, May 31, 2012 at 9:52 PM, Sujit Dhamale <sujitdhamale89@gmail.com
>>> wrote:
>> 
>>> Hi ,
>>> instead of pulling 70K tables from mysql into hdfs.
>>> take dump of all 30 table and put in to hBase data base .
>>> 
>>> if you pulled 70K tables from mysql into hdfs , you need to use Hive ,
>> but
>>> modification will not possible in Hive :(
>>> 
>>> *@ common-user :* please correct me , if i am wrong .
>>> 
>>> Kind Regards
>>> Sujit Dhamale
>>> (+91 9970086652)
>>> On Fri, Jun 1, 2012 at 5:42 AM, Edward Capriolo <edlinuxguru@gmail.com
>>>> wrote:
>>> 
>>>> Maybe you can do some VIEWs or unions or merge tables on the mysql
>>>> side to overcome the aspect of launching so many sqoop jobs.
>>>> 
>>>> On Thu, May 31, 2012 at 6:02 PM, Srinivas Surasani
>>>> <hi...@gmail.com> wrote:
>>>>> All,
>>>>> 
>>>>> We are trying to implement sqoop in our environment which has 30
>> mysql
>>>>> sharded databases and all the databases have around 30 databases with
>>>>> 150 tables in each of the database which are all sharded
>> (horizontally
>>>>> sharded that means the data is divided into all the tables in mysql).
>>>>> 
>>>>> The problem is that we have a total of around 70K tables which needed
>>>>> to be pulled from mysql into hdfs.
>>>>> 
>>>>> So, my question is that generating 70K sqoop commands and running
>> them
>>>>> parallel is feasible or not?
>>>>> 
>>>>> Also, doing incremental updates is going to be like invoking 70K
>>>>> another sqoop jobs which intern kick of map-reduce jobs.
>>>>> 
>>>>> The main problem is monitoring and managing this huge number of jobs?
>>>>> 
>>>>> Can anyone suggest me the best way of doing it or is sqoop a good
>>>>> candidate for this type of scenario?
>>>>> 
>>>>> Currently the same process is done by generating tsv files  mysql
>>>>> server and dumped into staging server and  from there we'll generate
>>>>> hdfs put statements..
>>>>> 
>>>>> Appreciate your suggestions !!!
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> Srinivas Surasani
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Thanks & Regards,
>> Anil Gupta
>> 
> 
> 
> 
> -- 
> Regards,
> -- Srinivas
> Srinivas@cloudwick.com

Re: Hadoop with Sharded MySql

Posted by Srinivas Surasani <hi...@gmail.com>.

All,

I'm trying to get data into HDFS directly from sharded database and expose
to existing hive infrastructure.

( we are currently doing this way,, mysql->staging server->hdfs put
commands->hdfs, which is taking lot of time ).

If we have way of running single sqoop job across all shardes for single
table, I believe it makes life easier in terms of monotoring and exception
handlings..

Thanks,
Srinivas

On Fri, Jun 1, 2012 at 1:27 AM, anil gupta <an...@gmail.com> wrote:

> Hi Sujith,
>
> Srinivas is asking how to import data into HDFS using sqoop?  I believe he
> must have thought out well before designing the entire
> architecture/solution. He has not specified whether he would like to modify
> the data or not. Whether to use HIve or HBase is a different question
> altogether and depends on his use-case.
>
> Thanks,
> Anil
>
>
> On Thu, May 31, 2012 at 9:52 PM, Sujit Dhamale <sujitdhamale89@gmail.com
> >wrote:
>
> > Hi ,
> > instead of pulling 70K tables from mysql into hdfs.
> > take dump of all 30 table and put in to hBase data base .
> >
> > if you pulled 70K tables from mysql into hdfs , you need to use Hive ,
> but
> > modification will not possible in Hive :(
> >
> > *@ common-user :* please correct me , if i am wrong .
> >
> > Kind Regards
> > Sujit Dhamale
> > (+91 9970086652)
> > On Fri, Jun 1, 2012 at 5:42 AM, Edward Capriolo <edlinuxguru@gmail.com
> > >wrote:
> >
> > > Maybe you can do some VIEWs or unions or merge tables on the mysql
> > > side to overcome the aspect of launching so many sqoop jobs.
> > >
> > > On Thu, May 31, 2012 at 6:02 PM, Srinivas Surasani
> > > <hi...@gmail.com> wrote:
> > > > All,
> > > >
> > > > We are trying to implement sqoop in our environment which has 30
> mysql
> > > > sharded databases and all the databases have around 30 databases with
> > > > 150 tables in each of the database which are all sharded
> (horizontally
> > > > sharded that means the data is divided into all the tables in mysql).
> > > >
> > > > The problem is that we have a total of around 70K tables which needed
> > > > to be pulled from mysql into hdfs.
> > > >
> > > > So, my question is that generating 70K sqoop commands and running
> them
> > > > parallel is feasible or not?
> > > >
> > > > Also, doing incremental updates is going to be like invoking 70K
> > > > another sqoop jobs which intern kick of map-reduce jobs.
> > > >
> > > > The main problem is monitoring and managing this huge number of jobs?
> > > >
> > > > Can anyone suggest me the best way of doing it or is sqoop a good
> > > > candidate for this type of scenario?
> > > >
> > > > Currently the same process is done by generating tsv files  mysql
> > > > server and dumped into staging server and  from there we'll generate
> > > > hdfs put statements..
> > > >
> > > > Appreciate your suggestions !!!
> > > >
> > > >
> > > > Thanks,
> > > > Srinivas Surasani
> > >
> >
>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>



-- 
Regards,
-- Srinivas
Srinivas@cloudwick.com

Re: Hadoop with Sharded MySql

Posted by anil gupta <an...@gmail.com>.

Hi Sujith,

Srinivas is asking how to import data into HDFS using sqoop?  I believe he
must have thought out well before designing the entire
architecture/solution. He has not specified whether he would like to modify
the data or not. Whether to use HIve or HBase is a different question
altogether and depends on his use-case.

Thanks,
Anil


On Thu, May 31, 2012 at 9:52 PM, Sujit Dhamale <su...@gmail.com>wrote:

> Hi ,
> instead of pulling 70K tables from mysql into hdfs.
> take dump of all 30 table and put in to hBase data base .
>
> if you pulled 70K tables from mysql into hdfs , you need to use Hive , but
> modification will not possible in Hive :(
>
> *@ common-user :* please correct me , if i am wrong .
>
> Kind Regards
> Sujit Dhamale
> (+91 9970086652)
> On Fri, Jun 1, 2012 at 5:42 AM, Edward Capriolo <edlinuxguru@gmail.com
> >wrote:
>
> > Maybe you can do some VIEWs or unions or merge tables on the mysql
> > side to overcome the aspect of launching so many sqoop jobs.
> >
> > On Thu, May 31, 2012 at 6:02 PM, Srinivas Surasani
> > <hi...@gmail.com> wrote:
> > > All,
> > >
> > > We are trying to implement sqoop in our environment which has 30 mysql
> > > sharded databases and all the databases have around 30 databases with
> > > 150 tables in each of the database which are all sharded (horizontally
> > > sharded that means the data is divided into all the tables in mysql).
> > >
> > > The problem is that we have a total of around 70K tables which needed
> > > to be pulled from mysql into hdfs.
> > >
> > > So, my question is that generating 70K sqoop commands and running them
> > > parallel is feasible or not?
> > >
> > > Also, doing incremental updates is going to be like invoking 70K
> > > another sqoop jobs which intern kick of map-reduce jobs.
> > >
> > > The main problem is monitoring and managing this huge number of jobs?
> > >
> > > Can anyone suggest me the best way of doing it or is sqoop a good
> > > candidate for this type of scenario?
> > >
> > > Currently the same process is done by generating tsv files  mysql
> > > server and dumped into staging server and  from there we'll generate
> > > hdfs put statements..
> > >
> > > Appreciate your suggestions !!!
> > >
> > >
> > > Thanks,
> > > Srinivas Surasani
> >
>



-- 
Thanks & Regards,
Anil Gupta

Re: Hadoop with Sharded MySql

Posted by Sujit Dhamale <su...@gmail.com>.

Hi ,
instead of pulling 70K tables from mysql into hdfs.
take dump of all 30 table and put in to hBase data base .

if you pulled 70K tables from mysql into hdfs , you need to use Hive , but
modification will not possible in Hive :(

*@ common-user :* please correct me , if i am wrong .

Kind Regards
Sujit Dhamale
(+91 9970086652)
On Fri, Jun 1, 2012 at 5:42 AM, Edward Capriolo <ed...@gmail.com>wrote:

> Maybe you can do some VIEWs or unions or merge tables on the mysql
> side to overcome the aspect of launching so many sqoop jobs.
>
> On Thu, May 31, 2012 at 6:02 PM, Srinivas Surasani
> <hi...@gmail.com> wrote:
> > All,
> >
> > We are trying to implement sqoop in our environment which has 30 mysql
> > sharded databases and all the databases have around 30 databases with
> > 150 tables in each of the database which are all sharded (horizontally
> > sharded that means the data is divided into all the tables in mysql).
> >
> > The problem is that we have a total of around 70K tables which needed
> > to be pulled from mysql into hdfs.
> >
> > So, my question is that generating 70K sqoop commands and running them
> > parallel is feasible or not?
> >
> > Also, doing incremental updates is going to be like invoking 70K
> > another sqoop jobs which intern kick of map-reduce jobs.
> >
> > The main problem is monitoring and managing this huge number of jobs?
> >
> > Can anyone suggest me the best way of doing it or is sqoop a good
> > candidate for this type of scenario?
> >
> > Currently the same process is done by generating tsv files  mysql
> > server and dumped into staging server and  from there we'll generate
> > hdfs put statements..
> >
> > Appreciate your suggestions !!!
> >
> >
> > Thanks,
> > Srinivas Surasani
>

Re: Hadoop with Sharded MySql

Posted by Edward Capriolo <ed...@gmail.com>.

Maybe you can do some VIEWs or unions or merge tables on the mysql
side to overcome the aspect of launching so many sqoop jobs.

On Thu, May 31, 2012 at 6:02 PM, Srinivas Surasani
<hi...@gmail.com> wrote:
> All,
>
> We are trying to implement sqoop in our environment which has 30 mysql
> sharded databases and all the databases have around 30 databases with
> 150 tables in each of the database which are all sharded (horizontally
> sharded that means the data is divided into all the tables in mysql).
>
> The problem is that we have a total of around 70K tables which needed
> to be pulled from mysql into hdfs.
>
> So, my question is that generating 70K sqoop commands and running them
> parallel is feasible or not?
>
> Also, doing incremental updates is going to be like invoking 70K
> another sqoop jobs which intern kick of map-reduce jobs.
>
> The main problem is monitoring and managing this huge number of jobs?
>
> Can anyone suggest me the best way of doing it or is sqoop a good
> candidate for this type of scenario?
>
> Currently the same process is done by generating tsv files  mysql
> server and dumped into staging server and  from there we'll generate
> hdfs put statements..
>
> Appreciate your suggestions !!!
>
>
> Thanks,
> Srinivas Surasani