You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Justin Vincent <ju...@gmail.com> on 2011/12/05 19:43:08 UTC

Multiple Mappers for Multiple Tables

I would like join some db tables, possibly from different databases, in a
MR job.

I would essentially like to use MultipleInputs, but that seems file
oriented. I need a different mapper for each db table.

Suggestions?

Thanks!

Justin Vincent

Re: Multiple Mappers for Multiple Tables

Posted by Praveen Sripati <pr...@gmail.com>.
MultipleInputs take multiple Path (files) and not DB as input. As mentioned
earlier export tables into HDFS either using Sqoop or native DB export tool
and then do the processing. Sqoop is configured to use native DB export
tool whenever possible.

Regards,
Praveen

On Tue, Dec 6, 2011 at 3:44 AM, Justin Vincent <ju...@gmail.com> wrote:

> Thanks Bejoy,
> I was looking at DBInputFormat with MultipleInputs. MultipleInputs takes a
> Path parameter. Are these paths just ignored here?
>
> On Mon, Dec 5, 2011 at 2:31 PM, Bejoy Ks <be...@gmail.com> wrote:
>
> > Hi Justin,
> >            Just to add on to my response. If you need to fetch data from
> > rdbms on your mapper using your custom mapreduce code you can use the
> > DBInputFormat in your mapper class with MultipleInputs. You have to be
> > careful in using the number of mappers for your application as dbs would
> be
> > constrained with a limit on maximum simultaneous connections. Also you
> need
> > to ensure that that the same Query is not executed n number of times in n
> > mappers all fetching the same data, It'd be just wastage of network.
> Sqoop
> > + Hive would be my recommendation and a good combination for such use
> > cases. If you have Pig competency you can also look into pig instead of
> > hive.
> >
> > Hope it helps!...
> >
> > Regards
> > Bejoy.K.S
> >
> > On Tue, Dec 6, 2011 at 1:36 AM, Bejoy Ks <be...@gmail.com> wrote:
> >
> > > Justin
> > >         If I get your requirement right you need to get in data from
> > > multiple rdbms sources and do a join on the same, also may be some more
> > > custom operations on top of this. For this you don't need to go in for
> > > writing your custom mapreduce code unless it is that required. You can
> > > achieve the same in two easy steps
> > > - Import data from RDBMS into Hive using SQOOP (Import)
> > > - Use hive to do some join and processing on this data
> > >
> > > Hope it helps!..
> > >
> > > Regards
> > > Bejoy.K.S
> > >
> > >
> > > On Tue, Dec 6, 2011 at 12:13 AM, Justin Vincent <justinvf@gmail.com
> > >wrote:
> > >
> > >> I would like join some db tables, possibly from different databases,
> in
> > a
> > >> MR job.
> > >>
> > >> I would essentially like to use MultipleInputs, but that seems file
> > >> oriented. I need a different mapper for each db table.
> > >>
> > >> Suggestions?
> > >>
> > >> Thanks!
> > >>
> > >> Justin Vincent
> > >>
> > >
> > >
> >
>

Re: Multiple Mappers for Multiple Tables

Posted by Justin Vincent <ju...@gmail.com>.
Thanks Bejoy,
I was looking at DBInputFormat with MultipleInputs. MultipleInputs takes a
Path parameter. Are these paths just ignored here?

On Mon, Dec 5, 2011 at 2:31 PM, Bejoy Ks <be...@gmail.com> wrote:

> Hi Justin,
>            Just to add on to my response. If you need to fetch data from
> rdbms on your mapper using your custom mapreduce code you can use the
> DBInputFormat in your mapper class with MultipleInputs. You have to be
> careful in using the number of mappers for your application as dbs would be
> constrained with a limit on maximum simultaneous connections. Also you need
> to ensure that that the same Query is not executed n number of times in n
> mappers all fetching the same data, It'd be just wastage of network. Sqoop
> + Hive would be my recommendation and a good combination for such use
> cases. If you have Pig competency you can also look into pig instead of
> hive.
>
> Hope it helps!...
>
> Regards
> Bejoy.K.S
>
> On Tue, Dec 6, 2011 at 1:36 AM, Bejoy Ks <be...@gmail.com> wrote:
>
> > Justin
> >         If I get your requirement right you need to get in data from
> > multiple rdbms sources and do a join on the same, also may be some more
> > custom operations on top of this. For this you don't need to go in for
> > writing your custom mapreduce code unless it is that required. You can
> > achieve the same in two easy steps
> > - Import data from RDBMS into Hive using SQOOP (Import)
> > - Use hive to do some join and processing on this data
> >
> > Hope it helps!..
> >
> > Regards
> > Bejoy.K.S
> >
> >
> > On Tue, Dec 6, 2011 at 12:13 AM, Justin Vincent <justinvf@gmail.com
> >wrote:
> >
> >> I would like join some db tables, possibly from different databases, in
> a
> >> MR job.
> >>
> >> I would essentially like to use MultipleInputs, but that seems file
> >> oriented. I need a different mapper for each db table.
> >>
> >> Suggestions?
> >>
> >> Thanks!
> >>
> >> Justin Vincent
> >>
> >
> >
>

Re: Multiple Mappers for Multiple Tables

Posted by Bejoy Ks <be...@gmail.com>.
Hi Justin,
            Just to add on to my response. If you need to fetch data from
rdbms on your mapper using your custom mapreduce code you can use the
DBInputFormat in your mapper class with MultipleInputs. You have to be
careful in using the number of mappers for your application as dbs would be
constrained with a limit on maximum simultaneous connections. Also you need
to ensure that that the same Query is not executed n number of times in n
mappers all fetching the same data, It'd be just wastage of network. Sqoop
+ Hive would be my recommendation and a good combination for such use
cases. If you have Pig competency you can also look into pig instead of
hive.

Hope it helps!...

Regards
Bejoy.K.S

On Tue, Dec 6, 2011 at 1:36 AM, Bejoy Ks <be...@gmail.com> wrote:

> Justin
>         If I get your requirement right you need to get in data from
> multiple rdbms sources and do a join on the same, also may be some more
> custom operations on top of this. For this you don't need to go in for
> writing your custom mapreduce code unless it is that required. You can
> achieve the same in two easy steps
> - Import data from RDBMS into Hive using SQOOP (Import)
> - Use hive to do some join and processing on this data
>
> Hope it helps!..
>
> Regards
> Bejoy.K.S
>
>
> On Tue, Dec 6, 2011 at 12:13 AM, Justin Vincent <ju...@gmail.com>wrote:
>
>> I would like join some db tables, possibly from different databases, in a
>> MR job.
>>
>> I would essentially like to use MultipleInputs, but that seems file
>> oriented. I need a different mapper for each db table.
>>
>> Suggestions?
>>
>> Thanks!
>>
>> Justin Vincent
>>
>
>

Re: Multiple Mappers for Multiple Tables

Posted by Bejoy Ks <be...@gmail.com>.
Justin
        If I get your requirement right you need to get in data from
multiple rdbms sources and do a join on the same, also may be some more
custom operations on top of this. For this you don't need to go in for
writing your custom mapreduce code unless it is that required. You can
achieve the same in two easy steps
- Import data from RDBMS into Hive using SQOOP (Import)
- Use hive to do some join and processing on this data

Hope it helps!..

Regards
Bejoy.K.S

On Tue, Dec 6, 2011 at 12:13 AM, Justin Vincent <ju...@gmail.com> wrote:

> I would like join some db tables, possibly from different databases, in a
> MR job.
>
> I would essentially like to use MultipleInputs, but that seems file
> oriented. I need a different mapper for each db table.
>
> Suggestions?
>
> Thanks!
>
> Justin Vincent
>