You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by sr...@yahoo.com on 2012/11/26 19:54:55 UTC

Request for suggestions

Hi,


We have a scenario where we want a single Hadoop job to create/manage multiple mapper tasks where each mapper task will query a subset of columns in a relational database table. We looked into DataDrivenDBInputFormat, but that only seems to facilitate partitioning where each mapper task can query a subset of rows in a relational database table. 

I am not sure if Pig can help us in this case.

Appreciate any suggestions in this regard.

Thanks
srinivas

Re: Request for suggestions

Posted by sr...@yahoo.com.

let us say if we have a relational table with 100 columns and 100000 rows of data. 

Using the DataDrivenDBInputFormat class, we were able to provide the min & max ids (let us say 1 and 100000) and let Hadoop manage spinning off as many mapper tasks and each such task handles a subset of data (rather some # of rows of data). i.e. partitioning based on rows


But, we also want to partition the columns so that a single Hadoop job can spin off say 20 mapper 
tasks where each mapper task works with 5 columns of data. i.e. partitioning based on columns

If we were to use Cassandra (and not a relational table) with Hadoop, then they provide something called ClassFamilyInputFormat, which seems to offer what we are looking for. I am not 100% sure though.


Hope it is clear. 

regards,
srinivas


________________________________
 From: Jonathan Coveney <jc...@gmail.com>
To: "user@pig.apache.org" <us...@pig.apache.org>; srinivasrajagopalan@yahoo.com 
Sent: Monday, November 26, 2012 1:14 PM
Subject: Re: Request for suggestions
 

Can you flesh out what you want it to do a little more? Maybe some example queries?



2012/11/26 <sr...@yahoo.com>

Hi,
>
>
>We have a scenario where we want a single Hadoop job to create/manage multiple mapper tasks where each mapper task will query a subset of columns in a relational database table. We looked into DataDrivenDBInputFormat, but that only seems to facilitate partitioning where each mapper task can query a subset of rows in a relational database table.
>
>I am not sure if Pig can help us in this case.
>
>Appreciate any suggestions in this regard.
>
>Thanks
>srinivas

Re: Request for suggestions

Posted by Jonathan Coveney <jc...@gmail.com>.

Can you flesh out what you want it to do a little more? Maybe some example
queries?


2012/11/26 <sr...@yahoo.com>

> Hi,
>
>
> We have a scenario where we want a single Hadoop job to create/manage
> multiple mapper tasks where each mapper task will query a subset of columns
> in a relational database table. We looked into DataDrivenDBInputFormat, but
> that only seems to facilitate partitioning where each mapper task can query
> a subset of rows in a relational database table.
>
> I am not sure if Pig can help us in this case.
>
> Appreciate any suggestions in this regard.
>
> Thanks
> srinivas