You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Terry Siu <Te...@datasphere.com> on 2012/09/12 20:55:28 UTC

Batching transformations in Pig

Hi all,

I'm wondering if anyone has experience with my following scenario:

I have a HBase table loaded with millions of records. I want to load these records into Pig, process each batch of 1000 by calling an external API, and then associate the results with the Tuples in each batch. I've having difficulty figuring out how to do this in Pig (via a UDF) if this is even possible or a valid scenario for Pig. In a nutshell, this is what I want:

[Input]
ID1,A,B
ID2,C,D
ID3,E,F
ID4,G,H
...
IDn,Y,Z

For example purposes, let's say batch size is 2. So, I'd like ID1 and ID2 to be batched together, call the external API, and then have the returned Tuples to include the data from the API call. Similarly, ID3 and ID4 are batched together, call the external API, and the returned Tuples have the data from API. So, I'd like my output to be:

[Output]
ID1,A,B,1,2
ID2,C,D,3,4
ID3,E,F,5
ID4,G,H,6,7,8
...
IDn,Y,Z

Yes, I can call the API per record, but I want to reduce the # of API calls, thus, I'd like to batch of set of records and then call the API.

Hope this makes sense. Is this possible via Pig with UDF?

-Terry

PS: I did try implementing an accumulator by grouping my records via a single constant value, hoping that the accumulate() and getValue() are called per batch with no luck.


RE: Batching transformations in Pig

Posted by Terry Siu <Te...@datasphere.com>.
Thanks, Dmitriy, that's eventually what I concluded that I needed to do.

-Terry

-----Original Message-----
From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com] 
Sent: Thursday, September 13, 2012 2:49 AM
To: user@pig.apache.org
Subject: Re: Batching transformations in Pig

Group, and pass the grouped sets to your batch-processing UDF?

so:

data:
id1 bucket1
id2 bucket2
id3 bucket2
id4 bucket1

bucketized = group data by bucket_id;

bucket1, { (id1, id4) }
bucket2, { (id2, id3) }

batch_processed = foreach bucketized generate MyUDF(data);


D

On Wed, Sep 12, 2012 at 11:55 AM, Terry Siu <Te...@datasphere.com> wrote:
> Hi all,
>
> I'm wondering if anyone has experience with my following scenario:
>
> I have a HBase table loaded with millions of records. I want to load these records into Pig, process each batch of 1000 by calling an external API, and then associate the results with the Tuples in each batch. I've having difficulty figuring out how to do this in Pig (via a UDF) if this is even possible or a valid scenario for Pig. In a nutshell, this is what I want:
>
> [Input]
> ID1,A,B
> ID2,C,D
> ID3,E,F
> ID4,G,H
> ...
> IDn,Y,Z
>
> For example purposes, let's say batch size is 2. So, I'd like ID1 and ID2 to be batched together, call the external API, and then have the returned Tuples to include the data from the API call. Similarly, ID3 and ID4 are batched together, call the external API, and the returned Tuples have the data from API. So, I'd like my output to be:
>
> [Output]
> ID1,A,B,1,2
> ID2,C,D,3,4
> ID3,E,F,5
> ID4,G,H,6,7,8
> ...
> IDn,Y,Z
>
> Yes, I can call the API per record, but I want to reduce the # of API calls, thus, I'd like to batch of set of records and then call the API.
>
> Hope this makes sense. Is this possible via Pig with UDF?
>
> -Terry
>
> PS: I did try implementing an accumulator by grouping my records via a single constant value, hoping that the accumulate() and getValue() are called per batch with no luck.
>

Re: Batching transformations in Pig

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Group, and pass the grouped sets to your batch-processing UDF?

so:

data:
id1 bucket1
id2 bucket2
id3 bucket2
id4 bucket1

bucketized = group data by bucket_id;

bucket1, { (id1, id4) }
bucket2, { (id2, id3) }

batch_processed = foreach bucketized generate MyUDF(data);


D

On Wed, Sep 12, 2012 at 11:55 AM, Terry Siu <Te...@datasphere.com> wrote:
> Hi all,
>
> I'm wondering if anyone has experience with my following scenario:
>
> I have a HBase table loaded with millions of records. I want to load these records into Pig, process each batch of 1000 by calling an external API, and then associate the results with the Tuples in each batch. I've having difficulty figuring out how to do this in Pig (via a UDF) if this is even possible or a valid scenario for Pig. In a nutshell, this is what I want:
>
> [Input]
> ID1,A,B
> ID2,C,D
> ID3,E,F
> ID4,G,H
> ...
> IDn,Y,Z
>
> For example purposes, let's say batch size is 2. So, I'd like ID1 and ID2 to be batched together, call the external API, and then have the returned Tuples to include the data from the API call. Similarly, ID3 and ID4 are batched together, call the external API, and the returned Tuples have the data from API. So, I'd like my output to be:
>
> [Output]
> ID1,A,B,1,2
> ID2,C,D,3,4
> ID3,E,F,5
> ID4,G,H,6,7,8
> ...
> IDn,Y,Z
>
> Yes, I can call the API per record, but I want to reduce the # of API calls, thus, I'd like to batch of set of records and then call the API.
>
> Hope this makes sense. Is this possible via Pig with UDF?
>
> -Terry
>
> PS: I did try implementing an accumulator by grouping my records via a single constant value, hoping that the accumulate() and getValue() are called per batch with no luck.
>