You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Benjamin Smedberg <be...@smedbergs.us> on 2013/03/11 16:29:20 UTC

Performing multiple reductions from a single map job

I'm working on a crash processing system and trying to group large 
amounts of data on multiple facets. Loading the data can be expensive, 
so I'd really like to use a single map job. I understand that 
multi-query execution in theory allows for multiple STORE commands to 
come from a single map execution. Is there a way to EXPLAIN the plan of 
an entire pig script that has multiple STORE commands, to tell how it's 
going to run mapreduce? I can only see a way to run EXPLAIN on a single 
relation, which shows a single mapreduce but doesn't really tell how 
they might be combined with multiquery execution. I'm trying to figure 
out whether pig will use a single map for the following pig statement, 
or whether there is a way to make it use a single map.

raw = LOAD ...;
processed = FOREACH raw GENERATE uuid, signature, AdapterVendorID, 
ExtensionsInstalled, ModulesLoaded; /* UDFs process the raw data into 
these fields */
filtered = FILTERED processed BY some conditions here;

bygraphicsvendor = GROUP filtered BY (signature, AdapterVendorID);
byvendortotals = FOREACH bygraphicsvendor GENERATE group.signature, 
group.AdapterVendorID, COUNT(filtered) AS c;

STORE byvendortotals INTO ....;

withextensions = FOREACH filtered GENERATE signature, 
flatten(ExtensionsInstalled);
byextension = GROUP withextensions BY (signature, extensionID);
byextensiontotals = FOREACH byextension GENERATE group.signature, 
group.extensionID, COUNT(withextensions) AS c;

STORE byextensiontotals INTO ...;

--BDS

Re: Performing multiple reductions from a single map job

Posted by Johnny Zhang <xi...@cloudera.com>.

Hi, Benjamin:
You can put all your commands in one script.pig file and try to run: pig -x
mapreduce -e 'explain -script script.pig'
It will explain the entire flow.

Johnny


On Mon, Mar 11, 2013 at 8:29 AM, Benjamin Smedberg <be...@smedbergs.us>wrote:

> I'm working on a crash processing system and trying to group large amounts
> of data on multiple facets. Loading the data can be expensive, so I'd
> really like to use a single map job. I understand that multi-query
> execution in theory allows for multiple STORE commands to come from a
> single map execution. Is there a way to EXPLAIN the plan of an entire pig
> script that has multiple STORE commands, to tell how it's going to run
> mapreduce? I can only see a way to run EXPLAIN on a single relation, which
> shows a single mapreduce but doesn't really tell how they might be combined
> with multiquery execution. I'm trying to figure out whether pig will use a
> single map for the following pig statement, or whether there is a way to
> make it use a single map.
>
> raw = LOAD ...;
> processed = FOREACH raw GENERATE uuid, signature, AdapterVendorID,
> ExtensionsInstalled, ModulesLoaded; /* UDFs process the raw data into these
> fields */
> filtered = FILTERED processed BY some conditions here;
>
> bygraphicsvendor = GROUP filtered BY (signature, AdapterVendorID);
> byvendortotals = FOREACH bygraphicsvendor GENERATE group.signature,
> group.AdapterVendorID, COUNT(filtered) AS c;
>
> STORE byvendortotals INTO ....;
>
> withextensions = FOREACH filtered GENERATE signature,
> flatten(ExtensionsInstalled);
> byextension = GROUP withextensions BY (signature, extensionID);
> byextensiontotals = FOREACH byextension GENERATE group.signature,
> group.extensionID, COUNT(withextensions) AS c;
>
> STORE byextensiontotals INTO ...;
>
> --BDS
>
>