You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Benjamin Smedberg <be...@smedbergs.us> on 2013/03/11 16:29:20 UTC
Performing multiple reductions from a single map job
I'm working on a crash processing system and trying to group large
amounts of data on multiple facets. Loading the data can be expensive,
so I'd really like to use a single map job. I understand that
multi-query execution in theory allows for multiple STORE commands to
come from a single map execution. Is there a way to EXPLAIN the plan of
an entire pig script that has multiple STORE commands, to tell how it's
going to run mapreduce? I can only see a way to run EXPLAIN on a single
relation, which shows a single mapreduce but doesn't really tell how
they might be combined with multiquery execution. I'm trying to figure
out whether pig will use a single map for the following pig statement,
or whether there is a way to make it use a single map.
raw = LOAD ...;
processed = FOREACH raw GENERATE uuid, signature, AdapterVendorID,
ExtensionsInstalled, ModulesLoaded; /* UDFs process the raw data into
these fields */
filtered = FILTERED processed BY some conditions here;
bygraphicsvendor = GROUP filtered BY (signature, AdapterVendorID);
byvendortotals = FOREACH bygraphicsvendor GENERATE group.signature,
group.AdapterVendorID, COUNT(filtered) AS c;
STORE byvendortotals INTO ....;
withextensions = FOREACH filtered GENERATE signature,
flatten(ExtensionsInstalled);
byextension = GROUP withextensions BY (signature, extensionID);
byextensiontotals = FOREACH byextension GENERATE group.signature,
group.extensionID, COUNT(withextensions) AS c;
STORE byextensiontotals INTO ...;
--BDS
Re: Performing multiple reductions from a single map job
Posted by Johnny Zhang <xi...@cloudera.com>.
Hi, Benjamin:
You can put all your commands in one script.pig file and try to run: pig -x
mapreduce -e 'explain -script script.pig'
It will explain the entire flow.
Johnny
On Mon, Mar 11, 2013 at 8:29 AM, Benjamin Smedberg <be...@smedbergs.us>wrote:
> I'm working on a crash processing system and trying to group large amounts
> of data on multiple facets. Loading the data can be expensive, so I'd
> really like to use a single map job. I understand that multi-query
> execution in theory allows for multiple STORE commands to come from a
> single map execution. Is there a way to EXPLAIN the plan of an entire pig
> script that has multiple STORE commands, to tell how it's going to run
> mapreduce? I can only see a way to run EXPLAIN on a single relation, which
> shows a single mapreduce but doesn't really tell how they might be combined
> with multiquery execution. I'm trying to figure out whether pig will use a
> single map for the following pig statement, or whether there is a way to
> make it use a single map.
>
> raw = LOAD ...;
> processed = FOREACH raw GENERATE uuid, signature, AdapterVendorID,
> ExtensionsInstalled, ModulesLoaded; /* UDFs process the raw data into these
> fields */
> filtered = FILTERED processed BY some conditions here;
>
> bygraphicsvendor = GROUP filtered BY (signature, AdapterVendorID);
> byvendortotals = FOREACH bygraphicsvendor GENERATE group.signature,
> group.AdapterVendorID, COUNT(filtered) AS c;
>
> STORE byvendortotals INTO ....;
>
> withextensions = FOREACH filtered GENERATE signature,
> flatten(ExtensionsInstalled);
> byextension = GROUP withextensions BY (signature, extensionID);
> byextensiontotals = FOREACH byextension GENERATE group.signature,
> group.extensionID, COUNT(withextensions) AS c;
>
> STORE byextensiontotals INTO ...;
>
> --BDS
>
>