You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Dexin Wang <wa...@gmail.com> on 2013/11/12 01:36:09 UTC
replicated join gets extra job
Hi,
I'm running a job like this:
raw_large = LOAD 'lots_of_files' AS (...);
raw_filtered = FILTER raw_large BY ...;
large_table = FOREACH raw_filtered GENERATE f1, f2, f3,....;
joined_1 = JOIN large_table BY (key1) LEFT, config_table_1 BY (key2) USING
'replicated';
joined_2 = JOIN join1 BY (key3) LEFT, config_table_2 BY (key4)
USING 'replicated';
joined_3 = JOIN join2 BY (key5) LEFT, config_table_3 BY (key6)
USING 'replicated';
joined_4 = JOIN join4 BY (key7) LEFT, config_table_3 BY (key8)
USING 'replicated';
basically left join a large table with 4 relatively small tables using the
replicated join.
I see a first load job has 120 mapper tasks and no reducer, and this job
seems to be doing the load and filtering. And there is another job
following that has 26 mapper tasks that seem to be doing the joins.
Shouldn't there be only one job and the joins being done in the mapper
phase of the first job?
The 4 config tables (files) have these sizes respectively:
3MB
220kB
2kB
100kB
these are running on AWS EMR Pig 0.92 on xlarge instances which has 15GB
memory.
Thanks!
Re: replicated join gets extra job
Posted by Pradeep Gollakota <pr...@gmail.com>.
Use the ILLUSTRATE or EXPLAIN keywords to look at the details of the
physical execution plan... from first glance it doesn't look like you'd
need a 2nd job to do the joins, but if you can post the output of
ILLUSTRATE/EXPLAIN, we can look into it.
On Mon, Nov 11, 2013 at 4:36 PM, Dexin Wang <wa...@gmail.com> wrote:
> Hi,
>
> I'm running a job like this:
>
> raw_large = LOAD 'lots_of_files' AS (...);
> raw_filtered = FILTER raw_large BY ...;
> large_table = FOREACH raw_filtered GENERATE f1, f2, f3,....;
>
> joined_1 = JOIN large_table BY (key1) LEFT, config_table_1 BY (key2) USING
> 'replicated';
> joined_2 = JOIN join1 BY (key3) LEFT, config_table_2 BY (key4)
> USING 'replicated';
> joined_3 = JOIN join2 BY (key5) LEFT, config_table_3 BY (key6)
> USING 'replicated';
> joined_4 = JOIN join4 BY (key7) LEFT, config_table_3 BY (key8)
> USING 'replicated';
>
> basically left join a large table with 4 relatively small tables using the
> replicated join.
>
> I see a first load job has 120 mapper tasks and no reducer, and this job
> seems to be doing the load and filtering. And there is another job
> following that has 26 mapper tasks that seem to be doing the joins.
>
> Shouldn't there be only one job and the joins being done in the mapper
> phase of the first job?
>
> The 4 config tables (files) have these sizes respectively:
>
> 3MB
> 220kB
> 2kB
> 100kB
>
> these are running on AWS EMR Pig 0.92 on xlarge instances which has 15GB
> memory.
>
> Thanks!
>