You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Padmashree Ravindra <pa...@gmail.com> on 2009/10/15 20:18:57 UTC

Simultaneous execution of Map tasks from different JOINs

Hi All,
I have a query regarding the execution of the Map tasks in Pig/Hadoop.

Suppose I have a query with 2 JOINS
JOIN 1 - between sets A and B
JOIN 2 - between sets C and D
(We took both as independent sets - no dependency)
COGROUP using results of JOIN 1 and JOIN 2

To check how Pig creates Mapreduce tasks, we used the "explain" to see the
logical / physical / Mapreduce information.
As expected, we found that separate MapReduce phases are required for each
of the JOINs. Our question is, would these two map phases (corr to the 2
JOINs), execute parallely or sequentially. Is there a way we can control
this execution and also to verify the execution process.

1. We are aware that if we have 10 map tasks in both the jobs and have 15
map task slots, then we can execute 5 tasks from the 2nd job simultaneously.
But in the above scenario, how is this handled. Is there a way we can
control this in Pig?
2. Also, we assume that through the logical plan, Pig will know that the
sets involved in the JOINs are independent. Is our understanding right?

Regards,
Padmashree

Re: Simultaneous execution of Map tasks from different JOINs

Posted by Ashutosh Chauhan <as...@gmail.com>.

Our question is, would these two map phases (corr to the 2
> JOINs), execute parallely or sequentially. Is there a way we can control
> this execution and also to verify the execution process.
>
> Yes, map phases of both the join will execute parallely. If you are on
latest trunk, you can verify this through grunt shell. Execute your script
in grunt shell and observe that following messages will be printed on screen
in quick succession as soon as your script starts executing.

 Submitting job: job_200910061853_0039 to execution engine.
 Submitting job: job_200910061853_0040 to execution engine.

This indicates both jobs are running in parallel. As soon as both job
finishes, third job will be launched for Co-group. Same message will be
printed when that will happen.

1. We are aware that if we have 10 map tasks in both the jobs and have 15
> map task slots, then we can execute 5 tasks from the 2nd job
> simultaneously.
> But in the above scenario, how is this handled. Is there a way we can
> control this in Pig?
>

As I said, by default map phases of both joins will run in parallel. If for
some reason you want them to execute sequentially, store the results
immediately after join operation followed by "exec " in your script.


> 2. Also, we assume that through the logical plan, Pig will know that the
> sets involved in the JOINs are independent. Is our understanding right?
>
>
Yep, Pig knows that.

Hope it helps,
Ashutosh