You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Kristi Morton <km...@cs.washington.edu> on 2009/06/05 02:12:51 UTC

Hadoop scheduling question

Hi,

I'm a Hadoop 17 user who is doing research with Prof. Magda Balazinska 
at the University of Washington on an improved progress indicator for 
Pig Latin.  We have a question regarding how Hadoop schedules Pig Latin 
queries with JOIN operators.  Does Hadoop schedule all MapReduce jobs in 
a script sequentially or does it ever schedule two MapReduce jobs in 
parallel.  For example, if the output of two Map-Reduce jobs is later 
joined and each of these jobs only needs a subset of the cluster 
resources, would they be scheduled in parallel or in series?

I apologize if I sent this to the wrong list, but please let me know 
which list is most appropriate for this type of question.

Thanks,
Kristi

Re: Hadoop scheduling question

Posted by Pankil Doshi <fo...@gmail.com>.

Hello Kristi,

I am Research Assistant at University of Texas at Dallas. We are working of
RDF data and we come across many joins in our queries. But We are not able
to carry out all joins in a single job..we also tried our hadoop code using
Pig scripts and found that for each join in PIG script new job is used.So
basically what i think its a sequential process to handle typesof join where
output of one job is required s an input to other one.

do let me know what you think about my view point.

Thanks
Pankil

On Thu, Jun 4, 2009 at 7:12 PM, Kristi Morton <km...@cs.washington.edu>wrote:

> Hi,
>
> I'm a Hadoop 17 user who is doing research with Prof. Magda Balazinska at
> the University of Washington on an improved progress indicator for Pig
> Latin.  We have a question regarding how Hadoop schedules Pig Latin queries
> with JOIN operators.  Does Hadoop schedule all MapReduce jobs in a script
> sequentially or does it ever schedule two MapReduce jobs in parallel.  For
> example, if the output of two Map-Reduce jobs is later joined and each of
> these jobs only needs a subset of the cluster resources, would they be
> scheduled in parallel or in series?
>
> I apologize if I sent this to the wrong list, but please let me know which
> list is most appropriate for this type of question.
>
> Thanks,
> Kristi
>
>