You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Kristi Morton <km...@cs.washington.edu> on 2009/06/05 02:12:51 UTC
Hadoop scheduling question
Hi,
I'm a Hadoop 17 user who is doing research with Prof. Magda Balazinska
at the University of Washington on an improved progress indicator for
Pig Latin. We have a question regarding how Hadoop schedules Pig Latin
queries with JOIN operators. Does Hadoop schedule all MapReduce jobs in
a script sequentially or does it ever schedule two MapReduce jobs in
parallel. For example, if the output of two Map-Reduce jobs is later
joined and each of these jobs only needs a subset of the cluster
resources, would they be scheduled in parallel or in series?
I apologize if I sent this to the wrong list, but please let me know
which list is most appropriate for this type of question.
Thanks,
Kristi
Re: Hadoop scheduling question
Posted by Pankil Doshi <fo...@gmail.com>.
Hello Kristi,
I am Research Assistant at University of Texas at Dallas. We are working of
RDF data and we come across many joins in our queries. But We are not able
to carry out all joins in a single job..we also tried our hadoop code using
Pig scripts and found that for each join in PIG script new job is used.So
basically what i think its a sequential process to handle typesof join where
output of one job is required s an input to other one.
do let me know what you think about my view point.
Thanks
Pankil
On Thu, Jun 4, 2009 at 7:12 PM, Kristi Morton <km...@cs.washington.edu>wrote:
> Hi,
>
> I'm a Hadoop 17 user who is doing research with Prof. Magda Balazinska at
> the University of Washington on an improved progress indicator for Pig
> Latin. We have a question regarding how Hadoop schedules Pig Latin queries
> with JOIN operators. Does Hadoop schedule all MapReduce jobs in a script
> sequentially or does it ever schedule two MapReduce jobs in parallel. For
> example, if the output of two Map-Reduce jobs is later joined and each of
> these jobs only needs a subset of the cluster resources, would they be
> scheduled in parallel or in series?
>
> I apologize if I sent this to the wrong list, but please let me know which
> list is most appropriate for this type of question.
>
> Thanks,
> Kristi
>
>