You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Nan Zhu <zh...@gmail.com> on 2012/06/14 08:58:00 UTC

what does "keep 10% map, 40% reduce" mean in gridmix2's README?

Hi, all

I'm using gridmix2 to test my cluster, while in its README file, there are
statements like the following:

+1) Three stage map/reduce job
+	   Input:      500GB compressed (2TB uncompressed) SequenceFile
+                 (k,v) = (5 words, 100 words)
+                 hadoop-env: FIXCOMPSEQ
+     *Compute1:   keep 10% map, 40% reduce
+	   Compute2:   keep 100% map, 77% reduce
+                 Input from Compute1
+     Compute3:   keep 116% map, 91% reduce
+                 Input from Compute2
+     *Motivation: Many user workloads are implemented as pipelined map/reduce
+                 jobs, including Pig workloads


Can anyone tell me what does "keep 10% map, 40% reduce" mean here?

Best,

-- 
Nan Zhu
School of Electronic, Information and Electrical Engineering,229
Shanghai Jiao Tong University
800,Dongchuan Road,Shanghai,China
E-Mail: zhunansjtu@gmail.com

Re: what does "keep 10% map, 40% reduce" mean in gridmix2's README?

Posted by Nan Zhu <zh...@gmail.com>.

yes, what's the relationship between them?

On Mon, Jun 25, 2012 at 2:50 PM, gemini alex <ge...@gmail.com>wrote:

> did you configure map output compression ?
>
>
> 2012/6/15 Chen He <ai...@gmail.com>
>
> > Let me know when you get the correct answer.
> >
> > Chen
> >
> > On Thu, Jun 14, 2012 at 11:42 AM, Nan Zhu <zh...@gmail.com> wrote:
> >
> > > Hi, Chen,
> > >
> > > Thank you for your reply,
> > >
> > > but in its README, there is no value which is larger than 100%, it
> means
> > > that the size of intermediate results will never be larger than input
> > size,
> > >
> > > it will not be the case, because the input data is compressed, the size
> > of
> > > the generated data will expand to be very large....
> > >
> > > it's just my guessing, can anyone correct me?
> > >
> > > Best,
> > >
> > > Nan
> > >
> > >
> > > On Thu, Jun 14, 2012 at 11:50 PM, Chen He <ai...@gmail.com> wrote:
> > >
> > > > Hi Nan
> > > >
> > > > probably the map stage will output 10% of the total input, and the
> > reduce
> > > > stage will output 40% of intermediate results (10% of total input).
> > > >
> > > > For example, 500GB input, after the map stage, it will be 50GB and it
> > > will
> > > > become 20GB after the reduce stage.
> > > >
> > > > It may be similar to the loadgen in hadoop test example.
> > > >
> > > > Anyone has suggestion?
> > > >
> > > > Chen
> > > > System Architect Intern @ ZData
> > > > PhD student@CSE Dept.
> > > >
> > > >
> > > > On Thu, Jun 14, 2012 at 1:58 AM, Nan Zhu <zh...@gmail.com>
> wrote:
> > > >
> > > > > Hi, all
> > > > >
> > > > > I'm using gridmix2 to test my cluster, while in its README file,
> > there
> > > > are
> > > > > statements like the following:
> > > > >
> > > > > +1) Three stage map/reduce job
> > > > > +          Input:      500GB compressed (2TB uncompressed)
> > SequenceFile
> > > > > +                 (k,v) = (5 words, 100 words)
> > > > > +                 hadoop-env: FIXCOMPSEQ
> > > > > +     *Compute1:   keep 10% map, 40% reduce
> > > > > +          Compute2:   keep 100% map, 77% reduce
> > > > > +                 Input from Compute1
> > > > > +     Compute3:   keep 116% map, 91% reduce
> > > > > +                 Input from Compute2
> > > > > +     *Motivation: Many user workloads are implemented as pipelined
> > > > > map/reduce
> > > > > +                 jobs, including Pig workloads
> > > > >
> > > > >
> > > > > Can anyone tell me what does "keep 10% map, 40% reduce" mean here?
> > > > >
> > > > > Best,
> > > > >
> > > > > --
> > > > > Nan Zhu
> > > > > School of Electronic, Information and Electrical Engineering,229
> > > > > Shanghai Jiao Tong University
> > > > > 800,Dongchuan Road,Shanghai,China
> > > > > E-Mail: zhunansjtu@gmail.com
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Nan Zhu
> > > School of Electronic, Information and Electrical Engineering,229
> > > Shanghai Jiao Tong University
> > > 800,Dongchuan Road,Shanghai,China
> > > E-Mail: zhunansjtu@gmail.com
> > >
> >
>



-- 
Nan Zhu
School of Electronic, Information and Electrical Engineering,229
Shanghai Jiao Tong University
800,Dongchuan Road,Shanghai,China
E-Mail: zhunansjtu@gmail.com

Re: what does "keep 10% map, 40% reduce" mean in gridmix2's README?

Posted by gemini alex <ge...@gmail.com>.

did you configure map output compression ?


2012/6/15 Chen He <ai...@gmail.com>

> Let me know when you get the correct answer.
>
> Chen
>
> On Thu, Jun 14, 2012 at 11:42 AM, Nan Zhu <zh...@gmail.com> wrote:
>
> > Hi, Chen,
> >
> > Thank you for your reply,
> >
> > but in its README, there is no value which is larger than 100%, it means
> > that the size of intermediate results will never be larger than input
> size,
> >
> > it will not be the case, because the input data is compressed, the size
> of
> > the generated data will expand to be very large....
> >
> > it's just my guessing, can anyone correct me?
> >
> > Best,
> >
> > Nan
> >
> >
> > On Thu, Jun 14, 2012 at 11:50 PM, Chen He <ai...@gmail.com> wrote:
> >
> > > Hi Nan
> > >
> > > probably the map stage will output 10% of the total input, and the
> reduce
> > > stage will output 40% of intermediate results (10% of total input).
> > >
> > > For example, 500GB input, after the map stage, it will be 50GB and it
> > will
> > > become 20GB after the reduce stage.
> > >
> > > It may be similar to the loadgen in hadoop test example.
> > >
> > > Anyone has suggestion?
> > >
> > > Chen
> > > System Architect Intern @ ZData
> > > PhD student@CSE Dept.
> > >
> > >
> > > On Thu, Jun 14, 2012 at 1:58 AM, Nan Zhu <zh...@gmail.com> wrote:
> > >
> > > > Hi, all
> > > >
> > > > I'm using gridmix2 to test my cluster, while in its README file,
> there
> > > are
> > > > statements like the following:
> > > >
> > > > +1) Three stage map/reduce job
> > > > +          Input:      500GB compressed (2TB uncompressed)
> SequenceFile
> > > > +                 (k,v) = (5 words, 100 words)
> > > > +                 hadoop-env: FIXCOMPSEQ
> > > > +     *Compute1:   keep 10% map, 40% reduce
> > > > +          Compute2:   keep 100% map, 77% reduce
> > > > +                 Input from Compute1
> > > > +     Compute3:   keep 116% map, 91% reduce
> > > > +                 Input from Compute2
> > > > +     *Motivation: Many user workloads are implemented as pipelined
> > > > map/reduce
> > > > +                 jobs, including Pig workloads
> > > >
> > > >
> > > > Can anyone tell me what does "keep 10% map, 40% reduce" mean here?
> > > >
> > > > Best,
> > > >
> > > > --
> > > > Nan Zhu
> > > > School of Electronic, Information and Electrical Engineering,229
> > > > Shanghai Jiao Tong University
> > > > 800,Dongchuan Road,Shanghai,China
> > > > E-Mail: zhunansjtu@gmail.com
> > > >
> > >
> >
> >
> >
> > --
> > Nan Zhu
> > School of Electronic, Information and Electrical Engineering,229
> > Shanghai Jiao Tong University
> > 800,Dongchuan Road,Shanghai,China
> > E-Mail: zhunansjtu@gmail.com
> >
>

Re: what does "keep 10% map, 40% reduce" mean in gridmix2's README?

Posted by Chen He <ai...@gmail.com>.

Let me know when you get the correct answer.

Chen

On Thu, Jun 14, 2012 at 11:42 AM, Nan Zhu <zh...@gmail.com> wrote:

> Hi, Chen,
>
> Thank you for your reply,
>
> but in its README, there is no value which is larger than 100%, it means
> that the size of intermediate results will never be larger than input size,
>
> it will not be the case, because the input data is compressed, the size of
> the generated data will expand to be very large....
>
> it's just my guessing, can anyone correct me?
>
> Best,
>
> Nan
>
>
> On Thu, Jun 14, 2012 at 11:50 PM, Chen He <ai...@gmail.com> wrote:
>
> > Hi Nan
> >
> > probably the map stage will output 10% of the total input, and the reduce
> > stage will output 40% of intermediate results (10% of total input).
> >
> > For example, 500GB input, after the map stage, it will be 50GB and it
> will
> > become 20GB after the reduce stage.
> >
> > It may be similar to the loadgen in hadoop test example.
> >
> > Anyone has suggestion?
> >
> > Chen
> > System Architect Intern @ ZData
> > PhD student@CSE Dept.
> >
> >
> > On Thu, Jun 14, 2012 at 1:58 AM, Nan Zhu <zh...@gmail.com> wrote:
> >
> > > Hi, all
> > >
> > > I'm using gridmix2 to test my cluster, while in its README file, there
> > are
> > > statements like the following:
> > >
> > > +1) Three stage map/reduce job
> > > +          Input:      500GB compressed (2TB uncompressed) SequenceFile
> > > +                 (k,v) = (5 words, 100 words)
> > > +                 hadoop-env: FIXCOMPSEQ
> > > +     *Compute1:   keep 10% map, 40% reduce
> > > +          Compute2:   keep 100% map, 77% reduce
> > > +                 Input from Compute1
> > > +     Compute3:   keep 116% map, 91% reduce
> > > +                 Input from Compute2
> > > +     *Motivation: Many user workloads are implemented as pipelined
> > > map/reduce
> > > +                 jobs, including Pig workloads
> > >
> > >
> > > Can anyone tell me what does "keep 10% map, 40% reduce" mean here?
> > >
> > > Best,
> > >
> > > --
> > > Nan Zhu
> > > School of Electronic, Information and Electrical Engineering,229
> > > Shanghai Jiao Tong University
> > > 800,Dongchuan Road,Shanghai,China
> > > E-Mail: zhunansjtu@gmail.com
> > >
> >
>
>
>
> --
> Nan Zhu
> School of Electronic, Information and Electrical Engineering,229
> Shanghai Jiao Tong University
> 800,Dongchuan Road,Shanghai,China
> E-Mail: zhunansjtu@gmail.com
>

Re: what does "keep 10% map, 40% reduce" mean in gridmix2's README?

Posted by Nan Zhu <zh...@gmail.com>.

Hi, Chen,

Thank you for your reply,

but in its README, there is no value which is larger than 100%, it means
that the size of intermediate results will never be larger than input size,

it will not be the case, because the input data is compressed, the size of
the generated data will expand to be very large....

it's just my guessing, can anyone correct me?

Best,

Nan


On Thu, Jun 14, 2012 at 11:50 PM, Chen He <ai...@gmail.com> wrote:

> Hi Nan
>
> probably the map stage will output 10% of the total input, and the reduce
> stage will output 40% of intermediate results (10% of total input).
>
> For example, 500GB input, after the map stage, it will be 50GB and it will
> become 20GB after the reduce stage.
>
> It may be similar to the loadgen in hadoop test example.
>
> Anyone has suggestion?
>
> Chen
> System Architect Intern @ ZData
> PhD student@CSE Dept.
>
>
> On Thu, Jun 14, 2012 at 1:58 AM, Nan Zhu <zh...@gmail.com> wrote:
>
> > Hi, all
> >
> > I'm using gridmix2 to test my cluster, while in its README file, there
> are
> > statements like the following:
> >
> > +1) Three stage map/reduce job
> > +          Input:      500GB compressed (2TB uncompressed) SequenceFile
> > +                 (k,v) = (5 words, 100 words)
> > +                 hadoop-env: FIXCOMPSEQ
> > +     *Compute1:   keep 10% map, 40% reduce
> > +          Compute2:   keep 100% map, 77% reduce
> > +                 Input from Compute1
> > +     Compute3:   keep 116% map, 91% reduce
> > +                 Input from Compute2
> > +     *Motivation: Many user workloads are implemented as pipelined
> > map/reduce
> > +                 jobs, including Pig workloads
> >
> >
> > Can anyone tell me what does "keep 10% map, 40% reduce" mean here?
> >
> > Best,
> >
> > --
> > Nan Zhu
> > School of Electronic, Information and Electrical Engineering,229
> > Shanghai Jiao Tong University
> > 800,Dongchuan Road,Shanghai,China
> > E-Mail: zhunansjtu@gmail.com
> >
>



-- 
Nan Zhu
School of Electronic, Information and Electrical Engineering,229
Shanghai Jiao Tong University
800,Dongchuan Road,Shanghai,China
E-Mail: zhunansjtu@gmail.com

Re: what does "keep 10% map, 40% reduce" mean in gridmix2's README?

Posted by Chen He <ai...@gmail.com>.

Hi Nan

probably the map stage will output 10% of the total input, and the reduce
stage will output 40% of intermediate results (10% of total input).

For example, 500GB input, after the map stage, it will be 50GB and it will
become 20GB after the reduce stage.

It may be similar to the loadgen in hadoop test example.

Anyone has suggestion?

Chen
System Architect Intern @ ZData
PhD student@CSE Dept.

On Thu, Jun 14, 2012 at 1:58 AM, Nan Zhu <zh...@gmail.com> wrote:

> Hi, all
>
> I'm using gridmix2 to test my cluster, while in its README file, there are
> statements like the following:
>
> +1) Three stage map/reduce job
> +          Input:      500GB compressed (2TB uncompressed) SequenceFile
> +                 (k,v) = (5 words, 100 words)
> +                 hadoop-env: FIXCOMPSEQ
> +     *Compute1:   keep 10% map, 40% reduce
> +          Compute2:   keep 100% map, 77% reduce
> +                 Input from Compute1
> +     Compute3:   keep 116% map, 91% reduce
> +                 Input from Compute2
> +     *Motivation: Many user workloads are implemented as pipelined
> map/reduce
> +                 jobs, including Pig workloads
>
>
> Can anyone tell me what does "keep 10% map, 40% reduce" mean here?
>
> Best,
>
> --
> Nan Zhu
> School of Electronic, Information and Electrical Engineering,229
> Shanghai Jiao Tong University
> 800,Dongchuan Road,Shanghai,China
> E-Mail: zhunansjtu@gmail.com
>