You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Something Something <ma...@gmail.com> on 2010/01/26 00:22:30 UTC

setNumReduceTasks(1)

If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the class
be instantiated only on one machine.. always?  I mean if I have a cluster of
say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed to
be instantiated only on 1 machine?

If answer is yes, then I will use static variable as a counter to see how
may rows have been added to my HBase table so far.  In my use case, I want
to write only N number of rows to a table.  Is there a better way to do
this?  Please let me know.  Thanks.

Re: setNumReduceTasks(1)

Posted by Something Something <ma...@gmail.com>.
It is the latter, not the former.  Yes, but approximation is not allowed in
this case.  Not sure how splitting in 2 jobs would help but let me think
more about it.  Thanks for the help.

On Sat, Jan 30, 2010 at 1:35 AM, Mridul Muralidharan
<mr...@yahoo-inc.com>wrote:

>
> Top K is slightly more complicated (in comparison) to implement efficiently
> : you might want to look at other projects like pig to see how they do it
> (to compare and look at ideas).
>
> Just to get an understanding - your mappers generate <key, value>, and you
> want to pick top K based on value in reducer side ?
> Or can you have multiple key's coming in from various mappers and you need
> to aggregate it at reducer ?
>
>
> If former (that is key is unique), then a combiner to emit's top K per
> mapper, and then a single reducer which sorts and picks from the M * C * K
> tuples should do the trick (M == number of mappers, C == avg number of
> combiner invocations per mapper, K == number of output tuples required).
>
> If latter, you can try to do heuristics to approximate the value, but it
> always has a error margin (to efficiently do it : this is something I ask in
> interviews :) ) which you will need to take into account - or you can just
> split it into two jobs : aggregate in job 1, top K in job 2.
>
> Regards,
> Mridul
>
>
>
> Something Something wrote:
>
>> N could be up to 1000, and output from Map job could be about 5 Million.
>>  We
>> only want the top 1000 because rest of it could be just noise.  Thanks for
>> your help.
>>
>> On Fri, Jan 29, 2010 at 11:43 AM, Alex Baranov <alex.baranov.v@gmail.com
>> >wrote:
>>
>>  How big is N?  How big is outcome of Map job?
>>>
>>> Alex.
>>>
>>> On Fri, Jan 29, 2010 at 7:36 PM, Something Something <
>>> mailinglists19@gmail.com> wrote:
>>>
>>>  I am sorry, but I forgot to add one important piece of information.
>>>>
>>>> I don't want to write any random N rows to the table.  I want to write
>>>>
>>> the
>>>
>>>> *top* N rows - meaning - I want to write the "key" values of the Reducer
>>>>
>>> in
>>>
>>>> descending order.  Does this make sense?  Sorry for the confusion.
>>>>
>>>> On Wed, Jan 27, 2010 at 11:09 PM, Mridul Muralidharan <
>>>> mridulm@yahoo-inc.com
>>>>
>>>>> wrote:
>>>>> A possible solution is to emit only N rows from each mapper and then
>>>>>
>>>> use
>>>
>>>> 1
>>>>
>>>>> reduce task [*] - if value of N is not very high.
>>>>> So you end up with utmost m * N rows on reducer instead of full
>>>>>
>>>> inputset
>>>
>>>> -
>>>>
>>>>> and so the limit can be done easier.
>>>>>
>>>>>
>>>>> If you ok with some sort of variance in the number of rows inserted
>>>>>
>>>> (and
>>>
>>>> if
>>>>
>>>>> value of N is very high), you can do more interesting things like N/m'
>>>>>
>>>> rows
>>>>
>>>>> per mapper - and multiple reducers (r) : with assumtion that each
>>>>>
>>>> reducer
>>>
>>>> will see atleast N/r rows - and so you can limit to N/r per reducer :
>>>>> ofcourse, there is a possible error that gets introduced here ...
>>>>>
>>>>>
>>>>> Regards,
>>>>> Mridul
>>>>>
>>>>> [*] Assuming you just want simple limit - nothing else.
>>>>> Also note, each mapper might want to emit N rows instead of 'tweaks'
>>>>>
>>>> like
>>>
>>>> N/m rows, since it is possible that multiple mappers might have less
>>>>>
>>>> than
>>>
>>>> N/m rows to emit to begin with !
>>>>>
>>>>>
>>>>>
>>>>> Something Something wrote:
>>>>>
>>>>>  If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the
>>>>>> class
>>>>>> be instantiated only on one machine.. always?  I mean if I have a
>>>>>>
>>>>> cluster
>>>>
>>>>> of
>>>>>> say 1 master, 10 workers & 3 zookeepers, is the Reducer class
>>>>>>
>>>>> guaranteed
>>>
>>>> to
>>>>>> be instantiated only on 1 machine?
>>>>>>
>>>>>> If answer is yes, then I will use static variable as a counter to see
>>>>>>
>>>>> how
>>>>
>>>>> may rows have been added to my HBase table so far.  In my use case, I
>>>>>>
>>>>> want
>>>>
>>>>> to write only N number of rows to a table.  Is there a better way to
>>>>>>
>>>>> do
>>>
>>>> this?  Please let me know.  Thanks.
>>>>>>
>>>>>>
>>>>>
>

Re: setNumReduceTasks(1)

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
Top K is slightly more complicated (in comparison) to implement 
efficiently : you might want to look at other projects like pig to see 
how they do it (to compare and look at ideas).

Just to get an understanding - your mappers generate <key, value>, and 
you want to pick top K based on value in reducer side ?
Or can you have multiple key's coming in from various mappers and you 
need to aggregate it at reducer ?


If former (that is key is unique), then a combiner to emit's top K per 
mapper, and then a single reducer which sorts and picks from the M * C * 
K tuples should do the trick (M == number of mappers, C == avg number of 
combiner invocations per mapper, K == number of output tuples required).

If latter, you can try to do heuristics to approximate the value, but it 
always has a error margin (to efficiently do it : this is something I 
ask in interviews :) ) which you will need to take into account - or you 
can just split it into two jobs : aggregate in job 1, top K in job 2.

Regards,
Mridul


Something Something wrote:
> N could be up to 1000, and output from Map job could be about 5 Million.  We
> only want the top 1000 because rest of it could be just noise.  Thanks for
> your help.
> 
> On Fri, Jan 29, 2010 at 11:43 AM, Alex Baranov <al...@gmail.com>wrote:
> 
>> How big is N?  How big is outcome of Map job?
>>
>> Alex.
>>
>> On Fri, Jan 29, 2010 at 7:36 PM, Something Something <
>> mailinglists19@gmail.com> wrote:
>>
>>> I am sorry, but I forgot to add one important piece of information.
>>>
>>> I don't want to write any random N rows to the table.  I want to write
>> the
>>> *top* N rows - meaning - I want to write the "key" values of the Reducer
>> in
>>> descending order.  Does this make sense?  Sorry for the confusion.
>>>
>>> On Wed, Jan 27, 2010 at 11:09 PM, Mridul Muralidharan <
>>> mridulm@yahoo-inc.com
>>>> wrote:
>>>> A possible solution is to emit only N rows from each mapper and then
>> use
>>> 1
>>>> reduce task [*] - if value of N is not very high.
>>>> So you end up with utmost m * N rows on reducer instead of full
>> inputset
>>> -
>>>> and so the limit can be done easier.
>>>>
>>>>
>>>> If you ok with some sort of variance in the number of rows inserted
>> (and
>>> if
>>>> value of N is very high), you can do more interesting things like N/m'
>>> rows
>>>> per mapper - and multiple reducers (r) : with assumtion that each
>> reducer
>>>> will see atleast N/r rows - and so you can limit to N/r per reducer :
>>>> ofcourse, there is a possible error that gets introduced here ...
>>>>
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>> [*] Assuming you just want simple limit - nothing else.
>>>> Also note, each mapper might want to emit N rows instead of 'tweaks'
>> like
>>>> N/m rows, since it is possible that multiple mappers might have less
>> than
>>>> N/m rows to emit to begin with !
>>>>
>>>>
>>>>
>>>> Something Something wrote:
>>>>
>>>>> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the
>>>>> class
>>>>> be instantiated only on one machine.. always?  I mean if I have a
>>> cluster
>>>>> of
>>>>> say 1 master, 10 workers & 3 zookeepers, is the Reducer class
>> guaranteed
>>>>> to
>>>>> be instantiated only on 1 machine?
>>>>>
>>>>> If answer is yes, then I will use static variable as a counter to see
>>> how
>>>>> may rows have been added to my HBase table so far.  In my use case, I
>>> want
>>>>> to write only N number of rows to a table.  Is there a better way to
>> do
>>>>> this?  Please let me know.  Thanks.
>>>>>
>>>>


Re: setNumReduceTasks(1)

Posted by Something Something <ma...@gmail.com>.
N could be up to 1000, and output from Map job could be about 5 Million.  We
only want the top 1000 because rest of it could be just noise.  Thanks for
your help.

On Fri, Jan 29, 2010 at 11:43 AM, Alex Baranov <al...@gmail.com>wrote:

> How big is N?  How big is outcome of Map job?
>
> Alex.
>
> On Fri, Jan 29, 2010 at 7:36 PM, Something Something <
> mailinglists19@gmail.com> wrote:
>
> > I am sorry, but I forgot to add one important piece of information.
> >
> > I don't want to write any random N rows to the table.  I want to write
> the
> > *top* N rows - meaning - I want to write the "key" values of the Reducer
> in
> > descending order.  Does this make sense?  Sorry for the confusion.
> >
> > On Wed, Jan 27, 2010 at 11:09 PM, Mridul Muralidharan <
> > mridulm@yahoo-inc.com
> > > wrote:
> >
> > >
> > > A possible solution is to emit only N rows from each mapper and then
> use
> > 1
> > > reduce task [*] - if value of N is not very high.
> > > So you end up with utmost m * N rows on reducer instead of full
> inputset
> > -
> > > and so the limit can be done easier.
> > >
> > >
> > > If you ok with some sort of variance in the number of rows inserted
> (and
> > if
> > > value of N is very high), you can do more interesting things like N/m'
> > rows
> > > per mapper - and multiple reducers (r) : with assumtion that each
> reducer
> > > will see atleast N/r rows - and so you can limit to N/r per reducer :
> > > ofcourse, there is a possible error that gets introduced here ...
> > >
> > >
> > > Regards,
> > > Mridul
> > >
> > > [*] Assuming you just want simple limit - nothing else.
> > > Also note, each mapper might want to emit N rows instead of 'tweaks'
> like
> > > N/m rows, since it is possible that multiple mappers might have less
> than
> > > N/m rows to emit to begin with !
> > >
> > >
> > >
> > > Something Something wrote:
> > >
> > >> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the
> > >> class
> > >> be instantiated only on one machine.. always?  I mean if I have a
> > cluster
> > >> of
> > >> say 1 master, 10 workers & 3 zookeepers, is the Reducer class
> guaranteed
> > >> to
> > >> be instantiated only on 1 machine?
> > >>
> > >> If answer is yes, then I will use static variable as a counter to see
> > how
> > >> may rows have been added to my HBase table so far.  In my use case, I
> > want
> > >> to write only N number of rows to a table.  Is there a better way to
> do
> > >> this?  Please let me know.  Thanks.
> > >>
> > >
> > >
> >
>

Re: setNumReduceTasks(1)

Posted by Alex Baranov <al...@gmail.com>.
How big is N?  How big is outcome of Map job?

Alex.

On Fri, Jan 29, 2010 at 7:36 PM, Something Something <
mailinglists19@gmail.com> wrote:

> I am sorry, but I forgot to add one important piece of information.
>
> I don't want to write any random N rows to the table.  I want to write the
> *top* N rows - meaning - I want to write the "key" values of the Reducer in
> descending order.  Does this make sense?  Sorry for the confusion.
>
> On Wed, Jan 27, 2010 at 11:09 PM, Mridul Muralidharan <
> mridulm@yahoo-inc.com
> > wrote:
>
> >
> > A possible solution is to emit only N rows from each mapper and then use
> 1
> > reduce task [*] - if value of N is not very high.
> > So you end up with utmost m * N rows on reducer instead of full inputset
> -
> > and so the limit can be done easier.
> >
> >
> > If you ok with some sort of variance in the number of rows inserted (and
> if
> > value of N is very high), you can do more interesting things like N/m'
> rows
> > per mapper - and multiple reducers (r) : with assumtion that each reducer
> > will see atleast N/r rows - and so you can limit to N/r per reducer :
> > ofcourse, there is a possible error that gets introduced here ...
> >
> >
> > Regards,
> > Mridul
> >
> > [*] Assuming you just want simple limit - nothing else.
> > Also note, each mapper might want to emit N rows instead of 'tweaks' like
> > N/m rows, since it is possible that multiple mappers might have less than
> > N/m rows to emit to begin with !
> >
> >
> >
> > Something Something wrote:
> >
> >> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the
> >> class
> >> be instantiated only on one machine.. always?  I mean if I have a
> cluster
> >> of
> >> say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed
> >> to
> >> be instantiated only on 1 machine?
> >>
> >> If answer is yes, then I will use static variable as a counter to see
> how
> >> may rows have been added to my HBase table so far.  In my use case, I
> want
> >> to write only N number of rows to a table.  Is there a better way to do
> >> this?  Please let me know.  Thanks.
> >>
> >
> >
>

Re: setNumReduceTasks(1)

Posted by Something Something <ma...@gmail.com>.
I am sorry, but I forgot to add one important piece of information.

I don't want to write any random N rows to the table.  I want to write the
*top* N rows - meaning - I want to write the "key" values of the Reducer in
descending order.  Does this make sense?  Sorry for the confusion.

On Wed, Jan 27, 2010 at 11:09 PM, Mridul Muralidharan <mridulm@yahoo-inc.com
> wrote:

>
> A possible solution is to emit only N rows from each mapper and then use 1
> reduce task [*] - if value of N is not very high.
> So you end up with utmost m * N rows on reducer instead of full inputset -
> and so the limit can be done easier.
>
>
> If you ok with some sort of variance in the number of rows inserted (and if
> value of N is very high), you can do more interesting things like N/m' rows
> per mapper - and multiple reducers (r) : with assumtion that each reducer
> will see atleast N/r rows - and so you can limit to N/r per reducer :
> ofcourse, there is a possible error that gets introduced here ...
>
>
> Regards,
> Mridul
>
> [*] Assuming you just want simple limit - nothing else.
> Also note, each mapper might want to emit N rows instead of 'tweaks' like
> N/m rows, since it is possible that multiple mappers might have less than
> N/m rows to emit to begin with !
>
>
>
> Something Something wrote:
>
>> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the
>> class
>> be instantiated only on one machine.. always?  I mean if I have a cluster
>> of
>> say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed
>> to
>> be instantiated only on 1 machine?
>>
>> If answer is yes, then I will use static variable as a counter to see how
>> may rows have been added to my HBase table so far.  In my use case, I want
>> to write only N number of rows to a table.  Is there a better way to do
>> this?  Please let me know.  Thanks.
>>
>
>

Re: setNumReduceTasks(1)

Posted by Something Something <ma...@gmail.com>.
I am sorry, but I forgot to add one important piece of information.

I don't want to write any random N rows to the table.  I want to write the
*top* N rows - meaning - I want to write the "key" values of the Reducer in
descending order.  Does this make sense?  Sorry for the confusion.

On Wed, Jan 27, 2010 at 11:09 PM, Mridul Muralidharan <mridulm@yahoo-inc.com
> wrote:

>
> A possible solution is to emit only N rows from each mapper and then use 1
> reduce task [*] - if value of N is not very high.
> So you end up with utmost m * N rows on reducer instead of full inputset -
> and so the limit can be done easier.
>
>
> If you ok with some sort of variance in the number of rows inserted (and if
> value of N is very high), you can do more interesting things like N/m' rows
> per mapper - and multiple reducers (r) : with assumtion that each reducer
> will see atleast N/r rows - and so you can limit to N/r per reducer :
> ofcourse, there is a possible error that gets introduced here ...
>
>
> Regards,
> Mridul
>
> [*] Assuming you just want simple limit - nothing else.
> Also note, each mapper might want to emit N rows instead of 'tweaks' like
> N/m rows, since it is possible that multiple mappers might have less than
> N/m rows to emit to begin with !
>
>
>
> Something Something wrote:
>
>> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the
>> class
>> be instantiated only on one machine.. always?  I mean if I have a cluster
>> of
>> say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed
>> to
>> be instantiated only on 1 machine?
>>
>> If answer is yes, then I will use static variable as a counter to see how
>> may rows have been added to my HBase table so far.  In my use case, I want
>> to write only N number of rows to a table.  Is there a better way to do
>> this?  Please let me know.  Thanks.
>>
>
>

Re: setNumReduceTasks(1)

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
A possible solution is to emit only N rows from each mapper and then use 
1 reduce task [*] - if value of N is not very high.
So you end up with utmost m * N rows on reducer instead of full inputset 
- and so the limit can be done easier.


If you ok with some sort of variance in the number of rows inserted (and 
if value of N is very high), you can do more interesting things like 
N/m' rows per mapper - and multiple reducers (r) : with assumtion that 
each reducer will see atleast N/r rows - and so you can limit to N/r per 
reducer : ofcourse, there is a possible error that gets introduced here ...


Regards,
Mridul

[*] Assuming you just want simple limit - nothing else.
Also note, each mapper might want to emit N rows instead of 'tweaks' 
like N/m rows, since it is possible that multiple mappers might have 
less than N/m rows to emit to begin with !


Something Something wrote:
> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the class
> be instantiated only on one machine.. always?  I mean if I have a cluster of
> say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed to
> be instantiated only on 1 machine?
> 
> If answer is yes, then I will use static variable as a counter to see how
> may rows have been added to my HBase table so far.  In my use case, I want
> to write only N number of rows to a table.  Is there a better way to do
> this?  Please let me know.  Thanks.


Re: setNumReduceTasks(1)

Posted by Alex Baranov <al...@gmail.com>.
Since MapReduce programming model defines only one "communication" point
between jobs - the one that occurs after all Map tasks are done and before
Reduce tasks begin I believe that anyway the solution to your problem will
come at a price of lower performance.

Although I don't think that while having 10 workers you should use
"setNumReduceTasks(1)" tactics since the performance will be very degraded.
Of course this depends on N number a lot: if N is quite small and everything
you're going to output from your reduce task is going to lay down in one
datanode then *may be* your strategy can be considered.

If N is really very big and there is lot of work to do before Reducers
should stop, then I'd consider communication throught storing the info about
a progress in DFS (implementation will not be straightforward though, since
we don't want to affect performance a lot).

Alex.

On Tue, Jan 26, 2010 at 1:22 AM, Something Something <
mailinglists19@gmail.com> wrote:

> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the class
> be instantiated only on one machine.. always?  I mean if I have a cluster
> of
> say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed to
> be instantiated only on 1 machine?
>
> If answer is yes, then I will use static variable as a counter to see how
> may rows have been added to my HBase table so far.  In my use case, I want
> to write only N number of rows to a table.  Is there a better way to do
> this?  Please let me know.  Thanks.
>

Re: setNumReduceTasks(1)

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
Jeff Zhang wrote:
> Mridul,
> 
> What do you mean about "Counter's are not synchronized in 'real-time' " ?
> As I know, JT will aggregate Counters from TT, so I think the aggregated
> Counter in JT should be correct.

Aggregate counters are guaranteed to be correct at end of a logical 
state - not necessarily in between.
Consider cases of mapper/reducer task re-execution, caching at the task 
nodes (counters piggyback on heartbeat - and so every XX seconds), etc.

So trying to limit output based on counter would typically result in not 
optimal results.

Regards,
Mridul

> 
> 
> On Tue, Jan 26, 2010 at 3:08 PM, Mridul Muralidharan
> <mr...@yahoo-inc.com>wrote:
> 
>> Jeff Zhang wrote:
>>
>>> *See my comments below*
>>>
>>>
>>> On Mon, Jan 25, 2010 at 3:22 PM, Something Something <
>>> mailinglists19@gmail.com> wrote:
>>>
>>>  If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the
>>>> class
>>>> be instantiated only on one machine.. always?  I mean if I have a cluster
>>>> of
>>>> say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed
>>>> to
>>>> be instantiated only on 1 machine?
>>>>
>>>> *--Yes*
>>>>
>>>
>>>  If answer is yes, then I will use static variable as a counter to see how
>>>> may rows have been added to my HBase table so far.  In my use case, I
>>>> want
>>>> to write only N number of rows to a table.  Is there a better way to do
>>>> this?  Please let me know.  Thanks.
>>>>
>>>>
>>> *--Maybe you can use Counter to track the number of rows you add to HBase,
>>> then you do not need to limit the reduce task as 1*
>>>
>>>
>>>
>> Counter's are not synchronized in 'real-time' : so you cant use that to
>> limit at addition time imo.
>> It is more for aggregation, not realtime messaging.
>>
>> - Mridul
>>
> 
> 
> 


Re: setNumReduceTasks(1)

Posted by Jeff Zhang <zj...@gmail.com>.
Mridul,

What do you mean about "Counter's are not synchronized in 'real-time' " ?
As I know, JT will aggregate Counters from TT, so I think the aggregated
Counter in JT should be correct.


On Tue, Jan 26, 2010 at 3:08 PM, Mridul Muralidharan
<mr...@yahoo-inc.com>wrote:

> Jeff Zhang wrote:
>
>> *See my comments below*
>>
>>
>> On Mon, Jan 25, 2010 at 3:22 PM, Something Something <
>> mailinglists19@gmail.com> wrote:
>>
>>  If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the
>>> class
>>> be instantiated only on one machine.. always?  I mean if I have a cluster
>>> of
>>> say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed
>>> to
>>> be instantiated only on 1 machine?
>>>
>>> *--Yes*
>>>
>>
>>
>>  If answer is yes, then I will use static variable as a counter to see how
>>> may rows have been added to my HBase table so far.  In my use case, I
>>> want
>>> to write only N number of rows to a table.  Is there a better way to do
>>> this?  Please let me know.  Thanks.
>>>
>>>
>> *--Maybe you can use Counter to track the number of rows you add to HBase,
>> then you do not need to limit the reduce task as 1*
>>
>>
>>
> Counter's are not synchronized in 'real-time' : so you cant use that to
> limit at addition time imo.
> It is more for aggregation, not realtime messaging.
>
> - Mridul
>



-- 
Best Regards

Jeff Zhang

Re: setNumReduceTasks(1)

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
Jeff Zhang wrote:
> *See my comments below*
> 
> On Mon, Jan 25, 2010 at 3:22 PM, Something Something <
> mailinglists19@gmail.com> wrote:
> 
>> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the class
>> be instantiated only on one machine.. always?  I mean if I have a cluster
>> of
>> say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed to
>> be instantiated only on 1 machine?
>>
>> *--Yes*
> 
> 
>> If answer is yes, then I will use static variable as a counter to see how
>> may rows have been added to my HBase table so far.  In my use case, I want
>> to write only N number of rows to a table.  Is there a better way to do
>> this?  Please let me know.  Thanks.
>>
> 
> *--Maybe you can use Counter to track the number of rows you add to HBase,
> then you do not need to limit the reduce task as 1*
> 
> 

Counter's are not synchronized in 'real-time' : so you cant use that to 
limit at addition time imo.
It is more for aggregation, not realtime messaging.

- Mridul

Re: setNumReduceTasks(1)

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
Jeff Zhang wrote:
> *See my comments below*
> 
> On Mon, Jan 25, 2010 at 3:22 PM, Something Something <
> mailinglists19@gmail.com> wrote:
> 
>> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the class
>> be instantiated only on one machine.. always?  I mean if I have a cluster
>> of
>> say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed to
>> be instantiated only on 1 machine?
>>
>> *--Yes*
> 
> 
>> If answer is yes, then I will use static variable as a counter to see how
>> may rows have been added to my HBase table so far.  In my use case, I want
>> to write only N number of rows to a table.  Is there a better way to do
>> this?  Please let me know.  Thanks.
>>
> 
> *--Maybe you can use Counter to track the number of rows you add to HBase,
> then you do not need to limit the reduce task as 1*
> 
> 

Counter's are not synchronized in 'real-time' : so you cant use that to 
limit at addition time imo.
It is more for aggregation, not realtime messaging.

- Mridul

Re: setNumReduceTasks(1)

Posted by Jeff Zhang <zj...@gmail.com>.
*See my comments below*

On Mon, Jan 25, 2010 at 3:22 PM, Something Something <
mailinglists19@gmail.com> wrote:

> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the class
> be instantiated only on one machine.. always?  I mean if I have a cluster
> of
> say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed to
> be instantiated only on 1 machine?
>
> *--Yes*


> If answer is yes, then I will use static variable as a counter to see how
> may rows have been added to my HBase table so far.  In my use case, I want
> to write only N number of rows to a table.  Is there a better way to do
> this?  Please let me know.  Thanks.
>

*--Maybe you can use Counter to track the number of rows you add to HBase,
then you do not need to limit the reduce task as 1*


-- 
Best Regards

Jeff Zhang

Re: setNumReduceTasks(1)

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
A possible solution is to emit only N rows from each mapper and then use 
1 reduce task [*] - if value of N is not very high.
So you end up with utmost m * N rows on reducer instead of full inputset 
- and so the limit can be done easier.


If you ok with some sort of variance in the number of rows inserted (and 
if value of N is very high), you can do more interesting things like 
N/m' rows per mapper - and multiple reducers (r) : with assumtion that 
each reducer will see atleast N/r rows - and so you can limit to N/r per 
reducer : ofcourse, there is a possible error that gets introduced here ...


Regards,
Mridul

[*] Assuming you just want simple limit - nothing else.
Also note, each mapper might want to emit N rows instead of 'tweaks' 
like N/m rows, since it is possible that multiple mappers might have 
less than N/m rows to emit to begin with !


Something Something wrote:
> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the class
> be instantiated only on one machine.. always?  I mean if I have a cluster of
> say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed to
> be instantiated only on 1 machine?
> 
> If answer is yes, then I will use static variable as a counter to see how
> may rows have been added to my HBase table so far.  In my use case, I want
> to write only N number of rows to a table.  Is there a better way to do
> this?  Please let me know.  Thanks.


Re: setNumReduceTasks(1)

Posted by Alex Baranov <al...@gmail.com>.
Since MapReduce programming model defines only one "communication" point
between jobs - the one that occurs after all Map tasks are done and before
Reduce tasks begin I believe that anyway the solution to your problem will
come at a price of lower performance.

Although I don't think that while having 10 workers you should use
"setNumReduceTasks(1)" tactics since the performance will be very degraded.
Of course this depends on N number a lot: if N is quite small and everything
you're going to output from your reduce task is going to lay down in one
datanode then *may be* your strategy can be considered.

If N is really very big and there is lot of work to do before Reducers
should stop, then I'd consider communication throught storing the info about
a progress in DFS (implementation will not be straightforward though, since
we don't want to affect performance a lot).

Alex.

On Tue, Jan 26, 2010 at 1:22 AM, Something Something <
mailinglists19@gmail.com> wrote:

> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the class
> be instantiated only on one machine.. always?  I mean if I have a cluster
> of
> say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed to
> be instantiated only on 1 machine?
>
> If answer is yes, then I will use static variable as a counter to see how
> may rows have been added to my HBase table so far.  In my use case, I want
> to write only N number of rows to a table.  Is there a better way to do
> this?  Please let me know.  Thanks.
>

Re: setNumReduceTasks(1)

Posted by Jeff Zhang <zj...@gmail.com>.
*See my comments below*

On Mon, Jan 25, 2010 at 3:22 PM, Something Something <
mailinglists19@gmail.com> wrote:

> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the class
> be instantiated only on one machine.. always?  I mean if I have a cluster
> of
> say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed to
> be instantiated only on 1 machine?
>
> *--Yes*


> If answer is yes, then I will use static variable as a counter to see how
> may rows have been added to my HBase table so far.  In my use case, I want
> to write only N number of rows to a table.  Is there a better way to do
> this?  Please let me know.  Thanks.
>

*--Maybe you can use Counter to track the number of rows you add to HBase,
then you do not need to limit the reduce task as 1*


-- 
Best Regards

Jeff Zhang