You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Edward Capriolo <ed...@gmail.com> on 2009/04/18 17:59:34 UTC

max value for a dataset

I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase
support the ability to max(). I am writing my own max() over a simple
one column dataset.

The best solution I came up with was using MapRunner. With maprunner I
can store the highest value in a private member variable. I can read
through the entire data set and only have to emit one value per mapper
upon completion of the map data. Then I can specify one reducer and
carry out the same operation.

Does anyone have a better tactic. I thought a counter could do this
but are they atomic?

Re: max value for a dataset

Posted by jason hadoop <ja...@gmail.com>.

I worked out how to use the aggregation service with streaming to do this,
entertainingly simple once you have figured this out.

Full details will be in ch08 of my book - buy a copy so I can afford to
write another :)

/tmp/numbers contains a file full of white space separated whole numbers.
/tmp/LongMax.pl is the attached perl script.
The output will be a single file part-00000 in /tmp/numbers_max_output.
Note: this job is run using the local runner (-jt local) so only 1 reduce is
allowed.



hadoop jar contrib/streaming/hadoop-0.19.0-streaming.jar -jt local -fs
file:/// -input /tmp/numbers  -output /tmp/numbers_max_output  -reducer
aggregate -mapper LongMax.pl -file /tmp/LongMax.pl



On Tue, Apr 21, 2009 at 7:42 PM, jason hadoop <ja...@gmail.com>wrote:

> There is no reason to use a combiner in this case, as there is only a
> single output record from the map.
>
> Combiners buy you data reduction when you have output values in your map
> that share keys, and your application allows you to do something with the
> values that results in smaller/fewer records being passed to the reduce.
>
>
> On Mon, Apr 20, 2009 at 4:24 PM, Brian Bockelman <bb...@cse.unl.edu>wrote:
>
>> Hey Jason,
>>
>> Wouldn't this be avoided if you used a combiner to also perform the max()
>> operation?  A minimal amount of data would be written over the network.
>>
>> I can't remember if the map output gets written to disk first, then
>> combine applied or if the combine is applied and then the data is written to
>> disk.  I suspect the latter, but it'd be a big difference.
>>
>> However, the original poster mentioned he was using hbase/pig --
>> certainly, there's some better way to perform max() in hbase/pig?  This list
>> probably isn't the right place to ask if you are using those technologies;
>> I'd suspect they do something more clever (certainly, you're performing a
>> SQL-like operation in MapReduce; not always the best way to approach this
>> type of problem).
>>
>> Brian
>>
>>
>> On Apr 20, 2009, at 8:25 PM, jason hadoop wrote:
>>
>>  The Hadoop Framework requires that a Map Phase be run before the Reduce
>>> Phase.
>>> By doing the initial 'reduce' in the map, a much smaller volume of data
>>> has
>>> to flow across the network to the reduce tasks.
>>> But yes, this could simply be done by using an IdentityMapper and then
>>> have
>>> all of the work done in the reduce.
>>>
>>>
>>> On Mon, Apr 20, 2009 at 4:26 AM, Shevek <ha...@anarres.org> wrote:
>>>
>>>  On Sat, 2009-04-18 at 09:57 -0700, jason hadoop wrote:
>>>>
>>>>> The traditional approach would be a Mapper class that maintained a
>>>>> member
>>>>> variable that you kept the max value record, and in the close method of
>>>>>
>>>> your
>>>>
>>>>> mapper you output a single record containing that value.
>>>>>
>>>>
>>>> Perhaps you can forgive the question from a heathen, but why is this
>>>> first mapper not also a reducer? It seems to me that it is performing a
>>>> reduce operation, and that maps should (philosophically speaking) not
>>>> maintain data from one input to the next, since the order (and location)
>>>> of inputs is not well defined. The program to compute a maximum should
>>>> then be a tree of reduction operations, with no maps at all.
>>>>
>>>> Of course in this instance, what you propose works, but it does seem
>>>> puzzling. Perhaps the answer is simple architectural limitation?
>>>>
>>>> S.
>>>>
>>>>  The map method of course compares the current record against the max
>>>>> and
>>>>> stores current in max when current is larger than max.
>>>>>
>>>>> Then each map output is a single record and the reduce behaves very
>>>>> similarly, in that the close method outputs the final max record. A
>>>>>
>>>> single
>>>>
>>>>> reduce would be the simplest.
>>>>>
>>>>> On your question a Mapper and Reducer defines 3 entry points,
>>>>> configure,
>>>>> called once on on task start, the map/reduce called once for each
>>>>> record,
>>>>> and close, called once after the last call to map/reduce.
>>>>> at least through 0.19, the close is not provided with the output
>>>>>
>>>> collector
>>>>
>>>>> or the reporter, so you need to save them in the map/reduce method.
>>>>>
>>>>> On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain <ru...@gmail.com>
>>>>>
>>>> wrote:
>>>>
>>>>>
>>>>>  How do you identify that map task is ending within the map method? Is
>>>>>>
>>>>> it
>>>>
>>>>> possible to know which is the last call to map method?
>>>>>>
>>>>>> On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo <
>>>>>>
>>>>> edlinuxguru@gmail.com
>>>>
>>>>>  wrote:
>>>>>>>
>>>>>>
>>>>>>  I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase
>>>>>>> support the ability to max(). I am writing my own max() over a simple
>>>>>>> one column dataset.
>>>>>>>
>>>>>>> The best solution I came up with was using MapRunner. With maprunner
>>>>>>>
>>>>>> I
>>>>
>>>>>  can store the highest value in a private member variable. I can read
>>>>>>> through the entire data set and only have to emit one value per
>>>>>>>
>>>>>> mapper
>>>>
>>>>>  upon completion of the map data. Then I can specify one reducer and
>>>>>>> carry out the same operation.
>>>>>>>
>>>>>>> Does anyone have a better tactic. I thought a counter could do this
>>>>>>> but are they atomic?
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Alpha Chapters of my book on Hadoop are available
>>> http://www.apress.com/book/view/9781430219422
>>>
>>
>>
>
>
> --
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Re: max value for a dataset

Posted by jason hadoop <ja...@gmail.com>.

There is no reason to use a combiner in this case, as there is only a single
output record from the map.

Combiners buy you data reduction when you have output values in your map
that share keys, and your application allows you to do something with the
values that results in smaller/fewer records being passed to the reduce.


On Mon, Apr 20, 2009 at 4:24 PM, Brian Bockelman <bb...@cse.unl.edu>wrote:

> Hey Jason,
>
> Wouldn't this be avoided if you used a combiner to also perform the max()
> operation?  A minimal amount of data would be written over the network.
>
> I can't remember if the map output gets written to disk first, then combine
> applied or if the combine is applied and then the data is written to disk.
>  I suspect the latter, but it'd be a big difference.
>
> However, the original poster mentioned he was using hbase/pig -- certainly,
> there's some better way to perform max() in hbase/pig?  This list probably
> isn't the right place to ask if you are using those technologies; I'd
> suspect they do something more clever (certainly, you're performing a
> SQL-like operation in MapReduce; not always the best way to approach this
> type of problem).
>
> Brian
>
>
> On Apr 20, 2009, at 8:25 PM, jason hadoop wrote:
>
>  The Hadoop Framework requires that a Map Phase be run before the Reduce
>> Phase.
>> By doing the initial 'reduce' in the map, a much smaller volume of data
>> has
>> to flow across the network to the reduce tasks.
>> But yes, this could simply be done by using an IdentityMapper and then
>> have
>> all of the work done in the reduce.
>>
>>
>> On Mon, Apr 20, 2009 at 4:26 AM, Shevek <ha...@anarres.org> wrote:
>>
>>  On Sat, 2009-04-18 at 09:57 -0700, jason hadoop wrote:
>>>
>>>> The traditional approach would be a Mapper class that maintained a
>>>> member
>>>> variable that you kept the max value record, and in the close method of
>>>>
>>> your
>>>
>>>> mapper you output a single record containing that value.
>>>>
>>>
>>> Perhaps you can forgive the question from a heathen, but why is this
>>> first mapper not also a reducer? It seems to me that it is performing a
>>> reduce operation, and that maps should (philosophically speaking) not
>>> maintain data from one input to the next, since the order (and location)
>>> of inputs is not well defined. The program to compute a maximum should
>>> then be a tree of reduction operations, with no maps at all.
>>>
>>> Of course in this instance, what you propose works, but it does seem
>>> puzzling. Perhaps the answer is simple architectural limitation?
>>>
>>> S.
>>>
>>>  The map method of course compares the current record against the max and
>>>> stores current in max when current is larger than max.
>>>>
>>>> Then each map output is a single record and the reduce behaves very
>>>> similarly, in that the close method outputs the final max record. A
>>>>
>>> single
>>>
>>>> reduce would be the simplest.
>>>>
>>>> On your question a Mapper and Reducer defines 3 entry points, configure,
>>>> called once on on task start, the map/reduce called once for each
>>>> record,
>>>> and close, called once after the last call to map/reduce.
>>>> at least through 0.19, the close is not provided with the output
>>>>
>>> collector
>>>
>>>> or the reporter, so you need to save them in the map/reduce method.
>>>>
>>>> On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain <ru...@gmail.com>
>>>>
>>> wrote:
>>>
>>>>
>>>>  How do you identify that map task is ending within the map method? Is
>>>>>
>>>> it
>>>
>>>> possible to know which is the last call to map method?
>>>>>
>>>>> On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo <
>>>>>
>>>> edlinuxguru@gmail.com
>>>
>>>> wrote:
>>>>>>
>>>>>
>>>>>  I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase
>>>>>> support the ability to max(). I am writing my own max() over a simple
>>>>>> one column dataset.
>>>>>>
>>>>>> The best solution I came up with was using MapRunner. With maprunner
>>>>>>
>>>>> I
>>>
>>>> can store the highest value in a private member variable. I can read
>>>>>> through the entire data set and only have to emit one value per
>>>>>>
>>>>> mapper
>>>
>>>> upon completion of the map data. Then I can specify one reducer and
>>>>>> carry out the same operation.
>>>>>>
>>>>>> Does anyone have a better tactic. I thought a counter could do this
>>>>>> but are they atomic?
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>> --
>> Alpha Chapters of my book on Hadoop are available
>> http://www.apress.com/book/view/9781430219422
>>
>
>


-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Re: max value for a dataset

Posted by jason hadoop <ja...@gmail.com>.

There will be a short summary of the hadoop aggregation tools in ch08, it
got missed in the first pass through, and is being added back in this week.
There are a number of howto's in the book particularly in ch08 and ch09.

I hope you enjoy them.

On Tue, Apr 21, 2009 at 8:24 AM, Edward Capriolo <ed...@gmail.com>wrote:

> On Mon, Apr 20, 2009 at 7:24 PM, Brian Bockelman <bb...@cse.unl.edu>
> wrote:
> > Hey Jason,
> >
> > Wouldn't this be avoided if you used a combiner to also perform the max()
> > operation?  A minimal amount of data would be written over the network.
> >
> > I can't remember if the map output gets written to disk first, then
> combine
> > applied or if the combine is applied and then the data is written to
> disk.
> >  I suspect the latter, but it'd be a big difference.
> >
> > However, the original poster mentioned he was using hbase/pig --
> certainly,
> > there's some better way to perform max() in hbase/pig?  This list
> probably
> > isn't the right place to ask if you are using those technologies; I'd
> > suspect they do something more clever (certainly, you're performing a
> > SQL-like operation in MapReduce; not always the best way to approach this
> > type of problem).
> >
> > Brian
> >
> > On Apr 20, 2009, at 8:25 PM, jason hadoop wrote:
> >
> >> The Hadoop Framework requires that a Map Phase be run before the Reduce
> >> Phase.
> >> By doing the initial 'reduce' in the map, a much smaller volume of data
> >> has
> >> to flow across the network to the reduce tasks.
> >> But yes, this could simply be done by using an IdentityMapper and then
> >> have
> >> all of the work done in the reduce.
> >>
> >>
> >> On Mon, Apr 20, 2009 at 4:26 AM, Shevek <ha...@anarres.org> wrote:
> >>
> >>> On Sat, 2009-04-18 at 09:57 -0700, jason hadoop wrote:
> >>>>
> >>>> The traditional approach would be a Mapper class that maintained a
> >>>> member
> >>>> variable that you kept the max value record, and in the close method
> of
> >>>
> >>> your
> >>>>
> >>>> mapper you output a single record containing that value.
> >>>
> >>> Perhaps you can forgive the question from a heathen, but why is this
> >>> first mapper not also a reducer? It seems to me that it is performing a
> >>> reduce operation, and that maps should (philosophically speaking) not
> >>> maintain data from one input to the next, since the order (and
> location)
> >>> of inputs is not well defined. The program to compute a maximum should
> >>> then be a tree of reduction operations, with no maps at all.
> >>>
> >>> Of course in this instance, what you propose works, but it does seem
> >>> puzzling. Perhaps the answer is simple architectural limitation?
> >>>
> >>> S.
> >>>
> >>>> The map method of course compares the current record against the max
> and
> >>>> stores current in max when current is larger than max.
> >>>>
> >>>> Then each map output is a single record and the reduce behaves very
> >>>> similarly, in that the close method outputs the final max record. A
> >>>
> >>> single
> >>>>
> >>>> reduce would be the simplest.
> >>>>
> >>>> On your question a Mapper and Reducer defines 3 entry points,
> configure,
> >>>> called once on on task start, the map/reduce called once for each
> >>>> record,
> >>>> and close, called once after the last call to map/reduce.
> >>>> at least through 0.19, the close is not provided with the output
> >>>
> >>> collector
> >>>>
> >>>> or the reporter, so you need to save them in the map/reduce method.
> >>>>
> >>>> On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain <ru...@gmail.com>
> >>>
> >>> wrote:
> >>>>
> >>>>> How do you identify that map task is ending within the map method? Is
> >>>
> >>> it
> >>>>>
> >>>>> possible to know which is the last call to map method?
> >>>>>
> >>>>> On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo <
> >>>
> >>> edlinuxguru@gmail.com
> >>>>>>
> >>>>>> wrote:
> >>>>>
> >>>>>> I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase
> >>>>>> support the ability to max(). I am writing my own max() over a
> simple
> >>>>>> one column dataset.
> >>>>>>
> >>>>>> The best solution I came up with was using MapRunner. With maprunner
> >>>
> >>> I
> >>>>>>
> >>>>>> can store the highest value in a private member variable. I can read
> >>>>>> through the entire data set and only have to emit one value per
> >>>
> >>> mapper
> >>>>>>
> >>>>>> upon completion of the map data. Then I can specify one reducer and
> >>>>>> carry out the same operation.
> >>>>>>
> >>>>>> Does anyone have a better tactic. I thought a counter could do this
> >>>>>> but are they atomic?
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >> --
> >> Alpha Chapters of my book on Hadoop are available
> >> http://www.apress.com/book/view/9781430219422
> >
> >
>
> I took a loot at the description of the book
> http://www.apress.com/book/view/9781430219422. Hopefully it and other
> endeavors like it can fill a need I have an see quite often. I am
> quite interested in practical hadoop algorithms. Most of my searching
> finds repeated WordCount examples, depictions of the shuffle-sort.
>
> The most practical lessons I took from my programming with Fortran was
> how to sum() min() max() and average() a data set. If the hadoop had a
> cookbook of sorts for algorithm design I think many people would
> benefit.
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Re: max value for a dataset

Posted by Edward Capriolo <ed...@gmail.com>.

On Mon, Apr 20, 2009 at 7:24 PM, Brian Bockelman <bb...@cse.unl.edu> wrote:
> Hey Jason,
>
> Wouldn't this be avoided if you used a combiner to also perform the max()
> operation?  A minimal amount of data would be written over the network.
>
> I can't remember if the map output gets written to disk first, then combine
> applied or if the combine is applied and then the data is written to disk.
>  I suspect the latter, but it'd be a big difference.
>
> However, the original poster mentioned he was using hbase/pig -- certainly,
> there's some better way to perform max() in hbase/pig?  This list probably
> isn't the right place to ask if you are using those technologies; I'd
> suspect they do something more clever (certainly, you're performing a
> SQL-like operation in MapReduce; not always the best way to approach this
> type of problem).
>
> Brian
>
> On Apr 20, 2009, at 8:25 PM, jason hadoop wrote:
>
>> The Hadoop Framework requires that a Map Phase be run before the Reduce
>> Phase.
>> By doing the initial 'reduce' in the map, a much smaller volume of data
>> has
>> to flow across the network to the reduce tasks.
>> But yes, this could simply be done by using an IdentityMapper and then
>> have
>> all of the work done in the reduce.
>>
>>
>> On Mon, Apr 20, 2009 at 4:26 AM, Shevek <ha...@anarres.org> wrote:
>>
>>> On Sat, 2009-04-18 at 09:57 -0700, jason hadoop wrote:
>>>>
>>>> The traditional approach would be a Mapper class that maintained a
>>>> member
>>>> variable that you kept the max value record, and in the close method of
>>>
>>> your
>>>>
>>>> mapper you output a single record containing that value.
>>>
>>> Perhaps you can forgive the question from a heathen, but why is this
>>> first mapper not also a reducer? It seems to me that it is performing a
>>> reduce operation, and that maps should (philosophically speaking) not
>>> maintain data from one input to the next, since the order (and location)
>>> of inputs is not well defined. The program to compute a maximum should
>>> then be a tree of reduction operations, with no maps at all.
>>>
>>> Of course in this instance, what you propose works, but it does seem
>>> puzzling. Perhaps the answer is simple architectural limitation?
>>>
>>> S.
>>>
>>>> The map method of course compares the current record against the max and
>>>> stores current in max when current is larger than max.
>>>>
>>>> Then each map output is a single record and the reduce behaves very
>>>> similarly, in that the close method outputs the final max record. A
>>>
>>> single
>>>>
>>>> reduce would be the simplest.
>>>>
>>>> On your question a Mapper and Reducer defines 3 entry points, configure,
>>>> called once on on task start, the map/reduce called once for each
>>>> record,
>>>> and close, called once after the last call to map/reduce.
>>>> at least through 0.19, the close is not provided with the output
>>>
>>> collector
>>>>
>>>> or the reporter, so you need to save them in the map/reduce method.
>>>>
>>>> On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain <ru...@gmail.com>
>>>
>>> wrote:
>>>>
>>>>> How do you identify that map task is ending within the map method? Is
>>>
>>> it
>>>>>
>>>>> possible to know which is the last call to map method?
>>>>>
>>>>> On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo <
>>>
>>> edlinuxguru@gmail.com
>>>>>>
>>>>>> wrote:
>>>>>
>>>>>> I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase
>>>>>> support the ability to max(). I am writing my own max() over a simple
>>>>>> one column dataset.
>>>>>>
>>>>>> The best solution I came up with was using MapRunner. With maprunner
>>>
>>> I
>>>>>>
>>>>>> can store the highest value in a private member variable. I can read
>>>>>> through the entire data set and only have to emit one value per
>>>
>>> mapper
>>>>>>
>>>>>> upon completion of the map data. Then I can specify one reducer and
>>>>>> carry out the same operation.
>>>>>>
>>>>>> Does anyone have a better tactic. I thought a counter could do this
>>>>>> but are they atomic?
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>> --
>> Alpha Chapters of my book on Hadoop are available
>> http://www.apress.com/book/view/9781430219422
>
>

I took a loot at the description of the book
http://www.apress.com/book/view/9781430219422. Hopefully it and other
endeavors like it can fill a need I have an see quite often. I am
quite interested in practical hadoop algorithms. Most of my searching
finds repeated WordCount examples, depictions of the shuffle-sort.

The most practical lessons I took from my programming with Fortran was
how to sum() min() max() and average() a data set. If the hadoop had a
cookbook of sorts for algorithm design I think many people would
benefit.

Re: max value for a dataset

Posted by Brian Bockelman <bb...@cse.unl.edu>.

Hey Jason,

Wouldn't this be avoided if you used a combiner to also perform the  
max() operation?  A minimal amount of data would be written over the  
network.

I can't remember if the map output gets written to disk first, then  
combine applied or if the combine is applied and then the data is  
written to disk.  I suspect the latter, but it'd be a big difference.

However, the original poster mentioned he was using hbase/pig --  
certainly, there's some better way to perform max() in hbase/pig?   
This list probably isn't the right place to ask if you are using those  
technologies; I'd suspect they do something more clever (certainly,  
you're performing a SQL-like operation in MapReduce; not always the  
best way to approach this type of problem).

Brian

On Apr 20, 2009, at 8:25 PM, jason hadoop wrote:

> The Hadoop Framework requires that a Map Phase be run before the  
> Reduce
> Phase.
> By doing the initial 'reduce' in the map, a much smaller volume of  
> data has
> to flow across the network to the reduce tasks.
> But yes, this could simply be done by using an IdentityMapper and  
> then have
> all of the work done in the reduce.
>
>
> On Mon, Apr 20, 2009 at 4:26 AM, Shevek <ha...@anarres.org> wrote:
>
>> On Sat, 2009-04-18 at 09:57 -0700, jason hadoop wrote:
>>> The traditional approach would be a Mapper class that maintained a  
>>> member
>>> variable that you kept the max value record, and in the close  
>>> method of
>> your
>>> mapper you output a single record containing that value.
>>
>> Perhaps you can forgive the question from a heathen, but why is this
>> first mapper not also a reducer? It seems to me that it is  
>> performing a
>> reduce operation, and that maps should (philosophically speaking) not
>> maintain data from one input to the next, since the order (and  
>> location)
>> of inputs is not well defined. The program to compute a maximum  
>> should
>> then be a tree of reduction operations, with no maps at all.
>>
>> Of course in this instance, what you propose works, but it does seem
>> puzzling. Perhaps the answer is simple architectural limitation?
>>
>> S.
>>
>>> The map method of course compares the current record against the  
>>> max and
>>> stores current in max when current is larger than max.
>>>
>>> Then each map output is a single record and the reduce behaves very
>>> similarly, in that the close method outputs the final max record. A
>> single
>>> reduce would be the simplest.
>>>
>>> On your question a Mapper and Reducer defines 3 entry points,  
>>> configure,
>>> called once on on task start, the map/reduce called once for each  
>>> record,
>>> and close, called once after the last call to map/reduce.
>>> at least through 0.19, the close is not provided with the output
>> collector
>>> or the reporter, so you need to save them in the map/reduce method.
>>>
>>> On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain <ru...@gmail.com>
>> wrote:
>>>
>>>> How do you identify that map task is ending within the map  
>>>> method? Is
>> it
>>>> possible to know which is the last call to map method?
>>>>
>>>> On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo <
>> edlinuxguru@gmail.com
>>>>> wrote:
>>>>
>>>>> I jumped into Hadoop at the 'deep end'. I know pig, hive, and  
>>>>> hbase
>>>>> support the ability to max(). I am writing my own max() over a  
>>>>> simple
>>>>> one column dataset.
>>>>>
>>>>> The best solution I came up with was using MapRunner. With  
>>>>> maprunner
>> I
>>>>> can store the highest value in a private member variable. I can  
>>>>> read
>>>>> through the entire data set and only have to emit one value per
>> mapper
>>>>> upon completion of the map data. Then I can specify one reducer  
>>>>> and
>>>>> carry out the same operation.
>>>>>
>>>>> Does anyone have a better tactic. I thought a counter could do  
>>>>> this
>>>>> but are they atomic?
>>>>>
>>>>
>>>
>>>
>>>
>>
>>
>
>
> -- 
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422

Re: max value for a dataset

Posted by Edward Capriolo <ed...@gmail.com>.

Yes I considered Shevek's tactic as well, but as Jason pointed out
emit ing the entire data set just to find the maximum value would be
wasteful, you do not want to sort the dataset, you just want to break
it in parts and find the max value of each part, then bring it into
one part and perform that operation again.

The way I look at it are the 'best' hadoop algorithms are the ones
that emit less key pairs. What Jason suggested, and the MapRunner
concept I was looking at, would both emit about the same amount of key
pairs.

I am curious to see if the MapRunner implementation would run faster
due to less calls to the map function. After all MapRunner is only
iterating over the data set.

On Mon, Apr 20, 2009 at 8:25 AM, jason hadoop <ja...@gmail.com> wrote:
> The Hadoop Framework requires that a Map Phase be run before the Reduce
> Phase.
> By doing the initial 'reduce' in the map, a much smaller volume of data has
> to flow across the network to the reduce tasks.
> But yes, this could simply be done by using an IdentityMapper and then have
> all of the work done in the reduce.
>
>
> On Mon, Apr 20, 2009 at 4:26 AM, Shevek <ha...@anarres.org> wrote:
>
>> On Sat, 2009-04-18 at 09:57 -0700, jason hadoop wrote:
>> > The traditional approach would be a Mapper class that maintained a member
>> > variable that you kept the max value record, and in the close method of
>> your
>> > mapper you output a single record containing that value.
>>
>> Perhaps you can forgive the question from a heathen, but why is this
>> first mapper not also a reducer? It seems to me that it is performing a
>> reduce operation, and that maps should (philosophically speaking) not
>> maintain data from one input to the next, since the order (and location)
>> of inputs is not well defined. The program to compute a maximum should
>> then be a tree of reduction operations, with no maps at all.
>>
>> Of course in this instance, what you propose works, but it does seem
>> puzzling. Perhaps the answer is simple architectural limitation?
>>
>> S.
>>
>> > The map method of course compares the current record against the max and
>> > stores current in max when current is larger than max.
>> >
>> > Then each map output is a single record and the reduce behaves very
>> > similarly, in that the close method outputs the final max record. A
>> single
>> > reduce would be the simplest.
>> >
>> > On your question a Mapper and Reducer defines 3 entry points, configure,
>> > called once on on task start, the map/reduce called once for each record,
>> > and close, called once after the last call to map/reduce.
>> > at least through 0.19, the close is not provided with the output
>> collector
>> > or the reporter, so you need to save them in the map/reduce method.
>> >
>> > On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain <ru...@gmail.com>
>> wrote:
>> >
>> > > How do you identify that map task is ending within the map method? Is
>> it
>> > > possible to know which is the last call to map method?
>> > >
>> > > On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo <
>> edlinuxguru@gmail.com
>> > > >wrote:
>> > >
>> > > > I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase
>> > > > support the ability to max(). I am writing my own max() over a simple
>> > > > one column dataset.
>> > > >
>> > > > The best solution I came up with was using MapRunner. With maprunner
>> I
>> > > > can store the highest value in a private member variable. I can read
>> > > > through the entire data set and only have to emit one value per
>> mapper
>> > > > upon completion of the map data. Then I can specify one reducer and
>> > > > carry out the same operation.
>> > > >
>> > > > Does anyone have a better tactic. I thought a counter could do this
>> > > > but are they atomic?
>> > > >
>> > >
>> >
>> >
>> >
>>
>>
>
>
> --
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
>

Re: max value for a dataset

Posted by jason hadoop <ja...@gmail.com>.

The Hadoop Framework requires that a Map Phase be run before the Reduce
Phase.
By doing the initial 'reduce' in the map, a much smaller volume of data has
to flow across the network to the reduce tasks.
But yes, this could simply be done by using an IdentityMapper and then have
all of the work done in the reduce.


On Mon, Apr 20, 2009 at 4:26 AM, Shevek <ha...@anarres.org> wrote:

> On Sat, 2009-04-18 at 09:57 -0700, jason hadoop wrote:
> > The traditional approach would be a Mapper class that maintained a member
> > variable that you kept the max value record, and in the close method of
> your
> > mapper you output a single record containing that value.
>
> Perhaps you can forgive the question from a heathen, but why is this
> first mapper not also a reducer? It seems to me that it is performing a
> reduce operation, and that maps should (philosophically speaking) not
> maintain data from one input to the next, since the order (and location)
> of inputs is not well defined. The program to compute a maximum should
> then be a tree of reduction operations, with no maps at all.
>
> Of course in this instance, what you propose works, but it does seem
> puzzling. Perhaps the answer is simple architectural limitation?
>
> S.
>
> > The map method of course compares the current record against the max and
> > stores current in max when current is larger than max.
> >
> > Then each map output is a single record and the reduce behaves very
> > similarly, in that the close method outputs the final max record. A
> single
> > reduce would be the simplest.
> >
> > On your question a Mapper and Reducer defines 3 entry points, configure,
> > called once on on task start, the map/reduce called once for each record,
> > and close, called once after the last call to map/reduce.
> > at least through 0.19, the close is not provided with the output
> collector
> > or the reporter, so you need to save them in the map/reduce method.
> >
> > On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain <ru...@gmail.com>
> wrote:
> >
> > > How do you identify that map task is ending within the map method? Is
> it
> > > possible to know which is the last call to map method?
> > >
> > > On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo <
> edlinuxguru@gmail.com
> > > >wrote:
> > >
> > > > I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase
> > > > support the ability to max(). I am writing my own max() over a simple
> > > > one column dataset.
> > > >
> > > > The best solution I came up with was using MapRunner. With maprunner
> I
> > > > can store the highest value in a private member variable. I can read
> > > > through the entire data set and only have to emit one value per
> mapper
> > > > upon completion of the map data. Then I can specify one reducer and
> > > > carry out the same operation.
> > > >
> > > > Does anyone have a better tactic. I thought a counter could do this
> > > > but are they atomic?
> > > >
> > >
> >
> >
> >
>
>


-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Re: max value for a dataset

Posted by Shevek <ha...@anarres.org>.

On Sat, 2009-04-18 at 09:57 -0700, jason hadoop wrote:
> The traditional approach would be a Mapper class that maintained a member
> variable that you kept the max value record, and in the close method of your
> mapper you output a single record containing that value.

Perhaps you can forgive the question from a heathen, but why is this
first mapper not also a reducer? It seems to me that it is performing a
reduce operation, and that maps should (philosophically speaking) not
maintain data from one input to the next, since the order (and location)
of inputs is not well defined. The program to compute a maximum should
then be a tree of reduction operations, with no maps at all.

Of course in this instance, what you propose works, but it does seem
puzzling. Perhaps the answer is simple architectural limitation?

S.

> The map method of course compares the current record against the max and
> stores current in max when current is larger than max.
> 
> Then each map output is a single record and the reduce behaves very
> similarly, in that the close method outputs the final max record. A single
> reduce would be the simplest.
> 
> On your question a Mapper and Reducer defines 3 entry points, configure,
> called once on on task start, the map/reduce called once for each record,
> and close, called once after the last call to map/reduce.
> at least through 0.19, the close is not provided with the output collector
> or the reporter, so you need to save them in the map/reduce method.
> 
> On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain <ru...@gmail.com> wrote:
> 
> > How do you identify that map task is ending within the map method? Is it
> > possible to know which is the last call to map method?
> >
> > On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo <edlinuxguru@gmail.com
> > >wrote:
> >
> > > I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase
> > > support the ability to max(). I am writing my own max() over a simple
> > > one column dataset.
> > >
> > > The best solution I came up with was using MapRunner. With maprunner I
> > > can store the highest value in a private member variable. I can read
> > > through the entire data set and only have to emit one value per mapper
> > > upon completion of the map data. Then I can specify one reducer and
> > > carry out the same operation.
> > >
> > > Does anyone have a better tactic. I thought a counter could do this
> > > but are they atomic?
> > >
> >
> 
> 
>

Re: max value for a dataset

Posted by Farhan Husain <ru...@gmail.com>.

Thanks for the explanation. I forgot the close method.

On Sat, Apr 18, 2009 at 11:57 AM, jason hadoop <ja...@gmail.com>wrote:

> The traditional approach would be a Mapper class that maintained a member
> variable that you kept the max value record, and in the close method of
> your
> mapper you output a single record containing that value.
>
> The map method of course compares the current record against the max and
> stores current in max when current is larger than max.
>
> Then each map output is a single record and the reduce behaves very
> similarly, in that the close method outputs the final max record. A single
> reduce would be the simplest.
>
> On your question a Mapper and Reducer defines 3 entry points, configure,
> called once on on task start, the map/reduce called once for each record,
> and close, called once after the last call to map/reduce.
> at least through 0.19, the close is not provided with the output collector
> or the reporter, so you need to save them in the map/reduce method.
>
> On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain <ru...@gmail.com> wrote:
>
> > How do you identify that map task is ending within the map method? Is it
> > possible to know which is the last call to map method?
> >
> > On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo <edlinuxguru@gmail.com
> > >wrote:
> >
> > > I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase
> > > support the ability to max(). I am writing my own max() over a simple
> > > one column dataset.
> > >
> > > The best solution I came up with was using MapRunner. With maprunner I
> > > can store the highest value in a private member variable. I can read
> > > through the entire data set and only have to emit one value per mapper
> > > upon completion of the map data. Then I can specify one reducer and
> > > carry out the same operation.
> > >
> > > Does anyone have a better tactic. I thought a counter could do this
> > > but are they atomic?
> > >
> >
>
>
>
> --
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
>

Re: max value for a dataset

Posted by jason hadoop <ja...@gmail.com>.

The traditional approach would be a Mapper class that maintained a member
variable that you kept the max value record, and in the close method of your
mapper you output a single record containing that value.

The map method of course compares the current record against the max and
stores current in max when current is larger than max.

Then each map output is a single record and the reduce behaves very
similarly, in that the close method outputs the final max record. A single
reduce would be the simplest.

On your question a Mapper and Reducer defines 3 entry points, configure,
called once on on task start, the map/reduce called once for each record,
and close, called once after the last call to map/reduce.
at least through 0.19, the close is not provided with the output collector
or the reporter, so you need to save them in the map/reduce method.

On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain <ru...@gmail.com> wrote:

> How do you identify that map task is ending within the map method? Is it
> possible to know which is the last call to map method?
>
> On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo <edlinuxguru@gmail.com
> >wrote:
>
> > I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase
> > support the ability to max(). I am writing my own max() over a simple
> > one column dataset.
> >
> > The best solution I came up with was using MapRunner. With maprunner I
> > can store the highest value in a private member variable. I can read
> > through the entire data set and only have to emit one value per mapper
> > upon completion of the map data. Then I can specify one reducer and
> > carry out the same operation.
> >
> > Does anyone have a better tactic. I thought a counter could do this
> > but are they atomic?
> >
>

-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422

Re: max value for a dataset

Posted by Farhan Husain <ru...@gmail.com>.

How do you identify that map task is ending within the map method? Is it
possible to know which is the last call to map method?

On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo <ed...@gmail.com>wrote:

> I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase
> support the ability to max(). I am writing my own max() over a simple
> one column dataset.
>
> The best solution I came up with was using MapRunner. With maprunner I
> can store the highest value in a private member variable. I can read
> through the entire data set and only have to emit one value per mapper
> upon completion of the map data. Then I can specify one reducer and
> carry out the same operation.
>
> Does anyone have a better tactic. I thought a counter could do this
> but are they atomic?
>