You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Lance Norskog <go...@gmail.com> on 2011/03/08 06:55:57 UTC

Map/Reduce algorithm needed- histogram

And now for some real machine learning. I would like to make a
histogram of values in a dataset. For simplicity let's just find the
median.

I would like to find the median value in a dataset via Map/Reduce. The
basic idea is distributed error estimation and reduction:
    start with a guess for the median.
    map each value to one side of the median.
    when the mapper finishes, also emit metadata to all reducers.
        this metadata is the largest or smallest value in this sample.

    the receiver gets the metadata from all mappers for the 'guessed' median.
    It also takes the boundary for the items it receives. It then
combines all of the boundary values into a new guess of the median.
    The job driver combines all of the reducer's boundaries and
creates a new composite guess.
    And repeat with some end condition. One can demand complete
accuracy, or a percentage of misclassified items.

--------------------------------------------------
More detail

This is an iterative design. Between each pass, there is a 'current
median' value passed as metadata. At the start, pick an arbitrary
value.
Key/value pair: ('upper'/'lower', (purpose='value'/boundary, item)
In the key/value pair, there is a special boolean/enum type called
'purpose=value or boundary'. This is one of two values: an entry to be
classified, or metadata giving the boundary values that the mapper
found and emitted.

    Start of map phase:
        Start with 'current median'.
    Mapper:
        The key is a boolean
        However, 4 kinds of output.
        key: boolean type: < current median or >= current median
        value: (boolean purpose, number value)
        loop(number):
            map number to upper/lower bin for key
            emit (key, (purpose = data, number) )
            track maximum & minimum values
        end
        emit (! key, (purpose = boundary, maximum))
    end Mapper

    Reducer:
        Receives all (key = boolean, (purpose = boundary, number)) first.
        This will receive the boundary values for the other bin,
because the mapper wrote the boundary to !('upper' v.s. 'lower')
        Calculates the mean of the boundary values as the new current median.
        Receives and assigns all (purpose = number) tuples.

There is a problem with this: the Reducer has to cache all of the
value tuples until it gets all of the boundary tuples. Thus it is
memory-bound. The Mapper is not memory-bound.

A hack to fix this is to create a custom sorter that causes each
reducer to receive the boundary tuples first.


-- 
Lance Norskog
goksron@gmail.com

Re: Map/Reduce algorithm needed- histogram

Posted by Ted Dunning <te...@gmail.com>.
You don't need either sort or iterations to get the median.

Check out the OnlineSummarizer for a sequential online O(n) algorithm for
median estimation that is about as accurate the sort algorithm.  You can
linearly combine multiple median estimates if you have suitably randomized
data.

On Mon, Mar 7, 2011 at 10:10 PM, Chris Schilling <ch...@cellixis.com> wrote:

> What about running terrasort map/reduce?  I am not sure on the run-time
> complexity differences, but if you want to find the median of a large set,
> it gets the job done
>
>
> On Mar 7, 2011, at 9:55 PM, Lance Norskog wrote:
>
> > And now for some real machine learning. I would like to make a
> > histogram of values in a dataset. For simplicity let's just find the
> > median.
> >
> > I would like to find the median value in a dataset via Map/Reduce. The
> > basic idea is distributed error estimation and reduction:
> >    start with a guess for the median.
> >    map each value to one side of the median.
> >    when the mapper finishes, also emit metadata to all reducers.
> >        this metadata is the largest or smallest value in this sample.
> >
> >    the receiver gets the metadata from all mappers for the 'guessed'
> median.
> >    It also takes the boundary for the items it receives. It then
> > combines all of the boundary values into a new guess of the median.
> >    The job driver combines all of the reducer's boundaries and
> > creates a new composite guess.
> >    And repeat with some end condition. One can demand complete
> > accuracy, or a percentage of misclassified items.
> >
> > --------------------------------------------------
> > More detail
> >
> > This is an iterative design. Between each pass, there is a 'current
> > median' value passed as metadata. At the start, pick an arbitrary
> > value.
> > Key/value pair: ('upper'/'lower', (purpose='value'/boundary, item)
> > In the key/value pair, there is a special boolean/enum type called
> > 'purpose=value or boundary'. This is one of two values: an entry to be
> > classified, or metadata giving the boundary values that the mapper
> > found and emitted.
> >
> >    Start of map phase:
> >        Start with 'current median'.
> >    Mapper:
> >        The key is a boolean
> >        However, 4 kinds of output.
> >        key: boolean type: < current median or >= current median
> >        value: (boolean purpose, number value)
> >        loop(number):
> >            map number to upper/lower bin for key
> >            emit (key, (purpose = data, number) )
> >            track maximum & minimum values
> >        end
> >        emit (! key, (purpose = boundary, maximum))
> >    end Mapper
> >
> >    Reducer:
> >        Receives all (key = boolean, (purpose = boundary, number)) first.
> >        This will receive the boundary values for the other bin,
> > because the mapper wrote the boundary to !('upper' v.s. 'lower')
> >        Calculates the mean of the boundary values as the new current
> median.
> >        Receives and assigns all (purpose = number) tuples.
> >
> > There is a problem with this: the Reducer has to cache all of the
> > value tuples until it gets all of the boundary tuples. Thus it is
> > memory-bound. The Mapper is not memory-bound.
> >
> > A hack to fix this is to create a custom sorter that causes each
> > reducer to receive the boundary tuples first.
> >
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
>
>

Re: Map/Reduce algorithm needed- histogram

Posted by Lance Norskog <go...@gmail.com>.
Really it is to support an arbitrary number of buckets. But thank you,
TeraSort does some interesting things with the M/R Hadoop framework.

One use for this is to find and remove the 5% most and least common
terms from a Lucene index, as part of vectorizing, or making spelling
dictionaries.

Lance

On Mon, Mar 7, 2011 at 10:10 PM, Chris Schilling <ch...@cellixis.com> wrote:
> What about running terrasort map/reduce?  I am not sure on the run-time complexity differences, but if you want to find the median of a large set, it gets the job done
>
>
> On Mar 7, 2011, at 9:55 PM, Lance Norskog wrote:
>
>> And now for some real machine learning. I would like to make a
>> histogram of values in a dataset. For simplicity let's just find the
>> median.
>>
>> I would like to find the median value in a dataset via Map/Reduce. The
>> basic idea is distributed error estimation and reduction:
>>    start with a guess for the median.
>>    map each value to one side of the median.
>>    when the mapper finishes, also emit metadata to all reducers.
>>        this metadata is the largest or smallest value in this sample.
>>
>>    the receiver gets the metadata from all mappers for the 'guessed' median.
>>    It also takes the boundary for the items it receives. It then
>> combines all of the boundary values into a new guess of the median.
>>    The job driver combines all of the reducer's boundaries and
>> creates a new composite guess.
>>    And repeat with some end condition. One can demand complete
>> accuracy, or a percentage of misclassified items.
>>
>> --------------------------------------------------
>> More detail
>>
>> This is an iterative design. Between each pass, there is a 'current
>> median' value passed as metadata. At the start, pick an arbitrary
>> value.
>> Key/value pair: ('upper'/'lower', (purpose='value'/boundary, item)
>> In the key/value pair, there is a special boolean/enum type called
>> 'purpose=value or boundary'. This is one of two values: an entry to be
>> classified, or metadata giving the boundary values that the mapper
>> found and emitted.
>>
>>    Start of map phase:
>>        Start with 'current median'.
>>    Mapper:
>>        The key is a boolean
>>        However, 4 kinds of output.
>>        key: boolean type: < current median or >= current median
>>        value: (boolean purpose, number value)
>>        loop(number):
>>            map number to upper/lower bin for key
>>            emit (key, (purpose = data, number) )
>>            track maximum & minimum values
>>        end
>>        emit (! key, (purpose = boundary, maximum))
>>    end Mapper
>>
>>    Reducer:
>>        Receives all (key = boolean, (purpose = boundary, number)) first.
>>        This will receive the boundary values for the other bin,
>> because the mapper wrote the boundary to !('upper' v.s. 'lower')
>>        Calculates the mean of the boundary values as the new current median.
>>        Receives and assigns all (purpose = number) tuples.
>>
>> There is a problem with this: the Reducer has to cache all of the
>> value tuples until it gets all of the boundary tuples. Thus it is
>> memory-bound. The Mapper is not memory-bound.
>>
>> A hack to fix this is to create a custom sorter that causes each
>> reducer to receive the boundary tuples first.
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Map/Reduce algorithm needed- histogram

Posted by Chris Schilling <ch...@cellixis.com>.
What about running terrasort map/reduce?  I am not sure on the run-time complexity differences, but if you want to find the median of a large set, it gets the job done


On Mar 7, 2011, at 9:55 PM, Lance Norskog wrote:

> And now for some real machine learning. I would like to make a
> histogram of values in a dataset. For simplicity let's just find the
> median.
> 
> I would like to find the median value in a dataset via Map/Reduce. The
> basic idea is distributed error estimation and reduction:
>    start with a guess for the median.
>    map each value to one side of the median.
>    when the mapper finishes, also emit metadata to all reducers.
>        this metadata is the largest or smallest value in this sample.
> 
>    the receiver gets the metadata from all mappers for the 'guessed' median.
>    It also takes the boundary for the items it receives. It then
> combines all of the boundary values into a new guess of the median.
>    The job driver combines all of the reducer's boundaries and
> creates a new composite guess.
>    And repeat with some end condition. One can demand complete
> accuracy, or a percentage of misclassified items.
> 
> --------------------------------------------------
> More detail
> 
> This is an iterative design. Between each pass, there is a 'current
> median' value passed as metadata. At the start, pick an arbitrary
> value.
> Key/value pair: ('upper'/'lower', (purpose='value'/boundary, item)
> In the key/value pair, there is a special boolean/enum type called
> 'purpose=value or boundary'. This is one of two values: an entry to be
> classified, or metadata giving the boundary values that the mapper
> found and emitted.
> 
>    Start of map phase:
>        Start with 'current median'.
>    Mapper:
>        The key is a boolean
>        However, 4 kinds of output.
>        key: boolean type: < current median or >= current median
>        value: (boolean purpose, number value)
>        loop(number):
>            map number to upper/lower bin for key
>            emit (key, (purpose = data, number) )
>            track maximum & minimum values
>        end
>        emit (! key, (purpose = boundary, maximum))
>    end Mapper
> 
>    Reducer:
>        Receives all (key = boolean, (purpose = boundary, number)) first.
>        This will receive the boundary values for the other bin,
> because the mapper wrote the boundary to !('upper' v.s. 'lower')
>        Calculates the mean of the boundary values as the new current median.
>        Receives and assigns all (purpose = number) tuples.
> 
> There is a problem with this: the Reducer has to cache all of the
> value tuples until it gets all of the boundary tuples. Thus it is
> memory-bound. The Mapper is not memory-bound.
> 
> A hack to fix this is to create a custom sorter that causes each
> reducer to receive the boundary tuples first.
> 
> 
> -- 
> Lance Norskog
> goksron@gmail.com