You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Lance Norskog (JIRA)" <ji...@apache.org> on 2011/05/29 05:33:47 UTC

[jira] [Issue Comment Edited] (MAHOUT-676) Random samplers in a modular library

    [ https://issues.apache.org/jira/browse/MAHOUT-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13040732#comment-13040732 ] 

Lance Norskog edited comment on MAHOUT-676 at 5/29/11 3:32 AM:
---------------------------------------------------------------

I got interested again :)

This includes full unit tests and a new sampler. The sampler interface is changed: you can add samples, iterate the current list, and check whether the sample would be dropped. This check kicks forward the state machine inside the sampler.

The major point of interest is a brute-force implementation of "Slice Sampling": you supply a function on your samples, and the sampler keeps samples based on the "area" under the function. Example: a user who watches 2 movies is more interesting than a user who watches one, on up to 20 movies. After that, who cares? Let's say the user's "influence score" is the square root of the number of movies he has watched. 

Slice sampling requires two functions: the function that maps a user to an X value, and a function that maps an X value to a Y value. The first gives the raw influence of the user, and the second compresses that influence. Slice sampling pulls a subset of the original samples whose density matches the area under the second function.

This is interesting because it lets you shape a set of samples according to a fixed curve. If your categorizer has problems with the class of inputs you are most interested in, you can use slice sampling to trim down the less interesting samples.

      was (Author: lancenorskog):
    I got interested again :)

This includes full unit tests and a new sampler. The sampler interface is changed: you can add samples, iterate the current list, and check whether the sample would be dropped. This kicks forward the state machine inside the sampler.

The major point of interest is a brute-force implementation of "Slice Sampling": you supply a function on your samples, and the sampler keeps samples based on the "area" under the function. Example: it doesn't matter how many movies a user watched above 20 movies. So, a function on the sample returns the number of movies. 
  
> Random samplers in a modular library
> ------------------------------------
>
>                 Key: MAHOUT-676
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-676
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>            Reporter: Lance Norskog
>            Priority: Minor
>         Attachments: MAHOUT-676.patch, Sampler.patch
>
>
> This is a modular suite of samplers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira