You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Lance Norskog <go...@gmail.com> on 2011/06/01 09:31:30 UTC

Measuring randomness

I'm trying to do a bake-off between Bernoulli (B) sampling (drop if
random > percentage) v.s. Reservoir (R) sampling (maintain a box of
randomly chosen samples).

Here is a test (simplified for explanatory purposes):
* Create a list of 1000 numbers, 0-999. Permute this list.
* Subsample N values
* Add them and take the median
* Do this 20 times and record the medians
* Calculate the standard deviation of the 20 median values
This last is my score for 'how good is the randomness of this sampler'.

Does this make sense? In this measurement is small or large deviation
better? What is another way to measure it?

Notes: Bernoulli pulls X percent of the samples and ignores the rest.
Reservoir pulls all of the samples and saves X of them. However, it
saves the first N samples and slowly replaces them. This suppresses
the deviation for small samples. This realization came just now; I'll
cut that phase.
Really I used the OnlineSummarizer and did deviations of
mean/median/25 percentile/75 percentile.

I had a more detailed report with numbers, but just realized that
given the above I have to start over.

Barbie says: "designing experiments is hard!"


-- 
Lance Norskog
goksron@gmail.com

Re: Measuring randomness

Posted by Hector Yee <he...@gmail.com>.

You'll probably see more difference if the data was ordered no? For the reason you said below about R sampling the entire set and BR sampling the first X

Sent from my iPad

On Jun 1, 2011, at 12:31 AM, Lance Norskog <go...@gmail.com> wrote:

> I'm trying to do a bake-off between Bernoulli (B) sampling (drop if
> random > percentage) v.s. Reservoir (R) sampling (maintain a box of
> randomly chosen samples).
> 
> Here is a test (simplified for explanatory purposes):
> * Create a list of 1000 numbers, 0-999. Permute this list.
> * Subsample N values
> * Add them and take the median
> * Do this 20 times and record the medians
> * Calculate the standard deviation of the 20 median values
> This last is my score for 'how good is the randomness of this sampler'.
> 
> Does this make sense? In this measurement is small or large deviation
> better? What is another way to measure it?
> 
> Notes: Bernoulli pulls X percent of the samples and ignores the rest.
> Reservoir pulls all of the samples and saves X of them. However, it
> saves the first N samples and slowly replaces them. This suppresses
> the deviation for small samples. This realization came just now; I'll
> cut that phase.
> Really I used the OnlineSummarizer and did deviations of
> mean/median/25 percentile/75 percentile.
> 
> I had a more detailed report with numbers, but just realized that
> given the above I have to start over.
> 
> Barbie says: "designing experiments is hard!"
> 
> 
> -- 
> Lance Norskog
> goksron@gmail.com

Re: Measuring randomness

Posted by Ted Dunning <te...@gmail.com>.

On Wed, Jun 1, 2011 at 1:17 AM, Sean Owen <sr...@gmail.com> wrote:

> In both cases, every element is picked with probability N/1000. That is the
> purest sense in which these processes can be wrong or right, to me, and
> they
> are both exactly as good as the underlying pseudo-random number generator.
> The difference is not their quality, but the number of elements that are
> chosen.
>

And how that number is specified.  And whether order is preserved.  And
whether you get samples along the way so that you can overlap computation
with I/O.

I am not sure what the distribution the median of the N values should follow
> in theory. I doubt it's Gaussian.

It is asymptotically
normal<http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdf_1&handle=euclid.aoms/1177728598>,
for pretty broad assumptions.  For normal underlying distribution, it
converges very quickly.  For a whacky underlying distribution like the
Cauchy, less quickly.

http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdf_1&handle=euclid.aoms/1177728598

> But that would be your question then --
> how likely is it that the 20 observed values are generated by this
> distribution?
>

But this doesn't really answer an important question because the underlying
data was sampled from the same distribution and a variety of defective
samplers would give similar results.

> This test would not prove all aspects of the sampler work. For example, a
> sampler that never picked 0 or 999 would have the same result (well, if
> N>2)
> as this one, when clearly it has a problem.
>

And I think that this sort of thing is the key question.

Make sure that you use sorted data as one test input.  Do a full median of
the samples because OnlineSummarizer doesn't like ordered data.

> But I think this is probably a more complicated question than you need ask
> in practice: what is the phenomenon you are worried will happen or not
> happen here?
>

Since the samplers are equal in quality by design, the only problem I can
imagine is code error.

Re: Measuring randomness

Posted by Sean Owen <sr...@gmail.com>.

In both cases, every element is picked with probability N/1000. That is the
purest sense in which these processes can be wrong or right, to me, and they
are both exactly as good as the underlying pseudo-random number generator.
The difference is not their quality, but the number of elements that are
chosen.

I am not sure what the distribution the median of the N values should follow
in theory. I doubt it's Gaussian. But that would be your question then --
how likely is it that the 20 observed values are generated by this
distribution?

This test would not prove all aspects of the sampler work. For example, a
sampler that never picked 0 or 999 would have the same result (well, if N>2)
as this one, when clearly it has a problem.

But I think this is probably a more complicated question than you need ask
in practice: what is the phenomenon you are worried will happen or not
happen here?

On Wed, Jun 1, 2011 at 8:31 AM, Lance Norskog <go...@gmail.com> wrote:

> I'm trying to do a bake-off between Bernoulli (B) sampling (drop if
> random > percentage) v.s. Reservoir (R) sampling (maintain a box of
> randomly chosen samples).
>
> Here is a test (simplified for explanatory purposes):
> * Create a list of 1000 numbers, 0-999. Permute this list.
> * Subsample N values
> * Add them and take the median
> * Do this 20 times and record the medians
> * Calculate the standard deviation of the 20 median values
> This last is my score for 'how good is the randomness of this sampler'.
>
> Does this make sense? In this measurement is small or large deviation
> better? What is another way to measure it?
>
> Notes: Bernoulli pulls X percent of the samples and ignores the rest.
> Reservoir pulls all of the samples and saves X of them. However, it
> saves the first N samples and slowly replaces them. This suppresses
> the deviation for small samples. This realization came just now; I'll
> cut that phase.
> Really I used the OnlineSummarizer and did deviations of
> mean/median/25 percentile/75 percentile.
>
> I had a more detailed report with numbers, but just realized that
> given the above I have to start over.
>
> Barbie says: "designing experiments is hard!"
>
>
> --
> Lance Norskog
> goksron@gmail.com
>