You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Dmitriy Lyubimov <dl...@gmail.com> on 2012/09/04 08:54:10 UTC

Random generator from random utils

Hello,

i have a question regarding line 344 in SSVDSolver:


Random rnd = RandomUtils.getRandom();

This random generator is used to obtain initial seed for random matrix
of SSVD. It used to be just "new Random()" but at some point
apparently was replaced with that util call.

At least in unit test this seems to result in situation that unit test
essentially gets a deterministic random gen. My guess is the intent is
to keep unit tests from failing non-deterministically from time to
time.

but am i right assuming that outside of Mahout's unit test this
actually will always be non-deterministic and I will be getting
different seeds? Cause if i don't, i think that's a problem.

Thanks.
-Dmitriy

Re: Random generator from random utils

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

Nothing's lost, really -- you can fix the seed on any method or the
entire suite with:

@Seed("deadbeef")

or whatever seed you desire. I understand Ted's argument very well
because I myself was fixing a test the other day that was sampling
from a random distribution of integers and the quiet assumption was
that "picking N values that will be < M is not really likely because
we pick from a range R where R >> M". Turned out not to be true -- we
did hit a distribution in which that assumption was wrong.

In any case, I'm not really advocating for Mahout to use randomized
testing package, I'm just telling you it's out there and works fairly
well (or so I think :) If you were to reinvent a similar thing then it
may be worth a try, that's all.

Dawid

On Tue, Sep 4, 2012 at 6:46 PM, Ted Dunning <te...@gmail.com> wrote:
> Yeah... if that is true, then the tests aren't well designed and are too
> picky.
>
> On Tue, Sep 4, 2012 at 9:24 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> Then it makes fine sense to have the test framework pick a new seed
>> and record it in the logs on every run. Before the above is fixed, it
>> won't do much since those tests will certainly fail on (most?) new
>> seeds.
>>

Re: Random generator from random utils

Posted by Ted Dunning <te...@gmail.com>.

I went ahead and looked at this.  I found two instances (so far).  One is
in the MeanShift tests and another in the OnlineSummarizer tests.  The
second of these, I can fix.  The first may be tricky since tests of
clustering are hard to write well.

On Tue, Sep 4, 2012 at 9:46 AM, Ted Dunning <te...@gmail.com> wrote:

> Yeah... if that is true, then the tests aren't well designed and are too
> picky.
>
>
> On Tue, Sep 4, 2012 at 9:24 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> Then it makes fine sense to have the test framework pick a new seed
>> and record it in the logs on every run. Before the above is fixed, it
>> won't do much since those tests will certainly fail on (most?) new
>> seeds.
>>
>
>

Re: Random generator from random utils

Posted by Ted Dunning <te...@gmail.com>.

Yeah... if that is true, then the tests aren't well designed and are too
picky.

On Tue, Sep 4, 2012 at 9:24 AM, Sean Owen <sr...@gmail.com> wrote:

> Then it makes fine sense to have the test framework pick a new seed
> and record it in the logs on every run. Before the above is fixed, it
> won't do much since those tests will certainly fail on (most?) new
> seeds.
>

Re: Random generator from random utils

Posted by Sean Owen <sr...@gmail.com>.

True that. In fact I know there are several tests that I was not able
to change to use the Mersenne RNG since their outcome depends too much
on the exact output of an RNG. So you still see some "new
Random(1234); // TODO" out there.

This is probably a bigger priority to fix. I would have fixed it, but
don't have the expertise. Relevant authors really should look at these
TODOs and try to fix the tests.

Then it makes fine sense to have the test framework pick a new seed
and record it in the logs on every run. Before the above is fixed, it
won't do much since those tests will certainly fail on (most?) new
seeds.

Sean

On Tue, Sep 4, 2012 at 5:19 PM, Ted Dunning <te...@gmail.com> wrote:
> Tidy, yes.  But better, no.
>
> The Lucene project has made an art out of randomizing configurations for
> tests.  Thus, the many thousands of people out there doing tests will all
> be testing different combinations of things and when a failure happens,
> that seed can be codified into the standard tests.
>
> This is a bit different with some of the randomized tests in Mahout.  For
> these, there is generally a (weak) statistical guarantee about the result.
>  For instance, it might be that the test should succeed 99.9% of the time.
>  To avoid spurious worries, after qualifying the test to fail no more than
> expected, the seed is frozen so that things sit still.  Most of the errors
> that we are after will trigger a hard failure so we don't lose much power
> this way and still have stability.
>
> A good example of this is a random number generator that is supposed to
> sample from a particular distribution.  If you draw 10,000 deviates from
> this generator, you can write a very simple test based on the cumulative
> distribution function.  Simple that is except for the fact that to put
> sharp bounds on the test will cause a non-negligible probability of failure
> for a working version of the software.  On the other hand, putting loose
> bounds will increase the likelihood that the test will succeed if somebody
> breaks the code.  Increasing the number of samples makes the useful bounds
> much tighter and allows lower probability of false success for bad code,
> but it increases the test time.
>
> There is little way around this Heisen-situation.  So we freeze the tests.
>
> There are other types of tests where randomization doesn't change the
> guarantees that the code makes whatsoever.  This often occurs in tinker-toy
> software where you can plug together all kinds of components
> interchangeably.  That is real different from the random number
> distribution problem.
>
> On Tue, Sep 4, 2012 at 12:26 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> I think this approach is even tidier than just recording the RNG seed
>> for later reuse.
>>

Re: Random generator from random utils

Posted by Ted Dunning <te...@gmail.com>.

Tidy, yes.  But better, no.

The Lucene project has made an art out of randomizing configurations for
tests.  Thus, the many thousands of people out there doing tests will all
be testing different combinations of things and when a failure happens,
that seed can be codified into the standard tests.

This is a bit different with some of the randomized tests in Mahout.  For
these, there is generally a (weak) statistical guarantee about the result.
 For instance, it might be that the test should succeed 99.9% of the time.
 To avoid spurious worries, after qualifying the test to fail no more than
expected, the seed is frozen so that things sit still.  Most of the errors
that we are after will trigger a hard failure so we don't lose much power
this way and still have stability.

A good example of this is a random number generator that is supposed to
sample from a particular distribution.  If you draw 10,000 deviates from
this generator, you can write a very simple test based on the cumulative
distribution function.  Simple that is except for the fact that to put
sharp bounds on the test will cause a non-negligible probability of failure
for a working version of the software.  On the other hand, putting loose
bounds will increase the likelihood that the test will succeed if somebody
breaks the code.  Increasing the number of samples makes the useful bounds
much tighter and allows lower probability of false success for bad code,
but it increases the test time.

There is little way around this Heisen-situation.  So we freeze the tests.

There are other types of tests where randomization doesn't change the
guarantees that the code makes whatsoever.  This often occurs in tinker-toy
software where you can plug together all kinds of components
interchangeably.  That is real different from the random number
distribution problem.

On Tue, Sep 4, 2012 at 12:26 AM, Sean Owen <sr...@gmail.com> wrote:

> I think this approach is even tidier than just recording the RNG seed
> for later reuse.
>

Re: Random generator from random utils

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

On Tue, Sep 4, 2012 at 12:26 AM, Sean Owen <sr...@gmail.com> wrote:
> No, this always generates a newly-seeded MersenneTwisterRNG. However,
> *if* you call RandomUtils.useTestSeed(), it will cause new instances
> to have a fixed seed, and existing instances to be reset to this seed.
> This is only called in the unit tests,

Ok i think this addresses my concern.. thanks.



 but lets RNGs be reset across
> the JVM to a known state. (You can supply your own test seed to
> useTestSeed()) too. This is desirable as is using a better RNG than
> java.util.Random.
>
> I think this approach is even tidier than just recording the RNG seed
> for later reuse.
>
> On Tue, Sep 4, 2012 at 7:54 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> Hello,
>>
>> i have a question regarding line 344 in SSVDSolver:
>>
>>
>> Random rnd = RandomUtils.getRandom();
>>
>> This random generator is used to obtain initial seed for random matrix
>> of SSVD. It used to be just "new Random()" but at some point
>> apparently was replaced with that util call.
>>
>> At least in unit test this seems to result in situation that unit test
>> essentially gets a deterministic random gen. My guess is the intent is
>> to keep unit tests from failing non-deterministically from time to
>> time.
>>
>> but am i right assuming that outside of Mahout's unit test this
>> actually will always be non-deterministic and I will be getting
>> different seeds? Cause if i don't, i think that's a problem.
>>
>> Thanks.
>> -Dmitriy

Re: Random generator from random utils

Posted by Sean Owen <sr...@gmail.com>.

No, this always generates a newly-seeded MersenneTwisterRNG. However,
*if* you call RandomUtils.useTestSeed(), it will cause new instances
to have a fixed seed, and existing instances to be reset to this seed.
This is only called in the unit tests, but lets RNGs be reset across
the JVM to a known state. (You can supply your own test seed to
useTestSeed()) too. This is desirable as is using a better RNG than
java.util.Random.

I think this approach is even tidier than just recording the RNG seed
for later reuse.

On Tue, Sep 4, 2012 at 7:54 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> Hello,
>
> i have a question regarding line 344 in SSVDSolver:
>
>
> Random rnd = RandomUtils.getRandom();
>
> This random generator is used to obtain initial seed for random matrix
> of SSVD. It used to be just "new Random()" but at some point
> apparently was replaced with that util call.
>
> At least in unit test this seems to result in situation that unit test
> essentially gets a deterministic random gen. My guess is the intent is
> to keep unit tests from failing non-deterministically from time to
> time.
>
> but am i right assuming that outside of Mahout's unit test this
> actually will always be non-deterministic and I will be getting
> different seeds? Cause if i don't, i think that's a problem.
>
> Thanks.
> -Dmitriy

Re: Random generator from random utils

Posted by Dawid Weiss <da...@cs.put.poznan.pl>.

This looks like a perfect use case for the randomized testing runner deployed
in Lucene that Isabel mentioned a while ago. That runner picks a random seed
but dumps it in stack traces etc. so that test runs can be reproduced
if failing (at least in theory if there are no other data races,
etc.).

http://labs.carrotsearch.com/randomizedtesting.html

The default RandomizedRunner comes with a somewhat stricter checks for
thread leaks etc but these can be turned off leaving just what regular
junit runner does. I can't offer my time for doing the heavy lifting
but I can help and offer guidance if needed.

Just my two cents,
Dawid

On Tue, Sep 4, 2012 at 8:54 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> Hello,
>
> i have a question regarding line 344 in SSVDSolver:
>
>
> Random rnd = RandomUtils.getRandom();
>
> This random generator is used to obtain initial seed for random matrix
> of SSVD. It used to be just "new Random()" but at some point
> apparently was replaced with that util call.
>
> At least in unit test this seems to result in situation that unit test
> essentially gets a deterministic random gen. My guess is the intent is
> to keep unit tests from failing non-deterministically from time to
> time.
>
> but am i right assuming that outside of Mahout's unit test this
> actually will always be non-deterministic and I will be getting
> different seeds? Cause if i don't, i think that's a problem.
>
> Thanks.
> -Dmitriy
>

Re: Random generator from random utils

Posted by Lance Norskog <go...@gmail.com>.

Yup- it's in the unit test helper class.

core/src --- org.apache.mahout.common.MahoutTestCase:

public abstract class MahoutTestCase extends org.apache.mahout.math.MahoutTestCase {
...

  @Override
  @Before
  public void setUp() throws Exception {
    super.setUp();
    RandomUtils.useTestSeed();  <<<<-------- Deterministic random seed
    testTempDirPath = null;
    fs = null;
  }



----- Original Message -----
| From: "Dmitriy Lyubimov" <dl...@gmail.com>
| To: dev@mahout.apache.org
| Sent: Monday, September 3, 2012 11:54:10 PM
| Subject: Random generator from random utils
| 
| Hello,
| 
| i have a question regarding line 344 in SSVDSolver:
| 
| 
| Random rnd = RandomUtils.getRandom();
| 
| This random generator is used to obtain initial seed for random
| matrix
| of SSVD. It used to be just "new Random()" but at some point
| apparently was replaced with that util call.
| 
| At least in unit test this seems to result in situation that unit
| test
| essentially gets a deterministic random gen. My guess is the intent
| is
| to keep unit tests from failing non-deterministically from time to
| time.
| 
| but am i right assuming that outside of Mahout's unit test this
| actually will always be non-deterministic and I will be getting
| different seeds? Cause if i don't, i think that's a problem.
| 
| Thanks.
| -Dmitriy
|