You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Koobas <ko...@gmail.com> on 2013/03/17 04:03:47 UTC

reproducibility

Can anybody shed any light on the issue of reproducibility in Mahout,
with and without Hadoop, specifically in the context of kNN and ALS
recommenders?

Re: reproducibility

Posted by Sebastian Schelter <ss...@apache.org>.

> KNN does not have a stochastic element. I think you would get the same
> results on one platform, unless I'm missing something.

These also have a stochastic element, as the Hadoop-based recommenders
randomly down-sample to the interaction histories of power-users.
However, this should only have a small impact on the result and can also
me made deterministic by fixing the seed of the RNG.

> On Sun, Mar 17, 2013 at 1:43 PM, Koobas <ko...@gmail.com> wrote:
> 
>> I am asking the basic reproducibility question.
>> If I run twice on the same dataset, with the same hardware setup, will I
>> always get the same resuts?
>> Or is there any chance that on two different runs, the same user will get
>> slightly different suggestions?
>> I am mostly revolving in the space of numerical libraries, where
>> reproducibility is, sort of, a big deal.
>> Maybe it's not much of a concern in machine learning.
>> I am just curious.
>>
>>
>> On Sun, Mar 17, 2013 at 8:46 AM, Sean Owen <sr...@gmail.com> wrote:
>>
>>> What's your question? ALS has a random starting point which changes the
>>> results a bit. Not sure about KNN though.
>>>
>>>
>>
>>> On Sun, Mar 17, 2013 at 3:03 AM, Koobas <ko...@gmail.com> wrote:
>>>
>>>> Can anybody shed any light on the issue of reproducibility in Mahout,
>>>> with and without Hadoop, specifically in the context of kNN and ALS
>>>> recommenders?
>>>>
>>>
>>
>

Re: reproducibility

Posted by Koobas <ko...@gmail.com>.

Understood.
Thanks a lot.


On Sun, Mar 17, 2013 at 9:57 AM, Sean Owen <sr...@gmail.com> wrote:

> If an algorithm has a stochastic/random element, no it won't necessarily
> produce the same result, by design. If you can fix the seed of the random
> number generator, you should get the same result. Except that if the
> process is multi-threaded or distributed, even that doesn't guarantee it --
> the RNG could be accessed in a different order. Even if you can control
> your code it can be hard to control the RNGs in third-party libraries. Even
> in a deterministic single-threaded program Java's floating point results
> are not guaranteed to be the same across platforms (unless you use
> strictfp).
>
> ALS definitely has a random starting point, so reproducibility is not
> guaranteed even from the top. If you fix the random seed in the context of
> this project's unit tests, you *should* get the same result since I think
> it manages to use no third-party RNGs and runs a test from a fixed starting
> point in 1 thread.
>
> KNN does not have a stochastic element. I think you would get the same
> results on one platform, unless I'm missing something.
>
> I don't think exact reproducibility is an issue. Certainly at scale where
> the entire computation is distributed over such a complex cluster
> environment. Most ML is about guessing at what's not known anyway. As long
> as very small differences make only very small differences in the outcome,
> differing FP behavior will make no or vanishingly small difference.
>
> The only place where I think FP reproducibility matters -- of the sort that
> numerical libraries care about -- is in under/overflow issues. But that is
> solved by moving into a log space or something. You would never want to
> depend on the nth significant digit of a float mattering.
>
>
>
>
> On Sun, Mar 17, 2013 at 1:43 PM, Koobas <ko...@gmail.com> wrote:
>
> > I am asking the basic reproducibility question.
> > If I run twice on the same dataset, with the same hardware setup, will I
> > always get the same resuts?
> > Or is there any chance that on two different runs, the same user will get
> > slightly different suggestions?
> > I am mostly revolving in the space of numerical libraries, where
> > reproducibility is, sort of, a big deal.
> > Maybe it's not much of a concern in machine learning.
> > I am just curious.
> >
> >
> > On Sun, Mar 17, 2013 at 8:46 AM, Sean Owen <sr...@gmail.com> wrote:
> >
> > > What's your question? ALS has a random starting point which changes the
> > > results a bit. Not sure about KNN though.
> > >
> > >
> >
> > > On Sun, Mar 17, 2013 at 3:03 AM, Koobas <ko...@gmail.com> wrote:
> > >
> > > > Can anybody shed any light on the issue of reproducibility in Mahout,
> > > > with and without Hadoop, specifically in the context of kNN and ALS
> > > > recommenders?
> > > >
> > >
> >
>

Re: reproducibility

Posted by Sean Owen <sr...@gmail.com>.

If an algorithm has a stochastic/random element, no it won't necessarily
produce the same result, by design. If you can fix the seed of the random
number generator, you should get the same result. Except that if the
process is multi-threaded or distributed, even that doesn't guarantee it --
the RNG could be accessed in a different order. Even if you can control
your code it can be hard to control the RNGs in third-party libraries. Even
in a deterministic single-threaded program Java's floating point results
are not guaranteed to be the same across platforms (unless you use
strictfp).

ALS definitely has a random starting point, so reproducibility is not
guaranteed even from the top. If you fix the random seed in the context of
this project's unit tests, you *should* get the same result since I think
it manages to use no third-party RNGs and runs a test from a fixed starting
point in 1 thread.

KNN does not have a stochastic element. I think you would get the same
results on one platform, unless I'm missing something.

I don't think exact reproducibility is an issue. Certainly at scale where
the entire computation is distributed over such a complex cluster
environment. Most ML is about guessing at what's not known anyway. As long
as very small differences make only very small differences in the outcome,
differing FP behavior will make no or vanishingly small difference.

The only place where I think FP reproducibility matters -- of the sort that
numerical libraries care about -- is in under/overflow issues. But that is
solved by moving into a log space or something. You would never want to
depend on the nth significant digit of a float mattering.

On Sun, Mar 17, 2013 at 1:43 PM, Koobas <ko...@gmail.com> wrote:

> I am asking the basic reproducibility question.
> If I run twice on the same dataset, with the same hardware setup, will I
> always get the same resuts?
> Or is there any chance that on two different runs, the same user will get
> slightly different suggestions?
> I am mostly revolving in the space of numerical libraries, where
> reproducibility is, sort of, a big deal.
> Maybe it's not much of a concern in machine learning.
> I am just curious.
>
>
> On Sun, Mar 17, 2013 at 8:46 AM, Sean Owen <sr...@gmail.com> wrote:
>
> > What's your question? ALS has a random starting point which changes the
> > results a bit. Not sure about KNN though.
> >
> >
>
> > On Sun, Mar 17, 2013 at 3:03 AM, Koobas <ko...@gmail.com> wrote:
> >
> > > Can anybody shed any light on the issue of reproducibility in Mahout,
> > > with and without Hadoop, specifically in the context of kNN and ALS
> > > recommenders?
> > >
> >
>

Re: reproducibility

Posted by Koobas <ko...@gmail.com>.

I am asking the basic reproducibility question.
If I run twice on the same dataset, with the same hardware setup, will I
always get the same resuts?
Or is there any chance that on two different runs, the same user will get
slightly different suggestions?
I am mostly revolving in the space of numerical libraries, where
reproducibility is, sort of, a big deal.
Maybe it's not much of a concern in machine learning.
I am just curious.

On Sun, Mar 17, 2013 at 8:46 AM, Sean Owen <sr...@gmail.com> wrote:

> What's your question? ALS has a random starting point which changes the
> results a bit. Not sure about KNN though.
>
>

> On Sun, Mar 17, 2013 at 3:03 AM, Koobas <ko...@gmail.com> wrote:
>
> > Can anybody shed any light on the issue of reproducibility in Mahout,
> > with and without Hadoop, specifically in the context of kNN and ALS
> > recommenders?
> >
>

Re: reproducibility

Posted by Sean Owen <sr...@gmail.com>.

What's your question? ALS has a random starting point which changes the
results a bit. Not sure about KNN though.

On Sun, Mar 17, 2013 at 3:03 AM, Koobas <ko...@gmail.com> wrote:

> Can anybody shed any light on the issue of reproducibility in Mahout,
> with and without Hadoop, specifically in the context of kNN and ALS
> recommenders?
>