You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by gabeweb <ga...@htc.com> on 2010/10/21 11:03:42 UTC

FullRunningAverage possibly inefficient and (very slightly) inaccurate?

I'm a bit concerned by the implementation of FullRunningAverage.  It's
probably not a big deal, but because it performs division each time a new
datum is added, there will be a floating point error that will increase over
time, and if you average together millions of data points, this error may
become significant.  I know that Java doubles have really huge precision,
but still.

In addition, the implementation seems rather inefficient in that it performs
6 math operations every time a datum is added.  Since adding to a running
average is likely to happen much more often than accessing the average,
wouldn't it be a lot more efficient to maintain the *total* and count
internally, rather than the current average and the count, and then when
someone requests the average, calculate the actual average on the fly by
dividing the total by the count?  Then it would just be two addition
operations to add a datum.  This would simultaneously eliminate the
accumulation of floating point errors.

If people agree, I'd be happy to fix this and submit a patch.  It's a simple
thing, but it's sort of bothering me :)  Plus it will allow me to get
familiar with the process.
-- 
View this message in context: http://lucene.472066.n3.nabble.com/FullRunningAverage-possibly-inefficient-and-very-slightly-inaccurate-tp1744425p1744425.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: FullRunningAverage possibly inefficient and (very slightly) inaccurate?

Posted by Lance Norskog <go...@gmail.com>.
This Welford's method?

http://www.johndcook.com/blog/2008/09/26/comparing-three-methods-of-computing-standard-deviation/

http://www.johndcook.com/standard_deviation.html

In the second one, you can ditch the standard deviation part.  As you
can see this has more computation than FullRunningAverage. It will be
hard to make a benchmark that elicits performance differences.


On Fri, Oct 22, 2010 at 10:55 AM, Ted Dunning <te...@gmail.com> wrote:
> If you care about accuracy, then switching to Welfords method (which is
> different from keeping a total) would help as much or more.  It would also
> avoid the conditional for the case of no prior data.
>
> The speedup is, in any case, not likely to exist because this code is
> probably memory bound anyway.  This is typical for modern architectures
> where floating point math is so much faster than memory access that you can
> only saturate the ALU with carefully designed codes.  Virtually no 1-d
> vector oriented code can do this because there are as many memory ops as
> flops  ... it takes a matrix operation to do it because those have n^2
> memory ops and n^3 flops.
>
> On Thu, Oct 21, 2010 at 11:45 PM, Gabriel Webster
> <ga...@htc.com>wrote:
>
>> OK no worries then, thanks for explaining.  I was thinking of using the
>> class as part of transforming ratings to z-scores as a preprocessing step
>> (or a wrapping recommender), and although the accuracy issue wouldn't come
>> up (since there are only up to few thousand ratings per user), the possible
>> speed-up over millions of users might make it worthwhile from a value point
>> of view (value = amount of improvement / amount of work to implement).  But
>> since it's just averaging, it's trivial to do it myself in a different
>> class.
>>
>> A documenting note at the top of the class explaining the motivation behind
>> the implementation might be useful to future code explorers, if it's not too
>> much bother.  You could probably largely just copy and paste your post.
>>
>>
>> On 10/21/10 5:26 PM, Sean Owen wrote:
>>
>>> This class is used mostly in slope-one. The priorities are memory usage,
>>> then speed, then accuracy in that context.
>>>
>>> Because of memory concerns, I don't want to store more than an int and
>>> double. So yes you can store a total and count. So we're left trading
>>> speed
>>> for accuracy. Your change is a bit more accurate but slower at reads (must
>>> divide to get the average).
>>>
>>> For slope one, which is read-heavy, and where accuracy is not a big deal,
>>> this isn't a worthwhile tradeoff I think.
>>>
>>> What's the use case you have in mind?
>>>
>>> On Thu, Oct 21, 2010 at 10:03 AM, gabeweb<ga...@htc.com>
>>>  wrote:
>>>
>>>
>>>> I'm a bit concerned by the implementation of FullRunningAverage.  It's
>>>> probably not a big deal, but because it performs division each time a new
>>>> datum is added, there will be a floating point error that will increase
>>>> over
>>>> time, and if you average together millions of data points, this error may
>>>> become significant.  I know that Java doubles have really huge precision,
>>>> but still.
>>>>
>>>> In addition, the implementation seems rather inefficient in that it
>>>> performs
>>>> 6 math operations every time a datum is added.  Since adding to a running
>>>> average is likely to happen much more often than accessing the average,
>>>> wouldn't it be a lot more efficient to maintain the *total* and count
>>>> internally, rather than the current average and the count, and then when
>>>> someone requests the average, calculate the actual average on the fly by
>>>> dividing the total by the count?  Then it would just be two addition
>>>> operations to add a datum.  This would simultaneously eliminate the
>>>> accumulation of floating point errors.
>>>>
>>>> If people agree, I'd be happy to fix this and submit a patch.  It's a
>>>> simple
>>>> thing, but it's sort of bothering me :)  Plus it will allow me to get
>>>> familiar with the process.
>>>> --
>>>> View this message in context:
>>>>
>>>> http://lucene.472066.n3.nabble.com/FullRunningAverage-possibly-inefficient-and-very-slightly-inaccurate-tp1744425p1744425.html
>>>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>>>
>>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: FullRunningAverage possibly inefficient and (very slightly) inaccurate?

Posted by Ted Dunning <te...@gmail.com>.
If you care about accuracy, then switching to Welfords method (which is
different from keeping a total) would help as much or more.  It would also
avoid the conditional for the case of no prior data.

The speedup is, in any case, not likely to exist because this code is
probably memory bound anyway.  This is typical for modern architectures
where floating point math is so much faster than memory access that you can
only saturate the ALU with carefully designed codes.  Virtually no 1-d
vector oriented code can do this because there are as many memory ops as
flops  ... it takes a matrix operation to do it because those have n^2
memory ops and n^3 flops.

On Thu, Oct 21, 2010 at 11:45 PM, Gabriel Webster
<ga...@htc.com>wrote:

> OK no worries then, thanks for explaining.  I was thinking of using the
> class as part of transforming ratings to z-scores as a preprocessing step
> (or a wrapping recommender), and although the accuracy issue wouldn't come
> up (since there are only up to few thousand ratings per user), the possible
> speed-up over millions of users might make it worthwhile from a value point
> of view (value = amount of improvement / amount of work to implement).  But
> since it's just averaging, it's trivial to do it myself in a different
> class.
>
> A documenting note at the top of the class explaining the motivation behind
> the implementation might be useful to future code explorers, if it's not too
> much bother.  You could probably largely just copy and paste your post.
>
>
> On 10/21/10 5:26 PM, Sean Owen wrote:
>
>> This class is used mostly in slope-one. The priorities are memory usage,
>> then speed, then accuracy in that context.
>>
>> Because of memory concerns, I don't want to store more than an int and
>> double. So yes you can store a total and count. So we're left trading
>> speed
>> for accuracy. Your change is a bit more accurate but slower at reads (must
>> divide to get the average).
>>
>> For slope one, which is read-heavy, and where accuracy is not a big deal,
>> this isn't a worthwhile tradeoff I think.
>>
>> What's the use case you have in mind?
>>
>> On Thu, Oct 21, 2010 at 10:03 AM, gabeweb<ga...@htc.com>
>>  wrote:
>>
>>
>>> I'm a bit concerned by the implementation of FullRunningAverage.  It's
>>> probably not a big deal, but because it performs division each time a new
>>> datum is added, there will be a floating point error that will increase
>>> over
>>> time, and if you average together millions of data points, this error may
>>> become significant.  I know that Java doubles have really huge precision,
>>> but still.
>>>
>>> In addition, the implementation seems rather inefficient in that it
>>> performs
>>> 6 math operations every time a datum is added.  Since adding to a running
>>> average is likely to happen much more often than accessing the average,
>>> wouldn't it be a lot more efficient to maintain the *total* and count
>>> internally, rather than the current average and the count, and then when
>>> someone requests the average, calculate the actual average on the fly by
>>> dividing the total by the count?  Then it would just be two addition
>>> operations to add a datum.  This would simultaneously eliminate the
>>> accumulation of floating point errors.
>>>
>>> If people agree, I'd be happy to fix this and submit a patch.  It's a
>>> simple
>>> thing, but it's sort of bothering me :)  Plus it will allow me to get
>>> familiar with the process.
>>> --
>>> View this message in context:
>>>
>>> http://lucene.472066.n3.nabble.com/FullRunningAverage-possibly-inefficient-and-very-slightly-inaccurate-tp1744425p1744425.html
>>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>>
>>

Re: FullRunningAverage possibly inefficient and (very slightly) inaccurate?

Posted by Gabriel Webster <ga...@htc.com>.
OK no worries then, thanks for explaining.  I was thinking of using the 
class as part of transforming ratings to z-scores as a preprocessing 
step (or a wrapping recommender), and although the accuracy issue 
wouldn't come up (since there are only up to few thousand ratings per 
user), the possible speed-up over millions of users might make it 
worthwhile from a value point of view (value = amount of improvement / 
amount of work to implement).  But since it's just averaging, it's 
trivial to do it myself in a different class.

A documenting note at the top of the class explaining the motivation 
behind the implementation might be useful to future code explorers, if 
it's not too much bother.  You could probably largely just copy and 
paste your post.

On 10/21/10 5:26 PM, Sean Owen wrote:
> This class is used mostly in slope-one. The priorities are memory usage,
> then speed, then accuracy in that context.
>
> Because of memory concerns, I don't want to store more than an int and
> double. So yes you can store a total and count. So we're left trading speed
> for accuracy. Your change is a bit more accurate but slower at reads (must
> divide to get the average).
>
> For slope one, which is read-heavy, and where accuracy is not a big deal,
> this isn't a worthwhile tradeoff I think.
>
> What's the use case you have in mind?
>
> On Thu, Oct 21, 2010 at 10:03 AM, gabeweb<ga...@htc.com>  wrote:
>
>>
>> I'm a bit concerned by the implementation of FullRunningAverage.  It's
>> probably not a big deal, but because it performs division each time a new
>> datum is added, there will be a floating point error that will increase
>> over
>> time, and if you average together millions of data points, this error may
>> become significant.  I know that Java doubles have really huge precision,
>> but still.
>>
>> In addition, the implementation seems rather inefficient in that it
>> performs
>> 6 math operations every time a datum is added.  Since adding to a running
>> average is likely to happen much more often than accessing the average,
>> wouldn't it be a lot more efficient to maintain the *total* and count
>> internally, rather than the current average and the count, and then when
>> someone requests the average, calculate the actual average on the fly by
>> dividing the total by the count?  Then it would just be two addition
>> operations to add a datum.  This would simultaneously eliminate the
>> accumulation of floating point errors.
>>
>> If people agree, I'd be happy to fix this and submit a patch.  It's a
>> simple
>> thing, but it's sort of bothering me :)  Plus it will allow me to get
>> familiar with the process.
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/FullRunningAverage-possibly-inefficient-and-very-slightly-inaccurate-tp1744425p1744425.html
>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>

Re: FullRunningAverage possibly inefficient and (very slightly) inaccurate?

Posted by Sean Owen <sr...@gmail.com>.
This class is used mostly in slope-one. The priorities are memory usage,
then speed, then accuracy in that context.

Because of memory concerns, I don't want to store more than an int and
double. So yes you can store a total and count. So we're left trading speed
for accuracy. Your change is a bit more accurate but slower at reads (must
divide to get the average).

For slope one, which is read-heavy, and where accuracy is not a big deal,
this isn't a worthwhile tradeoff I think.

What's the use case you have in mind?

On Thu, Oct 21, 2010 at 10:03 AM, gabeweb <ga...@htc.com> wrote:

>
> I'm a bit concerned by the implementation of FullRunningAverage.  It's
> probably not a big deal, but because it performs division each time a new
> datum is added, there will be a floating point error that will increase
> over
> time, and if you average together millions of data points, this error may
> become significant.  I know that Java doubles have really huge precision,
> but still.
>
> In addition, the implementation seems rather inefficient in that it
> performs
> 6 math operations every time a datum is added.  Since adding to a running
> average is likely to happen much more often than accessing the average,
> wouldn't it be a lot more efficient to maintain the *total* and count
> internally, rather than the current average and the count, and then when
> someone requests the average, calculate the actual average on the fly by
> dividing the total by the count?  Then it would just be two addition
> operations to add a datum.  This would simultaneously eliminate the
> accumulation of floating point errors.
>
> If people agree, I'd be happy to fix this and submit a patch.  It's a
> simple
> thing, but it's sort of bothering me :)  Plus it will allow me to get
> familiar with the process.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/FullRunningAverage-possibly-inefficient-and-very-slightly-inaccurate-tp1744425p1744425.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>