You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by philipghu <ph...@gmail.com> on 2016/10/02 19:47:40 UTC

statistical theory behind estimating the number of total tasks in GroupedSumEvaluator.scala

Hi,


I've been struggling to understand the statistical theory behind this piece
of code (from
/core/src/main/scala/org/apache/spark/partial/GroupedSumEvaluator.scala)
below, especially with respect  to estimating the  size of the population
(total tasks) and its variance. Also I'm trying to understand how the
variance  of the sum is calculated like that. I'm struggling to find the
source online too. 

 while (iter.hasNext) {
        val entry = iter.next()
        val counter = entry.getValue
        val meanEstimate = counter.mean
        val meanVar = counter.sampleVariance / counter.count
        val countEstimate = (counter.count + 1 - p) / p
        val countVar = (counter.count + 1) * (1 - p) / (p * p)
        val sumEstimate = meanEstimate * countEstimate
        val sumVar = (meanEstimate * meanEstimate * countVar) +
                     (countEstimate * countEstimate * meanVar) +
                     (meanVar * countVar)
        val sumStdev = math.sqrt(sumVar)
        val confFactor = studentTCacher.get(counter.count)
        val low = sumEstimate - confFactor * sumStdev
        val high = sumEstimate + confFactor * sumStdev
        result.put(entry.getKey, new BoundedDouble(sumEstimate, confidence,
low, high))
      }

Thanks and best regards,
Phil







--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/statistical-theory-behind-estimating-the-number-of-total-tasks-in-GroupedSumEvaluator-scala-tp27827.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: statistical theory behind estimating the number of total tasks in GroupedSumEvaluator.scala

Posted by Sean Owen <so...@cloudera.com>.

FWIW I think there are in any event several small problems with these
classes; I'm tracking it here and have a change almost ready:

https://issues.apache.org/jira/browse/SPARK-17768

On Mon, Oct 3, 2016 at 9:39 AM, Sean Owen <so...@cloudera.com> wrote:
> +Matei for question about the source of this bit of code
>
> That's a good question; I remember wondering about this once upon a time.
>
> First, GroupedSumEvaluator and GroupedMeanEvaluator look like dead
> code at this point. GroupedCountEvaluator is still used.
>
> MeanEvaluator is a better example, because it's straightforward. It's
> getting a confidence interval on the true mean from the sample stats
> using a t-distribution.
>
> CountEvaluator however I don't quite get ...
>
> val p = outputsMerged.toDouble / totalOutputs
> val mean = (sum + 1 - p) / p
> val variance = (sum + 1) * (1 - p) / (p * p)
> val stdev = math.sqrt(variance)
> val confFactor = new NormalDistribution().
>   inverseCumulativeProbability(1 - (1 - confidence) / 2)
> val low = mean - confFactor * stdev
> val high = mean + confFactor * stdev
> new BoundedDouble(mean, confidence, low, high)
>
> Given the mean/variance formula, this looks like it's modeling the
> rest of the count not yet observed as negative binomial. However I'd
> expect the mean to be something like sum * (1-p) / p instead, and
> that's not the mean of the total count but the rest of the count then.
> This also doesn't truncate the interval at 0 (count can't be
> negative), which could also be solved by not using a normal
> approximation too.
>
> Of course, I could be totally wrong about the model then.
>
> This is pretty old code, from years ago. Matei you at least merged it
> though may not have written it -- do you have any more info?
>
> I might start writing some tests for this since the result isn't
> directly tested in the unit tests, to see how it holds up.
>
> Then, the question is SumEvaluator, which seems to approach the
> estimation as an estimation of product of count and mean together, so
> it has some related questions.
>
> Sean
>
> On Sun, Oct 2, 2016 at 8:47 PM, philipghu <ph...@gmail.com> wrote:
>> Hi,
>>
>>
>> I've been struggling to understand the statistical theory behind this piece
>> of code (from
>> /core/src/main/scala/org/apache/spark/partial/GroupedSumEvaluator.scala)
>> below, especially with respect  to estimating the  size of the population
>> (total tasks) and its variance. Also I'm trying to understand how the
>> variance  of the sum is calculated like that. I'm struggling to find the
>> source online too.
>>
>>  while (iter.hasNext) {
>>         val entry = iter.next()
>>         val counter = entry.getValue
>>         val meanEstimate = counter.mean
>>         val meanVar = counter.sampleVariance / counter.count
>>         val countEstimate = (counter.count + 1 - p) / p
>>         val countVar = (counter.count + 1) * (1 - p) / (p * p)
>>         val sumEstimate = meanEstimate * countEstimate
>>         val sumVar = (meanEstimate * meanEstimate * countVar) +
>>                      (countEstimate * countEstimate * meanVar) +
>>                      (meanVar * countVar)
>>         val sumStdev = math.sqrt(sumVar)
>>         val confFactor = studentTCacher.get(counter.count)
>>         val low = sumEstimate - confFactor * sumStdev
>>         val high = sumEstimate + confFactor * sumStdev
>>         result.put(entry.getKey, new BoundedDouble(sumEstimate, confidence,
>> low, high))
>>       }
>>
>> Thanks and best regards,
>> Phil
>>
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/statistical-theory-behind-estimating-the-number-of-total-tasks-in-GroupedSumEvaluator-scala-tp27827.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: statistical theory behind estimating the number of total tasks in GroupedSumEvaluator.scala

Posted by Sean Owen <so...@cloudera.com>.

+Matei for question about the source of this bit of code

That's a good question; I remember wondering about this once upon a time.

First, GroupedSumEvaluator and GroupedMeanEvaluator look like dead
code at this point. GroupedCountEvaluator is still used.

MeanEvaluator is a better example, because it's straightforward. It's
getting a confidence interval on the true mean from the sample stats
using a t-distribution.

CountEvaluator however I don't quite get ...

val p = outputsMerged.toDouble / totalOutputs
val mean = (sum + 1 - p) / p
val variance = (sum + 1) * (1 - p) / (p * p)
val stdev = math.sqrt(variance)
val confFactor = new NormalDistribution().
  inverseCumulativeProbability(1 - (1 - confidence) / 2)
val low = mean - confFactor * stdev
val high = mean + confFactor * stdev
new BoundedDouble(mean, confidence, low, high)

Given the mean/variance formula, this looks like it's modeling the
rest of the count not yet observed as negative binomial. However I'd
expect the mean to be something like sum * (1-p) / p instead, and
that's not the mean of the total count but the rest of the count then.
This also doesn't truncate the interval at 0 (count can't be
negative), which could also be solved by not using a normal
approximation too.

Of course, I could be totally wrong about the model then.

This is pretty old code, from years ago. Matei you at least merged it
though may not have written it -- do you have any more info?

I might start writing some tests for this since the result isn't
directly tested in the unit tests, to see how it holds up.

Then, the question is SumEvaluator, which seems to approach the
estimation as an estimation of product of count and mean together, so
it has some related questions.

Sean

On Sun, Oct 2, 2016 at 8:47 PM, philipghu <ph...@gmail.com> wrote:
> Hi,
>
>
> I've been struggling to understand the statistical theory behind this piece
> of code (from
> /core/src/main/scala/org/apache/spark/partial/GroupedSumEvaluator.scala)
> below, especially with respect  to estimating the  size of the population
> (total tasks) and its variance. Also I'm trying to understand how the
> variance  of the sum is calculated like that. I'm struggling to find the
> source online too.
>
>  while (iter.hasNext) {
>         val entry = iter.next()
>         val counter = entry.getValue
>         val meanEstimate = counter.mean
>         val meanVar = counter.sampleVariance / counter.count
>         val countEstimate = (counter.count + 1 - p) / p
>         val countVar = (counter.count + 1) * (1 - p) / (p * p)
>         val sumEstimate = meanEstimate * countEstimate
>         val sumVar = (meanEstimate * meanEstimate * countVar) +
>                      (countEstimate * countEstimate * meanVar) +
>                      (meanVar * countVar)
>         val sumStdev = math.sqrt(sumVar)
>         val confFactor = studentTCacher.get(counter.count)
>         val low = sumEstimate - confFactor * sumStdev
>         val high = sumEstimate + confFactor * sumStdev
>         result.put(entry.getKey, new BoundedDouble(sumEstimate, confidence,
> low, high))
>       }
>
> Thanks and best regards,
> Phil
>
>
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/statistical-theory-behind-estimating-the-number-of-total-tasks-in-GroupedSumEvaluator-scala-tp27827.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org