You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by philipghu <ph...@gmail.com> on 2016/10/02 19:47:40 UTC
statistical theory behind estimating the number of total tasks in
GroupedSumEvaluator.scala
Hi,
I've been struggling to understand the statistical theory behind this piece
of code (from
/core/src/main/scala/org/apache/spark/partial/GroupedSumEvaluator.scala)
below, especially with respect to estimating the size of the population
(total tasks) and its variance. Also I'm trying to understand how the
variance of the sum is calculated like that. I'm struggling to find the
source online too.
while (iter.hasNext) {
val entry = iter.next()
val counter = entry.getValue
val meanEstimate = counter.mean
val meanVar = counter.sampleVariance / counter.count
val countEstimate = (counter.count + 1 - p) / p
val countVar = (counter.count + 1) * (1 - p) / (p * p)
val sumEstimate = meanEstimate * countEstimate
val sumVar = (meanEstimate * meanEstimate * countVar) +
(countEstimate * countEstimate * meanVar) +
(meanVar * countVar)
val sumStdev = math.sqrt(sumVar)
val confFactor = studentTCacher.get(counter.count)
val low = sumEstimate - confFactor * sumStdev
val high = sumEstimate + confFactor * sumStdev
result.put(entry.getKey, new BoundedDouble(sumEstimate, confidence,
low, high))
}
Thanks and best regards,
Phil
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/statistical-theory-behind-estimating-the-number-of-total-tasks-in-GroupedSumEvaluator-scala-tp27827.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Re: statistical theory behind estimating the number of total tasks in GroupedSumEvaluator.scala
Posted by Sean Owen <so...@cloudera.com>.
FWIW I think there are in any event several small problems with these
classes; I'm tracking it here and have a change almost ready:
https://issues.apache.org/jira/browse/SPARK-17768
On Mon, Oct 3, 2016 at 9:39 AM, Sean Owen <so...@cloudera.com> wrote:
> +Matei for question about the source of this bit of code
>
> That's a good question; I remember wondering about this once upon a time.
>
> First, GroupedSumEvaluator and GroupedMeanEvaluator look like dead
> code at this point. GroupedCountEvaluator is still used.
>
> MeanEvaluator is a better example, because it's straightforward. It's
> getting a confidence interval on the true mean from the sample stats
> using a t-distribution.
>
> CountEvaluator however I don't quite get ...
>
> val p = outputsMerged.toDouble / totalOutputs
> val mean = (sum + 1 - p) / p
> val variance = (sum + 1) * (1 - p) / (p * p)
> val stdev = math.sqrt(variance)
> val confFactor = new NormalDistribution().
> inverseCumulativeProbability(1 - (1 - confidence) / 2)
> val low = mean - confFactor * stdev
> val high = mean + confFactor * stdev
> new BoundedDouble(mean, confidence, low, high)
>
> Given the mean/variance formula, this looks like it's modeling the
> rest of the count not yet observed as negative binomial. However I'd
> expect the mean to be something like sum * (1-p) / p instead, and
> that's not the mean of the total count but the rest of the count then.
> This also doesn't truncate the interval at 0 (count can't be
> negative), which could also be solved by not using a normal
> approximation too.
>
> Of course, I could be totally wrong about the model then.
>
> This is pretty old code, from years ago. Matei you at least merged it
> though may not have written it -- do you have any more info?
>
> I might start writing some tests for this since the result isn't
> directly tested in the unit tests, to see how it holds up.
>
> Then, the question is SumEvaluator, which seems to approach the
> estimation as an estimation of product of count and mean together, so
> it has some related questions.
>
> Sean
>
> On Sun, Oct 2, 2016 at 8:47 PM, philipghu <ph...@gmail.com> wrote:
>> Hi,
>>
>>
>> I've been struggling to understand the statistical theory behind this piece
>> of code (from
>> /core/src/main/scala/org/apache/spark/partial/GroupedSumEvaluator.scala)
>> below, especially with respect to estimating the size of the population
>> (total tasks) and its variance. Also I'm trying to understand how the
>> variance of the sum is calculated like that. I'm struggling to find the
>> source online too.
>>
>> while (iter.hasNext) {
>> val entry = iter.next()
>> val counter = entry.getValue
>> val meanEstimate = counter.mean
>> val meanVar = counter.sampleVariance / counter.count
>> val countEstimate = (counter.count + 1 - p) / p
>> val countVar = (counter.count + 1) * (1 - p) / (p * p)
>> val sumEstimate = meanEstimate * countEstimate
>> val sumVar = (meanEstimate * meanEstimate * countVar) +
>> (countEstimate * countEstimate * meanVar) +
>> (meanVar * countVar)
>> val sumStdev = math.sqrt(sumVar)
>> val confFactor = studentTCacher.get(counter.count)
>> val low = sumEstimate - confFactor * sumStdev
>> val high = sumEstimate + confFactor * sumStdev
>> result.put(entry.getKey, new BoundedDouble(sumEstimate, confidence,
>> low, high))
>> }
>>
>> Thanks and best regards,
>> Phil
>>
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/statistical-theory-behind-estimating-the-number-of-total-tasks-in-GroupedSumEvaluator-scala-tp27827.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Re: statistical theory behind estimating the number of total tasks in GroupedSumEvaluator.scala
Posted by Sean Owen <so...@cloudera.com>.
+Matei for question about the source of this bit of code
That's a good question; I remember wondering about this once upon a time.
First, GroupedSumEvaluator and GroupedMeanEvaluator look like dead
code at this point. GroupedCountEvaluator is still used.
MeanEvaluator is a better example, because it's straightforward. It's
getting a confidence interval on the true mean from the sample stats
using a t-distribution.
CountEvaluator however I don't quite get ...
val p = outputsMerged.toDouble / totalOutputs
val mean = (sum + 1 - p) / p
val variance = (sum + 1) * (1 - p) / (p * p)
val stdev = math.sqrt(variance)
val confFactor = new NormalDistribution().
inverseCumulativeProbability(1 - (1 - confidence) / 2)
val low = mean - confFactor * stdev
val high = mean + confFactor * stdev
new BoundedDouble(mean, confidence, low, high)
Given the mean/variance formula, this looks like it's modeling the
rest of the count not yet observed as negative binomial. However I'd
expect the mean to be something like sum * (1-p) / p instead, and
that's not the mean of the total count but the rest of the count then.
This also doesn't truncate the interval at 0 (count can't be
negative), which could also be solved by not using a normal
approximation too.
Of course, I could be totally wrong about the model then.
This is pretty old code, from years ago. Matei you at least merged it
though may not have written it -- do you have any more info?
I might start writing some tests for this since the result isn't
directly tested in the unit tests, to see how it holds up.
Then, the question is SumEvaluator, which seems to approach the
estimation as an estimation of product of count and mean together, so
it has some related questions.
Sean
On Sun, Oct 2, 2016 at 8:47 PM, philipghu <ph...@gmail.com> wrote:
> Hi,
>
>
> I've been struggling to understand the statistical theory behind this piece
> of code (from
> /core/src/main/scala/org/apache/spark/partial/GroupedSumEvaluator.scala)
> below, especially with respect to estimating the size of the population
> (total tasks) and its variance. Also I'm trying to understand how the
> variance of the sum is calculated like that. I'm struggling to find the
> source online too.
>
> while (iter.hasNext) {
> val entry = iter.next()
> val counter = entry.getValue
> val meanEstimate = counter.mean
> val meanVar = counter.sampleVariance / counter.count
> val countEstimate = (counter.count + 1 - p) / p
> val countVar = (counter.count + 1) * (1 - p) / (p * p)
> val sumEstimate = meanEstimate * countEstimate
> val sumVar = (meanEstimate * meanEstimate * countVar) +
> (countEstimate * countEstimate * meanVar) +
> (meanVar * countVar)
> val sumStdev = math.sqrt(sumVar)
> val confFactor = studentTCacher.get(counter.count)
> val low = sumEstimate - confFactor * sumStdev
> val high = sumEstimate + confFactor * sumStdev
> result.put(entry.getKey, new BoundedDouble(sumEstimate, confidence,
> low, high))
> }
>
> Thanks and best regards,
> Phil
>
>
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/statistical-theory-behind-estimating-the-number-of-total-tasks-in-GroupedSumEvaluator-scala-tp27827.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org