You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Lovasoa (JIRA)" <ji...@apache.org> on 2017/06/11 16:37:18 UTC

[jira] [Created] (SPARK-21057) Do not use a PascalDistribution in countApprox

Lovasoa created SPARK-21057:
-------------------------------

             Summary: Do not use a PascalDistribution in countApprox
                 Key: SPARK-21057
                 URL: https://issues.apache.org/jira/browse/SPARK-21057
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.1.1
            Reporter: Lovasoa


I was reading the source of Spark, and found this:
https://github.com/apache/spark/blob/v2.1.1/core/src/main/scala/org/apache/spark/partial/CountEvaluator.scala#L50-L72

This is the function that estimates the probability distribution of the total count of elements in an RDD given the count of only some partitions.

This function does a strange thing: when the number of elements counted so far is less than 10 000, it models the total count with a negative binomial (Pascal) law, else, it models it with a Poisson law.

Modeling our number of uncounted elements with a negative binomial law is like saying that we ran over elements, counting only some, and stopping after having counted a given number of elements.
But this does not model what really happened.  Our counting was limited in time, not in number of counted elements.

I propose to use the Poisson distribution in every case, as it can be justified under the hypothesis that the number of elements in each partition is independent and follows a Poisson law.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org