You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Stefano Lodi <st...@unibo.it> on 2016/09/15 17:20:34 UTC

countApprox

I am experimenting with countApprox. I created a RDD of 10^8 numbers and ran countApprox with different parameters but I failed to generate any approximate output. In all runs it returns the exact number of elements. What is the effect of approximation in countApprox supposed to be, and for what inputs and parameters?

>>> rdd = sc.parallelize([random.choice(range(1000)) for i in range(10**8)], 50)
>>> rdd.countApprox(1, 0.8)
[Stage 12:>                                                        (0 + 0) / 50]16/09/15 15:45:28 WARN TaskSetManager: Stage 12 contains a task of very large size (5402 KB). The maximum recommended task size is 100 KB.
[Stage 12:======================================================> (49 + 1) / 50]100000000
>>> rdd.countApprox(1, 0.01)
16/09/15 15:45:45 WARN TaskSetManager: Stage 13 contains a task of very large size (5402 KB). The maximum recommended task size is 100 KB.
[Stage 13:====================================================>   (47 + 3) / 50]100000000

Re: countApprox

Posted by Stefano Lodi <st...@unibo.it>.

No, the ASCII progress bar grows for a few seconds,  with all four cores at 100%, then it returns 100000000, or rarely 99999999. Is the timeout value referred to elapsed time?
________________________________________
Da: Sean Owen <so...@cloudera.com>
Inviato: venerdì 16 settembre 2016 10:04
A: Stefano Lodi
Cc: user@spark.apache.org
Oggetto: Re: countApprox

countApprox gives the best answer within some timeout. Is it possible
that 1ms is more than enough to count this exactly? then the
confidence wouldn't matter. Although that seems way too fast, you're
counting ranges whose values don't actually matter, and maybe the
Python side is smart enough to use that fact. Then counting a
partition takes almost no time. Does it return immediately?

On Thu, Sep 15, 2016 at 6:20 PM, Stefano Lodi <st...@unibo.it> wrote:
> I am experimenting with countApprox. I created a RDD of 10^8 numbers and ran
> countApprox with different parameters but I failed to generate any
> approximate output. In all runs it returns the exact number of elements.
> What is the effect of approximation in countApprox supposed to be, and for
> what inputs and parameters?
>
>>>> rdd = sc.parallelize([random.choice(range(1000)) for i in range(10**8)],
>>>> 50)
>>>> rdd.countApprox(1, 0.8)
> [Stage 12:>                                                        (0 + 0) /
> 50]16/09/15 15:45:28 WARN TaskSetManager: Stage 12 contains a task of very
> large size (5402 KB). The maximum recommended task size is 100 KB.
> [Stage 12:======================================================> (49 + 1) /
> 50]100000000
>>>> rdd.countApprox(1, 0.01)
> 16/09/15 15:45:45 WARN TaskSetManager: Stage 13 contains a task of very
> large size (5402 KB). The maximum recommended task size is 100 KB.
> [Stage 13:====================================================>   (47 + 3) /
> 50]100000000
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: countApprox

Posted by Sean Owen <so...@cloudera.com>.

countApprox gives the best answer within some timeout. Is it possible
that 1ms is more than enough to count this exactly? then the
confidence wouldn't matter. Although that seems way too fast, you're
counting ranges whose values don't actually matter, and maybe the
Python side is smart enough to use that fact. Then counting a
partition takes almost no time. Does it return immediately?

On Thu, Sep 15, 2016 at 6:20 PM, Stefano Lodi <st...@unibo.it> wrote:
> I am experimenting with countApprox. I created a RDD of 10^8 numbers and ran
> countApprox with different parameters but I failed to generate any
> approximate output. In all runs it returns the exact number of elements.
> What is the effect of approximation in countApprox supposed to be, and for
> what inputs and parameters?
>
>>>> rdd = sc.parallelize([random.choice(range(1000)) for i in range(10**8)],
>>>> 50)
>>>> rdd.countApprox(1, 0.8)
> [Stage 12:>                                                        (0 + 0) /
> 50]16/09/15 15:45:28 WARN TaskSetManager: Stage 12 contains a task of very
> large size (5402 KB). The maximum recommended task size is 100 KB.
> [Stage 12:======================================================> (49 + 1) /
> 50]100000000
>>>> rdd.countApprox(1, 0.01)
> 16/09/15 15:45:45 WARN TaskSetManager: Stage 13 contains a task of very
> large size (5402 KB). The maximum recommended task size is 100 KB.
> [Stage 13:====================================================>   (47 + 3) /
> 50]100000000
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org