You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2015/12/17 07:42:46 UTC

[jira] [Closed] (SPARK-6006) Optimize count distinct in case of high cardinality columns

     [ https://issues.apache.org/jira/browse/SPARK-6006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Reynold Xin closed SPARK-6006.
------------------------------
       Resolution: Fixed
         Assignee: Davies Liu
    Fix Version/s: 1.6.0

This is fixed as of Spark 1.6.



> Optimize count distinct in case of high cardinality columns
> -----------------------------------------------------------
>
>                 Key: SPARK-6006
>                 URL: https://issues.apache.org/jira/browse/SPARK-6006
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.1.1, 1.2.1
>            Reporter: Yash Datta
>            Assignee: Davies Liu
>            Priority: Minor
>             Fix For: 1.6.0
>
>
> In case there are a lot of distinct values, count distinct becomes too slow since it tries to hash partial results to one map. It can be improved by creating buckets/partial maps in an intermediate stage where same key from multiple partial maps of first stage hash to the same bucket. Later we can sum the size of these buckets to get total distinct count.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org