You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Reynold Xin <rx...@databricks.com> on 2015/10/07 02:51:17 UTC

multiple count distinct in SQL/DataFrame?

The current implementation of multiple count distinct in a single query is
very inferior in terms of performance and robustness, and it is also hard
to guarantee correctness of the implementation in some of the refactorings
for Tungsten. Supporting a better version of it is possible in the future,
but will take a lot of engineering efforts. Most other Hadoop-based SQL
systems (e.g. Hive, Impala) don't support this feature.

As a result, we are considering removing support for multiple count
distinct in a single query in the next Spark release (1.6). If you use this
feature, please reply to this email. Thanks.

Note that if you don't care about null values, it is relatively easy to
reconstruct a query using joins to support multiple distincts.

Re: multiple count distinct in SQL/DataFrame?

Posted by Mayank Pradhan <ma...@platfora.com>.

Is this limited only to grand multiple count distincts or does it extends
to all kinds of multiple count distincts? More precisely would the
following multiple count distinct query also be affected?
select a, b, count(distinct x), count(distinct y) from foo group by a,b;

It would be unfortunate to loose that too.

-Mayank

On Wed, Oct 7, 2015 at 1:56 PM, Herman van Hövell tot Westerflier <
hvanhovell@questtec.nl> wrote:

> We could also fallback to approximate count distincts when the user
> requests multiple count distincts. This is less invasive than throwing an
> AnalysisException, but it could violate the principle of least surprise.
>
>
>
> Met vriendelijke groet/Kind regards,
>
> Herman van Hövell tot Westerflier
>
> QuestTec B.V.
> Torenwacht 98
> 2353 DC Leiderdorp
> hvanhovell@questtec.nl
> +599 9 521 4402
>
>
> 2015-10-07 22:43 GMT+02:00 Reynold Xin <rx...@databricks.com>:
>
>> Adding user list too.
>>
>>
>>
>> ---------- Forwarded message ----------
>> From: Reynold Xin <rx...@databricks.com>
>> Date: Tue, Oct 6, 2015 at 5:54 PM
>> Subject: Re: multiple count distinct in SQL/DataFrame?
>> To: "dev@spark.apache.org" <de...@spark.apache.org>
>>
>>
>> To provide more context, if we do remove this feature, the following SQL
>> query would throw an AnalysisException:
>>
>> select count(distinct colA), count(distinct colB) from foo;
>>
>> The following should still work:
>>
>> select count(distinct colA) from foo;
>>
>> The following should also work:
>>
>> select count(distinct colA, colB) from foo;
>>
>>
>> On Tue, Oct 6, 2015 at 5:51 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>>> The current implementation of multiple count distinct in a single query
>>> is very inferior in terms of performance and robustness, and it is also
>>> hard to guarantee correctness of the implementation in some of the
>>> refactorings for Tungsten. Supporting a better version of it is possible in
>>> the future, but will take a lot of engineering efforts. Most other
>>> Hadoop-based SQL systems (e.g. Hive, Impala) don't support this feature.
>>>
>>> As a result, we are considering removing support for multiple count
>>> distinct in a single query in the next Spark release (1.6). If you use this
>>> feature, please reply to this email. Thanks.
>>>
>>> Note that if you don't care about null values, it is relatively easy to
>>> reconstruct a query using joins to support multiple distincts.
>>>
>>>
>>>
>>
>>
>

Re: multiple count distinct in SQL/DataFrame?

Posted by Herman van Hövell tot Westerflier <hv...@questtec.nl>.

We could also fallback to approximate count distincts when the user
requests multiple count distincts. This is less invasive than throwing an
AnalysisException, but it could violate the principle of least surprise.



Met vriendelijke groet/Kind regards,

Herman van Hövell tot Westerflier

QuestTec B.V.
Torenwacht 98
2353 DC Leiderdorp
hvanhovell@questtec.nl
+599 9 521 4402


2015-10-07 22:43 GMT+02:00 Reynold Xin <rx...@databricks.com>:

> Adding user list too.
>
>
>
> ---------- Forwarded message ----------
> From: Reynold Xin <rx...@databricks.com>
> Date: Tue, Oct 6, 2015 at 5:54 PM
> Subject: Re: multiple count distinct in SQL/DataFrame?
> To: "dev@spark.apache.org" <de...@spark.apache.org>
>
>
> To provide more context, if we do remove this feature, the following SQL
> query would throw an AnalysisException:
>
> select count(distinct colA), count(distinct colB) from foo;
>
> The following should still work:
>
> select count(distinct colA) from foo;
>
> The following should also work:
>
> select count(distinct colA, colB) from foo;
>
>
> On Tue, Oct 6, 2015 at 5:51 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> The current implementation of multiple count distinct in a single query
>> is very inferior in terms of performance and robustness, and it is also
>> hard to guarantee correctness of the implementation in some of the
>> refactorings for Tungsten. Supporting a better version of it is possible in
>> the future, but will take a lot of engineering efforts. Most other
>> Hadoop-based SQL systems (e.g. Hive, Impala) don't support this feature.
>>
>> As a result, we are considering removing support for multiple count
>> distinct in a single query in the next Spark release (1.6). If you use this
>> feature, please reply to this email. Thanks.
>>
>> Note that if you don't care about null values, it is relatively easy to
>> reconstruct a query using joins to support multiple distincts.
>>
>>
>>
>
>

Fwd: multiple count distinct in SQL/DataFrame?

Posted by Reynold Xin <rx...@databricks.com>.

Adding user list too.



---------- Forwarded message ----------
From: Reynold Xin <rx...@databricks.com>
Date: Tue, Oct 6, 2015 at 5:54 PM
Subject: Re: multiple count distinct in SQL/DataFrame?
To: "dev@spark.apache.org" <de...@spark.apache.org>


To provide more context, if we do remove this feature, the following SQL
query would throw an AnalysisException:

select count(distinct colA), count(distinct colB) from foo;

The following should still work:

select count(distinct colA) from foo;

The following should also work:

select count(distinct colA, colB) from foo;


On Tue, Oct 6, 2015 at 5:51 PM, Reynold Xin <rx...@databricks.com> wrote:

> The current implementation of multiple count distinct in a single query is
> very inferior in terms of performance and robustness, and it is also hard
> to guarantee correctness of the implementation in some of the refactorings
> for Tungsten. Supporting a better version of it is possible in the future,
> but will take a lot of engineering efforts. Most other Hadoop-based SQL
> systems (e.g. Hive, Impala) don't support this feature.
>
> As a result, we are considering removing support for multiple count
> distinct in a single query in the next Spark release (1.6). If you use this
> feature, please reply to this email. Thanks.
>
> Note that if you don't care about null values, it is relatively easy to
> reconstruct a query using joins to support multiple distincts.
>
>
>

Fwd: multiple count distinct in SQL/DataFrame?

Posted by Reynold Xin <rx...@databricks.com>.

Adding user list too.



---------- Forwarded message ----------
From: Reynold Xin <rx...@databricks.com>
Date: Tue, Oct 6, 2015 at 5:54 PM
Subject: Re: multiple count distinct in SQL/DataFrame?
To: "dev@spark.apache.org" <de...@spark.apache.org>


To provide more context, if we do remove this feature, the following SQL
query would throw an AnalysisException:

select count(distinct colA), count(distinct colB) from foo;

The following should still work:

select count(distinct colA) from foo;

The following should also work:

select count(distinct colA, colB) from foo;


On Tue, Oct 6, 2015 at 5:51 PM, Reynold Xin <rx...@databricks.com> wrote:

> The current implementation of multiple count distinct in a single query is
> very inferior in terms of performance and robustness, and it is also hard
> to guarantee correctness of the implementation in some of the refactorings
> for Tungsten. Supporting a better version of it is possible in the future,
> but will take a lot of engineering efforts. Most other Hadoop-based SQL
> systems (e.g. Hive, Impala) don't support this feature.
>
> As a result, we are considering removing support for multiple count
> distinct in a single query in the next Spark release (1.6). If you use this
> feature, please reply to this email. Thanks.
>
> Note that if you don't care about null values, it is relatively easy to
> reconstruct a query using joins to support multiple distincts.
>
>
>

Re: multiple count distinct in SQL/DataFrame?

Posted by Reynold Xin <rx...@databricks.com>.

To provide more context, if we do remove this feature, the following SQL
query would throw an AnalysisException:

select count(distinct colA), count(distinct colB) from foo;

The following should still work:

select count(distinct colA) from foo;

The following should also work:

select count(distinct colA, colB) from foo;


On Tue, Oct 6, 2015 at 5:51 PM, Reynold Xin <rx...@databricks.com> wrote:

> The current implementation of multiple count distinct in a single query is
> very inferior in terms of performance and robustness, and it is also hard
> to guarantee correctness of the implementation in some of the refactorings
> for Tungsten. Supporting a better version of it is possible in the future,
> but will take a lot of engineering efforts. Most other Hadoop-based SQL
> systems (e.g. Hive, Impala) don't support this feature.
>
> As a result, we are considering removing support for multiple count
> distinct in a single query in the next Spark release (1.6). If you use this
> feature, please reply to this email. Thanks.
>
> Note that if you don't care about null values, it is relatively easy to
> reconstruct a query using joins to support multiple distincts.
>
>
>