You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Bojan Kostic <bl...@gmail.com> on 2014/11/03 09:45:20 UTC

Re: SQL COUNT DISTINCT

Hi Michael,
Thanks for response. I did test with query that you send me. And it works
really faster:
Old queries stats by phases:
3.2min
17s
Your query stats by phases:
0.3 s
16 s
20 s

But will this improvement also affect when you want to count distinct on 2
or more fields:
SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4)
FROM parquetFile

Should i still create Jira issue/improvement for this?

@Nick
That also make sense. But should i just get count of my data to driver node?

I just started to learn about Spark(and it is great) so sorry if i ask
stupid questions or anything like that.

Best regards
Bojan




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818p17939.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: SQL COUNT DISTINCT

Posted by Bojan Kostic <bl...@gmail.com>.

Here is the link on jira:  https://issues.apache.org/jira/browse/SPARK-4243
<https://issues.apache.org/jira/browse/SPARK-4243>  




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQL-COUNT-DISTINCT-tp17818p18166.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: SQL COUNT DISTINCT

Posted by Michael Armbrust <mi...@databricks.com>.

On Mon, Nov 3, 2014 at 12:45 AM, Bojan Kostic <bl...@gmail.com> wrote:
>
> But will this improvement also affect when you want to count distinct on 2
> or more fields:
> SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT
> f4)
> FROM parquetFile
>

Unfortunately I think this case may be harder for us to optimize, though
could be possible with some work.


> Should i still create Jira issue/improvement for this?
>

Yes please.