You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nicholas Chammas (JIRA)" <ji...@apache.org> on 2014/11/05 22:55:35 UTC

[jira] [Updated] (SPARK-4243) Spark SQL SELECT COUNT DISTINCT optimization

     [ https://issues.apache.org/jira/browse/SPARK-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicholas Chammas updated SPARK-4243:
------------------------------------
    Description: 
Spark SQL runs slow when using this code:
{code}
val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
val parquetFile = sqlContext.parquetFile("/bojan/test/2014-10-20/") 
parquetFile.registerTempTable("parquetFile") 
val count = sqlContext.sql("SELECT COUNT(DISTINCT f2) FROM parquetFile") 
count.map(t => t(0)).collect().foreach(println)
{code}

But with this query it runs much faster:
{code}
SELECT COUNT(*) FROM (SELECT DISTINCT f2 FROM parquetFile) a
{code}

Old queries stats by phases: 
3.2min 
17s 
New query stats by phases: 
0.3 s 
16 s 
20 s

Maybe you should also see this query for optimization:
{code}
SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4) FROM parquetFile 
{code}


  was:
Spark SQL runs slow when using this code:
val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
val parquetFile = sqlContext.parquetFile("/bojan/test/2014-10-20/") 
parquetFile.registerTempTable("parquetFile") 
val count = sqlContext.sql("SELECT COUNT(DISTINCT f2) FROM parquetFile") 
count.map(t => t(0)).collect().foreach(println)

But with this query it runs much faster:
SELECT COUNT(*) FROM (SELECT DISTINCT f2 FROM parquetFile) a

Old queries stats by phases: 
3.2min 
17s 
New query stats by phases: 
0.3 s 
16 s 
20 s

Maybe you should also see this query for optimization:
SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4) FROM parquetFile 


> Spark SQL SELECT COUNT DISTINCT optimization
> --------------------------------------------
>
>                 Key: SPARK-4243
>                 URL: https://issues.apache.org/jira/browse/SPARK-4243
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.1.0
>            Reporter: Bojan Kostić
>
> Spark SQL runs slow when using this code:
> {code}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc) 
> val parquetFile = sqlContext.parquetFile("/bojan/test/2014-10-20/") 
> parquetFile.registerTempTable("parquetFile") 
> val count = sqlContext.sql("SELECT COUNT(DISTINCT f2) FROM parquetFile") 
> count.map(t => t(0)).collect().foreach(println)
> {code}
> But with this query it runs much faster:
> {code}
> SELECT COUNT(*) FROM (SELECT DISTINCT f2 FROM parquetFile) a
> {code}
> Old queries stats by phases: 
> 3.2min 
> 17s 
> New query stats by phases: 
> 0.3 s 
> 16 s 
> 20 s
> Maybe you should also see this query for optimization:
> {code}
> SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4) FROM parquetFile 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org