You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nicholas Chammas (JIRA)" <ji...@apache.org> on 2014/11/05 22:55:35 UTC
[jira] [Updated] (SPARK-4243) Spark SQL SELECT COUNT DISTINCT
optimization
[ https://issues.apache.org/jira/browse/SPARK-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nicholas Chammas updated SPARK-4243:
------------------------------------
Description:
Spark SQL runs slow when using this code:
{code}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val parquetFile = sqlContext.parquetFile("/bojan/test/2014-10-20/")
parquetFile.registerTempTable("parquetFile")
val count = sqlContext.sql("SELECT COUNT(DISTINCT f2) FROM parquetFile")
count.map(t => t(0)).collect().foreach(println)
{code}
But with this query it runs much faster:
{code}
SELECT COUNT(*) FROM (SELECT DISTINCT f2 FROM parquetFile) a
{code}
Old queries stats by phases:
3.2min
17s
New query stats by phases:
0.3 s
16 s
20 s
Maybe you should also see this query for optimization:
{code}
SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4) FROM parquetFile
{code}
was:
Spark SQL runs slow when using this code:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val parquetFile = sqlContext.parquetFile("/bojan/test/2014-10-20/")
parquetFile.registerTempTable("parquetFile")
val count = sqlContext.sql("SELECT COUNT(DISTINCT f2) FROM parquetFile")
count.map(t => t(0)).collect().foreach(println)
But with this query it runs much faster:
SELECT COUNT(*) FROM (SELECT DISTINCT f2 FROM parquetFile) a
Old queries stats by phases:
3.2min
17s
New query stats by phases:
0.3 s
16 s
20 s
Maybe you should also see this query for optimization:
SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4) FROM parquetFile
> Spark SQL SELECT COUNT DISTINCT optimization
> --------------------------------------------
>
> Key: SPARK-4243
> URL: https://issues.apache.org/jira/browse/SPARK-4243
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 1.1.0
> Reporter: Bojan Kostić
>
> Spark SQL runs slow when using this code:
> {code}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val parquetFile = sqlContext.parquetFile("/bojan/test/2014-10-20/")
> parquetFile.registerTempTable("parquetFile")
> val count = sqlContext.sql("SELECT COUNT(DISTINCT f2) FROM parquetFile")
> count.map(t => t(0)).collect().foreach(println)
> {code}
> But with this query it runs much faster:
> {code}
> SELECT COUNT(*) FROM (SELECT DISTINCT f2 FROM parquetFile) a
> {code}
> Old queries stats by phases:
> 3.2min
> 17s
> New query stats by phases:
> 0.3 s
> 16 s
> 20 s
> Maybe you should also see this query for optimization:
> {code}
> SELECT COUNT(f1), COUNT(DISTINCT f2), COUNT(DISTINCT f3), COUNT(DISTINCT f4) FROM parquetFile
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org