You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Michael Armbrust (JIRA)" <ji...@apache.org> on 2015/04/12 20:30:12 UTC

[jira] [Resolved] (SPARK-4760) "ANALYZE TABLE table COMPUTE STATISTICS noscan" failed estimating table size for tables created from Parquet files

     [ https://issues.apache.org/jira/browse/SPARK-4760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Armbrust resolved SPARK-4760.
-------------------------------------
       Resolution: Fixed
    Fix Version/s: 1.3.0

The native parquet support (which is used for both Spark SQL and Hive DDL by default) automatically computes sizes starting with Spark 1.3.  So running ANALYZE is not needed for auto broadcast joins anymore.  Please reopen if you see any issues with this new feature.

> "ANALYZE TABLE table COMPUTE STATISTICS noscan" failed estimating table size for tables created from Parquet files
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-4760
>                 URL: https://issues.apache.org/jira/browse/SPARK-4760
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.2.0
>            Reporter: Jianshi Huang
>            Priority: Critical
>             Fix For: 1.3.0
>
>
> In a older Spark version built around Oct. 12, I was able to use 
>   ANALYZE TABLE table COMPUTE STATISTICS noscan
> to get estimated table size, which is important for optimizing joins. (I'm joining 15 small dimension tables, and this is crucial to me).
> In the more recent Spark builds, it fails to estimate the table size unless I remove "noscan".
> Here's the statistics I got using DESC EXTENDED:
> old:
> parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1417763591, totalSize=56166}
> new:
> parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417763892, COLUMN_STATS_ACCURATE=false, totalSize=0, numRows=-1, rawDataSize=-1}
> And I've tried turning off spark.sql.hive.convertMetastoreParquet in my spark-defaults.conf and the result is unaffected (in both versions).
> Looks like the Parquet support in new Hive (0.13.1) is broken?
> Jianshi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org