You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:35:47 UTC

[jira] [Resolved] (SPARK-14588) Consider getting column stats from files (wherever feasible) to get better stats for joins

     [ https://issues.apache.org/jira/browse/SPARK-14588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-14588.
----------------------------------
    Resolution: Incomplete

> Consider getting column stats from files (wherever feasible) to get better stats for joins
> ------------------------------------------------------------------------------------------
>
>                 Key: SPARK-14588
>                 URL: https://issues.apache.org/jira/browse/SPARK-14588
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Rajesh Balamohan
>            Priority: Major
>              Labels: bulk-closed
>
> Broadcast join is determined by "spark.sql.autoBroadcastJoinThreshold". Stats for this is determined from the files and by determining the projected columns (internally it assumes 20 bytes for string columns). However, estimated stats could be invalid if the dataset contains greater than 20 bytes for string columns . In such instances, broadcast join would not be invoked. 
> File formats like ORC would be able to provide the raw data size for the projected columns. It might be good to consider those (whenever available) to determine the accurate stats for broadcast threshold.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org