You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Jason Guo (JIRA)" <ji...@apache.org> on 2018/07/24 12:27:00 UTC

[jira] [Updated] (SPARK-24906) Enlarge split size for columnar file to ensure the task read enough data

     [ https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Guo updated SPARK-24906:
------------------------------
    Attachment: image-2018-07-24-20-26-32-441.png

> Enlarge split size for columnar file to ensure the task read enough data
> ------------------------------------------------------------------------
>
>                 Key: SPARK-24906
>                 URL: https://issues.apache.org/jira/browse/SPARK-24906
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Jason Guo
>            Priority: Critical
>         Attachments: image-2018-07-24-20-26-32-441.png
>
>
> For columnar file, such as, when spark sql read the table, each split will be 128 MB by default since spark.sql.files.maxPartitionBytes is default to 128MB. Even when user set it to a large value, such as 512MB, the task may read only few MB or even hundreds of KB. Because the table (Parquet) may consists of dozens of columns while the SQL only need few columns.
>  
> In this case, spark DataSourceScanExec can enlarge maxPartitionBytes adaptively. 
> For example, there is 40 columns , 20 are integer while another 20 are long. When use query on an integer type column and an long type column, the maxPartitionBytes should be 20 times larger. (20*4+20*8) /  (4+8) = 20. 
>  
> With this optimization, the number of task will be smaller and the job will run faster. More importantly, for a very large cluster (more the 10 thousand nodes), it will relieve RM's schedule pressure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org