You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Chao Sun (Jira)" <ji...@apache.org> on 2023/01/06 22:28:00 UTC

[jira] [Updated] (SPARK-36529) Decouple CPU with IO work in vectorized Parquet reader

     [ https://issues.apache.org/jira/browse/SPARK-36529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chao Sun updated SPARK-36529:
-----------------------------
        Parent:     (was: SPARK-35743)
    Issue Type: Bug  (was: Sub-task)

> Decouple CPU with IO work in vectorized Parquet reader
> ------------------------------------------------------
>
>                 Key: SPARK-36529
>                 URL: https://issues.apache.org/jira/browse/SPARK-36529
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Chao Sun
>            Priority: Major
>
> Currently it seems the vectorized Parquet reader does almost everything in a sequential manner:
> 1. read the row group using file system API (perhaps from remote storage like S3)
> 2. allocate buffers and store those row group bytes into them
> 3. decompress the data pages
> 4. in Spark, decode all the read columns one by one
> 5. read the next row group and repeat from 1.
> A lot of improvements can be done to decouple the IO and CPU intensive work. In addition, we could parallelize the row group loading and column decoding, and utilizing all the cores available for a Spark task.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org