You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Steve Loughran (Jira)" <ji...@apache.org> on 2021/10/08 12:39:00 UTC

[jira] [Commented] (SPARK-36529) Decouple CPU with IO work in vectorized Parquet reader

    [ https://issues.apache.org/jira/browse/SPARK-36529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426158#comment-17426158 ] 

Steve Loughran commented on SPARK-36529:
----------------------------------------

If you look at HADOOP-11867 / https://github.com/apache/hadoop/pull/3499 

We are adding a vectored read API to the FSDataInputStream with

* async fetch of different blocks
* order of return == "when the data comes back"
* read into bytebuffer
* caller provides their own bytebuffer factory

Will intially ship with
* base implementation to reorder/coalesce reads
* local FS to use native IO byte buffer reads

For the s3a and abfs object stores, our plan is to coalesce nearby ranges into aggregate ones, then issue multiple ranged GET requests in parallel. If/when the stores support multiple ranges in a GET, we could be even more efficient.

Please have a look @ the API and
1.  See if it will work with your code. Owen's clearly wrote knowing how ORC would make use of it.
1. try to make what you add now be able to support the API when spark is built against a version of hadoop with the API.

> Decouple CPU with IO work in vectorized Parquet reader
> ------------------------------------------------------
>
>                 Key: SPARK-36529
>                 URL: https://issues.apache.org/jira/browse/SPARK-36529
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Chao Sun
>            Priority: Major
>
> Currently it seems the vectorized Parquet reader does almost everything in a sequential manner:
> 1. read the row group using file system API (perhaps from remote storage like S3)
> 2. allocate buffers and store those row group bytes into them
> 3. decompress the data pages
> 4. in Spark, decode all the read columns one by one
> 5. read the next row group and repeat from 1.
> A lot of improvements can be done to decouple the IO and CPU intensive work. In addition, we could parallelize the row group loading and column decoding, and utilizing all the cores available for a Spark task.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org