You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Mukund Madhav Thakur <mt...@cloudera.com.INVALID> on 2022/09/27 16:29:13 UTC

Vectored IO in Parquet ( https://issues.apache.org/jira/browse/PARQUET-2171)

Hi Team,
We in hadoop project recently added a new feature in Hadoop Vectored IO
which will be released in the upcoming 3.3.5 hadoop release.
This is a high performance scatter/gather extension of PositionedReadable
API optimized for reading columnar data in cloud storage.
https://issues.apache.org/jira/browse/HADOOP-18103.
We observed really good performance improvements in hive tpch and tpcds
benchmark for orc data stored in S3.

We are now looking at Parquet integration as well.
https://issues.apache.org/jira/browse/PARQUET-2171
I have a draft patch which works locally through sparks file reader.
https://github.com/apache/parquet-mr/pull/999

We know Parquet likes to support builds against the older versions of
hadoop, we are working on a solution to offer the API through a
shim library.
As I have never contributed to the Parquet codebase and it is totally new
for me, I would really appreciate some help in implementing, testing and
releasing this feature in the best possible way.

I will be talking about all these in the upcoming Apache Conference NA next
week Tuesday, October 04, 4:10 PM CDT. It would be really great to meet
anyone who would be interested in getting involved in this.



Thanks,
Mukund

Re: Vectored IO in Parquet ( https://issues.apache.org/jira/browse/PARQUET-2171)

Posted by Mukund Madhav Thakur <mt...@cloudera.com.INVALID>.

++ Adding Xinli and Gidon. We discussed this during the Apache Con NA.
I will be putting the slide deck slideshare after making some changes soon.


On Tue, Sep 27, 2022 at 11:29 AM Mukund Madhav Thakur <mt...@cloudera.com>
wrote:

> Hi Team,
> We in hadoop project recently added a new feature in Hadoop Vectored IO
> which will be released in the upcoming 3.3.5 hadoop release.
> This is a high performance scatter/gather extension of PositionedReadable
> API optimized for reading columnar data in cloud storage.
> https://issues.apache.org/jira/browse/HADOOP-18103.
> We observed really good performance improvements in hive tpch and tpcds
> benchmark for orc data stored in S3.
>
> We are now looking at Parquet integration as well.
> https://issues.apache.org/jira/browse/PARQUET-2171
> I have a draft patch which works locally through sparks file reader.
> https://github.com/apache/parquet-mr/pull/999
>
> We know Parquet likes to support builds against the older versions of
> hadoop, we are working on a solution to offer the API through a
> shim library.
> As I have never contributed to the Parquet codebase and it is totally new
> for me, I would really appreciate some help in implementing, testing and
> releasing this feature in the best possible way.
>
> I will be talking about all these in the upcoming Apache Conference NA
> next week Tuesday, October 04, 4:10 PM CDT. It would be really great to
> meet anyone who would be interested in getting involved in this.
>
>
>
> Thanks,
> Mukund
>

Re: Vectored IO in Parquet ( https://issues.apache.org/jira/browse/PARQUET-2171)

Posted by Xinli shang <sh...@uber.com.INVALID>.

Thanks, Mukund! As spoken at the conference, this is a great feature! Look
forward to reviewing the changes!

On Tue, Sep 27, 2022 at 9:29 AM Mukund Madhav Thakur
<mt...@cloudera.com.invalid> wrote:

> Hi Team,
> We in hadoop project recently added a new feature in Hadoop Vectored IO
> which will be released in the upcoming 3.3.5 hadoop release.
> This is a high performance scatter/gather extension of PositionedReadable
> API optimized for reading columnar data in cloud storage.
> https://issues.apache.org/jira/browse/HADOOP-18103.
> We observed really good performance improvements in hive tpch and tpcds
> benchmark for orc data stored in S3.
>
> We are now looking at Parquet integration as well.
> https://issues.apache.org/jira/browse/PARQUET-2171
> I have a draft patch which works locally through sparks file reader.
> https://github.com/apache/parquet-mr/pull/999
>
> We know Parquet likes to support builds against the older versions of
> hadoop, we are working on a solution to offer the API through a
> shim library.
> As I have never contributed to the Parquet codebase and it is totally new
> for me, I would really appreciate some help in implementing, testing and
> releasing this feature in the best possible way.
>
> I will be talking about all these in the upcoming Apache Conference NA next
> week Tuesday, October 04, 4:10 PM CDT. It would be really great to meet
> anyone who would be interested in getting involved in this.
>
>
>
> Thanks,
> Mukund
>


-- 
Xinli Shang