You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Jack Chan <j4...@gmail.com> on 2021/02/12 22:36:33 UTC

[Rust] [DataFusion] Reading remote parquet files in S3?

Hi. I'm interested in reading parquet files stored in S3. I would like to
be able to do the followings:
1. read a single s3 file;
2. read all files in a s3 directory; and
3. read some files matching patterns in a s3 directory.

Currently, parquet.rs only supports local disk files. Potentially, this can
be done using the rusoto crate that provides a s3 client. What would be a
good way to do this?
1. create a remote parquet reader (potentially duplicate lots of code)
2. create an interface to abstract away reading from local/remote files
(not sure about performance if the reader blocks on every operation)

Jack

Re: [Rust] [DataFusion] Reading remote parquet files in S3?

Posted by Andrew Lamb <al...@influxdata.com>.
I don't know of any examples in the DataFusion codebase that take a
ChunkReader directly

The cloudfuse-io code defines the ChunkReader trait for the `CachedFile`
here :
https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/clients/cached_file.rs#L10

On Mon, Feb 15, 2021 at 4:19 AM Jack Chan <j4...@gmail.com> wrote:

> Thanks Andrew.
>
> As you mentioned, the ChunkReader is flexible enough. So, what is missing
> is a way to provider an parquet reader implementation of a customized
> ChunkReader. Are there any examples within datafusion where people can
> change the execution plan like this?
>
> If I understand correctly, the steps cloudfuse-io took are 1. define a s3
> parquet table provider. [1] 2. define a s3 parquet reader. [2] This does
> confirm my understanding that creating your own remote parquet reader
> requires lots of duplication.
>
> [1]
> https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/datasource/hbee/s3_parquet.rs
> [2]
> https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/execution_plan/parquet.rs
>
> Jack
>
> Andrew Lamb <al...@influxdata.com> 於 2021年2月14日週日 上午2:14寫道:
>
>> The Buzz project is one example I know of that reads parquet files from
>> S3 using the Rust implementation
>>
>>
>> https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/execution_plan/parquet.rs
>>
>> The SerializedFileReader[1] from the Rust parquet crate, despite its
>> somewhat misleading name, doesn't have to read from files, instead it reads
>> from something that implements the ChunkReader [2] trait. I am not sure how
>> well this matches what you are looking for.
>>
>> Hope that helps,
>> Andrew
>>
>> [1]
>> https://docs.rs/parquet/3.0.0/parquet/file/serialized_reader/struct.SerializedFileReader.html
>> [2]
>> https://docs.rs/parquet/3.0.0/parquet/file/reader/trait.ChunkReader.html
>>
>>
>>
>> On Sat, Feb 13, 2021 at 10:17 AM Steve Kim <ch...@gmail.com> wrote:
>>
>>> > Currently, parquet.rs only supports local disk files. Potentially,
>>> this can be done using the rusoto crate that provides a s3 client. What
>>> would be a good way to do this?
>>> > 1. create a remote parquet reader (potentially duplicate lots of code)
>>> > 2. create an interface to abstract away reading from local/remote
>>> files (not sure about performance if the reader blocks on every operation)
>>>
>>> This is a great question.
>>>
>>> I think that approach (2) is superior, although it requires more work
>>> than approach (1) to design an interface that works well across
>>> multiple file stores that have different performance characteristics.
>>> To accommodate storage-specific performance optimizations, I expect
>>> that the common interface will have to be more elaborate than the
>>> current reader API.
>>>
>>> Is it possible for the Rust reader to use the c++ implementation
>>> (https://github.com/apache/arrow/tree/master/cpp/src/arrow/filesystem)?
>>> If this reuse of implementation is feasible, then we could focus
>>> efforts on improving the c++ implementation and get the benefits in
>>> Python, Rust, etc.
>>>
>>> In the Java ecosystem, the (non-Arrow, row-wise) Parquet reader uses
>>> the Hadoop FileSystem abstraction. This abstraction is complex, leaky,
>>> and not well specialized for read patterns that are typical for
>>> Parquet files. We can learn from these mistakes to create a superior
>>> reader interface in the Arrow/Parquet project.
>>>
>>> Steve
>>>
>>

Re: [Rust] [DataFusion] Reading remote parquet files in S3?

Posted by Jack Chan <j4...@gmail.com>.
Thanks Andrew.

As you mentioned, the ChunkReader is flexible enough. So, what is missing
is a way to provider an parquet reader implementation of a customized
ChunkReader. Are there any examples within datafusion where people can
change the execution plan like this?

If I understand correctly, the steps cloudfuse-io took are 1. define a s3
parquet table provider. [1] 2. define a s3 parquet reader. [2] This does
confirm my understanding that creating your own remote parquet reader
requires lots of duplication.

[1]
https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/datasource/hbee/s3_parquet.rs
[2]
https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/execution_plan/parquet.rs

Jack

Andrew Lamb <al...@influxdata.com> 於 2021年2月14日週日 上午2:14寫道:

> The Buzz project is one example I know of that reads parquet files from S3
> using the Rust implementation
>
>
> https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/execution_plan/parquet.rs
>
> The SerializedFileReader[1] from the Rust parquet crate, despite its
> somewhat misleading name, doesn't have to read from files, instead it reads
> from something that implements the ChunkReader [2] trait. I am not sure how
> well this matches what you are looking for.
>
> Hope that helps,
> Andrew
>
> [1]
> https://docs.rs/parquet/3.0.0/parquet/file/serialized_reader/struct.SerializedFileReader.html
> [2]
> https://docs.rs/parquet/3.0.0/parquet/file/reader/trait.ChunkReader.html
>
>
>
> On Sat, Feb 13, 2021 at 10:17 AM Steve Kim <ch...@gmail.com> wrote:
>
>> > Currently, parquet.rs only supports local disk files. Potentially,
>> this can be done using the rusoto crate that provides a s3 client. What
>> would be a good way to do this?
>> > 1. create a remote parquet reader (potentially duplicate lots of code)
>> > 2. create an interface to abstract away reading from local/remote files
>> (not sure about performance if the reader blocks on every operation)
>>
>> This is a great question.
>>
>> I think that approach (2) is superior, although it requires more work
>> than approach (1) to design an interface that works well across
>> multiple file stores that have different performance characteristics.
>> To accommodate storage-specific performance optimizations, I expect
>> that the common interface will have to be more elaborate than the
>> current reader API.
>>
>> Is it possible for the Rust reader to use the c++ implementation
>> (https://github.com/apache/arrow/tree/master/cpp/src/arrow/filesystem)?
>> If this reuse of implementation is feasible, then we could focus
>> efforts on improving the c++ implementation and get the benefits in
>> Python, Rust, etc.
>>
>> In the Java ecosystem, the (non-Arrow, row-wise) Parquet reader uses
>> the Hadoop FileSystem abstraction. This abstraction is complex, leaky,
>> and not well specialized for read patterns that are typical for
>> Parquet files. We can learn from these mistakes to create a superior
>> reader interface in the Arrow/Parquet project.
>>
>> Steve
>>
>

Re: [Rust] [DataFusion] Reading remote parquet files in S3?

Posted by Andrew Lamb <al...@influxdata.com>.
The Buzz project is one example I know of that reads parquet files from S3
using the Rust implementation

https://github.com/cloudfuse-io/buzz-rust/blob/13175a7c5cdd298415889da710a254a218be0a01/code/src/execution_plan/parquet.rs

The SerializedFileReader[1] from the Rust parquet crate, despite its
somewhat misleading name, doesn't have to read from files, instead it reads
from something that implements the ChunkReader [2] trait. I am not sure how
well this matches what you are looking for.

Hope that helps,
Andrew

[1]
https://docs.rs/parquet/3.0.0/parquet/file/serialized_reader/struct.SerializedFileReader.html
[2] https://docs.rs/parquet/3.0.0/parquet/file/reader/trait.ChunkReader.html



On Sat, Feb 13, 2021 at 10:17 AM Steve Kim <ch...@gmail.com> wrote:

> > Currently, parquet.rs only supports local disk files. Potentially, this
> can be done using the rusoto crate that provides a s3 client. What would be
> a good way to do this?
> > 1. create a remote parquet reader (potentially duplicate lots of code)
> > 2. create an interface to abstract away reading from local/remote files
> (not sure about performance if the reader blocks on every operation)
>
> This is a great question.
>
> I think that approach (2) is superior, although it requires more work
> than approach (1) to design an interface that works well across
> multiple file stores that have different performance characteristics.
> To accommodate storage-specific performance optimizations, I expect
> that the common interface will have to be more elaborate than the
> current reader API.
>
> Is it possible for the Rust reader to use the c++ implementation
> (https://github.com/apache/arrow/tree/master/cpp/src/arrow/filesystem)?
> If this reuse of implementation is feasible, then we could focus
> efforts on improving the c++ implementation and get the benefits in
> Python, Rust, etc.
>
> In the Java ecosystem, the (non-Arrow, row-wise) Parquet reader uses
> the Hadoop FileSystem abstraction. This abstraction is complex, leaky,
> and not well specialized for read patterns that are typical for
> Parquet files. We can learn from these mistakes to create a superior
> reader interface in the Arrow/Parquet project.
>
> Steve
>

Re: [Rust] [DataFusion] Reading remote parquet files in S3?

Posted by Steve Kim <ch...@gmail.com>.
> Currently, parquet.rs only supports local disk files. Potentially, this can be done using the rusoto crate that provides a s3 client. What would be a good way to do this?
> 1. create a remote parquet reader (potentially duplicate lots of code)
> 2. create an interface to abstract away reading from local/remote files (not sure about performance if the reader blocks on every operation)

This is a great question.

I think that approach (2) is superior, although it requires more work
than approach (1) to design an interface that works well across
multiple file stores that have different performance characteristics.
To accommodate storage-specific performance optimizations, I expect
that the common interface will have to be more elaborate than the
current reader API.

Is it possible for the Rust reader to use the c++ implementation
(https://github.com/apache/arrow/tree/master/cpp/src/arrow/filesystem)?
If this reuse of implementation is feasible, then we could focus
efforts on improving the c++ implementation and get the benefits in
Python, Rust, etc.

In the Java ecosystem, the (non-Arrow, row-wise) Parquet reader uses
the Hadoop FileSystem abstraction. This abstraction is complex, leaky,
and not well specialized for read patterns that are typical for
Parquet files. We can learn from these mistakes to create a superior
reader interface in the Arrow/Parquet project.

Steve