You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by annsshadow <cr...@163.com> on 2019/10/31 02:27:40 UTC

[C++] How can I read streaming parquet file in v0.15.0


hi~
I hava a question about reading parquet file.
The offical example is reading the whole file from the local.
Now I can't get the whole parquet file in the memory, only can fetch it slice by slice from the network, so how can I use arrow to read the parquet file?
thank you~

Re: Re: [C++] How can I read streaming parquet file in v0.15.0

Posted by Micah Kornfield <em...@gmail.com>.
I'm not sure what is meant by "streaming" in this  context.  My
understanding is that Parquet file reading needs RandomAccess.  In this
regard if you are trying to fetch from S3  A RandomAccessFile object using
the S3FileSystem
https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3fs.h#L110
and
then create a Parquet file reader with the object.  I'm not sure if this
code path has been well tested.

On Fri, Nov 1, 2019 at 12:56 AM annsshadow <cr...@163.com> wrote:

> The arrow::RecordBatchReader needs a arrow::dataset::RecordBatchProjector
> which needs the Schema. It seems that I can't get the schema first and read
> the streaming parquet by arrow.<br/>In my situation, the parquet file is in
> the object system like S3. I can get it from the network slice by slice
> with any filesize, but can't hold the whole file in the memory and
> disk.<br/>Your reply indicates that the C++ can't read the streaming
> parquet now, so what should I try next with the arrow or anything
> else?<br/>Thank you for your work~~
> At 2019-11-01 01:46:32, "Wes McKinney" <we...@gmail.com> wrote:
> >You will want to use the GetRecordBatchReader C++ API here
> >
> >
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.h#L152
> >
> >It may not be optimal for your use case. Support for streaming reads
> >is not yet exposed in Python or other bindings as far as I know.
> >
> >There is work happening in the C++ Datasets project to better support
> >this use case.
> >
> >On Wed, Oct 30, 2019 at 9:28 PM annsshadow <cr...@163.com> wrote:
> >>
> >>
> >> hi~
> >> I hava a question about reading parquet file.
> >> The offical example is reading the whole file from the local.
> >> Now I can't get the whole parquet file in the memory, only can fetch it
> slice by slice from the network, so how can I use arrow to read the parquet
> file?
> >> thank you~
>

Re:Re: [C++] How can I read streaming parquet file in v0.15.0

Posted by annsshadow <cr...@163.com>.
The arrow::RecordBatchReader needs a arrow::dataset::RecordBatchProjector which needs the Schema. It seems that I can't get the schema first and read the streaming parquet by arrow.<br/>In my situation, the parquet file is in the object system like S3. I can get it from the network slice by slice with any filesize, but can't hold the whole file in the memory and disk.<br/>Your reply indicates that the C++ can't read the streaming parquet now, so what should I try next with the arrow or anything else?<br/>Thank you for your work~~
At 2019-11-01 01:46:32, "Wes McKinney" <we...@gmail.com> wrote:
>You will want to use the GetRecordBatchReader C++ API here
>
>https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.h#L152
>
>It may not be optimal for your use case. Support for streaming reads
>is not yet exposed in Python or other bindings as far as I know.
>
>There is work happening in the C++ Datasets project to better support
>this use case.
>
>On Wed, Oct 30, 2019 at 9:28 PM annsshadow <cr...@163.com> wrote:
>>
>>
>> hi~
>> I hava a question about reading parquet file.
>> The offical example is reading the whole file from the local.
>> Now I can't get the whole parquet file in the memory, only can fetch it slice by slice from the network, so how can I use arrow to read the parquet file?
>> thank you~

Re: [C++] How can I read streaming parquet file in v0.15.0

Posted by Wes McKinney <we...@gmail.com>.
You will want to use the GetRecordBatchReader C++ API here

https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.h#L152

It may not be optimal for your use case. Support for streaming reads
is not yet exposed in Python or other bindings as far as I know.

There is work happening in the C++ Datasets project to better support
this use case.

On Wed, Oct 30, 2019 at 9:28 PM annsshadow <cr...@163.com> wrote:
>
>
> hi~
> I hava a question about reading parquet file.
> The offical example is reading the whole file from the local.
> Now I can't get the whole parquet file in the memory, only can fetch it slice by slice from the network, so how can I use arrow to read the parquet file?
> thank you~