You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Albert Nadal <al...@nuclia.com> on 2022/11/02 13:11:55 UTC

Stream Record Batches from an Arrow stored in a GCS storage

Hi team. I recently started playing with the Python port of the Apache
Arrow to first learn how it works an then use it in our ML platform.
Currently we need to provide to our users a way to upload their datasets in
our storage platform (GCS and S3 mainly). Once the user uploaded a dataset
then we need to download that in order to properly convert each of its
records (rows) to an specific format we use internally in our platform
(protobuffers models).

Our main concern is that we want to achieve that in a performant way by
avoiding to download the entire dataset. We are really interested to know
if its possible to fetch each RecordBatch of a dataset (Arrow file) stored
in a GCS bucket via streaming by using, for instance, the
RecordBatchStreamReader. I'm not really sure if this is possible without
downloading the entire dataset first.

I made some small tests with GcsFileSystem, open_input_stream and
ipc.open_stream.



*gcs = fs.GcsFileSystem(anonymous=True)with
gcs.open_input_stream("bucket/bigfile.arrow") as source:        reader:
pa.ipc.RecordBatchStreamReader = pa.ipc.open_stream(source)*

I'm not sure if I'm missing some important details here but anyways I
always got the same error.

*pyarrow.lib.ArrowInvalid: Expected to read 1330795073 metadata bytes, but
only read 40168302*

I hope you can help me with some indications about how we can handle the
streaming of Record Batches from a dataset stored in an external storage
filesystem.

Thank you in advance!

Albert,

Re: Stream Record Batches from an Arrow stored in a GCS storage

Posted by Lubomir Slivka <lu...@gooddata.com>.
Hi Albert,

I think you are running into this error due to mismatch of IPC formats -
not 100% sure but I tried locally and I get a very similar error when I
intentionally mismatch.

It seems the file on GCS is in IPC File format and you are trying to read
this using a reader designated for IPC Stream format (open_stream,
RecordBatchStreamReader).

Check out the pa.ipc.open_file() function. This returns
RecordBatchFileReader. You should be able to iterate / read the file
batch-by-batch.

Hope this helps,
Lubo

On Wed, Nov 2, 2022 at 2:12 PM Albert Nadal <al...@nuclia.com> wrote:

> Hi team. I recently started playing with the Python port of the Apache
> Arrow to first learn how it works an then use it in our ML platform.
> Currently we need to provide to our users a way to upload their datasets in
> our storage platform (GCS and S3 mainly). Once the user uploaded a dataset
> then we need to download that in order to properly convert each of its
> records (rows) to an specific format we use internally in our platform
> (protobuffers models).
>
> Our main concern is that we want to achieve that in a performant way by
> avoiding to download the entire dataset. We are really interested to know
> if its possible to fetch each RecordBatch of a dataset (Arrow file) stored
> in a GCS bucket via streaming by using, for instance, the
> RecordBatchStreamReader. I'm not really sure if this is possible without
> downloading the entire dataset first.
>
> I made some small tests with GcsFileSystem, open_input_stream and
> ipc.open_stream.
>
>
>
> *gcs = fs.GcsFileSystem(anonymous=True)with
> gcs.open_input_stream("bucket/bigfile.arrow") as source:        reader:
> pa.ipc.RecordBatchStreamReader = pa.ipc.open_stream(source)*
>
> I'm not sure if I'm missing some important details here but anyways I
> always got the same error.
>
> *pyarrow.lib.ArrowInvalid: Expected to read 1330795073 metadata bytes, but
> only read 40168302*
>
> I hope you can help me with some indications about how we can handle the
> streaming of Record Batches from a dataset stored in an external storage
> filesystem.
>
> Thank you in advance!
>
> Albert,
>