You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Vaishal Shah <va...@gmail.com> on 2017/06/07 07:09:07 UTC

Writing numpy arrays on disk using pyarrow-parquet

This is Vaishal from D. E. Shaw and Co.



We are interested to use py-arrow/parquet for one of our projects, that
deals with numpy arrays.

Parquet provides API to store pandas dataframes on disk, but I could not
find any support for storing numpy arrays.


Since numpy is a trivial form to store data, I was surprised to find no
function to store them in parquet format. Is there any way to store numpy
array in parquet format, that I probably missed?

Or can we expect this support in newer version of parquet?


Pyarrow provides one using Tensors(but read_tensor requires file to be
opened in writeable mode, so that compels to use mem_mapped files) and in
order to read a file, it needs to be in writeable mode, that is kind of a
bug! Can you please look into this?



-- 
*Regards*

*Vaishal Shah,*
*Third year Undergraduate student,*
*Department of Computer Science and Engineering,*
*IIT Kharagpur*

Re: Writing numpy arrays on disk using pyarrow-parquet

Posted by Wes McKinney <we...@gmail.com>.
hi Vaishal,

I already replied to you about this on the mailing list on June 1, can
you reply to that thread?

I see that you opened ARROW-1097 about the tensor issue. If you could
add a standalone reproduction of the problem that would help us debug
it and fix faster

Thanks
Wes

On Wed, Jun 7, 2017 at 3:09 AM, Vaishal Shah <va...@gmail.com> wrote:
> This is Vaishal from D. E. Shaw and Co.
>
>
>
> We are interested to use py-arrow/parquet for one of our projects, that
> deals with numpy arrays.
>
> Parquet provides API to store pandas dataframes on disk, but I could not
> find any support for storing numpy arrays.
>
>
> Since numpy is a trivial form to store data, I was surprised to find no
> function to store them in parquet format. Is there any way to store numpy
> array in parquet format, that I probably missed?
>
> Or can we expect this support in newer version of parquet?
>
>
> Pyarrow provides one using Tensors(but read_tensor requires file to be
> opened in writeable mode, so that compels to use mem_mapped files) and in
> order to read a file, it needs to be in writeable mode, that is kind of a
> bug! Can you please look into this?
>
>
>
> --
> *Regards*
>
> *Vaishal Shah,*
> *Third year Undergraduate student,*
> *Department of Computer Science and Engineering,*
> *IIT Kharagpur*

Re: Writing numpy arrays on disk using pyarrow-parquet

Posted by Wes McKinney <we...@gmail.com>.
hi Vaishal,

I already replied to you about this on the mailing list on June 1, can
you reply to that thread?

I see that you opened ARROW-1097 about the tensor issue. If you could
add a standalone reproduction of the problem that would help us debug
it and fix faster

Thanks
Wes

On Wed, Jun 7, 2017 at 3:09 AM, Vaishal Shah <va...@gmail.com> wrote:
> This is Vaishal from D. E. Shaw and Co.
>
>
>
> We are interested to use py-arrow/parquet for one of our projects, that
> deals with numpy arrays.
>
> Parquet provides API to store pandas dataframes on disk, but I could not
> find any support for storing numpy arrays.
>
>
> Since numpy is a trivial form to store data, I was surprised to find no
> function to store them in parquet format. Is there any way to store numpy
> array in parquet format, that I probably missed?
>
> Or can we expect this support in newer version of parquet?
>
>
> Pyarrow provides one using Tensors(but read_tensor requires file to be
> opened in writeable mode, so that compels to use mem_mapped files) and in
> order to read a file, it needs to be in writeable mode, that is kind of a
> bug! Can you please look into this?
>
>
>
> --
> *Regards*
>
> *Vaishal Shah,*
> *Third year Undergraduate student,*
> *Department of Computer Science and Engineering,*
> *IIT Kharagpur*