You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Amine Boubezari <bo...@gmail.com> on 2020/11/25 03:41:05 UTC

How to best get data from pyarrow.Table column of MapType to a 2D numpy array?

Hello, I have question regarding best practices with Apache Arrow. I have a very large dataset (10's of millions of rows) stored on a partitioned parquet dataset on disk. I load this dataset into memory into a pyarrow.Table, and drop all columns except one, which is of type MapType mapping integers to floats. This column represents sparse feature vector data to be used in an ML context. Call the number of rows "num_rows". My job is to transform this column to a 2D numpy array of shape ("num_rows" x "num_cols") where both rows and cols are known before hand. If one of my pyarrow.Table rows looks like [(1, 3.4), (2, 4.4), (4, 5.4), (6, 6.4)] and "num_cols" = 10, then that row in the numpy array would look like [0, 3.4, 4.4, 0, 5.4, 0, 6.4, 0, 0, 0, 0], where unmapped values are just 0. My 2D numpy array would just be the collection of rows from the pyarrow.Table transformed in such a way. What is the best, most efficient way to accomplish this, considering I have 10's of millions of rows? Assume I have enough memory to fit the entire dataset.

Note that I can use table.to_pandas() to get a pandas DF, and then map functions on the pandas series, if that would help in the solution. So far I have been stumped, however. df.to_numpy() has not been helpful here.

Re: How to best get data from pyarrow.Table column of MapType to a 2D numpy array?

Posted by Ishan Anand <an...@outlook.com>.
Hi Amine,

I haven't worked with the map type directly, but the underlying storage would probably be a set of byte buffers to represent offsets and data.
You could read them as numpy arrays, and use numba to get the 2D numpy arrays?

There is a helpful tutorial here: https://uwekorn.com/2018/08/03/use-numba-to-work-with-apache-arrow-in-pure-python.html
[https://avatars2.githubusercontent.com/u/70274?s=460&v=4]<https://uwekorn.com/2018/08/03/use-numba-to-work-with-apache-arrow-in-pure-python.html>
Use Numba to work with Apache Arrow in pure Python | Uwe’s Blog<https://uwekorn.com/2018/08/03/use-numba-to-work-with-apache-arrow-in-pure-python.html>
Use Numba to work with Apache Arrow in pure Python · 03 Aug 2018 Apache Arrow is an in-memory memory format for columnar data. In more “plain” English, it is a standard on how to store DataFrames/tables in memory, independent of the programming language.
uwekorn.com

Best,
Ishan
________________________________
From: Amine Boubezari <bo...@gmail.com>
Sent: Wednesday, November 25, 2020 9:11 AM
To: user@arrow.apache.org <us...@arrow.apache.org>
Subject: How to best get data from pyarrow.Table column of MapType to a 2D numpy array?


Hello, I have question regarding best practices with Apache Arrow. I have a very large dataset (10's of millions of rows) stored on a partitioned parquet dataset on disk. I load this dataset into memory into a pyarrow.Table, and drop all columns except one, which is of type MapType mapping integers to floats. This column represents sparse feature vector data to be used in an ML context. Call the number of rows "num_rows". My job is to transform this column to a 2D numpy array of shape ("num_rows" x "num_cols") where both rows and cols are known before hand. If one of my pyarrow.Table rows looks like [(1, 3.4), (2, 4.4), (4, 5.4), (6, 6.4)] and "num_cols" = 10, then that row in the numpy array would look like [0, 3.4, 4.4, 0, 5.4, 0, 6.4, 0, 0, 0, 0], where unmapped values are just 0. My 2D numpy array would just be the collection of rows from the pyarrow.Table transformed in such a way. What is the best, most efficient way to accomplish this, considering I have 10's of millions of rows? Assume I have enough memory to fit the entire dataset.

Note that I can use table.to_pandas() to get a pandas DF, and then map functions on the pandas series, if that would help in the solution. So far I have been stumped, however. df.to_numpy() has not been helpful here.

Re: How to best get data from pyarrow.Table column of MapType to a 2D numpy array?

Posted by Micah Kornfield <em...@gmail.com>.
Hi Amine,
I don't think there is anything in the core arrow library that helps with
this at the moment.  The most efficient way for doing something like this
would probably be Customer C/C++ code to do the conversion, but I'm not an
expert in numpy.

-Micah

On Tue, Nov 24, 2020 at 7:41 PM Amine Boubezari <bo...@gmail.com>
wrote:

> Hello, I have question regarding best practices with Apache Arrow. I have
> a very large dataset (10's of millions of rows) stored on a partitioned
> parquet dataset on disk. I load this dataset into memory into a
> pyarrow.Table, and drop all columns except one, which is of type MapType
> mapping integers to floats. This column represents sparse feature vector
> data to be used in an ML context. Call the number of rows "num_rows". My
> job is to transform this column to a 2D numpy array of shape ("num_rows" x
> "num_cols") where both rows and cols are known before hand. If one of my
> pyarrow.Table rows looks like [(1, 3.4), (2, 4.4), (4, 5.4), (6, 6.4)] and
> "num_cols" = 10, then that row in the numpy array would look like [0, 3.4,
> 4.4, 0, 5.4, 0, 6.4, 0, 0, 0, 0], where unmapped values are just 0. My 2D
> numpy array would just be the collection of rows from the pyarrow.Table
> transformed in such a way. What is the best, most efficient way to
> accomplish this, considering I have 10's of millions of rows? Assume I have
> enough memory to fit the entire dataset.
> Note that I can use table.to_pandas() to get a pandas DF, and then map
> functions on the pandas series, if that would help in the solution. So far
> I have been stumped, however. df.to_numpy() has not been helpful here.
>