You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/25 03:34:32 UTC

[GitHub] [arrow] amine000 opened a new issue #8767: How to best get data from pyarrow.Table column of MapType to a 2D numpy array?

amine000 opened a new issue #8767:
URL: https://github.com/apache/arrow/issues/8767


   Hello, I have a very large dataset (10's of millions of rows) stored on a partitioned parquet dataset on disk. I load this dataset into memory into a pyarrow.Table, and drop all columns except one, which is of type MapType mapping integers to floats. This column represents sparse feature vector data to be used in an ML context. Call the number of rows "num_rows". My job is to transform this column to a 2D numpy array of shape ("num_rows" x "num_cols") where both rows and cols are known before hand. If one of my pyarrow.Table rows looks like `[(1, 3.4), (2, 4.4), (4, 5.4), (6, 6.4)]` and "num_cols" = 10, then that  row in the numpy array would look like [0, 3.4, 4.4, 0, 5.4, 0, 6.4, 0, 0, 0, 0], where unmapped values are just 0. What is the best, most efficient way to accomplish this, considering I have 10's of millions of rows? Assume I have enough memory to fit the entire dataset. 
   
   Note that I can use `table.to_pandas()` to get a pandas DF, and then map functions on the pandas series, if that would help in the solution. So far I have been stumped, however. `df.to_numpy()` has not been helpful here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm closed issue #8767: How to best get data from pyarrow.Table column of MapType to a 2D numpy array?

Posted by GitBox <gi...@apache.org>.
wesm closed issue #8767:
URL: https://github.com/apache/arrow/issues/8767


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on issue #8767: How to best get data from pyarrow.Table column of MapType to a 2D numpy array?

Posted by GitBox <gi...@apache.org>.
wesm commented on issue #8767:
URL: https://github.com/apache/arrow/issues/8767#issuecomment-735894436


   This was sent to the mailing list


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org