You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Alex Buchanan <bu...@ohsu.edu> on 2018/07/06 17:05:46 UTC

Intro to pandas + pyarrow integration?

Hello all.

I'm confused about the current level of integration between pandas and pyarrow. Am I correct in understanding that currently I'll need to convert pyarrow Tables to pandas DataFrames in order to use most of the pandas features?  By "pandas features" I mean every day slicing and dicing of data: merge, filtering, melt, spread, etc.

I have a dataframe which starts out from small files (< 1GB) and quickly explodes into dozens of gigabytes of memory in a pandas DataFrame. I'm interested in whether arrow can provide a better, optimized dataframe.

Thanks.

Re: Intro to pandas + pyarrow integration?

Posted by Wes McKinney <we...@gmail.com>.

In case it's interesting, I gave a talk a little over 3 years ago
about this theme ("we all have data frames, but they're all different
inside"): https://www.slideshare.net/wesm/dataframes-the-good-bad-and-ugly.
I mentioned the desire for a "Apache-licensed, community standard
C/C++ data frame that we can all use".

On Fri, Jul 6, 2018 at 1:53 PM, Alex Buchanan <bu...@ohsu.edu> wrote:
> Ok, interesting. Thanks Wes, that does make it clear.
>
>
> For other readers, this github issue is related: https://github.com/apache/arrow/issues/2189#issuecomment-402874836
>
>
>
> On 7/6/18, 10:25 AM, "Wes McKinney" <we...@gmail.com> wrote:
>
>>hi Alex,
>>
>>One of the goals of Apache Arrow is to define an open standard for
>>in-memory columnar data (which may be called "tables" or "data frames"
>>in some domains). Among other things, the Arrow columnar format is
>>optimized for memory efficiency and analytical processing performance
>>on very large (even larger-than-RAM) data sets.
>>
>>The way to think about it is that pandas has its own in-memory
>>representation for columnar data, but it is "proprietary" to pandas.
>>To make use of pandas's analytical facilities, you must convert data
>>to pandas's memory representation. As an example, pandas represents
>>strings as NumPy arrays of Python string objects, which is very
>>wasteful. Uwe Korn recently demonstrated an approach to using Arrow
>>inside pandas, but this would require a lot of work to port algorithms
>>to run against Arrow: https://github.com/xhochy/fletcher
>>
>>We are working to develop the standard data frame type operations as
>>reusable libraries within this project, and these will run natively
>>against the Arrow columnar format. This is a big project; we would
>>love to have you involved with the effort. One of the reasons I have
>>spent so much of my time the last few years on this project is that I
>>believe it is the best path to build a faster, more efficient
>>pandas-like library for data scientists.
>>
>>best,
>>Wes
>>
>>On Fri, Jul 6, 2018 at 1:05 PM, Alex Buchanan <bu...@ohsu.edu> wrote:
>>> Hello all.
>>>
>>> I'm confused about the current level of integration between pandas and pyarrow. Am I correct in understanding that currently I'll need to convert pyarrow Tables to pandas DataFrames in order to use most of the pandas features?  By "pandas features" I mean every day slicing and dicing of data: merge, filtering, melt, spread, etc.
>>>
>>> I have a dataframe which starts out from small files (< 1GB) and quickly explodes into dozens of gigabytes of memory in a pandas DataFrame. I'm interested in whether arrow can provide a better, optimized dataframe.
>>>
>>> Thanks.
>>>

Re: Intro to pandas + pyarrow integration?

Posted by Alex Buchanan <bu...@ohsu.edu>.

Ok, interesting. Thanks Wes, that does make it clear.


For other readers, this github issue is related: https://github.com/apache/arrow/issues/2189#issuecomment-402874836



On 7/6/18, 10:25 AM, "Wes McKinney" <we...@gmail.com> wrote:

>hi Alex,
>
>One of the goals of Apache Arrow is to define an open standard for
>in-memory columnar data (which may be called "tables" or "data frames"
>in some domains). Among other things, the Arrow columnar format is
>optimized for memory efficiency and analytical processing performance
>on very large (even larger-than-RAM) data sets.
>
>The way to think about it is that pandas has its own in-memory
>representation for columnar data, but it is "proprietary" to pandas.
>To make use of pandas's analytical facilities, you must convert data
>to pandas's memory representation. As an example, pandas represents
>strings as NumPy arrays of Python string objects, which is very
>wasteful. Uwe Korn recently demonstrated an approach to using Arrow
>inside pandas, but this would require a lot of work to port algorithms
>to run against Arrow: https://github.com/xhochy/fletcher
>
>We are working to develop the standard data frame type operations as
>reusable libraries within this project, and these will run natively
>against the Arrow columnar format. This is a big project; we would
>love to have you involved with the effort. One of the reasons I have
>spent so much of my time the last few years on this project is that I
>believe it is the best path to build a faster, more efficient
>pandas-like library for data scientists.
>
>best,
>Wes
>
>On Fri, Jul 6, 2018 at 1:05 PM, Alex Buchanan <bu...@ohsu.edu> wrote:
>> Hello all.
>>
>> I'm confused about the current level of integration between pandas and pyarrow. Am I correct in understanding that currently I'll need to convert pyarrow Tables to pandas DataFrames in order to use most of the pandas features?  By "pandas features" I mean every day slicing and dicing of data: merge, filtering, melt, spread, etc.
>>
>> I have a dataframe which starts out from small files (< 1GB) and quickly explodes into dozens of gigabytes of memory in a pandas DataFrame. I'm interested in whether arrow can provide a better, optimized dataframe.
>>
>> Thanks.
>>

Re: Intro to pandas + pyarrow integration?

Posted by Wes McKinney <we...@gmail.com>.

hi Alex,

One of the goals of Apache Arrow is to define an open standard for
in-memory columnar data (which may be called "tables" or "data frames"
in some domains). Among other things, the Arrow columnar format is
optimized for memory efficiency and analytical processing performance
on very large (even larger-than-RAM) data sets.

The way to think about it is that pandas has its own in-memory
representation for columnar data, but it is "proprietary" to pandas.
To make use of pandas's analytical facilities, you must convert data
to pandas's memory representation. As an example, pandas represents
strings as NumPy arrays of Python string objects, which is very
wasteful. Uwe Korn recently demonstrated an approach to using Arrow
inside pandas, but this would require a lot of work to port algorithms
to run against Arrow: https://github.com/xhochy/fletcher

We are working to develop the standard data frame type operations as
reusable libraries within this project, and these will run natively
against the Arrow columnar format. This is a big project; we would
love to have you involved with the effort. One of the reasons I have
spent so much of my time the last few years on this project is that I
believe it is the best path to build a faster, more efficient
pandas-like library for data scientists.

best,
Wes

On Fri, Jul 6, 2018 at 1:05 PM, Alex Buchanan <bu...@ohsu.edu> wrote:
> Hello all.
>
> I'm confused about the current level of integration between pandas and pyarrow. Am I correct in understanding that currently I'll need to convert pyarrow Tables to pandas DataFrames in order to use most of the pandas features?  By "pandas features" I mean every day slicing and dicing of data: merge, filtering, melt, spread, etc.
>
> I have a dataframe which starts out from small files (< 1GB) and quickly explodes into dozens of gigabytes of memory in a pandas DataFrame. I'm interested in whether arrow can provide a better, optimized dataframe.
>
> Thanks.
>