You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Adam Lippai <ad...@rigo.sk> on 2020/06/17 11:07:31 UTC

Pandas string type

Hi,

I was reading https://wesmckinney.com/blog/high-perf-arrow-to-pandas/ where
Wes writes

> "string or binary data would come with additional overhead while pandas
> continues to use Python objects in its memory representation"


Pandas 1.0 introduced StringDType which I thought could help with the issue
(I didn't check the internals, I assume they still use Python objects, just
not Numpy, but I had nothing to lose).

My issue is that if I create an PyArrow array with a = pa.array(["aaaaa",
"bbbbb"]*100000000) and call .to_pandas() the dtype of the dataframe is
still "object". I tried to add a types_mapper function (docs is not really
helpful so I've simply created def mapper(t): return pd.StringDtype) but it
didn't work.

Is this a future feature? Would it help anything? For now I'm happy to use
category/dictionary data, as the column is low cardinality and it makes it
5x faster, but I was hoping for a simpler solution. I don't know the
internals but if "aaaaa" and "bbbbb" are immutable strings it shouldn't
really differ from using Category type (even if it's creating python
objects for them, as it can be done with 2 immutable objects). Converting
compressed parquet -> pyarrow is fast (less than 10 seconds), it's pyarrow
-> pandas which is slow, running for 7 minutes (so I think pyarrow already
has a nice implementation)

Best regards,
Adam Lippai

Re: Pandas string type

Posted by Adam Lippai <ad...@rigo.sk>.

Thanks for the detailed answer.
It's indeed 5-10% faster with the correct arguments you provided, but the
performance is far from the categorical type based solution.
I'll track the linked pandas issue. I'm not a C++ dev, but I'll be happy to
test, benchmark or add docs.

Best regards,
Adam Lippai

On Thu, Jun 18, 2020 at 10:08 AM Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:

> Hi Adam,
>
> On Wed, 17 Jun 2020 at 13:07, Adam Lippai <ad...@rigo.sk> wrote:
>
> > Hi,
> >
> > I was reading https://wesmckinney.com/blog/high-perf-arrow-to-pandas/
> > where
> > Wes writes
> >
> > > "string or binary data would come with additional overhead while pandas
> > > continues to use Python objects in its memory representation"
> >
> >
> > Pandas 1.0 introduced StringDType which I thought could help with the
> issue
> > (I didn't check the internals, I assume they still use Python objects,
> just
> > not Numpy, but I had nothing to lose).
> >
> > My issue is that if I create an PyArrow array with a = pa.array(["aaaaa",
> > "bbbbb"]*100000000) and call .to_pandas() the dtype of the dataframe is
> > still "object". I tried to add a types_mapper function (docs is not
> really
> > helpful so I've simply created def mapper(t): return pd.StringDtype) but
> it
> > didn't work.
> >
>
> Two caveats here: 1) the function needs to return an *instance* and not a
> class (so `return pd.StringDtype()`), and 2) this keyword only works for
> Table.to_pandas right now (this is certainly something that should either
> be fixed or either be clarified in the docs).
>
> So taking your example array, and putting it in a Table, and then
> converting to pandas, the types_mapper keyword works:
>
> >>> table = pa.table({'a': a})
> >>> df = table.to_pandas(types_mapper={pa.string(): pd.StringDtype()}.get)
> >>> df.dtypes
> a    string
> dtype: object
>
> Now, the pandas string dtype is currently still using Python objects to
> store the strings (so similarly as using an object dtype). There are plans
> to store the strings more efficiently (eg using arrow's string array memory
> layout), see https://github.com/pandas-dev/pandas/issues/8640/.
>
> But so right now, if you have many repeated strings, I would still go for
> the category/dictionary type, as that will be a lot more efficient for
> further processing in pandas.
>
>
>
> >
> > Is this a future feature? Would it help anything? For now I'm happy to
> use
> > category/dictionary data, as the column is low cardinality and it makes
> it
> > 5x faster, but I was hoping for a simpler solution. I don't know the
> > internals but if "aaaaa" and "bbbbb" are immutable strings it shouldn't
> > really differ from using Category type (even if it's creating python
> > objects for them, as it can be done with 2 immutable objects). Converting
> > compressed parquet -> pyarrow is fast (less than 10 seconds), it's
> pyarrow
> > -> pandas which is slow, running for 7 minutes (so I think pyarrow
> already
> > has a nice implementation)
> >
>
> There is a `deduplicate_objects` keyword in to_pandas exactly for this (to
> avoid creating multiple Python objects for identical strings).
> However, as indicated above, and depending on what your further processing
> steps are in pandas, using a categorical/dictionary type might still be the
> better option.
>
> Joris
>
>
> >
> > Best regards,
> > Adam Lippai
> >
>

Re: Pandas string type

Posted by Joris Van den Bossche <jo...@gmail.com>.

Hi Adam,

On Wed, 17 Jun 2020 at 13:07, Adam Lippai <ad...@rigo.sk> wrote:

> Hi,
>
> I was reading https://wesmckinney.com/blog/high-perf-arrow-to-pandas/
> where
> Wes writes
>
> > "string or binary data would come with additional overhead while pandas
> > continues to use Python objects in its memory representation"
>
>
> Pandas 1.0 introduced StringDType which I thought could help with the issue
> (I didn't check the internals, I assume they still use Python objects, just
> not Numpy, but I had nothing to lose).
>
> My issue is that if I create an PyArrow array with a = pa.array(["aaaaa",
> "bbbbb"]*100000000) and call .to_pandas() the dtype of the dataframe is
> still "object". I tried to add a types_mapper function (docs is not really
> helpful so I've simply created def mapper(t): return pd.StringDtype) but it
> didn't work.
>

Two caveats here: 1) the function needs to return an *instance* and not a
class (so `return pd.StringDtype()`), and 2) this keyword only works for
Table.to_pandas right now (this is certainly something that should either
be fixed or either be clarified in the docs).

So taking your example array, and putting it in a Table, and then
converting to pandas, the types_mapper keyword works:

>>> table = pa.table({'a': a})
>>> df = table.to_pandas(types_mapper={pa.string(): pd.StringDtype()}.get)
>>> df.dtypes
a    string
dtype: object

Now, the pandas string dtype is currently still using Python objects to
store the strings (so similarly as using an object dtype). There are plans
to store the strings more efficiently (eg using arrow's string array memory
layout), see https://github.com/pandas-dev/pandas/issues/8640/.

But so right now, if you have many repeated strings, I would still go for
the category/dictionary type, as that will be a lot more efficient for
further processing in pandas.

>
> Is this a future feature? Would it help anything? For now I'm happy to use
> category/dictionary data, as the column is low cardinality and it makes it
> 5x faster, but I was hoping for a simpler solution. I don't know the
> internals but if "aaaaa" and "bbbbb" are immutable strings it shouldn't
> really differ from using Category type (even if it's creating python
> objects for them, as it can be done with 2 immutable objects). Converting
> compressed parquet -> pyarrow is fast (less than 10 seconds), it's pyarrow
> -> pandas which is slow, running for 7 minutes (so I think pyarrow already
> has a nice implementation)
>

There is a `deduplicate_objects` keyword in to_pandas exactly for this (to
avoid creating multiple Python objects for identical strings).
However, as indicated above, and depending on what your further processing
steps are in pandas, using a categorical/dictionary type might still be the
better option.

Joris

>
> Best regards,
> Adam Lippai
>