You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Will Jones (Jira)" <ji...@apache.org> on 2022/01/04 19:02:00 UTC

[jira] [Created] (ARROW-15246) [Python] Automatic conversion of low-cardinality string array to Dictionary Array

Will Jones created ARROW-15246:
----------------------------------

             Summary: [Python] Automatic conversion of low-cardinality string array to Dictionary Array
                 Key: ARROW-15246
                 URL: https://issues.apache.org/jira/browse/ARROW-15246
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
    Affects Versions: 6.0.1
            Reporter: Will Jones


Users who convert Pandas string arrays to Arrow arrays may be surprised to see the Arrow ones use far more memory when the cardinality is low. The solution is for them to first convert to a Pandas Categorical, but it might save some headaches if we can automatically (or possibly with an option) detect when it's appropriate to use a Dictionary type over a String type.

Here's an example of what I'm talking about:

{code:python}
import pyarrow as pa
import pandas as pd

x_str = "x" * 30
df = pd.DataFrame({"col": [x_str] * 1_000_000})

%memit tab1 = pa.Table.from_pandas(df)
# peak memory: 269.44 MiB, increment: 121.62 MiB

df['col'] = df['col'].astype('category')
%memit tab2 = pa.Table.from_pandas(df)
# peak memory: 286.14 MiB, increment: 1.20 MiB
{code}

One bad consequence of inferring this automatically is if there is a sequence of Pandas DataFrames that are being converted, it's possible they may end up with differing schemas. For that reason it's likely this behavior should be optional.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)