You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Alessandro Molina (Jira)" <ji...@apache.org> on 2021/07/16 13:44:00 UTC
[jira] [Commented] (ARROW-13342) [Python] Categorical boolean column saved as regular boolean in parquet

    [ https://issues.apache.org/jira/browse/ARROW-13342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17382081#comment-17382081 ] 

Alessandro Molina commented on ARROW-13342:
-------------------------------------------

FYI, this seems to happen for numeric types. The same is true for {{int}} and {{double}}

{code}
import pyarrow as pa
import pyarrow.parquet

# Convert to arrow Table and save to disk
table = pa.Table.from_arrays([
    pa.DictionaryArray.from_arrays(indices=pa.array([0, 1, 1, 0, 1, 1, 0], type=pa.int8()), 
                                   dictionary=[10000000000000000, 20000000000000000])
], names=["data"])
print(table)
pa.parquet.write_table(table, 'test.parquet')

# Reload data and convert back to pandas
table_rel = pa.parquet.read_table('test.parquet')
print(table_rel)
{code}

The exception seem to be `unicode` which preserves the dictionary form.  In general, from my understanding, it seems arrow currently writes in dictionary form only the binary types ( https://github.com/apache/arrow/blob/14b75ee71d770ba86999e0e7a0e0b94629b91968/cpp/src/parquet/column_writer.cc#L1008 )

This in general seems a fairly reasonable behaviour to me, because in general binary data/text is where you get most saving due to the delta in size between the data and the indices.


> [Python] Categorical boolean column saved as regular boolean in parquet
> -----------------------------------------------------------------------
>
>                 Key: ARROW-13342
>                 URL: https://issues.apache.org/jira/browse/ARROW-13342
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet, Python
>    Affects Versions: 4.0.1
>            Reporter: Joao Moreira
>            Priority: Major
>
> When saving a pandas dataframe to parquet, if there is a categorical column where the categories are boolean, the column is saved as regular boolean.
> This causes an issue because, when reading back the parquet file, I expect the column to still be categorical.
>  
> Reproducible example:
> {code:python}
> import pandas as pd
> import pyarrow
> # Create dataframe with boolean column that is then converted to categorical
> df = pd.DataFrame({'a': [True, True, False, True, False]})
> df['a'] = df['a'].astype('category')
> # Convert to arrow Table and save to disk
> table = pyarrow.Table.from_pandas(df)
> pyarrow.parquet.write_table(table, 'test.parquet')
> # Reload data and convert back to pandas
> table_rel = pyarrow.parquet.read_table('test.parquet')
> df_rel = table_rel.to_pandas()
> {code}
> The arrow {{table}} variable correctly converts the column to an arrow {{DICTIONARY}} type:
> {noformat}
> >>> df['a']
> 0     True
> 1     True
> 2    False
> 3     True
> 4    False
> Name: a, dtype: category
> Categories (2, object): [False, True]
> >>>
> >>> table
> pyarrow.Table
> a: dictionary<values=bool, indices=int8, ordered=0>
> {noformat}
> However, the reloaded column is now a regular boolean:
> {noformat}
> >>> table_rel
> pyarrow.Table
> a: bool
> >>>
> >>> df_rel['a']
> 0     True
> 1     True
> 2    False
> 3     True
> 4    False
> Name: a, dtype: bool
> {noformat}
> I would have expected the column to be read back as categorical.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)