You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Daniel Gafni (Jira)" <ji...@apache.org> on 2022/06/21 09:16:00 UTC

[jira] [Created] (ARROW-16866) Partition column type is modified after write/read

Daniel Gafni created ARROW-16866:
------------------------------------

             Summary: Partition column type is modified after write/read
                 Key: ARROW-16866
                 URL: https://issues.apache.org/jira/browse/ARROW-16866
             Project: Apache Arrow
          Issue Type: Bug
    Affects Versions: 8.0.0
         Environment: Linux, Python 3.8
            Reporter: Daniel Gafni


Example:

 
{code:java}
s = 100000
f = 10
data = pd.DataFrame(
    np.random.rand(s * f).reshape(s, f), 
    columns=[f"f_{i}" for i in range(f)]
)
data['partition_col'] = np.random.randint(0, f, s)
data = pyarrow.Table.from_pandas(data)
pq.write_to_dataset(data_arrow, root_path='test_pyarrow', partition_cols=['partition_col']) 
data.schema{code}
outputs:
{code:java}
data.schema
f_0: double
f_1: double
f_2: double
f_3: double
f_4: double
f_5: double
f_6: double
f_7: double
f_8: double
f_9: double
partition_col: int64
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 1456 {code}
After writing and reading the partition col dtype turns into:
{code:java}
pq.ParquetDataset('test_pyarrow').read().schema
f_0: double
f_1: double
f_2: double
f_3: double
f_4: double
f_5: double
f_6: double
f_7: double
f_8: double
f_9: double
partition_col: dictionary<values=int64, indices=int32, ordered=0> {code}
 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)