You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Will Jones (Jira)" <ji...@apache.org> on 2022/02/17 22:29:00 UTC

[jira] [Created] (ARROW-15725) [Python] Legacy dataset can't roundtrip Int64 with nulls if partitioned

Will Jones created ARROW-15725:
----------------------------------

             Summary: [Python] Legacy dataset can't roundtrip Int64 with nulls if partitioned
                 Key: ARROW-15725
                 URL: https://issues.apache.org/jira/browse/ARROW-15725
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 7.0.0, 4.0.0
            Reporter: Will Jones


If there is partitioning and the column has nulls, Int64 columns may not round trip successfully using the legacy datasets implementation. 

Simple reproduction:

 {code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import tempfile

table = pa.table({
    'x': pa.array([None, 7753285016841556620]),
    'y': pa.array(['a', 'b'])
})

ds_dir = tempfile.mkdtemp()
pq.write_to_dataset(table, ds_dir, partition_cols=['y'])

table_after = ds.dataset(ds_dir).to_table()
print(table['x'])
print(table_after['x'])
assert table['x'] == table_after['x']
{code}

{code}
[
  [
    null,
    7753285016841556620
  ]
]
[
  [
    null
  ],
  [
    7753285016841556992
  ]
]
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)