You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2019/12/09 13:58:00 UTC
[jira] [Updated] (ARROW-7345) [Python] Writing partitions with NaNs
silently drops data
[ https://issues.apache.org/jira/browse/ARROW-7345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche updated ARROW-7345:
-----------------------------------------
Labels: parquet (was: )
> [Python] Writing partitions with NaNs silently drops data
> ---------------------------------------------------------
>
> Key: ARROW-7345
> URL: https://issues.apache.org/jira/browse/ARROW-7345
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.15.1
> Reporter: Karl Dunkle Werner
> Priority: Minor
> Labels: parquet
>
> When writing a partitioned table, if the partitioning column has NA values, they're silently dropped. I think it would be helpful if there was a warning. Even better, from my perspective, would be writing out those partitions with a directory name like {{partition_col=NaN}}.
> Here's a small example where only the {{b = 2}} group is written out and the {{b = NaN}} group is dropped.
> {code:python}
> import os
> import tempfile
> import pyarrow.json
> import pyarrow.parquet
> from pathlib import Path
> # Create a dataset with NaN:
> json_str = """
> {"a": 1, "b": 2}
> {"a": 2, "b": null}
> """
> with tempfile.NamedTemporaryFile() as tf:
> tf = Path(tf.name)
> tf.write_text(json_str)
> table = pyarrow.json.read_json(tf)
> # Write out a partitioned dataset, using the NaN-containing column
> with tempfile.TemporaryDirectory() as out_dir:
> pyarrow.parquet.write_to_dataset(table, out_dir, partition_cols=["b"])
> print(os.listdir(out_dir))
> read_table = pyarrow.parquet.read_table(out_dir)
> print(f"Wrote out {table.shape[0]} rows, read back {read_table.shape[0]} row")
> # Output:
> #> ['b=2.0']
> #> Wrote out 2 rows, read back 1 row
> {code}
>
> It looks like this caused by pandas dropping NaNs when doing [the {{groupby}} here|https://github.com/apache/arrow/blob/b16a3b53092ccfbc67e5a4e5c90be5913a67c8a5/python/pyarrow/parquet.py#L1434].
--
This message was sent by Atlassian Jira
(v8.3.4#803005)