You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Karl Dunkle Werner (Jira)" <ji...@apache.org> on 2020/09/15 21:18:00 UTC

[jira] [Commented] (ARROW-7345) [Python] Writing partitions with NaNs silently drops data

    [ https://issues.apache.org/jira/browse/ARROW-7345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196556#comment-17196556 ] 

Karl Dunkle Werner commented on ARROW-7345:
-------------------------------------------

I noticed pandas version 1.1.0 added the {{dropna}} argument to {{DataFrame.group_by()}}. One way for pyarrow to avoid the issue I highlighted might be to pass {{dropna=True}}.

> [Python] Writing partitions with NaNs silently drops data
> ---------------------------------------------------------
>
>                 Key: ARROW-7345
>                 URL: https://issues.apache.org/jira/browse/ARROW-7345
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.15.1
>            Reporter: Karl Dunkle Werner
>            Priority: Minor
>              Labels: dataset, dataset-parquet-write, parquet
>
> When writing a partitioned table, if the partitioning column has NA values, they're silently dropped. I think it would be helpful if there was a warning. Even better, from my perspective, would be writing out those partitions with a directory name like {{partition_col=NaN}}. 
> Here's a small example where only the {{b = 2}} group is written out and the {{b = NaN}} group is dropped.
> {code:python}
> import os
> import tempfile
> import pyarrow.json
> import pyarrow.parquet
> from pathlib import Path
> # Create a dataset with NaN:
> json_str = """
> {"a": 1, "b": 2}
> {"a": 2, "b": null}
> """
> with tempfile.NamedTemporaryFile() as tf:
>     tf = Path(tf.name)
>     tf.write_text(json_str)
>     table = pyarrow.json.read_json(tf)
> # Write out a partitioned dataset, using the NaN-containing column
> with tempfile.TemporaryDirectory() as out_dir:
>     pyarrow.parquet.write_to_dataset(table, out_dir, partition_cols=["b"])
>     print(os.listdir(out_dir))
>     read_table = pyarrow.parquet.read_table(out_dir)
> print(f"Wrote out {table.shape[0]} rows, read back {read_table.shape[0]} row")
> # Output:
> #> ['b=2.0']
> #> Wrote out 2 rows, read back 1 row
> {code}
>  
> It looks like this caused by pandas dropping NaNs when doing [the {{groupby}} here|https://github.com/apache/arrow/blob/b16a3b53092ccfbc67e5a4e5c90be5913a67c8a5/python/pyarrow/parquet.py#L1434].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)