You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/01/20 13:20:00 UTC
[jira] [Comment Edited] (ARROW-10438) [C++][Dataset] Partitioning::Format on nulls

    [ https://issues.apache.org/jira/browse/ARROW-10438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17268559#comment-17268559 ] 

Joris Van den Bossche edited comment on ARROW-10438 at 1/20/21, 1:19 PM:
-------------------------------------------------------------------------

I am not sure we should exactly follow the (potentially non-ideal) behaviour of Hive, here. Or at least have the option (or default, and have hive-behaviour as option) for different behaviour that can preserve the actual values would be nice? (there will also be many people that use arrow datasets to write hive-like datastores without ever actually interacting with hive)

Another source about the topic: https://kb.databricks.com/data/null-empty-strings.html, which concludes with "This is the expected behavior. It is inherited from Apache Hive." and "Solution: In general, you shouldn’t use both null and empty strings as values in a partitioned column."

Some random other first thoughts:

- A default could also be to error? (so users will at least be aware of the problem, and of that it will loose empty strings)
- We also need to think about how to do this for directory partitioning, not only for hive partitioning (and using a hive-specific name for a partitioning schema that is not compatible with Hive might make less sense?)
- We currently already read empty string partition values from {{/key=/}} directory names just fine, although this is probably not tested and might only work accidentally (and might also not work for other readers like spark?)
- This might also interact with the discussion whether to include the partition fields in the actual data files or not (because when not left out, the actual file could still hold the real value to distinguish empty vs null)

As another observation: dask simply drops rows with missing values in the partition column (silently), but I think that is just inherited by the fact that pandas' groupby implementation by default drops missing values, and not necessarily intentional design.


was (Author: jorisvandenbossche):
I am not sure we should exactly follow the (potentially non-ideal) behaviour of Hive, here. Or at least have the option (or default, and have hive-behaviour as option) for different behaviour that can preserve the actual values would be nice? (there will also be many people that use arrow datasets to write hive-like datastores without ever actually interacting with hive)

Another source about the topic: https://kb.databricks.com/data/null-empty-strings.html, which concludes with "This is the expected behavior. It is inherited from Apache Hive." and "Solution: In general, you shouldn’t use both null and empty strings as values in a partitioned column."

Some random other first thoughts:

- A default could also be to error? (so users will at least be aware of the problem, and of that it will loose empty strings)
- We also need to think about how to do this for directory partitioning, not only for hive partitioning (and using a hive-specific name for a partitioning schema that is not compatible with Hive might make less sense?)
- We currently already read empty string partition values from {{/key=/}} directory names just fine, although this is probably not tested and might only work accidentally (and might also not work for other readers like spark?)
- This might also interact with the discussion whether to include the partition fields in the actual data files or not (because when not left out, the actual file could still hold the real value to distinguish empty vs null)


> [C++][Dataset] Partitioning::Format on nulls
> --------------------------------------------
>
>                 Key: ARROW-10438
>                 URL: https://issues.apache.org/jira/browse/ARROW-10438
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 2.0.0
>            Reporter: Ben Kietzman
>            Assignee: Weston Pace
>            Priority: Major
>             Fix For: 4.0.0
>
>
> Writing a dataset with null partition keys is currently untested. Ensure the behavior is documented and correct



--
This message was sent by Atlassian Jira
(v8.3.4#803005)