You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "David Li (Jira)" <ji...@apache.org> on 2021/03/10 23:42:00 UTC

[jira] [Assigned] (ARROW-11260) [C++][Dataset] Don't require dictionaries for reading dataset with schema-based Partitioning

     [ https://issues.apache.org/jira/browse/ARROW-11260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Li reassigned ARROW-11260:
--------------------------------

    Assignee: David Li  (was: Ben Kietzman)

> [C++][Dataset] Don't require dictionaries for reading dataset with schema-based Partitioning
> --------------------------------------------------------------------------------------------
>
>                 Key: ARROW-11260
>                 URL: https://issues.apache.org/jira/browse/ARROW-11260
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Assignee: David Li
>            Priority: Major
>              Labels: dataset
>             Fix For: 4.0.0
>
>
> As a follow-up on ARROW-10247 (see also https://github.com/apache/arrow/pull/9130#issuecomment-760801124). We currently require the user to pass manually specified dictionary values when reading a dataset with a Partitioning based on a schema with dictionary typed fields. 
> In practice that means that the user for example needs to parse the file paths to get all the possible values the partition field can take, while Arrow will then afterwards again do the same to construct the dataset object. 
> _Naively_, it seems that it should be possible to let Arrow infer the dictionary _values_, even when providing an explicit schema with a dictionary field for the Partitioning (i.e. so when not letting the partitioning schema itself be inferred from the file paths).
> An example use case is when you have a Partitioning schema with both dictionary and non-dictionary fields. When discovering the schema, you can only have all or nothing (all dictionary fields or no dictionary fields).
> cc [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)