You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/15 00:44:42 UTC

[GitHub] [arrow] eyevz opened a new issue, #14426: I am unable to specify a dataset schema and also infer partitioning dictionary for a CSV dataset

eyevz opened a new issue, #14426:
URL: https://github.com/apache/arrow/issues/14426

   I would like to create a dataset over a number of CSV files, specify the schema for the files, and for the partitioning, and have the dataset infer the partition dictionary.
   
   Is there anything obviously wrong with what I'm doing below?
   
   This approach _almost_ works:
   ```python
   part_schema = pa.schema([
       pa.field('partition_label', pa.dictionary(pa.int16(), pa.string())),
   ])
   
   partitioning = ds.partitioning(
       schema=part_schema,
       dictionaries='infer',
   )
   
   dataset = ds.dataset(
       list_of_csv_files,
       format='csv',
       partitioning=partitioning,
       partition_base_dir=appropriate_root_path,
   )
   ```
   With the above, `dataset.partitioning.dictionaries` is appropriately populated. However I'm not happy with the inference of the CSV file schema.
   
   If I specify the dataset schema as below, it breaks the partition dict inference:
   ```python
   ds_schema = pa.schema([
       pa.field('csv_field', pa.int8()),
       pa.field('partition_label', pa.string()),
   ])
   
   part_schema = pa.schema([
       pa.field('partition_label', pa.dictionary(pa.int16(), pa.string())),
   ])
   
   partitioning = ds.partitioning(
       schema=part_schema,
       dictionaries='infer',
   )
   
   dataset = ds.dataset(
       list_of_csv_files,
       format='csv',
       schema=ds_schema,
       partitioning=partitioning,
       partition_base_dir=appropriate_root_path,
   )
   ```
   At this point the schema for dataset is what I want, but `dataset.partitioning.dictionaries` is `[None]`.
   
   If I attempt to specify that `partition_label` is a dictionary field in the dataset schema, as in the below...
   ```python
   ds_schema = pa.schema([
       pa.field('csv_field', pa.int8()),
       pa.field('partition_label', pa.dictionary(pa.int16(), pa.string())),
   ])
   
   part_schema = pa.schema([
       pa.field('partition_label', pa.dictionary(pa.int16(), pa.string())),
   ])
   
   partitioning = ds.partitioning(
       schema=part_schema,
       dictionaries='infer',
   )
   
   dataset = ds.dataset(
       list_of_csv_files,
       format='csv',
       schema=ds_schema,
       partitioning=partitioning,
       partition_base_dir=appropriate_root_path,
   )
   ```
   ... then I get an `ArrowInvalid` error indicating that I have not provided a dictionary for field `partition_label`.
   
   Any suggestions for how I can specify the schema of a partitioned dataset over a large number of CSV files and have the dataset infer the partition dictionary?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] eyevz closed issue #14426: I am unable to specify a dataset schema and also infer partitioning dictionary for a CSV dataset

Posted by GitBox <gi...@apache.org>.

eyevz closed issue #14426: I am unable to specify a dataset schema and also infer partitioning dictionary for a CSV dataset
URL: https://github.com/apache/arrow/issues/14426


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] eyevz commented on issue #14426: I am unable to specify a dataset schema and also infer partitioning dictionary for a CSV dataset

Posted by GitBox <gi...@apache.org>.

eyevz commented on issue #14426:
URL: https://github.com/apache/arrow/issues/14426#issuecomment-1279617631

   Naturally I found a solution minutes after posting the question. I am using the first form in my initial comment, with an appropriately constructed `pyarrow.csv.ConvertOptions` instance instead of simply `format='csv'`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org