You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/15 00:44:42 UTC
[GitHub] [arrow] eyevz opened a new issue, #14426: I am unable to specify a dataset schema and also infer partitioning dictionary for a CSV dataset
eyevz opened a new issue, #14426:
URL: https://github.com/apache/arrow/issues/14426
I would like to create a dataset over a number of CSV files, specify the schema for the files, and for the partitioning, and have the dataset infer the partition dictionary.
Is there anything obviously wrong with what I'm doing below?
This approach _almost_ works:
```python
part_schema = pa.schema([
pa.field('partition_label', pa.dictionary(pa.int16(), pa.string())),
])
partitioning = ds.partitioning(
schema=part_schema,
dictionaries='infer',
)
dataset = ds.dataset(
list_of_csv_files,
format='csv',
partitioning=partitioning,
partition_base_dir=appropriate_root_path,
)
```
With the above, `dataset.partitioning.dictionaries` is appropriately populated. However I'm not happy with the inference of the CSV file schema.
If I specify the dataset schema as below, it breaks the partition dict inference:
```python
ds_schema = pa.schema([
pa.field('csv_field', pa.int8()),
pa.field('partition_label', pa.string()),
])
part_schema = pa.schema([
pa.field('partition_label', pa.dictionary(pa.int16(), pa.string())),
])
partitioning = ds.partitioning(
schema=part_schema,
dictionaries='infer',
)
dataset = ds.dataset(
list_of_csv_files,
format='csv',
schema=ds_schema,
partitioning=partitioning,
partition_base_dir=appropriate_root_path,
)
```
At this point the schema for dataset is what I want, but `dataset.partitioning.dictionaries` is `[None]`.
If I attempt to specify that `partition_label` is a dictionary field in the dataset schema, as in the below...
```python
ds_schema = pa.schema([
pa.field('csv_field', pa.int8()),
pa.field('partition_label', pa.dictionary(pa.int16(), pa.string())),
])
part_schema = pa.schema([
pa.field('partition_label', pa.dictionary(pa.int16(), pa.string())),
])
partitioning = ds.partitioning(
schema=part_schema,
dictionaries='infer',
)
dataset = ds.dataset(
list_of_csv_files,
format='csv',
schema=ds_schema,
partitioning=partitioning,
partition_base_dir=appropriate_root_path,
)
```
... then I get an `ArrowInvalid` error indicating that I have not provided a dictionary for field `partition_label`.
Any suggestions for how I can specify the schema of a partitioned dataset over a large number of CSV files and have the dataset infer the partition dictionary?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] eyevz closed issue #14426: I am unable to specify a dataset schema and also infer partitioning dictionary for a CSV dataset
Posted by GitBox <gi...@apache.org>.
eyevz closed issue #14426: I am unable to specify a dataset schema and also infer partitioning dictionary for a CSV dataset
URL: https://github.com/apache/arrow/issues/14426
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] eyevz commented on issue #14426: I am unable to specify a dataset schema and also infer partitioning dictionary for a CSV dataset
Posted by GitBox <gi...@apache.org>.
eyevz commented on issue #14426:
URL: https://github.com/apache/arrow/issues/14426#issuecomment-1279617631
Naturally I found a solution minutes after posting the question. I am using the first form in my initial comment, with an appropriately constructed `pyarrow.csv.ConvertOptions` instance instead of simply `format='csv'`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org