You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "David Li (Jira)" <ji...@apache.org> on 2022/05/02 15:06:00 UTC
[jira] [Created] (ARROW-16436) [C++] Datasets ignores CSV autogenerate_column_names during discovery
David Li created ARROW-16436:
--------------------------------
Summary: [C++] Datasets ignores CSV autogenerate_column_names during discovery
Key: ARROW-16436
URL: https://issues.apache.org/jira/browse/ARROW-16436
Project: Apache Arrow
Issue Type: Bug
Components: C++
Affects Versions: 7.0.0
Reporter: David Li
Reproduction
{code:python}
import tempfile
from pathlib import Path
import pyarrow as pa
import pyarrow.csv as csv
import pyarrow.dataset as ds
print("PyArrow version:", pa.__version__)
ro = csv.ReadOptions(autogenerate_column_names=True)
po = csv.ParseOptions()
co = csv.ConvertOptions()
file_format = ds.CsvFileFormat(read_options=ro, parse_options=po, convert_options=co)
with tempfile.TemporaryDirectory() as td:
td = Path(td).resolve()
with (td / "test.csv").open("w") as sink:
sink.write("1,a,true,1\n")
dataset = ds.dataset(str(td), format=file_format)
print(dataset.to_table())
{code}
Result:
{noformat}
PyArrow version: 7.0.0
Traceback (most recent call last):
File "/home/lidavidm/csvdemo.py", line 20, in <module>
dataset = ds.dataset(str(td), format=file_format)
File "/home/lidavidm/miniconda3/envs/arrow/lib/python3.10/site-packages/pyarrow/dataset.py", line 667, in dataset
return _filesystem_dataset(source, **kwargs)
File "/home/lidavidm/miniconda3/envs/arrow/lib/python3.10/site-packages/pyarrow/dataset.py", line 422, in _filesystem_dataset
return factory.finish(schema)
File "pyarrow/_dataset.pyx", line 1680, in pyarrow._dataset.DatasetFactory.finish
File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/tmp5rz0ipmm/test.csv': Could not open CSV input source '/tmp/tmp5rz0ipmm/test.csv': Invalid: CSV file contained multiple columns named 1. Is this a 'csv' file?
{noformat}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)