You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/08/04 08:05:00 UTC
[jira] [Created] (ARROW-9644) [C++][Dataset] Do not check for
ignore_prefixes in the base path
Joris Van den Bossche created ARROW-9644:
--------------------------------------------
Summary: [C++][Dataset] Do not check for ignore_prefixes in the base path
Key: ARROW-9644
URL: https://issues.apache.org/jira/browse/ARROW-9644
Project: Apache Arrow
Issue Type: Bug
Components: C++, Python, R
Reporter: Joris Van den Bossche
Somewhat related to ARROW-8427, and from https://github.com/apache/arrow/issues/7857
I am not sure we should check the {{ignore_prefixes}} for the base path provided by the user. Because if that contains eg an underscore, it simply skips the full dataset resulting in an empty dataset.
{code:python}
import tempfile
import pathlib
path = tempfile.mkdtemp()
tmpdir = pathlib.Path(path)
# base path with a directory with an underscore
datadir = tmpdir / "_data" / "dataset"
datadir.mkdir(parents=True, exist_ok=True)
# create a parquet file at that location
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table({'a': [1, 2, 3]})
pq.write_table(table, datadir / "data.parquet")
# reading dataset skips everything
import pyarrow.dataset as ds
In [26]: ds.dataset(datadir)
Out[26]: <pyarrow._dataset.FileSystemDataset at 0x7fbfd8779bb0>
In [27]: ds.dataset(datadir).files
Out[27]: []
{code}
cc [~bkietz] [~npr]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)