You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/08/04 08:05:00 UTC
[jira] [Created] (ARROW-9644) [C++][Dataset] Do not check for ignore_prefixes in the base path

Joris Van den Bossche created ARROW-9644:
--------------------------------------------

             Summary: [C++][Dataset] Do not check for ignore_prefixes in the base path
                 Key: ARROW-9644
                 URL: https://issues.apache.org/jira/browse/ARROW-9644
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, Python, R
            Reporter: Joris Van den Bossche


Somewhat related to ARROW-8427, and from https://github.com/apache/arrow/issues/7857

I am not sure we should check the {{ignore_prefixes}} for the base path provided by the user. Because if that contains eg an underscore, it simply skips the full dataset resulting in an empty dataset.

{code:python}
import tempfile
import pathlib

path = tempfile.mkdtemp()
tmpdir =  pathlib.Path(path)                                                                                                                                                              

# base path with a directory with an underscore 
datadir = tmpdir / "_data" / "dataset"                                                                                                                                                                    
datadir.mkdir(parents=True, exist_ok=True)                                                                                                                                                                

# create a parquet file at that location
import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({'a': [1, 2, 3]})                                                                                                                                                                        
pq.write_table(table, datadir / "data.parquet")                                                                                                                                                           

# reading dataset skips everything
import pyarrow.dataset as ds                                                                                                                                                                              

In [26]: ds.dataset(datadir)                                                                                                                                                                                       
Out[26]: <pyarrow._dataset.FileSystemDataset at 0x7fbfd8779bb0>

In [27]: ds.dataset(datadir).files                                                                                                                                                                                 
Out[27]: []
{code}

cc [~bkietz] [~npr]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)