You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Francois Saint-Jacques (Jira)" <ji...@apache.org> on 2019/11/05 13:13:00 UTC

[jira] [Comment Edited] (ARROW-7061) [C++][Dataset] FileSystemDiscovery with ParquetFileFormat should ignore files that aren't Parquet

    [ https://issues.apache.org/jira/browse/ARROW-7061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967516#comment-16967516 ] 

Francois Saint-Jacques edited comment on ARROW-7061 at 11/5/19 1:12 PM:
------------------------------------------------------------------------

I started adding features to fs::Selector, notably max depth recursion,  I intended to add a filter function option to the selector but [~apitrou] objected lightly, arguing that if this is desired, the user could filter the explicit list of FileStats returned by the selector. My intention was that `fs::Selector` could mimick the powerful and ubiquitious [find(1)|[http://man7.org/linux/man-pages/man1/find.1.html]] selection interface.

Hence, this is why FileSystemDataSourceDiscovery supports both option (an explicit list of FileStats or a Selector).

Ideally, we want the "it-just-works" feeling, some suggestions:
 * Detect failures early, e.g. in `FileSystemBasedDataSource::Make` should we scan all files and detect if they can be parsed by the format driver? How should we propagate the failure, ignore file and warn, or abort via failure? The failure to parse the file is implicitly done by `Inspect` call.
 * Should we filter by file extension by default (if the user is passing a Selector and not an explicit list of FileStats). At first it seems very convenient, but it can lead to situation of silently ignoring important files just because of implicit naming convention.
 * Should we settle that the `Selector` constructor is the it-just-works route, and the explicit vector<FileStats> is the power user route? 


was (Author: fsaintjacques):
I started adding features to fs::Selector, notably max depth recursion,  I intended to add a filter function option to the selector but [~apitrou] objected lightly, arguing that if this is desired, the user could filter the explicit list of FileStats returned by the selector. Hence, this is why FileSystemDataSourceDiscovery supports both option (an explicit list of FileStats or a Selector).

Ideally, we want the "it-just-works" feeling, some suggestions:
 * Detect failures early, e.g. in `FileSystemBasedDataSource::Make` should we scan all files and detect if they can be parsed by the format driver? How should we propagate the failure, ignore file and warn, or abort via failure? The failure to parse the file is implicitly done by `Inspect` call.
 * Should we filter by file extension by default (if the user is passing a Selector and not an explicit list of FileStats). At first it seems very convenient, but it can lead to situation of silently ignoring important files just because of implicit naming convention.
 * Should we settle that the `Selector` constructor is the it-just-works route, and the explicit vector<FileStats> is the power user route? 

> [C++][Dataset] FileSystemDiscovery with ParquetFileFormat should ignore files that aren't Parquet
> -------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-7061
>                 URL: https://issues.apache.org/jira/browse/ARROW-7061
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++ - Dataset
>            Reporter: Neal Richardson
>            Priority: Major
>
> I got {{Invalid parquet file. Corrupt footer.}} trying to read real data. Turned out it was because I had opened the directory in macOS Finder and it had added the junk .DS_Store files. Once I deleted them, the Dataset created fine. 
> If we're creating a DataSource with Parquet files, we should ignore any non-Parquet files we encounter when scanning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)