You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/04/01 09:14:00 UTC

[jira] [Commented] (ARROW-8245) [Python][Parquet] Skip hidden directories when reading partitioned parquet files

    [ https://issues.apache.org/jira/browse/ARROW-8245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072548#comment-17072548 ] 

Joris Van den Bossche commented on ARROW-8245:
----------------------------------------------

I just checked, and the C++ Datasets API already handles this fine. It has an option for discovery of the dataset which prefixes to ignore (the default is {{['.', '_']}}), and this is applied to all parts of the path, so for both file names as directory names.

Reproducer (the dataset part needs pyarrow master):

{code}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

import pathlib

# Create a small toy dataset
basedir = pathlib.Path(".")
case2 = basedir / "ignored_prefix_dot"
case2.mkdir(exist_ok=True)

(case2 / "A=1").mkdir(exist_ok=True)
(case2 / ".staging").mkdir(exist_ok=True)
pq.write_table(pa.table({'B': [1, 2, 3]}), case2 / "A=1" / "data.parquet")
pq.write_table(pa.table({'B': [4, 5, 6]}), case2 / ".staging" / "data.parquet")

# this fails "ValueError: Directory name did not appear to be a partition: .staging"
pq.read_table(str(case2))

# this works fine
dataset = ds.dataset(str(case2), format='parquet', partitioning="hive")
# returns only the file of 'ignored_prefix_dot/A=1/data.parquet'
dataset.files
dataset.to_table().to_pandas()
{code}

> [Python][Parquet] Skip hidden directories when reading partitioned parquet files
> --------------------------------------------------------------------------------
>
>                 Key: ARROW-8245
>                 URL: https://issues.apache.org/jira/browse/ARROW-8245
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Caleb Overman
>            Priority: Minor
>              Labels: parquet
>             Fix For: 0.17.0
>
>
> When writing a partitioned parquet file Spark can create a temporary hidden {{.spark-staging}} directory within the parquet file. Because it is a directory and not a file, it is not skipped when trying to read the parquet file. Pyarrow currently only skips directories prefixed with {{_}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)