You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/02 02:14:39 UTC

[GitHub] [arrow] westonpace commented on issue #11826: Partition in dataset

westonpace commented on issue #11826:
URL: https://github.com/apache/arrow/issues/11826#issuecomment-984228962


   The `read_parquet` operation needs to know more about the partitioning.  At the moment it is seeing files like `{output_path}/1234/chunk_0_0.parquet` and it doesn't know if the `1234` is meant to be a partition column (and if so what should the column name be?).  So instead it just does a recursive search of all files and pretends the inner directories didn't exist.
   
   You have two options.  First, you can specify the partitioning on the read:
   
   `pd.read_parquet(path, partitioning=["code"], filters=[('code', '=', '1234')])`
   
   Or, if you don't want to have to keep track of it, you can use the `hive` partitioning flavor when you write:
   
   ```
   pa.dataset.write_dataset(
               table,
               output_path,
               basename_template=f"chunk_{y}_{{i}}",
               format="parquet",
               partitioning=["code"],
               partitioning_flavor="hive",
               existing_data_behavior="overwrite_or_ignore",
           )
   ```
   
   This will create paths like `{output_path}/code=1234/chunk_0_0.parquet`.  The `code=1234` is clear enough to pyarrow's inference that it will assume that is a partitioning directory and the column is named `code`.  So then you can use the `read_parquet` call you have as-is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org