You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/01 12:38:13 UTC

[GitHub] [arrow] martgra opened a new issue #11826: Partition in dataset

martgra opened a new issue #11826:
URL: https://github.com/apache/arrow/issues/11826


   Hi, 
   
   I'm wondering how partitions work in the new Datasets api.
   
   This is part of my code where data is written:
   ```python
   pa.dataset.write_dataset(
               table,
               output_path,
               basename_template=f"chunk_{y}_{{i}}",
               format="parquet",
               partitioning=["code"],
               existing_data_behavior="overwrite_or_ignore",
           )
   ```
   However when reading out data again results in:
   ```python
   >>>pd.read_parquet(path, filters=[('code', '=', "1234")])
   
   Trace:
   ArrowInvalid: No match for FieldRef.Name(code)
   ```
   Is this expected? That partition columns dissappear from table? I have also tried directly with pyarrow, and also looking at the table columns "code" is missing. 
   
   FYI: 
   
   Thank you!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] martgra commented on issue #11826: Partition in dataset

Posted by GitBox <gi...@apache.org>.

martgra commented on issue #11826:
URL: https://github.com/apache/arrow/issues/11826#issuecomment-985459064


   @westonpace Perfect :-) Many thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace closed issue #11826: Partition in dataset

Posted by GitBox <gi...@apache.org>.

westonpace closed issue #11826:
URL: https://github.com/apache/arrow/issues/11826


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] martgra commented on issue #11826: Partition in dataset

Posted by GitBox <gi...@apache.org>.

martgra commented on issue #11826:
URL: https://github.com/apache/arrow/issues/11826#issuecomment-984384535


   @westonpace many thanks :-) New to this parquet thing so wasnt intuitive. Can I suggest that this is added to the https://arrow.apache.org/docs/python/dataset.html#writing-partitioned-data as it is not intuitive (at least to me)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #11826: Partition in dataset

Posted by GitBox <gi...@apache.org>.

westonpace commented on issue #11826:
URL: https://github.com/apache/arrow/issues/11826#issuecomment-984228962


   The `read_parquet` operation needs to know more about the partitioning.  At the moment it is seeing files like `{output_path}/1234/chunk_0_0.parquet` and it doesn't know if the `1234` is meant to be a partition column (and if so what should the column name be?).  So instead it just does a recursive search of all files and pretends the inner directories didn't exist.
   
   You have two options.  First, you can specify the partitioning on the read:
   
   `pd.read_parquet(path, partitioning=["code"], filters=[('code', '=', '1234')])`
   
   Or, if you don't want to have to keep track of it, you can use the `hive` partitioning flavor when you write:
   
   ```
   pa.dataset.write_dataset(
               table,
               output_path,
               basename_template=f"chunk_{y}_{{i}}",
               format="parquet",
               partitioning=["code"],
               partitioning_flavor="hive",
               existing_data_behavior="overwrite_or_ignore",
           )
   ```
   
   This will create paths like `{output_path}/code=1234/chunk_0_0.parquet`.  The `code=1234` is clear enough to pyarrow's inference that it will assume that is a partitioning directory and the column is named `code`.  So then you can use the `read_parquet` call you have as-is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #11826: Partition in dataset

Posted by GitBox <gi...@apache.org>.

westonpace commented on issue #11826:
URL: https://github.com/apache/arrow/issues/11826#issuecomment-985030715


   @martgra How does the content added in #11844 look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org