You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/02 21:52:27 UTC

[GitHub] [arrow] westonpace opened a new pull request #11844: ARROW-14972: [Python][Doc] Document automatic partitioning discovery

westonpace opened a new pull request #11844:
URL: https://github.com/apache/arrow/pull/11844


   See #11826 for motivation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #11844: ARROW-14972: [Python][Doc] Document automatic partitioning discovery

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #11844:
URL: https://github.com/apache/arrow/pull/11844#discussion_r763167433



##########
File path: docs/source/python/dataset.rst
##########
@@ -340,6 +340,30 @@ when constructing a directory partitioning:
 Directory partitioning also supports providing a full schema rather than inferring
 types from file paths.
 
+Automatic partitioning detection
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If the directory is partitioned using the hive partitioning scheme (see above)
+then pyarrow will be able to automatically recognize the partitioning and include
+the partitioning information as a column in the returned table.  There is no
+need to specify the partitioning unless you need to override the inferred data
+types of the partitioning columns:
+
+.. code-block:: python
+
+    dataset = ds.dataset("hive_partitioned", format="parquet")

Review comment:
       I think this will actually not work, because the default is "no partitioning"? (I think this was actually the confusion in https://github.com/apache/arrow/issues/11826 that no partition inference is done out of the box)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #11844: ARROW-14972: [Python][Doc] Document automatic partitioning discovery

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #11844:
URL: https://github.com/apache/arrow/pull/11844#discussion_r763168499



##########
File path: docs/source/python/dataset.rst
##########
@@ -340,6 +340,30 @@ when constructing a directory partitioning:
 Directory partitioning also supports providing a full schema rather than inferring
 types from file paths.
 
+Automatic partitioning detection
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If the directory is partitioned using the hive partitioning scheme (see above)
+then pyarrow will be able to automatically recognize the partitioning and include
+the partitioning information as a column in the returned table.  There is no
+need to specify the partitioning unless you need to override the inferred data
+types of the partitioning columns:
+
+.. code-block:: python
+
+    dataset = ds.dataset("hive_partitioned", format="parquet")

Review comment:
       To add to the confusion, on the issue you mentioned `pd.parquet_parquet`, which goes through `parquet.read_table/ParquetDataset`, and those paths will actually infer hive partitioning. But so not `ds.dataset(..)` ... 
   
   The default here is set to `partitioning="hive"`:
   
   https://github.com/apache/arrow/blob/ad1fd0495bc543107a8d882506263906f330f2c5/python/pyarrow/parquet.py#L1935-L1938
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #11844: ARROW-14972: [Python][Doc] Document automatic partitioning discovery

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #11844:
URL: https://github.com/apache/arrow/pull/11844#discussion_r764083271



##########
File path: docs/source/python/dataset.rst
##########
@@ -340,6 +340,30 @@ when constructing a directory partitioning:
 Directory partitioning also supports providing a full schema rather than inferring
 types from file paths.
 
+Automatic partitioning detection
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If the directory is partitioned using the hive partitioning scheme (see above)
+then pyarrow will be able to automatically recognize the partitioning and include
+the partitioning information as a column in the returned table.  There is no
+need to specify the partitioning unless you need to override the inferred data
+types of the partitioning columns:
+
+.. code-block:: python
+
+    dataset = ds.dataset("hive_partitioned", format="parquet")

Review comment:
       I think "read with hive" is functionally an "auto" option.  If the write was directory partitioned it is harmless and no partitions are found.  If the write was hive partitioned it will detect it.
   
   So I think we can change read without changing the write.  The only negative case will be the case where someone is using directory partitioning and their partition values actually have `=` inside of them.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #11844: ARROW-14972: [Python][Doc] Document automatic partitioning discovery

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #11844:
URL: https://github.com/apache/arrow/pull/11844#issuecomment-985030461






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #11844: ARROW-14972: [Python][Doc] Document automatic partitioning discovery

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #11844:
URL: https://github.com/apache/arrow/pull/11844#discussion_r789909668



##########
File path: docs/source/python/dataset.rst
##########
@@ -340,6 +340,30 @@ when constructing a directory partitioning:
 Directory partitioning also supports providing a full schema rather than inferring
 types from file paths.
 
+Automatic partitioning detection
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If the directory is partitioned using the hive partitioning scheme (see above)
+then pyarrow will be able to automatically recognize the partitioning and include
+the partitioning information as a column in the returned table.  There is no
+need to specify the partitioning unless you need to override the inferred data
+types of the partitioning columns:
+
+.. code-block:: python
+
+    dataset = ds.dataset("hive_partitioned", format="parquet")

Review comment:
       Ok, I opened ARROW-15406 and ARROW-15407 which I think addresses what you are describing here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #11844: ARROW-14972: [Python][Doc] Document automatic partitioning discovery

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #11844:
URL: https://github.com/apache/arrow/pull/11844#discussion_r763312562



##########
File path: docs/source/python/dataset.rst
##########
@@ -340,6 +340,30 @@ when constructing a directory partitioning:
 Directory partitioning also supports providing a full schema rather than inferring
 types from file paths.
 
+Automatic partitioning detection
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If the directory is partitioned using the hive partitioning scheme (see above)
+then pyarrow will be able to automatically recognize the partitioning and include
+the partitioning information as a column in the returned table.  There is no
+need to specify the partitioning unless you need to override the inferred data
+types of the partitioning columns:
+
+.. code-block:: python
+
+    dataset = ds.dataset("hive_partitioned", format="parquet")

Review comment:
       That is confusing.  But you are correct:
   
   ```
   import pyarrow as pa
   import pyarrow.dataset as ds
   import pyarrow.parquet as pq
   
   table = pa.Table.from_pydict({'type': ['a', 'a', 'b', 'b'], 'value': [1, 2, 3, 4]})
   ds.write_dataset(table, '/tmp/my_dataset', format='parquet', partitioning=['type'], partitioning_flavor='hive', existing_data_behavior='overwrite_or_ignore')
   
   print(ds.dataset('/tmp/my_dataset').to_table().column_names)
   print(ds.dataset('/tmp/my_dataset', partitioning='hive').to_table().column_names)
   print(pq.read_table('/tmp/my_dataset').column_names)
   ```
   
   Can we make this more intuitive somehow?  Maybe have `hive` be the default partitioning for a dataset similar to how it is for pq.read_table?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #11844: ARROW-14972: [Python][Doc] Document automatic partitioning discovery

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #11844:
URL: https://github.com/apache/arrow/pull/11844#discussion_r763168499



##########
File path: docs/source/python/dataset.rst
##########
@@ -340,6 +340,30 @@ when constructing a directory partitioning:
 Directory partitioning also supports providing a full schema rather than inferring
 types from file paths.
 
+Automatic partitioning detection
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If the directory is partitioned using the hive partitioning scheme (see above)
+then pyarrow will be able to automatically recognize the partitioning and include
+the partitioning information as a column in the returned table.  There is no
+need to specify the partitioning unless you need to override the inferred data
+types of the partitioning columns:
+
+.. code-block:: python
+
+    dataset = ds.dataset("hive_partitioned", format="parquet")

Review comment:
       To add to the confusion, on the issue you mentioned `pd.parquet_parquet`, which goes through `parquet.read_table/ParquetDataset`, and those paths will actually infer hive partitioning. But so not `ds.dataset(..)` ... 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #11844: ARROW-14972: [Python][Doc] Document automatic partitioning discovery

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #11844:
URL: https://github.com/apache/arrow/pull/11844#discussion_r764011810



##########
File path: docs/source/python/dataset.rst
##########
@@ -340,6 +340,30 @@ when constructing a directory partitioning:
 Directory partitioning also supports providing a full schema rather than inferring
 types from file paths.
 
+Automatic partitioning detection
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If the directory is partitioned using the hive partitioning scheme (see above)
+then pyarrow will be able to automatically recognize the partitioning and include
+the partitioning information as a column in the returned table.  There is no
+need to specify the partitioning unless you need to override the inferred data
+types of the partitioning columns:
+
+.. code-block:: python
+
+    dataset = ds.dataset("hive_partitioned", format="parquet")

Review comment:
       But then also for writing? (which currently defaults to directory partitioning)
   
   I think that will certainly give a smoother roundtrip experience. I am a bit unsure about the change in behaviour (for reading that seems harmless, since it will leave alone directories that doesn't match a hive scheme, but changing writing from directory to hive might be a bigger change)  
   I can't directly remember why we initially chose directory partitioning as the default ..




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #11844: ARROW-14972: [Python][Doc] Document automatic partitioning discovery

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #11844:
URL: https://github.com/apache/arrow/pull/11844#discussion_r783280265



##########
File path: docs/source/python/dataset.rst
##########
@@ -340,6 +340,30 @@ when constructing a directory partitioning:
 Directory partitioning also supports providing a full schema rather than inferring
 types from file paths.
 
+Automatic partitioning detection
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If the directory is partitioned using the hive partitioning scheme (see above)
+then pyarrow will be able to automatically recognize the partitioning and include
+the partitioning information as a column in the returned table.  There is no
+need to specify the partitioning unless you need to override the inferred data
+types of the partitioning columns:
+
+.. code-block:: python
+
+    dataset = ds.dataset("hive_partitioned", format="parquet")

Review comment:
       @AlenkaF and I ran into some similar issues again, reminding me that we should have better defaults (see also https://issues.apache.org/jira/browse/ARROW-15310 that I opened)
   
   Agreed that for the read side, we can certainly already switch the default, as that is quite harmless. 
   
   It would still be nice to have the write side to be consistent, but that's a bigger change. In theory we could do it with a deprecation cycle, though, now that we have the `partitioning_flavor` keyword. Currently that defaults to None, and you can set it to "hive" to get hive-style instead of directory partitioning. But so we could raise a warning if it is None (and a partitioning is specified), saying it will change to "hive" in the future, and people can specify "directory" or "hive" explicitly to silence the warning.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace closed pull request #11844: ARROW-14972: [Python][Doc] Document automatic partitioning discovery

Posted by GitBox <gi...@apache.org>.
westonpace closed pull request #11844:
URL: https://github.com/apache/arrow/pull/11844


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org