You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/06 19:23:57 UTC

[GitHub] [arrow] westonpace commented on a change in pull request #11844: ARROW-14972: [Python][Doc] Document automatic partitioning discovery

westonpace commented on a change in pull request #11844:
URL: https://github.com/apache/arrow/pull/11844#discussion_r763312562



##########
File path: docs/source/python/dataset.rst
##########
@@ -340,6 +340,30 @@ when constructing a directory partitioning:
 Directory partitioning also supports providing a full schema rather than inferring
 types from file paths.
 
+Automatic partitioning detection
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If the directory is partitioned using the hive partitioning scheme (see above)
+then pyarrow will be able to automatically recognize the partitioning and include
+the partitioning information as a column in the returned table.  There is no
+need to specify the partitioning unless you need to override the inferred data
+types of the partitioning columns:
+
+.. code-block:: python
+
+    dataset = ds.dataset("hive_partitioned", format="parquet")

Review comment:
       That is confusing.  But you are correct:
   
   ```
   import pyarrow as pa
   import pyarrow.dataset as ds
   import pyarrow.parquet as pq
   
   table = pa.Table.from_pydict({'type': ['a', 'a', 'b', 'b'], 'value': [1, 2, 3, 4]})
   ds.write_dataset(table, '/tmp/my_dataset', format='parquet', partitioning=['type'], partitioning_flavor='hive', existing_data_behavior='overwrite_or_ignore')
   
   print(ds.dataset('/tmp/my_dataset').to_table().column_names)
   print(ds.dataset('/tmp/my_dataset', partitioning='hive').to_table().column_names)
   print(pq.read_table('/tmp/my_dataset').column_names)
   ```
   
   Can we make this more intuitive somehow?  Maybe have `hive` be the default partitioning for a dataset similar to how it is for pq.read_table?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org