You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/03/13 16:33:29 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request, #5576: Minor: Add more documentation about table_partition_columns

alamb opened a new pull request, #5576:
URL: https://github.com/apache/arrow-datafusion/pull/5576

   # Which issue does this PR close?
   
   related to https://github.com/apache/arrow-datafusion/pull/5545
   
   # Rationale for these changes
   
   I found this feature was not well documented, so I wanted to rectify that while I was in the right headspace
   
   # What changes are included in this PR?
   
   Add docstrings
   
   # Are these changes tested?
   
   Yes
   
   # Are there any user-facing changes?
   Better docs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] crepererum commented on a diff in pull request #5576: Minor: Add more documentation about table_partition_columns

Posted by "crepererum (via GitHub)" <gi...@apache.org>.
crepererum commented on code in PR #5576:
URL: https://github.com/apache/arrow-datafusion/pull/5576#discussion_r1135255416


##########
datafusion/core/src/datasource/listing/table.rs:
##########
@@ -298,14 +295,49 @@ impl ListingOptions {
         self
     }
 
-    /// Set table partition column names on [`ListingOptions`] and returns self.
+    /// Set `table partition columns` on [`ListingOptions`] and returns self.
+    ///
+    /// "partition columns," used to support [Hive Partitioning], are
+    /// columns added to the data that is read, based on the folder
+    /// structure where the data resides.
+    ///
+    /// For example, give the following files in your filesystem:
+    ///
+    /// ```text
+    /// /mnt/nyctaxi/year=2022/month=01/tripdata.parquet
+    /// /mnt/nyctaxi/year=2021/month=12/tripdata.parquet
+    /// /mnt/nyctaxi/year=2021/month=11/tripdata.parquet
+    /// ```
+    ///
+    /// A [`ListingTable`] created at `/mnt/nyctaxi/` with partition
+    /// columns "year" and "month" will include new `year` and `month`
+    /// columns while reading the files. The `year` column would have
+    /// value `2022` and the `month` column would have value `01` for
+    /// the rows read from
+    /// `/mnt/nyctaxi/year=2022/month=01/tripdata.parquet`
+    ///
+    ///# Notes
+    ///
+    /// - If only one level (e.g. `year` in the example above) is specified, the other levels are ignored
+    /// but the files are still read.
+    ///
+    /// - Files that don't follow this partitioning scheme will be
+    /// ignored.
+    ///
+    /// - Since the columns have the same value for all rows read from
+    /// each individual file (such as dates), they are typically
+    /// dictionary encoded for efficiency.
+    ///

Review Comment:
   Because this may be unexpected to some users who didn't read the `Hive Partitioning` link:
   
   ```suggestion
       ///
       /// - The partition columns are solely extracted from the file path. Especially they are NOT part of the parquet files itself.
       ///
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #5576: Minor: Add more documentation about table_partition_columns

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #5576:
URL: https://github.com/apache/arrow-datafusion/pull/5576


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org