You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/16 01:47:41 UTC

[GitHub] [arrow] westonpace commented on a change in pull request #11970: [Docs][Minor] Add guidance on partitioning datasets [WIP]

westonpace commented on a change in pull request #11970:
URL: https://github.com/apache/arrow/pull/11970#discussion_r770165344



##########
File path: docs/source/cpp/dataset.rst
##########
@@ -334,6 +334,25 @@ altogether if they do not match the filter:
    :linenos:
    :lineno-match:
 
+Partitioning performance considerations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Partitioning datasets can improve performance when reading datasets, but have several 
+potential costs when reading and writing:
+
+#. Can significantly increase the number of files to write. The number of partitions is a
+   floor for the number of files in a dataset. If you partition a dataset by date with a 
+   year of data, you will have at least 365 files. If you further partition by another
+   dimension with 1,000 unique values, you will have 365,000 files. This can make it slower
+   to write and increase the size of the overall dataset because each file has some fixed
+   overhead. For example, each file in parquet dataset contains the schema.
+#. Multiple partitioning columns can produce deeply nested folder structures which are slow
+   to navigate because they require many recusive "list directory" calls to discover files.
+   These operations may be particularly expensive if you are using an object store 
+   filesystem such as S3. One workaround is to combine multiple columns into one for
+   partitioning. For example, instead of a schema like /year/month/day/ use /YYYY-MM-DD/.
+ 
+

Review comment:
       Both of these are in the "cons" section.  It might be worth adding a bit more body to "can improve the performance when reading datasets".
   
   There are two advantages (but really only one):
   
    * We need multiple files to read in parallel.
    * Smaller partitions allow for more selective queries.  E.g. we can load less data from the disk.
   
   We should also mention (here or elsewhere) that everything that applies here for # of files also applies for # of record batchs (or # of row groups in parquet).  It's possible to have 1 file with way too many row groups and get similar performance issues.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org