You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by jo...@apache.org on 2021/08/24 13:05:14 UTC

[arrow-cookbook] branch main updated: Writing Partitioned Datasets recipe for Python (#47)

This is an automated email from the ASF dual-hosted git repository.

jorisvandenbossche pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-cookbook.git


The following commit(s) were added to refs/heads/main by this push:
     new a3e01f7  Writing Partitioned Datasets recipe for Python (#47)
a3e01f7 is described below

commit a3e01f762b70f964650e5fb5999d2fa5893b4352
Author: Alessandro Molina <am...@turbogears.org>
AuthorDate: Tue Aug 24 15:05:05 2021 +0200

    Writing Partitioned Datasets recipe for Python (#47)
---
 python/source/io.rst | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/python/source/io.rst b/python/source/io.rst
index b3eab15..717d1db 100644
--- a/python/source/io.rst
+++ b/python/source/io.rst
@@ -217,6 +217,65 @@ provided to :func:`pyarrow.csv.read_csv` to drive
     col1: int64
     ChunkedArray = 0 .. 99
 
+Writing Partitioned Datasets 
+============================
+
+When your dataset is big it usually makes sense to split it into
+multiple separate files. You can do this manually or use 
+:func:`pyarrow.dataset.write_dataset` to let Arrow do the effort
+of splitting the data in chunks for you.
+
+The ``partitioning`` argument allows to tell :func:`pyarrow.dataset.write_dataset`
+for which columns the data should be split. 
+
+For example given 100 birthdays, within 2000 and 2009
+
+.. testcode::
+
+    import numpy.random
+    data = pa.table({"day": numpy.random.randint(1, 31, size=100), 
+                     "month": numpy.random.randint(1, 12, size=100),
+                     "year": [2000 + x // 10 for x in range(100)]})
+
+Then we could partition the data by the year column so that it
+gets saved in 10 different files:
+
+.. testcode::
+
+    import pyarrow as pa
+    import pyarrow.dataset as ds
+
+    ds.write_dataset(data, "./partitioned", format="parquet",
+                     partitioning=ds.partitioning(pa.schema([("year", pa.int16())])))
+
+Arrow will partition datasets in subdirectories by default, which will
+result in 10 different directories named with the value of the partitioning
+column each with a file containing the subset of the data for that partition:
+
+.. testcode::
+
+    from pyarrow import fs
+
+    localfs = fs.LocalFileSystem()
+    partitioned_dir_content = localfs.get_file_info(fs.FileSelector("./partitioned", recursive=True))
+    files = sorted((f.path for f in partitioned_dir_content if f.type == fs.FileType.File))
+
+    for file in files:
+        print(file)
+
+.. testoutput::
+
+    ./partitioned/2000/part-0.parquet
+    ./partitioned/2001/part-1.parquet
+    ./partitioned/2002/part-2.parquet
+    ./partitioned/2003/part-3.parquet
+    ./partitioned/2004/part-4.parquet
+    ./partitioned/2005/part-6.parquet
+    ./partitioned/2006/part-5.parquet
+    ./partitioned/2007/part-7.parquet
+    ./partitioned/2008/part-8.parquet
+    ./partitioned/2009/part-9.parquet
+
 Reading Partitioned data
 ========================