You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by jo...@apache.org on 2021/08/24 13:05:14 UTC
[arrow-cookbook] branch main updated: Writing Partitioned Datasets
recipe for Python (#47)
This is an automated email from the ASF dual-hosted git repository.
jorisvandenbossche pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-cookbook.git
The following commit(s) were added to refs/heads/main by this push:
new a3e01f7 Writing Partitioned Datasets recipe for Python (#47)
a3e01f7 is described below
commit a3e01f762b70f964650e5fb5999d2fa5893b4352
Author: Alessandro Molina <am...@turbogears.org>
AuthorDate: Tue Aug 24 15:05:05 2021 +0200
Writing Partitioned Datasets recipe for Python (#47)
---
python/source/io.rst | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 59 insertions(+)
diff --git a/python/source/io.rst b/python/source/io.rst
index b3eab15..717d1db 100644
--- a/python/source/io.rst
+++ b/python/source/io.rst
@@ -217,6 +217,65 @@ provided to :func:`pyarrow.csv.read_csv` to drive
col1: int64
ChunkedArray = 0 .. 99
+Writing Partitioned Datasets
+============================
+
+When your dataset is big it usually makes sense to split it into
+multiple separate files. You can do this manually or use
+:func:`pyarrow.dataset.write_dataset` to let Arrow do the effort
+of splitting the data in chunks for you.
+
+The ``partitioning`` argument allows to tell :func:`pyarrow.dataset.write_dataset`
+for which columns the data should be split.
+
+For example given 100 birthdays, within 2000 and 2009
+
+.. testcode::
+
+ import numpy.random
+ data = pa.table({"day": numpy.random.randint(1, 31, size=100),
+ "month": numpy.random.randint(1, 12, size=100),
+ "year": [2000 + x // 10 for x in range(100)]})
+
+Then we could partition the data by the year column so that it
+gets saved in 10 different files:
+
+.. testcode::
+
+ import pyarrow as pa
+ import pyarrow.dataset as ds
+
+ ds.write_dataset(data, "./partitioned", format="parquet",
+ partitioning=ds.partitioning(pa.schema([("year", pa.int16())])))
+
+Arrow will partition datasets in subdirectories by default, which will
+result in 10 different directories named with the value of the partitioning
+column each with a file containing the subset of the data for that partition:
+
+.. testcode::
+
+ from pyarrow import fs
+
+ localfs = fs.LocalFileSystem()
+ partitioned_dir_content = localfs.get_file_info(fs.FileSelector("./partitioned", recursive=True))
+ files = sorted((f.path for f in partitioned_dir_content if f.type == fs.FileType.File))
+
+ for file in files:
+ print(file)
+
+.. testoutput::
+
+ ./partitioned/2000/part-0.parquet
+ ./partitioned/2001/part-1.parquet
+ ./partitioned/2002/part-2.parquet
+ ./partitioned/2003/part-3.parquet
+ ./partitioned/2004/part-4.parquet
+ ./partitioned/2005/part-6.parquet
+ ./partitioned/2006/part-5.parquet
+ ./partitioned/2007/part-7.parquet
+ ./partitioned/2008/part-8.parquet
+ ./partitioned/2009/part-9.parquet
+
Reading Partitioned data
========================