You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/19 12:23:44 UTC

[GitHub] [arrow-cookbook] jorisvandenbossche commented on a change in pull request #47: Writing Partitioned Datasets recipe for Python

jorisvandenbossche commented on a change in pull request #47:
URL: https://github.com/apache/arrow-cookbook/pull/47#discussion_r692057775



##########
File path: python/source/io.rst
##########
@@ -217,6 +217,63 @@ provided to :func:`pyarrow.csv.read_csv` to drive
     col1: int64
     ChunkedArray = 0 .. 99
 
+Writing Partitioned Datasets 
+============================
+
+When your dataset is big it usually makes sense to split it into
+multiple separate files. You can do this manually or use 
+:func:`pyarrow.dataset.write_dataset` to let Arrow do the effort
+of splitting the data in chunks for you.
+
+The ``partitioning`` argument allows to tell :func:`pyarrow.dataset.write_dataset`
+for which columns the data should be split. For example given a table
+with 100 numbers, we could add a ``chunk`` column that groups numbers
+in chunks of 10 numbers each.
+
+.. testcode::
+
+    data = pa.table({"numbers": range(100), 
+                     "chunk": [x // 10 for x in range(100)]})
+
+Then we could partition the data by the chunk column so that it
+gets saved in 10 different files:
+
+.. testcode::
+
+    import pyarrow as pa
+    import pyarrow.dataset as ds
+
+    ds.write_dataset(data, "./partitioned", format="parquet",
+                     partitioning=ds.partitioning(pa.schema([("chunk", pa.int8())])))
+
+Arrow will partition datasets in subdirectories by default, which will
+result in 10 different directories named with the value of the partitioning
+column and with file containing the data partition inside:
+
+.. testcode::
+
+    from pyarrow import fs
+
+    s3 = fs.LocalFileSystem()

Review comment:
       ```suggestion
       localfs = fs.LocalFileSystem()
   ```
   
   ? (or don't import `fs`, and use `fs` as the name for the variable)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org