You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by ap...@apache.org on 2021/09/01 10:22:58 UTC
[arrow] branch master updated: ARROW-13404: [Doc][Python] Improve
PyArrow documentation for new users
This is an automated email from the ASF dual-hosted git repository.
apitrou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/master by this push:
new 3da09e5 ARROW-13404: [Doc][Python] Improve PyArrow documentation for new users
3da09e5 is described below
commit 3da09e51e3f94ca427915c98ac59d72701f728bc
Author: Alessandro Molina <am...@turbogears.org>
AuthorDate: Wed Sep 1 12:20:23 2021 +0200
ARROW-13404: [Doc][Python] Improve PyArrow documentation for new users
* Add link to the cookbooks
* Improve a bit landing page for PyArrow for people that don't already know Arrow
* Add a Getting Started section to introduce people to Arrow and PyArrow
Closes #10999 from amol-/ARROW-13404
Authored-by: Alessandro Molina <am...@turbogears.org>
Signed-off-by: Antoine Pitrou <an...@python.org>
---
docs/source/index.rst | 10 +++
docs/source/python/getstarted.rst | 145 ++++++++++++++++++++++++++++++++++++++
docs/source/python/index.rst | 17 +++--
3 files changed, 166 insertions(+), 6 deletions(-)
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 65aeb47..5579e8c 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -55,6 +55,16 @@ target environment.**
Rust <https://docs.rs/crate/arrow/>
status
+.. _toc.cookbook:
+
+.. toctree::
+ :maxdepth: 1
+ :caption: Cookbooks
+
+ C++ <https://arrow.apache.org/cookbook/cpp/>
+ Python <https://arrow.apache.org/cookbook/py/>
+ R <https://arrow.apache.org/cookbook/r/>
+
.. _toc.columnar:
.. toctree::
diff --git a/docs/source/python/getstarted.rst b/docs/source/python/getstarted.rst
new file mode 100644
index 0000000..36e4707
--- /dev/null
+++ b/docs/source/python/getstarted.rst
@@ -0,0 +1,145 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements. See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership. The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License. You may obtain a copy of the License at
+
+.. http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied. See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. _getstarted:
+
+Getting Started
+===============
+
+Arrow manages data in arrays (:class:`pyarrow.Array`), which can be
+grouped in tables (:class:`pyarrow.Table`) to represent columns of data
+in tabular data.
+
+Arrow also provides support for various formats to get those tabular
+data in and out of disk and networks. Most commonly used formats are
+Parquet (:ref:`parquet`) and the IPC format (:ref:`ipc`).
+
+Creating Arrays and Tables
+--------------------------
+
+Arrays in Arrow are collections of data of uniform type. That allows
+Arrow to use the best performing implementation to store the data and
+perform computations on it. So each array is meant to have data and
+a type
+
+.. ipython:: python
+
+ import pyarrow as pa
+
+ days = pa.array([1, 12, 17, 23, 28], type=pa.int8())
+
+Multiple arrays can be combined in tables to form the columns
+in tabular data when attached to a column name
+
+.. ipython:: python
+
+ months = pa.array([1, 3, 5, 7, 1], type=pa.int8())
+ years = pa.array([1990, 2000, 1995, 2000, 1995], type=pa.int16())
+
+ birthdays_table = pa.table([days, months, years],
+ names=["days", "months", "years"])
+
+ birthdays_table
+
+See :ref:`data` for more details.
+
+Saving and Loading Tables
+-------------------------
+
+Once you have tabular data, Arrow provides out of the box
+the features to save and restore that data for common formats
+like Parquet:
+
+.. ipython:: python
+
+ import pyarrow.parquet as pq
+
+ pq.write_table(birthdays_table, 'birthdays.parquet')
+
+Once you have your data on disk, loading it back is a single function call,
+and Arrow is heavily optimized for memory and speed so loading
+data will be as quick as possible
+
+.. ipython:: python
+
+ reloaded_birthdays = pq.read_table('birthdays.parquet')
+
+ reloaded_birthdays
+
+Saving and loading back data in arrow is usually done through
+:ref:`Parquet <parquet>`, :ref:`IPC format <ipc>` (:ref:`feather`),
+:ref:`CSV <csv>` or :ref:`Line-Delimited JSON <json>` formats.
+
+Performing Computations
+-----------------------
+
+Arrow ships with a bunch of compute functions that can be applied
+to its arrays and tables, so through the compute functions
+it's possible to apply transformations to the data
+
+.. ipython:: python
+
+ import pyarrow.compute as pc
+
+ pc.value_counts(birthdays_table["years"])
+
+See :ref:`compute` for a list of available compute functions and
+how to use them.
+
+Working with large data
+-----------------------
+
+Arrow also provides the :class:`pyarrow.dataset` API to work with
+large data, which will handle for you partitioning of your data in
+smaller chunks
+
+.. ipython:: python
+
+ import pyarrow.dataset as ds
+
+ ds.write_dataset(birthdays_table, "savedir", format="parquet",
+ partitioning=ds.partitioning(
+ pa.schema([birthdays_table.schema.field("years")])
+ ))
+
+Loading back the partitioned dataset will detect the chunks
+
+.. ipython:: python
+
+ birthdays_dataset = ds.dataset("savedir", format="parquet", partitioning=["years"])
+
+ birthdays_dataset.files
+
+and will lazily load chunks of data only when iterating over them
+
+.. ipython:: python
+
+ import datetime
+
+ current_year = datetime.datetime.utcnow().year
+ for table_chunk in birthdays_dataset.to_batches():
+ print("AGES", pc.subtract(current_year, table_chunk["years"]))
+
+For further details on how to work with big datasets, how to filter them,
+how to project them, etc., refer to :ref:`dataset` documentation.
+
+Continuining from here
+----------------------
+
+For digging further into Arrow, you might want to read the
+:doc:`PyArrow Documentation <./index>` itself or the
+`Arrow Python Cookbook <https://arrow.apache.org/cookbook/py/>`_
diff --git a/docs/source/python/index.rst b/docs/source/python/index.rst
index cc73830..0ffa405 100644
--- a/docs/source/python/index.rst
+++ b/docs/source/python/index.rst
@@ -15,12 +15,16 @@
.. specific language governing permissions and limitations
.. under the License.
-Python bindings
-===============
+PyArrow - Apache Arrow Python bindings
+======================================
-This is the documentation of the Python API of Apache Arrow. For more details
-on the Arrow format and other language bindings see the
-:doc:`parent documentation <../index>`.
+This is the documentation of the Python API of Apache Arrow.
+
+Apache Arrow is a development platform for in-memory analytics.
+It contains a set of technologies that enable big data systems to store, process and move data fast.
+
+See the :doc:`parent documentation <../index>` for additional details on
+the Arrow Project itself, on the Arrow format and the other language bindings.
The Arrow Python bindings (also named "PyArrow") have first-class integration
with NumPy, pandas, and built-in Python objects. They are based on the C++
@@ -34,9 +38,10 @@ files into Arrow structures.
:maxdepth: 2
install
- memory
+ getstarted
data
compute
+ memory
ipc
filesystems
filesystems_deprecated