You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by we...@apache.org on 2016/11/11 19:18:18 UTC
arrow git commit: ARROW-356: Add documentation about reading Parquet
Repository: arrow
Updated Branches:
refs/heads/master 4fa7ac4f6 -> 7f048a4b8
ARROW-356: Add documentation about reading Parquet
Assumes #192.
Author: Uwe L. Korn <uw...@xhochy.com>
Closes #193 from xhochy/ARROW-356 and squashes the following commits:
530484f [Uwe L. Korn] Mention new setup instructions
06b2f9c [Uwe L. Korn] Add tables describing dtype support
0467e0e [Uwe L. Korn] Move installation instructions into Sphinx docs
744202a [Uwe L. Korn] Document Pandas<->Arrow conversion
b5b4df5 [Uwe L. Korn] ARROW-356: Add documentation about reading Parquet
Project: http://git-wip-us.apache.org/repos/asf/arrow/repo
Commit: http://git-wip-us.apache.org/repos/asf/arrow/commit/7f048a4b
Tree: http://git-wip-us.apache.org/repos/asf/arrow/tree/7f048a4b
Diff: http://git-wip-us.apache.org/repos/asf/arrow/diff/7f048a4b
Branch: refs/heads/master
Commit: 7f048a4b8bdc6a20cd8f6eeca928ecbb6db7dd96
Parents: 4fa7ac4
Author: Uwe L. Korn <uw...@xhochy.com>
Authored: Fri Nov 11 14:18:09 2016 -0500
Committer: Wes McKinney <we...@twosigma.com>
Committed: Fri Nov 11 14:18:09 2016 -0500
----------------------------------------------------------------------
python/doc/INSTALL.md | 101 ----------------------------
python/doc/index.rst | 16 +++--
python/doc/install.rst | 151 ++++++++++++++++++++++++++++++++++++++++++
python/doc/pandas.rst | 114 +++++++++++++++++++++++++++++++
python/doc/parquet.rst | 66 ++++++++++++++++++
python/pyarrow/table.pyx | 15 +++++
6 files changed, 355 insertions(+), 108 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/arrow/blob/7f048a4b/python/doc/INSTALL.md
----------------------------------------------------------------------
diff --git a/python/doc/INSTALL.md b/python/doc/INSTALL.md
deleted file mode 100644
index 81eed56..0000000
--- a/python/doc/INSTALL.md
+++ /dev/null
@@ -1,101 +0,0 @@
-<!---
- Licensed under the Apache License, Version 2.0 (the "License");
- you may not use this file except in compliance with the License.
- You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License. See accompanying LICENSE file.
--->
-
-## Building pyarrow (Apache Arrow Python library)
-
-First, clone the master git repository:
-
-```bash
-git clone https://github.com/apache/arrow.git arrow
-```
-
-#### System requirements
-
-Building pyarrow requires:
-
-* A C++11 compiler
-
- * Linux: gcc >= 4.8 or clang >= 3.5
- * OS X: XCode 6.4 or higher preferred
-
-* [cmake][1]
-
-#### Python requirements
-
-You will need Python (CPython) 2.7, 3.4, or 3.5 installed. Earlier releases and
-are not being targeted.
-
-> This library targets CPython only due to an emphasis on interoperability with
-> pandas and NumPy, which are only available for CPython.
-
-The build requires NumPy, Cython, and a few other Python dependencies:
-
-```bash
-pip install cython
-cd arrow/python
-pip install -r requirements.txt
-```
-
-#### Installing Arrow C++ library
-
-First, you should choose an installation location for Arrow C++. In the future
-using the default system install location will work, but for now we are being
-explicit:
-
-```bash
-export ARROW_HOME=$HOME/local
-```
-
-Now, we build Arrow:
-
-```bash
-cd arrow/cpp
-
-mkdir dev-build
-cd dev-build
-
-cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME ..
-
-make
-
-# Use sudo here if $ARROW_HOME requires it
-make install
-```
-
-#### Install `pyarrow`
-
-```bash
-cd arrow/python
-
-python setup.py install
-```
-
-> On XCode 6 and prior there are some known OS X `@rpath` issues. If you are
-> unable to import pyarrow, upgrading XCode may be the solution.
-
-
-```python
-In [1]: import pyarrow
-
-In [2]: pyarrow.from_pylist([1,2,3])
-Out[2]:
-<pyarrow.array.Int64Array object at 0x7f899f3e60e8>
-[
- 1,
- 2,
- 3
-]
-```
-
-[1]: https://cmake.org/
http://git-wip-us.apache.org/repos/asf/arrow/blob/7f048a4b/python/doc/index.rst
----------------------------------------------------------------------
diff --git a/python/doc/index.rst b/python/doc/index.rst
index 88725ba..6725ae7 100644
--- a/python/doc/index.rst
+++ b/python/doc/index.rst
@@ -31,14 +31,16 @@ additional functionality such as reading Apache Parquet files into Arrow
structures.
.. toctree::
- :maxdepth: 4
- :hidden:
+ :maxdepth: 2
+ :caption: Getting Started
+ Installing pyarrow <install.rst>
+ Pandas <pandas.rst>
Module Reference <modules.rst>
-Indices and tables
-==================
+.. toctree::
+ :maxdepth: 2
+ :caption: Additional Features
+
+ Parquet format <parquet.rst>
-* :ref:`genindex`
-* :ref:`modindex`
-* :ref:`search`
http://git-wip-us.apache.org/repos/asf/arrow/blob/7f048a4b/python/doc/install.rst
----------------------------------------------------------------------
diff --git a/python/doc/install.rst b/python/doc/install.rst
new file mode 100644
index 0000000..1bab017
--- /dev/null
+++ b/python/doc/install.rst
@@ -0,0 +1,151 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements. See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership. The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License. You may obtain a copy of the License at
+
+.. http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied. See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Install PyArrow
+===============
+
+Conda
+-----
+
+To install the latest version of PyArrow from conda-forge using conda:
+
+.. code-block:: bash
+
+ conda install -c conda-forge pyarrow
+
+Pip
+---
+
+Install the latest version from PyPI:
+
+.. code-block:: bash
+
+ pip install pyarrow
+
+.. note::
+ Currently there are only binary artifcats available for Linux and MacOS.
+ Otherwise this will only pull the python sources and assumes an existing
+ installation of the C++ part of Arrow.
+ To retrieve the binary artifacts, you'll need a recent ``pip`` version that
+ supports features like the ``manylinux1`` tag.
+
+Building from source
+--------------------
+
+First, clone the master git repository:
+
+.. code-block:: bash
+
+ git clone https://github.com/apache/arrow.git arrow
+
+System requirements
+~~~~~~~~~~~~~~~~~~~
+
+Building pyarrow requires:
+
+* A C++11 compiler
+
+ * Linux: gcc >= 4.8 or clang >= 3.5
+ * OS X: XCode 6.4 or higher preferred
+
+* `CMake <https://cmake.org/>`_
+
+Python requirements
+~~~~~~~~~~~~~~~~~~~
+
+You will need Python (CPython) 2.7, 3.4, or 3.5 installed. Earlier releases and
+are not being targeted.
+
+.. note::
+ This library targets CPython only due to an emphasis on interoperability with
+ pandas and NumPy, which are only available for CPython.
+
+The build requires NumPy, Cython, and a few other Python dependencies:
+
+.. code-block:: bash
+
+ pip install cython
+ cd arrow/python
+ pip install -r requirements.txt
+
+Installing Arrow C++ library
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+First, you should choose an installation location for Arrow C++. In the future
+using the default system install location will work, but for now we are being
+explicit:
+
+.. code-block:: bash
+
+ export ARROW_HOME=$HOME/local
+
+Now, we build Arrow:
+
+.. code-block:: bash
+
+ cd arrow/cpp
+
+ mkdir dev-build
+ cd dev-build
+
+ cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME ..
+
+ make
+
+ # Use sudo here if $ARROW_HOME requires it
+ make install
+
+To get the optional Parquet support, you should also build and install
+`parquet-cpp <https://github.com/apache/parquet-cpp/blob/master/README.md>`_.
+
+Install `pyarrow`
+~~~~~~~~~~~~~~~~~
+
+
+.. code-block:: bash
+
+ cd arrow/python
+
+ # --with-parquet enable the Apache Parquet support in PyArrow
+ # --build-type=release disables debugging information and turns on
+ # compiler optimizations for native code
+ python setup.py build_ext --with-parquet --build-type=release install
+ python setup.py install
+
+.. warning::
+ On XCode 6 and prior there are some known OS X `@rpath` issues. If you are
+ unable to import pyarrow, upgrading XCode may be the solution.
+
+.. note::
+ In development installations, you will also need to set a correct
+ ``LD_LIBARY_PATH``. This is most probably done with
+ ``export LD_LIBARY_PATH=$ARROW_HOME/lib:$LD_LIBARY_PATH``.
+
+
+.. code-block:: python
+
+ In [1]: import pyarrow
+
+ In [2]: pyarrow.from_pylist([1,2,3])
+ Out[2]:
+ <pyarrow.array.Int64Array object at 0x7f899f3e60e8>
+ [
+ 1,
+ 2,
+ 3
+ ]
+
http://git-wip-us.apache.org/repos/asf/arrow/blob/7f048a4b/python/doc/pandas.rst
----------------------------------------------------------------------
diff --git a/python/doc/pandas.rst b/python/doc/pandas.rst
new file mode 100644
index 0000000..7c70074
--- /dev/null
+++ b/python/doc/pandas.rst
@@ -0,0 +1,114 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements. See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership. The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License. You may obtain a copy of the License at
+
+.. http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied. See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Pandas Interface
+================
+
+To interface with Pandas, PyArrow provides various conversion routines to
+consume Pandas structures and convert back to them.
+
+DataFrames
+----------
+
+The equivalent to a Pandas DataFrame in Arrow is a :class:`pyarrow.table.Table`.
+Both consist of a set of named columns of equal length. While Pandas only
+supports flat columns, the Table also provides nested columns, thus it can
+represent more data than a DataFrame, so a full conversion is not always possible.
+
+Conversion from a Table to a DataFrame is done by calling
+:meth:`pyarrow.table.Table.to_pandas`. The inverse is then achieved by using
+:meth:`pyarrow.from_pandas_dataframe`. This conversion routine provides the
+convience parameter ``timestamps_to_ms``. Although Arrow supports timestamps of
+different resolutions, Pandas only supports nanosecond timestamps and most
+other systems (e.g. Parquet) only work on millisecond timestamps. This parameter
+can be used to already do the time conversion during the Pandas to Arrow
+conversion.
+
+.. code-block:: python
+
+ import pyarrow as pa
+ import pandas as pd
+
+ df = pd.DataFrame({"a": [1, 2, 3]})
+ # Convert from Pandas to Arrow
+ table = pa.from_pandas_dataframe(df)
+ # Convert back to Pandas
+ df_new = table.to_pandas()
+
+
+Series
+------
+
+In Arrow, the most similar structure to a Pandas Series is an Array.
+It is a vector that contains data of the same type as linear memory. You can
+convert a Pandas Series to an Arrow Array using :meth:`pyarrow.array.from_pandas_series`.
+As Arrow Arrays are always nullable, you can supply an optional mask using
+the ``mask`` parameter to mark all null-entries.
+
+Type differences
+----------------
+
+With the current design of Pandas and Arrow, it is not possible to convert all
+column types unmodified. One of the main issues here is that Pandas has no
+support for nullable columns of arbitrary type. Also ``datetime64`` is currently
+fixed to nanosecond resolution. On the other side, Arrow might be still missing
+support for some types.
+
+Pandas -> Arrow Conversion
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
++------------------------+--------------------------+
+| Source Type (Pandas) | Destination Type (Arrow) |
++========================+==========================+
+| ``bool`` | ``BOOL`` |
++------------------------+--------------------------+
+| ``(u)int{8,16,32,64}`` | ``(U)INT{8,16,32,64}`` |
++------------------------+--------------------------+
+| ``float32`` | ``FLOAT`` |
++------------------------+--------------------------+
+| ``float64`` | ``DOUBLE`` |
++------------------------+--------------------------+
+| ``str`` / ``unicode`` | ``STRING`` |
++------------------------+--------------------------+
+| ``pd.Timestamp`` | ``TIMESTAMP(unit=ns)`` |
++------------------------+--------------------------+
+| ``pd.Categorical`` | *not supported* |
++------------------------+--------------------------+
+
+Arrow -> Pandas Conversion
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
++-------------------------------------+--------------------------------------------------------+
+| Source Type (Arrow) | Destination Type (Pandas) |
++=====================================+========================================================+
+| ``BOOL`` | ``bool`` |
++-------------------------------------+--------------------------------------------------------+
+| ``BOOL`` *with nulls* | ``object`` (with values ``True``, ``False``, ``None``) |
++-------------------------------------+--------------------------------------------------------+
+| ``(U)INT{8,16,32,64}`` | ``(u)int{8,16,32,64}`` |
++-------------------------------------+--------------------------------------------------------+
+| ``(U)INT{8,16,32,64}`` *with nulls* | ``float64`` |
++-------------------------------------+--------------------------------------------------------+
+| ``FLOAT`` | ``float32`` |
++-------------------------------------+--------------------------------------------------------+
+| ``DOUBLE`` | ``float64`` |
++-------------------------------------+--------------------------------------------------------+
+| ``STRING`` | ``str`` |
++-------------------------------------+--------------------------------------------------------+
+| ``TIMESTAMP(unit=*)`` | ``pd.Timestamp`` (``np.datetime64[ns]``) |
++-------------------------------------+--------------------------------------------------------+
+
http://git-wip-us.apache.org/repos/asf/arrow/blob/7f048a4b/python/doc/parquet.rst
----------------------------------------------------------------------
diff --git a/python/doc/parquet.rst b/python/doc/parquet.rst
new file mode 100644
index 0000000..674ed80
--- /dev/null
+++ b/python/doc/parquet.rst
@@ -0,0 +1,66 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements. See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership. The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License. You may obtain a copy of the License at
+
+.. http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied. See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Reading/Writing Parquet files
+=============================
+
+If you have built ``pyarrow`` with Parquet support, i.e. ``parquet-cpp`` was
+found during the build, you can read files in the Parquet format to/from Arrow
+memory structures. The Parquet support code is located in the
+:mod:`pyarrow.parquet` module and your package needs to be built with the
+``--with-parquet`` flag for ``build_ext``.
+
+Reading Parquet
+---------------
+
+To read a Parquet file into Arrow memory, you can use the following code
+snippet. It will read the whole Parquet file into memory as an
+:class:`pyarrow.table.Table`.
+
+.. code-block:: python
+
+ import pyarrow
+ import pyarrow.parquet
+
+ A = pyarrow
+
+ table = A.parquet.read_table('<filename>')
+
+Writing Parquet
+---------------
+
+Given an instance of :class:`pyarrow.table.Table`, the most simple way to
+persist it to Parquet is by using the :meth:`pyarrow.parquet.write_table`
+method.
+
+.. code-block:: python
+
+ import pyarrow
+ import pyarrow.parquet
+
+ A = pyarrow
+
+ table = A.Table(..)
+ A.parquet.write_table(table, '<filename>')
+
+By default this will write the Table as a single RowGroup using ``DICTIONARY``
+encoding. To increase the potential of parallelism a query engine can process
+a Parquet file, set the ``chunk_size`` to a fraction of the total number of rows.
+
+If you also want to compress the columns, you can select a compression
+method using the ``compression`` argument. Typically, ``GZIP`` is the choice if
+you want to minimize size and ``SNAPPY`` for performance.
http://git-wip-us.apache.org/repos/asf/arrow/blob/7f048a4b/python/pyarrow/table.pyx
----------------------------------------------------------------------
diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx
index c71bc71..5459f26 100644
--- a/python/pyarrow/table.pyx
+++ b/python/pyarrow/table.pyx
@@ -298,6 +298,8 @@ cdef class RecordBatch:
cdef class Table:
'''
+ A collection of top-level named, equal length Arrow arrays.
+
Do not call this class's constructor directly.
'''
@@ -335,6 +337,19 @@ cdef class Table:
@staticmethod
def from_arrays(names, arrays, name=None):
+ """
+ Construct a Table from Arrow Arrays
+
+ Parameters
+ ----------
+
+ names: list of str
+ Names for the table columns
+ arrays: list of pyarrow.array.Array
+ Equal-length arrays that should form the table.
+ name: str
+ (optional) name for the Table
+ """
cdef:
Array arr
c_string c_name