You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by we...@apache.org on 2016/11/11 19:18:18 UTC

arrow git commit: ARROW-356: Add documentation about reading Parquet

Repository: arrow
Updated Branches:
  refs/heads/master 4fa7ac4f6 -> 7f048a4b8


ARROW-356: Add documentation about reading Parquet

Assumes #192.

Author: Uwe L. Korn <uw...@xhochy.com>

Closes #193 from xhochy/ARROW-356 and squashes the following commits:

530484f [Uwe L. Korn] Mention new setup instructions
06b2f9c [Uwe L. Korn] Add tables describing dtype support
0467e0e [Uwe L. Korn] Move installation instructions into Sphinx docs
744202a [Uwe L. Korn] Document Pandas<->Arrow conversion
b5b4df5 [Uwe L. Korn] ARROW-356: Add documentation about reading Parquet


Project: http://git-wip-us.apache.org/repos/asf/arrow/repo
Commit: http://git-wip-us.apache.org/repos/asf/arrow/commit/7f048a4b
Tree: http://git-wip-us.apache.org/repos/asf/arrow/tree/7f048a4b
Diff: http://git-wip-us.apache.org/repos/asf/arrow/diff/7f048a4b

Branch: refs/heads/master
Commit: 7f048a4b8bdc6a20cd8f6eeca928ecbb6db7dd96
Parents: 4fa7ac4
Author: Uwe L. Korn <uw...@xhochy.com>
Authored: Fri Nov 11 14:18:09 2016 -0500
Committer: Wes McKinney <we...@twosigma.com>
Committed: Fri Nov 11 14:18:09 2016 -0500

----------------------------------------------------------------------
 python/doc/INSTALL.md    | 101 ----------------------------
 python/doc/index.rst     |  16 +++--
 python/doc/install.rst   | 151 ++++++++++++++++++++++++++++++++++++++++++
 python/doc/pandas.rst    | 114 +++++++++++++++++++++++++++++++
 python/doc/parquet.rst   |  66 ++++++++++++++++++
 python/pyarrow/table.pyx |  15 +++++
 6 files changed, 355 insertions(+), 108 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/arrow/blob/7f048a4b/python/doc/INSTALL.md
----------------------------------------------------------------------
diff --git a/python/doc/INSTALL.md b/python/doc/INSTALL.md
deleted file mode 100644
index 81eed56..0000000
--- a/python/doc/INSTALL.md
+++ /dev/null
@@ -1,101 +0,0 @@
-<!---
-  Licensed under the Apache License, Version 2.0 (the "License");
-  you may not use this file except in compliance with the License.
-  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
-  Unless required by applicable law or agreed to in writing, software
-  distributed under the License is distributed on an "AS IS" BASIS,
-  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-  See the License for the specific language governing permissions and
-  limitations under the License. See accompanying LICENSE file.
--->
-
-## Building pyarrow (Apache Arrow Python library)
-
-First, clone the master git repository:
-
-```bash
-git clone https://github.com/apache/arrow.git arrow
-```
-
-#### System requirements
-
-Building pyarrow requires:
-
-* A C++11 compiler
-
-  * Linux: gcc >= 4.8 or clang >= 3.5
-  * OS X: XCode 6.4 or higher preferred
-
-* [cmake][1]
-
-#### Python requirements
-
-You will need Python (CPython) 2.7, 3.4, or 3.5 installed. Earlier releases and
-are not being targeted.
-
-> This library targets CPython only due to an emphasis on interoperability with
-> pandas and NumPy, which are only available for CPython.
-
-The build requires NumPy, Cython, and a few other Python dependencies:
-
-```bash
-pip install cython
-cd arrow/python
-pip install -r requirements.txt
-```
-
-#### Installing Arrow C++ library
-
-First, you should choose an installation location for Arrow C++. In the future
-using the default system install location will work, but for now we are being
-explicit:
-
-```bash
-export ARROW_HOME=$HOME/local
-```
-
-Now, we build Arrow:
-
-```bash
-cd arrow/cpp
-
-mkdir dev-build
-cd dev-build
-
-cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME ..
-
-make
-
-# Use sudo here if $ARROW_HOME requires it
-make install
-```
-
-#### Install `pyarrow`
-
-```bash
-cd arrow/python
-
-python setup.py install
-```
-
-> On XCode 6 and prior there are some known OS X `@rpath` issues. If you are
-> unable to import pyarrow, upgrading XCode may be the solution.
-
-
-```python
-In [1]: import pyarrow
-
-In [2]: pyarrow.from_pylist([1,2,3])
-Out[2]:
-<pyarrow.array.Int64Array object at 0x7f899f3e60e8>
-[
-  1,
-  2,
-  3
-]
-```
-
-[1]: https://cmake.org/

http://git-wip-us.apache.org/repos/asf/arrow/blob/7f048a4b/python/doc/index.rst
----------------------------------------------------------------------
diff --git a/python/doc/index.rst b/python/doc/index.rst
index 88725ba..6725ae7 100644
--- a/python/doc/index.rst
+++ b/python/doc/index.rst
@@ -31,14 +31,16 @@ additional functionality such as reading Apache Parquet files into Arrow
 structures.
 
 .. toctree::
-   :maxdepth: 4
-   :hidden:
+   :maxdepth: 2
+   :caption: Getting Started
 
+   Installing pyarrow <install.rst>
+   Pandas <pandas.rst>
    Module Reference <modules.rst>
 
-Indices and tables
-==================
+.. toctree::
+   :maxdepth: 2
+   :caption: Additional Features
+
+   Parquet format <parquet.rst>
 
-* :ref:`genindex`
-* :ref:`modindex`
-* :ref:`search`

http://git-wip-us.apache.org/repos/asf/arrow/blob/7f048a4b/python/doc/install.rst
----------------------------------------------------------------------
diff --git a/python/doc/install.rst b/python/doc/install.rst
new file mode 100644
index 0000000..1bab017
--- /dev/null
+++ b/python/doc/install.rst
@@ -0,0 +1,151 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Install PyArrow
+===============
+
+Conda
+-----
+
+To install the latest version of PyArrow from conda-forge using conda:
+
+.. code-block:: bash
+
+    conda install -c conda-forge pyarrow
+
+Pip
+---
+
+Install the latest version from PyPI:
+
+.. code-block:: bash
+
+    pip install pyarrow
+
+.. note::
+    Currently there are only binary artifcats available for Linux and MacOS.
+    Otherwise this will only pull the python sources and assumes an existing
+    installation of the C++ part of Arrow.
+    To retrieve the binary artifacts, you'll need a recent ``pip`` version that
+    supports features like the ``manylinux1`` tag.
+
+Building from source
+--------------------
+
+First, clone the master git repository:
+
+.. code-block:: bash
+
+    git clone https://github.com/apache/arrow.git arrow
+
+System requirements
+~~~~~~~~~~~~~~~~~~~
+
+Building pyarrow requires:
+
+* A C++11 compiler
+
+  * Linux: gcc >= 4.8 or clang >= 3.5
+  * OS X: XCode 6.4 or higher preferred
+
+* `CMake <https://cmake.org/>`_
+
+Python requirements
+~~~~~~~~~~~~~~~~~~~
+
+You will need Python (CPython) 2.7, 3.4, or 3.5 installed. Earlier releases and
+are not being targeted.
+
+.. note::
+    This library targets CPython only due to an emphasis on interoperability with
+    pandas and NumPy, which are only available for CPython.
+
+The build requires NumPy, Cython, and a few other Python dependencies:
+
+.. code-block:: bash
+
+    pip install cython
+    cd arrow/python
+    pip install -r requirements.txt
+
+Installing Arrow C++ library
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+First, you should choose an installation location for Arrow C++. In the future
+using the default system install location will work, but for now we are being
+explicit:
+
+.. code-block:: bash
+    
+    export ARROW_HOME=$HOME/local
+
+Now, we build Arrow:
+
+.. code-block:: bash
+
+    cd arrow/cpp
+    
+    mkdir dev-build
+    cd dev-build
+    
+    cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME ..
+    
+    make
+    
+    # Use sudo here if $ARROW_HOME requires it
+    make install
+
+To get the optional Parquet support, you should also build and install 
+`parquet-cpp <https://github.com/apache/parquet-cpp/blob/master/README.md>`_.
+
+Install `pyarrow`
+~~~~~~~~~~~~~~~~~
+
+
+.. code-block:: bash
+
+    cd arrow/python
+
+    # --with-parquet enable the Apache Parquet support in PyArrow
+    # --build-type=release disables debugging information and turns on
+    #       compiler optimizations for native code
+    python setup.py build_ext --with-parquet --build-type=release install
+    python setup.py install
+
+.. warning::
+    On XCode 6 and prior there are some known OS X `@rpath` issues. If you are
+    unable to import pyarrow, upgrading XCode may be the solution.
+
+.. note::
+    In development installations, you will also need to set a correct
+    ``LD_LIBARY_PATH``. This is most probably done with
+    ``export LD_LIBARY_PATH=$ARROW_HOME/lib:$LD_LIBARY_PATH``.
+
+
+.. code-block:: python
+    
+    In [1]: import pyarrow
+
+    In [2]: pyarrow.from_pylist([1,2,3])
+    Out[2]:
+    <pyarrow.array.Int64Array object at 0x7f899f3e60e8>
+    [
+      1,
+      2,
+      3
+    ]
+

http://git-wip-us.apache.org/repos/asf/arrow/blob/7f048a4b/python/doc/pandas.rst
----------------------------------------------------------------------
diff --git a/python/doc/pandas.rst b/python/doc/pandas.rst
new file mode 100644
index 0000000..7c70074
--- /dev/null
+++ b/python/doc/pandas.rst
@@ -0,0 +1,114 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Pandas Interface
+================
+
+To interface with Pandas, PyArrow provides various conversion routines to
+consume Pandas structures and convert back to them.
+
+DataFrames
+----------
+
+The equivalent to a Pandas DataFrame in Arrow is a :class:`pyarrow.table.Table`.
+Both consist of a set of named columns of equal length. While Pandas only
+supports flat columns, the Table also provides nested columns, thus it can
+represent more data than a DataFrame, so a full conversion is not always possible.
+
+Conversion from a Table to a DataFrame is done by calling
+:meth:`pyarrow.table.Table.to_pandas`. The inverse is then achieved by using
+:meth:`pyarrow.from_pandas_dataframe`. This conversion routine provides the
+convience parameter ``timestamps_to_ms``. Although Arrow supports timestamps of
+different resolutions, Pandas only supports nanosecond timestamps and most
+other systems (e.g. Parquet) only work on millisecond timestamps. This parameter
+can be used to already do the time conversion during the Pandas to Arrow
+conversion.
+
+.. code-block:: python
+
+    import pyarrow as pa
+    import pandas as pd
+
+    df = pd.DataFrame({"a": [1, 2, 3]})
+    # Convert from Pandas to Arrow
+    table = pa.from_pandas_dataframe(df)
+    # Convert back to Pandas
+    df_new = table.to_pandas()
+
+
+Series
+------
+
+In Arrow, the most similar structure to a Pandas Series is an Array.
+It is a vector that contains data of the same type as linear memory. You can
+convert a Pandas Series to an Arrow Array using :meth:`pyarrow.array.from_pandas_series`.
+As Arrow Arrays are always nullable, you can supply an optional mask using
+the ``mask`` parameter to mark all null-entries.
+
+Type differences
+----------------
+
+With the current design of Pandas and Arrow, it is not possible to convert all
+column types unmodified. One of the main issues here is that Pandas has no
+support for nullable columns of arbitrary type. Also ``datetime64`` is currently
+fixed to nanosecond resolution. On the other side, Arrow might be still missing
+support for some types.
+
+Pandas -> Arrow Conversion
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
++------------------------+--------------------------+
+| Source Type (Pandas)   | Destination Type (Arrow) |
++========================+==========================+
+| ``bool``               | ``BOOL``                 |
++------------------------+--------------------------+
+| ``(u)int{8,16,32,64}`` | ``(U)INT{8,16,32,64}``   |
++------------------------+--------------------------+
+| ``float32``            | ``FLOAT``                |
++------------------------+--------------------------+
+| ``float64``            | ``DOUBLE``               |
++------------------------+--------------------------+
+| ``str`` / ``unicode``  | ``STRING``               |
++------------------------+--------------------------+
+| ``pd.Timestamp``       | ``TIMESTAMP(unit=ns)``   |
++------------------------+--------------------------+
+| ``pd.Categorical``     | *not supported*          |
++------------------------+--------------------------+
+
+Arrow -> Pandas Conversion
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
++-------------------------------------+--------------------------------------------------------+
+| Source Type (Arrow)                 | Destination Type (Pandas)                              |
++=====================================+========================================================+
+| ``BOOL``                            | ``bool``                                               |
++-------------------------------------+--------------------------------------------------------+
+| ``BOOL`` *with nulls*               | ``object`` (with values ``True``, ``False``, ``None``) |
++-------------------------------------+--------------------------------------------------------+
+| ``(U)INT{8,16,32,64}``              | ``(u)int{8,16,32,64}``                                 |
++-------------------------------------+--------------------------------------------------------+
+| ``(U)INT{8,16,32,64}`` *with nulls* | ``float64``                                            |
++-------------------------------------+--------------------------------------------------------+
+| ``FLOAT``                           | ``float32``                                            |
++-------------------------------------+--------------------------------------------------------+
+| ``DOUBLE``                          | ``float64``                                            |
++-------------------------------------+--------------------------------------------------------+
+| ``STRING``                          | ``str``                                                |
++-------------------------------------+--------------------------------------------------------+
+| ``TIMESTAMP(unit=*)``               | ``pd.Timestamp`` (``np.datetime64[ns]``)               |
++-------------------------------------+--------------------------------------------------------+
+

http://git-wip-us.apache.org/repos/asf/arrow/blob/7f048a4b/python/doc/parquet.rst
----------------------------------------------------------------------
diff --git a/python/doc/parquet.rst b/python/doc/parquet.rst
new file mode 100644
index 0000000..674ed80
--- /dev/null
+++ b/python/doc/parquet.rst
@@ -0,0 +1,66 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Reading/Writing Parquet files
+=============================
+
+If you have built ``pyarrow`` with Parquet support, i.e. ``parquet-cpp`` was
+found during the build, you can read files in the Parquet format to/from Arrow
+memory structures. The Parquet support code is located in the
+:mod:`pyarrow.parquet` module and your package needs to be built with the
+``--with-parquet`` flag for ``build_ext``.
+
+Reading Parquet
+---------------
+
+To read a Parquet file into Arrow memory, you can use the following code
+snippet. It will read the whole Parquet file into memory as an
+:class:`pyarrow.table.Table`.
+
+.. code-block:: python
+
+    import pyarrow
+    import pyarrow.parquet
+
+    A = pyarrow
+
+    table = A.parquet.read_table('<filename>')
+
+Writing Parquet
+---------------
+
+Given an instance of :class:`pyarrow.table.Table`, the most simple way to
+persist it to Parquet is by using the :meth:`pyarrow.parquet.write_table`
+method.
+
+.. code-block:: python
+
+    import pyarrow
+    import pyarrow.parquet
+
+    A = pyarrow
+
+    table = A.Table(..)
+    A.parquet.write_table(table, '<filename>')
+
+By default this will write the Table as a single RowGroup using ``DICTIONARY``
+encoding. To increase the potential of parallelism a query engine can process
+a Parquet file, set the ``chunk_size`` to a fraction of the total number of rows.
+
+If you also want to compress the columns, you can select a compression
+method using the ``compression`` argument. Typically, ``GZIP`` is the choice if
+you want to minimize size and ``SNAPPY`` for performance.

http://git-wip-us.apache.org/repos/asf/arrow/blob/7f048a4b/python/pyarrow/table.pyx
----------------------------------------------------------------------
diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx
index c71bc71..5459f26 100644
--- a/python/pyarrow/table.pyx
+++ b/python/pyarrow/table.pyx
@@ -298,6 +298,8 @@ cdef class RecordBatch:
 
 cdef class Table:
     '''
+    A collection of top-level named, equal length Arrow arrays.
+
     Do not call this class's constructor directly.
     '''
 
@@ -335,6 +337,19 @@ cdef class Table:
 
     @staticmethod
     def from_arrays(names, arrays, name=None):
+        """
+        Construct a Table from Arrow Arrays
+
+        Parameters
+        ----------
+
+        names: list of str
+            Names for the table columns
+        arrays: list of pyarrow.array.Array
+            Equal-length arrays that should form the table.
+        name: str
+            (optional) name for the Table
+        """
         cdef:
             Array arr
             c_string c_name