You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by we...@apache.org on 2017/02/07 16:17:43 UTC

arrow git commit: ARROW-531: Python: Document jemalloc, extend Pandas section, add Getting Involved

Repository: arrow
Updated Branches:
  refs/heads/master 4c3481ea5 -> e97fbe640


ARROW-531: Python: Document jemalloc, extend Pandas section, add Getting Involved

Author: Uwe L. Korn <uw...@xhochy.com>

Closes #321 from xhochy/ARROW-531 and squashes the following commits:

55da9dc [Uwe L. Korn] ARROW-531: Python: Document jemalloc, extend Pandas section, add Getting Involved


Project: http://git-wip-us.apache.org/repos/asf/arrow/repo
Commit: http://git-wip-us.apache.org/repos/asf/arrow/commit/e97fbe64
Tree: http://git-wip-us.apache.org/repos/asf/arrow/tree/e97fbe64
Diff: http://git-wip-us.apache.org/repos/asf/arrow/diff/e97fbe64

Branch: refs/heads/master
Commit: e97fbe6407e8b15c6d3ef745f7a728e01d499a23
Parents: 4c3481e
Author: Uwe L. Korn <uw...@xhochy.com>
Authored: Tue Feb 7 11:17:28 2017 -0500
Committer: Wes McKinney <we...@twosigma.com>
Committed: Tue Feb 7 11:17:28 2017 -0500

----------------------------------------------------------------------
 python/doc/getting_involved.rst | 37 +++++++++++++++++++++++++
 python/doc/index.rst            |  2 ++
 python/doc/install.rst          |  5 ++--
 python/doc/jemalloc.rst         | 52 ++++++++++++++++++++++++++++++++++++
 python/doc/pandas.rst           |  8 +++++-
 python/doc/parquet.rst          | 47 ++++++++++++++++++++++++--------
 6 files changed, 137 insertions(+), 14 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/arrow/blob/e97fbe64/python/doc/getting_involved.rst
----------------------------------------------------------------------
diff --git a/python/doc/getting_involved.rst b/python/doc/getting_involved.rst
new file mode 100644
index 0000000..90fa3e4
--- /dev/null
+++ b/python/doc/getting_involved.rst
@@ -0,0 +1,37 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Getting Involved
+================
+
+Right now the primary audience for Apache Arrow are the developers of data
+systems; most people will use Apache Arrow indirectly through systems that use
+it for internal data handling and interoperating with other Arrow-enabled
+systems.
+
+Even if you do not plan to contribute to Apache Arrow itself or Arrow
+integrations in other projects, we'd be happy to have you involved:
+
+ * Join the mailing list: send an email to 
+   `dev-subscribe@arrow.apache.org <ma...@arrow.apache.org>`_.
+   Share your ideas and use cases for the project or read through the
+   `Archive <http://mail-archives.apache.org/mod_mbox/arrow-dev/>`_.
+ * Follow our activity on `JIRA <https://issues.apache.org/jira/browse/ARROW>`_
+ * Learn the `Format / Specification
+   <https://github.com/apache/arrow/tree/master/format>`_
+ * Chat with us on `Slack <https://apachearrowslackin.herokuapp.com/>`_
+

http://git-wip-us.apache.org/repos/asf/arrow/blob/e97fbe64/python/doc/index.rst
----------------------------------------------------------------------
diff --git a/python/doc/index.rst b/python/doc/index.rst
index 6725ae7..d64354b 100644
--- a/python/doc/index.rst
+++ b/python/doc/index.rst
@@ -37,10 +37,12 @@ structures.
    Installing pyarrow <install.rst>
    Pandas <pandas.rst>
    Module Reference <modules.rst>
+   Getting Involved <getting_involved.rst>
 
 .. toctree::
    :maxdepth: 2
    :caption: Additional Features
 
    Parquet format <parquet.rst>
+   jemalloc MemoryPool <jemalloc.rst>
 

http://git-wip-us.apache.org/repos/asf/arrow/blob/e97fbe64/python/doc/install.rst
----------------------------------------------------------------------
diff --git a/python/doc/install.rst b/python/doc/install.rst
index 1bab017..4d99fa0 100644
--- a/python/doc/install.rst
+++ b/python/doc/install.rst
@@ -120,10 +120,11 @@ Install `pyarrow`
 
     cd arrow/python
 
-    # --with-parquet enable the Apache Parquet support in PyArrow
+    # --with-parquet enables the Apache Parquet support in PyArrow
+    # --with-jemalloc enables the jemalloc allocator support in PyArrow
     # --build-type=release disables debugging information and turns on
     #       compiler optimizations for native code
-    python setup.py build_ext --with-parquet --build-type=release install
+    python setup.py build_ext --with-parquet --with--jemalloc --build-type=release install
     python setup.py install
 
 .. warning::

http://git-wip-us.apache.org/repos/asf/arrow/blob/e97fbe64/python/doc/jemalloc.rst
----------------------------------------------------------------------
diff --git a/python/doc/jemalloc.rst b/python/doc/jemalloc.rst
new file mode 100644
index 0000000..33fe617
--- /dev/null
+++ b/python/doc/jemalloc.rst
@@ -0,0 +1,52 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+jemalloc MemoryPool
+===================
+
+Arrow's default :class:`~pyarrow.memory.MemoryPool` uses the system's allocator
+through the POSIX APIs. Although this already provides aligned allocation, the
+POSIX interface doesn't support aligned reallocation. The default reallocation
+strategy is to allocate a new region, copy over the old data and free the
+previous region. Using `jemalloc <http://jemalloc.net/>`_ we can simply extend
+the existing memory allocation to the requested size. While this may still be
+linear in the size of allocated memory, it is magnitudes faster as only the page
+mapping in the kernel is touched, not the actual data.
+
+The :mod:`~pyarrow.jemalloc` allocator is not enabled by default to allow the
+use of the system allocator and/or other allocators like ``tcmalloc``. You can
+either explicitly make it the default allocator or pass it only to single
+operations.
+
+.. code:: python
+
+    import pyarrow as pa
+    import pyarrow.jemalloc
+    import pyarrow.memory
+
+    jemalloc_pool = pyarrow.jemalloc.default_pool()
+
+    # Explicitly use jemalloc for allocating memory for an Arrow Table object
+    array = pa.Array.from_pylist([1, 2, 3], memory_pool=jemalloc_pool)
+
+    # Set the global pool
+    pyarrow.memory.set_default_pool(jemalloc_pool)
+    # This operation has no explicit MemoryPool specified and will thus will
+    # also use jemalloc for its allocations.
+    array = pa.Array.from_pylist([1, 2, 3])
+
+

http://git-wip-us.apache.org/repos/asf/arrow/blob/e97fbe64/python/doc/pandas.rst
----------------------------------------------------------------------
diff --git a/python/doc/pandas.rst b/python/doc/pandas.rst
index c225d13..34445ae 100644
--- a/python/doc/pandas.rst
+++ b/python/doc/pandas.rst
@@ -84,9 +84,11 @@ Pandas -> Arrow Conversion
 +------------------------+--------------------------+
 | ``str`` / ``unicode``  | ``STRING``               |
 +------------------------+--------------------------+
+| ``pd.Categorical``     | ``DICTIONARY``           |
++------------------------+--------------------------+
 | ``pd.Timestamp``       | ``TIMESTAMP(unit=ns)``   |
 +------------------------+--------------------------+
-| ``pd.Categorical``     | *not supported*          |
+| ``datetime.date``      | ``DATE``                 |
 +------------------------+--------------------------+
 
 Arrow -> Pandas Conversion
@@ -109,5 +111,9 @@ Arrow -> Pandas Conversion
 +-------------------------------------+--------------------------------------------------------+
 | ``STRING``                          | ``str``                                                |
 +-------------------------------------+--------------------------------------------------------+
+| ``DICTIONARY``                      | ``pd.Categorical``                                     |
++-------------------------------------+--------------------------------------------------------+
 | ``TIMESTAMP(unit=*)``               | ``pd.Timestamp`` (``np.datetime64[ns]``)               |
 +-------------------------------------+--------------------------------------------------------+
+| ``DATE``                            | ``pd.Timestamp`` (``np.datetime64[ns]``)               |
++-------------------------------------+--------------------------------------------------------+

http://git-wip-us.apache.org/repos/asf/arrow/blob/e97fbe64/python/doc/parquet.rst
----------------------------------------------------------------------
diff --git a/python/doc/parquet.rst b/python/doc/parquet.rst
index 674ed80..8e011e4 100644
--- a/python/doc/parquet.rst
+++ b/python/doc/parquet.rst
@@ -29,16 +29,30 @@ Reading Parquet
 
 To read a Parquet file into Arrow memory, you can use the following code
 snippet. It will read the whole Parquet file into memory as an
-:class:`pyarrow.table.Table`.
+:class:`~pyarrow.table.Table`.
 
 .. code-block:: python
 
-    import pyarrow
-    import pyarrow.parquet
+    import pyarrow.parquet as pq
 
-    A = pyarrow
+    table = pq.read_table('<filename>')
 
-    table = A.parquet.read_table('<filename>')
+As DataFrames stored as Parquet are often stored in multiple files, a
+convenience method :meth:`~pyarrow.parquet.read_multiple_files` is provided.
+
+If you already have the Parquet available in memory or get it via non-file
+source, you can utilize :class:`pyarrow.io.BufferReader` to read it from
+memory. As input to the :class:`~pyarrow.io.BufferReader` you can either supply
+a Python ``bytes`` object or a :class:`pyarrow.io.Buffer`.
+
+.. code:: python
+
+    import pyarrow.io as paio
+    import pyarrow.parquet as pq
+
+    buf = ... # either bytes or paio.Buffer
+    reader = paio.BufferReader(buf)
+    table = pq.read_table(reader)
 
 Writing Parquet
 ---------------
@@ -49,13 +63,11 @@ method.
 
 .. code-block:: python
 
-    import pyarrow
-    import pyarrow.parquet
-
-    A = pyarrow
+    import pyarrow as pa
+    import pyarrow.parquet as pq
 
-    table = A.Table(..)
-    A.parquet.write_table(table, '<filename>')
+    table = pa.Table(..)
+    pq.write_table(table, '<filename>')
 
 By default this will write the Table as a single RowGroup using ``DICTIONARY``
 encoding. To increase the potential of parallelism a query engine can process
@@ -64,3 +76,16 @@ a Parquet file, set the ``chunk_size`` to a fraction of the total number of rows
 If you also want to compress the columns, you can select a compression
 method using the ``compression`` argument. Typically, ``GZIP`` is the choice if
 you want to minimize size and ``SNAPPY`` for performance.
+
+Instead of writing to a file, you can also write to Python ``bytes`` by
+utilizing an :class:`pyarrow.io.InMemoryOutputStream()`:
+
+.. code:: python
+
+    import pyarrow.io as paio
+    import pyarrow.parquet as pq
+
+    table = ...
+    output = paio.InMemoryOutputStream()
+    pq.write_table(table, output)
+    pybytes = output.get_result().to_pybytes()