You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by we...@apache.org on 2017/02/07 16:17:43 UTC
arrow git commit: ARROW-531: Python: Document jemalloc,
extend Pandas section, add Getting Involved
Repository: arrow
Updated Branches:
refs/heads/master 4c3481ea5 -> e97fbe640
ARROW-531: Python: Document jemalloc, extend Pandas section, add Getting Involved
Author: Uwe L. Korn <uw...@xhochy.com>
Closes #321 from xhochy/ARROW-531 and squashes the following commits:
55da9dc [Uwe L. Korn] ARROW-531: Python: Document jemalloc, extend Pandas section, add Getting Involved
Project: http://git-wip-us.apache.org/repos/asf/arrow/repo
Commit: http://git-wip-us.apache.org/repos/asf/arrow/commit/e97fbe64
Tree: http://git-wip-us.apache.org/repos/asf/arrow/tree/e97fbe64
Diff: http://git-wip-us.apache.org/repos/asf/arrow/diff/e97fbe64
Branch: refs/heads/master
Commit: e97fbe6407e8b15c6d3ef745f7a728e01d499a23
Parents: 4c3481e
Author: Uwe L. Korn <uw...@xhochy.com>
Authored: Tue Feb 7 11:17:28 2017 -0500
Committer: Wes McKinney <we...@twosigma.com>
Committed: Tue Feb 7 11:17:28 2017 -0500
----------------------------------------------------------------------
python/doc/getting_involved.rst | 37 +++++++++++++++++++++++++
python/doc/index.rst | 2 ++
python/doc/install.rst | 5 ++--
python/doc/jemalloc.rst | 52 ++++++++++++++++++++++++++++++++++++
python/doc/pandas.rst | 8 +++++-
python/doc/parquet.rst | 47 ++++++++++++++++++++++++--------
6 files changed, 137 insertions(+), 14 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/arrow/blob/e97fbe64/python/doc/getting_involved.rst
----------------------------------------------------------------------
diff --git a/python/doc/getting_involved.rst b/python/doc/getting_involved.rst
new file mode 100644
index 0000000..90fa3e4
--- /dev/null
+++ b/python/doc/getting_involved.rst
@@ -0,0 +1,37 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements. See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership. The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License. You may obtain a copy of the License at
+
+.. http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied. See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+Getting Involved
+================
+
+Right now the primary audience for Apache Arrow are the developers of data
+systems; most people will use Apache Arrow indirectly through systems that use
+it for internal data handling and interoperating with other Arrow-enabled
+systems.
+
+Even if you do not plan to contribute to Apache Arrow itself or Arrow
+integrations in other projects, we'd be happy to have you involved:
+
+ * Join the mailing list: send an email to
+ `dev-subscribe@arrow.apache.org <ma...@arrow.apache.org>`_.
+ Share your ideas and use cases for the project or read through the
+ `Archive <http://mail-archives.apache.org/mod_mbox/arrow-dev/>`_.
+ * Follow our activity on `JIRA <https://issues.apache.org/jira/browse/ARROW>`_
+ * Learn the `Format / Specification
+ <https://github.com/apache/arrow/tree/master/format>`_
+ * Chat with us on `Slack <https://apachearrowslackin.herokuapp.com/>`_
+
http://git-wip-us.apache.org/repos/asf/arrow/blob/e97fbe64/python/doc/index.rst
----------------------------------------------------------------------
diff --git a/python/doc/index.rst b/python/doc/index.rst
index 6725ae7..d64354b 100644
--- a/python/doc/index.rst
+++ b/python/doc/index.rst
@@ -37,10 +37,12 @@ structures.
Installing pyarrow <install.rst>
Pandas <pandas.rst>
Module Reference <modules.rst>
+ Getting Involved <getting_involved.rst>
.. toctree::
:maxdepth: 2
:caption: Additional Features
Parquet format <parquet.rst>
+ jemalloc MemoryPool <jemalloc.rst>
http://git-wip-us.apache.org/repos/asf/arrow/blob/e97fbe64/python/doc/install.rst
----------------------------------------------------------------------
diff --git a/python/doc/install.rst b/python/doc/install.rst
index 1bab017..4d99fa0 100644
--- a/python/doc/install.rst
+++ b/python/doc/install.rst
@@ -120,10 +120,11 @@ Install `pyarrow`
cd arrow/python
- # --with-parquet enable the Apache Parquet support in PyArrow
+ # --with-parquet enables the Apache Parquet support in PyArrow
+ # --with-jemalloc enables the jemalloc allocator support in PyArrow
# --build-type=release disables debugging information and turns on
# compiler optimizations for native code
- python setup.py build_ext --with-parquet --build-type=release install
+ python setup.py build_ext --with-parquet --with--jemalloc --build-type=release install
python setup.py install
.. warning::
http://git-wip-us.apache.org/repos/asf/arrow/blob/e97fbe64/python/doc/jemalloc.rst
----------------------------------------------------------------------
diff --git a/python/doc/jemalloc.rst b/python/doc/jemalloc.rst
new file mode 100644
index 0000000..33fe617
--- /dev/null
+++ b/python/doc/jemalloc.rst
@@ -0,0 +1,52 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements. See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership. The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License. You may obtain a copy of the License at
+
+.. http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied. See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+jemalloc MemoryPool
+===================
+
+Arrow's default :class:`~pyarrow.memory.MemoryPool` uses the system's allocator
+through the POSIX APIs. Although this already provides aligned allocation, the
+POSIX interface doesn't support aligned reallocation. The default reallocation
+strategy is to allocate a new region, copy over the old data and free the
+previous region. Using `jemalloc <http://jemalloc.net/>`_ we can simply extend
+the existing memory allocation to the requested size. While this may still be
+linear in the size of allocated memory, it is magnitudes faster as only the page
+mapping in the kernel is touched, not the actual data.
+
+The :mod:`~pyarrow.jemalloc` allocator is not enabled by default to allow the
+use of the system allocator and/or other allocators like ``tcmalloc``. You can
+either explicitly make it the default allocator or pass it only to single
+operations.
+
+.. code:: python
+
+ import pyarrow as pa
+ import pyarrow.jemalloc
+ import pyarrow.memory
+
+ jemalloc_pool = pyarrow.jemalloc.default_pool()
+
+ # Explicitly use jemalloc for allocating memory for an Arrow Table object
+ array = pa.Array.from_pylist([1, 2, 3], memory_pool=jemalloc_pool)
+
+ # Set the global pool
+ pyarrow.memory.set_default_pool(jemalloc_pool)
+ # This operation has no explicit MemoryPool specified and will thus will
+ # also use jemalloc for its allocations.
+ array = pa.Array.from_pylist([1, 2, 3])
+
+
http://git-wip-us.apache.org/repos/asf/arrow/blob/e97fbe64/python/doc/pandas.rst
----------------------------------------------------------------------
diff --git a/python/doc/pandas.rst b/python/doc/pandas.rst
index c225d13..34445ae 100644
--- a/python/doc/pandas.rst
+++ b/python/doc/pandas.rst
@@ -84,9 +84,11 @@ Pandas -> Arrow Conversion
+------------------------+--------------------------+
| ``str`` / ``unicode`` | ``STRING`` |
+------------------------+--------------------------+
+| ``pd.Categorical`` | ``DICTIONARY`` |
++------------------------+--------------------------+
| ``pd.Timestamp`` | ``TIMESTAMP(unit=ns)`` |
+------------------------+--------------------------+
-| ``pd.Categorical`` | *not supported* |
+| ``datetime.date`` | ``DATE`` |
+------------------------+--------------------------+
Arrow -> Pandas Conversion
@@ -109,5 +111,9 @@ Arrow -> Pandas Conversion
+-------------------------------------+--------------------------------------------------------+
| ``STRING`` | ``str`` |
+-------------------------------------+--------------------------------------------------------+
+| ``DICTIONARY`` | ``pd.Categorical`` |
++-------------------------------------+--------------------------------------------------------+
| ``TIMESTAMP(unit=*)`` | ``pd.Timestamp`` (``np.datetime64[ns]``) |
+-------------------------------------+--------------------------------------------------------+
+| ``DATE`` | ``pd.Timestamp`` (``np.datetime64[ns]``) |
++-------------------------------------+--------------------------------------------------------+
http://git-wip-us.apache.org/repos/asf/arrow/blob/e97fbe64/python/doc/parquet.rst
----------------------------------------------------------------------
diff --git a/python/doc/parquet.rst b/python/doc/parquet.rst
index 674ed80..8e011e4 100644
--- a/python/doc/parquet.rst
+++ b/python/doc/parquet.rst
@@ -29,16 +29,30 @@ Reading Parquet
To read a Parquet file into Arrow memory, you can use the following code
snippet. It will read the whole Parquet file into memory as an
-:class:`pyarrow.table.Table`.
+:class:`~pyarrow.table.Table`.
.. code-block:: python
- import pyarrow
- import pyarrow.parquet
+ import pyarrow.parquet as pq
- A = pyarrow
+ table = pq.read_table('<filename>')
- table = A.parquet.read_table('<filename>')
+As DataFrames stored as Parquet are often stored in multiple files, a
+convenience method :meth:`~pyarrow.parquet.read_multiple_files` is provided.
+
+If you already have the Parquet available in memory or get it via non-file
+source, you can utilize :class:`pyarrow.io.BufferReader` to read it from
+memory. As input to the :class:`~pyarrow.io.BufferReader` you can either supply
+a Python ``bytes`` object or a :class:`pyarrow.io.Buffer`.
+
+.. code:: python
+
+ import pyarrow.io as paio
+ import pyarrow.parquet as pq
+
+ buf = ... # either bytes or paio.Buffer
+ reader = paio.BufferReader(buf)
+ table = pq.read_table(reader)
Writing Parquet
---------------
@@ -49,13 +63,11 @@ method.
.. code-block:: python
- import pyarrow
- import pyarrow.parquet
-
- A = pyarrow
+ import pyarrow as pa
+ import pyarrow.parquet as pq
- table = A.Table(..)
- A.parquet.write_table(table, '<filename>')
+ table = pa.Table(..)
+ pq.write_table(table, '<filename>')
By default this will write the Table as a single RowGroup using ``DICTIONARY``
encoding. To increase the potential of parallelism a query engine can process
@@ -64,3 +76,16 @@ a Parquet file, set the ``chunk_size`` to a fraction of the total number of rows
If you also want to compress the columns, you can select a compression
method using the ``compression`` argument. Typically, ``GZIP`` is the choice if
you want to minimize size and ``SNAPPY`` for performance.
+
+Instead of writing to a file, you can also write to Python ``bytes`` by
+utilizing an :class:`pyarrow.io.InMemoryOutputStream()`:
+
+.. code:: python
+
+ import pyarrow.io as paio
+ import pyarrow.parquet as pq
+
+ table = ...
+ output = paio.InMemoryOutputStream()
+ pq.write_table(table, output)
+ pybytes = output.get_result().to_pybytes()