You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by we...@apache.org on 2019/06/27 20:55:20 UTC
[arrow] branch master updated: ARROW-5138: [Python] Add
documentation about pandas preserve_index option
This is an automated email from the ASF dual-hosted git repository.
wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/master by this push:
new e12d52f ARROW-5138: [Python] Add documentation about pandas preserve_index option
e12d52f is described below
commit e12d52fbede8e22c83d167b91b958360ac662747
Author: Wes McKinney <we...@apache.org>
AuthorDate: Thu Jun 27 15:55:09 2019 -0500
ARROW-5138: [Python] Add documentation about pandas preserve_index option
The underlying issue reported in ARROW-5138 can now be addressed by passing `preserve_index=True` when using `Table.from_pandas`
Author: Wes McKinney <we...@apache.org>
Closes #4728 from wesm/ARROW-5138 and squashes the following commits:
27451dbd4 <Wes McKinney> Add documentation about pandas preserve_index option
---
docs/source/python/pandas.rst | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)
diff --git a/docs/source/python/pandas.rst b/docs/source/python/pandas.rst
index aafbf57..83d997c 100644
--- a/docs/source/python/pandas.rst
+++ b/docs/source/python/pandas.rst
@@ -62,6 +62,10 @@ Conversion from a Table to a DataFrame is done by calling
# Infer Arrow schema from pandas
schema = pa.Schema.from_pandas(df)
+By default ``pyarrow`` tries to preserve and restore the ``.index``
+data as accurately as possible. See the section below for more about
+this, and how to disable this logic.
+
Series
------
@@ -71,6 +75,29 @@ convert a pandas Series to an Arrow Array using :meth:`pyarrow.Array.from_pandas
As Arrow Arrays are always nullable, you can supply an optional mask using
the ``mask`` parameter to mark all null-entries.
+Handling pandas Indexes
+-----------------------
+
+Methods like :meth:`pyarrow.Table.from_pandas` have a
+``preserve_index`` option which defines how to preserve (store) or not
+to preserve (to not store) the data in the ``index`` member of the
+corresponding pandas object. This data is tracked using schema-level
+metadata in the internal ``arrow::Schema`` object.
+
+The default of ``preserve_index`` is ``None``, which behaves as
+follows:
+
+* ``RangeIndex`` is stored as metadata-only, not requiring any extra
+ storage.
+* Other index types are stored as one or more physical data columns in
+ the resulting :class:`Table`
+
+To not store the index at all pass ``preserve_index=False``. Since
+storing a ``RangeIndex`` can cause issues in some limited scenarios
+(such as storing multiple DataFrame objects in a Parquet file), to
+force all index data to be serialized in the resulting table, pass
+``preserve_index=True``.
+
Type differences
----------------