You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by we...@apache.org on 2019/06/27 20:55:20 UTC
[arrow] branch master updated: ARROW-5138: [Python] Add documentation about pandas preserve_index option

This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
     new e12d52f  ARROW-5138: [Python] Add documentation about pandas preserve_index option
e12d52f is described below

commit e12d52fbede8e22c83d167b91b958360ac662747
Author: Wes McKinney <we...@apache.org>
AuthorDate: Thu Jun 27 15:55:09 2019 -0500

    ARROW-5138: [Python] Add documentation about pandas preserve_index option
    
    The underlying issue reported in ARROW-5138 can now be addressed by passing `preserve_index=True` when using `Table.from_pandas`
    
    Author: Wes McKinney <we...@apache.org>
    
    Closes #4728 from wesm/ARROW-5138 and squashes the following commits:
    
    27451dbd4 <Wes McKinney> Add documentation about pandas preserve_index option
---
 docs/source/python/pandas.rst | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/docs/source/python/pandas.rst b/docs/source/python/pandas.rst
index aafbf57..83d997c 100644
--- a/docs/source/python/pandas.rst
+++ b/docs/source/python/pandas.rst
@@ -62,6 +62,10 @@ Conversion from a Table to a DataFrame is done by calling
     # Infer Arrow schema from pandas
     schema = pa.Schema.from_pandas(df)
 
+By default ``pyarrow`` tries to preserve and restore the ``.index``
+data as accurately as possible. See the section below for more about
+this, and how to disable this logic.
+
 Series
 ------
 
@@ -71,6 +75,29 @@ convert a pandas Series to an Arrow Array using :meth:`pyarrow.Array.from_pandas
 As Arrow Arrays are always nullable, you can supply an optional mask using
 the ``mask`` parameter to mark all null-entries.
 
+Handling pandas Indexes
+-----------------------
+
+Methods like :meth:`pyarrow.Table.from_pandas` have a
+``preserve_index`` option which defines how to preserve (store) or not
+to preserve (to not store) the data in the ``index`` member of the
+corresponding pandas object. This data is tracked using schema-level
+metadata in the internal ``arrow::Schema`` object.
+
+The default of ``preserve_index`` is ``None``, which behaves as
+follows:
+
+* ``RangeIndex`` is stored as metadata-only, not requiring any extra
+  storage.
+* Other index types are stored as one or more physical data columns in
+  the resulting :class:`Table`
+
+To not store the index at all pass ``preserve_index=False``. Since
+storing a ``RangeIndex`` can cause issues in some limited scenarios
+(such as storing multiple DataFrame objects in a Parquet file), to
+force all index data to be serialized in the resulting table, pass
+``preserve_index=True``.
+
 Type differences
 ----------------