You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by we...@apache.org on 2018/07/17 03:34:30 UTC

[arrow] branch master updated: ARROW-2861: [Python] Add note about how to not write DataFrame index to Parquet

This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
     new 3fd913e  ARROW-2861: [Python] Add note about how to not write DataFrame index to Parquet
3fd913e is described below

commit 3fd913e4b279554dc948411ccf56cbb4dc09dae3
Author: Daniel Compton <de...@danielcompton.net>
AuthorDate: Mon Jul 16 23:34:25 2018 -0400

    ARROW-2861: [Python] Add note about how to not write DataFrame index to Parquet
    
    Close #2248
    
    Also resolves ARROW-2862
    
    Author: Daniel Compton <de...@danielcompton.net>
    Author: Wes McKinney <we...@twosigma.com>
    
    Closes #2275 from wesm/ARROW-2861-edit and squashes the following commits:
    
    b0013dfc <Wes McKinney> Incorporate code review comments
    84a4cbb8 <Daniel Compton> Update pyarrow docs to remove index when creating Parquet
---
 cpp/thirdparty/download_dependencies.sh |  2 ++
 python/doc/source/parquet.rst           | 29 ++++++++++++++++++++++++++++-
 2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/cpp/thirdparty/download_dependencies.sh b/cpp/thirdparty/download_dependencies.sh
index 2d8bee4..ab8c0b2 100755
--- a/cpp/thirdparty/download_dependencies.sh
+++ b/cpp/thirdparty/download_dependencies.sh
@@ -34,6 +34,8 @@ _DST=$1
 # To change toolchain versions, edit versions.txt
 source $SOURCE_DIR/versions.txt
 
+mkdir -p $_DST
+
 BOOST_UNDERSCORE_VERSION=`echo $BOOST_VERSION | sed 's/\./_/g'`
 wget -c -O $_DST/boost.tar.gz https://dl.bintray.com/boostorg/release/$BOOST_VERSION/source/boost_$BOOST_UNDERSCORE_VERSION.tar.gz
 
diff --git a/python/doc/source/parquet.rst b/python/doc/source/parquet.rst
index dbc009e..dfbb318 100644
--- a/python/doc/source/parquet.rst
+++ b/python/doc/source/parquet.rst
@@ -112,6 +112,33 @@ In general, a Python file object will have the worst read performance, while a
 string file path or an instance of :class:`~.NativeFile` (especially memory
 maps) will perform the best.
 
+Omitting the DataFrame index
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When using ``pa.Table.from_pandas`` to convert to an Arrow table, by default
+one or more special columns are added to keep track of the index (row
+labels). Storing the index takes extra space, so if your index is not valuable,
+you may choose to omit it by passing ``preserve_index=False``
+
+.. ipython:: python
+
+   df = pd.DataFrame({'one': [-1, np.nan, 2.5],
+                      'two': ['foo', 'bar', 'baz'],
+                      'three': [True, False, True]},
+                      index=list('abc'))
+   df
+   table = pa.Table.from_pandas(df, preserve_index=False)
+
+Then we have:
+
+.. ipython:: python
+
+   pq.write_table(table, 'example_noindex.parquet')
+   t = pq.read_table('example_noindex.parquet')
+   t.to_pandas()
+
+Here you see the index did not survive the round trip.
+
 Finer-grained Reading and Writing
 ---------------------------------
 
@@ -159,6 +186,7 @@ Alternatively python ``with`` syntax can also be use:
    :suppress:
 
    !rm example.parquet
+   !rm example_noindex.parquet
    !rm example2.parquet
    !rm example3.parquet
 
@@ -337,4 +365,3 @@ Notes:
 * The ``account_key`` can be found under ``Settings -> Access keys`` in the Microsoft Azure portal for a given container
 * The code above works for a container with private access, Lease State = Available, Lease Status = Unlocked
 * The parquet file was Blob Type = Block blob
-