You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by we...@apache.org on 2018/07/17 03:34:30 UTC
[arrow] branch master updated: ARROW-2861: [Python] Add note about
how to not write DataFrame index to Parquet
This is an automated email from the ASF dual-hosted git repository.
wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/master by this push:
new 3fd913e ARROW-2861: [Python] Add note about how to not write DataFrame index to Parquet
3fd913e is described below
commit 3fd913e4b279554dc948411ccf56cbb4dc09dae3
Author: Daniel Compton <de...@danielcompton.net>
AuthorDate: Mon Jul 16 23:34:25 2018 -0400
ARROW-2861: [Python] Add note about how to not write DataFrame index to Parquet
Close #2248
Also resolves ARROW-2862
Author: Daniel Compton <de...@danielcompton.net>
Author: Wes McKinney <we...@twosigma.com>
Closes #2275 from wesm/ARROW-2861-edit and squashes the following commits:
b0013dfc <Wes McKinney> Incorporate code review comments
84a4cbb8 <Daniel Compton> Update pyarrow docs to remove index when creating Parquet
---
cpp/thirdparty/download_dependencies.sh | 2 ++
python/doc/source/parquet.rst | 29 ++++++++++++++++++++++++++++-
2 files changed, 30 insertions(+), 1 deletion(-)
diff --git a/cpp/thirdparty/download_dependencies.sh b/cpp/thirdparty/download_dependencies.sh
index 2d8bee4..ab8c0b2 100755
--- a/cpp/thirdparty/download_dependencies.sh
+++ b/cpp/thirdparty/download_dependencies.sh
@@ -34,6 +34,8 @@ _DST=$1
# To change toolchain versions, edit versions.txt
source $SOURCE_DIR/versions.txt
+mkdir -p $_DST
+
BOOST_UNDERSCORE_VERSION=`echo $BOOST_VERSION | sed 's/\./_/g'`
wget -c -O $_DST/boost.tar.gz https://dl.bintray.com/boostorg/release/$BOOST_VERSION/source/boost_$BOOST_UNDERSCORE_VERSION.tar.gz
diff --git a/python/doc/source/parquet.rst b/python/doc/source/parquet.rst
index dbc009e..dfbb318 100644
--- a/python/doc/source/parquet.rst
+++ b/python/doc/source/parquet.rst
@@ -112,6 +112,33 @@ In general, a Python file object will have the worst read performance, while a
string file path or an instance of :class:`~.NativeFile` (especially memory
maps) will perform the best.
+Omitting the DataFrame index
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When using ``pa.Table.from_pandas`` to convert to an Arrow table, by default
+one or more special columns are added to keep track of the index (row
+labels). Storing the index takes extra space, so if your index is not valuable,
+you may choose to omit it by passing ``preserve_index=False``
+
+.. ipython:: python
+
+ df = pd.DataFrame({'one': [-1, np.nan, 2.5],
+ 'two': ['foo', 'bar', 'baz'],
+ 'three': [True, False, True]},
+ index=list('abc'))
+ df
+ table = pa.Table.from_pandas(df, preserve_index=False)
+
+Then we have:
+
+.. ipython:: python
+
+ pq.write_table(table, 'example_noindex.parquet')
+ t = pq.read_table('example_noindex.parquet')
+ t.to_pandas()
+
+Here you see the index did not survive the round trip.
+
Finer-grained Reading and Writing
---------------------------------
@@ -159,6 +186,7 @@ Alternatively python ``with`` syntax can also be use:
:suppress:
!rm example.parquet
+ !rm example_noindex.parquet
!rm example2.parquet
!rm example3.parquet
@@ -337,4 +365,3 @@ Notes:
* The ``account_key`` can be found under ``Settings -> Access keys`` in the Microsoft Azure portal for a given container
* The code above works for a container with private access, Lease State = Available, Lease Status = Unlocked
* The parquet file was Blob Type = Block blob
-