You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by we...@apache.org on 2018/12/27 18:18:01 UTC

[arrow] branch master updated: ARROW-3928: [Python] Deduplicate Python objects when converting binary, string, date, time types to object arrays

This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
     new 9b03947  ARROW-3928: [Python] Deduplicate Python objects when converting binary, string, date, time types to object arrays
9b03947 is described below

commit 9b03947c4369cb1b4d82022df00629baf2b6eb00
Author: Wes McKinney <we...@apache.org>
AuthorDate: Thu Dec 27 12:17:50 2018 -0600

    ARROW-3928: [Python] Deduplicate Python objects when converting binary, string, date, time types to object arrays
    
    This adds a `deduplicate_objects` option to all of the `to_pandas` methods. It works with string types, date types (when `date_as_object=True`), and time types.
    
    I also made it so that `ScalarMemoTable` can be used with `string_view`, for more efficient memoization in this case.
    
    I made the default for `deduplicate_objects` is True. When the ratio of unique strings to the length of the array is low, not only does this use drastically less memory, it is also faster. I will write some benchmarks to show where the "crossover point" is when the overhead of hashing makes things slower.
    
    Let's consider a simple case where we have 10,000,000 strings of length 10, but only 1000 unique values:
    
    ```
    In [50]: import pandas.util.testing as tm
    
    In [51]: unique_values = [tm.rands(10) for i in range(1000)]
    
    In [52]: values = unique_values * 10000
    
    In [53]: arr = pa.array(values)
    
    In [54]: timeit arr.to_pandas()
    236 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [55]: timeit arr.to_pandas(deduplicate_objects=False)
    730 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    ```
    
    Almost 3 times faster in this case. The different in memory use is even more drastic
    
    ```
    In [44]: unique_values = [tm.rands(10) for i in range(1000)]
    
    In [45]: values = unique_values * 10000
    
    In [46]: arr = pa.array(values)
    
    In [49]: %memit result11 = arr.to_pandas()
    peak memory: 1505.89 MiB, increment: 76.27 MiB
    
    In [50]: %memit result12 = arr.to_pandas(deduplicate_objects=False)
    peak memory: 2202.29 MiB, increment: 696.11 MiB
    ```
    
    As you can see, this is a huge problem. If our bug reports about Parquet memory use problems are any indication, users have been suffering from this issue for a long time.
    
    When the strings are mostly unique, then things are slower as expected, the peak memory use is higher because of the hash table
    
    ```
    In [17]: unique_values = [tm.rands(10) for i in range(500000)]
    
    In [18]: values = unique_values * 2
    
    In [19]: arr = pa.array(values)
    
    In [20]: timeit result = arr.to_pandas()
    177 ms ± 574 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    In [21]: timeit result = arr.to_pandas(deduplicate_objects=False)
    70.1 ms ± 783 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    In [42]: %memit result8 = arr.to_pandas()
    peak memory: 644.39 MiB, increment: 92.23 MiB
    
    In [43]: %memit result9 = arr.to_pandas(deduplicate_objects=False)
    peak memory: 610.85 MiB, increment: 58.41 MiB
    ```
    
    In real world work, many duplicated strings is the most common use case. Given the massive memory use and moderate performance improvements, it makes sense to have this enabled by default.
    
    Author: Wes McKinney <we...@apache.org>
    
    Closes #3257 from wesm/ARROW-3928 and squashes the following commits:
    
    d9a88700 <Wes McKinney> Prettier output
    a00b51c7 <Wes McKinney> Add benchmarks for object deduplication
    ca88b963 <Wes McKinney> Add Python unit tests, deduplicate for date and time types also when converting to Python objects
    7a7873b8 <Wes McKinney> First working iteration of string deduplication when calling to_pandas
---
 cpp/src/arrow/python/arrow_to_pandas.cc     | 286 +++++++++++++++-------------
 cpp/src/arrow/python/arrow_to_pandas.h      |  41 ++--
 cpp/src/arrow/type.cc                       |   7 +-
 cpp/src/arrow/type.h                        |   9 +-
 cpp/src/arrow/type_traits.h                 |   5 +
 cpp/src/arrow/util/hashing.h                |  21 +-
 python/benchmarks/convert_pandas.py         |  22 +++
 python/pyarrow/array.pxi                    |  91 +++++----
 python/pyarrow/compat.py                    |   6 +-
 python/pyarrow/includes/libarrow.pxd        |   9 +-
 python/pyarrow/lib.pxd                      |  14 +-
 python/pyarrow/pandas_compat.py             |   5 +-
 python/pyarrow/table.pxi                    | 160 ++--------------
 python/pyarrow/tests/test_convert_pandas.py |  85 +++++++++
 14 files changed, 409 insertions(+), 352 deletions(-)

diff --git a/cpp/src/arrow/python/arrow_to_pandas.cc b/cpp/src/arrow/python/arrow_to_pandas.cc
index 29d6435..b532bfb 100644
--- a/cpp/src/arrow/python/arrow_to_pandas.cc
+++ b/cpp/src/arrow/python/arrow_to_pandas.cc
@@ -36,9 +36,11 @@
 #include "arrow/type.h"
 #include "arrow/type_traits.h"
 #include "arrow/util/checked_cast.h"
+#include "arrow/util/hashing.h"
 #include "arrow/util/logging.h"
 #include "arrow/util/macros.h"
 #include "arrow/util/parallel.h"
+#include "arrow/util/string_view.h"
 #include "arrow/visitor_inline.h"
 
 #include "arrow/compute/api.h"
@@ -75,21 +77,21 @@ template <typename T>
 struct WrapBytes {};
 
 template <>
-struct WrapBytes<StringArray> {
+struct WrapBytes<StringType> {
   static inline PyObject* Wrap(const char* data, int64_t length) {
     return PyUnicode_FromStringAndSize(data, length);
   }
 };
 
 template <>
-struct WrapBytes<BinaryArray> {
+struct WrapBytes<BinaryType> {
   static inline PyObject* Wrap(const char* data, int64_t length) {
     return PyBytes_FromStringAndSize(data, length);
   }
 };
 
 template <>
-struct WrapBytes<FixedSizeBinaryArray> {
+struct WrapBytes<FixedSizeBinaryType> {
   static inline PyObject* Wrap(const char* data, int64_t length) {
     return PyBytes_FromStringAndSize(data, length);
   }
@@ -216,7 +218,7 @@ class PandasBlock {
     CATEGORICAL
   };
 
-  PandasBlock(PandasOptions options, int64_t num_rows, int num_columns)
+  PandasBlock(const PandasOptions& options, int64_t num_rows, int num_columns)
       : num_rows_(num_rows), num_columns_(num_columns), options_(options) {}
   virtual ~PandasBlock() {}
 
@@ -301,8 +303,8 @@ inline const T* GetPrimitiveValues(const Array& arr) {
 }
 
 template <typename T>
-inline void ConvertIntegerWithNulls(PandasOptions options, const ChunkedArray& data,
-                                    double* out_values) {
+inline void ConvertIntegerWithNulls(const PandasOptions& options,
+                                    const ChunkedArray& data, double* out_values) {
   for (int c = 0; c < data.num_chunks(); c++) {
     const auto& arr = *data.chunk(c);
     const T* in_values = GetPrimitiveValues<T>(arr);
@@ -315,8 +317,8 @@ inline void ConvertIntegerWithNulls(PandasOptions options, const ChunkedArray& d
 }
 
 template <typename T>
-inline void ConvertIntegerNoNullsSameType(PandasOptions options, const ChunkedArray& data,
-                                          T* out_values) {
+inline void ConvertIntegerNoNullsSameType(const PandasOptions& options,
+                                          const ChunkedArray& data, T* out_values) {
   for (int c = 0; c < data.num_chunks(); c++) {
     const auto& arr = *data.chunk(c);
     if (arr.length() > 0) {
@@ -328,8 +330,8 @@ inline void ConvertIntegerNoNullsSameType(PandasOptions options, const ChunkedAr
 }
 
 template <typename InType, typename OutType>
-inline void ConvertIntegerNoNullsCast(PandasOptions options, const ChunkedArray& data,
-                                      OutType* out_values) {
+inline void ConvertIntegerNoNullsCast(const PandasOptions& options,
+                                      const ChunkedArray& data, OutType* out_values) {
   for (int c = 0; c < data.num_chunks(); c++) {
     const auto& arr = *data.chunk(c);
     const InType* in_values = GetPrimitiveValues<InType>(arr);
@@ -339,8 +341,8 @@ inline void ConvertIntegerNoNullsCast(PandasOptions options, const ChunkedArray&
   }
 }
 
-static Status ConvertBooleanWithNulls(PandasOptions options, const ChunkedArray& data,
-                                      PyObject** out_values) {
+static Status ConvertBooleanWithNulls(const PandasOptions& options,
+                                      const ChunkedArray& data, PyObject** out_values) {
   PyAcquireGIL lock;
   for (int c = 0; c < data.num_chunks(); c++) {
     const auto& arr = checked_cast<const BooleanArray&>(*data.chunk(c));
@@ -363,7 +365,7 @@ static Status ConvertBooleanWithNulls(PandasOptions options, const ChunkedArray&
   return Status::OK();
 }
 
-static void ConvertBooleanNoNulls(PandasOptions options, const ChunkedArray& data,
+static void ConvertBooleanNoNulls(const PandasOptions& options, const ChunkedArray& data,
                                   uint8_t* out_values) {
   for (int c = 0; c < data.num_chunks(); c++) {
     const auto& arr = checked_cast<const BooleanArray&>(*data.chunk(c));
@@ -373,57 +375,106 @@ static void ConvertBooleanNoNulls(PandasOptions options, const ChunkedArray& dat
   }
 }
 
-template <typename T>
-static Status ConvertIntegerObjects(PandasOptions options, const ChunkedArray& data,
-                                    PyObject** out_values) {
-  PyAcquireGIL lock;
-  constexpr bool is_signed = std::is_signed<T>::value;
-  for (int c = 0; c < data.num_chunks(); c++) {
-    const auto& arr = *data.chunk(c);
-    const auto* in_values = GetPrimitiveValues<T>(arr);
-
-    for (int i = 0; i < arr.length(); ++i) {
-      if (arr.IsNull(i)) {
-        Py_INCREF(Py_None);
-        *out_values++ = Py_None;
-      } else {
-        *out_values++ = is_signed ? PyLong_FromLongLong(in_values[i])
-                                  : PyLong_FromUnsignedLongLong(in_values[i]);
-        RETURN_IF_PYERROR();
-      }
+// Generic Array -> PyObject** converter that handles object deduplication, if
+// requested
+template <typename ArrayType, typename WriteValue>
+inline Status WriteArrayObjects(const ArrayType& arr, WriteValue&& write_func,
+                                PyObject** out_values) {
+  const bool has_nulls = arr.null_count() > 0;
+  for (int64_t i = 0; i < arr.length(); ++i) {
+    if (has_nulls && arr.IsNull(i)) {
+      Py_INCREF(Py_None);
+      *out_values = Py_None;
+    } else {
+      RETURN_NOT_OK(write_func(arr.GetView(i), out_values));
     }
+    ++out_values;
   }
   return Status::OK();
 }
 
-template <typename Type>
-inline Status ConvertBinaryLike(PandasOptions options, const ChunkedArray& data,
-                                PyObject** out_values) {
+template <typename T, typename Enable = void>
+struct MemoizationTraits {
+  using Scalar = typename T::c_type;
+};
+
+template <typename T>
+struct MemoizationTraits<T, enable_if_binary_like<T>> {
+  // For binary, we memoize string_view as a scalar value to avoid having to
+  // unnecessarily copy the memory into the memo table data structure
+  using Scalar = util::string_view;
+};
+
+template <typename Type, typename WrapFunction>
+inline Status ConvertAsPyObjects(const PandasOptions& options, const ChunkedArray& data,
+                                 WrapFunction&& wrap_func, PyObject** out_values) {
   using ArrayType = typename TypeTraits<Type>::ArrayType;
+  using Scalar = typename MemoizationTraits<Type>::Scalar;
+
   PyAcquireGIL lock;
+  ::arrow::internal::ScalarMemoTable<Scalar> memo_table;
+  std::vector<PyObject*> unique_values;
+  int32_t memo_size = 0;
+
+  auto WrapMemoized = [&](const Scalar& value, PyObject** out_values) {
+    int32_t memo_index = memo_table.GetOrInsert(value);
+    if (memo_index == memo_size) {
+      // New entry
+      RETURN_NOT_OK(wrap_func(value, out_values));
+      unique_values.push_back(*out_values);
+      ++memo_size;
+    } else {
+      // Duplicate entry
+      Py_INCREF(unique_values[memo_index]);
+      *out_values = unique_values[memo_index];
+    }
+    return Status::OK();
+  };
+
+  auto WrapUnmemoized = [&](const Scalar& value, PyObject** out_values) {
+    return wrap_func(value, out_values);
+  };
+
   for (int c = 0; c < data.num_chunks(); c++) {
     const auto& arr = checked_cast<const ArrayType&>(*data.chunk(c));
-
-    const bool has_nulls = data.null_count() > 0;
-    for (int64_t i = 0; i < arr.length(); ++i) {
-      if (has_nulls && arr.IsNull(i)) {
-        Py_INCREF(Py_None);
-        *out_values = Py_None;
-      } else {
-        auto view = arr.GetView(i);
-        *out_values = WrapBytes<ArrayType>::Wrap(view.data(), view.length());
-        if (*out_values == nullptr) {
-          PyErr_Clear();
-          return Status::UnknownError("Wrapping ", view, " failed");
-        }
-      }
-      ++out_values;
+    if (options.deduplicate_objects) {
+      RETURN_NOT_OK(WriteArrayObjects(arr, WrapMemoized, out_values));
+    } else {
+      RETURN_NOT_OK(WriteArrayObjects(arr, WrapUnmemoized, out_values));
     }
+    out_values += arr.length();
   }
   return Status::OK();
 }
 
-inline Status ConvertNulls(PandasOptions options, const ChunkedArray& data,
+template <typename Type>
+static Status ConvertIntegerObjects(const PandasOptions& options,
+                                    const ChunkedArray& data, PyObject** out_values) {
+  using T = typename Type::c_type;
+  auto WrapValue = [](T value, PyObject** out) {
+    *out = std::is_signed<T>::value ? PyLong_FromLongLong(value)
+                                    : PyLong_FromUnsignedLongLong(value);
+    RETURN_IF_PYERROR();
+    return Status::OK();
+  };
+  return ConvertAsPyObjects<Type>(options, data, WrapValue, out_values);
+}
+
+template <typename Type>
+inline Status ConvertBinaryLike(const PandasOptions& options, const ChunkedArray& data,
+                                PyObject** out_values) {
+  auto WrapValue = [](const util::string_view& view, PyObject** out) {
+    *out = WrapBytes<Type>::Wrap(view.data(), view.length());
+    if (*out == nullptr) {
+      PyErr_Clear();
+      return Status::UnknownError("Wrapping ", view, " failed");
+    }
+    return Status::OK();
+  };
+  return ConvertAsPyObjects<Type>(options, data, WrapValue, out_values);
+}
+
+inline Status ConvertNulls(const PandasOptions& options, const ChunkedArray& data,
                            PyObject** out_values) {
   PyAcquireGIL lock;
   for (int c = 0; c < data.num_chunks(); c++) {
@@ -439,7 +490,7 @@ inline Status ConvertNulls(PandasOptions options, const ChunkedArray& data,
   return Status::OK();
 }
 
-inline Status ConvertStruct(PandasOptions options, const ChunkedArray& data,
+inline Status ConvertStruct(const PandasOptions& options, const ChunkedArray& data,
                             PyObject** out_values) {
   PyAcquireGIL lock;
   if (data.num_chunks() <= 0) {
@@ -503,7 +554,8 @@ inline Status ConvertStruct(PandasOptions options, const ChunkedArray& data,
 }
 
 template <typename ArrowType>
-inline Status ConvertListsLike(PandasOptions options, const std::shared_ptr<Column>& col,
+inline Status ConvertListsLike(const PandasOptions& options,
+                               const std::shared_ptr<Column>& col,
                                PyObject** out_values) {
   const ChunkedArray& data = *col->data().get();
   const auto& list_type = checked_cast<const ListType&>(*col->type());
@@ -604,69 +656,40 @@ inline void ConvertDatetimeNanos(const ChunkedArray& data, int64_t* out_values)
   }
 }
 
-template <typename TYPE>
-static Status ConvertDates(PandasOptions options, const ChunkedArray& data,
+template <typename Type>
+static Status ConvertDates(const PandasOptions& options, const ChunkedArray& data,
                            PyObject** out_values) {
-  using ArrayType = typename TypeTraits<TYPE>::ArrayType;
-
-  PyAcquireGIL lock;
-  OwnedRef date_ref;
-
-  PyDateTime_IMPORT;
-
-  for (int c = 0; c < data.num_chunks(); c++) {
-    const auto& arr = checked_cast<const ArrayType&>(*data.chunk(c));
-    auto type = std::dynamic_pointer_cast<TYPE>(arr.type());
-    DCHECK(type);
-
-    const DateUnit unit = type->unit();
-
-    for (int64_t i = 0; i < arr.length(); ++i) {
-      if (arr.IsNull(i)) {
-        Py_INCREF(Py_None);
-        *out_values++ = Py_None;
-      } else {
-        RETURN_NOT_OK(PyDate_from_int(arr.Value(i), unit, out_values++));
-        RETURN_IF_PYERROR();
-      }
-    }
+  {
+    PyAcquireGIL lock;
+    PyDateTime_IMPORT;
   }
-
-  return Status::OK();
+  auto WrapValue = [](typename Type::c_type value, PyObject** out) {
+    RETURN_NOT_OK(PyDate_from_int(value, Type::UNIT, out));
+    RETURN_IF_PYERROR();
+    return Status::OK();
+  };
+  return ConvertAsPyObjects<Type>(options, data, WrapValue, out_values);
 }
 
-template <typename TYPE>
-static Status ConvertTimes(PandasOptions options, const ChunkedArray& data,
+template <typename Type>
+static Status ConvertTimes(const PandasOptions& options, const ChunkedArray& data,
                            PyObject** out_values) {
-  using ArrayType = typename TypeTraits<TYPE>::ArrayType;
-
-  PyAcquireGIL lock;
-  OwnedRef time_ref;
-
-  PyDateTime_IMPORT;
-
-  for (int c = 0; c < data.num_chunks(); c++) {
-    const auto& arr = checked_cast<const ArrayType&>(*data.chunk(c));
-    auto type = std::dynamic_pointer_cast<TYPE>(arr.type());
-    DCHECK(type);
-
-    const TimeUnit::type unit = type->unit();
-
-    for (int64_t i = 0; i < arr.length(); ++i) {
-      if (arr.IsNull(i)) {
-        Py_INCREF(Py_None);
-        *out_values++ = Py_None;
-      } else {
-        RETURN_NOT_OK(PyTime_from_int(arr.Value(i), unit, out_values++));
-        RETURN_IF_PYERROR();
-      }
-    }
+  {
+    PyAcquireGIL lock;
+    PyDateTime_IMPORT;
   }
 
-  return Status::OK();
+  const TimeUnit::type unit = checked_cast<const Type&>(*data.type()).unit();
+
+  auto WrapValue = [unit](typename Type::c_type value, PyObject** out) {
+    RETURN_NOT_OK(PyTime_from_int(value, unit, out));
+    RETURN_IF_PYERROR();
+    return Status::OK();
+  };
+  return ConvertAsPyObjects<Type>(options, data, WrapValue, out_values);
 }
 
-static Status ConvertDecimals(PandasOptions options, const ChunkedArray& data,
+static Status ConvertDecimals(const PandasOptions& options, const ChunkedArray& data,
                               PyObject** out_values) {
   PyAcquireGIL lock;
   OwnedRef decimal;
@@ -715,21 +738,21 @@ class ObjectBlock : public PandasBlock {
     if (type == Type::BOOL) {
       RETURN_NOT_OK(ConvertBooleanWithNulls(options_, data, out_buffer));
     } else if (type == Type::UINT8) {
-      RETURN_NOT_OK(ConvertIntegerObjects<uint8_t>(options_, data, out_buffer));
+      RETURN_NOT_OK(ConvertIntegerObjects<UInt8Type>(options_, data, out_buffer));
     } else if (type == Type::INT8) {
-      RETURN_NOT_OK(ConvertIntegerObjects<int8_t>(options_, data, out_buffer));
+      RETURN_NOT_OK(ConvertIntegerObjects<Int8Type>(options_, data, out_buffer));
     } else if (type == Type::UINT16) {
-      RETURN_NOT_OK(ConvertIntegerObjects<uint16_t>(options_, data, out_buffer));
+      RETURN_NOT_OK(ConvertIntegerObjects<UInt16Type>(options_, data, out_buffer));
     } else if (type == Type::INT16) {
-      RETURN_NOT_OK(ConvertIntegerObjects<int16_t>(options_, data, out_buffer));
+      RETURN_NOT_OK(ConvertIntegerObjects<Int16Type>(options_, data, out_buffer));
     } else if (type == Type::UINT32) {
-      RETURN_NOT_OK(ConvertIntegerObjects<uint32_t>(options_, data, out_buffer));
+      RETURN_NOT_OK(ConvertIntegerObjects<UInt32Type>(options_, data, out_buffer));
     } else if (type == Type::INT32) {
-      RETURN_NOT_OK(ConvertIntegerObjects<int32_t>(options_, data, out_buffer));
+      RETURN_NOT_OK(ConvertIntegerObjects<Int32Type>(options_, data, out_buffer));
     } else if (type == Type::UINT64) {
-      RETURN_NOT_OK(ConvertIntegerObjects<uint64_t>(options_, data, out_buffer));
+      RETURN_NOT_OK(ConvertIntegerObjects<UInt64Type>(options_, data, out_buffer));
     } else if (type == Type::INT64) {
-      RETURN_NOT_OK(ConvertIntegerObjects<int64_t>(options_, data, out_buffer));
+      RETURN_NOT_OK(ConvertIntegerObjects<Int64Type>(options_, data, out_buffer));
     } else if (type == Type::BINARY) {
       RETURN_NOT_OK(ConvertBinaryLike<BinaryType>(options_, data, out_buffer));
     } else if (type == Type::STRING) {
@@ -1009,7 +1032,8 @@ class DatetimeBlock : public PandasBlock {
 
 class DatetimeTZBlock : public DatetimeBlock {
  public:
-  DatetimeTZBlock(PandasOptions options, const std::string& timezone, int64_t num_rows)
+  DatetimeTZBlock(const PandasOptions& options, const std::string& timezone,
+                  int64_t num_rows)
       : DatetimeBlock(options, num_rows, 1), timezone_(timezone) {}
 
   // Like Categorical, the internal ndarray is 1-dimensional
@@ -1038,7 +1062,8 @@ class DatetimeTZBlock : public DatetimeBlock {
 
 class CategoricalBlock : public PandasBlock {
  public:
-  explicit CategoricalBlock(PandasOptions options, MemoryPool* pool, int64_t num_rows)
+  explicit CategoricalBlock(const PandasOptions& options, MemoryPool* pool,
+                            int64_t num_rows)
       : PandasBlock(options, num_rows, 1), pool_(pool), needs_copy_(false) {}
 
   Status Allocate() override {
@@ -1235,7 +1260,7 @@ class CategoricalBlock : public PandasBlock {
   bool needs_copy_;
 };
 
-Status MakeBlock(PandasOptions options, PandasBlock::type type, int64_t num_rows,
+Status MakeBlock(const PandasOptions& options, PandasBlock::type type, int64_t num_rows,
                  int num_columns, std::shared_ptr<PandasBlock>* block) {
 #define BLOCK_CASE(NAME, TYPE)                                       \
   case PandasBlock::NAME:                                            \
@@ -1518,7 +1543,7 @@ class DataFrameBlockCreator {
 
 class ArrowDeserializer {
  public:
-  ArrowDeserializer(PandasOptions options, const std::shared_ptr<Column>& col,
+  ArrowDeserializer(const PandasOptions& options, const std::shared_ptr<Column>& col,
                     PyObject* py_ref)
       : col_(col), data_(*col->data().get()), options_(options), py_ref_(py_ref) {}
 
@@ -1532,7 +1557,7 @@ class ArrowDeserializer {
   }
 
   template <int TYPE>
-  Status ConvertValuesZeroCopy(PandasOptions options, int npy_type,
+  Status ConvertValuesZeroCopy(const PandasOptions& options, int npy_type,
                                const std::shared_ptr<Array>& arr) {
     typedef typename internal::arrow_traits<TYPE>::T T;
 
@@ -1738,9 +1763,7 @@ class ArrowDeserializer {
 
     if (data_.null_count() > 0) {
       if (options_.integer_object_nulls) {
-        using c_type = typename Type::c_type;
-
-        return VisitObjects(ConvertIntegerObjects<c_type>);
+        return VisitObjects(ConvertIntegerObjects<Type>);
       } else {
         RETURN_NOT_OK(AllocateOutput(NPY_FLOAT64));
         auto out_values = reinterpret_cast<double*>(PyArray_DATA(arr_));
@@ -1878,15 +1901,16 @@ class ArrowDeserializer {
   PyObject* result_;
 };
 
-Status ConvertArrayToPandas(PandasOptions options, const std::shared_ptr<Array>& arr,
-                            PyObject* py_ref, PyObject** out) {
+Status ConvertArrayToPandas(const PandasOptions& options,
+                            const std::shared_ptr<Array>& arr, PyObject* py_ref,
+                            PyObject** out) {
   static std::string dummy_name = "dummy";
   auto field = std::make_shared<Field>(dummy_name, arr->type());
   auto col = std::make_shared<Column>(field, arr);
   return ConvertColumnToPandas(options, col, py_ref, out);
 }
 
-Status ConvertChunkedArrayToPandas(PandasOptions options,
+Status ConvertChunkedArrayToPandas(const PandasOptions& options,
                                    const std::shared_ptr<ChunkedArray>& ca,
                                    PyObject* py_ref, PyObject** out) {
   static std::string dummy_name = "dummy";
@@ -1895,19 +1919,21 @@ Status ConvertChunkedArrayToPandas(PandasOptions options,
   return ConvertColumnToPandas(options, col, py_ref, out);
 }
 
-Status ConvertColumnToPandas(PandasOptions options, const std::shared_ptr<Column>& col,
-                             PyObject* py_ref, PyObject** out) {
+Status ConvertColumnToPandas(const PandasOptions& options,
+                             const std::shared_ptr<Column>& col, PyObject* py_ref,
+                             PyObject** out) {
   ArrowDeserializer converter(options, col, py_ref);
   return converter.Convert(out);
 }
 
-Status ConvertTableToPandas(PandasOptions options, const std::shared_ptr<Table>& table,
-                            MemoryPool* pool, PyObject** out) {
+Status ConvertTableToPandas(const PandasOptions& options,
+                            const std::shared_ptr<Table>& table, MemoryPool* pool,
+                            PyObject** out) {
   return ConvertTableToPandas(options, std::unordered_set<std::string>(), table, pool,
                               out);
 }
 
-Status ConvertTableToPandas(PandasOptions options,
+Status ConvertTableToPandas(const PandasOptions& options,
                             const std::unordered_set<std::string>& categorical_columns,
                             const std::shared_ptr<Table>& table, MemoryPool* pool,
                             PyObject** out) {
diff --git a/cpp/src/arrow/python/arrow_to_pandas.h b/cpp/src/arrow/python/arrow_to_pandas.h
index 753bf48..20bad40 100644
--- a/cpp/src/arrow/python/arrow_to_pandas.h
+++ b/cpp/src/arrow/python/arrow_to_pandas.h
@@ -43,32 +43,32 @@ namespace py {
 
 struct PandasOptions {
   /// If true, we will convert all string columns to categoricals
-  bool strings_to_categorical;
-  bool zero_copy_only;
-  bool integer_object_nulls;
-  bool date_as_object;
-  bool use_threads;
-
-  PandasOptions()
-      : strings_to_categorical(false),
-        zero_copy_only(false),
-        integer_object_nulls(false),
-        date_as_object(false),
-        use_threads(false) {}
+  bool strings_to_categorical = false;
+  bool zero_copy_only = false;
+  bool integer_object_nulls = false;
+  bool date_as_object = false;
+  bool use_threads = false;
+
+  /// \brief If true, do not create duplicate PyObject versions of equal
+  /// objects. This only applies to immutable objects like strings or datetime
+  /// objects
+  bool deduplicate_objects = false;
 };
 
 ARROW_PYTHON_EXPORT
-Status ConvertArrayToPandas(PandasOptions options, const std::shared_ptr<Array>& arr,
-                            PyObject* py_ref, PyObject** out);
+Status ConvertArrayToPandas(const PandasOptions& options,
+                            const std::shared_ptr<Array>& arr, PyObject* py_ref,
+                            PyObject** out);
 
 ARROW_PYTHON_EXPORT
-Status ConvertChunkedArrayToPandas(PandasOptions options,
+Status ConvertChunkedArrayToPandas(const PandasOptions& options,
                                    const std::shared_ptr<ChunkedArray>& col,
                                    PyObject* py_ref, PyObject** out);
 
 ARROW_PYTHON_EXPORT
-Status ConvertColumnToPandas(PandasOptions options, const std::shared_ptr<Column>& col,
-                             PyObject* py_ref, PyObject** out);
+Status ConvertColumnToPandas(const PandasOptions& options,
+                             const std::shared_ptr<Column>& col, PyObject* py_ref,
+                             PyObject** out);
 
 // Convert a whole table as efficiently as possible to a pandas.DataFrame.
 //
@@ -77,15 +77,16 @@ Status ConvertColumnToPandas(PandasOptions options, const std::shared_ptr<Column
 //
 // tuple item: (indices: ndarray[int32], block: ndarray[TYPE, ndim=2])
 ARROW_PYTHON_EXPORT
-Status ConvertTableToPandas(PandasOptions options, const std::shared_ptr<Table>& table,
-                            MemoryPool* pool, PyObject** out);
+Status ConvertTableToPandas(const PandasOptions& options,
+                            const std::shared_ptr<Table>& table, MemoryPool* pool,
+                            PyObject** out);
 
 /// Convert a whole table as efficiently as possible to a pandas.DataFrame.
 ///
 /// Explicitly name columns that should be a categorical
 /// This option is only used on conversions that are applied to a table.
 ARROW_PYTHON_EXPORT
-Status ConvertTableToPandas(PandasOptions options,
+Status ConvertTableToPandas(const PandasOptions& options,
                             const std::unordered_set<std::string>& categorical_columns,
                             const std::shared_ptr<Table>& table, MemoryPool* pool,
                             PyObject** out);
diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc
index a8372b9..cd57e2d 100644
--- a/cpp/src/arrow/type.cc
+++ b/cpp/src/arrow/type.cc
@@ -137,12 +137,11 @@ std::string FixedSizeBinaryType::ToString() const {
 // ----------------------------------------------------------------------
 // Date types
 
-DateType::DateType(Type::type type_id, DateUnit unit)
-    : FixedWidthType(type_id), unit_(unit) {}
+DateType::DateType(Type::type type_id) : FixedWidthType(type_id) {}
 
-Date32Type::Date32Type() : DateType(Type::DATE32, DateUnit::DAY) {}
+Date32Type::Date32Type() : DateType(Type::DATE32) {}
 
-Date64Type::Date64Type() : DateType(Type::DATE64, DateUnit::MILLI) {}
+Date64Type::Date64Type() : DateType(Type::DATE64) {}
 
 std::string Date64Type::ToString() const { return std::string("date64[ms]"); }
 
diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h
index 0758ced..6c3643c 100644
--- a/cpp/src/arrow/type.h
+++ b/cpp/src/arrow/type.h
@@ -600,17 +600,17 @@ enum class DateUnit : char { DAY = 0, MILLI = 1 };
 /// \brief Base type class for date data
 class ARROW_EXPORT DateType : public FixedWidthType {
  public:
-  DateUnit unit() const { return unit_; }
+  virtual DateUnit unit() const = 0;
 
  protected:
-  DateType(Type::type type_id, DateUnit unit);
-  DateUnit unit_;
+  explicit DateType(Type::type type_id);
 };
 
 /// Concrete type class for 32-bit date data (as number of days since UNIX epoch)
 class ARROW_EXPORT Date32Type : public DateType {
  public:
   static constexpr Type::type type_id = Type::DATE32;
+  static constexpr DateUnit UNIT = DateUnit::DAY;
 
   using c_type = int32_t;
 
@@ -622,12 +622,14 @@ class ARROW_EXPORT Date32Type : public DateType {
   std::string ToString() const override;
 
   std::string name() const override { return "date32"; }
+  DateUnit unit() const override { return UNIT; }
 };
 
 /// Concrete type class for 64-bit date data (as number of milliseconds since UNIX epoch)
 class ARROW_EXPORT Date64Type : public DateType {
  public:
   static constexpr Type::type type_id = Type::DATE64;
+  static constexpr DateUnit UNIT = DateUnit::MILLI;
 
   using c_type = int64_t;
 
@@ -639,6 +641,7 @@ class ARROW_EXPORT Date64Type : public DateType {
   std::string ToString() const override;
 
   std::string name() const override { return "date64"; }
+  DateUnit unit() const override { return UNIT; }
 };
 
 struct TimeUnit {
diff --git a/cpp/src/arrow/type_traits.h b/cpp/src/arrow/type_traits.h
index da5cf25..b89f52f 100644
--- a/cpp/src/arrow/type_traits.h
+++ b/cpp/src/arrow/type_traits.h
@@ -372,6 +372,11 @@ using enable_if_boolean =
     typename std::enable_if<std::is_same<BooleanType, T>::value>::type;
 
 template <typename T>
+using enable_if_binary_like =
+    typename std::enable_if<std::is_base_of<BinaryType, T>::value ||
+                            std::is_base_of<FixedSizeBinaryType, T>::value>::type;
+
+template <typename T>
 using enable_if_fixed_size_binary =
     typename std::enable_if<std::is_base_of<FixedSizeBinaryType, T>::value>::type;
 
diff --git a/cpp/src/arrow/util/hashing.h b/cpp/src/arrow/util/hashing.h
index 76724b2..3dde0be 100644
--- a/cpp/src/arrow/util/hashing.h
+++ b/cpp/src/arrow/util/hashing.h
@@ -103,6 +103,18 @@ struct ScalarHelper<Scalar, AlgNum,
 };
 
 template <typename Scalar, uint64_t AlgNum>
+struct ScalarHelper<
+    Scalar, AlgNum,
+    typename std::enable_if<std::is_same<util::string_view, Scalar>::value>::type>
+    : public ScalarHelperBase<Scalar, AlgNum> {
+  // ScalarHelper specialization for util::string_view
+
+  static hash_t ComputeHash(const util::string_view& value) {
+    return ComputeStringHash<AlgNum>(value.data(), static_cast<int64_t>(value.size()));
+  }
+};
+
+template <typename Scalar, uint64_t AlgNum>
 struct ScalarHelper<Scalar, AlgNum,
                     typename std::enable_if<std::is_floating_point<Scalar>::value>::type>
     : public ScalarHelperBase<Scalar, AlgNum> {
@@ -332,7 +344,7 @@ class ScalarMemoTable {
   explicit ScalarMemoTable(int64_t entries = 0)
       : hash_table_(static_cast<uint64_t>(entries)) {}
 
-  int32_t Get(const Scalar value) const {
+  int32_t Get(const Scalar& value) const {
     auto cmp_func = [value](const Payload* payload) -> bool {
       return ScalarHelper<Scalar, 0>::CompareScalars(payload->value, value);
     };
@@ -346,7 +358,7 @@ class ScalarMemoTable {
   }
 
   template <typename Func1, typename Func2>
-  int32_t GetOrInsert(const Scalar value, Func1&& on_found, Func2&& on_not_found) {
+  int32_t GetOrInsert(const Scalar& value, Func1&& on_found, Func2&& on_not_found) {
     auto cmp_func = [value](const Payload* payload) -> bool {
       return ScalarHelper<Scalar, 0>::CompareScalars(value, payload->value);
     };
@@ -364,7 +376,7 @@ class ScalarMemoTable {
     return memo_index;
   }
 
-  int32_t GetOrInsert(const Scalar value) {
+  int32_t GetOrInsert(const Scalar& value) {
     return GetOrInsert(value, [](int32_t i) {}, [](int32_t i) {});
   }
 
@@ -389,6 +401,7 @@ class ScalarMemoTable {
     Scalar value;
     int32_t memo_index;
   };
+
   using HashTableType = HashTableTemplateType<Payload>;
   using HashTableEntry = typename HashTableType::Entry;
   HashTableType hash_table_;
@@ -621,9 +634,11 @@ class BinaryMemoTable {
   struct Payload {
     int32_t memo_index;
   };
+
   using HashTableType = HashTable<Payload>;
   using HashTableEntry = typename HashTable<Payload>::Entry;
   HashTableType hash_table_;
+
   std::vector<int32_t> offsets_;
   std::string values_;
 
diff --git a/python/benchmarks/convert_pandas.py b/python/benchmarks/convert_pandas.py
index 244b3dc..bb8d710 100644
--- a/python/benchmarks/convert_pandas.py
+++ b/python/benchmarks/convert_pandas.py
@@ -17,6 +17,8 @@
 
 import numpy as np
 import pandas as pd
+import pandas.util.testing as tm
+
 import pyarrow as pa
 
 
@@ -50,6 +52,26 @@ class PandasConversionsFromArrow(PandasConversionsBase):
         self.arrow_data.to_pandas()
 
 
+class ToPandasStrings(object):
+
+    param_names = ('uniqueness', 'total')
+    params = ((0.001, 0.01, 0.1, 0.5), (1000000,))
+    string_length = 25
+
+    def setup(self, uniqueness, total):
+        nunique = int(total * uniqueness)
+        unique_values = [tm.rands(self.string_length) for i in range(nunique)]
+        values = unique_values * (total // nunique)
+        self.arr = pa.array(values, type=pa.string())
+        self.table = pa.Table.from_arrays([self.arr], ['f0'])
+
+    def time_to_pandas_dedup(self, *args):
+        self.arr.to_pandas()
+
+    def time_to_pandas_no_dedup(self, *args):
+        self.arr.to_pandas(deduplicate_objects=False)
+
+
 class ZeroCopyPandasRead(object):
 
     def setup(self):
diff --git a/python/pyarrow/array.pxi b/python/pyarrow/array.pxi
index b86872f..ef95efe 100644
--- a/python/pyarrow/array.pxi
+++ b/python/pyarrow/array.pxi
@@ -339,7 +339,61 @@ def _restore_array(data):
     return pyarrow_wrap_array(MakeArray(ad))
 
 
-cdef class Array:
+cdef class _PandasConvertible:
+
+    def to_pandas(self, categories=None, bint strings_to_categorical=False,
+                  bint zero_copy_only=False, bint integer_object_nulls=False,
+                  bint date_as_object=False,
+                  bint use_threads=True,
+                  bint deduplicate_objects=True,
+                  bint ignore_metadata=False):
+        """
+        Convert to a pandas-compatible NumPy array or DataFrame, as appropriate
+
+        Parameters
+        ----------
+        strings_to_categorical : boolean, default False
+            Encode string (UTF8) and binary types to pandas.Categorical
+        categories: list, default empty
+            List of fields that should be returned as pandas.Categorical. Only
+            applies to table-like data structures
+        zero_copy_only : boolean, default False
+            Raise an ArrowException if this function call would require copying
+            the underlying data
+        integer_object_nulls : boolean, default False
+            Cast integers with nulls to objects
+        date_as_object : boolean, default False
+            Cast dates to objects
+        use_threads: boolean, default True
+            Whether to parallelize the conversion using multiple threads
+        deduplicate_objects : boolean, default False
+            Do not create multiple copies Python objects when created, to save
+            on memory use. Conversion will be slower
+        ignore_metadata : boolean, default False
+            If True, do not use the 'pandas' metadata to reconstruct the
+            DataFrame index, if present
+
+        Returns
+        -------
+        NumPy array or DataFrame depending on type of object
+        """
+        cdef:
+            PyObject* out
+            PandasOptions options
+
+        options = PandasOptions(
+            strings_to_categorical=strings_to_categorical,
+            zero_copy_only=zero_copy_only,
+            integer_object_nulls=integer_object_nulls,
+            date_as_object=date_as_object,
+            use_threads=use_threads,
+            deduplicate_objects=deduplicate_objects)
+
+        return self._to_pandas(options, categories=categories,
+                               ignore_metadata=ignore_metadata)
+
+
+cdef class Array(_PandasConvertible):
 
     def __init__(self):
         raise TypeError("Do not call {}'s constructor directly, use one of "
@@ -602,42 +656,13 @@ cdef class Array:
 
         return pyarrow_wrap_array(result)
 
-    def to_pandas(self, bint strings_to_categorical=False,
-                  bint zero_copy_only=False, bint integer_object_nulls=False,
-                  bint date_as_object=False):
-        """
-        Convert to a NumPy array object suitable for use in pandas.
-
-        Parameters
-        ----------
-        strings_to_categorical : boolean, default False
-            Encode string (UTF8) and binary types to pandas.Categorical
-        zero_copy_only : boolean, default False
-            Raise an ArrowException if this function call would require copying
-            the underlying data
-        integer_object_nulls : boolean, default False
-            Cast integers with nulls to objects
-        date_as_object : boolean, default False
-            Cast dates to objects
-
-        See also
-        --------
-        Column.to_pandas
-        Table.to_pandas
-        RecordBatch.to_pandas
-        """
+    def _to_pandas(self, options, **kwargs):
         cdef:
             PyObject* out
-            PandasOptions options
+            PandasOptions c_options = options
 
-        options = PandasOptions(
-            strings_to_categorical=strings_to_categorical,
-            zero_copy_only=zero_copy_only,
-            integer_object_nulls=integer_object_nulls,
-            date_as_object=date_as_object,
-            use_threads=False)
         with nogil:
-            check_status(ConvertArrayToPandas(options, self.sp_array,
+            check_status(ConvertArrayToPandas(c_options, self.sp_array,
                                               self, &out))
         return wrap_array_output(out)
 
diff --git a/python/pyarrow/compat.py b/python/pyarrow/compat.py
index 068d560..ee924ed 100644
--- a/python/pyarrow/compat.py
+++ b/python/pyarrow/compat.py
@@ -192,11 +192,15 @@ def _iterate_python_module_paths(package_name):
             for finder in sys.meta_path:
                 try:
                     spec = finder.find_spec(absolute_name, None)
-                except AttributeError:
+                except (AttributeError, TypeError):
                     # On Travis (Python 3.5) the above produced:
                     # AttributeError: 'VendorImporter' object has no
                     # attribute 'find_spec'
+                    #
+                    # ARROW-4117: When running "asv dev", TypeError is raised
+                    # due to the meta-importer
                     spec = None
+
                 if spec is not None:
                     break
 
diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd
index 7ce03bf..cc77ff4 100644
--- a/python/pyarrow/includes/libarrow.pxd
+++ b/python/pyarrow/includes/libarrow.pxd
@@ -1064,20 +1064,20 @@ cdef extern from "arrow/python/api.h" namespace "arrow::py" nogil:
     CStatus TensorToNdarray(const shared_ptr[CTensor]& tensor, object base,
                             PyObject** out)
 
-    CStatus ConvertArrayToPandas(PandasOptions options,
+    CStatus ConvertArrayToPandas(const PandasOptions& options,
                                  const shared_ptr[CArray]& arr,
                                  object py_ref, PyObject** out)
 
-    CStatus ConvertChunkedArrayToPandas(PandasOptions options,
+    CStatus ConvertChunkedArrayToPandas(const PandasOptions& options,
                                         const shared_ptr[CChunkedArray]& arr,
                                         object py_ref, PyObject** out)
 
-    CStatus ConvertColumnToPandas(PandasOptions options,
+    CStatus ConvertColumnToPandas(const PandasOptions& options,
                                   const shared_ptr[CColumn]& arr,
                                   object py_ref, PyObject** out)
 
     CStatus ConvertTableToPandas(
-        PandasOptions options,
+        const PandasOptions& options,
         const unordered_set[c_string]& categorical_columns,
         const shared_ptr[CTable]& table,
         CMemoryPool* pool,
@@ -1110,6 +1110,7 @@ cdef extern from "arrow/python/api.h" namespace "arrow::py" nogil:
         c_bool integer_object_nulls
         c_bool date_as_object
         c_bool use_threads
+        c_bool deduplicate_objects
 
 cdef extern from "arrow/python/api.h" namespace 'arrow::py' nogil:
 
diff --git a/python/pyarrow/lib.pxd b/python/pyarrow/lib.pxd
index d829d6a..8cd8f40 100644
--- a/python/pyarrow/lib.pxd
+++ b/python/pyarrow/lib.pxd
@@ -179,7 +179,11 @@ cdef class FixedSizeBinaryValue(ArrayValue):
     pass
 
 
-cdef class Array:
+cdef class _PandasConvertible:
+    pass
+
+
+cdef class Array(_PandasConvertible):
     cdef:
         shared_ptr[CArray] sp_array
         CArray* ap
@@ -306,7 +310,7 @@ cdef object box_scalar(DataType type,
                        int64_t index)
 
 
-cdef class ChunkedArray:
+cdef class ChunkedArray(_PandasConvertible):
     cdef:
         shared_ptr[CChunkedArray] sp_chunked_array
         CChunkedArray* chunked_array
@@ -315,7 +319,7 @@ cdef class ChunkedArray:
     cdef getitem(self, int64_t i)
 
 
-cdef class Column:
+cdef class Column(_PandasConvertible):
     cdef:
         shared_ptr[CColumn] sp_column
         CColumn* column
@@ -323,7 +327,7 @@ cdef class Column:
     cdef void init(self, const shared_ptr[CColumn]& column)
 
 
-cdef class Table:
+cdef class Table(_PandasConvertible):
     cdef:
         shared_ptr[CTable] sp_table
         CTable* table
@@ -331,7 +335,7 @@ cdef class Table:
     cdef void init(self, const shared_ptr[CTable]& table)
 
 
-cdef class RecordBatch:
+cdef class RecordBatch(_PandasConvertible):
     cdef:
         shared_ptr[CRecordBatch] sp_batch
         CRecordBatch* batch
diff --git a/python/pyarrow/pandas_compat.py b/python/pyarrow/pandas_compat.py
index 6acca0c..a5d8621 100644
--- a/python/pyarrow/pandas_compat.py
+++ b/python/pyarrow/pandas_compat.py
@@ -548,7 +548,7 @@ def _make_datetimetz(tz):
 # Converting pyarrow.Table efficiently to pandas.DataFrame
 
 
-def table_to_blockmanager(options, table, memory_pool, categories=None,
+def table_to_blockmanager(options, table, categories=None,
                           ignore_metadata=False):
     from pyarrow.compat import DatetimeTZDtype
 
@@ -624,7 +624,8 @@ def table_to_blockmanager(options, table, memory_pool, categories=None,
                 block_table.schema.get_field_index(raw_name)
             )
 
-    blocks = _table_to_blocks(options, block_table, memory_pool, categories)
+    blocks = _table_to_blocks(options, block_table, pa.default_memory_pool(),
+                              categories)
 
     # Construct the row index
     if len(index_arrays) > 1:
diff --git a/python/pyarrow/table.pxi b/python/pyarrow/table.pxi
index 29a784d..59680ed 100644
--- a/python/pyarrow/table.pxi
+++ b/python/pyarrow/table.pxi
@@ -28,7 +28,7 @@ else:
     import pyarrow.pandas_compat as pdcompat
 
 
-cdef class ChunkedArray:
+cdef class ChunkedArray(_PandasConvertible):
     """
     Array backed via one or more memory chunks.
 
@@ -145,43 +145,14 @@ cdef class ChunkedArray:
 
         return result
 
-    def to_pandas(self, bint strings_to_categorical=False,
-                  bint zero_copy_only=False, bint integer_object_nulls=False,
-                  bint date_as_object=False):
-        """
-        Convert the arrow::ChunkedArray to an array object suitable for use
-        in pandas
-
-        Parameters
-        ----------
-        strings_to_categorical : boolean, default False
-            Encode string (UTF8) and binary types to pandas.Categorical
-        zero_copy_only : boolean, default False
-            Raise an ArrowException if this function call would require copying
-            the underlying data
-        integer_object_nulls : boolean, default False
-            Cast integers with nulls to objects
-        date_as_object : boolean, default False
-            Cast dates to objects
-
-        See also
-        --------
-        Column.to_pandas
-        """
+    def _to_pandas(self, options, **kwargs):
         cdef:
             PyObject* out
-            PandasOptions options
-
-        options = PandasOptions(
-            strings_to_categorical=strings_to_categorical,
-            zero_copy_only=zero_copy_only,
-            integer_object_nulls=integer_object_nulls,
-            date_as_object=date_as_object,
-            use_threads=False)
+            PandasOptions c_options = options
 
         with nogil:
             check_status(libarrow.ConvertChunkedArrayToPandas(
-                options,
+                c_options,
                 self.sp_chunked_array,
                 self, &out))
 
@@ -385,7 +356,7 @@ def column(object field_or_name, arr):
     return pyarrow_wrap_column(sp_column)
 
 
-cdef class Column:
+cdef class Column(_PandasConvertible):
     """
     Named vector of elements of equal type.
 
@@ -497,33 +468,8 @@ cdef class Column:
 
         return [pyarrow_wrap_column(col) for col in flattened]
 
-    def to_pandas(self, bint strings_to_categorical=False,
-                  bint zero_copy_only=False, bint integer_object_nulls=False,
-                  bint date_as_object=False):
-        """
-        Convert the arrow::Column to a pandas.Series
-
-        Parameters
-        ----------
-        strings_to_categorical : boolean, default False
-            Encode string (UTF8) and binary types to pandas.Categorical
-        zero_copy_only : boolean, default False
-            Raise an ArrowException if this function call would require copying
-            the underlying data
-        integer_object_nulls : boolean, default False
-            Cast integers with nulls to objects
-        date_as_object : boolean, default False
-            Cast dates to objects
-
-        Returns
-        -------
-        pandas.Series
-        """
-        values = self.data.to_pandas(
-            strings_to_categorical=strings_to_categorical,
-            zero_copy_only=zero_copy_only,
-            date_as_object=date_as_object,
-            integer_object_nulls=integer_object_nulls)
+    def _to_pandas(self, options, **kwargs):
+        values = self.data._to_pandas(options)
         result = pd.Series(values, name=self.name)
 
         if isinstance(self.type, TimestampType):
@@ -685,7 +631,7 @@ cdef _schema_from_arrays(arrays, names, metadata, shared_ptr[CSchema]* schema):
     schema.reset(new CSchema(c_fields, c_meta))
 
 
-cdef class RecordBatch:
+cdef class RecordBatch(_PandasConvertible):
     """
     Batch of rows of columns of equal length
 
@@ -887,46 +833,8 @@ cdef class RecordBatch:
             entries.append((name, column))
         return OrderedDict(entries)
 
-    def to_pandas(self, MemoryPool memory_pool=None, categories=None,
-                  bint strings_to_categorical=False, bint zero_copy_only=False,
-                  bint integer_object_nulls=False, bint date_as_object=False,
-                  bint use_threads=True, bint ignore_metadata=False):
-        """
-        Convert the arrow::RecordBatch to a pandas DataFrame
-
-        Parameters
-        ----------
-        memory_pool: MemoryPool, optional
-            Specific memory pool to use to allocate casted columns
-        categories: list, default empty
-            List of columns that should be returned as pandas.Categorical
-        strings_to_categorical : boolean, default False
-            Encode string (UTF8) and binary types to pandas.Categorical
-        zero_copy_only : boolean, default False
-            Raise an ArrowException if this function call would require copying
-            the underlying data
-        integer_object_nulls : boolean, default False
-            Cast integers with nulls to objects
-        date_as_object : boolean, default False
-            Cast dates to objects
-        use_threads: boolean, default True
-            Whether to parallelize the conversion using multiple threads
-        ignore_metadata : boolean, default False
-            If True, do not use the 'pandas' metadata to reconstruct the
-            DataFrame index, if present
-
-        Returns
-        -------
-        pandas.DataFrame
-        """
-        return Table.from_batches([self]).to_pandas(
-            memory_pool=memory_pool, categories=categories,
-            strings_to_categorical=strings_to_categorical,
-            zero_copy_only=zero_copy_only,
-            integer_object_nulls=integer_object_nulls,
-            date_as_object=date_as_object, use_threads=use_threads,
-            ignore_metadata=ignore_metadata
-        )
+    def _to_pandas(self, options, **kwargs):
+        return Table.from_batches([self])._to_pandas(options, **kwargs)
 
     @classmethod
     def from_pandas(cls, df, Schema schema=None, bint preserve_index=True,
@@ -1031,7 +939,7 @@ def table_to_blocks(PandasOptions options, Table table,
     return PyObject_to_object(result_obj)
 
 
-cdef class Table:
+cdef class Table(_PandasConvertible):
     """
     A collection of top-level named, equal length Arrow arrays.
 
@@ -1386,50 +1294,8 @@ cdef class Table:
 
         return result
 
-    def to_pandas(self, MemoryPool memory_pool=None, categories=None,
-                  bint strings_to_categorical=False, bint zero_copy_only=False,
-                  bint integer_object_nulls=False, bint date_as_object=False,
-                  bint use_threads=True, bint ignore_metadata=False):
-        """
-        Convert the arrow::Table to a pandas DataFrame
-
-        Parameters
-        ----------
-        memory_pool: MemoryPool, optional
-            Specific memory pool to use to allocate casted columns
-        categories: list, default empty
-            List of columns that should be returned as pandas.Categorical
-        strings_to_categorical : boolean, default False
-            Encode string (UTF8) and binary types to pandas.Categorical
-        zero_copy_only : boolean, default False
-            Raise an ArrowException if this function call would require copying
-            the underlying data
-        integer_object_nulls : boolean, default False
-            Cast integers with nulls to objects
-        date_as_object : boolean, default False
-            Cast dates to objects
-        use_threads: boolean, default True
-            Whether to parallelize the conversion using multiple threads
-        ignore_metadata : boolean, default False
-            If True, do not use the 'pandas' metadata to reconstruct the
-            DataFrame index, if present
-
-        Returns
-        -------
-        pandas.DataFrame
-        """
-        cdef:
-            PandasOptions options
-
-        options = PandasOptions(
-            strings_to_categorical=strings_to_categorical,
-            zero_copy_only=zero_copy_only,
-            integer_object_nulls=integer_object_nulls,
-            date_as_object=date_as_object,
-            use_threads=use_threads)
-
-        mgr = pdcompat.table_to_blockmanager(options, self, memory_pool,
-                                             categories,
+    def _to_pandas(self, options, categories=None, ignore_metadata=False):
+        mgr = pdcompat.table_to_blockmanager(options, self, categories,
                                              ignore_metadata=ignore_metadata)
         return pd.DataFrame(mgr)
 
diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py
index 1221484..8d8b65b 100644
--- a/python/pyarrow/tests/test_convert_pandas.py
+++ b/python/pyarrow/tests/test_convert_pandas.py
@@ -2316,6 +2316,91 @@ def test_convert_unsupported_type_error_message():
         pa.Table.from_pandas(df)
 
 
+# ----------------------------------------------------------------------
+# Test object deduplication in to_pandas
+
+
+def _generate_dedup_example(nunique, repeats):
+    unique_values = [tm.rands(10) for i in range(nunique)]
+    return unique_values * repeats
+
+
+def _assert_nunique(obj, expected):
+    assert len({id(x) for x in obj}) == expected
+
+
+def test_to_pandas_deduplicate_strings_array_types():
+    nunique = 100
+    repeats = 10
+    values = _generate_dedup_example(nunique, repeats)
+
+    for arr in [pa.array(values, type=pa.binary()),
+                pa.array(values, type=pa.utf8()),
+                pa.chunked_array([values, values]),
+                pa.column('foo', [values, values])]:
+        _assert_nunique(arr.to_pandas(), nunique)
+        _assert_nunique(arr.to_pandas(deduplicate_objects=False), len(arr))
+
+
+def test_to_pandas_deduplicate_strings_table_types():
+    nunique = 100
+    repeats = 10
+    values = _generate_dedup_example(nunique, repeats)
+
+    arr = pa.array(values)
+    rb = pa.RecordBatch.from_arrays([arr], ['foo'])
+    tbl = pa.Table.from_batches([rb])
+
+    for obj in [rb, tbl]:
+        _assert_nunique(obj.to_pandas()['foo'], nunique)
+        _assert_nunique(obj.to_pandas(deduplicate_objects=False)['foo'],
+                        len(obj))
+
+
+def test_to_pandas_deduplicate_integers_as_objects():
+    nunique = 100
+    repeats = 10
+
+    # Python automatically interns smaller integers
+    unique_values = list(np.random.randint(10000000, 1000000000, size=nunique))
+    unique_values[nunique // 2] = None
+
+    arr = pa.array(unique_values * repeats)
+
+    _assert_nunique(arr.to_pandas(integer_object_nulls=True), nunique)
+    _assert_nunique(arr.to_pandas(integer_object_nulls=True,
+                                  deduplicate_objects=False),
+                    # Account for None
+                    (nunique - 1) * repeats + 1)
+
+
+def test_to_pandas_deduplicate_date_time():
+    nunique = 100
+    repeats = 10
+
+    unique_values = list(range(nunique))
+
+    cases = [
+        # raw type, array type, to_pandas options
+        ('int32', 'date32', {'date_as_object': True}),
+        ('int64', 'date64', {'date_as_object': True}),
+        ('int32', 'time32[ms]', {}),
+        ('int64', 'time64[us]', {})
+    ]
+
+    for raw_type, array_type, pandas_options in cases:
+        raw_arr = pa.array(unique_values * repeats, type=raw_type)
+        casted_arr = raw_arr.cast(array_type)
+
+        _assert_nunique(casted_arr.to_pandas(**pandas_options),
+                        nunique)
+        _assert_nunique(casted_arr.to_pandas(deduplicate_objects=False,
+                                             **pandas_options),
+                        len(casted_arr))
+
+
+# ---------------------------------------------------------------------
+
 def test_table_from_pandas_keeps_column_order_of_dataframe():
     df1 = pd.DataFrame(OrderedDict([
         ('partition', [0, 0, 1, 1]),