You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by ks...@apache.org on 2018/12/10 15:43:02 UTC

[arrow] branch master updated: ARROW-2624: [Python] Random schema generator for Arrow conversion and Parquet testing

This is an automated email from the ASF dual-hosted git repository.

kszucs pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
     new 9da4584  ARROW-2624: [Python] Random schema generator for Arrow conversion and Parquet testing
9da4584 is described below

commit 9da458437162574f3e0d82e4a51dc6c1589b9f94
Author: Krisztián Szűcs <sz...@gmail.com>
AuthorDate: Mon Dec 10 16:42:53 2018 +0100

    ARROW-2624: [Python] Random schema generator for Arrow conversion and Parquet testing
    
    - introduced hypothesis to generate pyarrow types, fields and schemas
    - test cases to highlight the functionality provided by hypothesis
    - hypothesis tests are disabled by default
    - represent kev-value metadata as OrderedDict on python side instead of plain dicts (pickling was indeterministic, found this bug by hypo)
    - unified multiple metadata conversion paths to a single one (pyarrow_wrap_metadata, pyarrow_unwrap_metadata)
    
    Also resolves: [ARROW-3901: [Python] Make Schema hashable](https://issues.apache.org/jira/browse/ARROW-3901)
    Follow-up issue: [ARROW-3903: [Python] Random data generator for ... testing](https://issues.apache.org/jira/browse/ARROW-3903)
    
    Author: Krisztián Szűcs <sz...@gmail.com>
    
    Closes #3046 from kszucs/ARROW-2624 and squashes the following commits:
    
    3e27ad15 <Krisztián Szűcs> hypo profiles
    88b107bb <Krisztián Szűcs> install hypothesis for msvc wheel test
    8fb6d0bc <Krisztián Szűcs> make pyarrow_wrap_metadata private
    80a276be <Krisztián Szűcs> manylinux
    26e6ecd6 <Krisztián Szűcs> manylinux
    e385d243 <Krisztián Szűcs> manylinux
    b6fe7576 <Krisztián Szűcs> append in unwrap
    0e28e5df <Krisztián Szűcs> ci fixes
    efeb65ee <Krisztián Szűcs> use conde_env_python.yml in travis
    1f7ad6b6 <Krisztián Szűcs> don't validate metadata type pyarrow_wrap_metadata
    14e444d9 <Krisztián Szűcs> introduce requirements-test.txt
    11b020c0 <Krisztián Szűcs> install hypothesis on appveyor and travis
    6bd5b21e <Krisztián Szűcs> license header
    a8fae546 <Krisztián Szűcs> remove unbox_metadata
    e8c0f3f5 <Krisztián Szűcs> add hypo as test dependency; hashing test
    e7bab691 <Krisztián Szűcs> remove box_metadata
    f1ae290e <Krisztián Szűcs> hypothesis strategies for pyarrow types; deterministic key-value metadata conversions
---
 ci/appveyor-cpp-build.bat                      |   2 +-
 ci/conda_env_python.yml                        |   2 +
 ci/cpp-msvc-build-main.bat                     |   2 +-
 ci/travis_script_python.sh                     |  10 +-
 dev/release/rat_exclude_files.txt              |   1 +
 dev/release/verify-release-candidate.sh        |   2 +-
 python/manylinux1/build_arrow.sh               |   5 +-
 python/manylinux1/scripts/build_virtualenvs.sh |   2 +-
 python/pyarrow/includes/libarrow.pxd           |   8 +-
 python/pyarrow/lib.pxd                         |   6 +-
 python/pyarrow/public-api.pxi                  |  25 +++++
 python/pyarrow/table.pxi                       |  60 ++++++-----
 python/pyarrow/tests/conftest.py               |  34 ++++--
 python/pyarrow/tests/strategies.py             | 138 +++++++++++++++++++++++++
 python/pyarrow/tests/test_types.py             |  50 +++++++++
 python/pyarrow/types.pxi                       |  88 +++++++---------
 python/requirements-test.txt                   |   5 +
 python/requirements.txt                        |   9 +-
 python/setup.py                                |   3 +-
 19 files changed, 348 insertions(+), 104 deletions(-)

diff --git a/ci/appveyor-cpp-build.bat b/ci/appveyor-cpp-build.bat
index 91212a6..b8e4316 100644
--- a/ci/appveyor-cpp-build.bat
+++ b/ci/appveyor-cpp-build.bat
@@ -91,7 +91,7 @@ if "%JOB%" == "Build_Debug" (
 
 conda create -n arrow -q -y ^
       python=%PYTHON% ^
-      six pytest setuptools numpy pandas cython ^
+      six pytest setuptools numpy pandas cython hypothesis ^
       thrift-cpp=0.11.0 boost-cpp ^
       -c conda-forge
 
diff --git a/ci/conda_env_python.yml b/ci/conda_env_python.yml
index 429851e..c187155 100644
--- a/ci/conda_env_python.yml
+++ b/ci/conda_env_python.yml
@@ -16,6 +16,8 @@
 # under the License.
 
 cython
+cloudpickle
+hypothesis
 nomkl
 numpy
 pandas
diff --git a/ci/cpp-msvc-build-main.bat b/ci/cpp-msvc-build-main.bat
index ef961b2..7349f8d 100644
--- a/ci/cpp-msvc-build-main.bat
+++ b/ci/cpp-msvc-build-main.bat
@@ -112,6 +112,6 @@ pip install %WHEEL_PATH% || exit /B
 python -c "import pyarrow" || exit /B
 python -c "import pyarrow.parquet" || exit /B
 
-pip install pandas pickle5 pytest pytest-faulthandler || exit /B
+pip install pandas pickle5 pytest pytest-faulthandler hypothesis || exit /B
 
 py.test -r sxX --durations=15 --pyargs pyarrow.tests || exit /B
diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh
index e4290ed..b316c81 100755
--- a/ci/travis_script_python.sh
+++ b/ci/travis_script_python.sh
@@ -51,13 +51,11 @@ if [ $ARROW_TRAVIS_PYTHON_JVM == "1" ]; then
   CONDA_JVM_DEPS="jpype1"
 fi
 
-conda install -y -q pip \
-      nomkl \
-      cloudpickle \
+conda install -y -q \
+      --file $TRAVIS_BUILD_DIR/ci/conda_env_python.yml \
+      pip \
       numpy=1.13.1 \
-      ${CONDA_JVM_DEPS} \
-      pandas \
-      cython
+      ${CONDA_JVM_DEPS}
 
 if [ "$ARROW_TRAVIS_PYTHON_DOCS" == "1" ] && [ "$PYTHON_VERSION" == "3.6" ]; then
   # Install documentation dependencies
diff --git a/dev/release/rat_exclude_files.txt b/dev/release/rat_exclude_files.txt
index 0baf29e..e274d97 100644
--- a/dev/release/rat_exclude_files.txt
+++ b/dev/release/rat_exclude_files.txt
@@ -129,6 +129,7 @@ python/MANIFEST.in
 python/pyarrow/includes/__init__.pxd
 python/pyarrow/tests/__init__.py
 python/requirements.txt
+python/requirements-test.txt
 pax_global_header
 MANIFEST.in
 __init__.pxd
diff --git a/dev/release/verify-release-candidate.sh b/dev/release/verify-release-candidate.sh
index 5b66663..57b1850 100755
--- a/dev/release/verify-release-candidate.sh
+++ b/dev/release/verify-release-candidate.sh
@@ -189,7 +189,7 @@ test_and_install_cpp() {
 test_python() {
   pushd python
 
-  pip install -r requirements.txt
+  pip install -r requirements-test.txt
 
   python setup.py build_ext --inplace --with-parquet --with-plasma
   py.test pyarrow -v --pdb
diff --git a/python/manylinux1/build_arrow.sh b/python/manylinux1/build_arrow.sh
index 4481652..9042973 100755
--- a/python/manylinux1/build_arrow.sh
+++ b/python/manylinux1/build_arrow.sh
@@ -107,7 +107,7 @@ for PYTHON_TUPLE in ${PYTHON_VERSIONS}; do
     PATH="$PATH:${CPYTHON_PATH}/bin" $PYTHON_INTERPRETER setup.py bdist_wheel
     PATH="$PATH:${CPYTHON_PATH}/bin" $PYTHON_INTERPRETER setup.py sdist
 
-    echo "=== (${PYTHON}) Test the existence of optional modules ==="
+    echo "=== (${PYTHON}) Ensure the existence of mandatory modules ==="
     $PIP install -r requirements.txt
 
     echo "=== (${PYTHON}) Tag the wheel with manylinux1 ==="
@@ -122,6 +122,9 @@ for PYTHON_TUPLE in ${PYTHON_VERSIONS}; do
     PATH="$PATH:${CPYTHON_PATH}/bin" $PYTHON_INTERPRETER -c "import pyarrow.parquet"
     PATH="$PATH:${CPYTHON_PATH}/bin" $PYTHON_INTERPRETER -c "import pyarrow.plasma"
 
+    echo "=== (${PYTHON}) Install modules required for testing ==="
+    pip install -r requirements-test.txt
+
     # The TensorFlow test will be skipped here, since TensorFlow is not
     # manylinux1 compatible; however, the wheels will support TensorFlow on
     # a TensorFlow compatible system
diff --git a/python/manylinux1/scripts/build_virtualenvs.sh b/python/manylinux1/scripts/build_virtualenvs.sh
index 18f3b0d..1410031 100755
--- a/python/manylinux1/scripts/build_virtualenvs.sh
+++ b/python/manylinux1/scripts/build_virtualenvs.sh
@@ -41,7 +41,7 @@ for PYTHON_TUPLE in ${PYTHON_VERSIONS}; do
     echo "=== (${PYTHON}, ${U_WIDTH}) Preparing virtualenv for tests ==="
     "$(cpython_path $PYTHON ${U_WIDTH})/bin/virtualenv" -p ${PYTHON_INTERPRETER} --no-download /venv-test-${PYTHON}-${U_WIDTH}
     source /venv-test-${PYTHON}-${U_WIDTH}/bin/activate
-    pip install pytest 'numpy==1.14.5' 'pandas==0.23.4'
+    pip install pytest hypothesis 'numpy==1.14.5' 'pandas==0.23.4'
     deactivate
 done
 
diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd
index c5e7457..61517e4 100644
--- a/python/pyarrow/includes/libarrow.pxd
+++ b/python/pyarrow/includes/libarrow.pxd
@@ -23,9 +23,15 @@ cdef extern from "arrow/util/key_value_metadata.h" namespace "arrow" nogil:
     cdef cppclass CKeyValueMetadata" arrow::KeyValueMetadata":
         CKeyValueMetadata()
         CKeyValueMetadata(const unordered_map[c_string, c_string]&)
+        CKeyValueMetadata(const vector[c_string]& keys,
+                          const vector[c_string]& values)
 
-        c_bool Equals(const CKeyValueMetadata& other)
+        void reserve(int64_t n)
+        int64_t size() const
+        c_string key(int64_t i) const
+        c_string value(int64_t i) const
 
+        c_bool Equals(const CKeyValueMetadata& other)
         void Append(const c_string& key, const c_string& value)
         void ToUnorderedMap(unordered_map[c_string, c_string]*) const
 
diff --git a/python/pyarrow/lib.pxd b/python/pyarrow/lib.pxd
index 098ae62..745a049 100644
--- a/python/pyarrow/lib.pxd
+++ b/python/pyarrow/lib.pxd
@@ -384,11 +384,13 @@ cdef get_reader(object source, c_bool use_memory_map,
                 shared_ptr[RandomAccessFile]* reader)
 cdef get_writer(object source, shared_ptr[OutputStream]* writer)
 
-cdef dict box_metadata(const CKeyValueMetadata* sp_metadata)
-
 # Default is allow_none=False
 cdef DataType ensure_type(object type, c_bool allow_none=*)
 
+cdef shared_ptr[CKeyValueMetadata] pyarrow_unwrap_metadata(object meta)
+cdef object pyarrow_wrap_metadata(
+    const shared_ptr[const CKeyValueMetadata]& meta)
+
 #
 # Public Cython API for 3rd party code
 #
diff --git a/python/pyarrow/public-api.pxi b/python/pyarrow/public-api.pxi
index e8798c5..ef54c7a 100644
--- a/python/pyarrow/public-api.pxi
+++ b/python/pyarrow/public-api.pxi
@@ -92,6 +92,31 @@ cdef public api object pyarrow_wrap_data_type(
     return out
 
 
+cdef object pyarrow_wrap_metadata(
+        const shared_ptr[const CKeyValueMetadata]& meta):
+    cdef const CKeyValueMetadata* cmeta = meta.get()
+
+    if cmeta == nullptr:
+        return None
+
+    result = OrderedDict()
+    for i in range(cmeta.size()):
+        result[cmeta.key(i)] = cmeta.value(i)
+
+    return result
+
+
+cdef shared_ptr[CKeyValueMetadata] pyarrow_unwrap_metadata(object meta):
+    cdef vector[c_string] keys, values
+
+    if isinstance(meta, dict):
+        keys = map(tobytes, meta.keys())
+        values = map(tobytes, meta.values())
+        return make_shared[CKeyValueMetadata](keys, values)
+
+    return shared_ptr[CKeyValueMetadata]()
+
+
 cdef public api bint pyarrow_is_field(object field):
     return isinstance(field, Field)
 
diff --git a/python/pyarrow/table.pxi b/python/pyarrow/table.pxi
index 0d529d3..fd565af 100644
--- a/python/pyarrow/table.pxi
+++ b/python/pyarrow/table.pxi
@@ -634,26 +634,22 @@ cdef class Column:
         return pyarrow_wrap_chunked_array(self.column.data())
 
 
-cdef shared_ptr[const CKeyValueMetadata] unbox_metadata(dict metadata):
-    if metadata is None:
-        return <shared_ptr[const CKeyValueMetadata]> nullptr
-    cdef:
-        unordered_map[c_string, c_string] unordered_metadata = metadata
-    return (<shared_ptr[const CKeyValueMetadata]>
-            make_shared[CKeyValueMetadata](unordered_metadata))
-
-
-cdef _schema_from_arrays(arrays, names, dict metadata,
-                         shared_ptr[CSchema]* schema):
+cdef _schema_from_arrays(arrays, names, metadata, shared_ptr[CSchema]* schema):
     cdef:
         Column col
         c_string c_name
         vector[shared_ptr[CField]] fields
         shared_ptr[CDataType] type_
         Py_ssize_t K = len(arrays)
+        shared_ptr[CKeyValueMetadata] c_meta
+
+    if metadata is not None:
+        if not isinstance(metadata, dict):
+            raise TypeError('Metadata must be an instance of dict')
+        c_meta = pyarrow_unwrap_metadata(metadata)
 
     if K == 0:
-        schema.reset(new CSchema(fields, unbox_metadata(metadata)))
+        schema.reset(new CSchema(fields, c_meta))
         return
 
     fields.resize(K)
@@ -684,7 +680,7 @@ cdef _schema_from_arrays(arrays, names, dict metadata,
                 c_name = tobytes(names[i])
             fields[i].reset(new CField(c_name, type_, True))
 
-    schema.reset(new CSchema(fields, unbox_metadata(metadata)))
+    schema.reset(new CSchema(fields, c_meta))
 
 
 cdef class RecordBatch:
@@ -715,7 +711,7 @@ cdef class RecordBatch:
     def __len__(self):
         return self.batch.num_rows()
 
-    def replace_schema_metadata(self, dict metadata=None):
+    def replace_schema_metadata(self, metadata=None):
         """
         EXPERIMENTAL: Create shallow copy of record batch by replacing schema
         key-value metadata with the indicated new metadata (which may be None,
@@ -729,15 +725,19 @@ cdef class RecordBatch:
         -------
         shallow_copy : RecordBatch
         """
-        cdef shared_ptr[CKeyValueMetadata] c_meta
+        cdef:
+            shared_ptr[CKeyValueMetadata] c_meta
+            shared_ptr[CRecordBatch] c_batch
+
         if metadata is not None:
-            convert_metadata(metadata, &c_meta)
+            if not isinstance(metadata, dict):
+                raise TypeError('Metadata must be an instance of dict')
+            c_meta = pyarrow_unwrap_metadata(metadata)
 
-        cdef shared_ptr[CRecordBatch] new_batch
         with nogil:
-            new_batch = self.batch.ReplaceSchemaMetadata(c_meta)
+            c_batch = self.batch.ReplaceSchemaMetadata(c_meta)
 
-        return pyarrow_wrap_batch(new_batch)
+        return pyarrow_wrap_batch(c_batch)
 
     @property
     def num_columns(self):
@@ -953,7 +953,7 @@ cdef class RecordBatch:
         return cls.from_arrays(arrays, names, metadata)
 
     @staticmethod
-    def from_arrays(list arrays, names, dict metadata=None):
+    def from_arrays(list arrays, names, metadata=None):
         """
         Construct a RecordBatch from multiple pyarrow.Arrays
 
@@ -1062,7 +1062,7 @@ cdef class Table:
         columns = [col.data for col in self.columns]
         return _reconstruct_table, (columns, self.schema)
 
-    def replace_schema_metadata(self, dict metadata=None):
+    def replace_schema_metadata(self, metadata=None):
         """
         EXPERIMENTAL: Create shallow copy of table by replacing schema
         key-value metadata with the indicated new metadata (which may be None,
@@ -1076,15 +1076,19 @@ cdef class Table:
         -------
         shallow_copy : Table
         """
-        cdef shared_ptr[CKeyValueMetadata] c_meta
+        cdef:
+            shared_ptr[CKeyValueMetadata] c_meta
+            shared_ptr[CTable] c_table
+
         if metadata is not None:
-            convert_metadata(metadata, &c_meta)
+            if not isinstance(metadata, dict):
+                raise TypeError('Metadata must be an instance of dict')
+            c_meta = pyarrow_unwrap_metadata(metadata)
 
-        cdef shared_ptr[CTable] new_table
         with nogil:
-            new_table = self.table.ReplaceSchemaMetadata(c_meta)
+            c_table = self.table.ReplaceSchemaMetadata(c_meta)
 
-        return pyarrow_wrap_table(new_table)
+        return pyarrow_wrap_table(c_table)
 
     def flatten(self, MemoryPool memory_pool=None):
         """
@@ -1225,7 +1229,7 @@ cdef class Table:
         return cls.from_arrays(arrays, names=names, metadata=metadata)
 
     @staticmethod
-    def from_arrays(arrays, names=None, schema=None, dict metadata=None):
+    def from_arrays(arrays, names=None, schema=None, metadata=None):
         """
         Construct a Table from Arrow arrays or columns
 
@@ -1236,6 +1240,8 @@ cdef class Table:
         names: list of str, optional
             Names for the table columns. If Columns passed, will be
             inferred. If Arrays passed, this argument is required
+        schema : Schema, default None
+            If not passed, will be inferred from the arrays
 
         Returns
         -------
diff --git a/python/pyarrow/tests/conftest.py b/python/pyarrow/tests/conftest.py
index 6cdedbb..69e8e82 100644
--- a/python/pyarrow/tests/conftest.py
+++ b/python/pyarrow/tests/conftest.py
@@ -15,7 +15,9 @@
 # specific language governing permissions and limitations
 # under the License.
 
+import os
 import pytest
+import hypothesis as h
 
 try:
     import pathlib
@@ -23,7 +25,20 @@ except ImportError:
     import pathlib2 as pathlib  # py2 compat
 
 
+# setup hypothesis profiles
+h.settings.register_profile('ci', max_examples=1000)
+h.settings.register_profile('dev', max_examples=10)
+h.settings.register_profile('debug', max_examples=10,
+                            verbosity=h.Verbosity.verbose)
+
+# load default hypothesis profile, either set HYPOTHESIS_PROFILE environment
+# variable or pass --hypothesis-profile option to pytest, to see the generated
+# examples try: pytest pyarrow -sv --only-hypothesis --hypothesis-profile=debug
+h.settings.load_profile(os.environ.get('HYPOTHESIS_PROFILE', 'default'))
+
+
 groups = [
+    'hypothesis',
     'gandiva',
     'hdfs',
     'large_memory',
@@ -36,6 +51,7 @@ groups = [
 
 
 defaults = {
+    'hypothesis': False,
     'gandiva': False,
     'hdfs': False,
     'large_memory': False,
@@ -84,16 +100,15 @@ def pytest_configure(config):
 
 def pytest_addoption(parser):
     for group in groups:
-        parser.addoption('--{0}'.format(group), action='store_true',
-                         default=defaults[group],
-                         help=('Enable the {0} test group'.format(group)))
+        for flag in ['--{0}', '--enable-{0}']:
+            parser.addoption(flag.format(group), action='store_true',
+                             default=defaults[group],
+                             help=('Enable the {0} test group'.format(group)))
 
-    for group in groups:
         parser.addoption('--disable-{0}'.format(group), action='store_true',
                          default=False,
                          help=('Disable the {0} test group'.format(group)))
 
-    for group in groups:
         parser.addoption('--only-{0}'.format(group), action='store_true',
                          default=False,
                          help=('Run only the {0} test group'.format(group)))
@@ -115,15 +130,18 @@ def pytest_runtest_setup(item):
     only_set = False
 
     for group in groups:
+        flag = '--{0}'.format(group)
         only_flag = '--only-{0}'.format(group)
+        enable_flag = '--enable-{0}'.format(group)
         disable_flag = '--disable-{0}'.format(group)
-        flag = '--{0}'.format(group)
 
         if item.config.getoption(only_flag):
             only_set = True
         elif getattr(item.obj, group, None):
-            if (item.config.getoption(disable_flag) or
-                    not item.config.getoption(flag)):
+            is_enabled = (item.config.getoption(flag) or
+                          item.config.getoption(enable_flag))
+            is_disabled = item.config.getoption(disable_flag)
+            if is_disabled or not is_enabled:
                 pytest.skip('{0} NOT enabled'.format(flag))
 
     if only_set:
diff --git a/python/pyarrow/tests/strategies.py b/python/pyarrow/tests/strategies.py
new file mode 100644
index 0000000..bc8ded2
--- /dev/null
+++ b/python/pyarrow/tests/strategies.py
@@ -0,0 +1,138 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import pyarrow as pa
+import hypothesis.strategies as st
+
+
+# TODO(kszucs): alphanum_text, surrogate_text
+custom_text = st.text(
+    alphabet=st.characters(
+        min_codepoint=0x41,
+        max_codepoint=0x7E
+    )
+)
+
+null_type = st.just(pa.null())
+bool_type = st.just(pa.bool_())
+
+binary_type = st.just(pa.binary())
+string_type = st.just(pa.string())
+
+signed_integer_types = st.sampled_from([
+    pa.int8(),
+    pa.int16(),
+    pa.int32(),
+    pa.int64()
+])
+unsigned_integer_types = st.sampled_from([
+    pa.uint8(),
+    pa.uint16(),
+    pa.uint32(),
+    pa.uint64()
+])
+integer_types = st.one_of(signed_integer_types, unsigned_integer_types)
+
+floating_types = st.sampled_from([
+    pa.float16(),
+    pa.float32(),
+    pa.float64()
+])
+decimal_type = st.builds(
+    pa.decimal128,
+    precision=st.integers(min_value=0, max_value=38),
+    scale=st.integers(min_value=0, max_value=38)
+)
+numeric_types = st.one_of(integer_types, floating_types, decimal_type)
+
+date_types = st.sampled_from([
+    pa.date32(),
+    pa.date64()
+])
+time_types = st.sampled_from([
+    pa.time32('s'),
+    pa.time32('ms'),
+    pa.time64('us'),
+    pa.time64('ns')
+])
+timestamp_types = st.sampled_from([
+    pa.timestamp('s'),
+    pa.timestamp('ms'),
+    pa.timestamp('us'),
+    pa.timestamp('ns')
+])
+temporal_types = st.one_of(date_types, time_types, timestamp_types)
+
+primitive_types = st.one_of(
+    null_type,
+    bool_type,
+    binary_type,
+    string_type,
+    numeric_types,
+    temporal_types
+)
+
+metadata = st.dictionaries(st.text(), st.text())
+
+
+@st.defines_strategy
+def fields(type_strategy=primitive_types):
+    return st.builds(pa.field, name=custom_text, type=type_strategy,
+                     nullable=st.booleans(), metadata=metadata)
+
+
+@st.defines_strategy
+def list_types(item_strategy=primitive_types):
+    return st.builds(pa.list_, item_strategy)
+
+
+@st.defines_strategy
+def struct_types(item_strategy=primitive_types):
+    return st.builds(pa.struct, st.lists(fields(item_strategy)))
+
+
+@st.defines_strategy
+def complex_types(inner_strategy=primitive_types):
+    return list_types(inner_strategy) | struct_types(inner_strategy)
+
+
+@st.defines_strategy
+def nested_list_types(item_strategy=primitive_types):
+    return st.recursive(item_strategy, list_types)
+
+
+@st.defines_strategy
+def nested_struct_types(item_strategy=primitive_types):
+    return st.recursive(item_strategy, struct_types)
+
+
+@st.defines_strategy
+def nested_complex_types(inner_strategy=primitive_types):
+    return st.recursive(inner_strategy, complex_types)
+
+
+@st.defines_strategy
+def schemas(type_strategy=primitive_types):
+    return st.builds(pa.schema, st.lists(fields(type_strategy)))
+
+
+complex_schemas = schemas(complex_types())
+
+
+all_types = st.one_of(primitive_types, complex_types(), nested_complex_types())
+all_fields = fields(all_types)
+all_schemas = schemas(all_types)
diff --git a/python/pyarrow/tests/test_types.py b/python/pyarrow/tests/test_types.py
index 176ce87..310656d 100644
--- a/python/pyarrow/tests/test_types.py
+++ b/python/pyarrow/tests/test_types.py
@@ -19,11 +19,14 @@ from collections import OrderedDict
 
 import pickle
 import pytest
+import hypothesis as h
+import hypothesis.strategies as st
 
 import pandas as pd
 import numpy as np
 import pyarrow as pa
 import pyarrow.types as types
+import pyarrow.tests.strategies as past
 
 
 def get_many_types():
@@ -466,15 +469,27 @@ def test_field_metadata():
 
 
 def test_field_add_remove_metadata():
+    import collections
+
     f0 = pa.field('foo', pa.int32())
 
     assert f0.metadata is None
 
     metadata = {b'foo': b'bar', b'pandas': b'badger'}
+    metadata2 = collections.OrderedDict([
+        (b'a', b'alpha'),
+        (b'b', b'beta')
+    ])
 
     f1 = f0.add_metadata(metadata)
     assert f1.metadata == metadata
 
+    f2 = f0.add_metadata(metadata2)
+    assert f2.metadata == metadata2
+
+    with pytest.raises(TypeError):
+        f0.add_metadata([1, 2, 3])
+
     f3 = f1.remove_metadata()
     assert f3.metadata is None
 
@@ -533,3 +548,38 @@ def test_schema_from_pandas(data):
     schema = pa.Schema.from_pandas(df)
     expected = pa.Table.from_pandas(df).schema
     assert schema == expected
+
+
+@h.given(
+    past.all_types |
+    past.all_fields |
+    past.all_schemas
+)
+@h.example(
+    pa.field(name='', type=pa.null(), metadata={'0': '', '': ''})
+)
+def test_pickling(field):
+    data = pickle.dumps(field)
+    assert pickle.loads(data) == field
+
+
+@h.given(
+    st.lists(past.all_types) |
+    st.lists(past.all_fields) |
+    st.lists(past.all_schemas)
+)
+def test_hashing(items):
+    h.assume(
+        # well, this is still O(n^2), but makes the input unique
+        all(not a.equals(b) for i, a in enumerate(items) for b in items[:i])
+    )
+
+    container = {}
+    for i, item in enumerate(items):
+        assert hash(item) == hash(item)
+        container[item] = i
+
+    assert len(container) == len(items)
+
+    for i, item in enumerate(items):
+        assert container[item] == i
diff --git a/python/pyarrow/types.pxi b/python/pyarrow/types.pxi
index 1ebd196..f69190c 100644
--- a/python/pyarrow/types.pxi
+++ b/python/pyarrow/types.pxi
@@ -430,11 +430,9 @@ cdef class Field:
 
     @property
     def metadata(self):
-        cdef shared_ptr[const CKeyValueMetadata] metadata = (
-            self.field.metadata())
-        return box_metadata(metadata.get())
+        return pyarrow_wrap_metadata(self.field.metadata())
 
-    def add_metadata(self, dict metadata):
+    def add_metadata(self, metadata):
         """
         Add metadata as dict of string keys and values to Field
 
@@ -447,14 +445,18 @@ cdef class Field:
         -------
         field : pyarrow.Field
         """
-        cdef shared_ptr[CKeyValueMetadata] c_meta
-        convert_metadata(metadata, &c_meta)
+        cdef:
+            shared_ptr[CField] c_field
+            shared_ptr[CKeyValueMetadata] c_meta
 
-        cdef shared_ptr[CField] new_field
+        if not isinstance(metadata, dict):
+            raise TypeError('Metadata must be an instance of dict')
+
+        c_meta = pyarrow_unwrap_metadata(metadata)
         with nogil:
-            new_field = self.field.AddMetadata(c_meta)
+            c_field = self.field.AddMetadata(c_meta)
 
-        return pyarrow_wrap_field(new_field)
+        return pyarrow_wrap_field(c_field)
 
     def remove_metadata(self):
         """
@@ -515,6 +517,9 @@ cdef class Schema:
     def __reduce__(self):
         return schema, (list(self), self.metadata)
 
+    def __hash__(self):
+        return hash((tuple(self), self.metadata))
+
     @property
     def names(self):
         """
@@ -544,9 +549,7 @@ cdef class Schema:
 
     @property
     def metadata(self):
-        cdef shared_ptr[const CKeyValueMetadata] metadata = (
-            self.schema.metadata())
-        return box_metadata(metadata.get())
+        return pyarrow_wrap_metadata(self.schema.metadata())
 
     def __eq__(self, other):
         try:
@@ -728,7 +731,7 @@ cdef class Schema:
 
         return pyarrow_wrap_schema(new_schema)
 
-    def add_metadata(self, dict metadata):
+    def add_metadata(self, metadata):
         """
         Add metadata as dict of string keys and values to Schema
 
@@ -741,14 +744,18 @@ cdef class Schema:
         -------
         schema : pyarrow.Schema
         """
-        cdef shared_ptr[CKeyValueMetadata] c_meta
-        convert_metadata(metadata, &c_meta)
+        cdef:
+            shared_ptr[CKeyValueMetadata] c_meta
+            shared_ptr[CSchema] c_schema
 
-        cdef shared_ptr[CSchema] new_schema
+        if not isinstance(metadata, dict):
+            raise TypeError('Metadata must be an instance of dict')
+
+        c_meta = pyarrow_unwrap_metadata(metadata)
         with nogil:
-            new_schema = self.schema.AddMetadata(c_meta)
+            c_schema = self.schema.AddMetadata(c_meta)
 
-        return pyarrow_wrap_schema(new_schema)
+        return pyarrow_wrap_schema(c_schema)
 
     def serialize(self, memory_pool=None):
         """
@@ -810,15 +817,6 @@ cdef class Schema:
         return self.__str__()
 
 
-cdef dict box_metadata(const CKeyValueMetadata* metadata):
-    cdef unordered_map[c_string, c_string] result
-    if metadata != nullptr:
-        metadata.ToUnorderedMap(&result)
-        return result
-    else:
-        return None
-
-
 cdef dict _type_cache = {}
 
 
@@ -832,25 +830,12 @@ cdef DataType primitive_type(Type type):
     _type_cache[type] = out
     return out
 
+
 # -----------------------------------------------------------
 # Type factory functions
 
-cdef int convert_metadata(dict metadata,
-                          shared_ptr[CKeyValueMetadata]* out) except -1:
-    cdef:
-        shared_ptr[CKeyValueMetadata] meta = (
-            make_shared[CKeyValueMetadata]())
-        c_string key, value
-
-    for py_key, py_value in metadata.items():
-        key = tobytes(py_key)
-        value = tobytes(py_value)
-        meta.get().Append(key, value)
-    out[0] = meta
-    return 0
-
 
-def field(name, type, bint nullable=True, dict metadata=None):
+def field(name, type, bint nullable=True, metadata=None):
     """
     Create a pyarrow.Field instance
 
@@ -867,17 +852,21 @@ def field(name, type, bint nullable=True, dict metadata=None):
     field : pyarrow.Field
     """
     cdef:
-        shared_ptr[CKeyValueMetadata] c_meta
         Field result = Field.__new__(Field)
         DataType _type = ensure_type(type, allow_none=False)
+        shared_ptr[CKeyValueMetadata] c_meta
 
     if metadata is not None:
-        convert_metadata(metadata, &c_meta)
+        if not isinstance(metadata, dict):
+            raise TypeError('Metadata must be an instance of dict')
+        c_meta = pyarrow_unwrap_metadata(metadata)
 
-    result.sp_field.reset(new CField(tobytes(name), _type.sp_type,
-                                     nullable == 1, c_meta))
+    result.sp_field.reset(
+        new CField(tobytes(name), _type.sp_type, nullable, c_meta)
+    )
     result.field = result.sp_field.get()
     result.type = _type
+
     return result
 
 
@@ -1490,7 +1479,7 @@ cdef DataType ensure_type(object ty, c_bool allow_none=False):
         raise TypeError('DataType expected, got {!r}'.format(type(ty)))
 
 
-def schema(fields, dict metadata=None):
+def schema(fields, metadata=None):
     """
     Construct pyarrow.Schema from collection of fields
 
@@ -1535,11 +1524,14 @@ def schema(fields, dict metadata=None):
         c_fields.push_back(py_field.sp_field)
 
     if metadata is not None:
-        convert_metadata(metadata, &c_meta)
+        if not isinstance(metadata, dict):
+            raise TypeError('Metadata must be an instance of dict')
+        c_meta = pyarrow_unwrap_metadata(metadata)
 
     c_schema.reset(new CSchema(c_fields, c_meta))
     result = Schema.__new__(Schema)
     result.init_schema(c_schema)
+
     return result
 
 
diff --git a/python/requirements-test.txt b/python/requirements-test.txt
new file mode 100644
index 0000000..482e888
--- /dev/null
+++ b/python/requirements-test.txt
@@ -0,0 +1,5 @@
+-r requirements.txt
+pandas
+pytest
+hypothesis
+pathlib2; python_version < "3.4"
diff --git a/python/requirements.txt b/python/requirements.txt
index ddedd75..3a23d1d 100644
--- a/python/requirements.txt
+++ b/python/requirements.txt
@@ -1,6 +1,3 @@
-six
-pytest
-cloudpickle>=0.4.0
-numpy>=1.14.0
-futures; python_version < "3"
-pathlib2; python_version < "3.4"
+six>=1.0.0
+numpy>=1.14
+futures; python_version < "3.2"
diff --git a/python/setup.py b/python/setup.py
index e6a8871..b8d192d 100755
--- a/python/setup.py
+++ b/python/setup.py
@@ -577,7 +577,8 @@ setup(
                      },
     setup_requires=['setuptools_scm', 'cython >= 0.27'] + setup_requires,
     install_requires=install_requires,
-    tests_require=['pytest', 'pandas', 'pathlib2; python_version < "3.4"'],
+    tests_require=['pytest', 'pandas', 'hypothesis',
+                   'pathlib2; python_version < "3.4"'],
     description="Python library for Apache Arrow",
     long_description=long_description,
     long_description_content_type="text/markdown",