You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by ks...@apache.org on 2018/12/10 15:43:02 UTC
[arrow] branch master updated: ARROW-2624: [Python] Random schema
generator for Arrow conversion and Parquet testing
This is an automated email from the ASF dual-hosted git repository.
kszucs pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/master by this push:
new 9da4584 ARROW-2624: [Python] Random schema generator for Arrow conversion and Parquet testing
9da4584 is described below
commit 9da458437162574f3e0d82e4a51dc6c1589b9f94
Author: Krisztián Szűcs <sz...@gmail.com>
AuthorDate: Mon Dec 10 16:42:53 2018 +0100
ARROW-2624: [Python] Random schema generator for Arrow conversion and Parquet testing
- introduced hypothesis to generate pyarrow types, fields and schemas
- test cases to highlight the functionality provided by hypothesis
- hypothesis tests are disabled by default
- represent kev-value metadata as OrderedDict on python side instead of plain dicts (pickling was indeterministic, found this bug by hypo)
- unified multiple metadata conversion paths to a single one (pyarrow_wrap_metadata, pyarrow_unwrap_metadata)
Also resolves: [ARROW-3901: [Python] Make Schema hashable](https://issues.apache.org/jira/browse/ARROW-3901)
Follow-up issue: [ARROW-3903: [Python] Random data generator for ... testing](https://issues.apache.org/jira/browse/ARROW-3903)
Author: Krisztián Szűcs <sz...@gmail.com>
Closes #3046 from kszucs/ARROW-2624 and squashes the following commits:
3e27ad15 <Krisztián Szűcs> hypo profiles
88b107bb <Krisztián Szűcs> install hypothesis for msvc wheel test
8fb6d0bc <Krisztián Szűcs> make pyarrow_wrap_metadata private
80a276be <Krisztián Szűcs> manylinux
26e6ecd6 <Krisztián Szűcs> manylinux
e385d243 <Krisztián Szűcs> manylinux
b6fe7576 <Krisztián Szűcs> append in unwrap
0e28e5df <Krisztián Szűcs> ci fixes
efeb65ee <Krisztián Szűcs> use conde_env_python.yml in travis
1f7ad6b6 <Krisztián Szűcs> don't validate metadata type pyarrow_wrap_metadata
14e444d9 <Krisztián Szűcs> introduce requirements-test.txt
11b020c0 <Krisztián Szűcs> install hypothesis on appveyor and travis
6bd5b21e <Krisztián Szűcs> license header
a8fae546 <Krisztián Szűcs> remove unbox_metadata
e8c0f3f5 <Krisztián Szűcs> add hypo as test dependency; hashing test
e7bab691 <Krisztián Szűcs> remove box_metadata
f1ae290e <Krisztián Szűcs> hypothesis strategies for pyarrow types; deterministic key-value metadata conversions
---
ci/appveyor-cpp-build.bat | 2 +-
ci/conda_env_python.yml | 2 +
ci/cpp-msvc-build-main.bat | 2 +-
ci/travis_script_python.sh | 10 +-
dev/release/rat_exclude_files.txt | 1 +
dev/release/verify-release-candidate.sh | 2 +-
python/manylinux1/build_arrow.sh | 5 +-
python/manylinux1/scripts/build_virtualenvs.sh | 2 +-
python/pyarrow/includes/libarrow.pxd | 8 +-
python/pyarrow/lib.pxd | 6 +-
python/pyarrow/public-api.pxi | 25 +++++
python/pyarrow/table.pxi | 60 ++++++-----
python/pyarrow/tests/conftest.py | 34 ++++--
python/pyarrow/tests/strategies.py | 138 +++++++++++++++++++++++++
python/pyarrow/tests/test_types.py | 50 +++++++++
python/pyarrow/types.pxi | 88 +++++++---------
python/requirements-test.txt | 5 +
python/requirements.txt | 9 +-
python/setup.py | 3 +-
19 files changed, 348 insertions(+), 104 deletions(-)
diff --git a/ci/appveyor-cpp-build.bat b/ci/appveyor-cpp-build.bat
index 91212a6..b8e4316 100644
--- a/ci/appveyor-cpp-build.bat
+++ b/ci/appveyor-cpp-build.bat
@@ -91,7 +91,7 @@ if "%JOB%" == "Build_Debug" (
conda create -n arrow -q -y ^
python=%PYTHON% ^
- six pytest setuptools numpy pandas cython ^
+ six pytest setuptools numpy pandas cython hypothesis ^
thrift-cpp=0.11.0 boost-cpp ^
-c conda-forge
diff --git a/ci/conda_env_python.yml b/ci/conda_env_python.yml
index 429851e..c187155 100644
--- a/ci/conda_env_python.yml
+++ b/ci/conda_env_python.yml
@@ -16,6 +16,8 @@
# under the License.
cython
+cloudpickle
+hypothesis
nomkl
numpy
pandas
diff --git a/ci/cpp-msvc-build-main.bat b/ci/cpp-msvc-build-main.bat
index ef961b2..7349f8d 100644
--- a/ci/cpp-msvc-build-main.bat
+++ b/ci/cpp-msvc-build-main.bat
@@ -112,6 +112,6 @@ pip install %WHEEL_PATH% || exit /B
python -c "import pyarrow" || exit /B
python -c "import pyarrow.parquet" || exit /B
-pip install pandas pickle5 pytest pytest-faulthandler || exit /B
+pip install pandas pickle5 pytest pytest-faulthandler hypothesis || exit /B
py.test -r sxX --durations=15 --pyargs pyarrow.tests || exit /B
diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh
index e4290ed..b316c81 100755
--- a/ci/travis_script_python.sh
+++ b/ci/travis_script_python.sh
@@ -51,13 +51,11 @@ if [ $ARROW_TRAVIS_PYTHON_JVM == "1" ]; then
CONDA_JVM_DEPS="jpype1"
fi
-conda install -y -q pip \
- nomkl \
- cloudpickle \
+conda install -y -q \
+ --file $TRAVIS_BUILD_DIR/ci/conda_env_python.yml \
+ pip \
numpy=1.13.1 \
- ${CONDA_JVM_DEPS} \
- pandas \
- cython
+ ${CONDA_JVM_DEPS}
if [ "$ARROW_TRAVIS_PYTHON_DOCS" == "1" ] && [ "$PYTHON_VERSION" == "3.6" ]; then
# Install documentation dependencies
diff --git a/dev/release/rat_exclude_files.txt b/dev/release/rat_exclude_files.txt
index 0baf29e..e274d97 100644
--- a/dev/release/rat_exclude_files.txt
+++ b/dev/release/rat_exclude_files.txt
@@ -129,6 +129,7 @@ python/MANIFEST.in
python/pyarrow/includes/__init__.pxd
python/pyarrow/tests/__init__.py
python/requirements.txt
+python/requirements-test.txt
pax_global_header
MANIFEST.in
__init__.pxd
diff --git a/dev/release/verify-release-candidate.sh b/dev/release/verify-release-candidate.sh
index 5b66663..57b1850 100755
--- a/dev/release/verify-release-candidate.sh
+++ b/dev/release/verify-release-candidate.sh
@@ -189,7 +189,7 @@ test_and_install_cpp() {
test_python() {
pushd python
- pip install -r requirements.txt
+ pip install -r requirements-test.txt
python setup.py build_ext --inplace --with-parquet --with-plasma
py.test pyarrow -v --pdb
diff --git a/python/manylinux1/build_arrow.sh b/python/manylinux1/build_arrow.sh
index 4481652..9042973 100755
--- a/python/manylinux1/build_arrow.sh
+++ b/python/manylinux1/build_arrow.sh
@@ -107,7 +107,7 @@ for PYTHON_TUPLE in ${PYTHON_VERSIONS}; do
PATH="$PATH:${CPYTHON_PATH}/bin" $PYTHON_INTERPRETER setup.py bdist_wheel
PATH="$PATH:${CPYTHON_PATH}/bin" $PYTHON_INTERPRETER setup.py sdist
- echo "=== (${PYTHON}) Test the existence of optional modules ==="
+ echo "=== (${PYTHON}) Ensure the existence of mandatory modules ==="
$PIP install -r requirements.txt
echo "=== (${PYTHON}) Tag the wheel with manylinux1 ==="
@@ -122,6 +122,9 @@ for PYTHON_TUPLE in ${PYTHON_VERSIONS}; do
PATH="$PATH:${CPYTHON_PATH}/bin" $PYTHON_INTERPRETER -c "import pyarrow.parquet"
PATH="$PATH:${CPYTHON_PATH}/bin" $PYTHON_INTERPRETER -c "import pyarrow.plasma"
+ echo "=== (${PYTHON}) Install modules required for testing ==="
+ pip install -r requirements-test.txt
+
# The TensorFlow test will be skipped here, since TensorFlow is not
# manylinux1 compatible; however, the wheels will support TensorFlow on
# a TensorFlow compatible system
diff --git a/python/manylinux1/scripts/build_virtualenvs.sh b/python/manylinux1/scripts/build_virtualenvs.sh
index 18f3b0d..1410031 100755
--- a/python/manylinux1/scripts/build_virtualenvs.sh
+++ b/python/manylinux1/scripts/build_virtualenvs.sh
@@ -41,7 +41,7 @@ for PYTHON_TUPLE in ${PYTHON_VERSIONS}; do
echo "=== (${PYTHON}, ${U_WIDTH}) Preparing virtualenv for tests ==="
"$(cpython_path $PYTHON ${U_WIDTH})/bin/virtualenv" -p ${PYTHON_INTERPRETER} --no-download /venv-test-${PYTHON}-${U_WIDTH}
source /venv-test-${PYTHON}-${U_WIDTH}/bin/activate
- pip install pytest 'numpy==1.14.5' 'pandas==0.23.4'
+ pip install pytest hypothesis 'numpy==1.14.5' 'pandas==0.23.4'
deactivate
done
diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd
index c5e7457..61517e4 100644
--- a/python/pyarrow/includes/libarrow.pxd
+++ b/python/pyarrow/includes/libarrow.pxd
@@ -23,9 +23,15 @@ cdef extern from "arrow/util/key_value_metadata.h" namespace "arrow" nogil:
cdef cppclass CKeyValueMetadata" arrow::KeyValueMetadata":
CKeyValueMetadata()
CKeyValueMetadata(const unordered_map[c_string, c_string]&)
+ CKeyValueMetadata(const vector[c_string]& keys,
+ const vector[c_string]& values)
- c_bool Equals(const CKeyValueMetadata& other)
+ void reserve(int64_t n)
+ int64_t size() const
+ c_string key(int64_t i) const
+ c_string value(int64_t i) const
+ c_bool Equals(const CKeyValueMetadata& other)
void Append(const c_string& key, const c_string& value)
void ToUnorderedMap(unordered_map[c_string, c_string]*) const
diff --git a/python/pyarrow/lib.pxd b/python/pyarrow/lib.pxd
index 098ae62..745a049 100644
--- a/python/pyarrow/lib.pxd
+++ b/python/pyarrow/lib.pxd
@@ -384,11 +384,13 @@ cdef get_reader(object source, c_bool use_memory_map,
shared_ptr[RandomAccessFile]* reader)
cdef get_writer(object source, shared_ptr[OutputStream]* writer)
-cdef dict box_metadata(const CKeyValueMetadata* sp_metadata)
-
# Default is allow_none=False
cdef DataType ensure_type(object type, c_bool allow_none=*)
+cdef shared_ptr[CKeyValueMetadata] pyarrow_unwrap_metadata(object meta)
+cdef object pyarrow_wrap_metadata(
+ const shared_ptr[const CKeyValueMetadata]& meta)
+
#
# Public Cython API for 3rd party code
#
diff --git a/python/pyarrow/public-api.pxi b/python/pyarrow/public-api.pxi
index e8798c5..ef54c7a 100644
--- a/python/pyarrow/public-api.pxi
+++ b/python/pyarrow/public-api.pxi
@@ -92,6 +92,31 @@ cdef public api object pyarrow_wrap_data_type(
return out
+cdef object pyarrow_wrap_metadata(
+ const shared_ptr[const CKeyValueMetadata]& meta):
+ cdef const CKeyValueMetadata* cmeta = meta.get()
+
+ if cmeta == nullptr:
+ return None
+
+ result = OrderedDict()
+ for i in range(cmeta.size()):
+ result[cmeta.key(i)] = cmeta.value(i)
+
+ return result
+
+
+cdef shared_ptr[CKeyValueMetadata] pyarrow_unwrap_metadata(object meta):
+ cdef vector[c_string] keys, values
+
+ if isinstance(meta, dict):
+ keys = map(tobytes, meta.keys())
+ values = map(tobytes, meta.values())
+ return make_shared[CKeyValueMetadata](keys, values)
+
+ return shared_ptr[CKeyValueMetadata]()
+
+
cdef public api bint pyarrow_is_field(object field):
return isinstance(field, Field)
diff --git a/python/pyarrow/table.pxi b/python/pyarrow/table.pxi
index 0d529d3..fd565af 100644
--- a/python/pyarrow/table.pxi
+++ b/python/pyarrow/table.pxi
@@ -634,26 +634,22 @@ cdef class Column:
return pyarrow_wrap_chunked_array(self.column.data())
-cdef shared_ptr[const CKeyValueMetadata] unbox_metadata(dict metadata):
- if metadata is None:
- return <shared_ptr[const CKeyValueMetadata]> nullptr
- cdef:
- unordered_map[c_string, c_string] unordered_metadata = metadata
- return (<shared_ptr[const CKeyValueMetadata]>
- make_shared[CKeyValueMetadata](unordered_metadata))
-
-
-cdef _schema_from_arrays(arrays, names, dict metadata,
- shared_ptr[CSchema]* schema):
+cdef _schema_from_arrays(arrays, names, metadata, shared_ptr[CSchema]* schema):
cdef:
Column col
c_string c_name
vector[shared_ptr[CField]] fields
shared_ptr[CDataType] type_
Py_ssize_t K = len(arrays)
+ shared_ptr[CKeyValueMetadata] c_meta
+
+ if metadata is not None:
+ if not isinstance(metadata, dict):
+ raise TypeError('Metadata must be an instance of dict')
+ c_meta = pyarrow_unwrap_metadata(metadata)
if K == 0:
- schema.reset(new CSchema(fields, unbox_metadata(metadata)))
+ schema.reset(new CSchema(fields, c_meta))
return
fields.resize(K)
@@ -684,7 +680,7 @@ cdef _schema_from_arrays(arrays, names, dict metadata,
c_name = tobytes(names[i])
fields[i].reset(new CField(c_name, type_, True))
- schema.reset(new CSchema(fields, unbox_metadata(metadata)))
+ schema.reset(new CSchema(fields, c_meta))
cdef class RecordBatch:
@@ -715,7 +711,7 @@ cdef class RecordBatch:
def __len__(self):
return self.batch.num_rows()
- def replace_schema_metadata(self, dict metadata=None):
+ def replace_schema_metadata(self, metadata=None):
"""
EXPERIMENTAL: Create shallow copy of record batch by replacing schema
key-value metadata with the indicated new metadata (which may be None,
@@ -729,15 +725,19 @@ cdef class RecordBatch:
-------
shallow_copy : RecordBatch
"""
- cdef shared_ptr[CKeyValueMetadata] c_meta
+ cdef:
+ shared_ptr[CKeyValueMetadata] c_meta
+ shared_ptr[CRecordBatch] c_batch
+
if metadata is not None:
- convert_metadata(metadata, &c_meta)
+ if not isinstance(metadata, dict):
+ raise TypeError('Metadata must be an instance of dict')
+ c_meta = pyarrow_unwrap_metadata(metadata)
- cdef shared_ptr[CRecordBatch] new_batch
with nogil:
- new_batch = self.batch.ReplaceSchemaMetadata(c_meta)
+ c_batch = self.batch.ReplaceSchemaMetadata(c_meta)
- return pyarrow_wrap_batch(new_batch)
+ return pyarrow_wrap_batch(c_batch)
@property
def num_columns(self):
@@ -953,7 +953,7 @@ cdef class RecordBatch:
return cls.from_arrays(arrays, names, metadata)
@staticmethod
- def from_arrays(list arrays, names, dict metadata=None):
+ def from_arrays(list arrays, names, metadata=None):
"""
Construct a RecordBatch from multiple pyarrow.Arrays
@@ -1062,7 +1062,7 @@ cdef class Table:
columns = [col.data for col in self.columns]
return _reconstruct_table, (columns, self.schema)
- def replace_schema_metadata(self, dict metadata=None):
+ def replace_schema_metadata(self, metadata=None):
"""
EXPERIMENTAL: Create shallow copy of table by replacing schema
key-value metadata with the indicated new metadata (which may be None,
@@ -1076,15 +1076,19 @@ cdef class Table:
-------
shallow_copy : Table
"""
- cdef shared_ptr[CKeyValueMetadata] c_meta
+ cdef:
+ shared_ptr[CKeyValueMetadata] c_meta
+ shared_ptr[CTable] c_table
+
if metadata is not None:
- convert_metadata(metadata, &c_meta)
+ if not isinstance(metadata, dict):
+ raise TypeError('Metadata must be an instance of dict')
+ c_meta = pyarrow_unwrap_metadata(metadata)
- cdef shared_ptr[CTable] new_table
with nogil:
- new_table = self.table.ReplaceSchemaMetadata(c_meta)
+ c_table = self.table.ReplaceSchemaMetadata(c_meta)
- return pyarrow_wrap_table(new_table)
+ return pyarrow_wrap_table(c_table)
def flatten(self, MemoryPool memory_pool=None):
"""
@@ -1225,7 +1229,7 @@ cdef class Table:
return cls.from_arrays(arrays, names=names, metadata=metadata)
@staticmethod
- def from_arrays(arrays, names=None, schema=None, dict metadata=None):
+ def from_arrays(arrays, names=None, schema=None, metadata=None):
"""
Construct a Table from Arrow arrays or columns
@@ -1236,6 +1240,8 @@ cdef class Table:
names: list of str, optional
Names for the table columns. If Columns passed, will be
inferred. If Arrays passed, this argument is required
+ schema : Schema, default None
+ If not passed, will be inferred from the arrays
Returns
-------
diff --git a/python/pyarrow/tests/conftest.py b/python/pyarrow/tests/conftest.py
index 6cdedbb..69e8e82 100644
--- a/python/pyarrow/tests/conftest.py
+++ b/python/pyarrow/tests/conftest.py
@@ -15,7 +15,9 @@
# specific language governing permissions and limitations
# under the License.
+import os
import pytest
+import hypothesis as h
try:
import pathlib
@@ -23,7 +25,20 @@ except ImportError:
import pathlib2 as pathlib # py2 compat
+# setup hypothesis profiles
+h.settings.register_profile('ci', max_examples=1000)
+h.settings.register_profile('dev', max_examples=10)
+h.settings.register_profile('debug', max_examples=10,
+ verbosity=h.Verbosity.verbose)
+
+# load default hypothesis profile, either set HYPOTHESIS_PROFILE environment
+# variable or pass --hypothesis-profile option to pytest, to see the generated
+# examples try: pytest pyarrow -sv --only-hypothesis --hypothesis-profile=debug
+h.settings.load_profile(os.environ.get('HYPOTHESIS_PROFILE', 'default'))
+
+
groups = [
+ 'hypothesis',
'gandiva',
'hdfs',
'large_memory',
@@ -36,6 +51,7 @@ groups = [
defaults = {
+ 'hypothesis': False,
'gandiva': False,
'hdfs': False,
'large_memory': False,
@@ -84,16 +100,15 @@ def pytest_configure(config):
def pytest_addoption(parser):
for group in groups:
- parser.addoption('--{0}'.format(group), action='store_true',
- default=defaults[group],
- help=('Enable the {0} test group'.format(group)))
+ for flag in ['--{0}', '--enable-{0}']:
+ parser.addoption(flag.format(group), action='store_true',
+ default=defaults[group],
+ help=('Enable the {0} test group'.format(group)))
- for group in groups:
parser.addoption('--disable-{0}'.format(group), action='store_true',
default=False,
help=('Disable the {0} test group'.format(group)))
- for group in groups:
parser.addoption('--only-{0}'.format(group), action='store_true',
default=False,
help=('Run only the {0} test group'.format(group)))
@@ -115,15 +130,18 @@ def pytest_runtest_setup(item):
only_set = False
for group in groups:
+ flag = '--{0}'.format(group)
only_flag = '--only-{0}'.format(group)
+ enable_flag = '--enable-{0}'.format(group)
disable_flag = '--disable-{0}'.format(group)
- flag = '--{0}'.format(group)
if item.config.getoption(only_flag):
only_set = True
elif getattr(item.obj, group, None):
- if (item.config.getoption(disable_flag) or
- not item.config.getoption(flag)):
+ is_enabled = (item.config.getoption(flag) or
+ item.config.getoption(enable_flag))
+ is_disabled = item.config.getoption(disable_flag)
+ if is_disabled or not is_enabled:
pytest.skip('{0} NOT enabled'.format(flag))
if only_set:
diff --git a/python/pyarrow/tests/strategies.py b/python/pyarrow/tests/strategies.py
new file mode 100644
index 0000000..bc8ded2
--- /dev/null
+++ b/python/pyarrow/tests/strategies.py
@@ -0,0 +1,138 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import pyarrow as pa
+import hypothesis.strategies as st
+
+
+# TODO(kszucs): alphanum_text, surrogate_text
+custom_text = st.text(
+ alphabet=st.characters(
+ min_codepoint=0x41,
+ max_codepoint=0x7E
+ )
+)
+
+null_type = st.just(pa.null())
+bool_type = st.just(pa.bool_())
+
+binary_type = st.just(pa.binary())
+string_type = st.just(pa.string())
+
+signed_integer_types = st.sampled_from([
+ pa.int8(),
+ pa.int16(),
+ pa.int32(),
+ pa.int64()
+])
+unsigned_integer_types = st.sampled_from([
+ pa.uint8(),
+ pa.uint16(),
+ pa.uint32(),
+ pa.uint64()
+])
+integer_types = st.one_of(signed_integer_types, unsigned_integer_types)
+
+floating_types = st.sampled_from([
+ pa.float16(),
+ pa.float32(),
+ pa.float64()
+])
+decimal_type = st.builds(
+ pa.decimal128,
+ precision=st.integers(min_value=0, max_value=38),
+ scale=st.integers(min_value=0, max_value=38)
+)
+numeric_types = st.one_of(integer_types, floating_types, decimal_type)
+
+date_types = st.sampled_from([
+ pa.date32(),
+ pa.date64()
+])
+time_types = st.sampled_from([
+ pa.time32('s'),
+ pa.time32('ms'),
+ pa.time64('us'),
+ pa.time64('ns')
+])
+timestamp_types = st.sampled_from([
+ pa.timestamp('s'),
+ pa.timestamp('ms'),
+ pa.timestamp('us'),
+ pa.timestamp('ns')
+])
+temporal_types = st.one_of(date_types, time_types, timestamp_types)
+
+primitive_types = st.one_of(
+ null_type,
+ bool_type,
+ binary_type,
+ string_type,
+ numeric_types,
+ temporal_types
+)
+
+metadata = st.dictionaries(st.text(), st.text())
+
+
+@st.defines_strategy
+def fields(type_strategy=primitive_types):
+ return st.builds(pa.field, name=custom_text, type=type_strategy,
+ nullable=st.booleans(), metadata=metadata)
+
+
+@st.defines_strategy
+def list_types(item_strategy=primitive_types):
+ return st.builds(pa.list_, item_strategy)
+
+
+@st.defines_strategy
+def struct_types(item_strategy=primitive_types):
+ return st.builds(pa.struct, st.lists(fields(item_strategy)))
+
+
+@st.defines_strategy
+def complex_types(inner_strategy=primitive_types):
+ return list_types(inner_strategy) | struct_types(inner_strategy)
+
+
+@st.defines_strategy
+def nested_list_types(item_strategy=primitive_types):
+ return st.recursive(item_strategy, list_types)
+
+
+@st.defines_strategy
+def nested_struct_types(item_strategy=primitive_types):
+ return st.recursive(item_strategy, struct_types)
+
+
+@st.defines_strategy
+def nested_complex_types(inner_strategy=primitive_types):
+ return st.recursive(inner_strategy, complex_types)
+
+
+@st.defines_strategy
+def schemas(type_strategy=primitive_types):
+ return st.builds(pa.schema, st.lists(fields(type_strategy)))
+
+
+complex_schemas = schemas(complex_types())
+
+
+all_types = st.one_of(primitive_types, complex_types(), nested_complex_types())
+all_fields = fields(all_types)
+all_schemas = schemas(all_types)
diff --git a/python/pyarrow/tests/test_types.py b/python/pyarrow/tests/test_types.py
index 176ce87..310656d 100644
--- a/python/pyarrow/tests/test_types.py
+++ b/python/pyarrow/tests/test_types.py
@@ -19,11 +19,14 @@ from collections import OrderedDict
import pickle
import pytest
+import hypothesis as h
+import hypothesis.strategies as st
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.types as types
+import pyarrow.tests.strategies as past
def get_many_types():
@@ -466,15 +469,27 @@ def test_field_metadata():
def test_field_add_remove_metadata():
+ import collections
+
f0 = pa.field('foo', pa.int32())
assert f0.metadata is None
metadata = {b'foo': b'bar', b'pandas': b'badger'}
+ metadata2 = collections.OrderedDict([
+ (b'a', b'alpha'),
+ (b'b', b'beta')
+ ])
f1 = f0.add_metadata(metadata)
assert f1.metadata == metadata
+ f2 = f0.add_metadata(metadata2)
+ assert f2.metadata == metadata2
+
+ with pytest.raises(TypeError):
+ f0.add_metadata([1, 2, 3])
+
f3 = f1.remove_metadata()
assert f3.metadata is None
@@ -533,3 +548,38 @@ def test_schema_from_pandas(data):
schema = pa.Schema.from_pandas(df)
expected = pa.Table.from_pandas(df).schema
assert schema == expected
+
+
+@h.given(
+ past.all_types |
+ past.all_fields |
+ past.all_schemas
+)
+@h.example(
+ pa.field(name='', type=pa.null(), metadata={'0': '', '': ''})
+)
+def test_pickling(field):
+ data = pickle.dumps(field)
+ assert pickle.loads(data) == field
+
+
+@h.given(
+ st.lists(past.all_types) |
+ st.lists(past.all_fields) |
+ st.lists(past.all_schemas)
+)
+def test_hashing(items):
+ h.assume(
+ # well, this is still O(n^2), but makes the input unique
+ all(not a.equals(b) for i, a in enumerate(items) for b in items[:i])
+ )
+
+ container = {}
+ for i, item in enumerate(items):
+ assert hash(item) == hash(item)
+ container[item] = i
+
+ assert len(container) == len(items)
+
+ for i, item in enumerate(items):
+ assert container[item] == i
diff --git a/python/pyarrow/types.pxi b/python/pyarrow/types.pxi
index 1ebd196..f69190c 100644
--- a/python/pyarrow/types.pxi
+++ b/python/pyarrow/types.pxi
@@ -430,11 +430,9 @@ cdef class Field:
@property
def metadata(self):
- cdef shared_ptr[const CKeyValueMetadata] metadata = (
- self.field.metadata())
- return box_metadata(metadata.get())
+ return pyarrow_wrap_metadata(self.field.metadata())
- def add_metadata(self, dict metadata):
+ def add_metadata(self, metadata):
"""
Add metadata as dict of string keys and values to Field
@@ -447,14 +445,18 @@ cdef class Field:
-------
field : pyarrow.Field
"""
- cdef shared_ptr[CKeyValueMetadata] c_meta
- convert_metadata(metadata, &c_meta)
+ cdef:
+ shared_ptr[CField] c_field
+ shared_ptr[CKeyValueMetadata] c_meta
- cdef shared_ptr[CField] new_field
+ if not isinstance(metadata, dict):
+ raise TypeError('Metadata must be an instance of dict')
+
+ c_meta = pyarrow_unwrap_metadata(metadata)
with nogil:
- new_field = self.field.AddMetadata(c_meta)
+ c_field = self.field.AddMetadata(c_meta)
- return pyarrow_wrap_field(new_field)
+ return pyarrow_wrap_field(c_field)
def remove_metadata(self):
"""
@@ -515,6 +517,9 @@ cdef class Schema:
def __reduce__(self):
return schema, (list(self), self.metadata)
+ def __hash__(self):
+ return hash((tuple(self), self.metadata))
+
@property
def names(self):
"""
@@ -544,9 +549,7 @@ cdef class Schema:
@property
def metadata(self):
- cdef shared_ptr[const CKeyValueMetadata] metadata = (
- self.schema.metadata())
- return box_metadata(metadata.get())
+ return pyarrow_wrap_metadata(self.schema.metadata())
def __eq__(self, other):
try:
@@ -728,7 +731,7 @@ cdef class Schema:
return pyarrow_wrap_schema(new_schema)
- def add_metadata(self, dict metadata):
+ def add_metadata(self, metadata):
"""
Add metadata as dict of string keys and values to Schema
@@ -741,14 +744,18 @@ cdef class Schema:
-------
schema : pyarrow.Schema
"""
- cdef shared_ptr[CKeyValueMetadata] c_meta
- convert_metadata(metadata, &c_meta)
+ cdef:
+ shared_ptr[CKeyValueMetadata] c_meta
+ shared_ptr[CSchema] c_schema
- cdef shared_ptr[CSchema] new_schema
+ if not isinstance(metadata, dict):
+ raise TypeError('Metadata must be an instance of dict')
+
+ c_meta = pyarrow_unwrap_metadata(metadata)
with nogil:
- new_schema = self.schema.AddMetadata(c_meta)
+ c_schema = self.schema.AddMetadata(c_meta)
- return pyarrow_wrap_schema(new_schema)
+ return pyarrow_wrap_schema(c_schema)
def serialize(self, memory_pool=None):
"""
@@ -810,15 +817,6 @@ cdef class Schema:
return self.__str__()
-cdef dict box_metadata(const CKeyValueMetadata* metadata):
- cdef unordered_map[c_string, c_string] result
- if metadata != nullptr:
- metadata.ToUnorderedMap(&result)
- return result
- else:
- return None
-
-
cdef dict _type_cache = {}
@@ -832,25 +830,12 @@ cdef DataType primitive_type(Type type):
_type_cache[type] = out
return out
+
# -----------------------------------------------------------
# Type factory functions
-cdef int convert_metadata(dict metadata,
- shared_ptr[CKeyValueMetadata]* out) except -1:
- cdef:
- shared_ptr[CKeyValueMetadata] meta = (
- make_shared[CKeyValueMetadata]())
- c_string key, value
-
- for py_key, py_value in metadata.items():
- key = tobytes(py_key)
- value = tobytes(py_value)
- meta.get().Append(key, value)
- out[0] = meta
- return 0
-
-def field(name, type, bint nullable=True, dict metadata=None):
+def field(name, type, bint nullable=True, metadata=None):
"""
Create a pyarrow.Field instance
@@ -867,17 +852,21 @@ def field(name, type, bint nullable=True, dict metadata=None):
field : pyarrow.Field
"""
cdef:
- shared_ptr[CKeyValueMetadata] c_meta
Field result = Field.__new__(Field)
DataType _type = ensure_type(type, allow_none=False)
+ shared_ptr[CKeyValueMetadata] c_meta
if metadata is not None:
- convert_metadata(metadata, &c_meta)
+ if not isinstance(metadata, dict):
+ raise TypeError('Metadata must be an instance of dict')
+ c_meta = pyarrow_unwrap_metadata(metadata)
- result.sp_field.reset(new CField(tobytes(name), _type.sp_type,
- nullable == 1, c_meta))
+ result.sp_field.reset(
+ new CField(tobytes(name), _type.sp_type, nullable, c_meta)
+ )
result.field = result.sp_field.get()
result.type = _type
+
return result
@@ -1490,7 +1479,7 @@ cdef DataType ensure_type(object ty, c_bool allow_none=False):
raise TypeError('DataType expected, got {!r}'.format(type(ty)))
-def schema(fields, dict metadata=None):
+def schema(fields, metadata=None):
"""
Construct pyarrow.Schema from collection of fields
@@ -1535,11 +1524,14 @@ def schema(fields, dict metadata=None):
c_fields.push_back(py_field.sp_field)
if metadata is not None:
- convert_metadata(metadata, &c_meta)
+ if not isinstance(metadata, dict):
+ raise TypeError('Metadata must be an instance of dict')
+ c_meta = pyarrow_unwrap_metadata(metadata)
c_schema.reset(new CSchema(c_fields, c_meta))
result = Schema.__new__(Schema)
result.init_schema(c_schema)
+
return result
diff --git a/python/requirements-test.txt b/python/requirements-test.txt
new file mode 100644
index 0000000..482e888
--- /dev/null
+++ b/python/requirements-test.txt
@@ -0,0 +1,5 @@
+-r requirements.txt
+pandas
+pytest
+hypothesis
+pathlib2; python_version < "3.4"
diff --git a/python/requirements.txt b/python/requirements.txt
index ddedd75..3a23d1d 100644
--- a/python/requirements.txt
+++ b/python/requirements.txt
@@ -1,6 +1,3 @@
-six
-pytest
-cloudpickle>=0.4.0
-numpy>=1.14.0
-futures; python_version < "3"
-pathlib2; python_version < "3.4"
+six>=1.0.0
+numpy>=1.14
+futures; python_version < "3.2"
diff --git a/python/setup.py b/python/setup.py
index e6a8871..b8d192d 100755
--- a/python/setup.py
+++ b/python/setup.py
@@ -577,7 +577,8 @@ setup(
},
setup_requires=['setuptools_scm', 'cython >= 0.27'] + setup_requires,
install_requires=install_requires,
- tests_require=['pytest', 'pandas', 'pathlib2; python_version < "3.4"'],
+ tests_require=['pytest', 'pandas', 'hypothesis',
+ 'pathlib2; python_version < "3.4"'],
description="Python library for Apache Arrow",
long_description=long_description,
long_description_content_type="text/markdown",