You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@arrow.apache.org by ra...@apache.org on 2024/02/20 11:21:12 UTC

(arrow) branch maint-15.0.x created (now 0e9bd55b65)

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a change to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git


      at 0e9bd55b65 GH-39803: [C++][Acero] Fix AsOfJoin with differently ordered schemas than the output (#39804)

This branch includes the following new commits:

     new dec703aa28 GH-39313: [Python] Fix race condition in _pandas_api#_check_import (#39314)
     new 75c9e02934 GH-39525: [C++][Parquet] Pass memory pool to decoders (#39526)
     new 31ecb4ccb4 GH-39504: [Docs] Update footer in main sphinx docs with correct attribution (#39505)
     new 34b54f08cb GH-39577: [C++] Fix tail-word access cross buffer boundary in `CompareBinaryColumnToRow` (#39606)
     new fc586a091b GH-39599: [Python] Avoid leaking references to Numpy dtypes (#39636)
     new e388eb57df GH-39640: [Docs] Pin pydata-sphinx-theme to 0.14.1 (#39658)
     new ef4b4de21e GH-39583: [C++] Fix the issue of ExecBatchBuilder when appending consecutive tail rows with the same id may exceed buffer boundary (for fixed size types) (#39585)
     new ee7f54cd1b GH-39656: [Release] Update platform tags for macOS wheels to macosx_10_15 (#39657)
     new 881eec5142 GH-39332: [C++] Explicit error in ExecBatchBuilder when appending var length data exceeds offset limit (int32 max) (#39383)
     new 914e62dca7 GH-39672: [Go] Time to Date32/Date64 conversion issues for non-UTC timezones (#39674)
     new 87eae3d4eb GH-39690: [C++][FlightRPC] Fix nullptr dereference in PollInfo (#39711)
     new 215bcf9a5f GH-38655: [C++] "iso_calendar" kernel returns incorrect results for array length > 32 (#39360)
     new 5e8e20de3c GH-39732: [Python][CI] Fix test failures with latest/nightly pandas (#39760)
     new faeab94e48 GH-39778: [C++] Fix tail-byte access cross buffer boundary in key hash avx2 (#39800)
     new 8d7d90a5ba GH-39527: [C++][Parquet] Validate page sizes before truncating to int32 (#39528)
     new a1ada92d17 GH-39740: [C++] Fix filter and take kernel for month_day_nano intervals (#39795)
     new 66def3d6bf GH-39640: [Docs] Pin pydata-sphinx-theme to 0.14.* (#39758)
     new 26e957883c GH-39876: [C++] Thirdparty: Bump zlib to 1.3.1 (#39877)
     new ba6a8f5663 MINOR: [Python][CI] Add upper bound on pytest version (#39827)
     new aa172aa007 GH-39849: [Python] Remove the use of pytest-lazy-fixture (#39850)
     new 4f7819a743 GH-39880: [Python][CI] Pin moto<5 for dask integration tests (#39881)
     new 91be098b56 GH-39865: [C++] Strip extension metadata when importing a registered extension (#39866)
     new e19e1817b2 GH-39860: [C++] Expression ExecuteScalarExpression execute empty args function with a wrong result (#39908)
     new 0d6e95b490 GH-39737: [Release][Docs] Update post release documentation task (#39762)
     new 23a8991bee GH-39976: [C++] Fix out-of-line data size calculation in BinaryViewBuilder::AppendArraySlice (#39994)
     new 4b1153620a GH-40004: [Python][FlightRPC] Release GIL in GeneratorStream (#40005)
     new ecfc9979c6 GH-39916: [C#] Restore support for .NET 4.6.2 (#40008)
     new 0d0be3b5a0 GH-39999: [Python] Fix tests for pandas with CoW / nightly integration tests (#40000)
     new b59bec36b7 GH-40009: [C++] Add missing "#include <algorithm>" (#40010)
     new 0e9bd55b65 GH-39803: [C++][Acero] Fix AsOfJoin with differently ordered schemas than the output (#39804)

The 30 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.

(arrow) 24/30: GH-39737: [Release][Docs] Update post release documentation task (#39762)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 0d6e95b490c07256e24f6d2d01e0b6d010e980c0
Author: Alenka Frim <Al...@users.noreply.github.com>
AuthorDate: Tue Feb 6 01:54:13 2024 +0100

    GH-39737: [Release][Docs] Update post release documentation task (#39762)
    
    This PR updates the `dev/release/post-08-docs.sh` task so that
    
    - `DOCUMENTATION_OPTIONS.theme_switcher_version_match` changes from `""` to `"{previous_version}"`
    - `DOCUMENTATION_OPTIONS.show_version_warning_banner` changes from `false` to `true`
    
    for the documentation that is moved to a subfolder when a new major release is done.
    * Closes: #39737
    
    Lead-authored-by: AlenkaF <fr...@gmail.com>
    Co-authored-by: Alenka Frim <Al...@users.noreply.github.com>
    Co-authored-by: Raúl Cumplido <ra...@gmail.com>
    Co-authored-by: Sutou Kouhei <ko...@cozmixng.org>
    Signed-off-by: Sutou Kouhei <ko...@clear-code.com>
---
 dev/release/post-08-docs.sh | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/dev/release/post-08-docs.sh b/dev/release/post-08-docs.sh
index f18f7d10c7..4df574700e 100755
--- a/dev/release/post-08-docs.sh
+++ b/dev/release/post-08-docs.sh
@@ -86,6 +86,21 @@ if [ "$is_major_release" = "yes" ] ; then
 fi
 git add docs
 git commit -m "[Website] Update documentations for ${version}"
+
+# Update DOCUMENTATION_OPTIONS.theme_switcher_version_match and
+# DOCUMENTATION_OPTIONS.show_version_warning_banner
+pushd docs/${previous_series}
+find ./ \
+  -type f \
+  -exec \
+    sed -i.bak \
+      -e "s/DOCUMENTATION_OPTIONS.theme_switcher_version_match = '';/DOCUMENTATION_OPTIONS.theme_switcher_version_match = '${previous_version}';/g" \
+      -e "s/DOCUMENTATION_OPTIONS.show_version_warning_banner = false/DOCUMENTATION_OPTIONS.show_version_warning_banner = true/g" \
+      {} \;
+find ./ -name '*.bak' -delete
+popd
+git add docs/${previous_series}
+git commit -m "[Website] Update warning banner for ${previous_series}"
 git clean -d -f -x
 popd

(arrow) 06/30: GH-39640: [Docs] Pin pydata-sphinx-theme to 0.14.1 (#39658)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit e388eb57dfc1417590364d8d84e67cbba1fa539b
Author: Alenka Frim <Al...@users.noreply.github.com>
AuthorDate: Wed Jan 17 16:58:48 2024 +0100

    GH-39640: [Docs] Pin pydata-sphinx-theme to 0.14.1 (#39658)
    
    The version warning banner in the documentation has the wrong version: it currently uses `pydata-sphinx-version` instead of Arrow dev version. Testing if the update in upstream fixes this error in version `14.0.1`.
    * Closes: #39640
    
    Authored-by: AlenkaF <fr...@gmail.com>
    Signed-off-by: AlenkaF <fr...@gmail.com>
---
 ci/conda_env_sphinx.txt | 2 +-
 docs/requirements.txt   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/ci/conda_env_sphinx.txt b/ci/conda_env_sphinx.txt
index 0e50875fc1..d0f494d2e0 100644
--- a/ci/conda_env_sphinx.txt
+++ b/ci/conda_env_sphinx.txt
@@ -20,7 +20,7 @@ breathe
 doxygen
 ipython
 numpydoc
-pydata-sphinx-theme=0.14
+pydata-sphinx-theme=0.14.1
 sphinx-autobuild
 sphinx-design
 sphinx-copybutton
diff --git a/docs/requirements.txt b/docs/requirements.txt
index da2327a6df..aee2eb662c 100644
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@@ -5,7 +5,7 @@
 breathe
 ipython
 numpydoc
-pydata-sphinx-theme==0.14
+pydata-sphinx-theme==0.14.1
 sphinx-autobuild
 sphinx-design
 sphinx-copybutton

(arrow) 01/30: GH-39313: [Python] Fix race condition in _pandas_api#_check_import (#39314)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit dec703aa28347d5e8f6d568fb80bfb4a4c6aa79a
Author: Tom Jarosz <th...@c3.ai>
AuthorDate: Tue Jan 9 05:25:21 2024 -0800

    GH-39313: [Python] Fix race condition in _pandas_api#_check_import (#39314)
    
    ### Rationale for this change
    
    See:
    ```
        cdef inline bint _have_pandas_internal(self):
            if not self._tried_importing_pandas:
                self._check_import(raise_=False)
            return self._have_pandas
    ```
    
    The method `_check_import`:
    1) sets `_tried_importing_pandas` to true
    2) does some things which take time...
    3) sets `_have_pandas` to true (if we indeed do have pandas)
    
    Suppose thread 1 calls `_have_pandas_internal`. If thread 1 is at step 2 while thread 2 calls `_have_pandas_internal`, `_have_pandas_internal` may incorrectly return False for thread 2 as thread 1 has set `_tried_importing_pandas` to true, but has not yet (but will) set `_have_pandas` to True. `_have_pandas_internal` will return True for thread 1.
    
    After my fix, `_have_pandas_internal` will not return an incorrect value in the scenario described above. It would instead result in a redundant, but (I believe) harmless, invocation of `_check_import`.
    
    ### What changes are included in this PR?
    
    Changes ordering of "trying to import pandas" and "recording that pandas import has been tried"
    
    ### Are these changes tested?
    yes, see test committed
    
    ### Are there any user-facing changes?
    
    This PR resolves a user-facing race condition https://github.com/apache/arrow/issues/39313
    * Closes: #39313
    
    Lead-authored-by: Thomas Jarosz <th...@c3.ai>
    Co-authored-by: Antoine Pitrou <an...@python.org>
    Signed-off-by: Antoine Pitrou <an...@python.org>
---
 python/pyarrow/pandas-shim.pxi      | 22 ++++++++++-------
 python/pyarrow/tests/arrow_39313.py | 47 +++++++++++++++++++++++++++++++++++++
 python/pyarrow/tests/test_pandas.py |  6 +++++
 3 files changed, 67 insertions(+), 8 deletions(-)

diff --git a/python/pyarrow/pandas-shim.pxi b/python/pyarrow/pandas-shim.pxi
index 273575b779..0409e133ad 100644
--- a/python/pyarrow/pandas-shim.pxi
+++ b/python/pyarrow/pandas-shim.pxi
@@ -18,6 +18,7 @@
 # pandas lazy-loading API shim that reduces API call and import overhead
 
 import warnings
+from threading import Lock
 
 
 cdef class _PandasAPIShim(object):
@@ -34,12 +35,13 @@ cdef class _PandasAPIShim(object):
         object _pd, _types_api, _compat_module
         object _data_frame, _index, _series, _categorical_type
         object _datetimetz_type, _extension_array, _extension_dtype
-        object _array_like_types, _is_extension_array_dtype
+        object _array_like_types, _is_extension_array_dtype, _lock
         bint has_sparse
         bint _pd024
         bint _is_v1, _is_ge_v21
 
     def __init__(self):
+        self._lock = Lock()
         self._tried_importing_pandas = False
         self._have_pandas = 0
 
@@ -96,13 +98,17 @@ cdef class _PandasAPIShim(object):
         self.has_sparse = False
 
     cdef inline _check_import(self, bint raise_=True):
-        if self._tried_importing_pandas:
-            if not self._have_pandas and raise_:
-                self._import_pandas(raise_)
-            return
-
-        self._tried_importing_pandas = True
-        self._import_pandas(raise_)
+        if not self._tried_importing_pandas:
+            with self._lock:
+                if not self._tried_importing_pandas:
+                    try:
+                        self._import_pandas(raise_)
+                    finally:
+                        self._tried_importing_pandas = True
+                    return
+
+        if not self._have_pandas and raise_:
+            self._import_pandas(raise_)
 
     def series(self, *args, **kwargs):
         self._check_import()
diff --git a/python/pyarrow/tests/arrow_39313.py b/python/pyarrow/tests/arrow_39313.py
new file mode 100644
index 0000000000..1e769f49d9
--- /dev/null
+++ b/python/pyarrow/tests/arrow_39313.py
@@ -0,0 +1,47 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# This file is called from a test in test_pandas.py.
+
+from threading import Thread
+
+import pandas as pd
+from pyarrow.pandas_compat import _pandas_api
+
+if __name__ == "__main__":
+    wait = True
+    num_threads = 10
+    df = pd.DataFrame()
+    results = []
+
+    def rc():
+        while wait:
+            pass
+        results.append(_pandas_api.is_data_frame(df))
+
+    threads = [Thread(target=rc) for _ in range(num_threads)]
+
+    for t in threads:
+        t.start()
+
+    wait = False
+
+    for t in threads:
+        t.join()
+
+    assert len(results) == num_threads
+    assert all(results), "`is_data_frame` returned False when given a DataFrame"
diff --git a/python/pyarrow/tests/test_pandas.py b/python/pyarrow/tests/test_pandas.py
index 3353bebce7..d15ee82d5d 100644
--- a/python/pyarrow/tests/test_pandas.py
+++ b/python/pyarrow/tests/test_pandas.py
@@ -34,6 +34,7 @@ import pytest
 from pyarrow.pandas_compat import get_logical_type, _pandas_api
 from pyarrow.tests.util import invoke_script, random_ascii, rands
 import pyarrow.tests.strategies as past
+import pyarrow.tests.util as test_util
 from pyarrow.vendored.version import Version
 
 import pyarrow as pa
@@ -5008,3 +5009,8 @@ def test_nested_chunking_valid():
     schema = pa.schema([("maps", map_type)])
     roundtrip(pd.DataFrame({"maps": [map_of_los, map_of_los, map_of_los]}),
               schema=schema)
+
+
+def test_is_data_frame_race_condition():
+    # See https://github.com/apache/arrow/issues/39313
+    test_util.invoke_script('arrow_39313.py')

(arrow) 19/30: MINOR: [Python][CI] Add upper bound on pytest version (#39827)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit ba6a8f56635fedd8e1d2604664b9dddccc11ddc0
Author: Antoine Pitrou <an...@python.org>
AuthorDate: Mon Jan 29 17:41:47 2024 +0100

    MINOR: [Python][CI] Add upper bound on pytest version (#39827)
    
    ### Rationale for this change
    
    The PyArrow test suite relies on the pytest-lazy-fixture plugin, which breaks on pytest 8.0.0: https://github.com/TvoroG/pytest-lazy-fixture/issues/65
    
    ### What changes are included in this PR?
    
    Avoid installing pytest 8 on CI builds, by putting an upper bound on the pytest version.
    
    ### Are these changes tested?
    
    Yes, by construction.
    
    ### Are there any user-facing changes?
    
    No.
    
    Authored-by: Antoine Pitrou <an...@python.org>
    Signed-off-by: Antoine Pitrou <an...@python.org>
---
 ci/conda_env_python.txt            | 2 +-
 python/requirements-test.txt       | 2 +-
 python/requirements-wheel-test.txt | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/ci/conda_env_python.txt b/ci/conda_env_python.txt
index 9720344212..5fdd21d2bd 100644
--- a/ci/conda_env_python.txt
+++ b/ci/conda_env_python.txt
@@ -23,7 +23,7 @@ cloudpickle
 fsspec
 hypothesis
 numpy>=1.16.6
-pytest
+pytest<8  # pytest-lazy-fixture broken on pytest 8.0.0
 pytest-faulthandler
 pytest-lazy-fixture
 s3fs>=2023.10.0
diff --git a/python/requirements-test.txt b/python/requirements-test.txt
index 9f07e5c57b..b3ba5d852b 100644
--- a/python/requirements-test.txt
+++ b/python/requirements-test.txt
@@ -1,6 +1,6 @@
 cffi
 hypothesis
 pandas
-pytest
+pytest<8
 pytest-lazy-fixture
 pytz
diff --git a/python/requirements-wheel-test.txt b/python/requirements-wheel-test.txt
index 516ec0fccc..c74a8ca690 100644
--- a/python/requirements-wheel-test.txt
+++ b/python/requirements-wheel-test.txt
@@ -1,7 +1,7 @@
 cffi
 cython
 hypothesis
-pytest
+pytest<8
 pytest-lazy-fixture
 pytz
 tzdata; sys_platform == 'win32'

(arrow) 25/30: GH-39976: [C++] Fix out-of-line data size calculation in BinaryViewBuilder::AppendArraySlice (#39994)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 23a8991bee10900f10d25f0c2d43c47c11ea20e0
Author: Rossi Sun <za...@gmail.com>
AuthorDate: Fri Feb 9 00:05:50 2024 +0800

    GH-39976: [C++] Fix out-of-line data size calculation in BinaryViewBuilder::AppendArraySlice (#39994)
    
    
    
    ### Rationale for this change
    
    Fix the bug in `BinaryViewBuilder::AppendArraySlice` that, when calculating out-of-line data size, the array is wrongly iterated.
    
    ### What changes are included in this PR?
    
    Fix and UT.
    
    ### Are these changes tested?
    
    UT included.
    
    ### Are there any user-facing changes?
    
    No.
    
    * Closes: #39976
    
    Authored-by: Ruoxi Sun <za...@gmail.com>
    Signed-off-by: Antoine Pitrou <an...@python.org>
---
 cpp/src/arrow/array/array_test.cc     | 23 +++++++++++++++++++++++
 cpp/src/arrow/array/builder_binary.cc |  2 +-
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/cpp/src/arrow/array/array_test.cc b/cpp/src/arrow/array/array_test.cc
index e9d478f108..21ac1a09f5 100644
--- a/cpp/src/arrow/array/array_test.cc
+++ b/cpp/src/arrow/array/array_test.cc
@@ -905,6 +905,29 @@ TEST_F(TestArray, TestAppendArraySlice) {
   }
 }
 
+// GH-39976: Test out-of-line data size calculation in
+// BinaryViewBuilder::AppendArraySlice.
+TEST_F(TestArray, TestBinaryViewAppendArraySlice) {
+  BinaryViewBuilder src_builder(pool_);
+  ASSERT_OK(src_builder.AppendNull());
+  ASSERT_OK(src_builder.Append("long string; not inlined"));
+  ASSERT_EQ(2, src_builder.length());
+  ASSERT_OK_AND_ASSIGN(auto src, src_builder.Finish());
+  ASSERT_OK(src->ValidateFull());
+
+  ArraySpan span;
+  span.SetMembers(*src->data());
+  BinaryViewBuilder dst_builder(pool_);
+  ASSERT_OK(dst_builder.AppendArraySlice(span, 0, 1));
+  ASSERT_EQ(1, dst_builder.length());
+  ASSERT_OK(dst_builder.AppendArraySlice(span, 1, 1));
+  ASSERT_EQ(2, dst_builder.length());
+  ASSERT_OK_AND_ASSIGN(auto dst, dst_builder.Finish());
+  ASSERT_OK(dst->ValidateFull());
+
+  AssertArraysEqual(*src, *dst);
+}
+
 TEST_F(TestArray, ValidateBuffersPrimitive) {
   auto empty_buffer = std::make_shared<Buffer>("");
   auto null_buffer = Buffer::FromString("\xff");
diff --git a/cpp/src/arrow/array/builder_binary.cc b/cpp/src/arrow/array/builder_binary.cc
index f85852fa0e..7e5721917f 100644
--- a/cpp/src/arrow/array/builder_binary.cc
+++ b/cpp/src/arrow/array/builder_binary.cc
@@ -54,7 +54,7 @@ Status BinaryViewBuilder::AppendArraySlice(const ArraySpan& array, int64_t offse
 
   int64_t out_of_line_total = 0, i = 0;
   VisitNullBitmapInline(
-      array.buffers[0].data, array.offset, array.length, array.null_count,
+      array.buffers[0].data, array.offset + offset, length, array.null_count,
       [&] {
         if (!values[i].is_inline()) {
           out_of_line_total += static_cast<int64_t>(values[i].size());

(arrow) 05/30: GH-39599: [Python] Avoid leaking references to Numpy dtypes (#39636)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit fc586a091be82a19bb5b7f9f0a23ae7bb4e36ea6
Author: Antoine Pitrou <an...@python.org>
AuthorDate: Wed Jan 17 11:26:37 2024 +0100

    GH-39599: [Python] Avoid leaking references to Numpy dtypes (#39636)
    
    ### Rationale for this change
    
    `PyArray_DescrFromScalar` returns a new reference, so we should be careful to decref it when we don't use it anymore.
    
    ### Are these changes tested?
    
    No.
    
    ### Are there any user-facing changes?
    
    No.
    * Closes: #39599
    
    Authored-by: Antoine Pitrou <an...@python.org>
    Signed-off-by: Joris Van den Bossche <jo...@gmail.com>
---
 python/pyarrow/array.pxi                           |  3 +-
 python/pyarrow/includes/libarrow_python.pxd        |  2 +-
 python/pyarrow/src/arrow/python/inference.cc       |  5 +-
 python/pyarrow/src/arrow/python/numpy_convert.cc   | 77 ++++++++++------------
 python/pyarrow/src/arrow/python/numpy_convert.h    |  6 +-
 python/pyarrow/src/arrow/python/numpy_to_arrow.cc  | 11 ++--
 python/pyarrow/src/arrow/python/python_to_arrow.cc |  6 +-
 python/pyarrow/types.pxi                           |  6 +-
 8 files changed, 48 insertions(+), 68 deletions(-)

diff --git a/python/pyarrow/array.pxi b/python/pyarrow/array.pxi
index 5c2d22aef1..1416f5f434 100644
--- a/python/pyarrow/array.pxi
+++ b/python/pyarrow/array.pxi
@@ -66,8 +66,7 @@ cdef shared_ptr[CDataType] _ndarray_to_type(object values,
     dtype = values.dtype
 
     if type is None and dtype != object:
-        with nogil:
-            check_status(NumPyDtypeToArrow(dtype, &c_type))
+        c_type = GetResultValue(NumPyDtypeToArrow(dtype))
 
     if type is not None:
         c_type = type.sp_type
diff --git a/python/pyarrow/includes/libarrow_python.pxd b/python/pyarrow/includes/libarrow_python.pxd
index e3179062a1..906f0b7d28 100644
--- a/python/pyarrow/includes/libarrow_python.pxd
+++ b/python/pyarrow/includes/libarrow_python.pxd
@@ -73,7 +73,7 @@ cdef extern from "arrow/python/api.h" namespace "arrow::py" nogil:
         object obj, object mask, const PyConversionOptions& options,
         CMemoryPool* pool)
 
-    CStatus NumPyDtypeToArrow(object dtype, shared_ptr[CDataType]* type)
+    CResult[shared_ptr[CDataType]] NumPyDtypeToArrow(object dtype)
 
     CStatus NdarrayToArrow(CMemoryPool* pool, object ao, object mo,
                            c_bool from_pandas,
diff --git a/python/pyarrow/src/arrow/python/inference.cc b/python/pyarrow/src/arrow/python/inference.cc
index 9537aec574..10116f9afa 100644
--- a/python/pyarrow/src/arrow/python/inference.cc
+++ b/python/pyarrow/src/arrow/python/inference.cc
@@ -468,10 +468,7 @@ class TypeInferrer {
     if (numpy_dtype_count_ > 0) {
       // All NumPy scalars and Nones/nulls
       if (numpy_dtype_count_ + none_count_ == total_count_) {
-        std::shared_ptr<DataType> type;
-        RETURN_NOT_OK(NumPyDtypeToArrow(numpy_unifier_.current_dtype(), &type));
-        *out = type;
-        return Status::OK();
+        return NumPyDtypeToArrow(numpy_unifier_.current_dtype()).Value(out);
       }
 
       // The "bad path": data contains a mix of NumPy scalars and
diff --git a/python/pyarrow/src/arrow/python/numpy_convert.cc b/python/pyarrow/src/arrow/python/numpy_convert.cc
index 4970680764..dfee88c092 100644
--- a/python/pyarrow/src/arrow/python/numpy_convert.cc
+++ b/python/pyarrow/src/arrow/python/numpy_convert.cc
@@ -59,12 +59,11 @@ NumPyBuffer::~NumPyBuffer() {
 
 #define TO_ARROW_TYPE_CASE(NPY_NAME, FACTORY) \
   case NPY_##NPY_NAME:                        \
-    *out = FACTORY();                         \
-    break;
+    return FACTORY();
 
 namespace {
 
-Status GetTensorType(PyObject* dtype, std::shared_ptr<DataType>* out) {
+Result<std::shared_ptr<DataType>> GetTensorType(PyObject* dtype) {
   if (!PyObject_TypeCheck(dtype, &PyArrayDescr_Type)) {
     return Status::TypeError("Did not pass numpy.dtype object");
   }
@@ -84,11 +83,8 @@ Status GetTensorType(PyObject* dtype, std::shared_ptr<DataType>* out) {
     TO_ARROW_TYPE_CASE(FLOAT16, float16);
     TO_ARROW_TYPE_CASE(FLOAT32, float32);
     TO_ARROW_TYPE_CASE(FLOAT64, float64);
-    default: {
-      return Status::NotImplemented("Unsupported numpy type ", descr->type_num);
-    }
   }
-  return Status::OK();
+  return Status::NotImplemented("Unsupported numpy type ", descr->type_num);
 }
 
 Status GetNumPyType(const DataType& type, int* type_num) {
@@ -120,15 +116,21 @@ Status GetNumPyType(const DataType& type, int* type_num) {
 
 }  // namespace
 
-Status NumPyDtypeToArrow(PyObject* dtype, std::shared_ptr<DataType>* out) {
+Result<std::shared_ptr<DataType>> NumPyScalarToArrowDataType(PyObject* scalar) {
+  PyArray_Descr* descr = PyArray_DescrFromScalar(scalar);
+  OwnedRef descr_ref(reinterpret_cast<PyObject*>(descr));
+  return NumPyDtypeToArrow(descr);
+}
+
+Result<std::shared_ptr<DataType>> NumPyDtypeToArrow(PyObject* dtype) {
   if (!PyObject_TypeCheck(dtype, &PyArrayDescr_Type)) {
     return Status::TypeError("Did not pass numpy.dtype object");
   }
   PyArray_Descr* descr = reinterpret_cast<PyArray_Descr*>(dtype);
-  return NumPyDtypeToArrow(descr, out);
+  return NumPyDtypeToArrow(descr);
 }
 
-Status NumPyDtypeToArrow(PyArray_Descr* descr, std::shared_ptr<DataType>* out) {
+Result<std::shared_ptr<DataType>> NumPyDtypeToArrow(PyArray_Descr* descr) {
   int type_num = fix_numpy_type_num(descr->type_num);
 
   switch (type_num) {
@@ -151,20 +153,15 @@ Status NumPyDtypeToArrow(PyArray_Descr* descr, std::shared_ptr<DataType>* out) {
           reinterpret_cast<PyArray_DatetimeDTypeMetaData*>(descr->c_metadata);
       switch (date_dtype->meta.base) {
         case NPY_FR_s:
-          *out = timestamp(TimeUnit::SECOND);
-          break;
+          return timestamp(TimeUnit::SECOND);
         case NPY_FR_ms:
-          *out = timestamp(TimeUnit::MILLI);
-          break;
+          return timestamp(TimeUnit::MILLI);
         case NPY_FR_us:
-          *out = timestamp(TimeUnit::MICRO);
-          break;
+          return timestamp(TimeUnit::MICRO);
         case NPY_FR_ns:
-          *out = timestamp(TimeUnit::NANO);
-          break;
+          return timestamp(TimeUnit::NANO);
         case NPY_FR_D:
-          *out = date32();
-          break;
+          return date32();
         case NPY_FR_GENERIC:
           return Status::NotImplemented("Unbound or generic datetime64 time unit");
         default:
@@ -176,29 +173,22 @@ Status NumPyDtypeToArrow(PyArray_Descr* descr, std::shared_ptr<DataType>* out) {
           reinterpret_cast<PyArray_DatetimeDTypeMetaData*>(descr->c_metadata);
       switch (timedelta_dtype->meta.base) {
         case NPY_FR_s:
-          *out = duration(TimeUnit::SECOND);
-          break;
+          return duration(TimeUnit::SECOND);
         case NPY_FR_ms:
-          *out = duration(TimeUnit::MILLI);
-          break;
+          return duration(TimeUnit::MILLI);
         case NPY_FR_us:
-          *out = duration(TimeUnit::MICRO);
-          break;
+          return duration(TimeUnit::MICRO);
         case NPY_FR_ns:
-          *out = duration(TimeUnit::NANO);
-          break;
+          return duration(TimeUnit::NANO);
         case NPY_FR_GENERIC:
           return Status::NotImplemented("Unbound or generic timedelta64 time unit");
         default:
           return Status::NotImplemented("Unsupported timedelta64 time unit");
       }
     } break;
-    default: {
-      return Status::NotImplemented("Unsupported numpy type ", descr->type_num);
-    }
   }
 
-  return Status::OK();
+  return Status::NotImplemented("Unsupported numpy type ", descr->type_num);
 }
 
 #undef TO_ARROW_TYPE_CASE
@@ -230,9 +220,8 @@ Status NdarrayToTensor(MemoryPool* pool, PyObject* ao,
     strides[i] = array_strides[i];
   }
 
-  std::shared_ptr<DataType> type;
-  RETURN_NOT_OK(
-      GetTensorType(reinterpret_cast<PyObject*>(PyArray_DESCR(ndarray)), &type));
+  ARROW_ASSIGN_OR_RAISE(
+      auto type, GetTensorType(reinterpret_cast<PyObject*>(PyArray_DESCR(ndarray))));
   *out = std::make_shared<Tensor>(type, data, shape, strides, dim_names);
   return Status::OK();
 }
@@ -435,9 +424,9 @@ Status NdarraysToSparseCOOTensor(MemoryPool* pool, PyObject* data_ao, PyObject*
 
   PyArrayObject* ndarray_data = reinterpret_cast<PyArrayObject*>(data_ao);
   std::shared_ptr<Buffer> data = std::make_shared<NumPyBuffer>(data_ao);
-  std::shared_ptr<DataType> type_data;
-  RETURN_NOT_OK(GetTensorType(reinterpret_cast<PyObject*>(PyArray_DESCR(ndarray_data)),
-                              &type_data));
+  ARROW_ASSIGN_OR_RAISE(
+      auto type_data,
+      GetTensorType(reinterpret_cast<PyObject*>(PyArray_DESCR(ndarray_data))));
 
   std::shared_ptr<Tensor> coords;
   RETURN_NOT_OK(NdarrayToTensor(pool, coords_ao, {}, &coords));
@@ -462,9 +451,9 @@ Status NdarraysToSparseCSXMatrix(MemoryPool* pool, PyObject* data_ao, PyObject*
 
   PyArrayObject* ndarray_data = reinterpret_cast<PyArrayObject*>(data_ao);
   std::shared_ptr<Buffer> data = std::make_shared<NumPyBuffer>(data_ao);
-  std::shared_ptr<DataType> type_data;
-  RETURN_NOT_OK(GetTensorType(reinterpret_cast<PyObject*>(PyArray_DESCR(ndarray_data)),
-                              &type_data));
+  ARROW_ASSIGN_OR_RAISE(
+      auto type_data,
+      GetTensorType(reinterpret_cast<PyObject*>(PyArray_DESCR(ndarray_data))));
 
   std::shared_ptr<Tensor> indptr, indices;
   RETURN_NOT_OK(NdarrayToTensor(pool, indptr_ao, {}, &indptr));
@@ -491,9 +480,9 @@ Status NdarraysToSparseCSFTensor(MemoryPool* pool, PyObject* data_ao, PyObject*
   const int ndim = static_cast<const int>(shape.size());
   PyArrayObject* ndarray_data = reinterpret_cast<PyArrayObject*>(data_ao);
   std::shared_ptr<Buffer> data = std::make_shared<NumPyBuffer>(data_ao);
-  std::shared_ptr<DataType> type_data;
-  RETURN_NOT_OK(GetTensorType(reinterpret_cast<PyObject*>(PyArray_DESCR(ndarray_data)),
-                              &type_data));
+  ARROW_ASSIGN_OR_RAISE(
+      auto type_data,
+      GetTensorType(reinterpret_cast<PyObject*>(PyArray_DESCR(ndarray_data))));
 
   std::vector<std::shared_ptr<Tensor>> indptr(ndim - 1);
   std::vector<std::shared_ptr<Tensor>> indices(ndim);
diff --git a/python/pyarrow/src/arrow/python/numpy_convert.h b/python/pyarrow/src/arrow/python/numpy_convert.h
index 10451077a2..2d1086e135 100644
--- a/python/pyarrow/src/arrow/python/numpy_convert.h
+++ b/python/pyarrow/src/arrow/python/numpy_convert.h
@@ -49,9 +49,11 @@ class ARROW_PYTHON_EXPORT NumPyBuffer : public Buffer {
 };
 
 ARROW_PYTHON_EXPORT
-Status NumPyDtypeToArrow(PyObject* dtype, std::shared_ptr<DataType>* out);
+Result<std::shared_ptr<DataType>> NumPyDtypeToArrow(PyObject* dtype);
 ARROW_PYTHON_EXPORT
-Status NumPyDtypeToArrow(PyArray_Descr* descr, std::shared_ptr<DataType>* out);
+Result<std::shared_ptr<DataType>> NumPyDtypeToArrow(PyArray_Descr* descr);
+ARROW_PYTHON_EXPORT
+Result<std::shared_ptr<DataType>> NumPyScalarToArrowDataType(PyObject* scalar);
 
 ARROW_PYTHON_EXPORT Status NdarrayToTensor(MemoryPool* pool, PyObject* ao,
                                            const std::vector<std::string>& dim_names,
diff --git a/python/pyarrow/src/arrow/python/numpy_to_arrow.cc b/python/pyarrow/src/arrow/python/numpy_to_arrow.cc
index 2727ce32f4..8903df31be 100644
--- a/python/pyarrow/src/arrow/python/numpy_to_arrow.cc
+++ b/python/pyarrow/src/arrow/python/numpy_to_arrow.cc
@@ -462,8 +462,7 @@ template <typename ArrowType>
 inline Status NumPyConverter::ConvertData(std::shared_ptr<Buffer>* data) {
   RETURN_NOT_OK(PrepareInputData<ArrowType>(data));
 
-  std::shared_ptr<DataType> input_type;
-  RETURN_NOT_OK(NumPyDtypeToArrow(reinterpret_cast<PyObject*>(dtype_), &input_type));
+  ARROW_ASSIGN_OR_RAISE(auto input_type, NumPyDtypeToArrow(dtype_));
 
   if (!input_type->Equals(*type_)) {
     RETURN_NOT_OK(CastBuffer(input_type, *data, length_, null_bitmap_, null_count_, type_,
@@ -490,7 +489,7 @@ inline Status NumPyConverter::ConvertData<Date32Type>(std::shared_ptr<Buffer>* d
       Status s = StaticCastBuffer<int64_t, int32_t>(**data, length_, pool_, data);
       RETURN_NOT_OK(s);
     } else {
-      RETURN_NOT_OK(NumPyDtypeToArrow(reinterpret_cast<PyObject*>(dtype_), &input_type));
+      ARROW_ASSIGN_OR_RAISE(input_type, NumPyDtypeToArrow(dtype_));
       if (!input_type->Equals(*type_)) {
         // The null bitmap was already computed in VisitNative()
         RETURN_NOT_OK(CastBuffer(input_type, *data, length_, null_bitmap_, null_count_,
@@ -498,7 +497,7 @@ inline Status NumPyConverter::ConvertData<Date32Type>(std::shared_ptr<Buffer>* d
       }
     }
   } else {
-    RETURN_NOT_OK(NumPyDtypeToArrow(reinterpret_cast<PyObject*>(dtype_), &input_type));
+    ARROW_ASSIGN_OR_RAISE(input_type, NumPyDtypeToArrow(dtype_));
     if (!input_type->Equals(*type_)) {
       RETURN_NOT_OK(CastBuffer(input_type, *data, length_, null_bitmap_, null_count_,
                                type_, cast_options_, pool_, data));
@@ -531,7 +530,7 @@ inline Status NumPyConverter::ConvertData<Date64Type>(std::shared_ptr<Buffer>* d
       }
       *data = std::move(result);
     } else {
-      RETURN_NOT_OK(NumPyDtypeToArrow(reinterpret_cast<PyObject*>(dtype_), &input_type));
+      ARROW_ASSIGN_OR_RAISE(input_type, NumPyDtypeToArrow(dtype_));
       if (!input_type->Equals(*type_)) {
         // The null bitmap was already computed in VisitNative()
         RETURN_NOT_OK(CastBuffer(input_type, *data, length_, null_bitmap_, null_count_,
@@ -539,7 +538,7 @@ inline Status NumPyConverter::ConvertData<Date64Type>(std::shared_ptr<Buffer>* d
       }
     }
   } else {
-    RETURN_NOT_OK(NumPyDtypeToArrow(reinterpret_cast<PyObject*>(dtype_), &input_type));
+    ARROW_ASSIGN_OR_RAISE(input_type, NumPyDtypeToArrow(dtype_));
     if (!input_type->Equals(*type_)) {
       RETURN_NOT_OK(CastBuffer(input_type, *data, length_, null_bitmap_, null_count_,
                                type_, cast_options_, pool_, data));
diff --git a/python/pyarrow/src/arrow/python/python_to_arrow.cc b/python/pyarrow/src/arrow/python/python_to_arrow.cc
index 23b92598e3..d1d94ac17a 100644
--- a/python/pyarrow/src/arrow/python/python_to_arrow.cc
+++ b/python/pyarrow/src/arrow/python/python_to_arrow.cc
@@ -386,8 +386,7 @@ class PyValue {
       }
     } else if (PyArray_CheckAnyScalarExact(obj)) {
       // validate that the numpy scalar has np.datetime64 dtype
-      std::shared_ptr<DataType> numpy_type;
-      RETURN_NOT_OK(NumPyDtypeToArrow(PyArray_DescrFromScalar(obj), &numpy_type));
+      ARROW_ASSIGN_OR_RAISE(auto numpy_type, NumPyScalarToArrowDataType(obj));
       if (!numpy_type->Equals(*type)) {
         return Status::NotImplemented("Expected np.datetime64 but got: ",
                                       numpy_type->ToString());
@@ -466,8 +465,7 @@ class PyValue {
       }
     } else if (PyArray_CheckAnyScalarExact(obj)) {
       // validate that the numpy scalar has np.datetime64 dtype
-      std::shared_ptr<DataType> numpy_type;
-      RETURN_NOT_OK(NumPyDtypeToArrow(PyArray_DescrFromScalar(obj), &numpy_type));
+      ARROW_ASSIGN_OR_RAISE(auto numpy_type, NumPyScalarToArrowDataType(obj));
       if (!numpy_type->Equals(*type)) {
         return Status::NotImplemented("Expected np.timedelta64 but got: ",
                                       numpy_type->ToString());
diff --git a/python/pyarrow/types.pxi b/python/pyarrow/types.pxi
index 912ee39f7d..b6dc53d633 100644
--- a/python/pyarrow/types.pxi
+++ b/python/pyarrow/types.pxi
@@ -5140,12 +5140,8 @@ def from_numpy_dtype(object dtype):
     >>> pa.from_numpy_dtype(np.str_)
     DataType(string)
     """
-    cdef shared_ptr[CDataType] c_type
     dtype = np.dtype(dtype)
-    with nogil:
-        check_status(NumPyDtypeToArrow(dtype, &c_type))
-
-    return pyarrow_wrap_data_type(c_type)
+    return pyarrow_wrap_data_type(GetResultValue(NumPyDtypeToArrow(dtype)))
 
 
 def is_boolean_value(object obj):

(arrow) 12/30: GH-38655: [C++] "iso_calendar" kernel returns incorrect results for array length > 32 (#39360)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 215bcf9a5fc2a729bc4bc142f7a01a8a0dfd670b
Author: Rok Mihevc <ro...@mihevc.org>
AuthorDate: Tue Jan 23 12:43:05 2024 +0100

    GH-38655: [C++] "iso_calendar" kernel returns incorrect results for array length > 32 (#39360)
    
    ### Rationale for this change
    
    When defining `StructArray`'s field builders for `ISOCalendar` we don't pre-allocate memory and then use unsafe append. This causes the resulting array to be at most 32 rows long.
    
    ### What changes are included in this PR?
    
    This introduces required memory pre-allocation in the `ISOCalendar` c++ kernel.
    
    ### Are these changes tested?
    
    This adds a test for the Python wrapper.
    
    ### Are there any user-facing changes?
    
    Fixes the behavior of `iso_calendar` kernel.
    * Closes: #38655
    
    Lead-authored-by: Rok Mihevc <ro...@mihevc.org>
    Co-authored-by: Joris Van den Bossche <jo...@gmail.com>
    Signed-off-by: Joris Van den Bossche <jo...@gmail.com>
---
 cpp/src/arrow/compute/kernels/scalar_temporal_unary.cc |  2 +-
 python/pyarrow/tests/test_compute.py                   | 13 +++++++++++++
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/cpp/src/arrow/compute/kernels/scalar_temporal_unary.cc b/cpp/src/arrow/compute/kernels/scalar_temporal_unary.cc
index a88ce38936..f49e201492 100644
--- a/cpp/src/arrow/compute/kernels/scalar_temporal_unary.cc
+++ b/cpp/src/arrow/compute/kernels/scalar_temporal_unary.cc
@@ -1510,7 +1510,7 @@ struct ISOCalendar {
     for (int i = 0; i < 3; i++) {
       field_builders.push_back(
           checked_cast<BuilderType*>(struct_builder->field_builder(i)));
-      RETURN_NOT_OK(field_builders[i]->Reserve(1));
+      RETURN_NOT_OK(field_builders[i]->Reserve(in.length));
     }
     auto visit_null = [&]() { return struct_builder->AppendNull(); };
     std::function<Status(typename InType::c_type arg)> visit_value;
diff --git a/python/pyarrow/tests/test_compute.py b/python/pyarrow/tests/test_compute.py
index 7c5a134d33..9ceb2fd730 100644
--- a/python/pyarrow/tests/test_compute.py
+++ b/python/pyarrow/tests/test_compute.py
@@ -2255,6 +2255,19 @@ def test_extract_datetime_components():
             _check_datetime_components(timestamps, timezone)
 
 
+@pytest.mark.parametrize("unit", ["s", "ms", "us", "ns"])
+def test_iso_calendar_longer_array(unit):
+    # https://github.com/apache/arrow/issues/38655
+    # ensure correct result for array length > 32
+    arr = pa.array([datetime.datetime(2022, 1, 2, 9)]*50, pa.timestamp(unit))
+    result = pc.iso_calendar(arr)
+    expected = pa.StructArray.from_arrays(
+        [[2021]*50, [52]*50, [7]*50],
+        names=['iso_year', 'iso_week', 'iso_day_of_week']
+    )
+    assert result.equals(expected)
+
+
 @pytest.mark.pandas
 @pytest.mark.skipif(sys.platform == "win32" and not util.windows_has_tzdata(),
                     reason="Timezone database is not installed on Windows")

(arrow) 27/30: GH-39916: [C#] Restore support for .NET 4.6.2 (#40008)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit ecfc9979c6ed6884b35c8f4d659e0721c109ad32
Author: Curt Hagenlocher <cu...@hagenlocher.org>
AuthorDate: Thu Feb 8 14:26:06 2024 -0800

    GH-39916: [C#] Restore support for .NET 4.6.2 (#40008)
    
    ### What changes are included in this PR?
    
    Project targets have been added for net462 which is still in support. A few tests have been modified to allow them to build against that target.
    
    ### Are these changes tested?
    
    Yes.
    
    ### Are there any user-facing changes?
    
    There are new build artifacts for Apache.Arrow.dll and Apache.Arrow.Compression.dll.
    
    * Closes: #39916
    
    Authored-by: Curt Hagenlocher <cu...@hagenlocher.org>
    Signed-off-by: Curt Hagenlocher <cu...@hagenlocher.org>
---
 .../Apache.Arrow.Compression/Apache.Arrow.Compression.csproj |  8 +++++++-
 csharp/src/Apache.Arrow/Apache.Arrow.csproj                  | 12 +++++++++---
 .../Apache.Arrow/Extensions/TupleExtensions.netstandard.cs   |  7 +++++++
 csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj     |  2 +-
 csharp/test/Apache.Arrow.Tests/BinaryArrayBuilderTests.cs    |  8 ++++----
 5 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/csharp/src/Apache.Arrow.Compression/Apache.Arrow.Compression.csproj b/csharp/src/Apache.Arrow.Compression/Apache.Arrow.Compression.csproj
index fded629112..6988567193 100644
--- a/csharp/src/Apache.Arrow.Compression/Apache.Arrow.Compression.csproj
+++ b/csharp/src/Apache.Arrow.Compression/Apache.Arrow.Compression.csproj
@@ -1,10 +1,16 @@
 <Project Sdk="Microsoft.NET.Sdk">
 
   <PropertyGroup>
-    <TargetFramework>netstandard2.0</TargetFramework>
     <Description>Provides decompression support for the Arrow IPC format</Description>
   </PropertyGroup>
 
+  <PropertyGroup Condition="'$(IsWindows)'=='true'">
+    <TargetFrameworks>netstandard2.0;net462</TargetFrameworks>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(IsWindows)'!='true'">
+    <TargetFrameworks>netstandard2.0</TargetFrameworks>
+  </PropertyGroup>
+
   <ItemGroup>
     <PackageReference Include="K4os.Compression.LZ4.Streams" Version="1.3.6" />
     <PackageReference Include="ZstdSharp.Port" Version="0.7.3" />
diff --git a/csharp/src/Apache.Arrow/Apache.Arrow.csproj b/csharp/src/Apache.Arrow/Apache.Arrow.csproj
index 3a229f4ffc..c4bb64b73a 100644
--- a/csharp/src/Apache.Arrow/Apache.Arrow.csproj
+++ b/csharp/src/Apache.Arrow/Apache.Arrow.csproj
@@ -1,14 +1,20 @@
 <Project Sdk="Microsoft.NET.Sdk">
 
   <PropertyGroup>
-    <TargetFrameworks>netstandard2.0;net6.0</TargetFrameworks>
     <AllowUnsafeBlocks>true</AllowUnsafeBlocks>
     <DefineConstants>$(DefineConstants);UNSAFE_BYTEBUFFER;BYTEBUFFER_NO_BOUNDS_CHECK;ENABLE_SPAN_T</DefineConstants>
     
     <Description>Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.</Description>
   </PropertyGroup>
 
-  <ItemGroup Condition="'$(TargetFrameworkIdentifier)' == '.NETStandard'">
+  <PropertyGroup Condition="'$(IsWindows)'=='true'">
+    <TargetFrameworks>netstandard2.0;net6.0;net462</TargetFrameworks>
+  </PropertyGroup>
+  <PropertyGroup Condition="'$(IsWindows)'!='true'">
+    <TargetFrameworks>netstandard2.0;net6.0</TargetFrameworks>
+  </PropertyGroup>
+
+  <ItemGroup Condition="'$(TargetFrameworkIdentifier)' == '.NETStandard' or '$(TargetFramework)' == 'net462'">
     <PackageReference Include="System.Buffers" Version="4.5.1" />
     <PackageReference Include="System.Memory" Version="4.5.5" />
     <PackageReference Include="System.Runtime.CompilerServices.Unsafe" Version="4.7.1" />
@@ -34,7 +40,7 @@
     </EmbeddedResource>
   </ItemGroup>
 
-  <ItemGroup Condition="'$(TargetFrameworkIdentifier)' == '.NETStandard'">
+  <ItemGroup Condition="'$(TargetFrameworkIdentifier)' == '.NETStandard' or '$(TargetFramework)' == 'net462'">
     <Compile Remove="Extensions\StreamExtensions.netcoreapp.cs" />
   </ItemGroup>
   <ItemGroup Condition="'$(TargetFrameworkIdentifier)' == '.NETCoreApp'">
diff --git a/csharp/src/Apache.Arrow/Extensions/TupleExtensions.netstandard.cs b/csharp/src/Apache.Arrow/Extensions/TupleExtensions.netstandard.cs
index fe42075f14..e0e0f57070 100644
--- a/csharp/src/Apache.Arrow/Extensions/TupleExtensions.netstandard.cs
+++ b/csharp/src/Apache.Arrow/Extensions/TupleExtensions.netstandard.cs
@@ -25,5 +25,12 @@ namespace Apache.Arrow
             item1 = value.Item1;
             item2 = value.Item2;
         }
+
+        public static void Deconstruct<T1, T2, T3>(this Tuple<T1, T2, T3> value, out T1 item1, out T2 item2, out T3 item3)
+        {
+            item1 = value.Item1;
+            item2 = value.Item2;
+            item3 = value.Item3;
+        }
     }
 }
diff --git a/csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj b/csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj
index 0afd1490e7..d8f7a9566b 100644
--- a/csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj
+++ b/csharp/test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj
@@ -7,7 +7,7 @@
   </PropertyGroup>
 
   <PropertyGroup Condition="'$(IsWindows)'=='true'">
-    <TargetFrameworks>net7.0;net472</TargetFrameworks>
+    <TargetFrameworks>net7.0;net472;net462</TargetFrameworks>
   </PropertyGroup>
   <PropertyGroup Condition="'$(IsWindows)'!='true'">
     <TargetFrameworks>net7.0</TargetFrameworks>
diff --git a/csharp/test/Apache.Arrow.Tests/BinaryArrayBuilderTests.cs b/csharp/test/Apache.Arrow.Tests/BinaryArrayBuilderTests.cs
index 4c2b050d0c..447572dda0 100644
--- a/csharp/test/Apache.Arrow.Tests/BinaryArrayBuilderTests.cs
+++ b/csharp/test/Apache.Arrow.Tests/BinaryArrayBuilderTests.cs
@@ -83,7 +83,7 @@ namespace Apache.Arrow.Tests
                     builder.AppendRange(initialContents);
                 int initialLength = builder.Length;
                 int expectedLength = initialLength + 1;
-                var expectedArrayContents = initialContents.Append(new[] { singleByte });
+                var expectedArrayContents = initialContents.Concat(new[] { new[] { singleByte } });
 
                 // Act
                 var actualReturnValue = builder.Append(singleByte);
@@ -130,7 +130,7 @@ namespace Apache.Arrow.Tests
                     builder.AppendRange(initialContents);
                 int initialLength = builder.Length;
                 int expectedLength = initialLength + 1;
-                var expectedArrayContents = initialContents.Append(null);
+                var expectedArrayContents = initialContents.Concat(new byte[][] { null });
 
                 // Act
                 var actualReturnValue = builder.AppendNull();
@@ -180,7 +180,7 @@ namespace Apache.Arrow.Tests
                 int initialLength = builder.Length;
                 var span = (ReadOnlySpan<byte>)bytes;
                 int expectedLength = initialLength + 1;
-                var expectedArrayContents = initialContents.Append(bytes);
+                var expectedArrayContents = initialContents.Concat(new[] { bytes });
 
                 // Act
                 var actualReturnValue = builder.Append(span);
@@ -230,7 +230,7 @@ namespace Apache.Arrow.Tests
                 int initialLength = builder.Length;
                 int expectedLength = initialLength + 1;
                 var enumerable = (IEnumerable<byte>)bytes;
-                var expectedArrayContents = initialContents.Append(bytes);
+                var expectedArrayContents = initialContents.Concat(new[] { bytes });
 
                 // Act
                 var actualReturnValue = builder.Append(enumerable);

(arrow) 26/30: GH-40004: [Python][FlightRPC] Release GIL in GeneratorStream (#40005)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 4b1153620a973ad95484ec008419cf806e35b589
Author: Lubo Slivka <lu...@gooddata.com>
AuthorDate: Thu Feb 8 22:58:07 2024 +0100

    GH-40004: [Python][FlightRPC] Release GIL in GeneratorStream (#40005)
    
    Fixes #40004.
    
    * Closes: #40004
    
    Authored-by: lupko <lu...@gooddata.com>
    Signed-off-by: David Li <li...@gmail.com>
---
 python/pyarrow/_flight.pyx | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/python/pyarrow/_flight.pyx b/python/pyarrow/_flight.pyx
index a2ff045f25..67ee759056 100644
--- a/python/pyarrow/_flight.pyx
+++ b/python/pyarrow/_flight.pyx
@@ -2013,8 +2013,9 @@ cdef CStatus _data_stream_next(void* self, CFlightPayload* payload) except *:
     max_attempts = 128
     for _ in range(max_attempts):
         if stream.current_stream != nullptr:
-            check_flight_status(
-                stream.current_stream.get().Next().Value(payload))
+            with nogil:
+                check_flight_status(
+                    stream.current_stream.get().Next().Value(payload))
             # If the stream ended, see if there's another stream from the
             # generator
             if payload.ipc_message.metadata != nullptr:

(arrow) 09/30: GH-39332: [C++] Explicit error in ExecBatchBuilder when appending var length data exceeds offset limit (int32 max) (#39383)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 881eec514255e6cdc78e28c390d674d218b86059
Author: Rossi(Ruoxi) Sun <za...@gmail.com>
AuthorDate: Thu Jan 18 19:44:26 2024 +0800

    GH-39332: [C++] Explicit error in ExecBatchBuilder when appending var length data exceeds offset limit (int32 max) (#39383)
    
    
    
    ### Rationale for this change
    
    When appending var length data in `ExecBatchBuilder`, the offset is potentially to overflow if the batch contains 4GB data or more. This may further result in segmentation fault during the subsequent data content copying. For details, please refer to this comment: https://github.com/apache/arrow/issues/39332#issuecomment-1870690063.
    
    The solution is let user to use the "large" counterpart data type to avoid the overflow, but we may need explicit error information when such overflow happens.
    
    ### What changes are included in this PR?
    
    1. Detect the offset overflow in appending data in `ExecBatchBuilder` and explicitly throw.
    2. Change the offset type from `uint32_t` to `int32_t` in `ExecBatchBuilder` and respects the `BinaryBuilder::memory_limit()` which is `2GB - 2B` as the rest part of the codebase.
    
    ### Are these changes tested?
    
    UT included.
    
    ### Are there any user-facing changes?
    
    No.
    
    * Closes: #39332
    
    Lead-authored-by: zanmato <za...@gmail.com>
    Co-authored-by: Antoine Pitrou <an...@python.org>
    Co-authored-by: Rossi(Ruoxi) Sun <za...@gmail.com>
    Co-authored-by: Antoine Pitrou <pi...@free.fr>
    Signed-off-by: Antoine Pitrou <an...@python.org>
---
 cpp/src/arrow/compute/light_array.cc      | 47 +++++++++++++----------
 cpp/src/arrow/compute/light_array.h       |  2 +-
 cpp/src/arrow/compute/light_array_test.cc | 64 +++++++++++++++++++++++++++++++
 cpp/src/arrow/testing/generator.cc        |  9 ++++-
 4 files changed, 100 insertions(+), 22 deletions(-)

diff --git a/cpp/src/arrow/compute/light_array.cc b/cpp/src/arrow/compute/light_array.cc
index 66d8477b02..b225e04b05 100644
--- a/cpp/src/arrow/compute/light_array.cc
+++ b/cpp/src/arrow/compute/light_array.cc
@@ -20,6 +20,8 @@
 #include <type_traits>
 
 #include "arrow/util/bitmap_ops.h"
+#include "arrow/util/int_util_overflow.h"
+#include "arrow/util/macros.h"
 
 namespace arrow {
 namespace compute {
@@ -325,11 +327,10 @@ Status ResizableArrayData::ResizeVaryingLengthBuffer() {
   column_metadata = ColumnMetadataFromDataType(data_type_).ValueOrDie();
 
   if (!column_metadata.is_fixed_length) {
-    int min_new_size = static_cast<int>(reinterpret_cast<const uint32_t*>(
-        buffers_[kFixedLengthBuffer]->data())[num_rows_]);
+    int64_t min_new_size = buffers_[kFixedLengthBuffer]->data_as<int32_t>()[num_rows_];
     ARROW_DCHECK(var_len_buf_size_ > 0);
     if (var_len_buf_size_ < min_new_size) {
-      int new_size = var_len_buf_size_;
+      int64_t new_size = var_len_buf_size_;
       while (new_size < min_new_size) {
         new_size *= 2;
       }
@@ -465,12 +466,11 @@ void ExecBatchBuilder::Visit(const std::shared_ptr<ArrayData>& column, int num_r
 
   if (!metadata.is_fixed_length) {
     const uint8_t* ptr_base = column->buffers[2]->data();
-    const uint32_t* offsets =
-        reinterpret_cast<const uint32_t*>(column->buffers[1]->data()) + column->offset;
+    const int32_t* offsets = column->GetValues<int32_t>(1);
     for (int i = 0; i < num_rows; ++i) {
       uint16_t row_id = row_ids[i];
       const uint8_t* field_ptr = ptr_base + offsets[row_id];
-      uint32_t field_length = offsets[row_id + 1] - offsets[row_id];
+      int32_t field_length = offsets[row_id + 1] - offsets[row_id];
       process_value_fn(i, field_ptr, field_length);
     }
   } else {
@@ -480,7 +480,7 @@ void ExecBatchBuilder::Visit(const std::shared_ptr<ArrayData>& column, int num_r
       const uint8_t* field_ptr =
           column->buffers[1]->data() +
           (column->offset + row_id) * static_cast<int64_t>(metadata.fixed_length);
-      process_value_fn(i, field_ptr, metadata.fixed_length);
+      process_value_fn(i, field_ptr, static_cast<int32_t>(metadata.fixed_length));
     }
   }
 }
@@ -511,14 +511,14 @@ Status ExecBatchBuilder::AppendSelected(const std::shared_ptr<ArrayData>& source
         break;
       case 1:
         Visit(source, num_rows_to_append, row_ids,
-              [&](int i, const uint8_t* ptr, uint32_t num_bytes) {
+              [&](int i, const uint8_t* ptr, int32_t num_bytes) {
                 target->mutable_data(1)[num_rows_before + i] = *ptr;
               });
         break;
       case 2:
         Visit(
             source, num_rows_to_append, row_ids,
-            [&](int i, const uint8_t* ptr, uint32_t num_bytes) {
+            [&](int i, const uint8_t* ptr, int32_t num_bytes) {
               reinterpret_cast<uint16_t*>(target->mutable_data(1))[num_rows_before + i] =
                   *reinterpret_cast<const uint16_t*>(ptr);
             });
@@ -526,7 +526,7 @@ Status ExecBatchBuilder::AppendSelected(const std::shared_ptr<ArrayData>& source
       case 4:
         Visit(
             source, num_rows_to_append, row_ids,
-            [&](int i, const uint8_t* ptr, uint32_t num_bytes) {
+            [&](int i, const uint8_t* ptr, int32_t num_bytes) {
               reinterpret_cast<uint32_t*>(target->mutable_data(1))[num_rows_before + i] =
                   *reinterpret_cast<const uint32_t*>(ptr);
             });
@@ -534,7 +534,7 @@ Status ExecBatchBuilder::AppendSelected(const std::shared_ptr<ArrayData>& source
       case 8:
         Visit(
             source, num_rows_to_append, row_ids,
-            [&](int i, const uint8_t* ptr, uint32_t num_bytes) {
+            [&](int i, const uint8_t* ptr, int32_t num_bytes) {
               reinterpret_cast<uint64_t*>(target->mutable_data(1))[num_rows_before + i] =
                   *reinterpret_cast<const uint64_t*>(ptr);
             });
@@ -544,7 +544,7 @@ Status ExecBatchBuilder::AppendSelected(const std::shared_ptr<ArrayData>& source
             num_rows_to_append -
             NumRowsToSkip(source, num_rows_to_append, row_ids, sizeof(uint64_t));
         Visit(source, num_rows_to_process, row_ids,
-              [&](int i, const uint8_t* ptr, uint32_t num_bytes) {
+              [&](int i, const uint8_t* ptr, int32_t num_bytes) {
                 uint64_t* dst = reinterpret_cast<uint64_t*>(
                     target->mutable_data(1) +
                     static_cast<int64_t>(num_bytes) * (num_rows_before + i));
@@ -558,7 +558,7 @@ Status ExecBatchBuilder::AppendSelected(const std::shared_ptr<ArrayData>& source
         if (num_rows_to_append > num_rows_to_process) {
           Visit(source, num_rows_to_append - num_rows_to_process,
                 row_ids + num_rows_to_process,
-                [&](int i, const uint8_t* ptr, uint32_t num_bytes) {
+                [&](int i, const uint8_t* ptr, int32_t num_bytes) {
                   uint64_t* dst = reinterpret_cast<uint64_t*>(
                       target->mutable_data(1) +
                       static_cast<int64_t>(num_bytes) *
@@ -575,16 +575,23 @@ Status ExecBatchBuilder::AppendSelected(const std::shared_ptr<ArrayData>& source
 
     // Step 1: calculate target offsets
     //
-    uint32_t* offsets = reinterpret_cast<uint32_t*>(target->mutable_data(1));
-    uint32_t sum = num_rows_before == 0 ? 0 : offsets[num_rows_before];
+    int32_t* offsets = reinterpret_cast<int32_t*>(target->mutable_data(1));
+    int32_t sum = num_rows_before == 0 ? 0 : offsets[num_rows_before];
     Visit(source, num_rows_to_append, row_ids,
-          [&](int i, const uint8_t* ptr, uint32_t num_bytes) {
+          [&](int i, const uint8_t* ptr, int32_t num_bytes) {
             offsets[num_rows_before + i] = num_bytes;
           });
     for (int i = 0; i < num_rows_to_append; ++i) {
-      uint32_t length = offsets[num_rows_before + i];
+      int32_t length = offsets[num_rows_before + i];
       offsets[num_rows_before + i] = sum;
-      sum += length;
+      int32_t new_sum_maybe_overflow = 0;
+      if (ARROW_PREDICT_FALSE(
+              arrow::internal::AddWithOverflow(sum, length, &new_sum_maybe_overflow))) {
+        return Status::Invalid("Overflow detected in ExecBatchBuilder when appending ",
+                               num_rows_before + i + 1, "-th element of length ", length,
+                               " bytes to current length ", sum, " bytes");
+      }
+      sum = new_sum_maybe_overflow;
     }
     offsets[num_rows_before + num_rows_to_append] = sum;
 
@@ -598,7 +605,7 @@ Status ExecBatchBuilder::AppendSelected(const std::shared_ptr<ArrayData>& source
         num_rows_to_append -
         NumRowsToSkip(source, num_rows_to_append, row_ids, sizeof(uint64_t));
     Visit(source, num_rows_to_process, row_ids,
-          [&](int i, const uint8_t* ptr, uint32_t num_bytes) {
+          [&](int i, const uint8_t* ptr, int32_t num_bytes) {
             uint64_t* dst = reinterpret_cast<uint64_t*>(target->mutable_data(2) +
                                                         offsets[num_rows_before + i]);
             const uint64_t* src = reinterpret_cast<const uint64_t*>(ptr);
@@ -608,7 +615,7 @@ Status ExecBatchBuilder::AppendSelected(const std::shared_ptr<ArrayData>& source
             }
           });
     Visit(source, num_rows_to_append - num_rows_to_process, row_ids + num_rows_to_process,
-          [&](int i, const uint8_t* ptr, uint32_t num_bytes) {
+          [&](int i, const uint8_t* ptr, int32_t num_bytes) {
             uint64_t* dst = reinterpret_cast<uint64_t*>(
                 target->mutable_data(2) +
                 offsets[num_rows_before + num_rows_to_process + i]);
diff --git a/cpp/src/arrow/compute/light_array.h b/cpp/src/arrow/compute/light_array.h
index 84aa86d64b..67de71bf56 100644
--- a/cpp/src/arrow/compute/light_array.h
+++ b/cpp/src/arrow/compute/light_array.h
@@ -353,7 +353,7 @@ class ARROW_EXPORT ResizableArrayData {
   MemoryPool* pool_;
   int num_rows_;
   int num_rows_allocated_;
-  int var_len_buf_size_;
+  int64_t var_len_buf_size_;
   static constexpr int kMaxBuffers = 3;
   std::shared_ptr<ResizableBuffer> buffers_[kMaxBuffers];
 };
diff --git a/cpp/src/arrow/compute/light_array_test.cc b/cpp/src/arrow/compute/light_array_test.cc
index d50e967551..ecc5f3ad37 100644
--- a/cpp/src/arrow/compute/light_array_test.cc
+++ b/cpp/src/arrow/compute/light_array_test.cc
@@ -407,6 +407,70 @@ TEST(ExecBatchBuilder, AppendValuesBeyondLimit) {
   ASSERT_EQ(0, pool->bytes_allocated());
 }
 
+TEST(ExecBatchBuilder, AppendVarLengthBeyondLimit) {
+  // GH-39332: check appending variable-length data past 2GB.
+  if constexpr (sizeof(void*) == 4) {
+    GTEST_SKIP() << "Test only works on 64-bit platforms";
+  }
+
+  std::unique_ptr<MemoryPool> owned_pool = MemoryPool::CreateDefault();
+  MemoryPool* pool = owned_pool.get();
+  constexpr auto eight_mb = 8 * 1024 * 1024;
+  constexpr auto eight_mb_minus_one = eight_mb - 1;
+  // String of size 8mb to repetitively fill the heading multiple of 8mbs of an array
+  // of int32_max bytes.
+  std::string str_8mb(eight_mb, 'a');
+  // String of size (8mb - 1) to be the last element of an array of int32_max bytes.
+  std::string str_8mb_minus_1(eight_mb_minus_one, 'b');
+  std::shared_ptr<Array> values_8mb = ConstantArrayGenerator::String(1, str_8mb);
+  std::shared_ptr<Array> values_8mb_minus_1 =
+      ConstantArrayGenerator::String(1, str_8mb_minus_1);
+
+  ExecBatch batch_8mb({values_8mb}, 1);
+  ExecBatch batch_8mb_minus_1({values_8mb_minus_1}, 1);
+
+  auto num_rows = std::numeric_limits<int32_t>::max() / eight_mb;
+  std::vector<uint16_t> body_row_ids(num_rows, 0);
+  std::vector<uint16_t> tail_row_id(1, 0);
+
+  {
+    // Building an array of (int32_max + 1) = (8mb * num_rows + 8mb) bytes should raise an
+    // error of overflow.
+    ExecBatchBuilder builder;
+    ASSERT_OK(builder.AppendSelected(pool, batch_8mb, num_rows, body_row_ids.data(),
+                                     /*num_cols=*/1));
+    std::stringstream ss;
+    ss << "Invalid: Overflow detected in ExecBatchBuilder when appending " << num_rows + 1
+       << "-th element of length " << eight_mb << " bytes to current length "
+       << eight_mb * num_rows << " bytes";
+    ASSERT_RAISES_WITH_MESSAGE(
+        Invalid, ss.str(),
+        builder.AppendSelected(pool, batch_8mb, 1, tail_row_id.data(),
+                               /*num_cols=*/1));
+  }
+
+  {
+    // Building an array of int32_max = (8mb * num_rows + 8mb - 1) bytes should succeed.
+    ExecBatchBuilder builder;
+    ASSERT_OK(builder.AppendSelected(pool, batch_8mb, num_rows, body_row_ids.data(),
+                                     /*num_cols=*/1));
+    ASSERT_OK(builder.AppendSelected(pool, batch_8mb_minus_1, 1, tail_row_id.data(),
+                                     /*num_cols=*/1));
+    ExecBatch built = builder.Flush();
+    auto datum = built[0];
+    ASSERT_TRUE(datum.is_array());
+    auto array = datum.array_as<StringArray>();
+    ASSERT_EQ(array->length(), num_rows + 1);
+    for (int i = 0; i < num_rows; ++i) {
+      ASSERT_EQ(array->GetString(i), str_8mb);
+    }
+    ASSERT_EQ(array->GetString(num_rows), str_8mb_minus_1);
+    ASSERT_NE(0, pool->bytes_allocated());
+  }
+
+  ASSERT_EQ(0, pool->bytes_allocated());
+}
+
 TEST(KeyColumnArray, FromExecBatch) {
   ExecBatch batch =
       JSONToExecBatch({int64(), boolean()}, "[[1, true], [2, false], [null, null]]");
diff --git a/cpp/src/arrow/testing/generator.cc b/cpp/src/arrow/testing/generator.cc
index 36c88c20ef..5ea6a541e8 100644
--- a/cpp/src/arrow/testing/generator.cc
+++ b/cpp/src/arrow/testing/generator.cc
@@ -38,6 +38,7 @@
 #include "arrow/type.h"
 #include "arrow/type_traits.h"
 #include "arrow/util/checked_cast.h"
+#include "arrow/util/logging.h"
 #include "arrow/util/macros.h"
 #include "arrow/util/string.h"
 
@@ -103,7 +104,13 @@ std::shared_ptr<arrow::Array> ConstantArrayGenerator::Float64(int64_t size,
 
 std::shared_ptr<arrow::Array> ConstantArrayGenerator::String(int64_t size,
                                                              std::string value) {
-  return ConstantArray<StringType>(size, value);
+  using BuilderType = typename TypeTraits<StringType>::BuilderType;
+  auto type = TypeTraits<StringType>::type_singleton();
+  auto builder_fn = [&](BuilderType* builder) {
+    DCHECK_OK(builder->Append(std::string_view(value.data())));
+  };
+  return ArrayFromBuilderVisitor(type, value.size() * size, size, builder_fn)
+      .ValueOrDie();
 }
 
 std::shared_ptr<arrow::Array> ConstantArrayGenerator::Zeroes(

(arrow) 29/30: GH-40009: [C++] Add missing "#include " (#40010)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit b59bec36b7eca72d289bf44d3b59ef3085521f54
Author: Sutou Kouhei <ko...@clear-code.com>
AuthorDate: Sat Feb 10 08:29:55 2024 +0900

    GH-40009: [C++] Add missing "#include <algorithm>" (#40010)
    
    ### Rationale for this change
    
    `std::find()` is defined in `<algorithm>`. If we don't include `<algorithm>` explicitly, g++-14 complains:
    
        cpp/src/arrow/filesystem/util_internal.cc: In function 'arrow::Result<std::__cxx11::basic_string<char> > arrow::fs::internal::PathFromUriHelper(const std::string&, std::vector<std::__cxx11::basic_string<char> >, bool, AuthorityHandlingBehavior)':
        cpp/src/arrow/filesystem/util_internal.cc:143:16: error: no matching function for call to 'find(std::vector<std::__cxx11::basic_string<char> >::iterator, std::vector<std::__cxx11::basic_string<char> >::iterator, const std::__cxx11::basic_string<char>&)'
          143 |   if (std::find(supported_schemes.begin(), supported_schemes.end(), scheme) ==
              |       ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        /usr/include/c++/14/bits/streambuf_iterator.h:435:5: note: candidate: 'template<class _CharT2> typename __gnu_cxx::__enable_if<std::__is_char<_CharT2>::__value, std::istreambuf_iterator<_CharT, std::char_traits<_CharT> > >::__type std::find(istreambuf_iterator<_CharT, char_traits<_CharT> >, istreambuf_iterator<_CharT, char_traits<_CharT> >, const _CharT2&)'
          435 |     find(istreambuf_iterator<_CharT> __first,
              |     ^~~~
        /usr/include/c++/14/bits/streambuf_iterator.h:435:5: note:   template argument deduction/substitution failed:
        cpp/src/arrow/filesystem/util_internal.cc:143:16: note:   '__gnu_cxx::__normal_iterator<std::__cxx11::basic_string<char>*, std::vector<std::__cxx11::basic_string<char> > >' is not derived from 'std::istreambuf_iterator<_CharT, std::char_traits<_CharT> >'
          143 |   if (std::find(supported_schemes.begin(), supported_schemes.end(), scheme) ==
              |       ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    
    ### What changes are included in this PR?
    
    Include `<algorithm>` explicitly.
    
    ### Are these changes tested?
    
    Yes.
    
    ### Are there any user-facing changes?
    
    No.
    * Closes: #40009
    
    Authored-by: Sutou Kouhei <ko...@clear-code.com>
    Signed-off-by: Jacob Wujciak-Jens <ja...@wujciak.de>
---
 cpp/src/arrow/filesystem/util_internal.cc | 1 +
 1 file changed, 1 insertion(+)

diff --git a/cpp/src/arrow/filesystem/util_internal.cc b/cpp/src/arrow/filesystem/util_internal.cc
index 1ca5af27fc..13f43d45db 100644
--- a/cpp/src/arrow/filesystem/util_internal.cc
+++ b/cpp/src/arrow/filesystem/util_internal.cc
@@ -17,6 +17,7 @@
 
 #include "arrow/filesystem/util_internal.h"
 
+#include <algorithm>
 #include <cerrno>
 
 #include "arrow/buffer.h"

(arrow) 14/30: GH-39778: [C++] Fix tail-byte access cross buffer boundary in key hash avx2 (#39800)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit faeab94e486f9c3cfacec2ab27ea3b6c8c0c9e27
Author: Rossi Sun <za...@gmail.com>
AuthorDate: Fri Jan 26 22:43:08 2024 +0800

    GH-39778: [C++] Fix tail-byte access cross buffer boundary in key hash avx2 (#39800)
    
    
    
    ### Rationale for this change
    
    Issue #39778 seems caused by a careless (but hard to spot) bug in key hash avx2.
    
    ### What changes are included in this PR?
    
    Fix the careless bug.
    
    ### Are these changes tested?
    
    UT included.
    
    ### Are there any user-facing changes?
    
    No.
    
    * Closes: #39778
    
    Authored-by: Ruoxi Sun <za...@gmail.com>
    Signed-off-by: Antoine Pitrou <an...@python.org>
---
 cpp/src/arrow/compute/key_hash.cc      | 142 +++++++++++++++++----------------
 cpp/src/arrow/compute/key_hash.h       |  22 ++---
 cpp/src/arrow/compute/key_hash_avx2.cc |   2 +-
 cpp/src/arrow/compute/key_hash_test.cc |  59 ++++++++++++++
 4 files changed, 145 insertions(+), 80 deletions(-)

diff --git a/cpp/src/arrow/compute/key_hash.cc b/cpp/src/arrow/compute/key_hash.cc
index f5867b405e..1902b9ce9a 100644
--- a/cpp/src/arrow/compute/key_hash.cc
+++ b/cpp/src/arrow/compute/key_hash.cc
@@ -105,23 +105,23 @@ inline void Hashing32::StripeMask(int i, uint32_t* mask1, uint32_t* mask2,
 }
 
 template <bool T_COMBINE_HASHES>
-void Hashing32::HashFixedLenImp(uint32_t num_rows, uint64_t length, const uint8_t* keys,
-                                uint32_t* hashes) {
+void Hashing32::HashFixedLenImp(uint32_t num_rows, uint64_t key_length,
+                                const uint8_t* keys, uint32_t* hashes) {
   // Calculate the number of rows that skip the last 16 bytes
   //
   uint32_t num_rows_safe = num_rows;
-  while (num_rows_safe > 0 && (num_rows - num_rows_safe) * length < kStripeSize) {
+  while (num_rows_safe > 0 && (num_rows - num_rows_safe) * key_length < kStripeSize) {
     --num_rows_safe;
   }
 
   // Compute masks for the last 16 byte stripe
   //
-  uint64_t num_stripes = bit_util::CeilDiv(length, kStripeSize);
+  uint64_t num_stripes = bit_util::CeilDiv(key_length, kStripeSize);
   uint32_t mask1, mask2, mask3, mask4;
-  StripeMask(((length - 1) & (kStripeSize - 1)) + 1, &mask1, &mask2, &mask3, &mask4);
+  StripeMask(((key_length - 1) & (kStripeSize - 1)) + 1, &mask1, &mask2, &mask3, &mask4);
 
   for (uint32_t i = 0; i < num_rows_safe; ++i) {
-    const uint8_t* key = keys + static_cast<uint64_t>(i) * length;
+    const uint8_t* key = keys + static_cast<uint64_t>(i) * key_length;
     uint32_t acc1, acc2, acc3, acc4;
     ProcessFullStripes(num_stripes, key, &acc1, &acc2, &acc3, &acc4);
     ProcessLastStripe(mask1, mask2, mask3, mask4, key + (num_stripes - 1) * kStripeSize,
@@ -138,11 +138,11 @@ void Hashing32::HashFixedLenImp(uint32_t num_rows, uint64_t length, const uint8_
 
   uint32_t last_stripe_copy[4];
   for (uint32_t i = num_rows_safe; i < num_rows; ++i) {
-    const uint8_t* key = keys + static_cast<uint64_t>(i) * length;
+    const uint8_t* key = keys + static_cast<uint64_t>(i) * key_length;
     uint32_t acc1, acc2, acc3, acc4;
     ProcessFullStripes(num_stripes, key, &acc1, &acc2, &acc3, &acc4);
     memcpy(last_stripe_copy, key + (num_stripes - 1) * kStripeSize,
-           length - (num_stripes - 1) * kStripeSize);
+           key_length - (num_stripes - 1) * kStripeSize);
     ProcessLastStripe(mask1, mask2, mask3, mask4,
                       reinterpret_cast<const uint8_t*>(last_stripe_copy), &acc1, &acc2,
                       &acc3, &acc4);
@@ -168,15 +168,16 @@ void Hashing32::HashVarLenImp(uint32_t num_rows, const T* offsets,
   }
 
   for (uint32_t i = 0; i < num_rows_safe; ++i) {
-    uint64_t length = offsets[i + 1] - offsets[i];
+    uint64_t key_length = offsets[i + 1] - offsets[i];
 
     // Compute masks for the last 16 byte stripe.
     // For an empty string set number of stripes to 1 but mask to all zeroes.
     //
-    int is_non_empty = length == 0 ? 0 : 1;
-    uint64_t num_stripes = bit_util::CeilDiv(length, kStripeSize) + (1 - is_non_empty);
+    int is_non_empty = key_length == 0 ? 0 : 1;
+    uint64_t num_stripes =
+        bit_util::CeilDiv(key_length, kStripeSize) + (1 - is_non_empty);
     uint32_t mask1, mask2, mask3, mask4;
-    StripeMask(((length - is_non_empty) & (kStripeSize - 1)) + is_non_empty, &mask1,
+    StripeMask(((key_length - is_non_empty) & (kStripeSize - 1)) + is_non_empty, &mask1,
                &mask2, &mask3, &mask4);
 
     const uint8_t* key = concatenated_keys + offsets[i];
@@ -198,23 +199,24 @@ void Hashing32::HashVarLenImp(uint32_t num_rows, const T* offsets,
 
   uint32_t last_stripe_copy[4];
   for (uint32_t i = num_rows_safe; i < num_rows; ++i) {
-    uint64_t length = offsets[i + 1] - offsets[i];
+    uint64_t key_length = offsets[i + 1] - offsets[i];
 
     // Compute masks for the last 16 byte stripe.
     // For an empty string set number of stripes to 1 but mask to all zeroes.
     //
-    int is_non_empty = length == 0 ? 0 : 1;
-    uint64_t num_stripes = bit_util::CeilDiv(length, kStripeSize) + (1 - is_non_empty);
+    int is_non_empty = key_length == 0 ? 0 : 1;
+    uint64_t num_stripes =
+        bit_util::CeilDiv(key_length, kStripeSize) + (1 - is_non_empty);
     uint32_t mask1, mask2, mask3, mask4;
-    StripeMask(((length - is_non_empty) & (kStripeSize - 1)) + is_non_empty, &mask1,
+    StripeMask(((key_length - is_non_empty) & (kStripeSize - 1)) + is_non_empty, &mask1,
                &mask2, &mask3, &mask4);
 
     const uint8_t* key = concatenated_keys + offsets[i];
     uint32_t acc1, acc2, acc3, acc4;
     ProcessFullStripes(num_stripes, key, &acc1, &acc2, &acc3, &acc4);
-    if (length > 0) {
+    if (key_length > 0) {
       memcpy(last_stripe_copy, key + (num_stripes - 1) * kStripeSize,
-             length - (num_stripes - 1) * kStripeSize);
+             key_length - (num_stripes - 1) * kStripeSize);
     }
     if (num_stripes > 0) {
       ProcessLastStripe(mask1, mask2, mask3, mask4,
@@ -309,9 +311,9 @@ void Hashing32::HashIntImp(uint32_t num_keys, const T* keys, uint32_t* hashes) {
   }
 }
 
-void Hashing32::HashInt(bool combine_hashes, uint32_t num_keys, uint64_t length_key,
+void Hashing32::HashInt(bool combine_hashes, uint32_t num_keys, uint64_t key_length,
                         const uint8_t* keys, uint32_t* hashes) {
-  switch (length_key) {
+  switch (key_length) {
     case sizeof(uint8_t):
       if (combine_hashes) {
         HashIntImp<true, uint8_t>(num_keys, keys, hashes);
@@ -352,27 +354,27 @@ void Hashing32::HashInt(bool combine_hashes, uint32_t num_keys, uint64_t length_
   }
 }
 
-void Hashing32::HashFixed(int64_t hardware_flags, bool combine_hashes, uint32_t num_rows,
-                          uint64_t length, const uint8_t* keys, uint32_t* hashes,
-                          uint32_t* hashes_temp_for_combine) {
-  if (ARROW_POPCOUNT64(length) == 1 && length <= sizeof(uint64_t)) {
-    HashInt(combine_hashes, num_rows, length, keys, hashes);
+void Hashing32::HashFixed(int64_t hardware_flags, bool combine_hashes, uint32_t num_keys,
+                          uint64_t key_length, const uint8_t* keys, uint32_t* hashes,
+                          uint32_t* temp_hashes_for_combine) {
+  if (ARROW_POPCOUNT64(key_length) == 1 && key_length <= sizeof(uint64_t)) {
+    HashInt(combine_hashes, num_keys, key_length, keys, hashes);
     return;
   }
 
   uint32_t num_processed = 0;
 #if defined(ARROW_HAVE_RUNTIME_AVX2)
   if (hardware_flags & arrow::internal::CpuInfo::AVX2) {
-    num_processed = HashFixedLen_avx2(combine_hashes, num_rows, length, keys, hashes,
-                                      hashes_temp_for_combine);
+    num_processed = HashFixedLen_avx2(combine_hashes, num_keys, key_length, keys, hashes,
+                                      temp_hashes_for_combine);
   }
 #endif
   if (combine_hashes) {
-    HashFixedLenImp<true>(num_rows - num_processed, length, keys + length * num_processed,
-                          hashes + num_processed);
+    HashFixedLenImp<true>(num_keys - num_processed, key_length,
+                          keys + key_length * num_processed, hashes + num_processed);
   } else {
-    HashFixedLenImp<false>(num_rows - num_processed, length,
-                           keys + length * num_processed, hashes + num_processed);
+    HashFixedLenImp<false>(num_keys - num_processed, key_length,
+                           keys + key_length * num_processed, hashes + num_processed);
   }
 }
 
@@ -423,13 +425,13 @@ void Hashing32::HashMultiColumn(const std::vector<KeyColumnArray>& cols,
       }
 
       if (cols[icol].metadata().is_fixed_length) {
-        uint32_t col_width = cols[icol].metadata().fixed_length;
-        if (col_width == 0) {
+        uint32_t key_length = cols[icol].metadata().fixed_length;
+        if (key_length == 0) {
           HashBit(icol > 0, cols[icol].bit_offset(1), batch_size_next,
                   cols[icol].data(1) + first_row / 8, hashes + first_row);
         } else {
-          HashFixed(ctx->hardware_flags, icol > 0, batch_size_next, col_width,
-                    cols[icol].data(1) + first_row * col_width, hashes + first_row,
+          HashFixed(ctx->hardware_flags, icol > 0, batch_size_next, key_length,
+                    cols[icol].data(1) + first_row * key_length, hashes + first_row,
                     hash_temp);
         }
       } else if (cols[icol].metadata().fixed_length == sizeof(uint32_t)) {
@@ -463,8 +465,9 @@ void Hashing32::HashMultiColumn(const std::vector<KeyColumnArray>& cols,
 Status Hashing32::HashBatch(const ExecBatch& key_batch, uint32_t* hashes,
                             std::vector<KeyColumnArray>& column_arrays,
                             int64_t hardware_flags, util::TempVectorStack* temp_stack,
-                            int64_t offset, int64_t length) {
-  RETURN_NOT_OK(ColumnArraysFromExecBatch(key_batch, offset, length, &column_arrays));
+                            int64_t start_rows, int64_t num_rows) {
+  RETURN_NOT_OK(
+      ColumnArraysFromExecBatch(key_batch, start_rows, num_rows, &column_arrays));
 
   LightContext ctx;
   ctx.hardware_flags = hardware_flags;
@@ -574,23 +577,23 @@ inline void Hashing64::StripeMask(int i, uint64_t* mask1, uint64_t* mask2,
 }
 
 template <bool T_COMBINE_HASHES>
-void Hashing64::HashFixedLenImp(uint32_t num_rows, uint64_t length, const uint8_t* keys,
-                                uint64_t* hashes) {
+void Hashing64::HashFixedLenImp(uint32_t num_rows, uint64_t key_length,
+                                const uint8_t* keys, uint64_t* hashes) {
   // Calculate the number of rows that skip the last 32 bytes
   //
   uint32_t num_rows_safe = num_rows;
-  while (num_rows_safe > 0 && (num_rows - num_rows_safe) * length < kStripeSize) {
+  while (num_rows_safe > 0 && (num_rows - num_rows_safe) * key_length < kStripeSize) {
     --num_rows_safe;
   }
 
   // Compute masks for the last 32 byte stripe
   //
-  uint64_t num_stripes = bit_util::CeilDiv(length, kStripeSize);
+  uint64_t num_stripes = bit_util::CeilDiv(key_length, kStripeSize);
   uint64_t mask1, mask2, mask3, mask4;
-  StripeMask(((length - 1) & (kStripeSize - 1)) + 1, &mask1, &mask2, &mask3, &mask4);
+  StripeMask(((key_length - 1) & (kStripeSize - 1)) + 1, &mask1, &mask2, &mask3, &mask4);
 
   for (uint32_t i = 0; i < num_rows_safe; ++i) {
-    const uint8_t* key = keys + static_cast<uint64_t>(i) * length;
+    const uint8_t* key = keys + static_cast<uint64_t>(i) * key_length;
     uint64_t acc1, acc2, acc3, acc4;
     ProcessFullStripes(num_stripes, key, &acc1, &acc2, &acc3, &acc4);
     ProcessLastStripe(mask1, mask2, mask3, mask4, key + (num_stripes - 1) * kStripeSize,
@@ -607,11 +610,11 @@ void Hashing64::HashFixedLenImp(uint32_t num_rows, uint64_t length, const uint8_
 
   uint64_t last_stripe_copy[4];
   for (uint32_t i = num_rows_safe; i < num_rows; ++i) {
-    const uint8_t* key = keys + static_cast<uint64_t>(i) * length;
+    const uint8_t* key = keys + static_cast<uint64_t>(i) * key_length;
     uint64_t acc1, acc2, acc3, acc4;
     ProcessFullStripes(num_stripes, key, &acc1, &acc2, &acc3, &acc4);
     memcpy(last_stripe_copy, key + (num_stripes - 1) * kStripeSize,
-           length - (num_stripes - 1) * kStripeSize);
+           key_length - (num_stripes - 1) * kStripeSize);
     ProcessLastStripe(mask1, mask2, mask3, mask4,
                       reinterpret_cast<const uint8_t*>(last_stripe_copy), &acc1, &acc2,
                       &acc3, &acc4);
@@ -637,15 +640,16 @@ void Hashing64::HashVarLenImp(uint32_t num_rows, const T* offsets,
   }
 
   for (uint32_t i = 0; i < num_rows_safe; ++i) {
-    uint64_t length = offsets[i + 1] - offsets[i];
+    uint64_t key_length = offsets[i + 1] - offsets[i];
 
     // Compute masks for the last 32 byte stripe.
     // For an empty string set number of stripes to 1 but mask to all zeroes.
     //
-    int is_non_empty = length == 0 ? 0 : 1;
-    uint64_t num_stripes = bit_util::CeilDiv(length, kStripeSize) + (1 - is_non_empty);
+    int is_non_empty = key_length == 0 ? 0 : 1;
+    uint64_t num_stripes =
+        bit_util::CeilDiv(key_length, kStripeSize) + (1 - is_non_empty);
     uint64_t mask1, mask2, mask3, mask4;
-    StripeMask(((length - is_non_empty) & (kStripeSize - 1)) + is_non_empty, &mask1,
+    StripeMask(((key_length - is_non_empty) & (kStripeSize - 1)) + is_non_empty, &mask1,
                &mask2, &mask3, &mask4);
 
     const uint8_t* key = concatenated_keys + offsets[i];
@@ -667,22 +671,23 @@ void Hashing64::HashVarLenImp(uint32_t num_rows, const T* offsets,
 
   uint64_t last_stripe_copy[4];
   for (uint32_t i = num_rows_safe; i < num_rows; ++i) {
-    uint64_t length = offsets[i + 1] - offsets[i];
+    uint64_t key_length = offsets[i + 1] - offsets[i];
 
     // Compute masks for the last 32 byte stripe
     //
-    int is_non_empty = length == 0 ? 0 : 1;
-    uint64_t num_stripes = bit_util::CeilDiv(length, kStripeSize) + (1 - is_non_empty);
+    int is_non_empty = key_length == 0 ? 0 : 1;
+    uint64_t num_stripes =
+        bit_util::CeilDiv(key_length, kStripeSize) + (1 - is_non_empty);
     uint64_t mask1, mask2, mask3, mask4;
-    StripeMask(((length - is_non_empty) & (kStripeSize - 1)) + is_non_empty, &mask1,
+    StripeMask(((key_length - is_non_empty) & (kStripeSize - 1)) + is_non_empty, &mask1,
                &mask2, &mask3, &mask4);
 
     const uint8_t* key = concatenated_keys + offsets[i];
     uint64_t acc1, acc2, acc3, acc4;
     ProcessFullStripes(num_stripes, key, &acc1, &acc2, &acc3, &acc4);
-    if (length > 0) {
+    if (key_length > 0) {
       memcpy(last_stripe_copy, key + (num_stripes - 1) * kStripeSize,
-             length - (num_stripes - 1) * kStripeSize);
+             key_length - (num_stripes - 1) * kStripeSize);
     }
     if (num_stripes > 0) {
       ProcessLastStripe(mask1, mask2, mask3, mask4,
@@ -759,9 +764,9 @@ void Hashing64::HashIntImp(uint32_t num_keys, const T* keys, uint64_t* hashes) {
   }
 }
 
-void Hashing64::HashInt(bool combine_hashes, uint32_t num_keys, uint64_t length_key,
+void Hashing64::HashInt(bool combine_hashes, uint32_t num_keys, uint64_t key_length,
                         const uint8_t* keys, uint64_t* hashes) {
-  switch (length_key) {
+  switch (key_length) {
     case sizeof(uint8_t):
       if (combine_hashes) {
         HashIntImp<true, uint8_t>(num_keys, keys, hashes);
@@ -802,17 +807,17 @@ void Hashing64::HashInt(bool combine_hashes, uint32_t num_keys, uint64_t length_
   }
 }
 
-void Hashing64::HashFixed(bool combine_hashes, uint32_t num_rows, uint64_t length,
+void Hashing64::HashFixed(bool combine_hashes, uint32_t num_keys, uint64_t key_length,
                           const uint8_t* keys, uint64_t* hashes) {
-  if (ARROW_POPCOUNT64(length) == 1 && length <= sizeof(uint64_t)) {
-    HashInt(combine_hashes, num_rows, length, keys, hashes);
+  if (ARROW_POPCOUNT64(key_length) == 1 && key_length <= sizeof(uint64_t)) {
+    HashInt(combine_hashes, num_keys, key_length, keys, hashes);
     return;
   }
 
   if (combine_hashes) {
-    HashFixedLenImp<true>(num_rows, length, keys, hashes);
+    HashFixedLenImp<true>(num_keys, key_length, keys, hashes);
   } else {
-    HashFixedLenImp<false>(num_rows, length, keys, hashes);
+    HashFixedLenImp<false>(num_keys, key_length, keys, hashes);
   }
 }
 
@@ -860,13 +865,13 @@ void Hashing64::HashMultiColumn(const std::vector<KeyColumnArray>& cols,
       }
 
       if (cols[icol].metadata().is_fixed_length) {
-        uint64_t col_width = cols[icol].metadata().fixed_length;
-        if (col_width == 0) {
+        uint64_t key_length = cols[icol].metadata().fixed_length;
+        if (key_length == 0) {
           HashBit(icol > 0, cols[icol].bit_offset(1), batch_size_next,
                   cols[icol].data(1) + first_row / 8, hashes + first_row);
         } else {
-          HashFixed(icol > 0, batch_size_next, col_width,
-                    cols[icol].data(1) + first_row * col_width, hashes + first_row);
+          HashFixed(icol > 0, batch_size_next, key_length,
+                    cols[icol].data(1) + first_row * key_length, hashes + first_row);
         }
       } else if (cols[icol].metadata().fixed_length == sizeof(uint32_t)) {
         HashVarLen(icol > 0, batch_size_next, cols[icol].offsets() + first_row,
@@ -897,8 +902,9 @@ void Hashing64::HashMultiColumn(const std::vector<KeyColumnArray>& cols,
 Status Hashing64::HashBatch(const ExecBatch& key_batch, uint64_t* hashes,
                             std::vector<KeyColumnArray>& column_arrays,
                             int64_t hardware_flags, util::TempVectorStack* temp_stack,
-                            int64_t offset, int64_t length) {
-  RETURN_NOT_OK(ColumnArraysFromExecBatch(key_batch, offset, length, &column_arrays));
+                            int64_t start_row, int64_t num_rows) {
+  RETURN_NOT_OK(
+      ColumnArraysFromExecBatch(key_batch, start_row, num_rows, &column_arrays));
 
   LightContext ctx;
   ctx.hardware_flags = hardware_flags;
diff --git a/cpp/src/arrow/compute/key_hash.h b/cpp/src/arrow/compute/key_hash.h
index b193716c9b..1173df5ed1 100644
--- a/cpp/src/arrow/compute/key_hash.h
+++ b/cpp/src/arrow/compute/key_hash.h
@@ -51,10 +51,10 @@ class ARROW_EXPORT Hashing32 {
   static Status HashBatch(const ExecBatch& key_batch, uint32_t* hashes,
                           std::vector<KeyColumnArray>& column_arrays,
                           int64_t hardware_flags, util::TempVectorStack* temp_stack,
-                          int64_t offset, int64_t length);
+                          int64_t start_row, int64_t num_rows);
 
   static void HashFixed(int64_t hardware_flags, bool combine_hashes, uint32_t num_keys,
-                        uint64_t length_key, const uint8_t* keys, uint32_t* hashes,
+                        uint64_t key_length, const uint8_t* keys, uint32_t* hashes,
                         uint32_t* temp_hashes_for_combine);
 
  private:
@@ -100,7 +100,7 @@ class ARROW_EXPORT Hashing32 {
   static inline void StripeMask(int i, uint32_t* mask1, uint32_t* mask2, uint32_t* mask3,
                                 uint32_t* mask4);
   template <bool T_COMBINE_HASHES>
-  static void HashFixedLenImp(uint32_t num_rows, uint64_t length, const uint8_t* keys,
+  static void HashFixedLenImp(uint32_t num_rows, uint64_t key_length, const uint8_t* keys,
                               uint32_t* hashes);
   template <typename T, bool T_COMBINE_HASHES>
   static void HashVarLenImp(uint32_t num_rows, const T* offsets,
@@ -112,7 +112,7 @@ class ARROW_EXPORT Hashing32 {
                       const uint8_t* keys, uint32_t* hashes);
   template <bool T_COMBINE_HASHES, typename T>
   static void HashIntImp(uint32_t num_keys, const T* keys, uint32_t* hashes);
-  static void HashInt(bool combine_hashes, uint32_t num_keys, uint64_t length_key,
+  static void HashInt(bool combine_hashes, uint32_t num_keys, uint64_t key_length,
                       const uint8_t* keys, uint32_t* hashes);
 
 #if defined(ARROW_HAVE_RUNTIME_AVX2)
@@ -129,11 +129,11 @@ class ARROW_EXPORT Hashing32 {
                                             __m256i mask_last_stripe, const uint8_t* keys,
                                             int64_t offset_A, int64_t offset_B);
   template <bool T_COMBINE_HASHES>
-  static uint32_t HashFixedLenImp_avx2(uint32_t num_rows, uint64_t length,
+  static uint32_t HashFixedLenImp_avx2(uint32_t num_rows, uint64_t key_length,
                                        const uint8_t* keys, uint32_t* hashes,
                                        uint32_t* hashes_temp_for_combine);
   static uint32_t HashFixedLen_avx2(bool combine_hashes, uint32_t num_rows,
-                                    uint64_t length, const uint8_t* keys,
+                                    uint64_t key_length, const uint8_t* keys,
                                     uint32_t* hashes, uint32_t* hashes_temp_for_combine);
   template <typename T, bool T_COMBINE_HASHES>
   static uint32_t HashVarLenImp_avx2(uint32_t num_rows, const T* offsets,
@@ -164,9 +164,9 @@ class ARROW_EXPORT Hashing64 {
   static Status HashBatch(const ExecBatch& key_batch, uint64_t* hashes,
                           std::vector<KeyColumnArray>& column_arrays,
                           int64_t hardware_flags, util::TempVectorStack* temp_stack,
-                          int64_t offset, int64_t length);
+                          int64_t start_row, int64_t num_rows);
 
-  static void HashFixed(bool combine_hashes, uint32_t num_keys, uint64_t length_key,
+  static void HashFixed(bool combine_hashes, uint32_t num_keys, uint64_t key_length,
                         const uint8_t* keys, uint64_t* hashes);
 
  private:
@@ -203,7 +203,7 @@ class ARROW_EXPORT Hashing64 {
   static inline void StripeMask(int i, uint64_t* mask1, uint64_t* mask2, uint64_t* mask3,
                                 uint64_t* mask4);
   template <bool T_COMBINE_HASHES>
-  static void HashFixedLenImp(uint32_t num_rows, uint64_t length, const uint8_t* keys,
+  static void HashFixedLenImp(uint32_t num_rows, uint64_t key_length, const uint8_t* keys,
                               uint64_t* hashes);
   template <typename T, bool T_COMBINE_HASHES>
   static void HashVarLenImp(uint32_t num_rows, const T* offsets,
@@ -211,11 +211,11 @@ class ARROW_EXPORT Hashing64 {
   template <bool T_COMBINE_HASHES>
   static void HashBitImp(int64_t bit_offset, uint32_t num_keys, const uint8_t* keys,
                          uint64_t* hashes);
-  static void HashBit(bool T_COMBINE_HASHES, int64_t bit_offset, uint32_t num_keys,
+  static void HashBit(bool combine_hashes, int64_t bit_offset, uint32_t num_keys,
                       const uint8_t* keys, uint64_t* hashes);
   template <bool T_COMBINE_HASHES, typename T>
   static void HashIntImp(uint32_t num_keys, const T* keys, uint64_t* hashes);
-  static void HashInt(bool T_COMBINE_HASHES, uint32_t num_keys, uint64_t length_key,
+  static void HashInt(bool combine_hashes, uint32_t num_keys, uint64_t key_length,
                       const uint8_t* keys, uint64_t* hashes);
 };
 
diff --git a/cpp/src/arrow/compute/key_hash_avx2.cc b/cpp/src/arrow/compute/key_hash_avx2.cc
index 1b444b5767..aec2800c64 100644
--- a/cpp/src/arrow/compute/key_hash_avx2.cc
+++ b/cpp/src/arrow/compute/key_hash_avx2.cc
@@ -190,7 +190,7 @@ uint32_t Hashing32::HashFixedLenImp_avx2(uint32_t num_rows, uint64_t length,
   // Do not process rows that could read past the end of the buffer using 16
   // byte loads. Round down number of rows to process to multiple of 2.
   //
-  uint64_t num_rows_to_skip = bit_util::CeilDiv(length, kStripeSize);
+  uint64_t num_rows_to_skip = bit_util::CeilDiv(kStripeSize, length);
   uint32_t num_rows_to_process =
       (num_rows_to_skip > num_rows)
           ? 0
diff --git a/cpp/src/arrow/compute/key_hash_test.cc b/cpp/src/arrow/compute/key_hash_test.cc
index 3e6d41525c..c998df7169 100644
--- a/cpp/src/arrow/compute/key_hash_test.cc
+++ b/cpp/src/arrow/compute/key_hash_test.cc
@@ -252,5 +252,64 @@ TEST(VectorHash, BasicString) { RunTestVectorHash<StringType>(); }
 
 TEST(VectorHash, BasicLargeString) { RunTestVectorHash<LargeStringType>(); }
 
+void HashFixedLengthFrom(int key_length, int num_rows, int start_row) {
+  int num_rows_to_hash = num_rows - start_row;
+  auto num_bytes_aligned = arrow::bit_util::RoundUpToMultipleOf64(key_length * num_rows);
+
+  const auto hardware_flags_for_testing = HardwareFlagsForTesting();
+  ASSERT_GT(hardware_flags_for_testing.size(), 0);
+
+  std::vector<std::vector<uint32_t>> hashes32(hardware_flags_for_testing.size());
+  std::vector<std::vector<uint64_t>> hashes64(hardware_flags_for_testing.size());
+  for (auto& h : hashes32) {
+    h.resize(num_rows_to_hash);
+  }
+  for (auto& h : hashes64) {
+    h.resize(num_rows_to_hash);
+  }
+
+  FixedSizeBinaryBuilder keys_builder(fixed_size_binary(key_length));
+  for (int j = 0; j < num_rows; ++j) {
+    ASSERT_OK(keys_builder.Append(std::string(key_length, 42)));
+  }
+  ASSERT_OK_AND_ASSIGN(auto keys, keys_builder.Finish());
+  // Make sure the buffer is aligned as expected.
+  ASSERT_EQ(keys->data()->buffers[1]->capacity(), num_bytes_aligned);
+
+  constexpr int mini_batch_size = 1024;
+  std::vector<uint32_t> temp_buffer;
+  temp_buffer.resize(mini_batch_size * 4);
+
+  for (int i = 0; i < static_cast<int>(hardware_flags_for_testing.size()); ++i) {
+    const auto hardware_flags = hardware_flags_for_testing[i];
+    Hashing32::HashFixed(hardware_flags,
+                         /*combine_hashes=*/false, num_rows_to_hash, key_length,
+                         keys->data()->GetValues<uint8_t>(1) + start_row * key_length,
+                         hashes32[i].data(), temp_buffer.data());
+    Hashing64::HashFixed(
+        /*combine_hashes=*/false, num_rows_to_hash, key_length,
+        keys->data()->GetValues<uint8_t>(1) + start_row * key_length, hashes64[i].data());
+  }
+
+  // Verify that all implementations (scalar, SIMD) give the same hashes.
+  for (int i = 1; i < static_cast<int>(hardware_flags_for_testing.size()); ++i) {
+    for (int j = 0; j < num_rows_to_hash; ++j) {
+      ASSERT_EQ(hashes32[i][j], hashes32[0][j])
+          << "scalar and simd approaches yielded different 32-bit hashes";
+      ASSERT_EQ(hashes64[i][j], hashes64[0][j])
+          << "scalar and simd approaches yielded different 64-bit hashes";
+    }
+  }
+}
+
+// Some carefully chosen cases that may cause troubles like GH-39778.
+TEST(VectorHash, FixedLengthTailByteSafety) {
+  // Tow cases of key_length < stripe (16-byte).
+  HashFixedLengthFrom(/*key_length=*/3, /*num_rows=*/1450, /*start_row=*/1447);
+  HashFixedLengthFrom(/*key_length=*/5, /*num_rows=*/883, /*start_row=*/858);
+  // Case of key_length > stripe (16-byte).
+  HashFixedLengthFrom(/*key_length=*/19, /*num_rows=*/64, /*start_row=*/63);
+}
+
 }  // namespace compute
 }  // namespace arrow

(arrow) 04/30: GH-39577: [C++] Fix tail-word access cross buffer boundary in `CompareBinaryColumnToRow` (#39606)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 34b54f08cb283797f0e5b2f060637962b5198c4d
Author: Rossi(Ruoxi) Sun <za...@gmail.com>
AuthorDate: Wed Jan 17 01:14:03 2024 +0800

    GH-39577: [C++] Fix tail-word access cross buffer boundary in `CompareBinaryColumnToRow` (#39606)
    
    
    
    ### Rationale for this change
    
    Default buffer alignment (64b) doesn't guarantee the safety of tail-word access in  `KeyCompare::CompareBinaryColumnToRow`. Comment https://github.com/apache/arrow/issues/39577#issuecomment-1889090279 is a concrete example.
    
    ### What changes are included in this PR?
    
    Make `KeyCompare::CompareBinaryColumnToRow` tail-word safe.
    
    ### Are these changes tested?
    
    UT included.
    
    ### Are there any user-facing changes?
    
    No.
    
    * Closes: #39577
    
    Authored-by: zanmato1984 <za...@gmail.com>
    Signed-off-by: Antoine Pitrou <an...@python.org>
---
 cpp/src/arrow/compute/CMakeLists.txt          |   3 +-
 cpp/src/arrow/compute/row/compare_internal.cc |  11 +--
 cpp/src/arrow/compute/row/compare_test.cc     | 110 ++++++++++++++++++++++++++
 3 files changed, 118 insertions(+), 6 deletions(-)

diff --git a/cpp/src/arrow/compute/CMakeLists.txt b/cpp/src/arrow/compute/CMakeLists.txt
index 1134e0a98a..e14d78ff6e 100644
--- a/cpp/src/arrow/compute/CMakeLists.txt
+++ b/cpp/src/arrow/compute/CMakeLists.txt
@@ -89,7 +89,8 @@ add_arrow_test(internals_test
                kernel_test.cc
                light_array_test.cc
                registry_test.cc
-               key_hash_test.cc)
+               key_hash_test.cc
+               row/compare_test.cc)
 
 add_arrow_compute_test(expression_test SOURCES expression_test.cc)
 
diff --git a/cpp/src/arrow/compute/row/compare_internal.cc b/cpp/src/arrow/compute/row/compare_internal.cc
index 7c402e7a23..078a8287c7 100644
--- a/cpp/src/arrow/compute/row/compare_internal.cc
+++ b/cpp/src/arrow/compute/row/compare_internal.cc
@@ -208,8 +208,7 @@ void KeyCompare::CompareBinaryColumnToRow(uint32_t offset_within_row,
           // Non-zero length guarantees no underflow
           int32_t num_loops_less_one =
               static_cast<int32_t>(bit_util::CeilDiv(length, 8)) - 1;
-
-          uint64_t tail_mask = ~0ULL >> (64 - 8 * (length - num_loops_less_one * 8));
+          int32_t num_tail_bytes = length - num_loops_less_one * 8;
 
           const uint64_t* key_left_ptr =
               reinterpret_cast<const uint64_t*>(left_base + irow_left * length);
@@ -224,9 +223,11 @@ void KeyCompare::CompareBinaryColumnToRow(uint32_t offset_within_row,
             uint64_t key_right = key_right_ptr[i];
             result_or |= key_left ^ key_right;
           }
-          uint64_t key_left = util::SafeLoad(key_left_ptr + i);
-          uint64_t key_right = key_right_ptr[i];
-          result_or |= tail_mask & (key_left ^ key_right);
+          uint64_t key_left = 0;
+          memcpy(&key_left, key_left_ptr + i, num_tail_bytes);
+          uint64_t key_right = 0;
+          memcpy(&key_right, key_right_ptr + i, num_tail_bytes);
+          result_or |= key_left ^ key_right;
           return result_or == 0 ? 0xff : 0;
         });
   }
diff --git a/cpp/src/arrow/compute/row/compare_test.cc b/cpp/src/arrow/compute/row/compare_test.cc
new file mode 100644
index 0000000000..1d8562cd56
--- /dev/null
+++ b/cpp/src/arrow/compute/row/compare_test.cc
@@ -0,0 +1,110 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <numeric>
+
+#include "arrow/compute/row/compare_internal.h"
+#include "arrow/testing/gtest_util.h"
+
+namespace arrow {
+namespace compute {
+
+using arrow::bit_util::BytesForBits;
+using arrow::internal::CpuInfo;
+using arrow::util::MiniBatch;
+using arrow::util::TempVectorStack;
+
+// Specialized case for GH-39577.
+TEST(KeyCompare, CompareColumnsToRowsCuriousFSB) {
+  int fsb_length = 9;
+  MemoryPool* pool = default_memory_pool();
+  TempVectorStack stack;
+  ASSERT_OK(stack.Init(pool, 8 * MiniBatch::kMiniBatchLength * sizeof(uint64_t)));
+
+  int num_rows = 7;
+  auto column_right = ArrayFromJSON(fixed_size_binary(fsb_length), R"([
+      "000000000",
+      "111111111",
+      "222222222",
+      "333333333",
+      "444444444",
+      "555555555",
+      "666666666"])");
+  ExecBatch batch_right({column_right}, num_rows);
+
+  std::vector<KeyColumnMetadata> column_metadatas_right;
+  ASSERT_OK(ColumnMetadatasFromExecBatch(batch_right, &column_metadatas_right));
+
+  RowTableMetadata table_metadata_right;
+  table_metadata_right.FromColumnMetadataVector(column_metadatas_right, sizeof(uint64_t),
+                                                sizeof(uint64_t));
+
+  std::vector<KeyColumnArray> column_arrays_right;
+  ASSERT_OK(ColumnArraysFromExecBatch(batch_right, &column_arrays_right));
+
+  RowTableImpl row_table;
+  ASSERT_OK(row_table.Init(pool, table_metadata_right));
+
+  RowTableEncoder row_encoder;
+  row_encoder.Init(column_metadatas_right, sizeof(uint64_t), sizeof(uint64_t));
+  row_encoder.PrepareEncodeSelected(0, num_rows, column_arrays_right);
+
+  std::vector<uint16_t> row_ids_right(num_rows);
+  std::iota(row_ids_right.begin(), row_ids_right.end(), 0);
+  ASSERT_OK(row_encoder.EncodeSelected(&row_table, num_rows, row_ids_right.data()));
+
+  auto column_left = ArrayFromJSON(fixed_size_binary(fsb_length), R"([
+      "000000000",
+      "111111111",
+      "222222222",
+      "333333333",
+      "444444444",
+      "555555555",
+      "777777777"])");
+  ExecBatch batch_left({column_left}, num_rows);
+  std::vector<KeyColumnArray> column_arrays_left;
+  ASSERT_OK(ColumnArraysFromExecBatch(batch_left, &column_arrays_left));
+
+  std::vector<uint32_t> row_ids_left(num_rows);
+  std::iota(row_ids_left.begin(), row_ids_left.end(), 0);
+
+  LightContext ctx{CpuInfo::GetInstance()->hardware_flags(), &stack};
+
+  {
+    uint32_t num_rows_no_match;
+    std::vector<uint16_t> row_ids_out(num_rows);
+    KeyCompare::CompareColumnsToRows(num_rows, NULLPTR, row_ids_left.data(), &ctx,
+                                     &num_rows_no_match, row_ids_out.data(),
+                                     column_arrays_left, row_table, true, NULLPTR);
+    ASSERT_EQ(num_rows_no_match, 1);
+    ASSERT_EQ(row_ids_out[0], 6);
+  }
+
+  {
+    std::vector<uint8_t> match_bitvector(BytesForBits(num_rows));
+    KeyCompare::CompareColumnsToRows(num_rows, NULLPTR, row_ids_left.data(), &ctx,
+                                     NULLPTR, NULLPTR, column_arrays_left, row_table,
+                                     true, match_bitvector.data());
+    for (int i = 0; i < num_rows; ++i) {
+      SCOPED_TRACE(i);
+      ASSERT_EQ(arrow::bit_util::GetBit(match_bitvector.data(), i), i != 6);
+    }
+  }
+}
+
+}  // namespace compute
+}  // namespace arrow

(arrow) 02/30: GH-39525: [C++][Parquet] Pass memory pool to decoders (#39526)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 75c9e0293415f56b8bac1cadbfb71eb1318fde89
Author: emkornfield <mi...@google.com>
AuthorDate: Thu Jan 11 09:40:51 2024 -0800

    GH-39525: [C++][Parquet] Pass memory pool to decoders (#39526)
    
    ### Rationale for this change
    
    Memory pools should be plumbed through where ever possible.
    
    ### What changes are included in this PR?
    
    Pass through memory pool to decoders
    
    ### Are these changes tested?
    
    Not directly; this was caught via some internal fuzz targets.
    
    ### Are there any user-facing changes?
    
    No.
    
    * Closes: #39525
    
    Authored-by: Micah Kornfield <mi...@google.com>
    Signed-off-by: mwish <ma...@gmail.com>
---
 cpp/src/parquet/column_reader.cc | 44 ++++++++--------------------------------
 1 file changed, 9 insertions(+), 35 deletions(-)

diff --git a/cpp/src/parquet/column_reader.cc b/cpp/src/parquet/column_reader.cc
index 99978e283b..86c32e5e27 100644
--- a/cpp/src/parquet/column_reader.cc
+++ b/cpp/src/parquet/column_reader.cc
@@ -760,7 +760,7 @@ class ColumnReaderImplBase {
 
     if (page->encoding() == Encoding::PLAIN_DICTIONARY ||
         page->encoding() == Encoding::PLAIN) {
-      auto dictionary = MakeTypedDecoder<DType>(Encoding::PLAIN, descr_);
+      auto dictionary = MakeTypedDecoder<DType>(Encoding::PLAIN, descr_, pool_);
       dictionary->SetData(page->num_values(), page->data(), page->size());
 
       // The dictionary is fully decoded during DictionaryDecoder::Init, so the
@@ -883,47 +883,21 @@ class ColumnReaderImplBase {
       current_decoder_ = it->second.get();
     } else {
       switch (encoding) {
-        case Encoding::PLAIN: {
-          auto decoder = MakeTypedDecoder<DType>(Encoding::PLAIN, descr_);
-          current_decoder_ = decoder.get();
-          decoders_[static_cast<int>(encoding)] = std::move(decoder);
-          break;
-        }
-        case Encoding::BYTE_STREAM_SPLIT: {
-          auto decoder = MakeTypedDecoder<DType>(Encoding::BYTE_STREAM_SPLIT, descr_);
-          current_decoder_ = decoder.get();
-          decoders_[static_cast<int>(encoding)] = std::move(decoder);
-          break;
-        }
-        case Encoding::RLE: {
-          auto decoder = MakeTypedDecoder<DType>(Encoding::RLE, descr_);
+        case Encoding::PLAIN:
+        case Encoding::BYTE_STREAM_SPLIT:
+        case Encoding::RLE:
+        case Encoding::DELTA_BINARY_PACKED:
+        case Encoding::DELTA_BYTE_ARRAY:
+        case Encoding::DELTA_LENGTH_BYTE_ARRAY: {
+          auto decoder = MakeTypedDecoder<DType>(encoding, descr_, pool_);
           current_decoder_ = decoder.get();
           decoders_[static_cast<int>(encoding)] = std::move(decoder);
           break;
         }
+
         case Encoding::RLE_DICTIONARY:
           throw ParquetException("Dictionary page must be before data page.");
 
-        case Encoding::DELTA_BINARY_PACKED: {
-          auto decoder = MakeTypedDecoder<DType>(Encoding::DELTA_BINARY_PACKED, descr_);
-          current_decoder_ = decoder.get();
-          decoders_[static_cast<int>(encoding)] = std::move(decoder);
-          break;
-        }
-        case Encoding::DELTA_BYTE_ARRAY: {
-          auto decoder = MakeTypedDecoder<DType>(Encoding::DELTA_BYTE_ARRAY, descr_);
-          current_decoder_ = decoder.get();
-          decoders_[static_cast<int>(encoding)] = std::move(decoder);
-          break;
-        }
-        case Encoding::DELTA_LENGTH_BYTE_ARRAY: {
-          auto decoder =
-              MakeTypedDecoder<DType>(Encoding::DELTA_LENGTH_BYTE_ARRAY, descr_);
-          current_decoder_ = decoder.get();
-          decoders_[static_cast<int>(encoding)] = std::move(decoder);
-          break;
-        }
-
         default:
           throw ParquetException("Unknown encoding type.");
       }

(arrow) 03/30: GH-39504: [Docs] Update footer in main sphinx docs with correct attribution (#39505)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 31ecb4ccb476773aebd5cebbd1e4e012a9ac6878
Author: Joris Van den Bossche <jo...@gmail.com>
AuthorDate: Mon Jan 15 15:02:45 2024 +0100

    GH-39504: [Docs] Update footer in main sphinx docs with correct attribution (#39505)
    
    
    * Closes: #39504
    
    Lead-authored-by: Joris Van den Bossche <jo...@gmail.com>
    Co-authored-by: Alenka Frim <Al...@users.noreply.github.com>
    Signed-off-by: AlenkaF <fr...@gmail.com>
---
 docs/source/conf.py | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/docs/source/conf.py b/docs/source/conf.py
index cde0c2b31f..5af7b7955f 100644
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -201,7 +201,12 @@ master_doc = 'index'
 
 # General information about the project.
 project = u'Apache Arrow'
-copyright = f'2016-{datetime.datetime.now().year} Apache Software Foundation'
+copyright = (
+    f"2016-{datetime.datetime.now().year} Apache Software Foundation.\n"
+    "Apache Arrow, Arrow, Apache, the Apache feather logo, and the Apache Arrow "
+    "project logo are either registered trademarks or trademarks of The Apache "
+    "Software Foundation in the United States and other countries"
+)
 author = u'Apache Software Foundation'
 
 # The version info for the project you're documenting, acts as replacement for

(arrow) 21/30: GH-39880: [Python][CI] Pin moto<5 for dask integration tests (#39881)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 4f7819a743b0c1ff695ecfdbf75d8986bb6d35e1
Author: Joris Van den Bossche <jo...@gmail.com>
AuthorDate: Thu Feb 1 14:54:14 2024 +0100

    GH-39880: [Python][CI] Pin moto<5 for dask integration tests (#39881)
    
    See upstream pin being added (https://github.com/dask/dask/pull/10868 / https://github.com/dask/dask/issues/10869), we are seeing the same failures
    * Closes: #39880
    
    Lead-authored-by: Joris Van den Bossche <jo...@gmail.com>
    Co-authored-by: Raúl Cumplido <ra...@gmail.com>
    Signed-off-by: Joris Van den Bossche <jo...@gmail.com>
---
 ci/scripts/install_dask.sh | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/ci/scripts/install_dask.sh b/ci/scripts/install_dask.sh
index 8d712a88a6..478c1d5997 100755
--- a/ci/scripts/install_dask.sh
+++ b/ci/scripts/install_dask.sh
@@ -35,4 +35,5 @@ else
 fi
 
 # additional dependencies needed for dask's s3 tests
-pip install moto[server] flask requests
+# Moto 5 results in timeouts in s3 tests: https://github.com/dask/dask/issues/10869
+pip install "moto[server]<5" flask requests

(arrow) 22/30: GH-39865: [C++] Strip extension metadata when importing a registered extension (#39866)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 91be098b56021b1f9569986b038bd46c3ed53701
Author: Antoine Pitrou <an...@python.org>
AuthorDate: Mon Feb 5 17:15:44 2024 +0100

    GH-39865: [C++] Strip extension metadata when importing a registered extension (#39866)
    
    ### Rationale for this change
    
    When importing an extension type from the C Data Interface and the extension type is registered, we would still leave the extension-related metadata on the storage type.
    
    ### What changes are included in this PR?
    
    Strip extension-related metadata on the storage type if we succeed in recreating the extension type.
    This matches the behavior of the IPC layer and allows for more exact roundtripping.
    
    ### Are these changes tested?
    
    Yes.
    
    ### Are there any user-facing changes?
    
    No, unless people mistakingly rely on the presence of said metadata.
    * Closes: #39865
    
    Authored-by: Antoine Pitrou <an...@python.org>
    Signed-off-by: Antoine Pitrou <an...@python.org>
---
 cpp/src/arrow/c/bridge.cc                |  6 ++++
 cpp/src/arrow/c/bridge_test.cc           | 48 +++++++++++++++++++++-----------
 cpp/src/arrow/util/key_value_metadata.cc | 18 ++++++------
 cpp/src/arrow/util/key_value_metadata.h  | 11 ++++----
 4 files changed, 52 insertions(+), 31 deletions(-)

diff --git a/cpp/src/arrow/c/bridge.cc b/cpp/src/arrow/c/bridge.cc
index 238afb0328..4751f65632 100644
--- a/cpp/src/arrow/c/bridge.cc
+++ b/cpp/src/arrow/c/bridge.cc
@@ -914,6 +914,8 @@ struct DecodedMetadata {
   std::shared_ptr<KeyValueMetadata> metadata;
   std::string extension_name;
   std::string extension_serialized;
+  int extension_name_index = -1;        // index of extension_name in metadata
+  int extension_serialized_index = -1;  // index of extension_serialized in metadata
 };
 
 Result<DecodedMetadata> DecodeMetadata(const char* metadata) {
@@ -956,8 +958,10 @@ Result<DecodedMetadata> DecodeMetadata(const char* metadata) {
     RETURN_NOT_OK(read_string(&values[i]));
     if (keys[i] == kExtensionTypeKeyName) {
       decoded.extension_name = values[i];
+      decoded.extension_name_index = i;
     } else if (keys[i] == kExtensionMetadataKeyName) {
       decoded.extension_serialized = values[i];
+      decoded.extension_serialized_index = i;
     }
   }
   decoded.metadata = key_value_metadata(std::move(keys), std::move(values));
@@ -1046,6 +1050,8 @@ struct SchemaImporter {
         ARROW_ASSIGN_OR_RAISE(
             type_, registered_ext_type->Deserialize(std::move(type_),
                                                     metadata_.extension_serialized));
+        RETURN_NOT_OK(metadata_.metadata->DeleteMany(
+            {metadata_.extension_name_index, metadata_.extension_serialized_index}));
       }
     }
 
diff --git a/cpp/src/arrow/c/bridge_test.cc b/cpp/src/arrow/c/bridge_test.cc
index 58bbc9282c..5dcb38185f 100644
--- a/cpp/src/arrow/c/bridge_test.cc
+++ b/cpp/src/arrow/c/bridge_test.cc
@@ -1870,7 +1870,7 @@ class TestSchemaImport : public ::testing::Test, public SchemaStructBuilder {
     ASSERT_TRUE(ArrowSchemaIsReleased(&c_struct_));
     Reset();            // for further tests
     cb.AssertCalled();  // was released
-    AssertTypeEqual(*expected, *type);
+    AssertTypeEqual(*expected, *type, /*check_metadata=*/true);
   }
 
   void CheckImport(const std::shared_ptr<Field>& expected) {
@@ -1890,7 +1890,7 @@ class TestSchemaImport : public ::testing::Test, public SchemaStructBuilder {
     ASSERT_TRUE(ArrowSchemaIsReleased(&c_struct_));
     Reset();            // for further tests
     cb.AssertCalled();  // was released
-    AssertSchemaEqual(*expected, *schema);
+    AssertSchemaEqual(*expected, *schema, /*check_metadata=*/true);
   }
 
   void CheckImportError() {
@@ -3569,7 +3569,7 @@ class TestSchemaRoundtrip : public ::testing::Test {
     // Recreate the type
     ASSERT_OK_AND_ASSIGN(actual, ImportType(&c_schema));
     type = factory_expected();
-    AssertTypeEqual(*type, *actual);
+    AssertTypeEqual(*type, *actual, /*check_metadata=*/true);
     type.reset();
     actual.reset();
 
@@ -3600,7 +3600,7 @@ class TestSchemaRoundtrip : public ::testing::Test {
     // Recreate the schema
     ASSERT_OK_AND_ASSIGN(actual, ImportSchema(&c_schema));
     schema = factory();
-    AssertSchemaEqual(*schema, *actual);
+    AssertSchemaEqual(*schema, *actual, /*check_metadata=*/true);
     schema.reset();
     actual.reset();
 
@@ -3693,13 +3693,27 @@ TEST_F(TestSchemaRoundtrip, Dictionary) {
   }
 }
 
+// Given an extension type, return a field of its storage type + the
+// serialized extension metadata.
+std::shared_ptr<Field> GetStorageWithMetadata(const std::string& field_name,
+                                              const std::shared_ptr<DataType>& type) {
+  const auto& ext_type = checked_cast<const ExtensionType&>(*type);
+  auto storage_type = ext_type.storage_type();
+  auto md = KeyValueMetadata::Make({kExtensionTypeKeyName, kExtensionMetadataKeyName},
+                                   {ext_type.extension_name(), ext_type.Serialize()});
+  return field(field_name, storage_type, /*nullable=*/true, md);
+}
+
 TEST_F(TestSchemaRoundtrip, UnregisteredExtension) {
   TestWithTypeFactory(uuid, []() { return fixed_size_binary(16); });
   TestWithTypeFactory(dict_extension_type, []() { return dictionary(int8(), utf8()); });
 
-  // Inside nested type
-  TestWithTypeFactory([]() { return list(dict_extension_type()); },
-                      []() { return list(dictionary(int8(), utf8())); });
+  // Inside nested type.
+  // When an extension type is not known by the importer, it is imported
+  // as its storage type and the extension metadata is preserved on the field.
+  TestWithTypeFactory(
+      []() { return list(dict_extension_type()); },
+      []() { return list(GetStorageWithMetadata("item", dict_extension_type())); });
 }
 
 TEST_F(TestSchemaRoundtrip, RegisteredExtension) {
@@ -3708,7 +3722,9 @@ TEST_F(TestSchemaRoundtrip, RegisteredExtension) {
   TestWithTypeFactory(dict_extension_type);
   TestWithTypeFactory(complex128);
 
-  // Inside nested type
+  // Inside nested type.
+  // When the extension type is registered, the extension metadata is removed
+  // from the storage type's field to ensure roundtripping (GH-39865).
   TestWithTypeFactory([]() { return list(uuid()); });
   TestWithTypeFactory([]() { return list(dict_extension_type()); });
   TestWithTypeFactory([]() { return list(complex128()); });
@@ -3808,7 +3824,7 @@ class TestArrayRoundtrip : public ::testing::Test {
     {
       std::shared_ptr<Array> expected;
       ASSERT_OK_AND_ASSIGN(expected, ToResult(factory_expected()));
-      AssertTypeEqual(*expected->type(), *array->type());
+      AssertTypeEqual(*expected->type(), *array->type(), /*check_metadata=*/true);
       AssertArraysEqual(*expected, *array, true);
     }
     array.reset();
@@ -3848,7 +3864,7 @@ class TestArrayRoundtrip : public ::testing::Test {
     {
       std::shared_ptr<RecordBatch> expected;
       ASSERT_OK_AND_ASSIGN(expected, ToResult(factory()));
-      AssertSchemaEqual(*expected->schema(), *batch->schema());
+      AssertSchemaEqual(*expected->schema(), *batch->schema(), /*check_metadata=*/true);
       AssertBatchesEqual(*expected, *batch);
     }
     batch.reset();
@@ -4228,7 +4244,7 @@ class TestDeviceArrayRoundtrip : public ::testing::Test {
     {
       std::shared_ptr<Array> expected;
       ASSERT_OK_AND_ASSIGN(expected, ToResult(factory_expected()));
-      AssertTypeEqual(*expected->type(), *array->type());
+      AssertTypeEqual(*expected->type(), *array->type(), /*check_metadata=*/true);
       AssertArraysEqual(*expected, *array, true);
     }
     array.reset();
@@ -4274,7 +4290,7 @@ class TestDeviceArrayRoundtrip : public ::testing::Test {
     {
       std::shared_ptr<RecordBatch> expected;
       ASSERT_OK_AND_ASSIGN(expected, ToResult(factory()));
-      AssertSchemaEqual(*expected->schema(), *batch->schema());
+      AssertSchemaEqual(*expected->schema(), *batch->schema(), /*check_metadata=*/true);
       AssertBatchesEqual(*expected, *batch);
     }
     batch.reset();
@@ -4351,7 +4367,7 @@ class TestArrayStreamExport : public BaseArrayStreamTest {
     SchemaExportGuard schema_guard(&c_schema);
     ASSERT_FALSE(ArrowSchemaIsReleased(&c_schema));
     ASSERT_OK_AND_ASSIGN(auto schema, ImportSchema(&c_schema));
-    AssertSchemaEqual(expected, *schema);
+    AssertSchemaEqual(expected, *schema, /*check_metadata=*/true);
   }
 
   void AssertStreamEnd(struct ArrowArrayStream* c_stream) {
@@ -4435,7 +4451,7 @@ TEST_F(TestArrayStreamExport, ArrayLifetime) {
   {
     SchemaExportGuard schema_guard(&c_schema);
     ASSERT_OK_AND_ASSIGN(auto got_schema, ImportSchema(&c_schema));
-    AssertSchemaEqual(*schema, *got_schema);
+    AssertSchemaEqual(*schema, *got_schema, /*check_metadata=*/true);
   }
 
   ASSERT_GT(pool_->bytes_allocated(), orig_allocated_);
@@ -4460,7 +4476,7 @@ TEST_F(TestArrayStreamExport, Errors) {
   {
     SchemaExportGuard schema_guard(&c_schema);
     ASSERT_OK_AND_ASSIGN(auto schema, ImportSchema(&c_schema));
-    AssertSchemaEqual(schema, arrow::schema({}));
+    AssertSchemaEqual(schema, arrow::schema({}), /*check_metadata=*/true);
   }
 
   struct ArrowArray c_array;
@@ -4537,7 +4553,7 @@ TEST_F(TestArrayStreamRoundtrip, Simple) {
   ASSERT_OK_AND_ASSIGN(auto reader, RecordBatchReader::Make(batches, orig_schema));
 
   Roundtrip(std::move(reader), [&](const std::shared_ptr<RecordBatchReader>& reader) {
-    AssertSchemaEqual(*orig_schema, *reader->schema());
+    AssertSchemaEqual(*orig_schema, *reader->schema(), /*check_metadata=*/true);
     AssertReaderNext(reader, *batches[0]);
     AssertReaderNext(reader, *batches[1]);
     AssertReaderEnd(reader);
diff --git a/cpp/src/arrow/util/key_value_metadata.cc b/cpp/src/arrow/util/key_value_metadata.cc
index bc48ae76c2..002e8b0975 100644
--- a/cpp/src/arrow/util/key_value_metadata.cc
+++ b/cpp/src/arrow/util/key_value_metadata.cc
@@ -90,7 +90,7 @@ void KeyValueMetadata::Append(std::string key, std::string value) {
   values_.push_back(std::move(value));
 }
 
-Result<std::string> KeyValueMetadata::Get(const std::string& key) const {
+Result<std::string> KeyValueMetadata::Get(std::string_view key) const {
   auto index = FindKey(key);
   if (index < 0) {
     return Status::KeyError(key);
@@ -129,7 +129,7 @@ Status KeyValueMetadata::DeleteMany(std::vector<int64_t> indices) {
   return Status::OK();
 }
 
-Status KeyValueMetadata::Delete(const std::string& key) {
+Status KeyValueMetadata::Delete(std::string_view key) {
   auto index = FindKey(key);
   if (index < 0) {
     return Status::KeyError(key);
@@ -138,20 +138,18 @@ Status KeyValueMetadata::Delete(const std::string& key) {
   }
 }
 
-Status KeyValueMetadata::Set(const std::string& key, const std::string& value) {
+Status KeyValueMetadata::Set(std::string key, std::string value) {
   auto index = FindKey(key);
   if (index < 0) {
-    Append(key, value);
+    Append(std::move(key), std::move(value));
   } else {
-    keys_[index] = key;
-    values_[index] = value;
+    keys_[index] = std::move(key);
+    values_[index] = std::move(value);
   }
   return Status::OK();
 }
 
-bool KeyValueMetadata::Contains(const std::string& key) const {
-  return FindKey(key) >= 0;
-}
+bool KeyValueMetadata::Contains(std::string_view key) const { return FindKey(key) >= 0; }
 
 void KeyValueMetadata::reserve(int64_t n) {
   DCHECK_GE(n, 0);
@@ -188,7 +186,7 @@ std::vector<std::pair<std::string, std::string>> KeyValueMetadata::sorted_pairs(
   return pairs;
 }
 
-int KeyValueMetadata::FindKey(const std::string& key) const {
+int KeyValueMetadata::FindKey(std::string_view key) const {
   for (size_t i = 0; i < keys_.size(); ++i) {
     if (keys_[i] == key) {
       return static_cast<int>(i);
diff --git a/cpp/src/arrow/util/key_value_metadata.h b/cpp/src/arrow/util/key_value_metadata.h
index 8702ce73a6..57ade11e75 100644
--- a/cpp/src/arrow/util/key_value_metadata.h
+++ b/cpp/src/arrow/util/key_value_metadata.h
@@ -20,6 +20,7 @@
 #include <cstdint>
 #include <memory>
 #include <string>
+#include <string_view>
 #include <unordered_map>
 #include <utility>
 #include <vector>
@@ -44,13 +45,13 @@ class ARROW_EXPORT KeyValueMetadata {
   void ToUnorderedMap(std::unordered_map<std::string, std::string>* out) const;
   void Append(std::string key, std::string value);
 
-  Result<std::string> Get(const std::string& key) const;
-  bool Contains(const std::string& key) const;
+  Result<std::string> Get(std::string_view key) const;
+  bool Contains(std::string_view key) const;
   // Note that deleting may invalidate known indices
-  Status Delete(const std::string& key);
+  Status Delete(std::string_view key);
   Status Delete(int64_t index);
   Status DeleteMany(std::vector<int64_t> indices);
-  Status Set(const std::string& key, const std::string& value);
+  Status Set(std::string key, std::string value);
 
   void reserve(int64_t n);
 
@@ -63,7 +64,7 @@ class ARROW_EXPORT KeyValueMetadata {
   std::vector<std::pair<std::string, std::string>> sorted_pairs() const;
 
   /// \brief Perform linear search for key, returning -1 if not found
-  int FindKey(const std::string& key) const;
+  int FindKey(std::string_view key) const;
 
   std::shared_ptr<KeyValueMetadata> Copy() const;

(arrow) 07/30: GH-39583: [C++] Fix the issue of ExecBatchBuilder when appending consecutive tail rows with the same id may exceed buffer boundary (for fixed size types) (#39585)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit ef4b4de21e748da0007e42ba5d72d49f78fbcdcf
Author: Rossi(Ruoxi) Sun <za...@gmail.com>
AuthorDate: Thu Jan 18 00:26:48 2024 +0800

    GH-39583: [C++] Fix the issue of ExecBatchBuilder when appending consecutive tail rows with the same id may exceed buffer boundary (for fixed size types) (#39585)
    
    
    
    ### Rationale for this change
    
    #39583 is a subsequent issue of #32570 (fixed by #39234). The last issue and fixed only resolved var length types. It turns out fixed size types have the same issue.
    
    ### What changes are included in this PR?
    
    Do the same fix of #39234 for fixed size types.
    
    ### Are these changes tested?
    
    UT included.
    
    ### Are there any user-facing changes?
    
    * Closes: #39583
    
    Authored-by: zanmato1984 <za...@gmail.com>
    Signed-off-by: Antoine Pitrou <an...@python.org>
---
 cpp/src/arrow/compute/light_array.cc      | 21 ++++-----
 cpp/src/arrow/compute/light_array_test.cc | 75 ++++++++++++++++++++++++++++---
 2 files changed, 77 insertions(+), 19 deletions(-)

diff --git a/cpp/src/arrow/compute/light_array.cc b/cpp/src/arrow/compute/light_array.cc
index 73ea01a03a..66d8477b02 100644
--- a/cpp/src/arrow/compute/light_array.cc
+++ b/cpp/src/arrow/compute/light_array.cc
@@ -383,27 +383,22 @@ int ExecBatchBuilder::NumRowsToSkip(const std::shared_ptr<ArrayData>& column,
 
   KeyColumnMetadata column_metadata =
       ColumnMetadataFromDataType(column->type).ValueOrDie();
+  ARROW_DCHECK(!column_metadata.is_fixed_length || column_metadata.fixed_length > 0);
 
   int num_rows_left = num_rows;
   int num_bytes_skipped = 0;
   while (num_rows_left > 0 && num_bytes_skipped < num_tail_bytes_to_skip) {
+    --num_rows_left;
+    int row_id_removed = row_ids[num_rows_left];
     if (column_metadata.is_fixed_length) {
-      if (column_metadata.fixed_length == 0) {
-        num_rows_left = std::max(num_rows_left, 8) - 8;
-        ++num_bytes_skipped;
-      } else {
-        --num_rows_left;
-        num_bytes_skipped += column_metadata.fixed_length;
-      }
+      num_bytes_skipped += column_metadata.fixed_length;
     } else {
-      --num_rows_left;
-      int row_id_removed = row_ids[num_rows_left];
       const int32_t* offsets = column->GetValues<int32_t>(1);
       num_bytes_skipped += offsets[row_id_removed + 1] - offsets[row_id_removed];
-      // Skip consecutive rows with the same id
-      while (num_rows_left > 0 && row_id_removed == row_ids[num_rows_left - 1]) {
-        --num_rows_left;
-      }
+    }
+    // Skip consecutive rows with the same id
+    while (num_rows_left > 0 && row_id_removed == row_ids[num_rows_left - 1]) {
+      --num_rows_left;
     }
   }
 
diff --git a/cpp/src/arrow/compute/light_array_test.cc b/cpp/src/arrow/compute/light_array_test.cc
index 3ceba43604..d50e967551 100644
--- a/cpp/src/arrow/compute/light_array_test.cc
+++ b/cpp/src/arrow/compute/light_array_test.cc
@@ -474,15 +474,18 @@ TEST(ExecBatchBuilder, AppendBatchesSomeRows) {
 TEST(ExecBatchBuilder, AppendBatchDupRows) {
   std::unique_ptr<MemoryPool> owned_pool = MemoryPool::CreateDefault();
   MemoryPool* pool = owned_pool.get();
+
   // Case of cross-word copying for the last row, which may exceed the buffer boundary.
-  // This is a simplified case of GH-32570
+  //
   {
+    // This is a simplified case of GH-32570
     // 64-byte data fully occupying one minimal 64-byte aligned memory region.
-    ExecBatch batch_string = JSONToExecBatch({binary()}, R"([["123456789ABCDEF0"],
-      ["123456789ABCDEF0"],
-      ["123456789ABCDEF0"],
-      ["ABCDEF0"],
-      ["123456789"]])");  // 9-byte tail row, larger than a word.
+    ExecBatch batch_string = JSONToExecBatch({binary()}, R"([
+        ["123456789ABCDEF0"],
+        ["123456789ABCDEF0"],
+        ["123456789ABCDEF0"],
+        ["ABCDEF0"],
+        ["123456789"]])");  // 9-byte tail row, larger than a word.
     ASSERT_EQ(batch_string[0].array()->buffers[1]->capacity(), 64);
     ASSERT_EQ(batch_string[0].array()->buffers[2]->capacity(), 64);
     ExecBatchBuilder builder;
@@ -494,6 +497,66 @@ TEST(ExecBatchBuilder, AppendBatchDupRows) {
     ASSERT_EQ(batch_string_appended, built);
     ASSERT_NE(0, pool->bytes_allocated());
   }
+
+  {
+    // This is a simplified case of GH-39583, using fsb(3) type.
+    // 63-byte data occupying almost one minimal 64-byte aligned memory region.
+    ExecBatch batch_fsb = JSONToExecBatch({fixed_size_binary(3)}, R"([
+        ["000"],
+        ["000"],
+        ["000"],
+        ["000"],
+        ["000"],
+        ["000"],
+        ["000"],
+        ["000"],
+        ["000"],
+        ["000"],
+        ["000"],
+        ["000"],
+        ["000"],
+        ["000"],
+        ["000"],
+        ["000"],
+        ["000"],
+        ["000"],
+        ["000"],
+        ["000"],
+        ["123"]])");  // 3-byte tail row, not aligned to a word.
+    ASSERT_EQ(batch_fsb[0].array()->buffers[1]->capacity(), 64);
+    ExecBatchBuilder builder;
+    uint16_t row_ids[4] = {20, 20, 20,
+                           20};  // Get the last row 4 times, 3 to skip a word.
+    ASSERT_OK(builder.AppendSelected(pool, batch_fsb, 4, row_ids, /*num_cols=*/1));
+    ExecBatch built = builder.Flush();
+    ExecBatch batch_fsb_appended = JSONToExecBatch(
+        {fixed_size_binary(3)}, R"([["123"], ["123"], ["123"], ["123"]])");
+    ASSERT_EQ(batch_fsb_appended, built);
+    ASSERT_NE(0, pool->bytes_allocated());
+  }
+
+  {
+    // This is a simplified case of GH-39583, using fsb(9) type.
+    // 63-byte data occupying almost one minimal 64-byte aligned memory region.
+    ExecBatch batch_fsb = JSONToExecBatch({fixed_size_binary(9)}, R"([
+        ["000000000"],
+        ["000000000"],
+        ["000000000"],
+        ["000000000"],
+        ["000000000"],
+        ["000000000"],
+        ["123456789"]])");  // 9-byte tail row, not aligned to a word.
+    ASSERT_EQ(batch_fsb[0].array()->buffers[1]->capacity(), 64);
+    ExecBatchBuilder builder;
+    uint16_t row_ids[2] = {6, 6};  // Get the last row 2 times, 1 to skip a word.
+    ASSERT_OK(builder.AppendSelected(pool, batch_fsb, 2, row_ids, /*num_cols=*/1));
+    ExecBatch built = builder.Flush();
+    ExecBatch batch_fsb_appended =
+        JSONToExecBatch({fixed_size_binary(9)}, R"([["123456789"], ["123456789"]])");
+    ASSERT_EQ(batch_fsb_appended, built);
+    ASSERT_NE(0, pool->bytes_allocated());
+  }
+
   ASSERT_EQ(0, pool->bytes_allocated());
 }

(arrow) 10/30: GH-39672: [Go] Time to Date32/Date64 conversion issues for non-UTC timezones (#39674)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 914e62dca7bb43c44a707b815a98579608d180fa
Author: Matt Topol <zo...@gmail.com>
AuthorDate: Thu Jan 18 15:30:38 2024 -0500

    GH-39672: [Go] Time to Date32/Date64 conversion issues for non-UTC timezones (#39674)
    
    
    
    A failing unit test in release verification led to discovering an issue with timestamp to date conversions for non-utc timezones.
    
    Upon some investigation I was able to determine that it was the conflation of casting conversion behavior (normalize to cast a Timestamp to a Date) vs flat conversion. I've fixed this conflation of concerns and the version of the methods which are exported properly converts non-UTC timezones to dates without affecting Casting behavior.
    
    ### Are these changes tested?
    yes
    
    ### Are there any user-facing changes?
    The methods `Date32FromTime` and `Date64FromTime` will properly handle timezones now.
    
    * Closes: #39672
    
    Authored-by: Matt Topol <zo...@gmail.com>
    Signed-off-by: Matt Topol <zo...@gmail.com>
---
 go/arrow/compute/internal/kernels/cast_temporal.go |  8 ++++++++
 go/arrow/datatype_fixedwidth.go                    | 10 ----------
 go/arrow/datatype_fixedwidth_test.go               | 10 ++++++++++
 3 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/go/arrow/compute/internal/kernels/cast_temporal.go b/go/arrow/compute/internal/kernels/cast_temporal.go
index 542a8a4590..48e2bfb6ca 100644
--- a/go/arrow/compute/internal/kernels/cast_temporal.go
+++ b/go/arrow/compute/internal/kernels/cast_temporal.go
@@ -112,6 +112,10 @@ func TimestampToDate32(ctx *exec.KernelCtx, batch *exec.ExecSpan, out *exec.Exec
 
 	return ScalarUnaryNotNull(func(_ *exec.KernelCtx, arg0 arrow.Timestamp, _ *error) arrow.Date32 {
 		tm := fnToTime(arg0)
+		if _, offset := tm.Zone(); offset != 0 {
+			// normalize the tm
+			tm = tm.Add(time.Duration(offset) * time.Second).UTC()
+		}
 		return arrow.Date32FromTime(tm)
 	})(ctx, batch, out)
 }
@@ -125,6 +129,10 @@ func TimestampToDate64(ctx *exec.KernelCtx, batch *exec.ExecSpan, out *exec.Exec
 
 	return ScalarUnaryNotNull(func(_ *exec.KernelCtx, arg0 arrow.Timestamp, _ *error) arrow.Date64 {
 		tm := fnToTime(arg0)
+		if _, offset := tm.Zone(); offset != 0 {
+			// normalize the tm
+			tm = tm.Add(time.Duration(offset) * time.Second).UTC()
+		}
 		return arrow.Date64FromTime(tm)
 	})(ctx, batch, out)
 }
diff --git a/go/arrow/datatype_fixedwidth.go b/go/arrow/datatype_fixedwidth.go
index 1a3074e59e..6a7071422f 100644
--- a/go/arrow/datatype_fixedwidth.go
+++ b/go/arrow/datatype_fixedwidth.go
@@ -70,11 +70,6 @@ type (
 
 // Date32FromTime returns a Date32 value from a time object
 func Date32FromTime(t time.Time) Date32 {
-	if _, offset := t.Zone(); offset != 0 {
-		// properly account for timezone adjustments before we calculate
-		// the number of days by adjusting the time and converting to UTC
-		t = t.Add(time.Duration(offset) * time.Second).UTC()
-	}
 	return Date32(t.Truncate(24*time.Hour).Unix() / int64((time.Hour * 24).Seconds()))
 }
 
@@ -88,11 +83,6 @@ func (d Date32) FormattedString() string {
 
 // Date64FromTime returns a Date64 value from a time object
 func Date64FromTime(t time.Time) Date64 {
-	if _, offset := t.Zone(); offset != 0 {
-		// properly account for timezone adjustments before we calculate
-		// the actual value by adjusting the time and converting to UTC
-		t = t.Add(time.Duration(offset) * time.Second).UTC()
-	}
 	// truncate to the start of the day to get the correct value
 	t = t.Truncate(24 * time.Hour)
 	return Date64(t.Unix()*1e3 + int64(t.Nanosecond())/1e6)
diff --git a/go/arrow/datatype_fixedwidth_test.go b/go/arrow/datatype_fixedwidth_test.go
index b3cbb465f3..d6caa21e1a 100644
--- a/go/arrow/datatype_fixedwidth_test.go
+++ b/go/arrow/datatype_fixedwidth_test.go
@@ -428,3 +428,13 @@ func TestMonthIntervalType(t *testing.T) {
 		t.Fatalf("invalid type stringer: got=%q, want=%q", got, want)
 	}
 }
+
+func TestDateFromTime(t *testing.T) {
+	loc, _ := time.LoadLocation("Asia/Hong_Kong")
+	tm := time.Date(2024, time.January, 18, 3, 0, 0, 0, loc)
+
+	wantD32 := time.Date(2024, time.January, 17, 0, 0, 0, 0, time.UTC).Truncate(24*time.Hour).Unix() / int64((time.Hour * 24).Seconds())
+	wantD64 := time.Date(2024, time.January, 17, 0, 0, 0, 0, time.UTC).UnixMilli()
+	assert.EqualValues(t, wantD64, arrow.Date64FromTime(tm))
+	assert.EqualValues(t, wantD32, arrow.Date32FromTime(tm))
+}

(arrow) 13/30: GH-39732: [Python][CI] Fix test failures with latest/nightly pandas (#39760)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 5e8e20de3cdf697230651f729475d7795106ab2a
Author: Alenka Frim <Al...@users.noreply.github.com>
AuthorDate: Thu Jan 25 10:21:57 2024 +0100

    GH-39732: [Python][CI] Fix test failures with latest/nightly pandas (#39760)
    
    This PR rearranges if-else blocks in the `table` function (`table.pxi`) so that pandas dataframe object comes before checking for `__arrow_c_stream__` and `__arrow_c_array__`.
    * Closes: #39732
    
    Authored-by: AlenkaF <fr...@gmail.com>
    Signed-off-by: Joris Van den Bossche <jo...@gmail.com>
---
 python/pyarrow/table.pxi | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/python/pyarrow/table.pxi b/python/pyarrow/table.pxi
index d98c93e1c0..3c450d61a7 100644
--- a/python/pyarrow/table.pxi
+++ b/python/pyarrow/table.pxi
@@ -5202,7 +5202,17 @@ def table(data, names=None, schema=None, metadata=None, nthreads=None):
             raise ValueError(
                 "The 'names' argument is not valid when passing a dictionary")
         return Table.from_pydict(data, schema=schema, metadata=metadata)
+    elif _pandas_api.is_data_frame(data):
+        if names is not None or metadata is not None:
+            raise ValueError(
+                "The 'names' and 'metadata' arguments are not valid when "
+                "passing a pandas DataFrame")
+        return Table.from_pandas(data, schema=schema, nthreads=nthreads)
     elif hasattr(data, "__arrow_c_stream__"):
+        if names is not None or metadata is not None:
+            raise ValueError(
+                "The 'names' and 'metadata' arguments are not valid when "
+                "using Arrow PyCapsule Interface")
         if schema is not None:
             requested = schema.__arrow_c_schema__()
         else:
@@ -5216,14 +5226,12 @@ def table(data, names=None, schema=None, metadata=None, nthreads=None):
             table = table.cast(schema)
         return table
     elif hasattr(data, "__arrow_c_array__"):
-        batch = record_batch(data, schema)
-        return Table.from_batches([batch])
-    elif _pandas_api.is_data_frame(data):
         if names is not None or metadata is not None:
             raise ValueError(
                 "The 'names' and 'metadata' arguments are not valid when "
-                "passing a pandas DataFrame")
-        return Table.from_pandas(data, schema=schema, nthreads=nthreads)
+                "using Arrow PyCapsule Interface")
+        batch = record_batch(data, schema)
+        return Table.from_batches([batch])
     else:
         raise TypeError(
             "Expected pandas DataFrame, python dictionary or list of arrays")

(arrow) 18/30: GH-39876: [C++] Thirdparty: Bump zlib to 1.3.1 (#39877)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 26e957883cb83a08b584370dac03dc0d006f9731
Author: mwish <ma...@gmail.com>
AuthorDate: Thu Feb 1 21:14:47 2024 +0800

    GH-39876: [C++] Thirdparty: Bump zlib to 1.3.1 (#39877)
    
    zlib 1.3.1 is the latest release.
    
    Bump zlib to 1.3.1
    
    Already has testing
    
    no
    
    * Closes: #39876
    
    Authored-by: mwish <ma...@gmail.com>
    Signed-off-by: Sutou Kouhei <ko...@clear-code.com>
---
 cpp/thirdparty/versions.txt | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/cpp/thirdparty/versions.txt b/cpp/thirdparty/versions.txt
index e9df0c8d75..dd3f5da84f 100644
--- a/cpp/thirdparty/versions.txt
+++ b/cpp/thirdparty/versions.txt
@@ -115,8 +115,8 @@ ARROW_UTF8PROC_BUILD_VERSION=v2.7.0
 ARROW_UTF8PROC_BUILD_SHA256_CHECKSUM=4bb121e297293c0fd55f08f83afab6d35d48f0af4ecc07523ad8ec99aa2b12a1
 ARROW_XSIMD_BUILD_VERSION=9.0.1
 ARROW_XSIMD_BUILD_SHA256_CHECKSUM=b1bb5f92167fd3a4f25749db0be7e61ed37e0a5d943490f3accdcd2cd2918cc0
-ARROW_ZLIB_BUILD_VERSION=1.2.13
-ARROW_ZLIB_BUILD_SHA256_CHECKSUM=b3a24de97a8fdbc835b9833169501030b8977031bcb54b3b3ac13740f846ab30
+ARROW_ZLIB_BUILD_VERSION=1.3.1
+ARROW_ZLIB_BUILD_SHA256_CHECKSUM=9a93b2b7dfdac77ceba5a558a580e74667dd6fede4585b91eefb60f03b72df23
 ARROW_ZSTD_BUILD_VERSION=1.5.5
 ARROW_ZSTD_BUILD_SHA256_CHECKSUM=9c4396cc829cfae319a6e2615202e82aad41372073482fce286fac78646d3ee4

(arrow) 23/30: GH-39860: [C++] Expression ExecuteScalarExpression execute empty args function with a wrong result (#39908)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit e19e1817b2ce0732ceddc1aa5c21c713116708d3
Author: ZhangHuiGui <10...@users.noreply.github.com>
AuthorDate: Tue Feb 6 06:11:36 2024 +0800

    GH-39860: [C++] Expression ExecuteScalarExpression execute empty args function with a wrong result (#39908)
    
    
    
    ### Rationale for this change
    
    Try to fix #39860.
    
    ### What changes are included in this PR?
    
    Deal with the call->arguments.size() == 0's condition in ExecuteScalarExpression when we call some functions
    has no arguments, like (random, hash_count ...).
    
    ### Are these changes tested?
    
    Yes
    
    ### Are there any user-facing changes?
    
    No.
    * Closes: #39860
    
    Lead-authored-by: hugo.zhang <hu...@openpie.com>
    Co-authored-by: 张回归 <zh...@zhanghuiguideMacBook-Pro-1681.local>
    Signed-off-by: Benjamin Kietzman <be...@gmail.com>
---
 cpp/src/arrow/compute/expression.cc      | 13 +++++++++++--
 cpp/src/arrow/compute/expression_test.cc | 19 +++++++++++++++++++
 2 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/cpp/src/arrow/compute/expression.cc b/cpp/src/arrow/compute/expression.cc
index b47e0a3552..8c59ad1df8 100644
--- a/cpp/src/arrow/compute/expression.cc
+++ b/cpp/src/arrow/compute/expression.cc
@@ -761,6 +761,15 @@ Result<Datum> ExecuteScalarExpression(const Expression& expr, const ExecBatch& i
     }
   }
 
+  int64_t input_length;
+  if (!arguments.empty() && all_scalar) {
+    // all inputs are scalar, so use a 1-long batch to avoid
+    // computing input.length equivalent outputs
+    input_length = 1;
+  } else {
+    input_length = input.length;
+  }
+
   auto executor = compute::detail::KernelExecutor::MakeScalar();
 
   compute::KernelContext kernel_context(exec_context, call->kernel);
@@ -772,8 +781,8 @@ Result<Datum> ExecuteScalarExpression(const Expression& expr, const ExecBatch& i
   RETURN_NOT_OK(executor->Init(&kernel_context, {kernel, types, options}));
 
   compute::detail::DatumAccumulator listener;
-  RETURN_NOT_OK(executor->Execute(
-      ExecBatch(std::move(arguments), all_scalar ? 1 : input.length), &listener));
+  RETURN_NOT_OK(
+      executor->Execute(ExecBatch(std::move(arguments), input_length), &listener));
   const auto out = executor->WrapResults(arguments, listener.values());
 #ifndef NDEBUG
   DCHECK_OK(executor->CheckResultType(out, call->function_name.c_str()));
diff --git a/cpp/src/arrow/compute/expression_test.cc b/cpp/src/arrow/compute/expression_test.cc
index 44159e7660..d33c348cd7 100644
--- a/cpp/src/arrow/compute/expression_test.cc
+++ b/cpp/src/arrow/compute/expression_test.cc
@@ -863,6 +863,25 @@ TEST(Expression, ExecuteCall) {
   ])"));
 }
 
+TEST(Expression, ExecuteCallWithNoArguments) {
+  const int kCount = 10;
+  auto random_options = RandomOptions::FromSeed(/*seed=*/0);
+  ExecBatch input({}, kCount);
+
+  Expression random_expr = call("random", {}, random_options);
+  ASSERT_OK_AND_ASSIGN(random_expr, random_expr.Bind(float64()));
+
+  ASSERT_OK_AND_ASSIGN(Datum actual, ExecuteScalarExpression(random_expr, input));
+  compute::ExecContext* exec_context = default_exec_context();
+  ASSERT_OK_AND_ASSIGN(auto function,
+                       exec_context->func_registry()->GetFunction("random"));
+  ASSERT_OK_AND_ASSIGN(Datum expected,
+                       function->Execute(input, &random_options, exec_context));
+  AssertDatumsEqual(actual, expected, /*verbose=*/true);
+
+  EXPECT_EQ(actual.length(), kCount);
+}
+
 TEST(Expression, ExecuteDictionaryTransparent) {
   ExpectExecute(
       equal(field_ref("a"), field_ref("b")),

(arrow) 20/30: GH-39849: [Python] Remove the use of pytest-lazy-fixture (#39850)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit aa172aa00776f20dc05d8d400d60f0d5ba3d076e
Author: Alenka Frim <Al...@users.noreply.github.com>
AuthorDate: Thu Feb 1 14:35:32 2024 +0100

    GH-39849: [Python] Remove the use of pytest-lazy-fixture (#39850)
    
    ### Rationale for this change
    
    Removing the use of `pytest-lazy-fixture` in our test suite as it is unmaintained.
    Changes in this PR include:
    
    - Remove the use of `pytest-lazy-fixture`
    - Remove marks from fixtures to avoid future error, see
       ```
       PytestRemovedIn9Warning: Marks applied to fixtures have no effect
         See docs: https://docs.pytest.org/en/stable/deprecations.html#applying-a-mark-to-a-fixture-function
       ```
    - Catch two different warnings in `def test_legacy_int_type()`
    
    ### Are these changes tested?
    
    The changes affect the tests so they must pass.
    
    ### Are there any user-facing changes?
    
    No.
    * Closes: #39849
    
    Lead-authored-by: AlenkaF <fr...@gmail.com>
    Co-authored-by: Joris Van den Bossche <jo...@gmail.com>
    Signed-off-by: Joris Van den Bossche <jo...@gmail.com>
---
 ci/conda_env_python.txt                     |  3 +--
 dev/tasks/conda-recipes/arrow-cpp/meta.yaml |  1 -
 python/pyarrow/tests/conftest.py            |  7 +++---
 python/pyarrow/tests/test_dataset.py        |  3 ---
 python/pyarrow/tests/test_extension_type.py |  5 +----
 python/pyarrow/tests/test_fs.py             | 34 ++++++++++++++---------------
 python/pyarrow/tests/test_ipc.py            |  6 ++---
 python/requirements-test.txt                |  1 -
 python/requirements-wheel-test.txt          |  1 -
 9 files changed, 25 insertions(+), 36 deletions(-)

diff --git a/ci/conda_env_python.txt b/ci/conda_env_python.txt
index 5fdd21d2bd..59e2def1bf 100644
--- a/ci/conda_env_python.txt
+++ b/ci/conda_env_python.txt
@@ -23,9 +23,8 @@ cloudpickle
 fsspec
 hypothesis
 numpy>=1.16.6
-pytest<8  # pytest-lazy-fixture broken on pytest 8.0.0
+pytest<8
 pytest-faulthandler
-pytest-lazy-fixture
 s3fs>=2023.10.0
 setuptools
 setuptools_scm<8.0.0
diff --git a/dev/tasks/conda-recipes/arrow-cpp/meta.yaml b/dev/tasks/conda-recipes/arrow-cpp/meta.yaml
index b8ffbfdb71..367445c595 100644
--- a/dev/tasks/conda-recipes/arrow-cpp/meta.yaml
+++ b/dev/tasks/conda-recipes/arrow-cpp/meta.yaml
@@ -340,7 +340,6 @@ outputs:
         # test_cpp_extension_in_python requires a compiler
         - {{ compiler("cxx") }}  # [linux]
         - pytest
-        - pytest-lazy-fixture
         - backports.zoneinfo     # [py<39]
         - boto3
         - cffi
diff --git a/python/pyarrow/tests/conftest.py b/python/pyarrow/tests/conftest.py
index a5941e8c8d..0da757a4bc 100644
--- a/python/pyarrow/tests/conftest.py
+++ b/python/pyarrow/tests/conftest.py
@@ -24,7 +24,6 @@ import time
 import urllib.request
 
 import pytest
-from pytest_lazyfixture import lazy_fixture
 import hypothesis as h
 from ..conftest import groups, defaults
 
@@ -259,13 +258,13 @@ def gcs_server():
 
 @pytest.fixture(
     params=[
-        lazy_fixture('builtin_pickle'),
-        lazy_fixture('cloudpickle')
+        'builtin_pickle',
+        'cloudpickle'
     ],
     scope='session'
 )
 def pickle_module(request):
-    return request.param
+    return request.getfixturevalue(request.param)
 
 
 @pytest.fixture(scope='session')
diff --git a/python/pyarrow/tests/test_dataset.py b/python/pyarrow/tests/test_dataset.py
index ae2146c0bd..04732fefbd 100644
--- a/python/pyarrow/tests/test_dataset.py
+++ b/python/pyarrow/tests/test_dataset.py
@@ -100,7 +100,6 @@ def assert_dataset_fragment_convenience_methods(dataset):
 
 
 @pytest.fixture
-@pytest.mark.parquet
 def mockfs():
     mockfs = fs._MockFileSystem()
 
@@ -219,7 +218,6 @@ def multisourcefs(request):
 
 
 @pytest.fixture
-@pytest.mark.parquet
 def dataset(mockfs):
     format = ds.ParquetFileFormat()
     selector = fs.FileSelector('subdir', recursive=True)
@@ -2679,7 +2677,6 @@ def test_dataset_partitioned_dictionary_type_reconstruct(tempdir, pickle_module)
 
 
 @pytest.fixture
-@pytest.mark.parquet
 def s3_example_simple(s3_server):
     from pyarrow.fs import FileSystem
 
diff --git a/python/pyarrow/tests/test_extension_type.py b/python/pyarrow/tests/test_extension_type.py
index a88e20eefe..d8c792ef00 100644
--- a/python/pyarrow/tests/test_extension_type.py
+++ b/python/pyarrow/tests/test_extension_type.py
@@ -1485,10 +1485,7 @@ def test_legacy_int_type():
     batch = pa.RecordBatch.from_arrays([ext_arr], names=['ext'])
     buf = ipc_write_batch(batch)
 
-    with pytest.warns(
-            RuntimeWarning,
-            match="pickle-based deserialization of pyarrow.PyExtensionType "
-                  "subclasses is disabled by default"):
+    with pytest.warns((RuntimeWarning, FutureWarning)):
         batch = ipc_read_batch(buf)
         assert isinstance(batch.column(0).type, pa.UnknownExtensionType)
 
diff --git a/python/pyarrow/tests/test_fs.py b/python/pyarrow/tests/test_fs.py
index d0fa253e31..ab10addfc3 100644
--- a/python/pyarrow/tests/test_fs.py
+++ b/python/pyarrow/tests/test_fs.py
@@ -362,79 +362,79 @@ def py_fsspec_s3fs(request, s3_server):
 
 @pytest.fixture(params=[
     pytest.param(
-        pytest.lazy_fixture('localfs'),
+        'localfs',
         id='LocalFileSystem()'
     ),
     pytest.param(
-        pytest.lazy_fixture('localfs_with_mmap'),
+        'localfs_with_mmap',
         id='LocalFileSystem(use_mmap=True)'
     ),
     pytest.param(
-        pytest.lazy_fixture('subtree_localfs'),
+        'subtree_localfs',
         id='SubTreeFileSystem(LocalFileSystem())'
     ),
     pytest.param(
-        pytest.lazy_fixture('s3fs'),
+        's3fs',
         id='S3FileSystem',
         marks=pytest.mark.s3
     ),
     pytest.param(
-        pytest.lazy_fixture('gcsfs'),
+        'gcsfs',
         id='GcsFileSystem',
         marks=pytest.mark.gcs
     ),
     pytest.param(
-        pytest.lazy_fixture('hdfs'),
+        'hdfs',
         id='HadoopFileSystem',
         marks=pytest.mark.hdfs
     ),
     pytest.param(
-        pytest.lazy_fixture('mockfs'),
+        'mockfs',
         id='_MockFileSystem()'
     ),
     pytest.param(
-        pytest.lazy_fixture('py_localfs'),
+        'py_localfs',
         id='PyFileSystem(ProxyHandler(LocalFileSystem()))'
     ),
     pytest.param(
-        pytest.lazy_fixture('py_mockfs'),
+        'py_mockfs',
         id='PyFileSystem(ProxyHandler(_MockFileSystem()))'
     ),
     pytest.param(
-        pytest.lazy_fixture('py_fsspec_localfs'),
+        'py_fsspec_localfs',
         id='PyFileSystem(FSSpecHandler(fsspec.LocalFileSystem()))'
     ),
     pytest.param(
-        pytest.lazy_fixture('py_fsspec_memoryfs'),
+        'py_fsspec_memoryfs',
         id='PyFileSystem(FSSpecHandler(fsspec.filesystem("memory")))'
     ),
     pytest.param(
-        pytest.lazy_fixture('py_fsspec_s3fs'),
+        'py_fsspec_s3fs',
         id='PyFileSystem(FSSpecHandler(s3fs.S3FileSystem()))',
         marks=pytest.mark.s3
     ),
 ])
 def filesystem_config(request):
-    return request.param
+    return request.getfixturevalue(request.param)
 
 
 @pytest.fixture
-def fs(request, filesystem_config):
+def fs(filesystem_config):
     return filesystem_config['fs']
 
 
 @pytest.fixture
-def pathfn(request, filesystem_config):
+def pathfn(filesystem_config):
     return filesystem_config['pathfn']
 
 
 @pytest.fixture
-def allow_move_dir(request, filesystem_config):
+def allow_move_dir(filesystem_config):
     return filesystem_config['allow_move_dir']
 
 
 @pytest.fixture
-def allow_append_to_file(request, filesystem_config):
+def allow_append_to_file(filesystem_config):
     return filesystem_config['allow_append_to_file']
 
 
diff --git a/python/pyarrow/tests/test_ipc.py b/python/pyarrow/tests/test_ipc.py
index f75ec8158a..407011d90b 100644
--- a/python/pyarrow/tests/test_ipc.py
+++ b/python/pyarrow/tests/test_ipc.py
@@ -142,16 +142,16 @@ def stream_fixture():
 
 @pytest.fixture(params=[
     pytest.param(
-        pytest.lazy_fixture('file_fixture'),
+        'file_fixture',
         id='File Format'
     ),
     pytest.param(
-        pytest.lazy_fixture('stream_fixture'),
+        'stream_fixture',
         id='Stream Format'
     )
 ])
 def format_fixture(request):
-    return request.param
+    return request.getfixturevalue(request.param)
 
 
 def test_empty_file():
diff --git a/python/requirements-test.txt b/python/requirements-test.txt
index b3ba5d852b..2108d70a54 100644
--- a/python/requirements-test.txt
+++ b/python/requirements-test.txt
@@ -2,5 +2,4 @@ cffi
 hypothesis
 pandas
 pytest<8
-pytest-lazy-fixture
 pytz
diff --git a/python/requirements-wheel-test.txt b/python/requirements-wheel-test.txt
index c74a8ca690..a1046bc18c 100644
--- a/python/requirements-wheel-test.txt
+++ b/python/requirements-wheel-test.txt
@@ -2,7 +2,6 @@ cffi
 cython
 hypothesis
 pytest<8
-pytest-lazy-fixture
 pytz
 tzdata; sys_platform == 'win32'

(arrow) 11/30: GH-39690: [C++][FlightRPC] Fix nullptr dereference in PollInfo (#39711)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 87eae3d4ebab43304e3ba24bbdb386db1ec8cbc6
Author: David Li <li...@gmail.com>
AuthorDate: Mon Jan 22 09:38:57 2024 -0500

    GH-39690: [C++][FlightRPC] Fix nullptr dereference in PollInfo (#39711)
    
    
    
    ### Rationale for this change
    
    The current implementation is a bit painful to use due to the lack of a move constructor.
    
    ### What changes are included in this PR?
    
    - Fix a crash in PollInfo with a nullptr FlightInfo.
    - Declare all necessary constructors (https://en.cppreference.com/w/cpp/language/rule_of_three)
    
    ### Are these changes tested?
    
    Yes.
    
    ### Are there any user-facing changes?
    
    Yes, this adds new copy constructors.
    
    * Closes: #39673.
    * Closes: #39690
    
    Authored-by: David Li <li...@gmail.com>
    Signed-off-by: David Li <li...@gmail.com>
---
 cpp/cmake_modules/FindClangTools.cmake         |  3 ++-
 cpp/src/arrow/flight/flight_internals_test.cc  |  2 ++
 cpp/src/arrow/flight/serialization_internal.cc | 10 +++++++---
 cpp/src/arrow/flight/types.cc                  |  7 ++++++-
 cpp/src/arrow/flight/types.h                   | 13 ++++++++++++-
 5 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/cpp/cmake_modules/FindClangTools.cmake b/cpp/cmake_modules/FindClangTools.cmake
index 90df60bf54..1364ccbed8 100644
--- a/cpp/cmake_modules/FindClangTools.cmake
+++ b/cpp/cmake_modules/FindClangTools.cmake
@@ -40,7 +40,8 @@ set(CLANG_TOOLS_SEARCH_PATHS
     /usr/local/bin
     /usr/bin
     "C:/Program Files/LLVM/bin" # Windows, non-conda
-    "$ENV{CONDA_PREFIX}/Library/bin") # Windows, conda
+    "$ENV{CONDA_PREFIX}/Library/bin" # Windows, conda
+    "$ENV{CONDA_PREFIX}/bin") # Unix, conda
 if(APPLE)
   find_program(BREW brew)
   if(BREW)
diff --git a/cpp/src/arrow/flight/flight_internals_test.cc b/cpp/src/arrow/flight/flight_internals_test.cc
index 522973bec7..a1c5250ba6 100644
--- a/cpp/src/arrow/flight/flight_internals_test.cc
+++ b/cpp/src/arrow/flight/flight_internals_test.cc
@@ -282,6 +282,7 @@ TEST(FlightTypes, PollInfo) {
                std::nullopt},
       PollInfo{std::make_unique<FlightInfo>(info), FlightDescriptor::Command("poll"), 0.1,
                expiration_time},
+      PollInfo{},
   };
   std::vector<std::string> reprs = {
       "<PollInfo info=" + info.ToString() +
@@ -290,6 +291,7 @@ TEST(FlightTypes, PollInfo) {
       "<PollInfo info=" + info.ToString() +
           " descriptor=<FlightDescriptor cmd='poll'> "
           "progress=0.1 expiration_time=2023-06-19 03:14:06.004339000>",
+      "<PollInfo info=null descriptor=null progress=null expiration_time=null>",
   };
 
   ASSERT_NO_FATAL_FAILURE(TestRoundtrip<pb::PollInfo>(values, reprs));
diff --git a/cpp/src/arrow/flight/serialization_internal.cc b/cpp/src/arrow/flight/serialization_internal.cc
index 64a40564af..e5a7503a63 100644
--- a/cpp/src/arrow/flight/serialization_internal.cc
+++ b/cpp/src/arrow/flight/serialization_internal.cc
@@ -306,8 +306,10 @@ Status ToProto(const FlightInfo& info, pb::FlightInfo* pb_info) {
 // PollInfo
 
 Status FromProto(const pb::PollInfo& pb_info, PollInfo* info) {
-  ARROW_ASSIGN_OR_RAISE(auto flight_info, FromProto(pb_info.info()));
-  info->info = std::make_unique<FlightInfo>(std::move(flight_info));
+  if (pb_info.has_info()) {
+    ARROW_ASSIGN_OR_RAISE(auto flight_info, FromProto(pb_info.info()));
+    info->info = std::make_unique<FlightInfo>(std::move(flight_info));
+  }
   if (pb_info.has_flight_descriptor()) {
     FlightDescriptor descriptor;
     RETURN_NOT_OK(FromProto(pb_info.flight_descriptor(), &descriptor));
@@ -331,7 +333,9 @@ Status FromProto(const pb::PollInfo& pb_info, PollInfo* info) {
 }
 
 Status ToProto(const PollInfo& info, pb::PollInfo* pb_info) {
-  RETURN_NOT_OK(ToProto(*info.info, pb_info->mutable_info()));
+  if (info.info) {
+    RETURN_NOT_OK(ToProto(*info.info, pb_info->mutable_info()));
+  }
   if (info.descriptor) {
     RETURN_NOT_OK(ToProto(*info.descriptor, pb_info->mutable_flight_descriptor()));
   }
diff --git a/cpp/src/arrow/flight/types.cc b/cpp/src/arrow/flight/types.cc
index 9da83fa8a1..1d43c41b69 100644
--- a/cpp/src/arrow/flight/types.cc
+++ b/cpp/src/arrow/flight/types.cc
@@ -373,7 +373,12 @@ arrow::Result<std::unique_ptr<PollInfo>> PollInfo::Deserialize(
 
 std::string PollInfo::ToString() const {
   std::stringstream ss;
-  ss << "<PollInfo info=" << info->ToString();
+  ss << "<PollInfo info=";
+  if (info) {
+    ss << info->ToString();
+  } else {
+    ss << "null";
+  }
   ss << " descriptor=";
   if (descriptor) {
     ss << descriptor->ToString();
diff --git a/cpp/src/arrow/flight/types.h b/cpp/src/arrow/flight/types.h
index 2342c75827..790a2067dd 100644
--- a/cpp/src/arrow/flight/types.h
+++ b/cpp/src/arrow/flight/types.h
@@ -693,11 +693,22 @@ class ARROW_FLIGHT_EXPORT PollInfo {
         progress(progress),
         expiration_time(expiration_time) {}
 
-  explicit PollInfo(const PollInfo& other)
+  // Must not be explicit; to declare one we must declare all ("rule of five")
+  PollInfo(const PollInfo& other)  // NOLINT(runtime/explicit)
       : info(other.info ? std::make_unique<FlightInfo>(*other.info) : NULLPTR),
         descriptor(other.descriptor),
         progress(other.progress),
         expiration_time(other.expiration_time) {}
+  PollInfo(PollInfo&& other) noexcept = default;  // NOLINT(runtime/explicit)
+  ~PollInfo() = default;
+  PollInfo& operator=(const PollInfo& other) {
+    info = other.info ? std::make_unique<FlightInfo>(*other.info) : NULLPTR;
+    descriptor = other.descriptor;
+    progress = other.progress;
+    expiration_time = other.expiration_time;
+    return *this;
+  }
+  PollInfo& operator=(PollInfo&& other) = default;
 
   /// \brief Get the wire-format representation of this type.
   ///

(arrow) 16/30: GH-39740: [C++] Fix filter and take kernel for month_day_nano intervals (#39795)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit a1ada92d17569d4fac6d6ddfc8f6d9b8018a9f14
Author: Antoine Pitrou <an...@python.org>
AuthorDate: Mon Jan 29 17:27:36 2024 +0100

    GH-39740: [C++] Fix filter and take kernel for month_day_nano intervals (#39795)
    
    ### Rationale for this change
    
    The filter and take functions were not correctly supported on month_day_nano intervals.
    
    ### What changes are included in this PR?
    
    * Expand the primitive filter implementation to handle all possible fixed-width primitive types (including fixed-size binary)
    * Expand the take filter implementation to handle all well-known fixed-width primitive types (including month_day_nano, decimal128 and decimal256)
    * Add benchmarks for taking and filtering fixed-size binary
    
    These changes allow for very significant performance improvements filtering and taking fixed-size binary data:
    ```
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Non-regressions: (90)
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                  benchmark           baseline          contender  change %                                                                                                                                                                                                                                                                   counters
              FilterFixedSizeBinaryFilterNoNulls/524288/0/8      1.716 GiB/sec     33.814 GiB/sec  1870.862      {'family_index': 0, 'per_family_instance_index': 0, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/0/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2462, 'byte_width': 8.0, 'data null%': 0.0, 'mask null%': 0.0, 'select%': 99.9}
       TakeFixedSizeBinaryRandomIndicesWithNulls/524288/1/8 380.056M items/sec   7.098G items/sec  1767.491                                {'family_index': 3, 'per_family_instance_index': 6, 'run_name': 'TakeFixedSizeBinaryRandomIndicesWithNulls/524288/1/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 505, 'byte_width': 8.0, 'null_percent': 100.0}
              FilterFixedSizeBinaryFilterNoNulls/524288/0/9      1.916 GiB/sec     33.721 GiB/sec  1659.766      {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/0/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2750, 'byte_width': 9.0, 'data null%': 0.0, 'mask null%': 0.0, 'select%': 99.9}
              FilterFixedSizeBinaryFilterNoNulls/524288/9/8    917.713 MiB/sec      9.193 GiB/sec   925.719    {'family_index': 0, 'per_family_instance_index': 18, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/9/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1271, 'byte_width': 8.0, 'data null%': 10.0, 'mask null%': 0.0, 'select%': 99.9}
             FilterFixedSizeBinaryFilterNoNulls/524288/12/8      1.004 GiB/sec      9.374 GiB/sec   833.673   {'family_index': 0, 'per_family_instance_index': 24, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/12/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1440, 'byte_width': 8.0, 'data null%': 90.0, 'mask null%': 0.0, 'select%': 99.9}
              FilterFixedSizeBinaryFilterNoNulls/524288/3/8      1.625 GiB/sec     15.009 GiB/sec   823.442      {'family_index': 0, 'per_family_instance_index': 6, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/3/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2328, 'byte_width': 8.0, 'data null%': 0.1, 'mask null%': 0.0, 'select%': 99.9}
              FilterFixedSizeBinaryFilterNoNulls/524288/9/9   1021.638 MiB/sec      9.126 GiB/sec   814.670    {'family_index': 0, 'per_family_instance_index': 19, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/9/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1428, 'byte_width': 9.0, 'data null%': 10.0, 'mask null%': 0.0, 'select%': 99.9}
              FilterFixedSizeBinaryFilterNoNulls/524288/6/8      1.235 GiB/sec     10.814 GiB/sec   775.869     {'family_index': 0, 'per_family_instance_index': 12, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/6/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1762, 'byte_width': 8.0, 'data null%': 1.0, 'mask null%': 0.0, 'select%': 99.9}
             FilterFixedSizeBinaryFilterNoNulls/524288/12/9      1.123 GiB/sec      9.120 GiB/sec   712.196   {'family_index': 0, 'per_family_instance_index': 25, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/12/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1598, 'byte_width': 9.0, 'data null%': 90.0, 'mask null%': 0.0, 'select%': 99.9}
              FilterFixedSizeBinaryFilterNoNulls/524288/6/9      1.370 GiB/sec     10.499 GiB/sec   666.348     {'family_index': 0, 'per_family_instance_index': 13, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/6/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1958, 'byte_width': 9.0, 'data null%': 1.0, 'mask null%': 0.0, 'select%': 99.9}
              FilterFixedSizeBinaryFilterNoNulls/524288/3/9      1.814 GiB/sec     13.394 GiB/sec   638.343      {'family_index': 0, 'per_family_instance_index': 7, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/3/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2600, 'byte_width': 9.0, 'data null%': 0.1, 'mask null%': 0.0, 'select%': 99.9}
              FilterFixedSizeBinaryFilterNoNulls/524288/2/8     12.155 GiB/sec     77.799 GiB/sec   540.051      {'family_index': 0, 'per_family_instance_index': 4, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/2/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 17222, 'byte_width': 8.0, 'data null%': 0.0, 'mask null%': 0.0, 'select%': 1.0}
              FilterFixedSizeBinaryFilterNoNulls/524288/2/9     13.507 GiB/sec     84.361 GiB/sec   524.592      {'family_index': 0, 'per_family_instance_index': 5, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/2/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 19469, 'byte_width': 9.0, 'data null%': 0.0, 'mask null%': 0.0, 'select%': 1.0}
             TakeFixedSizeBinaryMonotonicIndices/524288/1/8 194.493M items/sec 732.378M items/sec   276.557                                      {'family_index': 4, 'per_family_instance_index': 6, 'run_name': 'TakeFixedSizeBinaryMonotonicIndices/524288/1/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 259, 'byte_width': 8.0, 'null_percent': 100.0}
         TakeFixedSizeBinaryRandomIndicesNoNulls/524288/1/8 200.981M items/sec 747.628M items/sec   271.989                                  {'family_index': 2, 'per_family_instance_index': 6, 'run_name': 'TakeFixedSizeBinaryRandomIndicesNoNulls/524288/1/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 268, 'byte_width': 8.0, 'null_percent': 100.0}
            FilterFixedSizeBinaryFilterWithNulls/524288/0/8    947.631 MiB/sec      3.318 GiB/sec   258.565    {'family_index': 1, 'per_family_instance_index': 0, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/0/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1329, 'byte_width': 8.0, 'data null%': 0.0, 'mask null%': 5.0, 'select%': 99.9}
            FilterFixedSizeBinaryFilterWithNulls/524288/3/8    911.406 MiB/sec      3.121 GiB/sec   250.677    {'family_index': 1, 'per_family_instance_index': 6, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/3/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1275, 'byte_width': 8.0, 'data null%': 0.1, 'mask null%': 5.0, 'select%': 99.9}
              FilterFixedSizeBinaryFilterNoNulls/524288/1/8      1.045 GiB/sec      3.535 GiB/sec   238.406      {'family_index': 0, 'per_family_instance_index': 2, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/1/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1496, 'byte_width': 8.0, 'data null%': 0.0, 'mask null%': 0.0, 'select%': 50.0}
            FilterFixedSizeBinaryFilterWithNulls/524288/6/8    899.161 MiB/sec      2.915 GiB/sec   232.029   {'family_index': 1, 'per_family_instance_index': 12, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/6/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1260, 'byte_width': 8.0, 'data null%': 1.0, 'mask null%': 5.0, 'select%': 99.9}
            FilterFixedSizeBinaryFilterWithNulls/524288/9/8    829.852 MiB/sec      2.617 GiB/sec   222.914  {'family_index': 1, 'per_family_instance_index': 18, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/9/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1157, 'byte_width': 8.0, 'data null%': 10.0, 'mask null%': 5.0, 'select%': 99.9}
         TakeFixedSizeBinaryRandomIndicesNoNulls/524288/0/8 234.268M items/sec 752.809M items/sec   221.345                                    {'family_index': 2, 'per_family_instance_index': 8, 'run_name': 'TakeFixedSizeBinaryRandomIndicesNoNulls/524288/0/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 312, 'byte_width': 8.0, 'null_percent': 0.0}
              FilterFixedSizeBinaryFilterNoNulls/524288/1/9      1.171 GiB/sec      3.711 GiB/sec   216.957      {'family_index': 0, 'per_family_instance_index': 3, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/1/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1674, 'byte_width': 9.0, 'data null%': 0.0, 'mask null%': 0.0, 'select%': 50.0}
             TakeFixedSizeBinaryMonotonicIndices/524288/0/8 249.393M items/sec 787.274M items/sec   215.676                                        {'family_index': 4, 'per_family_instance_index': 8, 'run_name': 'TakeFixedSizeBinaryMonotonicIndices/524288/0/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 333, 'byte_width': 8.0, 'null_percent': 0.0}
       TakeFixedSizeBinaryRandomIndicesWithNulls/524288/0/8 234.268M items/sec 736.727M items/sec   214.481                                  {'family_index': 3, 'per_family_instance_index': 8, 'run_name': 'TakeFixedSizeBinaryRandomIndicesWithNulls/524288/0/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 313, 'byte_width': 8.0, 'null_percent': 0.0}
          TakeFixedSizeBinaryMonotonicIndices/524288/1000/8 134.852M items/sec 423.748M items/sec   214.231                                     {'family_index': 4, 'per_family_instance_index': 0, 'run_name': 'TakeFixedSizeBinaryMonotonicIndices/524288/1000/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 202, 'byte_width': 8.0, 'null_percent': 0.1}
           FilterFixedSizeBinaryFilterWithNulls/524288/12/8    913.734 MiB/sec      2.599 GiB/sec   191.245 {'family_index': 1, 'per_family_instance_index': 24, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/12/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1292, 'byte_width': 8.0, 'data null%': 90.0, 'mask null%': 5.0, 'select%': 99.9}
      TakeFixedSizeBinaryRandomIndicesNoNulls/524288/1000/8 138.218M items/sec 309.307M items/sec   123.783                                 {'family_index': 2, 'per_family_instance_index': 0, 'run_name': 'TakeFixedSizeBinaryRandomIndicesNoNulls/524288/1000/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 184, 'byte_width': 8.0, 'null_percent': 0.1}
    TakeFixedSizeBinaryRandomIndicesWithNulls/524288/1000/8 132.755M items/sec 293.027M items/sec   120.727                               {'family_index': 3, 'per_family_instance_index': 0, 'run_name': 'TakeFixedSizeBinaryRandomIndicesWithNulls/524288/1000/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 179, 'byte_width': 8.0, 'null_percent': 0.1}
        TakeFixedSizeBinaryRandomIndicesNoNulls/524288/10/8 125.492M items/sec 272.996M items/sec   117.540                                  {'family_index': 2, 'per_family_instance_index': 2, 'run_name': 'TakeFixedSizeBinaryRandomIndicesNoNulls/524288/10/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 174, 'byte_width': 8.0, 'null_percent': 10.0}
            FilterFixedSizeBinaryFilterWithNulls/524288/9/9    926.938 MiB/sec      1.904 GiB/sec   110.379  {'family_index': 1, 'per_family_instance_index': 19, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/9/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1295, 'byte_width': 9.0, 'data null%': 10.0, 'mask null%': 5.0, 'select%': 99.9}
            TakeFixedSizeBinaryMonotonicIndices/524288/10/8 158.754M items/sec 331.106M items/sec   108.565                                      {'family_index': 4, 'per_family_instance_index': 2, 'run_name': 'TakeFixedSizeBinaryMonotonicIndices/524288/10/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 167, 'byte_width': 8.0, 'null_percent': 10.0}
            FilterFixedSizeBinaryFilterWithNulls/524288/0/9      1.031 GiB/sec      2.129 GiB/sec   106.621    {'family_index': 1, 'per_family_instance_index': 1, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/0/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1477, 'byte_width': 9.0, 'data null%': 0.0, 'mask null%': 5.0, 'select%': 99.9}
            FilterFixedSizeBinaryFilterWithNulls/524288/3/9   1020.776 MiB/sec      2.056 GiB/sec   106.293    {'family_index': 1, 'per_family_instance_index': 7, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/3/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1430, 'byte_width': 9.0, 'data null%': 0.1, 'mask null%': 5.0, 'select%': 99.9}
            FilterFixedSizeBinaryFilterWithNulls/524288/4/8    890.785 MiB/sec      1.768 GiB/sec   103.293    {'family_index': 1, 'per_family_instance_index': 8, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/4/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1242, 'byte_width': 8.0, 'data null%': 0.1, 'mask null%': 5.0, 'select%': 50.0}
            FilterFixedSizeBinaryFilterWithNulls/524288/6/9   1005.839 MiB/sec      1.984 GiB/sec   102.023   {'family_index': 1, 'per_family_instance_index': 13, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/6/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1407, 'byte_width': 9.0, 'data null%': 1.0, 'mask null%': 5.0, 'select%': 99.9}
            FilterFixedSizeBinaryFilterWithNulls/524288/1/8    916.810 MiB/sec      1.762 GiB/sec    96.757    {'family_index': 1, 'per_family_instance_index': 2, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/1/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1270, 'byte_width': 8.0, 'data null%': 0.0, 'mask null%': 5.0, 'select%': 50.0}
            FilterFixedSizeBinaryFilterWithNulls/524288/7/8    890.211 MiB/sec      1.694 GiB/sec    94.853   {'family_index': 1, 'per_family_instance_index': 14, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/7/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1235, 'byte_width': 8.0, 'data null%': 1.0, 'mask null%': 5.0, 'select%': 50.0}
         TakeFixedSizeBinaryRandomIndicesNoNulls/524288/2/8  95.788M items/sec 184.004M items/sec    92.095                                   {'family_index': 2, 'per_family_instance_index': 4, 'run_name': 'TakeFixedSizeBinaryRandomIndicesNoNulls/524288/2/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 124, 'byte_width': 8.0, 'null_percent': 50.0}
           FilterFixedSizeBinaryFilterWithNulls/524288/10/8    862.497 MiB/sec      1.616 GiB/sec    91.823 {'family_index': 1, 'per_family_instance_index': 20, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/10/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1200, 'byte_width': 8.0, 'data null%': 10.0, 'mask null%': 5.0, 'select%': 50.0}
           FilterFixedSizeBinaryFilterWithNulls/524288/12/9      1.005 GiB/sec      1.904 GiB/sec    89.431 {'family_index': 1, 'per_family_instance_index': 25, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/12/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1442, 'byte_width': 9.0, 'data null%': 90.0, 'mask null%': 5.0, 'select%': 99.9}
             TakeFixedSizeBinaryMonotonicIndices/524288/2/8 123.065M items/sec 228.755M items/sec    85.881                                       {'family_index': 4, 'per_family_instance_index': 4, 'run_name': 'TakeFixedSizeBinaryMonotonicIndices/524288/2/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 164, 'byte_width': 8.0, 'null_percent': 50.0}
             FilterFixedSizeBinaryFilterNoNulls/524288/10/8    930.637 MiB/sec      1.669 GiB/sec    83.659   {'family_index': 0, 'per_family_instance_index': 20, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/10/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1293, 'byte_width': 8.0, 'data null%': 10.0, 'mask null%': 0.0, 'select%': 50.0}
              FilterFixedSizeBinaryFilterNoNulls/524288/4/8      1.034 GiB/sec      1.871 GiB/sec    81.019      {'family_index': 0, 'per_family_instance_index': 8, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/4/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1482, 'byte_width': 8.0, 'data null%': 0.1, 'mask null%': 0.0, 'select%': 50.0}
              FilterFixedSizeBinaryFilterNoNulls/524288/7/8   1004.789 MiB/sec      1.772 GiB/sec    80.538     {'family_index': 0, 'per_family_instance_index': 14, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/7/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1404, 'byte_width': 8.0, 'data null%': 1.0, 'mask null%': 0.0, 'select%': 50.0}
           FilterFixedSizeBinaryFilterWithNulls/524288/13/8    920.819 MiB/sec      1.616 GiB/sec    79.686 {'family_index': 1, 'per_family_instance_index': 26, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/13/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1285, 'byte_width': 8.0, 'data null%': 90.0, 'mask null%': 5.0, 'select%': 50.0}
             FilterFixedSizeBinaryFilterNoNulls/524288/13/8    974.713 MiB/sec      1.669 GiB/sec    75.388   {'family_index': 0, 'per_family_instance_index': 26, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/13/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1363, 'byte_width': 8.0, 'data null%': 90.0, 'mask null%': 0.0, 'select%': 50.0}
      TakeFixedSizeBinaryRandomIndicesWithNulls/524288/10/8 107.165M items/sec 187.372M items/sec    74.845                                {'family_index': 3, 'per_family_instance_index': 2, 'run_name': 'TakeFixedSizeBinaryRandomIndicesWithNulls/524288/10/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 143, 'byte_width': 8.0, 'null_percent': 10.0}
       TakeFixedSizeBinaryRandomIndicesWithNulls/524288/2/8  72.662M items/sec 114.781M items/sec    57.965                                  {'family_index': 3, 'per_family_instance_index': 4, 'run_name': 'TakeFixedSizeBinaryRandomIndicesWithNulls/524288/2/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 96, 'byte_width': 8.0, 'null_percent': 50.0}
           FilterFixedSizeBinaryFilterWithNulls/524288/10/9    976.180 MiB/sec      1.480 GiB/sec    55.260 {'family_index': 1, 'per_family_instance_index': 21, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/10/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1358, 'byte_width': 9.0, 'data null%': 10.0, 'mask null%': 5.0, 'select%': 50.0}
             FilterFixedSizeBinaryFilterNoNulls/524288/10/9      1.023 GiB/sec      1.581 GiB/sec    54.502   {'family_index': 0, 'per_family_instance_index': 21, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/10/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1466, 'byte_width': 9.0, 'data null%': 10.0, 'mask null%': 0.0, 'select%': 50.0}
            FilterFixedSizeBinaryFilterWithNulls/524288/4/9    992.477 MiB/sec      1.453 GiB/sec    49.957    {'family_index': 1, 'per_family_instance_index': 9, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/4/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1400, 'byte_width': 9.0, 'data null%': 0.1, 'mask null%': 5.0, 'select%': 50.0}
            FilterFixedSizeBinaryFilterWithNulls/524288/7/9    997.679 MiB/sec      1.450 GiB/sec    48.846   {'family_index': 1, 'per_family_instance_index': 15, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/7/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1389, 'byte_width': 9.0, 'data null%': 1.0, 'mask null%': 5.0, 'select%': 50.0}
             FilterFixedSizeBinaryFilterNoNulls/524288/13/9      1.071 GiB/sec      1.581 GiB/sec    47.526   {'family_index': 0, 'per_family_instance_index': 27, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/13/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1538, 'byte_width': 9.0, 'data null%': 90.0, 'mask null%': 0.0, 'select%': 50.0}
           FilterFixedSizeBinaryFilterWithNulls/524288/13/9      1.008 GiB/sec      1.485 GiB/sec    47.328 {'family_index': 1, 'per_family_instance_index': 27, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/13/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1446, 'byte_width': 9.0, 'data null%': 90.0, 'mask null%': 5.0, 'select%': 50.0}
            FilterFixedSizeBinaryFilterWithNulls/524288/1/9      1.003 GiB/sec      1.452 GiB/sec    44.708    {'family_index': 1, 'per_family_instance_index': 3, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/1/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1437, 'byte_width': 9.0, 'data null%': 0.0, 'mask null%': 5.0, 'select%': 50.0}
              FilterFixedSizeBinaryFilterNoNulls/524288/7/9      1.105 GiB/sec      1.568 GiB/sec    41.954     {'family_index': 0, 'per_family_instance_index': 15, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/7/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1587, 'byte_width': 9.0, 'data null%': 1.0, 'mask null%': 0.0, 'select%': 50.0}
              FilterFixedSizeBinaryFilterNoNulls/524288/4/9      1.163 GiB/sec      1.613 GiB/sec    38.639      {'family_index': 0, 'per_family_instance_index': 9, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/4/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1662, 'byte_width': 9.0, 'data null%': 0.1, 'mask null%': 0.0, 'select%': 50.0}
           FilterFixedSizeBinaryFilterWithNulls/524288/14/9      8.884 GiB/sec     12.117 GiB/sec    36.381 {'family_index': 1, 'per_family_instance_index': 29, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/14/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 12508, 'byte_width': 9.0, 'data null%': 90.0, 'mask null%': 5.0, 'select%': 1.0}
           FilterFixedSizeBinaryFilterWithNulls/524288/11/9      8.886 GiB/sec     12.075 GiB/sec    35.892 {'family_index': 1, 'per_family_instance_index': 23, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/11/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 12716, 'byte_width': 9.0, 'data null%': 10.0, 'mask null%': 5.0, 'select%': 1.0}
          TakeFixedSizeBinaryMonotonicIndices/524288/1000/9 134.765M items/sec 182.868M items/sec    35.694                                     {'family_index': 4, 'per_family_instance_index': 1, 'run_name': 'TakeFixedSizeBinaryMonotonicIndices/524288/1000/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 206, 'byte_width': 9.0, 'null_percent': 0.1}
              FilterFixedSizeBinaryFilterNoNulls/524288/5/8     11.393 GiB/sec     15.091 GiB/sec    32.453     {'family_index': 0, 'per_family_instance_index': 10, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/5/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 16510, 'byte_width': 8.0, 'data null%': 0.1, 'mask null%': 0.0, 'select%': 1.0}
              FilterFixedSizeBinaryFilterNoNulls/524288/8/8     11.573 GiB/sec     15.102 GiB/sec    30.496     {'family_index': 0, 'per_family_instance_index': 16, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/8/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 16684, 'byte_width': 8.0, 'data null%': 1.0, 'mask null%': 0.0, 'select%': 1.0}
           FilterFixedSizeBinaryFilterWithNulls/524288/11/8      7.740 GiB/sec     10.059 GiB/sec    29.956 {'family_index': 1, 'per_family_instance_index': 22, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/11/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10972, 'byte_width': 8.0, 'data null%': 10.0, 'mask null%': 5.0, 'select%': 1.0}
           FilterFixedSizeBinaryFilterWithNulls/524288/14/8      7.733 GiB/sec      9.915 GiB/sec    28.213 {'family_index': 1, 'per_family_instance_index': 28, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/14/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10991, 'byte_width': 8.0, 'data null%': 90.0, 'mask null%': 5.0, 'select%': 1.0}
            FilterFixedSizeBinaryFilterWithNulls/524288/5/8      7.682 GiB/sec      9.765 GiB/sec    27.109   {'family_index': 1, 'per_family_instance_index': 10, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/5/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 10991, 'byte_width': 8.0, 'data null%': 0.1, 'mask null%': 5.0, 'select%': 1.0}
            FilterFixedSizeBinaryFilterWithNulls/524288/8/9      8.856 GiB/sec     11.180 GiB/sec    26.241   {'family_index': 1, 'per_family_instance_index': 17, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/8/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 12571, 'byte_width': 9.0, 'data null%': 1.0, 'mask null%': 5.0, 'select%': 1.0}
            FilterFixedSizeBinaryFilterWithNulls/524288/8/8      7.735 GiB/sec      9.710 GiB/sec    25.530   {'family_index': 1, 'per_family_instance_index': 16, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/8/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 11069, 'byte_width': 8.0, 'data null%': 1.0, 'mask null%': 5.0, 'select%': 1.0}
            TakeFixedSizeBinaryMonotonicIndices/524288/10/9 128.606M items/sec 160.249M items/sec    24.604                                      {'family_index': 4, 'per_family_instance_index': 3, 'run_name': 'TakeFixedSizeBinaryMonotonicIndices/524288/10/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 209, 'byte_width': 9.0, 'null_percent': 10.0}
             FilterFixedSizeBinaryFilterNoNulls/524288/11/8     12.033 GiB/sec     14.737 GiB/sec    22.478   {'family_index': 0, 'per_family_instance_index': 22, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/11/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 17220, 'byte_width': 8.0, 'data null%': 10.0, 'mask null%': 0.0, 'select%': 1.0}
             FilterFixedSizeBinaryFilterNoNulls/524288/14/8     12.141 GiB/sec     14.761 GiB/sec    21.579   {'family_index': 0, 'per_family_instance_index': 28, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/14/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 17343, 'byte_width': 8.0, 'data null%': 90.0, 'mask null%': 0.0, 'select%': 1.0}
            FilterFixedSizeBinaryFilterWithNulls/524288/5/9      8.825 GiB/sec     10.633 GiB/sec    20.489   {'family_index': 1, 'per_family_instance_index': 11, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/5/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 12543, 'byte_width': 9.0, 'data null%': 0.1, 'mask null%': 5.0, 'select%': 1.0}
            FilterFixedSizeBinaryFilterWithNulls/524288/2/8      8.300 GiB/sec      9.969 GiB/sec    20.117    {'family_index': 1, 'per_family_instance_index': 4, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/2/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 11819, 'byte_width': 8.0, 'data null%': 0.0, 'mask null%': 5.0, 'select%': 1.0}
              FilterFixedSizeBinaryFilterNoNulls/524288/5/9     12.954 GiB/sec     15.192 GiB/sec    17.273     {'family_index': 0, 'per_family_instance_index': 11, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/5/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 18572, 'byte_width': 9.0, 'data null%': 0.1, 'mask null%': 0.0, 'select%': 1.0}
              FilterFixedSizeBinaryFilterNoNulls/524288/8/9     13.181 GiB/sec     15.222 GiB/sec    15.490     {'family_index': 0, 'per_family_instance_index': 17, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/8/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 18904, 'byte_width': 9.0, 'data null%': 1.0, 'mask null%': 0.0, 'select%': 1.0}
            FilterFixedSizeBinaryFilterWithNulls/524288/2/9      9.344 GiB/sec     10.632 GiB/sec    13.784    {'family_index': 1, 'per_family_instance_index': 5, 'run_name': 'FilterFixedSizeBinaryFilterWithNulls/524288/2/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 13291, 'byte_width': 9.0, 'data null%': 0.0, 'mask null%': 5.0, 'select%': 1.0}
             FilterFixedSizeBinaryFilterNoNulls/524288/11/9     13.566 GiB/sec     14.894 GiB/sec     9.789   {'family_index': 0, 'per_family_instance_index': 23, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/11/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 19349, 'byte_width': 9.0, 'data null%': 10.0, 'mask null%': 0.0, 'select%': 1.0}
             FilterFixedSizeBinaryFilterNoNulls/524288/14/9     13.603 GiB/sec     14.863 GiB/sec     9.265   {'family_index': 0, 'per_family_instance_index': 29, 'run_name': 'FilterFixedSizeBinaryFilterNoNulls/524288/14/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 19490, 'byte_width': 9.0, 'data null%': 90.0, 'mask null%': 0.0, 'select%': 1.0}
        TakeFixedSizeBinaryRandomIndicesNoNulls/524288/10/9 124.390M items/sec 133.566M items/sec     7.377                                  {'family_index': 2, 'per_family_instance_index': 3, 'run_name': 'TakeFixedSizeBinaryRandomIndicesNoNulls/524288/10/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 164, 'byte_width': 9.0, 'null_percent': 10.0}
             TakeFixedSizeBinaryMonotonicIndices/524288/2/9 116.792M items/sec 124.182M items/sec     6.328                                       {'family_index': 4, 'per_family_instance_index': 5, 'run_name': 'TakeFixedSizeBinaryMonotonicIndices/524288/2/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 161, 'byte_width': 9.0, 'null_percent': 50.0}
      TakeFixedSizeBinaryRandomIndicesNoNulls/524288/1000/9 135.860M items/sec 142.524M items/sec     4.905                                 {'family_index': 2, 'per_family_instance_index': 1, 'run_name': 'TakeFixedSizeBinaryRandomIndicesNoNulls/524288/1000/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 180, 'byte_width': 9.0, 'null_percent': 0.1}
    TakeFixedSizeBinaryRandomIndicesWithNulls/524288/1000/9 131.123M items/sec 137.400M items/sec     4.788                               {'family_index': 3, 'per_family_instance_index': 1, 'run_name': 'TakeFixedSizeBinaryRandomIndicesWithNulls/524288/1000/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 176, 'byte_width': 9.0, 'null_percent': 0.1}
         TakeFixedSizeBinaryRandomIndicesNoNulls/524288/0/9 220.634M items/sec 230.872M items/sec     4.640                                    {'family_index': 2, 'per_family_instance_index': 9, 'run_name': 'TakeFixedSizeBinaryRandomIndicesNoNulls/524288/0/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 295, 'byte_width': 9.0, 'null_percent': 0.0}
         TakeFixedSizeBinaryRandomIndicesNoNulls/524288/2/9  97.425M items/sec 101.477M items/sec     4.159                                   {'family_index': 2, 'per_family_instance_index': 5, 'run_name': 'TakeFixedSizeBinaryRandomIndicesNoNulls/524288/2/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 130, 'byte_width': 9.0, 'null_percent': 50.0}
      TakeFixedSizeBinaryRandomIndicesWithNulls/524288/10/9 104.830M items/sec 108.346M items/sec     3.354                                {'family_index': 3, 'per_family_instance_index': 3, 'run_name': 'TakeFixedSizeBinaryRandomIndicesWithNulls/524288/10/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 100, 'byte_width': 9.0, 'null_percent': 10.0}
       TakeFixedSizeBinaryRandomIndicesWithNulls/524288/1/9 378.858M items/sec 387.322M items/sec     2.234                                {'family_index': 3, 'per_family_instance_index': 7, 'run_name': 'TakeFixedSizeBinaryRandomIndicesWithNulls/524288/1/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 506, 'byte_width': 9.0, 'null_percent': 100.0}
       TakeFixedSizeBinaryRandomIndicesWithNulls/524288/0/9 221.900M items/sec 226.450M items/sec     2.050                                  {'family_index': 3, 'per_family_instance_index': 9, 'run_name': 'TakeFixedSizeBinaryRandomIndicesWithNulls/524288/0/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 295, 'byte_width': 9.0, 'null_percent': 0.0}
             TakeFixedSizeBinaryMonotonicIndices/524288/0/9 248.664M items/sec 253.037M items/sec     1.758                                        {'family_index': 4, 'per_family_instance_index': 9, 'run_name': 'TakeFixedSizeBinaryMonotonicIndices/524288/0/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 332, 'byte_width': 9.0, 'null_percent': 0.0}
         TakeFixedSizeBinaryRandomIndicesNoNulls/524288/1/9 197.730M items/sec 201.173M items/sec     1.741                                  {'family_index': 2, 'per_family_instance_index': 7, 'run_name': 'TakeFixedSizeBinaryRandomIndicesNoNulls/524288/1/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 264, 'byte_width': 9.0, 'null_percent': 100.0}
       TakeFixedSizeBinaryRandomIndicesWithNulls/524288/2/9  73.196M items/sec  74.167M items/sec     1.327                                  {'family_index': 3, 'per_family_instance_index': 5, 'run_name': 'TakeFixedSizeBinaryRandomIndicesWithNulls/524288/2/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 96, 'byte_width': 9.0, 'null_percent': 50.0}
             TakeFixedSizeBinaryMonotonicIndices/524288/1/9 192.545M items/sec 188.138M items/sec    -2.289                                      {'family_index': 4, 'per_family_instance_index': 7, 'run_name': 'TakeFixedSizeBinaryMonotonicIndices/524288/1/9', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 257, 'byte_width': 9.0, 'null_percent': 100.0}
    
    ```
    
    ### Are these changes tested?
    
    Yes.
    
    ### Are there any user-facing changes?
    
    No.
    
    * Closes: #39740
    
    Authored-by: Antoine Pitrou <an...@python.org>
    Signed-off-by: Antoine Pitrou <an...@python.org>
---
 .../compute/kernels/vector_selection_benchmark.cc  |  72 ++++++++-
 .../kernels/vector_selection_filter_internal.cc    | 167 ++++++++++++---------
 .../compute/kernels/vector_selection_internal.cc   |  22 ++-
 .../compute/kernels/vector_selection_internal.h    |   2 +-
 .../kernels/vector_selection_take_internal.cc      |  76 +++++++---
 .../arrow/compute/kernels/vector_selection_test.cc | 150 +++++++++++++-----
 6 files changed, 348 insertions(+), 141 deletions(-)

diff --git a/cpp/src/arrow/compute/kernels/vector_selection_benchmark.cc b/cpp/src/arrow/compute/kernels/vector_selection_benchmark.cc
index 25e30e65a3..e65d5dbcab 100644
--- a/cpp/src/arrow/compute/kernels/vector_selection_benchmark.cc
+++ b/cpp/src/arrow/compute/kernels/vector_selection_benchmark.cc
@@ -128,6 +128,13 @@ struct TakeBenchmark {
     Bench(values);
   }
 
+  void FixedSizeBinary() {
+    const int32_t byte_width = static_cast<int32_t>(state.range(2));
+    auto values = rand.FixedSizeBinary(args.size, byte_width, args.null_proportion);
+    Bench(values);
+    state.counters["byte_width"] = byte_width;
+  }
+
   void String() {
     int32_t string_min_length = 0, string_max_length = 32;
     auto values = std::static_pointer_cast<StringArray>(rand.String(
@@ -149,6 +156,7 @@ struct TakeBenchmark {
     for (auto _ : state) {
       ABORT_NOT_OK(Take(values, indices).status());
     }
+    state.SetItemsProcessed(state.iterations() * values->length());
   }
 };
 
@@ -166,8 +174,7 @@ struct FilterBenchmark {
 
   void Int64() {
     const int64_t array_size = args.size / sizeof(int64_t);
-    auto values = std::static_pointer_cast<NumericArray<Int64Type>>(
-        rand.Int64(array_size, -100, 100, args.values_null_proportion));
+    auto values = rand.Int64(array_size, -100, 100, args.values_null_proportion);
     Bench(values);
   }
 
@@ -181,6 +188,15 @@ struct FilterBenchmark {
     Bench(values);
   }
 
+  void FixedSizeBinary() {
+    const int32_t byte_width = static_cast<int32_t>(state.range(2));
+    const int64_t array_size = args.size / byte_width;
+    auto values =
+        rand.FixedSizeBinary(array_size, byte_width, args.values_null_proportion);
+    Bench(values);
+    state.counters["byte_width"] = byte_width;
+  }
+
   void String() {
     int32_t string_min_length = 0, string_max_length = 32;
     int32_t string_mean_length = (string_max_length + string_min_length) / 2;
@@ -202,6 +218,7 @@ struct FilterBenchmark {
     for (auto _ : state) {
       ABORT_NOT_OK(Filter(values, filter).status());
     }
+    state.SetItemsProcessed(state.iterations() * values->length());
   }
 
   void BenchRecordBatch() {
@@ -236,6 +253,7 @@ struct FilterBenchmark {
     for (auto _ : state) {
       ABORT_NOT_OK(Filter(batch, filter).status());
     }
+    state.SetItemsProcessed(state.iterations() * num_rows);
   }
 };
 
@@ -255,6 +273,14 @@ static void FilterFSLInt64FilterWithNulls(benchmark::State& state) {
   FilterBenchmark(state, true).FSLInt64();
 }
 
+static void FilterFixedSizeBinaryFilterNoNulls(benchmark::State& state) {
+  FilterBenchmark(state, false).FixedSizeBinary();
+}
+
+static void FilterFixedSizeBinaryFilterWithNulls(benchmark::State& state) {
+  FilterBenchmark(state, true).FixedSizeBinary();
+}
+
 static void FilterStringFilterNoNulls(benchmark::State& state) {
   FilterBenchmark(state, false).String();
 }
@@ -283,6 +309,19 @@ static void TakeInt64MonotonicIndices(benchmark::State& state) {
   TakeBenchmark(state, /*indices_with_nulls=*/false, /*monotonic=*/true).Int64();
 }
 
+static void TakeFixedSizeBinaryRandomIndicesNoNulls(benchmark::State& state) {
+  TakeBenchmark(state, false).FixedSizeBinary();
+}
+
+static void TakeFixedSizeBinaryRandomIndicesWithNulls(benchmark::State& state) {
+  TakeBenchmark(state, true).FixedSizeBinary();
+}
+
+static void TakeFixedSizeBinaryMonotonicIndices(benchmark::State& state) {
+  TakeBenchmark(state, /*indices_with_nulls=*/false, /*monotonic=*/true)
+      .FixedSizeBinary();
+}
+
 static void TakeFSLInt64RandomIndicesNoNulls(benchmark::State& state) {
   TakeBenchmark(state, false).FSLInt64();
 }
@@ -315,8 +354,22 @@ void FilterSetArgs(benchmark::internal::Benchmark* bench) {
   }
 }
 
+void FilterFSBSetArgs(benchmark::internal::Benchmark* bench) {
+  for (int64_t size : g_data_sizes) {
+    for (int i = 0; i < static_cast<int>(g_filter_params.size()); ++i) {
+      // FixedSizeBinary of primitive sizes (powers of two up to 32)
+      // have a faster path.
+      for (int32_t byte_width : {8, 9}) {
+        bench->Args({static_cast<ArgsType>(size), i, byte_width});
+      }
+    }
+  }
+}
+
 BENCHMARK(FilterInt64FilterNoNulls)->Apply(FilterSetArgs);
 BENCHMARK(FilterInt64FilterWithNulls)->Apply(FilterSetArgs);
+BENCHMARK(FilterFixedSizeBinaryFilterNoNulls)->Apply(FilterFSBSetArgs);
+BENCHMARK(FilterFixedSizeBinaryFilterWithNulls)->Apply(FilterFSBSetArgs);
 BENCHMARK(FilterFSLInt64FilterNoNulls)->Apply(FilterSetArgs);
 BENCHMARK(FilterFSLInt64FilterWithNulls)->Apply(FilterSetArgs);
 BENCHMARK(FilterStringFilterNoNulls)->Apply(FilterSetArgs);
@@ -340,9 +393,24 @@ void TakeSetArgs(benchmark::internal::Benchmark* bench) {
   }
 }
 
+void TakeFSBSetArgs(benchmark::internal::Benchmark* bench) {
+  for (int64_t size : g_data_sizes) {
+    for (auto nulls : std::vector<ArgsType>({1000, 10, 2, 1, 0})) {
+      // FixedSizeBinary of primitive sizes (powers of two up to 32)
+      // have a faster path.
+      for (int32_t byte_width : {8, 9}) {
+        bench->Args({static_cast<ArgsType>(size), nulls, byte_width});
+      }
+    }
+  }
+}
+
 BENCHMARK(TakeInt64RandomIndicesNoNulls)->Apply(TakeSetArgs);
 BENCHMARK(TakeInt64RandomIndicesWithNulls)->Apply(TakeSetArgs);
 BENCHMARK(TakeInt64MonotonicIndices)->Apply(TakeSetArgs);
+BENCHMARK(TakeFixedSizeBinaryRandomIndicesNoNulls)->Apply(TakeFSBSetArgs);
+BENCHMARK(TakeFixedSizeBinaryRandomIndicesWithNulls)->Apply(TakeFSBSetArgs);
+BENCHMARK(TakeFixedSizeBinaryMonotonicIndices)->Apply(TakeFSBSetArgs);
 BENCHMARK(TakeFSLInt64RandomIndicesNoNulls)->Apply(TakeSetArgs);
 BENCHMARK(TakeFSLInt64RandomIndicesWithNulls)->Apply(TakeSetArgs);
 BENCHMARK(TakeFSLInt64MonotonicIndices)->Apply(TakeSetArgs);
diff --git a/cpp/src/arrow/compute/kernels/vector_selection_filter_internal.cc b/cpp/src/arrow/compute/kernels/vector_selection_filter_internal.cc
index a25b04ae4f..8825d697fd 100644
--- a/cpp/src/arrow/compute/kernels/vector_selection_filter_internal.cc
+++ b/cpp/src/arrow/compute/kernels/vector_selection_filter_internal.cc
@@ -146,36 +146,40 @@ class DropNullCounter {
 
 /// \brief The Filter implementation for primitive (fixed-width) types does not
 /// use the logical Arrow type but rather the physical C type. This way we only
-/// generate one take function for each byte width. We use the same
-/// implementation here for boolean and fixed-byte-size inputs with some
-/// template specialization.
-template <typename ArrowType>
+/// generate one take function for each byte width.
+///
+/// We use compile-time specialization for two variations:
+/// - operating on boolean data (using kIsBoolean = true)
+/// - operating on fixed-width data of arbitrary width (using kByteWidth = -1),
+///   with the actual width only known at runtime
+template <int32_t kByteWidth, bool kIsBoolean = false>
 class PrimitiveFilterImpl {
  public:
-  using T = typename std::conditional<std::is_same<ArrowType, BooleanType>::value,
-                                      uint8_t, typename ArrowType::c_type>::type;
-
   PrimitiveFilterImpl(const ArraySpan& values, const ArraySpan& filter,
                       FilterOptions::NullSelectionBehavior null_selection,
                       ArrayData* out_arr)
-      : values_is_valid_(values.buffers[0].data),
-        values_data_(reinterpret_cast<const T*>(values.buffers[1].data)),
+      : byte_width_(values.type->byte_width()),
+        values_is_valid_(values.buffers[0].data),
+        values_data_(values.buffers[1].data),
         values_null_count_(values.null_count),
         values_offset_(values.offset),
         values_length_(values.length),
         filter_(filter),
         null_selection_(null_selection) {
-    if (values.type->id() != Type::BOOL) {
+    if constexpr (kByteWidth >= 0 && !kIsBoolean) {
+      DCHECK_EQ(kByteWidth, byte_width_);
+    }
+    if constexpr (!kIsBoolean) {
       // No offset applied for boolean because it's a bitmap
-      values_data_ += values.offset;
+      values_data_ += values.offset * byte_width();
     }
 
     if (out_arr->buffers[0] != nullptr) {
       // May be unallocated if neither filter nor values contain nulls
       out_is_valid_ = out_arr->buffers[0]->mutable_data();
     }
-    out_data_ = reinterpret_cast<T*>(out_arr->buffers[1]->mutable_data());
-    out_offset_ = out_arr->offset;
+    out_data_ = out_arr->buffers[1]->mutable_data();
+    DCHECK_EQ(out_arr->offset, 0);
     out_length_ = out_arr->length;
     out_position_ = 0;
   }
@@ -201,14 +205,11 @@ class PrimitiveFilterImpl {
           [&](int64_t position, int64_t segment_length, bool filter_valid) {
             if (filter_valid) {
               CopyBitmap(values_is_valid_, values_offset_ + position, segment_length,
-                         out_is_valid_, out_offset_ + out_position_);
+                         out_is_valid_, out_position_);
               WriteValueSegment(position, segment_length);
             } else {
-              bit_util::SetBitsTo(out_is_valid_, out_offset_ + out_position_,
-                                  segment_length, false);
-              memset(out_data_ + out_offset_ + out_position_, 0,
-                     segment_length * sizeof(T));
-              out_position_ += segment_length;
+              bit_util::SetBitsTo(out_is_valid_, out_position_, segment_length, false);
+              WriteNullSegment(segment_length);
             }
             return true;
           });
@@ -218,7 +219,7 @@ class PrimitiveFilterImpl {
     if (out_is_valid_) {
       // Set all to valid, so only if nulls are produced by EMIT_NULL, we need
       // to set out_is_valid[i] to false.
-      bit_util::SetBitsTo(out_is_valid_, out_offset_, out_length_, true);
+      bit_util::SetBitsTo(out_is_valid_, 0, out_length_, true);
     }
     return VisitPlainxREEFilterOutputSegments(
         filter_, /*filter_may_have_nulls=*/true, null_selection_,
@@ -226,11 +227,8 @@ class PrimitiveFilterImpl {
           if (filter_valid) {
             WriteValueSegment(position, segment_length);
           } else {
-            bit_util::SetBitsTo(out_is_valid_, out_offset_ + out_position_,
-                                segment_length, false);
-            memset(out_data_ + out_offset_ + out_position_, 0,
-                   segment_length * sizeof(T));
-            out_position_ += segment_length;
+            bit_util::SetBitsTo(out_is_valid_, out_position_, segment_length, false);
+            WriteNullSegment(segment_length);
           }
           return true;
         });
@@ -260,13 +258,13 @@ class PrimitiveFilterImpl {
                                                  values_length_);
 
     auto WriteNotNull = [&](int64_t index) {
-      bit_util::SetBit(out_is_valid_, out_offset_ + out_position_);
+      bit_util::SetBit(out_is_valid_, out_position_);
       // Increments out_position_
       WriteValue(index);
     };
 
     auto WriteMaybeNull = [&](int64_t index) {
-      bit_util::SetBitTo(out_is_valid_, out_offset_ + out_position_,
+      bit_util::SetBitTo(out_is_valid_, out_position_,
                          bit_util::GetBit(values_is_valid_, values_offset_ + index));
       // Increments out_position_
       WriteValue(index);
@@ -279,15 +277,14 @@ class PrimitiveFilterImpl {
       BitBlockCount data_block = data_counter.NextWord();
       if (filter_block.AllSet() && data_block.AllSet()) {
         // Fastest path: all values in block are included and not null
-        bit_util::SetBitsTo(out_is_valid_, out_offset_ + out_position_,
-                            filter_block.length, true);
+        bit_util::SetBitsTo(out_is_valid_, out_position_, filter_block.length, true);
         WriteValueSegment(in_position, filter_block.length);
         in_position += filter_block.length;
       } else if (filter_block.AllSet()) {
         // Faster: all values are selected, but some values are null
         // Batch copy bits from values validity bitmap to output validity bitmap
         CopyBitmap(values_is_valid_, values_offset_ + in_position, filter_block.length,
-                   out_is_valid_, out_offset_ + out_position_);
+                   out_is_valid_, out_position_);
         WriteValueSegment(in_position, filter_block.length);
         in_position += filter_block.length;
       } else if (filter_block.NoneSet() && null_selection_ == FilterOptions::DROP) {
@@ -326,7 +323,7 @@ class PrimitiveFilterImpl {
                 WriteNotNull(in_position);
               } else if (!is_valid) {
                 // Filter slot is null, so we have a null in the output
-                bit_util::ClearBit(out_is_valid_, out_offset_ + out_position_);
+                bit_util::ClearBit(out_is_valid_, out_position_);
                 WriteNull();
               }
               ++in_position;
@@ -362,7 +359,7 @@ class PrimitiveFilterImpl {
                 WriteMaybeNull(in_position);
               } else if (!is_valid) {
                 // Filter slot is null, so we have a null in the output
-                bit_util::ClearBit(out_is_valid_, out_offset_ + out_position_);
+                bit_util::ClearBit(out_is_valid_, out_position_);
                 WriteNull();
               }
               ++in_position;
@@ -376,54 +373,72 @@ class PrimitiveFilterImpl {
   // Write the next out_position given the selected in_position for the input
   // data and advance out_position
   void WriteValue(int64_t in_position) {
-    out_data_[out_offset_ + out_position_++] = values_data_[in_position];
+    if constexpr (kIsBoolean) {
+      bit_util::SetBitTo(out_data_, out_position_,
+                         bit_util::GetBit(values_data_, values_offset_ + in_position));
+    } else {
+      memcpy(out_data_ + out_position_ * byte_width(),
+             values_data_ + in_position * byte_width(), byte_width());
+    }
+    ++out_position_;
   }
 
   void WriteValueSegment(int64_t in_start, int64_t length) {
-    std::memcpy(out_data_ + out_position_, values_data_ + in_start, length * sizeof(T));
+    if constexpr (kIsBoolean) {
+      CopyBitmap(values_data_, values_offset_ + in_start, length, out_data_,
+                 out_position_);
+    } else {
+      memcpy(out_data_ + out_position_ * byte_width(),
+             values_data_ + in_start * byte_width(), length * byte_width());
+    }
     out_position_ += length;
   }
 
   void WriteNull() {
-    // Zero the memory
-    out_data_[out_offset_ + out_position_++] = T{};
+    if constexpr (kIsBoolean) {
+      // Zero the bit
+      bit_util::ClearBit(out_data_, out_position_);
+    } else {
+      // Zero the memory
+      memset(out_data_ + out_position_ * byte_width(), 0, byte_width());
+    }
+    ++out_position_;
+  }
+
+  void WriteNullSegment(int64_t length) {
+    if constexpr (kIsBoolean) {
+      // Zero the bits
+      bit_util::SetBitsTo(out_data_, out_position_, length, false);
+    } else {
+      // Zero the memory
+      memset(out_data_ + out_position_ * byte_width(), 0, length * byte_width());
+    }
+    out_position_ += length;
+  }
+
+  constexpr int32_t byte_width() const {
+    if constexpr (kByteWidth >= 0) {
+      return kByteWidth;
+    } else {
+      return byte_width_;
+    }
   }
 
  private:
+  int32_t byte_width_;
   const uint8_t* values_is_valid_;
-  const T* values_data_;
+  const uint8_t* values_data_;
   int64_t values_null_count_;
   int64_t values_offset_;
   int64_t values_length_;
   const ArraySpan& filter_;
   FilterOptions::NullSelectionBehavior null_selection_;
   uint8_t* out_is_valid_ = NULLPTR;
-  T* out_data_;
-  int64_t out_offset_;
+  uint8_t* out_data_;
   int64_t out_length_;
   int64_t out_position_;
 };
 
-template <>
-inline void PrimitiveFilterImpl<BooleanType>::WriteValue(int64_t in_position) {
-  bit_util::SetBitTo(out_data_, out_offset_ + out_position_++,
-                     bit_util::GetBit(values_data_, values_offset_ + in_position));
-}
-
-template <>
-inline void PrimitiveFilterImpl<BooleanType>::WriteValueSegment(int64_t in_start,
-                                                                int64_t length) {
-  CopyBitmap(values_data_, values_offset_ + in_start, length, out_data_,
-             out_offset_ + out_position_);
-  out_position_ += length;
-}
-
-template <>
-inline void PrimitiveFilterImpl<BooleanType>::WriteNull() {
-  // Zero the bit
-  bit_util::ClearBit(out_data_, out_offset_ + out_position_++);
-}
-
 Status PrimitiveFilterExec(KernelContext* ctx, const ExecSpan& batch, ExecResult* out) {
   const ArraySpan& values = batch[0].array;
   const ArraySpan& filter = batch[1].array;
@@ -459,22 +474,32 @@ Status PrimitiveFilterExec(KernelContext* ctx, const ExecSpan& batch, ExecResult
 
   switch (bit_width) {
     case 1:
-      PrimitiveFilterImpl<BooleanType>(values, filter, null_selection, out_arr).Exec();
+      PrimitiveFilterImpl<1, /*kIsBoolean=*/true>(values, filter, null_selection, out_arr)
+          .Exec();
       break;
     case 8:
-      PrimitiveFilterImpl<UInt8Type>(values, filter, null_selection, out_arr).Exec();
+      PrimitiveFilterImpl<1>(values, filter, null_selection, out_arr).Exec();
       break;
     case 16:
-      PrimitiveFilterImpl<UInt16Type>(values, filter, null_selection, out_arr).Exec();
+      PrimitiveFilterImpl<2>(values, filter, null_selection, out_arr).Exec();
       break;
     case 32:
-      PrimitiveFilterImpl<UInt32Type>(values, filter, null_selection, out_arr).Exec();
+      PrimitiveFilterImpl<4>(values, filter, null_selection, out_arr).Exec();
       break;
     case 64:
-      PrimitiveFilterImpl<UInt64Type>(values, filter, null_selection, out_arr).Exec();
+      PrimitiveFilterImpl<8>(values, filter, null_selection, out_arr).Exec();
+      break;
+    case 128:
+      // For INTERVAL_MONTH_DAY_NANO, DECIMAL128
+      PrimitiveFilterImpl<16>(values, filter, null_selection, out_arr).Exec();
+      break;
+    case 256:
+      // For DECIMAL256
+      PrimitiveFilterImpl<32>(values, filter, null_selection, out_arr).Exec();
       break;
     default:
-      DCHECK(false) << "Invalid values bit width";
+      // Non-specializing on byte width
+      PrimitiveFilterImpl<-1>(values, filter, null_selection, out_arr).Exec();
       break;
   }
   return Status::OK();
@@ -1050,10 +1075,10 @@ void PopulateFilterKernels(std::vector<SelectionKernelData>* out) {
       {InputType(match::Primitive()), plain_filter, PrimitiveFilterExec},
       {InputType(match::BinaryLike()), plain_filter, BinaryFilterExec},
       {InputType(match::LargeBinaryLike()), plain_filter, BinaryFilterExec},
-      {InputType(Type::FIXED_SIZE_BINARY), plain_filter, FSBFilterExec},
       {InputType(null()), plain_filter, NullFilterExec},
-      {InputType(Type::DECIMAL128), plain_filter, FSBFilterExec},
-      {InputType(Type::DECIMAL256), plain_filter, FSBFilterExec},
+      {InputType(Type::FIXED_SIZE_BINARY), plain_filter, PrimitiveFilterExec},
+      {InputType(Type::DECIMAL128), plain_filter, PrimitiveFilterExec},
+      {InputType(Type::DECIMAL256), plain_filter, PrimitiveFilterExec},
       {InputType(Type::DICTIONARY), plain_filter, DictionaryFilterExec},
       {InputType(Type::EXTENSION), plain_filter, ExtensionFilterExec},
       {InputType(Type::LIST), plain_filter, ListFilterExec},
@@ -1068,10 +1093,10 @@ void PopulateFilterKernels(std::vector<SelectionKernelData>* out) {
       {InputType(match::Primitive()), ree_filter, PrimitiveFilterExec},
       {InputType(match::BinaryLike()), ree_filter, BinaryFilterExec},
       {InputType(match::LargeBinaryLike()), ree_filter, BinaryFilterExec},
-      {InputType(Type::FIXED_SIZE_BINARY), ree_filter, FSBFilterExec},
       {InputType(null()), ree_filter, NullFilterExec},
-      {InputType(Type::DECIMAL128), ree_filter, FSBFilterExec},
-      {InputType(Type::DECIMAL256), ree_filter, FSBFilterExec},
+      {InputType(Type::FIXED_SIZE_BINARY), ree_filter, PrimitiveFilterExec},
+      {InputType(Type::DECIMAL128), ree_filter, PrimitiveFilterExec},
+      {InputType(Type::DECIMAL256), ree_filter, PrimitiveFilterExec},
       {InputType(Type::DICTIONARY), ree_filter, DictionaryFilterExec},
       {InputType(Type::EXTENSION), ree_filter, ExtensionFilterExec},
       {InputType(Type::LIST), ree_filter, ListFilterExec},
diff --git a/cpp/src/arrow/compute/kernels/vector_selection_internal.cc b/cpp/src/arrow/compute/kernels/vector_selection_internal.cc
index 98eb37e9c5..a0fe2808e3 100644
--- a/cpp/src/arrow/compute/kernels/vector_selection_internal.cc
+++ b/cpp/src/arrow/compute/kernels/vector_selection_internal.cc
@@ -77,7 +77,8 @@ Status PreallocatePrimitiveArrayData(KernelContext* ctx, int64_t length, int bit
   if (bit_width == 1) {
     ARROW_ASSIGN_OR_RAISE(out->buffers[1], ctx->AllocateBitmap(length));
   } else {
-    ARROW_ASSIGN_OR_RAISE(out->buffers[1], ctx->Allocate(length * bit_width / 8));
+    ARROW_ASSIGN_OR_RAISE(out->buffers[1],
+                          ctx->Allocate(bit_util::BytesForBits(length * bit_width)));
   }
   return Status::OK();
 }
@@ -899,10 +900,6 @@ Status FilterExec(KernelContext* ctx, const ExecSpan& batch, ExecResult* out) {
 
 }  // namespace
 
-Status FSBFilterExec(KernelContext* ctx, const ExecSpan& batch, ExecResult* out) {
-  return FilterExec<FSBSelectionImpl>(ctx, batch, out);
-}
-
 Status ListFilterExec(KernelContext* ctx, const ExecSpan& batch, ExecResult* out) {
   return FilterExec<ListSelectionImpl<ListType>>(ctx, batch, out);
 }
@@ -946,7 +943,20 @@ Status LargeVarBinaryTakeExec(KernelContext* ctx, const ExecSpan& batch,
 }
 
 Status FSBTakeExec(KernelContext* ctx, const ExecSpan& batch, ExecResult* out) {
-  return TakeExec<FSBSelectionImpl>(ctx, batch, out);
+  const ArraySpan& values = batch[0].array;
+  const auto byte_width = values.type->byte_width();
+  // Use primitive Take implementation (presumably faster) for some byte widths
+  switch (byte_width) {
+    case 1:
+    case 2:
+    case 4:
+    case 8:
+    case 16:
+    case 32:
+      return PrimitiveTakeExec(ctx, batch, out);
+    default:
+      return TakeExec<FSBSelectionImpl>(ctx, batch, out);
+  }
 }
 
 Status ListTakeExec(KernelContext* ctx, const ExecSpan& batch, ExecResult* out) {
diff --git a/cpp/src/arrow/compute/kernels/vector_selection_internal.h b/cpp/src/arrow/compute/kernels/vector_selection_internal.h
index b9eba6ea66..95f3e51cd6 100644
--- a/cpp/src/arrow/compute/kernels/vector_selection_internal.h
+++ b/cpp/src/arrow/compute/kernels/vector_selection_internal.h
@@ -70,7 +70,6 @@ void VisitPlainxREEFilterOutputSegments(
     FilterOptions::NullSelectionBehavior null_selection,
     const EmitREEFilterSegment& emit_segment);
 
-Status FSBFilterExec(KernelContext*, const ExecSpan&, ExecResult*);
 Status ListFilterExec(KernelContext*, const ExecSpan&, ExecResult*);
 Status LargeListFilterExec(KernelContext*, const ExecSpan&, ExecResult*);
 Status FSLFilterExec(KernelContext*, const ExecSpan&, ExecResult*);
@@ -79,6 +78,7 @@ Status MapFilterExec(KernelContext*, const ExecSpan&, ExecResult*);
 
 Status VarBinaryTakeExec(KernelContext*, const ExecSpan&, ExecResult*);
 Status LargeVarBinaryTakeExec(KernelContext*, const ExecSpan&, ExecResult*);
+Status PrimitiveTakeExec(KernelContext*, const ExecSpan&, ExecResult*);
 Status FSBTakeExec(KernelContext*, const ExecSpan&, ExecResult*);
 Status ListTakeExec(KernelContext*, const ExecSpan&, ExecResult*);
 Status LargeListTakeExec(KernelContext*, const ExecSpan&, ExecResult*);
diff --git a/cpp/src/arrow/compute/kernels/vector_selection_take_internal.cc b/cpp/src/arrow/compute/kernels/vector_selection_take_internal.cc
index 612de8505d..89b3f7d0d3 100644
--- a/cpp/src/arrow/compute/kernels/vector_selection_take_internal.cc
+++ b/cpp/src/arrow/compute/kernels/vector_selection_take_internal.cc
@@ -334,11 +334,15 @@ using TakeState = OptionsWrapper<TakeOptions>;
 /// only generate one take function for each byte width.
 ///
 /// This function assumes that the indices have been boundschecked.
-template <typename IndexCType, typename ValueCType>
+template <typename IndexCType, typename ValueWidthConstant>
 struct PrimitiveTakeImpl {
+  static constexpr int kValueWidth = ValueWidthConstant::value;
+
   static void Exec(const ArraySpan& values, const ArraySpan& indices,
                    ArrayData* out_arr) {
-    const auto* values_data = values.GetValues<ValueCType>(1);
+    DCHECK_EQ(values.type->byte_width(), kValueWidth);
+    const auto* values_data =
+        values.GetValues<uint8_t>(1, 0) + kValueWidth * values.offset;
     const uint8_t* values_is_valid = values.buffers[0].data;
     auto values_offset = values.offset;
 
@@ -346,9 +350,10 @@ struct PrimitiveTakeImpl {
     const uint8_t* indices_is_valid = indices.buffers[0].data;
     auto indices_offset = indices.offset;
 
-    auto out = out_arr->GetMutableValues<ValueCType>(1);
+    auto out = out_arr->GetMutableValues<uint8_t>(1, 0) + kValueWidth * out_arr->offset;
     auto out_is_valid = out_arr->buffers[0]->mutable_data();
     auto out_offset = out_arr->offset;
+    DCHECK_EQ(out_offset, 0);
 
     // If either the values or indices have nulls, we preemptively zero out the
     // out validity bitmap so that we don't have to use ClearBit in each
@@ -357,6 +362,19 @@ struct PrimitiveTakeImpl {
       bit_util::SetBitsTo(out_is_valid, out_offset, indices.length, false);
     }
 
+    auto WriteValue = [&](int64_t position) {
+      memcpy(out + position * kValueWidth,
+             values_data + indices_data[position] * kValueWidth, kValueWidth);
+    };
+
+    auto WriteZero = [&](int64_t position) {
+      memset(out + position * kValueWidth, 0, kValueWidth);
+    };
+
+    auto WriteZeroSegment = [&](int64_t position, int64_t length) {
+      memset(out + position * kValueWidth, 0, kValueWidth * length);
+    };
+
     OptionalBitBlockCounter indices_bit_counter(indices_is_valid, indices_offset,
                                                 indices.length);
     int64_t position = 0;
@@ -370,7 +388,7 @@ struct PrimitiveTakeImpl {
           // Fastest path: neither values nor index nulls
           bit_util::SetBitsTo(out_is_valid, out_offset + position, block.length, true);
           for (int64_t i = 0; i < block.length; ++i) {
-            out[position] = values_data[indices_data[position]];
+            WriteValue(position);
             ++position;
           }
         } else if (block.popcount > 0) {
@@ -379,14 +397,14 @@ struct PrimitiveTakeImpl {
             if (bit_util::GetBit(indices_is_valid, indices_offset + position)) {
               // index is not null
               bit_util::SetBit(out_is_valid, out_offset + position);
-              out[position] = values_data[indices_data[position]];
+              WriteValue(position);
             } else {
-              out[position] = ValueCType{};
+              WriteZero(position);
             }
             ++position;
           }
         } else {
-          memset(out + position, 0, sizeof(ValueCType) * block.length);
+          WriteZeroSegment(position, block.length);
           position += block.length;
         }
       } else {
@@ -397,11 +415,11 @@ struct PrimitiveTakeImpl {
             if (bit_util::GetBit(values_is_valid,
                                  values_offset + indices_data[position])) {
               // value is not null
-              out[position] = values_data[indices_data[position]];
+              WriteValue(position);
               bit_util::SetBit(out_is_valid, out_offset + position);
               ++valid_count;
             } else {
-              out[position] = ValueCType{};
+              WriteZero(position);
             }
             ++position;
           }
@@ -414,16 +432,16 @@ struct PrimitiveTakeImpl {
                 bit_util::GetBit(values_is_valid,
                                  values_offset + indices_data[position])) {
               // index is not null && value is not null
-              out[position] = values_data[indices_data[position]];
+              WriteValue(position);
               bit_util::SetBit(out_is_valid, out_offset + position);
               ++valid_count;
             } else {
-              out[position] = ValueCType{};
+              WriteZero(position);
             }
             ++position;
           }
         } else {
-          memset(out + position, 0, sizeof(ValueCType) * block.length);
+          WriteZeroSegment(position, block.length);
           position += block.length;
         }
       }
@@ -554,6 +572,8 @@ void TakeIndexDispatch(const ArraySpan& values, const ArraySpan& indices,
   }
 }
 
+}  // namespace
+
 Status PrimitiveTakeExec(KernelContext* ctx, const ExecSpan& batch, ExecResult* out) {
   const ArraySpan& values = batch[0].array;
   const ArraySpan& indices = batch[1].array;
@@ -577,24 +597,40 @@ Status PrimitiveTakeExec(KernelContext* ctx, const ExecSpan& batch, ExecResult*
       TakeIndexDispatch<BooleanTakeImpl>(values, indices, out_arr);
       break;
     case 8:
-      TakeIndexDispatch<PrimitiveTakeImpl, int8_t>(values, indices, out_arr);
+      TakeIndexDispatch<PrimitiveTakeImpl, std::integral_constant<int, 1>>(
+          values, indices, out_arr);
       break;
     case 16:
-      TakeIndexDispatch<PrimitiveTakeImpl, int16_t>(values, indices, out_arr);
+      TakeIndexDispatch<PrimitiveTakeImpl, std::integral_constant<int, 2>>(
+          values, indices, out_arr);
       break;
     case 32:
-      TakeIndexDispatch<PrimitiveTakeImpl, int32_t>(values, indices, out_arr);
+      TakeIndexDispatch<PrimitiveTakeImpl, std::integral_constant<int, 4>>(
+          values, indices, out_arr);
       break;
     case 64:
-      TakeIndexDispatch<PrimitiveTakeImpl, int64_t>(values, indices, out_arr);
+      TakeIndexDispatch<PrimitiveTakeImpl, std::integral_constant<int, 8>>(
+          values, indices, out_arr);
       break;
-    default:
-      DCHECK(false) << "Invalid values byte width";
+    case 128:
+      // For INTERVAL_MONTH_DAY_NANO, DECIMAL128
+      TakeIndexDispatch<PrimitiveTakeImpl, std::integral_constant<int, 16>>(
+          values, indices, out_arr);
+      break;
+    case 256:
+      // For DECIMAL256
+      TakeIndexDispatch<PrimitiveTakeImpl, std::integral_constant<int, 32>>(
+          values, indices, out_arr);
       break;
+    default:
+      return Status::NotImplemented("Unsupported primitive type for take: ",
+                                    *values.type);
   }
   return Status::OK();
 }
 
+namespace {
+
 // ----------------------------------------------------------------------
 // Null take
 
@@ -836,8 +872,8 @@ void PopulateTakeKernels(std::vector<SelectionKernelData>* out) {
       {InputType(match::LargeBinaryLike()), take_indices, LargeVarBinaryTakeExec},
       {InputType(Type::FIXED_SIZE_BINARY), take_indices, FSBTakeExec},
       {InputType(null()), take_indices, NullTakeExec},
-      {InputType(Type::DECIMAL128), take_indices, FSBTakeExec},
-      {InputType(Type::DECIMAL256), take_indices, FSBTakeExec},
+      {InputType(Type::DECIMAL128), take_indices, PrimitiveTakeExec},
+      {InputType(Type::DECIMAL256), take_indices, PrimitiveTakeExec},
       {InputType(Type::DICTIONARY), take_indices, DictionaryTake},
       {InputType(Type::EXTENSION), take_indices, ExtensionTake},
       {InputType(Type::LIST), take_indices, ListTakeExec},
diff --git a/cpp/src/arrow/compute/kernels/vector_selection_test.cc b/cpp/src/arrow/compute/kernels/vector_selection_test.cc
index bdf9f5454f..ec94b328ea 100644
--- a/cpp/src/arrow/compute/kernels/vector_selection_test.cc
+++ b/cpp/src/arrow/compute/kernels/vector_selection_test.cc
@@ -309,6 +309,33 @@ class TestFilterKernel : public ::testing::Test {
     AssertFilter(values_array, ree_filter, expected_array);
   }
 
+  void TestNumericBasics(const std::shared_ptr<DataType>& type) {
+    ARROW_SCOPED_TRACE("type = ", *type);
+    AssertFilter(type, "[]", "[]", "[]");
+
+    AssertFilter(type, "[9]", "[0]", "[]");
+    AssertFilter(type, "[9]", "[1]", "[9]");
+    AssertFilter(type, "[9]", "[null]", "[null]");
+    AssertFilter(type, "[null]", "[0]", "[]");
+    AssertFilter(type, "[null]", "[1]", "[null]");
+    AssertFilter(type, "[null]", "[null]", "[null]");
+
+    AssertFilter(type, "[7, 8, 9]", "[0, 1, 0]", "[8]");
+    AssertFilter(type, "[7, 8, 9]", "[1, 0, 1]", "[7, 9]");
+    AssertFilter(type, "[null, 8, 9]", "[0, 1, 0]", "[8]");
+    AssertFilter(type, "[7, 8, 9]", "[null, 1, 0]", "[null, 8]");
+    AssertFilter(type, "[7, 8, 9]", "[1, null, 1]", "[7, null, 9]");
+
+    AssertFilter(ArrayFromJSON(type, "[7, 8, 9]"),
+                 ArrayFromJSON(boolean(), "[0, 1, 1, 1, 0, 1]")->Slice(3, 3),
+                 ArrayFromJSON(type, "[7, 9]"));
+
+    ASSERT_RAISES(Invalid, Filter(ArrayFromJSON(type, "[7, 8, 9]"),
+                                  ArrayFromJSON(boolean(), "[]"), emit_null_));
+    ASSERT_RAISES(Invalid, Filter(ArrayFromJSON(type, "[7, 8, 9]"),
+                                  ArrayFromJSON(boolean(), "[]"), drop_));
+  }
+
   const FilterOptions emit_null_, drop_;
 };
 
@@ -342,6 +369,33 @@ void ValidateFilter(const std::shared_ptr<Array>& values,
                     /*verbose=*/true);
 }
 
+TEST_F(TestFilterKernel, Temporal) {
+  this->TestNumericBasics(time32(TimeUnit::MILLI));
+  this->TestNumericBasics(time64(TimeUnit::MICRO));
+  this->TestNumericBasics(timestamp(TimeUnit::NANO, "Europe/Paris"));
+  this->TestNumericBasics(duration(TimeUnit::SECOND));
+  this->TestNumericBasics(date32());
+  this->AssertFilter(date64(), "[0, 86400000, null]", "[null, 1, 0]", "[null, 86400000]");
+}
+
+TEST_F(TestFilterKernel, Duration) {
+  for (auto type : DurationTypes()) {
+    this->TestNumericBasics(type);
+  }
+}
+
+TEST_F(TestFilterKernel, Interval) {
+  this->TestNumericBasics(month_interval());
+
+  auto type = day_time_interval();
+  this->AssertFilter(type, "[[1, -600], [2, 3000], null]", "[null, 1, 0]",
+                     "[null, [2, 3000]]");
+  type = month_day_nano_interval();
+  this->AssertFilter(type,
+                     "[[1, -2, 34567890123456789], [2, 3, -34567890123456789], null]",
+                     "[null, 1, 0]", "[null, [2, 3, -34567890123456789]]");
+}
+
 class TestFilterKernelWithNull : public TestFilterKernel {
  protected:
   void AssertFilter(const std::string& values, const std::string& filter,
@@ -401,30 +455,7 @@ class TestFilterKernelWithNumeric : public TestFilterKernel {
 
 TYPED_TEST_SUITE(TestFilterKernelWithNumeric, NumericArrowTypes);
 TYPED_TEST(TestFilterKernelWithNumeric, FilterNumeric) {
-  auto type = this->type_singleton();
-  this->AssertFilter(type, "[]", "[]", "[]");
-
-  this->AssertFilter(type, "[9]", "[0]", "[]");
-  this->AssertFilter(type, "[9]", "[1]", "[9]");
-  this->AssertFilter(type, "[9]", "[null]", "[null]");
-  this->AssertFilter(type, "[null]", "[0]", "[]");
-  this->AssertFilter(type, "[null]", "[1]", "[null]");
-  this->AssertFilter(type, "[null]", "[null]", "[null]");
-
-  this->AssertFilter(type, "[7, 8, 9]", "[0, 1, 0]", "[8]");
-  this->AssertFilter(type, "[7, 8, 9]", "[1, 0, 1]", "[7, 9]");
-  this->AssertFilter(type, "[null, 8, 9]", "[0, 1, 0]", "[8]");
-  this->AssertFilter(type, "[7, 8, 9]", "[null, 1, 0]", "[null, 8]");
-  this->AssertFilter(type, "[7, 8, 9]", "[1, null, 1]", "[7, null, 9]");
-
-  this->AssertFilter(ArrayFromJSON(type, "[7, 8, 9]"),
-                     ArrayFromJSON(boolean(), "[0, 1, 1, 1, 0, 1]")->Slice(3, 3),
-                     ArrayFromJSON(type, "[7, 9]"));
-
-  ASSERT_RAISES(Invalid, Filter(ArrayFromJSON(type, "[7, 8, 9]"),
-                                ArrayFromJSON(boolean(), "[]"), this->emit_null_));
-  ASSERT_RAISES(Invalid, Filter(ArrayFromJSON(type, "[7, 8, 9]"),
-                                ArrayFromJSON(boolean(), "[]"), this->drop_));
+  this->TestNumericBasics(this->type_singleton());
 }
 
 template <typename CType>
@@ -588,7 +619,7 @@ TYPED_TEST(TestFilterKernelWithDecimal, FilterNumeric) {
                                 ArrayFromJSON(boolean(), "[]"), this->drop_));
 }
 
-TEST(TestFilterKernel, NoValidityBitmapButUnknownNullCount) {
+TEST_F(TestFilterKernel, NoValidityBitmapButUnknownNullCount) {
   auto values = ArrayFromJSON(int32(), "[1, 2, 3, 4]");
   auto filter = ArrayFromJSON(boolean(), "[true, true, false, true]");
 
@@ -1136,6 +1167,20 @@ class TestTakeKernel : public ::testing::Test {
     TestNoValidityBitmapButUnknownNullCount(ArrayFromJSON(type, values),
                                             ArrayFromJSON(int16(), indices));
   }
+
+  void TestNumericBasics(const std::shared_ptr<DataType>& type) {
+    ARROW_SCOPED_TRACE("type = ", *type);
+    CheckTake(type, "[7, 8, 9]", "[]", "[]");
+    CheckTake(type, "[7, 8, 9]", "[0, 1, 0]", "[7, 8, 7]");
+    CheckTake(type, "[null, 8, 9]", "[0, 1, 0]", "[null, 8, null]");
+    CheckTake(type, "[7, 8, 9]", "[null, 1, 0]", "[null, 8, 7]");
+    CheckTake(type, "[null, 8, 9]", "[]", "[]");
+    CheckTake(type, "[7, 8, 9]", "[0, 0, 0, 0, 0, 0, 2]", "[7, 7, 7, 7, 7, 7, 9]");
+
+    std::shared_ptr<Array> arr;
+    ASSERT_RAISES(IndexError, TakeJSON(type, "[7, 8, 9]", int8(), "[0, 9, 0]", &arr));
+    ASSERT_RAISES(IndexError, TakeJSON(type, "[7, 8, 9]", int8(), "[0, -1, 0]", &arr));
+  }
 };
 
 template <typename ArrowType>
@@ -1201,6 +1246,34 @@ TEST_F(TestTakeKernel, TakeBoolean) {
                 TakeJSON(boolean(), "[true, false, true]", int8(), "[0, -1, 0]", &arr));
 }
 
+TEST_F(TestTakeKernel, Temporal) {
+  this->TestNumericBasics(time32(TimeUnit::MILLI));
+  this->TestNumericBasics(time64(TimeUnit::MICRO));
+  this->TestNumericBasics(timestamp(TimeUnit::NANO, "Europe/Paris"));
+  this->TestNumericBasics(duration(TimeUnit::SECOND));
+  this->TestNumericBasics(date32());
+  CheckTake(date64(), "[0, 86400000, null]", "[null, 1, 1, 0]",
+            "[null, 86400000, 86400000, 0]");
+}
+
+TEST_F(TestTakeKernel, Duration) {
+  for (auto type : DurationTypes()) {
+    this->TestNumericBasics(type);
+  }
+}
+
+TEST_F(TestTakeKernel, Interval) {
+  this->TestNumericBasics(month_interval());
+
+  auto type = day_time_interval();
+  CheckTake(type, "[[1, -600], [2, 3000], null]", "[0, null, 2, 1]",
+            "[[1, -600], null, null, [2, 3000]]");
+  type = month_day_nano_interval();
+  CheckTake(type, "[[1, -2, 34567890123456789], [2, 3, -34567890123456789], null]",
+            "[0, null, 2, 1]",
+            "[[1, -2, 34567890123456789], null, null, [2, 3, -34567890123456789]]");
+}
+
 template <typename ArrowType>
 class TestTakeKernelWithNumeric : public TestTakeKernelTyped<ArrowType> {
  protected:
@@ -1216,18 +1289,7 @@ class TestTakeKernelWithNumeric : public TestTakeKernelTyped<ArrowType> {
 
 TYPED_TEST_SUITE(TestTakeKernelWithNumeric, NumericArrowTypes);
 TYPED_TEST(TestTakeKernelWithNumeric, TakeNumeric) {
-  this->AssertTake("[7, 8, 9]", "[]", "[]");
-  this->AssertTake("[7, 8, 9]", "[0, 1, 0]", "[7, 8, 7]");
-  this->AssertTake("[null, 8, 9]", "[0, 1, 0]", "[null, 8, null]");
-  this->AssertTake("[7, 8, 9]", "[null, 1, 0]", "[null, 8, 7]");
-  this->AssertTake("[null, 8, 9]", "[]", "[]");
-  this->AssertTake("[7, 8, 9]", "[0, 0, 0, 0, 0, 0, 2]", "[7, 7, 7, 7, 7, 7, 9]");
-
-  std::shared_ptr<Array> arr;
-  ASSERT_RAISES(IndexError,
-                TakeJSON(this->type_singleton(), "[7, 8, 9]", int8(), "[0, 9, 0]", &arr));
-  ASSERT_RAISES(IndexError, TakeJSON(this->type_singleton(), "[7, 8, 9]", int8(),
-                                     "[0, -1, 0]", &arr));
+  this->TestNumericBasics(this->type_singleton());
 }
 
 template <typename TypeClass>
@@ -1816,6 +1878,7 @@ TEST(TestTakeMetaFunction, ArityChecking) {
 template <typename Unused = void>
 struct FilterRandomTest {
   static void Test(const std::shared_ptr<DataType>& type) {
+    ARROW_SCOPED_TRACE("type = ", *type);
     auto rand = random::RandomArrayGenerator(kRandomSeed);
     const int64_t length = static_cast<int64_t>(1ULL << 10);
     for (auto null_probability : {0.0, 0.01, 0.1, 0.999, 1.0}) {
@@ -1856,6 +1919,7 @@ void CheckTakeRandom(const std::shared_ptr<Array>& values, int64_t indices_lengt
 template <typename ValuesType>
 struct TakeRandomTest {
   static void Test(const std::shared_ptr<DataType>& type) {
+    ARROW_SCOPED_TRACE("type = ", *type);
     auto rand = random::RandomArrayGenerator(kRandomSeed);
     const int64_t values_length = 64 * 16 + 1;
     const int64_t indices_length = 64 * 4 + 1;
@@ -1897,8 +1961,10 @@ TEST(TestFilter, RandomString) {
 }
 
 TEST(TestFilter, RandomFixedSizeBinary) {
-  FilterRandomTest<>::Test(fixed_size_binary(0));
-  FilterRandomTest<>::Test(fixed_size_binary(16));
+  // FixedSizeBinary filter is special-cased for some widths
+  for (int32_t width : {0, 1, 16, 32, 35}) {
+    FilterRandomTest<>::Test(fixed_size_binary(width));
+  }
 }
 
 TEST(TestTake, PrimitiveRandom) { TestRandomPrimitiveCTypes<TakeRandomTest>(); }
@@ -1911,8 +1977,10 @@ TEST(TestTake, RandomString) {
 }
 
 TEST(TestTake, RandomFixedSizeBinary) {
-  TakeRandomTest<FixedSizeBinaryType>::Test(fixed_size_binary(0));
-  TakeRandomTest<FixedSizeBinaryType>::Test(fixed_size_binary(16));
+  // FixedSizeBinary take is special-cased for some widths
+  for (int32_t width : {0, 1, 16, 32, 35}) {
+    TakeRandomTest<FixedSizeBinaryType>::Test(fixed_size_binary(width));
+  }
 }
 
 // ----------------------------------------------------------------------

(arrow) 30/30: GH-39803: [C++][Acero] Fix AsOfJoin with differently ordered schemas than the output (#39804)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 0e9bd55b6584441fa078337728d703c9dc1c2049
Author: Jeremy Aguilon <je...@gmail.com>
AuthorDate: Mon Feb 19 09:54:57 2024 -0500

    GH-39803: [C++][Acero] Fix AsOfJoin with differently ordered schemas than the output (#39804)
    
    ### Rationale for this change
    
    Issue is described visually in https://github.com/apache/arrow/issues/39803.
    
    The key hasher works by hashing every row of the input tables' key columns. An important step is inspecting the [column metadata](https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/asof_join_node.cc#L412) for the asof-join key fields. This returns whether columns are fixed width, among other things.
    
    The issue is we are passing the `output_schema`, rather than the input's schema.
    
    If an input looks like
    
    ```
    key_string_type,ts_int32_type,val
    ```
    
    But our expected output schema looks like:
    
    ```
    ts_int32,key_string_type,...
    ```
    Then the hasher will think that the `key_string_type`'s type is an int32. This completely throws off hashes. Tests currently get away with it since we just use ints across the board.
    
    ### What changes are included in this PR?
    
    One line fix and test with string types.
    
    ### Are these changes tested?
    
    Yes. Can see the test run before and after changes here: https://gist.github.com/JerAguilon/953d82ed288d58f9ce24d1a925def2cc
    
    Before the change, notice that inputs 0 and 1 have mismatched hashes:
    
    ```
    AsofjoinNode(0x16cf9e2d8): key hasher 1 got hashes [0, 9784892099856512926, 1050982531982388796, 10763536662319179482, 2029627098739957112, 11814237723602982167, 3080328155728858293, 12792882290360550483, 4058972722486426609, 13771526852823217039]
    ...
    AsofjoinNode(0x16cf9dd18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]
    
    ```
    
    And after, they do match:
    
    ```
    AsofjoinNode(0x16f2ea2d8): key hasher 1 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]
    ...
    AsofjoinNode(0x16f2e9d18): key hasher 0 got hashes [17528465654998409509, 12047706865972860560, 18017664240540048750, 12358837084497432044, 8151160321586084686, 8691136767698756332, 15973065724125580046, 9654919479117127288, 618127929167745505, 3403805303373270709]
    ```
    
    ...which is exactly what you want, since the `key` column for both tables looks like `["0", "1", ..."9"]`
    
    ### Are there any user-facing changes?
    
    * Closes: #39803
    
    Lead-authored-by: Jeremy Aguilon <je...@gmail.com>
    Co-authored-by: Antoine Pitrou <pi...@free.fr>
    Signed-off-by: Antoine Pitrou <an...@python.org>
---
 cpp/src/arrow/acero/asof_join_node.cc      |  2 +-
 cpp/src/arrow/acero/asof_join_node_test.cc | 64 ++++++++++++++++++++++++++++++
 2 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/cpp/src/arrow/acero/asof_join_node.cc b/cpp/src/arrow/acero/asof_join_node.cc
index 2609905a0b..e96d5ad44a 100644
--- a/cpp/src/arrow/acero/asof_join_node.cc
+++ b/cpp/src/arrow/acero/asof_join_node.cc
@@ -1098,7 +1098,7 @@ class AsofJoinNode : public ExecNode {
     auto inputs = this->inputs();
     for (size_t i = 0; i < inputs.size(); i++) {
       RETURN_NOT_OK(key_hashers_[i]->Init(plan()->query_context()->exec_context(),
-                                          output_schema()));
+                                          inputs[i]->output_schema()));
       ARROW_ASSIGN_OR_RAISE(
           auto input_state,
           InputState::Make(i, tolerance_, must_hash_, may_rehash_, key_hashers_[i].get(),
diff --git a/cpp/src/arrow/acero/asof_join_node_test.cc b/cpp/src/arrow/acero/asof_join_node_test.cc
index e400cc0316..d95d2aaad3 100644
--- a/cpp/src/arrow/acero/asof_join_node_test.cc
+++ b/cpp/src/arrow/acero/asof_join_node_test.cc
@@ -1582,6 +1582,70 @@ TEST(AsofJoinTest, BatchSequencing) {
   return TestSequencing(MakeIntegerBatches, /*num_batches=*/32, /*batch_size=*/1);
 }
 
+template <typename BatchesMaker>
+void TestSchemaResolution(BatchesMaker maker, int num_batches, int batch_size) {
+  // GH-39803: The key hasher needs to resolve the types of key columns. All other
+  // tests use int32 for all columns, but this test converts the key columns to
+  // strings via a projection node to test that the column is correctly resolved
+  // to string.
+  auto l_schema =
+      schema({field("time", int32()), field("key", int32()), field("l_value", int32())});
+  auto r_schema =
+      schema({field("time", int32()), field("key", int32()), field("r0_value", int32())});
+
+  auto make_shift = [&maker, num_batches, batch_size](
+                        const std::shared_ptr<Schema>& schema, int shift) {
+    return maker({[](int row) -> int64_t { return row; },
+                  [num_batches](int row) -> int64_t { return row / num_batches; },
+                  [shift](int row) -> int64_t { return row * 10 + shift; }},
+                 schema, num_batches, batch_size);
+  };
+  ASSERT_OK_AND_ASSIGN(auto l_batches, make_shift(l_schema, 0));
+  ASSERT_OK_AND_ASSIGN(auto r_batches, make_shift(r_schema, 1));
+
+  Declaration l_src = {"source",
+                       SourceNodeOptions(l_schema, l_batches.gen(false, false))};
+  Declaration r_src = {"source",
+                       SourceNodeOptions(r_schema, r_batches.gen(false, false))};
+  Declaration l_project = {
+      "project",
+      {std::move(l_src)},
+      ProjectNodeOptions({compute::field_ref("time"),
+                          compute::call("cast", {compute::field_ref("key")},
+                                        compute::CastOptions::Safe(utf8())),
+                          compute::field_ref("l_value")},
+                         {"time", "key", "l_value"})};
+  Declaration r_project = {
+      "project",
+      {std::move(r_src)},
+      ProjectNodeOptions({compute::call("cast", {compute::field_ref("key")},
+                                        compute::CastOptions::Safe(utf8())),
+                          compute::field_ref("r0_value"), compute::field_ref("time")},
+                         {"key", "r0_value", "time"})};
+
+  Declaration asofjoin = {
+      "asofjoin", {l_project, r_project}, GetRepeatedOptions(2, "time", {"key"}, 1000)};
+
+  QueryOptions query_options;
+  query_options.use_threads = false;
+  ASSERT_OK_AND_ASSIGN(auto table, DeclarationToTable(asofjoin, query_options));
+
+  Int32Builder expected_r0_b;
+  for (int i = 1; i <= 91; i += 10) {
+    ASSERT_OK(expected_r0_b.Append(i));
+  }
+  ASSERT_OK_AND_ASSIGN(auto expected_r0, expected_r0_b.Finish());
+
+  auto actual_r0 = table->GetColumnByName("r0_value");
+  std::vector<std::shared_ptr<arrow::Array>> chunks = {expected_r0};
+  auto expected_r0_chunked = std::make_shared<arrow::ChunkedArray>(chunks);
+  ASSERT_TRUE(actual_r0->Equals(expected_r0_chunked));
+}
+
+TEST(AsofJoinTest, OutputSchemaResolution) {
+  return TestSchemaResolution(MakeIntegerBatches, /*num_batches=*/1, /*batch_size=*/10);
+}
+
 namespace {
 
 Result<AsyncGenerator<std::optional<ExecBatch>>> MakeIntegerBatchGenForTest(

(arrow) 28/30: GH-39999: [Python] Fix tests for pandas with CoW / nightly integration tests (#40000)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 0d0be3b5a0d233a9287121f3fd5a4c92d7538112
Author: Joris Van den Bossche <jo...@gmail.com>
AuthorDate: Fri Feb 9 09:04:16 2024 +0100

    GH-39999: [Python] Fix tests for pandas with CoW / nightly integration tests (#40000)
    
    ### Rationale for this change
    
    Fixing a failing test with pandas nightly because of CoW changes.
    
    * Closes: #39999
    
    Authored-by: Joris Van den Bossche <jo...@gmail.com>
    Signed-off-by: Joris Van den Bossche <jo...@gmail.com>
---
 python/pyarrow/tests/test_pandas.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/python/pyarrow/tests/test_pandas.py b/python/pyarrow/tests/test_pandas.py
index d15ee82d5d..8fd4b3041b 100644
--- a/python/pyarrow/tests/test_pandas.py
+++ b/python/pyarrow/tests/test_pandas.py
@@ -3643,7 +3643,8 @@ def test_singleton_blocks_zero_copy():
 
     prior_allocation = pa.total_allocated_bytes()
     result = t.to_pandas()
-    assert result['f0'].values.flags.writeable
+    # access private `_values` because the public `values` is made read-only by pandas
+    assert result['f0']._values.flags.writeable
     assert pa.total_allocated_bytes() > prior_allocation

(arrow) 15/30: GH-39527: [C++][Parquet] Validate page sizes before truncating to int32 (#39528)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 8d7d90a5ba806ca3ac5a69ffeddae55f4dfabc2c
Author: emkornfield <em...@gmail.com>
AuthorDate: Fri Jan 26 23:02:12 2024 -0800

    GH-39527: [C++][Parquet] Validate page sizes before truncating to int32 (#39528)
    
    Be defensive instead of writing invalid data.
    
    ### Rationale for this change
    
    Users can provide this API pages that are large to validly write and we silently truncate lengths before writing.
    
    ### What changes are included in this PR?
    
    Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build).
    
    ### Are these changes tested?
    
    Unit tested
    
    ### Are there any user-facing changes?
    
    This might start raising exceptions instead of writing out invalid parquet files.
    
    Closes #39527
    
    **This PR contains a "Critical Fix".**
    * Closes: #39527
    
    Lead-authored-by: emkornfield <em...@gmail.com>
    Co-authored-by: Micah Kornfield <mi...@google.com>
    Co-authored-by: mwish <ma...@gmail.com>
    Co-authored-by: Antoine Pitrou <pi...@free.fr>
    Co-authored-by: Gang Wu <us...@gmail.com>
    Signed-off-by: mwish <ma...@gmail.com>
---
 cpp/src/parquet/column_writer.cc      | 29 ++++++++++++++++++----
 cpp/src/parquet/column_writer_test.cc | 45 +++++++++++++++++++++++++++++++++++
 2 files changed, 70 insertions(+), 4 deletions(-)

diff --git a/cpp/src/parquet/column_writer.cc b/cpp/src/parquet/column_writer.cc
index 12b2837fbf..23366b2daa 100644
--- a/cpp/src/parquet/column_writer.cc
+++ b/cpp/src/parquet/column_writer.cc
@@ -271,7 +271,12 @@ class SerializedPageWriter : public PageWriter {
   }
 
   int64_t WriteDictionaryPage(const DictionaryPage& page) override {
-    int64_t uncompressed_size = page.size();
+    int64_t uncompressed_size = page.buffer()->size();
+    if (uncompressed_size > std::numeric_limits<int32_t>::max()) {
+      throw ParquetException(
+          "Uncompressed dictionary page size overflows INT32_MAX. Size:",
+          uncompressed_size);
+    }
     std::shared_ptr<Buffer> compressed_data;
     if (has_compressor()) {
       auto buffer = std::static_pointer_cast<ResizableBuffer>(
@@ -288,6 +293,11 @@ class SerializedPageWriter : public PageWriter {
     dict_page_header.__set_is_sorted(page.is_sorted());
 
     const uint8_t* output_data_buffer = compressed_data->data();
+    if (compressed_data->size() > std::numeric_limits<int32_t>::max()) {
+      throw ParquetException(
+          "Compressed dictionary page size overflows INT32_MAX. Size: ",
+          uncompressed_size);
+    }
     int32_t output_data_len = static_cast<int32_t>(compressed_data->size());
 
     if (data_encryptor_.get()) {
@@ -371,18 +381,29 @@ class SerializedPageWriter : public PageWriter {
     const int64_t uncompressed_size = page.uncompressed_size();
     std::shared_ptr<Buffer> compressed_data = page.buffer();
     const uint8_t* output_data_buffer = compressed_data->data();
-    int32_t output_data_len = static_cast<int32_t>(compressed_data->size());
+    int64_t output_data_len = compressed_data->size();
+
+    if (output_data_len > std::numeric_limits<int32_t>::max()) {
+      throw ParquetException("Compressed data page size overflows INT32_MAX. Size:",
+                             output_data_len);
+    }
 
     if (data_encryptor_.get()) {
       PARQUET_THROW_NOT_OK(encryption_buffer_->Resize(
           data_encryptor_->CiphertextSizeDelta() + output_data_len, false));
       UpdateEncryption(encryption::kDataPage);
-      output_data_len = data_encryptor_->Encrypt(compressed_data->data(), output_data_len,
+      output_data_len = data_encryptor_->Encrypt(compressed_data->data(),
+                                                 static_cast<int32_t>(output_data_len),
                                                  encryption_buffer_->mutable_data());
       output_data_buffer = encryption_buffer_->data();
     }
 
     format::PageHeader page_header;
+
+    if (uncompressed_size > std::numeric_limits<int32_t>::max()) {
+      throw ParquetException("Uncompressed data page size overflows INT32_MAX. Size:",
+                             uncompressed_size);
+    }
     page_header.__set_uncompressed_page_size(static_cast<int32_t>(uncompressed_size));
     page_header.__set_compressed_page_size(static_cast<int32_t>(output_data_len));
 
@@ -421,7 +442,7 @@ class SerializedPageWriter : public PageWriter {
     if (offset_index_builder_ != nullptr) {
       const int64_t compressed_size = output_data_len + header_size;
       if (compressed_size > std::numeric_limits<int32_t>::max()) {
-        throw ParquetException("Compressed page size overflows to INT32_MAX.");
+        throw ParquetException("Compressed page size overflows INT32_MAX.");
       }
       if (!page.first_row_index().has_value()) {
         throw ParquetException("First row index is not set in data page.");
diff --git a/cpp/src/parquet/column_writer_test.cc b/cpp/src/parquet/column_writer_test.cc
index 59fc848d7f..97421629d2 100644
--- a/cpp/src/parquet/column_writer_test.cc
+++ b/cpp/src/parquet/column_writer_test.cc
@@ -15,9 +15,11 @@
 // specific language governing permissions and limitations
 // under the License.
 
+#include <memory>
 #include <utility>
 #include <vector>
 
+#include <gmock/gmock.h>
 #include <gtest/gtest.h>
 
 #include "arrow/io/buffered.h"
@@ -25,6 +27,7 @@
 #include "arrow/util/bit_util.h"
 #include "arrow/util/bitmap_builders.h"
 
+#include "parquet/column_page.h"
 #include "parquet/column_reader.h"
 #include "parquet/column_writer.h"
 #include "parquet/file_reader.h"
@@ -479,6 +482,9 @@ using TestValuesWriterInt64Type = TestPrimitiveWriter<Int64Type>;
 using TestByteArrayValuesWriter = TestPrimitiveWriter<ByteArrayType>;
 using TestFixedLengthByteArrayValuesWriter = TestPrimitiveWriter<FLBAType>;
 
+using ::testing::HasSubstr;
+using ::testing::ThrowsMessage;
+
 TYPED_TEST(TestPrimitiveWriter, RequiredPlain) {
   this->TestRequiredWithEncoding(Encoding::PLAIN);
 }
@@ -889,6 +895,45 @@ TEST_F(TestByteArrayValuesWriter, CheckDefaultStats) {
   ASSERT_TRUE(this->metadata_is_stats_set());
 }
 
+TEST(TestPageWriter, ThrowsOnPagesTooLarge) {
+  NodePtr item = schema::Int32("item");  // optional item
+  NodePtr list(GroupNode::Make("b", Repetition::REPEATED, {item}, ConvertedType::LIST));
+  NodePtr bag(GroupNode::Make("bag", Repetition::OPTIONAL, {list}));  // optional list
+  std::vector<NodePtr> fields = {bag};
+  NodePtr root = GroupNode::Make("schema", Repetition::REPEATED, fields);
+
+  SchemaDescriptor schema;
+  schema.Init(root);
+
+  auto sink = CreateOutputStream();
+  auto props = WriterProperties::Builder().build();
+
+  auto metadata = ColumnChunkMetaDataBuilder::Make(props, schema.Column(0));
+  std::unique_ptr<PageWriter> pager =
+      PageWriter::Open(sink, Compression::UNCOMPRESSED, metadata.get());
+
+  uint8_t data;
+  std::shared_ptr<Buffer> buffer =
+      std::make_shared<Buffer>(&data, std::numeric_limits<int32_t>::max() + int64_t{1});
+  DataPageV1 over_compressed_limit(buffer, /*num_values=*/100, Encoding::BIT_PACKED,
+                                   Encoding::BIT_PACKED, Encoding::BIT_PACKED,
+                                   /*uncompressed_size=*/100);
+  EXPECT_THAT([&]() { pager->WriteDataPage(over_compressed_limit); },
+              ThrowsMessage<ParquetException>(HasSubstr("overflows INT32_MAX")));
+  DictionaryPage dictionary_over_compressed_limit(buffer, /*num_values=*/100,
+                                                  Encoding::PLAIN);
+  EXPECT_THAT([&]() { pager->WriteDictionaryPage(dictionary_over_compressed_limit); },
+              ThrowsMessage<ParquetException>(HasSubstr("overflows INT32_MAX")));
+
+  buffer = std::make_shared<Buffer>(&data, 1);
+  DataPageV1 over_uncompressed_limit(
+      buffer, /*num_values=*/100, Encoding::BIT_PACKED, Encoding::BIT_PACKED,
+      Encoding::BIT_PACKED,
+      /*uncompressed_size=*/std::numeric_limits<int32_t>::max() + int64_t{1});
+  EXPECT_THAT([&]() { pager->WriteDataPage(over_compressed_limit); },
+              ThrowsMessage<ParquetException>(HasSubstr("overflows INT32_MAX")));
+}
+
 TEST(TestColumnWriter, RepeatedListsUpdateSpacedBug) {
   // In ARROW-3930 we discovered a bug when writing from Arrow when we had data
   // that looks like this:

(arrow) 17/30: GH-39640: [Docs] Pin pydata-sphinx-theme to 0.14.* (#39758)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 66def3d6bf624c4d4b7a864ae71ecef43c51ab87
Author: Joris Van den Bossche <jo...@gmail.com>
AuthorDate: Tue Jan 30 09:16:53 2024 +0100

    GH-39640: [Docs] Pin pydata-sphinx-theme to 0.14.* (#39758)
    
    ### Rationale for this change
    
    Fixing the pinning syntax so we get the latest 0.14.x version (which is currently 0.14.4)
    
    * Closes: #39640
    
    Authored-by: Joris Van den Bossche <jo...@gmail.com>
    Signed-off-by: Joris Van den Bossche <jo...@gmail.com>
---
 ci/conda_env_sphinx.txt            | 2 +-
 docs/requirements.txt              | 2 +-
 docs/source/python/api/compute.rst | 2 +-
 docs/source/python/compute.rst     | 4 ++--
 docs/source/python/pandas.rst      | 2 +-
 5 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/ci/conda_env_sphinx.txt b/ci/conda_env_sphinx.txt
index d0f494d2e0..0e50875fc1 100644
--- a/ci/conda_env_sphinx.txt
+++ b/ci/conda_env_sphinx.txt
@@ -20,7 +20,7 @@ breathe
 doxygen
 ipython
 numpydoc
-pydata-sphinx-theme=0.14.1
+pydata-sphinx-theme=0.14
 sphinx-autobuild
 sphinx-design
 sphinx-copybutton
diff --git a/docs/requirements.txt b/docs/requirements.txt
index aee2eb662c..5d6fec7ddf 100644
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@@ -5,7 +5,7 @@
 breathe
 ipython
 numpydoc
-pydata-sphinx-theme==0.14.1
+pydata-sphinx-theme~=0.14
 sphinx-autobuild
 sphinx-design
 sphinx-copybutton
diff --git a/docs/source/python/api/compute.rst b/docs/source/python/api/compute.rst
index b879643017..928c607d13 100644
--- a/docs/source/python/api/compute.rst
+++ b/docs/source/python/api/compute.rst
@@ -590,4 +590,4 @@ User-Defined Functions
    :toctree: ../generated/
 
    register_scalar_function
-   ScalarUdfContext
+   UdfContext
diff --git a/docs/source/python/compute.rst b/docs/source/python/compute.rst
index e8a5b613c6..c02059a4f8 100644
--- a/docs/source/python/compute.rst
+++ b/docs/source/python/compute.rst
@@ -445,9 +445,9 @@ output type need to be defined. Using :func:`pyarrow.compute.register_scalar_fun
 
 The implementation of a user-defined function always takes a first *context*
 parameter (named ``ctx`` in the example above) which is an instance of
-:class:`pyarrow.compute.ScalarUdfContext`.
+:class:`pyarrow.compute.UdfContext`.
 This context exposes several useful attributes, particularly a
-:attr:`~pyarrow.compute.ScalarUdfContext.memory_pool` to be used for
+:attr:`~pyarrow.compute.UdfContext.memory_pool` to be used for
 allocations in the context of the user-defined function.
 
 You can call a user-defined function directly using :func:`pyarrow.compute.call_function`:
diff --git a/docs/source/python/pandas.rst b/docs/source/python/pandas.rst
index fda90c4f2a..23a4b73bd0 100644
--- a/docs/source/python/pandas.rst
+++ b/docs/source/python/pandas.rst
@@ -197,7 +197,7 @@ use the ``datetime64[ns]`` type in Pandas and are converted to an Arrow
 
 .. ipython:: python
 
-   df = pd.DataFrame({"datetime": pd.date_range("2020-01-01T00:00:00Z", freq="H", periods=3)})
+   df = pd.DataFrame({"datetime": pd.date_range("2020-01-01T00:00:00Z", freq="h", periods=3)})
    df.dtypes
    df

(arrow) 08/30: GH-39656: [Release] Update platform tags for macOS wheels to macosx_10_15 (#39657)

Posted by ra...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-15.0.x
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit ee7f54cd1b0681540ae7fe852de2d1ed4ad6535b
Author: Raúl Cumplido <ra...@gmail.com>
AuthorDate: Thu Jan 18 04:22:57 2024 +0100

    GH-39656: [Release] Update platform tags for macOS wheels to macosx_10_15 (#39657)
    
    ### Rationale for this change
    
    Currently the binary verification for releases fails due to wrong macOS platform version.
    
    ### What changes are included in this PR?
    
    Update to the current generated platform tag.
    
    ### Are these changes tested?
    
    No, but I've validated this is the corrected generated platform tag for the wheels on the Release Candidate: https://apache.jfrog.io/ui/native/arrow/python-rc/15.0.0-rc1/
    
    ### Are there any user-facing changes?
    
    Not because of this change but there was a minimum version of macOS as part of the PR that caused this issue.
    * Closes: #39656
    
    Authored-by: Raúl Cumplido <ra...@gmail.com>
    Signed-off-by: Sutou Kouhei <ko...@clear-code.com>
---
 dev/release/verify-release-candidate.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/dev/release/verify-release-candidate.sh b/dev/release/verify-release-candidate.sh
index c5e27d0830..90f071c5b4 100755
--- a/dev/release/verify-release-candidate.sh
+++ b/dev/release/verify-release-candidate.sh
@@ -1136,7 +1136,7 @@ test_macos_wheels() {
     local check_flight=OFF
   else
     local python_versions="3.8 3.9 3.10 3.11 3.12"
-    local platform_tags="macosx_10_14_x86_64"
+    local platform_tags="macosx_10_15_x86_64"
   fi
 
   # verify arch-native wheels inside an arch-native conda environment