You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by ra...@apache.org on 2023/01/18 08:34:16 UTC

[arrow] branch maint-11.0.0 updated (2f8e2b2051 -> ccd169caed)

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a change to branch maint-11.0.0
in repository https://gitbox.apache.org/repos/asf/arrow.git


    from 2f8e2b2051 MINOR: [Python] Fix pre-commit and archery flake8 to honor pyarrow config (#15067)
     new 461c17d0da GH-33687: [Dev] Fix commit message generation in merge script (#33691)
     new fa52bb02ae GH-15243: [C++] fix for potential deadlock in the group-by node (#33700)
     new 3a2f3650ed GH-33705: [R] Fix link on README (#33706)
     new 057f291eff GH-33666: [R] Remove extraneous argument to semi_join (#33693)
     new 5bda66a96a GH-25633: [CI][Java][macOS] Ensure using bundled RE2 (#33711)
     new 7b759bcb1f GH-15265: [Java] Publish SBOM artifacts (#15267)
     new 8d1e357b77 GH-14875: [C++] C Data Interface: check imported buffer for non-null (#14814)
     new c2199dc490 GH-20512: [Python] Numpy conversion doesn't account for ListArray offset (#15210)
     new 66332515d0 GH-33526: [R] Implement new function open_dataset_csv with signature more closely matching read_csv_arrow (#33614)
     new ccd169caed GH-14997: [Release] Ensure archery release tasks works with both new style GitHub issues and old style JIRA issues (#33615)

The 10 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .github/workflows/java_nightly.yml                 |   2 +-
 ci/scripts/java_full_build.sh                      |   8 +-
 ci/scripts/java_jni_macos_build.sh                 |   1 +
 cpp/src/arrow/array/array_nested.h                 |   4 +-
 cpp/src/arrow/array/validate.cc                    |  11 +-
 cpp/src/arrow/c/bridge.cc                          |  40 +++-
 cpp/src/arrow/c/bridge_test.cc                     | 132 ++++++++++-
 cpp/src/arrow/compute/exec/aggregate_node.cc       |  32 ++-
 dev/archery/archery/release/cli.py                 |  41 ++--
 dev/archery/archery/release/core.py                | 258 ++++++++++++++-------
 dev/archery/archery/release/reports.py             |   7 +-
 dev/archery/archery/release/tests/test_release.py  |  91 +++++---
 .../archery/templates/release_changelog.md.j2      |   4 +
 .../archery/templates/release_curation.txt.j2      |  20 +-
 dev/archery/setup.py                               |   2 +-
 dev/merge_arrow_pr.py                              |  13 +-
 dev/tasks/java-jars/github.yml                     |   9 +
 dev/tasks/tasks.yml                                |  48 ++++
 java/pom.xml                                       |  13 ++
 python/pyarrow/src/arrow/python/arrow_to_pandas.cc |  23 +-
 python/pyarrow/tests/test_pandas.py                |  45 ++++
 r/NAMESPACE                                        |   3 +
 r/R/csv.R                                          |   2 +-
 r/R/dataset-format.R                               | 252 +++++++++++++-------
 r/R/dataset.R                                      | 124 ++++++++++
 r/README.md                                        |   2 +-
 r/_pkgdown.yml                                     |   4 +
 r/man/CsvFileFormat.Rd                             |  41 ++++
 r/man/FileFormat.Rd                                |   3 +-
 r/man/acero.Rd                                     |   4 +-
 r/man/open_delim_dataset.Rd                        | 216 +++++++++++++++++
 r/tests/testthat/test-dataset-csv.R                |  90 ++++++-
 r/tests/testthat/test-dplyr-join.R                 |  11 +-
 33 files changed, 1275 insertions(+), 281 deletions(-)
 create mode 100644 r/man/CsvFileFormat.Rd
 create mode 100644 r/man/open_delim_dataset.Rd


[arrow] 07/10: GH-14875: [C++] C Data Interface: check imported buffer for non-null (#14814)

Posted by ra...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-11.0.0
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 8d1e357b7721af971ff02a0325fced448b78a26b
Author: Antoine Pitrou <an...@python.org>
AuthorDate: Tue Jan 17 18:53:02 2023 +0100

    GH-14875: [C++] C Data Interface: check imported buffer for non-null (#14814)
    
    The C data interface may expose null data pointers for zero-sized buffers.
    Make sure that all buffer pointers remain non-null internally.
    
    Followup to GH-14805
    
    * Closes: #14875
    
    Authored-by: Antoine Pitrou <an...@python.org>
    Signed-off-by: Antoine Pitrou <an...@python.org>
---
 cpp/src/arrow/array/validate.cc |  11 ++--
 cpp/src/arrow/c/bridge.cc       |  40 +++++++++---
 cpp/src/arrow/c/bridge_test.cc  | 132 +++++++++++++++++++++++++++++++++++++++-
 3 files changed, 167 insertions(+), 16 deletions(-)

diff --git a/cpp/src/arrow/array/validate.cc b/cpp/src/arrow/array/validate.cc
index 56470ac74b..c1a37c4234 100644
--- a/cpp/src/arrow/array/validate.cc
+++ b/cpp/src/arrow/array/validate.cc
@@ -459,14 +459,17 @@ struct ValidateArrayImpl {
       if (buffer == nullptr) {
         continue;
       }
-      int64_t min_buffer_size = -1;
+      int64_t min_buffer_size = 0;
       switch (spec.kind) {
         case DataTypeLayout::BITMAP:
-          min_buffer_size = bit_util::BytesForBits(length_plus_offset);
+          // If length == 0, buffer size can be 0 regardless of offset
+          if (data.length > 0) {
+            min_buffer_size = bit_util::BytesForBits(length_plus_offset);
+          }
           break;
         case DataTypeLayout::FIXED_WIDTH:
-          if (MultiplyWithOverflow(length_plus_offset, spec.byte_width,
-                                   &min_buffer_size)) {
+          if (data.length > 0 && MultiplyWithOverflow(length_plus_offset, spec.byte_width,
+                                                      &min_buffer_size)) {
             return Status::Invalid("Array of type ", type.ToString(),
                                    " has impossibly large length and offset");
           }
diff --git a/cpp/src/arrow/c/bridge.cc b/cpp/src/arrow/c/bridge.cc
index 7579a1c489..d6ea60f520 100644
--- a/cpp/src/arrow/c/bridge.cc
+++ b/cpp/src/arrow/c/bridge.cc
@@ -31,6 +31,7 @@
 #include "arrow/c/util_internal.h"
 #include "arrow/extension_type.h"
 #include "arrow/memory_pool.h"
+#include "arrow/memory_pool_internal.h"  // for kZeroSizeArea
 #include "arrow/record_batch.h"
 #include "arrow/result.h"
 #include "arrow/stl_allocator.h"
@@ -60,6 +61,8 @@ using internal::SchemaExportTraits;
 
 using internal::ToChars;
 
+using memory_pool::internal::kZeroSizeArea;
+
 namespace {
 
 Status ExportingNotImplemented(const DataType& type) {
@@ -1265,7 +1268,8 @@ class ImportedBuffer : public Buffer {
 };
 
 struct ArrayImporter {
-  explicit ArrayImporter(const std::shared_ptr<DataType>& type) : type_(type) {}
+  explicit ArrayImporter(const std::shared_ptr<DataType>& type)
+      : type_(type), zero_size_buffer_(std::make_shared<Buffer>(kZeroSizeArea, 0)) {}
 
   Status Import(struct ArrowArray* src) {
     if (ArrowArrayIsReleased(src)) {
@@ -1529,7 +1533,7 @@ struct ArrayImporter {
   }
 
   Status ImportNullBitmap(int32_t buffer_id = 0) {
-    RETURN_NOT_OK(ImportBitsBuffer(buffer_id));
+    RETURN_NOT_OK(ImportBitsBuffer(buffer_id, /*is_null_bitmap=*/true));
     if (data_->null_count > 0 && data_->buffers[buffer_id] == nullptr) {
       return Status::Invalid(
           "ArrowArray struct has null bitmap buffer but non-zero null_count ",
@@ -1538,15 +1542,20 @@ struct ArrayImporter {
     return Status::OK();
   }
 
-  Status ImportBitsBuffer(int32_t buffer_id) {
+  Status ImportBitsBuffer(int32_t buffer_id, bool is_null_bitmap = false) {
     // Compute visible size of buffer
-    int64_t buffer_size = bit_util::BytesForBits(c_struct_->length + c_struct_->offset);
-    return ImportBuffer(buffer_id, buffer_size);
+    int64_t buffer_size =
+        (c_struct_->length > 0)
+            ? bit_util::BytesForBits(c_struct_->length + c_struct_->offset)
+            : 0;
+    return ImportBuffer(buffer_id, buffer_size, is_null_bitmap);
   }
 
   Status ImportFixedSizeBuffer(int32_t buffer_id, int64_t byte_width) {
     // Compute visible size of buffer
-    int64_t buffer_size = byte_width * (c_struct_->length + c_struct_->offset);
+    int64_t buffer_size = (c_struct_->length > 0)
+                              ? byte_width * (c_struct_->length + c_struct_->offset)
+                              : 0;
     return ImportBuffer(buffer_id, buffer_size);
   }
 
@@ -1563,17 +1572,27 @@ struct ArrayImporter {
                                   int64_t byte_width = 1) {
     auto offsets = data_->GetValues<OffsetType>(offsets_buffer_id);
     // Compute visible size of buffer
-    int64_t buffer_size = byte_width * offsets[c_struct_->length];
+    int64_t buffer_size =
+        (c_struct_->length > 0) ? byte_width * offsets[c_struct_->length] : 0;
     return ImportBuffer(buffer_id, buffer_size);
   }
 
-  Status ImportBuffer(int32_t buffer_id, int64_t buffer_size) {
+  Status ImportBuffer(int32_t buffer_id, int64_t buffer_size,
+                      bool is_null_bitmap = false) {
     std::shared_ptr<Buffer>* out = &data_->buffers[buffer_id];
     auto data = reinterpret_cast<const uint8_t*>(c_struct_->buffers[buffer_id]);
     if (data != nullptr) {
       *out = std::make_shared<ImportedBuffer>(data, buffer_size, import_);
-    } else {
+    } else if (is_null_bitmap) {
       out->reset();
+    } else {
+      // Ensure that imported buffers are never null (except for the null bitmap)
+      if (buffer_size != 0) {
+        return Status::Invalid(
+            "ArrowArrayStruct contains null data pointer "
+            "for a buffer with non-zero computed size");
+      }
+      *out = zero_size_buffer_;
     }
     return Status::OK();
   }
@@ -1585,6 +1604,9 @@ struct ArrayImporter {
   std::shared_ptr<ImportedArrayData> import_;
   std::shared_ptr<ArrayData> data_;
   std::vector<ArrayImporter> child_importers_;
+
+  // For imported null buffer pointers
+  std::shared_ptr<Buffer> zero_size_buffer_;
 };
 
 }  // namespace
diff --git a/cpp/src/arrow/c/bridge_test.cc b/cpp/src/arrow/c/bridge_test.cc
index 813d429533..90fe9d5965 100644
--- a/cpp/src/arrow/c/bridge_test.cc
+++ b/cpp/src/arrow/c/bridge_test.cc
@@ -124,6 +124,24 @@ class ReleaseCallback {
 using SchemaReleaseCallback = ReleaseCallback<SchemaExportTraits>;
 using ArrayReleaseCallback = ReleaseCallback<ArrayExportTraits>;
 
+// Whether c_struct or any of its descendents have non-null data pointers.
+bool HasData(const ArrowArray* c_struct) {
+  for (int64_t i = 0; i < c_struct->n_buffers; ++i) {
+    if (c_struct->buffers[i] != nullptr) {
+      return true;
+    }
+  }
+  if (c_struct->dictionary && HasData(c_struct->dictionary)) {
+    return true;
+  }
+  for (int64_t i = 0; i < c_struct->n_children; ++i) {
+    if (HasData(c_struct->children[i])) {
+      return true;
+    }
+  }
+  return false;
+}
+
 static const std::vector<std::string> kMetadataKeys1{"key1", "key2"};
 static const std::vector<std::string> kMetadataValues1{"", "bar"};
 
@@ -1659,6 +1677,8 @@ static const uint8_t bits_buffer1[] = {0xed, 0xed};
 static const void* buffers_no_nulls_no_data[1] = {nullptr};
 static const void* buffers_nulls_no_data1[1] = {bits_buffer1};
 
+static const void* all_buffers_omitted[3] = {nullptr, nullptr, nullptr};
+
 static const uint8_t data_buffer1[] = {1, 2,  3,  4,  5,  6,  7,  8,
                                        9, 10, 11, 12, 13, 14, 15, 16};
 static const uint8_t data_buffer2[] = "abcdefghijklmnopqrstuvwxyz";
@@ -1724,10 +1744,13 @@ static const uint8_t string_data_buffer1[] = "foobarquuxxyzzy";
 static const int32_t string_offsets_buffer1[] = {0, 3, 3, 6, 10, 15};
 static const void* string_buffers_no_nulls1[3] = {nullptr, string_offsets_buffer1,
                                                   string_data_buffer1};
+static const void* string_buffers_omitted[3] = {nullptr, string_offsets_buffer1, nullptr};
 
 static const int64_t large_string_offsets_buffer1[] = {0, 3, 3, 6, 10};
 static const void* large_string_buffers_no_nulls1[3] = {
     nullptr, large_string_offsets_buffer1, string_data_buffer1};
+static const void* large_string_buffers_omitted[3] = {
+    nullptr, large_string_offsets_buffer1, nullptr};
 
 static const int32_t list_offsets_buffer1[] = {0, 2, 2, 5, 6, 8};
 static const void* list_buffers_no_nulls1[2] = {nullptr, list_offsets_buffer1};
@@ -1901,9 +1924,9 @@ class TestArrayImport : public ::testing::Test {
     Reset();                                        // for further tests
 
     ASSERT_OK(array->ValidateFull());
-    // Special case: Null array doesn't have any data, so it needn't
-    // keep the ArrowArray struct alive.
-    if (type->id() != Type::NA) {
+    // Special case: arrays without data (such as Null arrays) needn't keep
+    // the ArrowArray struct alive.
+    if (HasData(&c_struct_)) {
       cb.AssertNotCalled();
     }
     AssertArraysEqual(*expected, *array, true);
@@ -1990,6 +2013,10 @@ TEST_F(TestArrayImport, Primitive) {
   CheckImport(ArrayFromJSON(boolean(), "[true, null, false]"));
   FillPrimitive(3, 1, 0, primitive_buffers_nulls1_8);
   CheckImport(ArrayFromJSON(boolean(), "[true, null, false]"));
+
+  // Empty array with null data pointers
+  FillPrimitive(0, 0, 0, all_buffers_omitted);
+  CheckImport(ArrayFromJSON(int32(), "[]"));
 }
 
 TEST_F(TestArrayImport, Temporal) {
@@ -2070,6 +2097,12 @@ TEST_F(TestArrayImport, PrimitiveWithOffset) {
 
   FillPrimitive(4, 0, 7, primitive_buffers_no_nulls1_8);
   CheckImport(ArrayFromJSON(boolean(), "[false, false, true, false]"));
+
+  // Empty array with null data pointers
+  FillPrimitive(0, 0, 2, all_buffers_omitted);
+  CheckImport(ArrayFromJSON(int32(), "[]"));
+  FillPrimitive(0, 0, 3, all_buffers_omitted);
+  CheckImport(ArrayFromJSON(boolean(), "[]"));
 }
 
 TEST_F(TestArrayImport, NullWithOffset) {
@@ -2092,10 +2125,48 @@ TEST_F(TestArrayImport, String) {
   FillStringLike(4, 0, 0, large_string_buffers_no_nulls1);
   CheckImport(ArrayFromJSON(large_binary(), R"(["foo", "", "bar", "quux"])"));
 
+  // Empty array with null data pointers
+  FillStringLike(0, 0, 0, string_buffers_omitted);
+  CheckImport(ArrayFromJSON(utf8(), "[]"));
+  FillStringLike(0, 0, 0, large_string_buffers_omitted);
+  CheckImport(ArrayFromJSON(large_binary(), "[]"));
+}
+
+TEST_F(TestArrayImport, StringWithOffset) {
+  FillStringLike(3, 0, 1, string_buffers_no_nulls1);
+  CheckImport(ArrayFromJSON(utf8(), R"(["", "bar", "quux"])"));
+  FillStringLike(2, 0, 2, large_string_buffers_no_nulls1);
+  CheckImport(ArrayFromJSON(large_utf8(), R"(["bar", "quux"])"));
+
+  // Empty array with null data pointers
+  FillStringLike(0, 0, 1, string_buffers_omitted);
+  CheckImport(ArrayFromJSON(utf8(), "[]"));
+}
+
+TEST_F(TestArrayImport, FixedSizeBinary) {
   FillPrimitive(2, 0, 0, primitive_buffers_no_nulls2);
   CheckImport(ArrayFromJSON(fixed_size_binary(3), R"(["abc", "def"])"));
   FillPrimitive(2, 0, 0, primitive_buffers_no_nulls3);
   CheckImport(ArrayFromJSON(decimal(15, 4), R"(["12345.6789", "98765.4321"])"));
+
+  // Empty array with null data pointers
+  FillPrimitive(0, 0, 0, all_buffers_omitted);
+  CheckImport(ArrayFromJSON(fixed_size_binary(3), "[]"));
+  FillPrimitive(0, 0, 0, all_buffers_omitted);
+  CheckImport(ArrayFromJSON(decimal(15, 4), "[]"));
+}
+
+TEST_F(TestArrayImport, FixedSizeBinaryWithOffset) {
+  FillPrimitive(1, 0, 1, primitive_buffers_no_nulls2);
+  CheckImport(ArrayFromJSON(fixed_size_binary(3), R"(["def"])"));
+  FillPrimitive(1, 0, 1, primitive_buffers_no_nulls3);
+  CheckImport(ArrayFromJSON(decimal(15, 4), R"(["98765.4321"])"));
+
+  // Empty array with null data pointers
+  FillPrimitive(0, 0, 1, all_buffers_omitted);
+  CheckImport(ArrayFromJSON(fixed_size_binary(3), "[]"));
+  FillPrimitive(0, 0, 1, all_buffers_omitted);
+  CheckImport(ArrayFromJSON(decimal(15, 4), "[]"));
 }
 
 TEST_F(TestArrayImport, List) {
@@ -2117,6 +2188,11 @@ TEST_F(TestArrayImport, List) {
   FillFixedSizeListLike(3, 0, 0, buffers_no_nulls_no_data);
   CheckImport(
       ArrayFromJSON(fixed_size_list(int8(), 3), "[[1, 2, 3], [4, 5, 6], [7, 8, 9]]"));
+
+  // Empty child array with null data pointers
+  FillPrimitive(AddChild(), 0, 0, 0, all_buffers_omitted);
+  FillFixedSizeListLike(0, 0, 0, buffers_no_nulls_no_data);
+  CheckImport(ArrayFromJSON(fixed_size_list(int8(), 3), "[]"));
 }
 
 TEST_F(TestArrayImport, NestedList) {
@@ -2205,6 +2281,15 @@ TEST_F(TestArrayImport, SparseUnion) {
   FillUnionLike(UnionMode::SPARSE, 4, 0, 0, 2, sparse_union_buffers1_legacy,
                 /*legacy=*/true);
   CheckImport(expected);
+
+  // Empty array with null data pointers
+  expected = ArrayFromJSON(type, "[]");
+  FillStringLike(AddChild(), 0, 0, 0, string_buffers_omitted);
+  FillPrimitive(AddChild(), 0, 0, 0, all_buffers_omitted);
+  FillUnionLike(UnionMode::SPARSE, 0, 0, 0, 2, all_buffers_omitted, /*legacy=*/false);
+  FillStringLike(AddChild(), 0, 0, 0, string_buffers_omitted);
+  FillPrimitive(AddChild(), 0, 0, 0, all_buffers_omitted);
+  FillUnionLike(UnionMode::SPARSE, 0, 0, 3, 2, all_buffers_omitted, /*legacy=*/false);
 }
 
 TEST_F(TestArrayImport, DenseUnion) {
@@ -2223,6 +2308,15 @@ TEST_F(TestArrayImport, DenseUnion) {
   FillUnionLike(UnionMode::DENSE, 5, 0, 0, 2, dense_union_buffers1_legacy,
                 /*legacy=*/true);
   CheckImport(expected);
+
+  // Empty array with null data pointers
+  expected = ArrayFromJSON(type, "[]");
+  FillStringLike(AddChild(), 0, 0, 0, string_buffers_omitted);
+  FillPrimitive(AddChild(), 0, 0, 0, all_buffers_omitted);
+  FillUnionLike(UnionMode::DENSE, 0, 0, 0, 2, all_buffers_omitted, /*legacy=*/false);
+  FillStringLike(AddChild(), 0, 0, 0, string_buffers_omitted);
+  FillPrimitive(AddChild(), 0, 0, 0, all_buffers_omitted);
+  FillUnionLike(UnionMode::DENSE, 0, 0, 3, 2, all_buffers_omitted, /*legacy=*/false);
 }
 
 TEST_F(TestArrayImport, StructWithOffset) {
@@ -2359,6 +2453,29 @@ TEST_F(TestArrayImport, PrimitiveError) {
   // Zero null bitmap but non-zero null_count
   FillPrimitive(3, 1, 0, primitive_buffers_no_nulls1_8);
   CheckImportError(int8());
+
+  // Null data pointers with non-zero length
+  FillPrimitive(1, 0, 0, all_buffers_omitted);
+  CheckImportError(int8());
+  FillPrimitive(1, 0, 0, all_buffers_omitted);
+  CheckImportError(boolean());
+  FillPrimitive(1, 0, 0, all_buffers_omitted);
+  CheckImportError(fixed_size_binary(3));
+}
+
+TEST_F(TestArrayImport, StringError) {
+  // Bad number of buffers
+  FillStringLike(4, 0, 0, string_buffers_no_nulls1);
+  c_struct_.n_buffers = 2;
+  CheckImportError(utf8());
+
+  // Null data pointers with non-zero length
+  FillStringLike(4, 0, 0, string_buffers_omitted);
+  CheckImportError(utf8());
+
+  // Null offsets pointer
+  FillStringLike(0, 0, 0, all_buffers_omitted);
+  CheckImportError(utf8());
 }
 
 TEST_F(TestArrayImport, StructError) {
@@ -2369,6 +2486,13 @@ TEST_F(TestArrayImport, StructError) {
   CheckImportError(struct_({field("strs", utf8())}));
 }
 
+TEST_F(TestArrayImport, ListError) {
+  // Null offsets pointer
+  FillPrimitive(AddChild(), 0, 0, 0, primitive_buffers_no_nulls1_8);
+  FillListLike(0, 0, 0, all_buffers_omitted);
+  CheckImportError(list(int8()));
+}
+
 TEST_F(TestArrayImport, MapError) {
   // Bad number of (struct) children in map child
   FillStringLike(AddChild(), 5, 0, 0, string_buffers_no_nulls1);
@@ -2859,8 +2983,10 @@ TEST_F(TestArrayRoundtrip, UnknownNullCount) {
 TEST_F(TestArrayRoundtrip, List) {
   TestWithJSON(list(int32()), "[]");
   TestWithJSON(list(int32()), "[[4, 5], [6, null], null]");
+  TestWithJSON(fixed_size_list(int32(), 3), "[[4, 5, 6], null, [7, 8, null]]");
 
   TestWithJSONSliced(list(int32()), "[[4, 5], [6, null], null]");
+  TestWithJSONSliced(fixed_size_list(int32(), 3), "[[4, 5, 6], null, [7, 8, null]]");
 }
 
 TEST_F(TestArrayRoundtrip, Struct) {


[arrow] 02/10: GH-15243: [C++] fix for potential deadlock in the group-by node (#33700)

Posted by ra...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-11.0.0
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit fa52bb02aee393e0fc078400335636b5b11b1dd2
Author: Weston Pace <we...@gmail.com>
AuthorDate: Mon Jan 16 07:17:02 2023 -0800

    GH-15243: [C++] fix for potential deadlock in the group-by node (#33700)
    
    
    * Closes: #15243
    
    Authored-by: Weston Pace <we...@gmail.com>
    Signed-off-by: Weston Pace <we...@gmail.com>
---
 cpp/src/arrow/compute/exec/aggregate_node.cc | 32 ++++++++++++++++++++++------
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/cpp/src/arrow/compute/exec/aggregate_node.cc b/cpp/src/arrow/compute/exec/aggregate_node.cc
index 0b70577ae7..725372700c 100644
--- a/cpp/src/arrow/compute/exec/aggregate_node.cc
+++ b/cpp/src/arrow/compute/exec/aggregate_node.cc
@@ -486,7 +486,7 @@ class GroupByNode : public ExecNode {
     outputs_[0]->InputReceived(this, out_data_.Slice(batch_size * n, batch_size));
   }
 
-  Status OutputResult() {
+  Status DoOutputResult() {
     // To simplify merging, ensure that the first grouper is nonempty
     for (size_t i = 0; i < local_states_.size(); i++) {
       if (local_states_[i].grouper) {
@@ -500,11 +500,28 @@ class GroupByNode : public ExecNode {
 
     int64_t num_output_batches = bit_util::CeilDiv(out_data_.length, output_batch_size());
     outputs_[0]->InputFinished(this, static_cast<int>(num_output_batches));
-    RETURN_NOT_OK(plan_->query_context()->StartTaskGroup(output_task_group_id_,
-                                                         num_output_batches));
+    Status st =
+        plan_->query_context()->StartTaskGroup(output_task_group_id_, num_output_batches);
+    if (st.IsCancelled()) {
+      // This means the user has cancelled/aborted the plan.  We will not send any batches
+      // and end immediately.
+      finished_.MarkFinished();
+      return Status::OK();
+    } else {
+      return st;
+    }
     return Status::OK();
   }
 
+  void OutputResult() {
+    // If something goes wrong outputting the result we need to make sure
+    // we still mark finished.
+    Status st = DoOutputResult();
+    if (!st.ok()) {
+      finished_.MarkFinished(st);
+    }
+  }
+
   void InputReceived(ExecNode* input, ExecBatch batch) override {
     EVENT(span_, "InputReceived", {{"batch.length", batch.length}});
     util::tracing::Span span;
@@ -521,7 +538,7 @@ class GroupByNode : public ExecNode {
     if (ErrorIfNotOk(Consume(ExecSpan(batch)))) return;
 
     if (input_counter_.Increment()) {
-      ErrorIfNotOk(OutputResult());
+      OutputResult();
     }
   }
 
@@ -542,7 +559,7 @@ class GroupByNode : public ExecNode {
     DCHECK_EQ(input, inputs_[0]);
 
     if (input_counter_.SetTotal(total_batches)) {
-      ErrorIfNotOk(OutputResult());
+      OutputResult();
     }
   }
 
@@ -551,7 +568,6 @@ class GroupByNode : public ExecNode {
                        {{"node.label", label()},
                         {"node.detail", ToString()},
                         {"node.kind", kind_name()}});
-
     local_states_.resize(plan_->query_context()->max_concurrency());
     return Status::OK();
   }
@@ -570,7 +586,9 @@ class GroupByNode : public ExecNode {
     EVENT(span_, "StopProducing");
     DCHECK_EQ(output, outputs_[0]);
 
-    if (input_counter_.Cancel()) finished_.MarkFinished();
+    if (input_counter_.Cancel()) {
+      finished_.MarkFinished();
+    }
     inputs_[0]->StopProducing(this);
   }
 


[arrow] 05/10: GH-25633: [CI][Java][macOS] Ensure using bundled RE2 (#33711)

Posted by ra...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-11.0.0
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 5bda66a96ad8c107cc9e54ed91be2ada63602bf1
Author: Sutou Kouhei <ko...@clear-code.com>
AuthorDate: Tue Jan 17 13:27:41 2023 +0900

    GH-25633: [CI][Java][macOS] Ensure using bundled RE2 (#33711)
    
    ### Rationale for this change
    
    If we have Homebrew's RE2, we may mix re2.h from Homebrew's RE2 and bundled RE2.
    If we mix re2.h and libre2.a, we may generate wrong re2::RE2::Options. It may crashes our program.
    
    ### What changes are included in this PR?
    
    Ensure removing Homebrew's RE2.
    
    ### Are these changes tested?
    
    Yes.
    
    ### Are there any user-facing changes?
    
    No.
    * Closes: #25633
    
    Authored-by: Sutou Kouhei <ko...@clear-code.com>
    Signed-off-by: Jacob Wujciak-Jens <ja...@wujciak.de>
---
 ci/scripts/java_jni_macos_build.sh | 1 +
 dev/tasks/java-jars/github.yml     | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/ci/scripts/java_jni_macos_build.sh b/ci/scripts/java_jni_macos_build.sh
index 912638b508..e82936c711 100755
--- a/ci/scripts/java_jni_macos_build.sh
+++ b/ci/scripts/java_jni_macos_build.sh
@@ -133,6 +133,7 @@ archery linking check-dependencies \
   --allow libcurl \
   --allow libgandiva_jni \
   --allow libncurses \
+  --allow libobjc \
   --allow libplasma_java \
   --allow libz \
   libarrow_cdata_jni.dylib \
diff --git a/dev/tasks/java-jars/github.yml b/dev/tasks/java-jars/github.yml
index cfa1dbed49..3dcce6d950 100644
--- a/dev/tasks/java-jars/github.yml
+++ b/dev/tasks/java-jars/github.yml
@@ -86,6 +86,7 @@ jobs:
           # If llvm is installed, Apache Arrow C++ uses llvm rather than
           # llvm@14 because llvm is newer than llvm@14.
           brew uninstall llvm || :
+
           brew bundle --file=arrow/cpp/Brewfile
           # We want to link aws-sdk-cpp statically but Homebrew's
           # aws-sdk-cpp provides only shared library. If we have
@@ -93,6 +94,12 @@ jobs:
           # aws-sdk-cpp and bundled aws-sdk-cpp. We uninstall Homebrew's
           # aws-sdk-cpp to ensure using only bundled aws-sdk-cpp.
           brew uninstall aws-sdk-cpp
+          # We want to use bundled RE2 for static linking. If
+          # Homebrew's RE2 is installed, its header file may be used.
+          # We uninstall Homebrew's RE2 to ensure using bundled RE2.
+          brew uninstall grpc || : # gRPC depends on RE2
+          brew uninstall re2 || :
+
           brew bundle --file=arrow/java/Brewfile
       - name: Build C++ libraries
         env:


[arrow] 03/10: GH-33705: [R] Fix link on README (#33706)

Posted by ra...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-11.0.0
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 3a2f3650eda22d109d5082c42b0fea0bfbb61310
Author: Dewey Dunnington <de...@fishandwhistle.net>
AuthorDate: Mon Jan 16 13:41:45 2023 -0400

    GH-33705: [R] Fix link on README (#33706)
    
    Fixes a link and removes a reference to "feather" that was sitting front and centre.
    * Closes: #33705
    
    Lead-authored-by: Dewey Dunnington <de...@voltrondata.com>
    Co-authored-by: Dewey Dunnington <de...@fishandwhistle.net>
    Co-authored-by: Nic Crane <th...@gmail.com>
    Signed-off-by: Dewey Dunnington <de...@voltrondata.com>
---
 r/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/r/README.md b/r/README.md
index 28f9e192f8..3551e92bff 100644
--- a/r/README.md
+++ b/r/README.md
@@ -22,7 +22,7 @@ The arrow package provides functionality for a wide range of data analysis
 tasks. It allows users to read and write data in a variety formats:
 
 -   Read and write Parquet files, an efficient and widely used columnar format
--   Read and write Feather files, a format optimized for speed and
+-   Read and write Arrow (formerly known as Feather) files, a format optimized for speed and
     interoperability
 -   Read and write CSV files with excellent speed and efficiency
 -   Read and write multi-file and larger-than-memory datasets


[arrow] 10/10: GH-14997: [Release] Ensure archery release tasks works with both new style GitHub issues and old style JIRA issues (#33615)

Posted by ra...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-11.0.0
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit ccd169caed4b5b63d3f60ddc8c7fb45a14e32857
Author: Raúl Cumplido <ra...@gmail.com>
AuthorDate: Wed Jan 18 09:16:31 2023 +0100

    GH-14997: [Release] Ensure archery release tasks works with both new style GitHub issues and old style JIRA issues (#33615)
    
    I've decided to do all the archery release tasks on a single PR:
    * Closes: #14997
    * Closes: #14999
    * Closes: #15002
    
    Authored-by: Raúl Cumplido <ra...@gmail.com>
    Signed-off-by: Raúl Cumplido <ra...@gmail.com>
---
 dev/archery/archery/release/cli.py                 |  41 ++--
 dev/archery/archery/release/core.py                | 258 ++++++++++++++-------
 dev/archery/archery/release/reports.py             |   7 +-
 dev/archery/archery/release/tests/test_release.py  |  91 +++++---
 .../archery/templates/release_changelog.md.j2      |   4 +
 .../archery/templates/release_curation.txt.j2      |  20 +-
 dev/archery/setup.py                               |   2 +-
 7 files changed, 278 insertions(+), 145 deletions(-)

diff --git a/dev/archery/archery/release/cli.py b/dev/archery/archery/release/cli.py
index 4fbf93861e..ed15dcb1ed 100644
--- a/dev/archery/archery/release/cli.py
+++ b/dev/archery/archery/release/cli.py
@@ -20,34 +20,33 @@ import pathlib
 import click
 
 from ..utils.cli import validate_arrow_sources
-from .core import Jira, CachedJira, Release
+from .core import IssueTracker, Release
 
 
 @click.group('release')
 @click.option("--src", metavar="<arrow_src>", default=None,
               callback=validate_arrow_sources,
               help="Specify Arrow source directory.")
-@click.option("--jira-cache", type=click.Path(), default=None,
-              help="File path to cache queried JIRA issues per version.")
+@click.option('--github-token', '-t', default=None,
+              envvar="CROSSBOW_GITHUB_TOKEN",
+              help='OAuth token for GitHub authentication')
 @click.pass_obj
-def release(obj, src, jira_cache):
+def release(obj, src, github_token):
     """Release releated commands."""
-    jira = Jira()
-    if jira_cache is not None:
-        jira = CachedJira(jira_cache, jira=jira)
 
-    obj['jira'] = jira
+    obj['issue_tracker'] = IssueTracker(github_token=github_token)
     obj['repo'] = src.path
 
 
-@release.command('curate', help="Lists release related Jira issues.")
+@release.command('curate', help="Lists release related issues.")
 @click.argument('version')
 @click.option('--minimal/--full', '-m/-f',
-              help="Only show actionable Jira issues.", default=False)
+              help="Only show actionable issues.", default=False)
 @click.pass_obj
 def release_curate(obj, version, minimal):
     """Release curation."""
-    release = Release.from_jira(version, jira=obj['jira'], repo=obj['repo'])
+    release = Release(version, repo=obj['repo'],
+                      issue_tracker=obj['issue_tracker'])
     curation = release.curate(minimal)
 
     click.echo(curation.render('console'))
@@ -64,10 +63,10 @@ def release_changelog():
 @click.pass_obj
 def release_changelog_add(obj, version):
     """Prepend the changelog with the current release"""
-    jira, repo = obj['jira'], obj['repo']
+    repo, issue_tracker = obj['repo'], obj['issue_tracker']
 
     # just handle the current version
-    release = Release.from_jira(version, jira=jira, repo=repo)
+    release = Release(version, repo=repo, issue_tracker=issue_tracker)
     if release.is_released:
         raise ValueError('This version has been already released!')
 
@@ -87,10 +86,10 @@ def release_changelog_add(obj, version):
 @click.pass_obj
 def release_changelog_generate(obj, version, output):
     """Generate the changelog of a specific release."""
-    jira, repo = obj['jira'], obj['repo']
+    repo, issue_tracker = obj['repo'], obj['issue_tracker']
 
     # just handle the current version
-    release = Release.from_jira(version, jira=jira, repo=repo)
+    release = Release(version, repo=repo, issue_tracker=issue_tracker)
 
     changelog = release.changelog()
     output.write(changelog.render('markdown'))
@@ -100,13 +99,15 @@ def release_changelog_generate(obj, version, output):
 @click.pass_obj
 def release_changelog_regenerate(obj):
     """Regeneretate the whole CHANGELOG.md file"""
-    jira, repo = obj['jira'], obj['repo']
+    issue_tracker, repo = obj['issue_tracker'], obj['repo']
     changelogs = []
+    issue_tracker = IssueTracker(issue_tracker=issue_tracker)
 
-    for version in jira.project_versions('ARROW'):
+    for version in issue_tracker.project_versions():
         if not version.released:
             continue
-        release = Release.from_jira(version, jira=jira, repo=repo)
+        release = Release(version, repo=repo,
+                          issue_tracker=issue_tracker)
         click.echo('Querying changelog for version: {}'.format(version))
         changelogs.append(release.changelog())
 
@@ -129,7 +130,9 @@ def release_cherry_pick(obj, version, dry_run, recreate):
     """
     Cherry pick commits.
     """
-    release = Release.from_jira(version, jira=obj['jira'], repo=obj['repo'])
+    issue_tracker = obj['issue_tracker']
+    release = Release(version,
+                      repo=obj['repo'], issue_tracker=issue_tracker)
 
     if not dry_run:
         release.cherry_pick_commits(recreate_branch=recreate)
diff --git a/dev/archery/archery/release/core.py b/dev/archery/archery/release/core.py
index 03eceb80a1..822d408f88 100644
--- a/dev/archery/archery/release/core.py
+++ b/dev/archery/archery/release/core.py
@@ -21,16 +21,16 @@ import functools
 import os
 import pathlib
 import re
-import shelve
 import warnings
 
 from git import Repo
+from github import Github
 from jira import JIRA
 from semver import VersionInfo as SemVer
 
 from ..utils.source import ArrowSources
 from ..utils.logger import logger
-from .reports import ReleaseCuration, JiraChangelog
+from .reports import ReleaseCuration, ReleaseChangelog
 
 
 def cached_property(fn):
@@ -58,13 +58,29 @@ class Version(SemVer):
             release_date=getattr(jira_version, 'releaseDate', None)
         )
 
+    @classmethod
+    def from_milestone(cls, milestone):
+        return cls.parse(
+            milestone.title,
+            released=milestone.state == "closed",
+            release_date=milestone.due_on
+        )
+
+
+ORIGINAL_ARROW_REGEX = re.compile(
+    r"\*This issue was originally created as " +
+    r"\[(?P<issue>ARROW\-(?P<issue_id>(\d+)))\]"
+)
+
 
 class Issue:
 
-    def __init__(self, key, type, summary):
+    def __init__(self, key, type, summary, github_issue=None):
         self.key = key
         self.type = type
         self.summary = summary
+        self.github_issue_id = getattr(github_issue, "number", None)
+        self._github_issue = github_issue
 
     @classmethod
     def from_jira(cls, jira_issue):
@@ -74,13 +90,49 @@ class Issue:
             summary=jira_issue.fields.summary
         )
 
+    @classmethod
+    def from_github(cls, github_issue):
+        original_jira = cls.original_jira_id(github_issue)
+        key = original_jira or github_issue.number
+        return cls(
+            key=key,
+            type=next(
+                iter(
+                    [
+                        label.name for label in github_issue.labels
+                        if label.name.startswith("Type:")
+                    ]
+                ), None),
+            summary=github_issue.title,
+            github_issue=github_issue
+        )
+
     @property
     def project(self):
+        if isinstance(self.key, int):
+            return 'GH'
         return self.key.split('-')[0]
 
     @property
     def number(self):
-        return int(self.key.split('-')[1])
+        if isinstance(self.key, str):
+            return int(self.key.split('-')[1])
+        else:
+            return self.key
+
+    @cached_property
+    def is_pr(self):
+        return bool(self._github_issue and self._github_issue.pull_request)
+
+    @classmethod
+    def original_jira_id(cls, github_issue):
+        # All migrated issues contain body
+        if not github_issue.body:
+            return None
+        matches = ORIGINAL_ARROW_REGEX.search(github_issue.body)
+        if matches:
+            values = matches.groupdict()
+            return values['issue']
 
 
 class Jira(JIRA):
@@ -88,54 +140,54 @@ class Jira(JIRA):
     def __init__(self, url='https://issues.apache.org/jira'):
         super().__init__(url)
 
-    def project_version(self, version_string, project='ARROW'):
-        # query version from jira to populated with additional metadata
-        versions = {str(v): v for v in self.project_versions(project)}
-        return versions[version_string]
+    def issue(self, key):
+        return Issue.from_jira(super().issue(key))
+
+
+class IssueTracker:
+
+    def __init__(self, github_token=None):
+        github = Github(github_token)
+        self.github_repo = github.get_repo('apache/arrow')
 
-    def project_versions(self, project):
+    def project_version(self, version_string):
+        for milestone in self.project_versions():
+            if milestone == version_string:
+                return milestone
+
+    def project_versions(self):
         versions = []
-        for v in super().project_versions(project):
+        milestones = self.github_repo.get_milestones(state="all")
+        for milestone in milestones:
             try:
-                versions.append(Version.from_jira(v))
+                versions.append(Version.from_milestone(milestone))
             except ValueError:
                 # ignore invalid semantic versions like JS-0.4.0
                 continue
         return sorted(versions, reverse=True)
 
-    def issue(self, key):
-        return Issue.from_jira(super().issue(key))
-
-    def project_issues(self, version, project='ARROW'):
-        query = "project={} AND fixVersion={}".format(project, version)
-        issues = super().search_issues(query, maxResults=False)
-        return list(map(Issue.from_jira, issues))
-
-
-class CachedJira:
-
-    def __init__(self, cache_path, jira=None):
-        self.jira = jira or Jira()
-        self.cache_path = cache_path
+    def _milestone_from_semver(self, semver):
+        milestones = self.github_repo.get_milestones(state="all")
+        for milestone in milestones:
+            try:
+                if milestone.title == semver:
+                    return milestone
+            except ValueError:
+                # ignore invalid semantic versions like JS-0.3.0
+                continue
 
-    def __getattr__(self, name):
-        attr = getattr(self.jira, name)
-        return self._cached(name, attr) if callable(attr) else attr
+    def project_issues(self, version):
+        issues = self.github_repo.get_issues(
+            milestone=self._milestone_from_semver(version),
+            state="all")
+        return list(map(Issue.from_github, issues))
 
-    def _cached(self, name, method):
-        def wrapper(*args, **kwargs):
-            key = str((name, args, kwargs))
-            with shelve.open(self.cache_path) as cache:
-                try:
-                    result = cache[key]
-                except KeyError:
-                    cache[key] = result = method(*args, **kwargs)
-            return result
-        return wrapper
+    def issue(self, key):
+        return Issue.from_github(self.github_repo.get_issue(key))
 
 
 _TITLE_REGEX = re.compile(
-    r"(?P<issue>(?P<project>(ARROW|PARQUET))\-\d+)?\s*:?\s*"
+    r"(?P<issue>(?P<project>(ARROW|PARQUET|GH))\-(?P<issue_id>(\d+)))?\s*:?\s*"
     r"(?P<minor>(MINOR))?\s*:?\s*"
     r"(?P<components>\[.*\])?\s*(?P<summary>.*)"
 )
@@ -145,9 +197,10 @@ _COMPONENT_REGEX = re.compile(r"\[([^\[\]]+)\]")
 class CommitTitle:
 
     def __init__(self, summary, project=None, issue=None, minor=None,
-                 components=None):
+                 components=None, issue_id=None):
         self.project = project
         self.issue = issue
+        self.issue_id = issue_id
         self.components = components or []
         self.summary = summary
         self.minor = bool(minor)
@@ -186,6 +239,7 @@ class CommitTitle:
             values['summary'],
             project=values.get('project'),
             issue=values.get('issue'),
+            issue_id=values.get('issue_id'),
             minor=values.get('minor'),
             components=components
         )
@@ -230,7 +284,8 @@ class Commit:
 
 class Release:
 
-    def __new__(self, version, jira=None, repo=None):
+    def __new__(self, version, repo=None, github_token=None,
+                issue_tracker=None):
         if isinstance(version, str):
             version = Version.parse(version)
         elif not isinstance(version, Version):
@@ -250,15 +305,7 @@ class Release:
 
         return super().__new__(klass)
 
-    def __init__(self, version, jira, repo):
-        if jira is None:
-            jira = Jira()
-        elif isinstance(jira, str):
-            jira = Jira(jira)
-        elif not isinstance(jira, (Jira, CachedJira)):
-            raise TypeError("`jira` argument must be a server url or a valid "
-                            "Jira instance")
-
+    def __init__(self, version, repo, issue_tracker):
         if repo is None:
             arrow = ArrowSources.find()
             repo = Repo(arrow.path)
@@ -269,13 +316,14 @@ class Release:
                             "instance")
 
         if isinstance(version, str):
-            version = jira.project_version(version, project='ARROW')
+            version = issue_tracker.project_version(version)
+
         elif not isinstance(version, Version):
             raise TypeError(version)
 
         self.version = version
-        self.jira = jira
         self.repo = repo
+        self.issue_tracker = issue_tracker
 
     def __repr__(self):
         if self.version.released:
@@ -284,10 +332,6 @@ class Release:
             status = "pending"
         return f"<{self.__class__.__name__} {self.version!r} {status}>"
 
-    @staticmethod
-    def from_jira(version, jira=None, repo=None):
-        return Release(version, jira, repo)
-
     @property
     def is_released(self):
         return self.version.released
@@ -322,7 +366,8 @@ class Release:
             # first release doesn't have a previous one
             return None
         else:
-            return Release.from_jira(previous, jira=self.jira, repo=self.repo)
+            return Release(previous, repo=self.repo,
+                           issue_tracker=self.issue_tracker)
 
     @cached_property
     def next(self):
@@ -332,13 +377,21 @@ class Release:
             raise ValueError("There is no upcoming release set in JIRA after "
                              f"version {self.version}")
         upcoming = self.siblings[position - 1]
-        return Release.from_jira(upcoming, jira=self.jira, repo=self.repo)
+        return Release(upcoming, repo=self.repo,
+                       issue_tracker=self.issue_tracker)
 
     @cached_property
     def issues(self):
-        issues = self.jira.project_issues(self.version, project='ARROW')
+        issues = self.issue_tracker.project_issues(
+            self.version
+        )
         return {i.key: i for i in issues}
 
+    @cached_property
+    def github_issue_ids(self):
+        return {v.github_issue_id for v in self.issues.values()
+                if v.github_issue_id}
+
     @cached_property
     def commits(self):
         """
@@ -351,7 +404,11 @@ class Release:
             lower = self.repo.tags[self.previous.tag]
 
         if self.version.released:
-            upper = self.repo.tags[self.tag]
+            try:
+                upper = self.repo.tags[self.tag]
+            except IndexError:
+                warnings.warn(f"Release tag `{self.tag}` doesn't exist.")
+                return []
         else:
             try:
                 upper = self.repo.branches[self.branch]
@@ -362,6 +419,10 @@ class Release:
         commit_range = f"{lower}..{upper}"
         return list(map(Commit, self.repo.iter_commits(commit_range)))
 
+    @cached_property
+    def jira_instance(self):
+        return Jira()
+
     @cached_property
     def default_branch(self):
         default_branch_name = os.getenv("ARCHERY_DEFAULT_BRANCH")
@@ -388,7 +449,7 @@ class Release:
 
                 # The last token is the default branch name
                 default_branch_name = origin_head_name_tokenized[-1]
-            except KeyError:
+            except (KeyError, IndexError):
                 # Use a hard-coded default value to set default_branch_name
                 # TODO: ARROW-18011 to track changing the hard coded default
                 # value from "master" to "main".
@@ -403,29 +464,43 @@ class Release:
         return default_branch_name
 
     def curate(self, minimal=False):
-        # handle commits with parquet issue key specially and query them from
-        # jira and add it to the issues
+        # handle commits with parquet issue key specially
         release_issues = self.issues
-
-        within, outside, nojira, parquet = [], [], [], []
+        within, outside, noissue, parquet, minor = [], [], [], [], []
         for c in self.commits:
             if c.issue is None:
-                nojira.append(c)
-            elif c.issue in release_issues:
-                within.append((release_issues[c.issue], c))
+                if c.title.minor:
+                    minor.append(c)
+                else:
+                    noissue.append(c)
+            elif c.project == 'GH':
+                if int(c.issue_id) in release_issues:
+                    within.append((release_issues[int(c.issue_id)], c))
+                else:
+                    outside.append(
+                        (self.issue_tracker.issue(int(c.issue_id)), c))
+            elif c.project == 'ARROW':
+                if c.issue in release_issues:
+                    within.append((release_issues[c.issue], c))
+                else:
+                    outside.append((self.jira_instance.issue(c.issue), c))
             elif c.project == 'PARQUET':
-                parquet.append((self.jira.issue(c.issue), c))
+                parquet.append((self.jira_instance.issue(c.issue), c))
             else:
-                outside.append((self.jira.issue(c.issue), c))
+                warnings.warn(
+                    f'Issue {c.issue} is not MINOR nor pertains to GH' +
+                    ', ARROW or PARQUET')
+                outside.append((c.issue, c))
 
         # remaining jira tickets
         within_keys = {i.key for i, c in within}
+        # Take into account that some issues milestoned are prs
         nopatch = [issue for key, issue in release_issues.items()
-                   if key not in within_keys]
+                   if key not in within_keys and issue.is_pr is False]
 
         return ReleaseCuration(release=self, within=within, outside=outside,
-                               nojira=nojira, parquet=parquet, nopatch=nopatch,
-                               minimal=minimal)
+                               noissue=noissue, parquet=parquet,
+                               nopatch=nopatch, minimal=minimal, minor=minor)
 
     def changelog(self):
         issue_commit_pairs = []
@@ -451,16 +526,26 @@ class Release:
             'Task': 'New Features and Improvements',
             'Test': 'Bug Fixes',
             'Wish': 'New Features and Improvements',
+            'Type: bug': 'Bug Fixes',
+            'Type: enhancement': 'New Features and Improvements',
+            'Type: task': 'New Features and Improvements',
+            'Type: test': 'Bug Fixes',
+            'Type: usage': 'New Features and Improvements',
         }
         categories = defaultdict(list)
         for issue, commit in issue_commit_pairs:
-            categories[issue_types[issue.type]].append((issue, commit))
+            try:
+                categories[issue_types[issue.type]].append((issue, commit))
+            except KeyError:
+                # If issue or pr don't have a type assume task.
+                # Currently the label for type is not mandatory on GitHub.
+                categories[issue_types['Type: task']].append((issue, commit))
 
         # sort issues by the issue key in ascending order
         for issues in categories.values():
             issues.sort(key=lambda pair: (pair[0].project, pair[0].number))
 
-        return JiraChangelog(release=self, categories=categories)
+        return ReleaseChangelog(release=self, categories=categories)
 
     def commits_to_pick(self, exclude_already_applied=True):
         # collect commits applied on the default branch since the root of the
@@ -481,10 +566,18 @@ class Release:
 
         # iterate over the commits applied on the main branch and filter out
         # the ones that are included in the jira release
-        patches_to_pick = [c for c in commits if
-                           c.issue in self.issues and
-                           c.title not in already_applied]
-
+        patches_to_pick = []
+        for c in commits:
+            key = c.issue
+            # For the release we assume all issues that have to be
+            # cherry-picked are merged with the GH issue id instead of the
+            # JIRA ARROW one. That's why we use github_issues along with
+            # issues. This is only to correct the mapping for migrated issues.
+            if c.issue and c.issue.startswith("GH-"):
+                key = int(c.issue_id)
+            if ((key in self.github_issue_ids or key in self.issues) and
+                    c.title not in already_applied):
+                patches_to_pick.append(c)
         return reversed(patches_to_pick)
 
     def cherry_pick_commits(self, recreate_branch=True):
@@ -525,7 +618,7 @@ class MajorRelease(Release):
         Filter only the major releases.
         """
         # handle minor releases before 1.0 as major releases
-        return [v for v in self.jira.project_versions('ARROW')
+        return [v for v in self.issue_tracker.project_versions()
                 if v.patch == 0 and (v.major == 0 or v.minor == 0)]
 
 
@@ -544,7 +637,8 @@ class MinorRelease(Release):
         """
         Filter the major and minor releases.
         """
-        return [v for v in self.jira.project_versions('ARROW') if v.patch == 0]
+        return [v for v in self.issue_tracker.project_versions()
+                if v.patch == 0]
 
 
 class PatchRelease(Release):
@@ -562,4 +656,4 @@ class PatchRelease(Release):
         """
         No filtering, consider all releases.
         """
-        return self.jira.project_versions('ARROW')
+        return self.issue_tracker.project_versions()
diff --git a/dev/archery/archery/release/reports.py b/dev/archery/archery/release/reports.py
index 43093487c0..4299eaa7ed 100644
--- a/dev/archery/archery/release/reports.py
+++ b/dev/archery/archery/release/reports.py
@@ -27,14 +27,15 @@ class ReleaseCuration(JinjaReport):
         'release',
         'within',
         'outside',
-        'nojira',
+        'noissue',
         'parquet',
         'nopatch',
-        'minimal'
+        'minimal',
+        'minor'
     ]
 
 
-class JiraChangelog(JinjaReport):
+class ReleaseChangelog(JinjaReport):
     templates = {
         'markdown': 'release_changelog.md.j2',
         'html': 'release_changelog.html.j2'
diff --git a/dev/archery/archery/release/tests/test_release.py b/dev/archery/archery/release/tests/test_release.py
index 1283b4bcb4..22b43c7cb3 100644
--- a/dev/archery/archery/release/tests/test_release.py
+++ b/dev/archery/archery/release/tests/test_release.py
@@ -19,13 +19,29 @@ import pytest
 
 from archery.release.core import (
     Release, MajorRelease, MinorRelease, PatchRelease,
-    Jira, Version, Issue, CommitTitle, Commit
+    IssueTracker, Version, Issue, CommitTitle, Commit
 )
 from archery.testing import DotDict
 
 
 # subset of issues per revision
 _issues = {
+    "3.0.0": [
+        Issue("GH-9784", type="Bug", summary="[C++] Title"),
+        Issue("GH-9767", type="New Feature", summary="[Crossbow] Title"),
+        Issue("GH-1231", type="Bug", summary="[Java] Title"),
+        Issue("GH-1244", type="Bug", summary="[C++] Title"),
+        Issue("GH-1301", type="Bug", summary="[Python][Archery] Title")
+    ],
+    "2.0.0": [
+        Issue("ARROW-9784", type="Bug", summary="[Java] Title"),
+        Issue("ARROW-9767", type="New Feature", summary="[Crossbow] Title"),
+        Issue("GH-1230", type="Bug", summary="[Dev] Title"),
+        Issue("ARROW-9694", type="Bug", summary="[Release] Title"),
+        Issue("ARROW-5643", type="Bug", summary="[Go] Title"),
+        Issue("GH-1243", type="Bug", summary="[Python] Title"),
+        Issue("GH-1300", type="Bug", summary="[CI][Archery] Title")
+    ],
     "1.0.1": [
         Issue("ARROW-9684", type="Bug", summary="[C++] Title"),
         Issue("ARROW-9667", type="New Feature", summary="[Crossbow] Title"),
@@ -62,13 +78,14 @@ _issues = {
 }
 
 
-class FakeJira(Jira):
+class FakeIssueTracker(IssueTracker):
 
     def __init__(self):
         pass
 
-    def project_versions(self, project='ARROW'):
+    def project_versions(self):
         return [
+            Version.parse("4.0.0", released=False),
             Version.parse("3.0.0", released=False),
             Version.parse("2.0.0", released=False),
             Version.parse("1.1.0", released=False),
@@ -82,16 +99,16 @@ class FakeJira(Jira):
             Version.parse("0.15.0", released=True),
         ]
 
-    def project_issues(self, version, project='ARROW'):
+    def project_issues(self, version):
         return _issues[str(version)]
 
 
 @pytest.fixture
-def fake_jira():
-    return FakeJira()
+def fake_issue_tracker():
+    return FakeIssueTracker()
 
 
-def test_version(fake_jira):
+def test_version(fake_issue_tracker):
     v = Version.parse("1.2.5")
     assert str(v) == "1.2.5"
     assert v.major == 1
@@ -109,7 +126,7 @@ def test_version(fake_jira):
     assert v.release_date == "2020-01-01"
 
 
-def test_issue(fake_jira):
+def test_issue(fake_issue_tracker):
     i = Issue("ARROW-1234", type='Bug', summary="title")
     assert i.key == "ARROW-1234"
     assert i.type == "Bug"
@@ -212,78 +229,78 @@ def test_commit_title():
     assert t.minor is False
 
 
-def test_release_basics(fake_jira):
-    r = Release.from_jira("1.0.0", jira=fake_jira)
+def test_release_basics(fake_issue_tracker):
+    r = Release("1.0.0", repo=None, issue_tracker=fake_issue_tracker)
     assert isinstance(r, MajorRelease)
     assert r.is_released is True
     assert r.branch == 'maint-1.0.0'
     assert r.tag == 'apache-arrow-1.0.0'
 
-    r = Release.from_jira("1.1.0", jira=fake_jira)
+    r = Release("1.1.0", repo=None, issue_tracker=fake_issue_tracker)
     assert isinstance(r, MinorRelease)
     assert r.is_released is False
     assert r.branch == 'maint-1.x.x'
     assert r.tag == 'apache-arrow-1.1.0'
 
     # minor releases before 1.0 are treated as major releases
-    r = Release.from_jira("0.17.0", jira=fake_jira)
+    r = Release("0.17.0", repo=None, issue_tracker=fake_issue_tracker)
     assert isinstance(r, MajorRelease)
     assert r.is_released is True
     assert r.branch == 'maint-0.17.0'
     assert r.tag == 'apache-arrow-0.17.0'
 
-    r = Release.from_jira("0.17.1", jira=fake_jira)
+    r = Release("0.17.1", repo=None, issue_tracker=fake_issue_tracker)
     assert isinstance(r, PatchRelease)
     assert r.is_released is True
     assert r.branch == 'maint-0.17.x'
     assert r.tag == 'apache-arrow-0.17.1'
 
 
-def test_previous_and_next_release(fake_jira):
-    r = Release.from_jira("3.0.0", jira=fake_jira)
+def test_previous_and_next_release(fake_issue_tracker):
+    r = Release("4.0.0", repo=None, issue_tracker=fake_issue_tracker)
     assert isinstance(r.previous, MajorRelease)
-    assert r.previous.version == Version.parse("2.0.0")
+    assert r.previous.version == Version.parse("3.0.0")
     with pytest.raises(ValueError, match="There is no upcoming release set"):
         assert r.next
 
-    r = Release.from_jira("2.0.0", jira=fake_jira)
+    r = Release("3.0.0", repo=None, issue_tracker=fake_issue_tracker)
     assert isinstance(r.previous, MajorRelease)
     assert isinstance(r.next, MajorRelease)
-    assert r.previous.version == Version.parse("1.0.0")
-    assert r.next.version == Version.parse("3.0.0")
+    assert r.previous.version == Version.parse("2.0.0")
+    assert r.next.version == Version.parse("4.0.0")
 
-    r = Release.from_jira("1.1.0", jira=fake_jira)
+    r = Release("1.1.0", repo=None, issue_tracker=fake_issue_tracker)
     assert isinstance(r.previous, MajorRelease)
     assert isinstance(r.next, MajorRelease)
     assert r.previous.version == Version.parse("1.0.0")
     assert r.next.version == Version.parse("2.0.0")
 
-    r = Release.from_jira("1.0.0", jira=fake_jira)
+    r = Release("1.0.0", repo=None, issue_tracker=fake_issue_tracker)
     assert isinstance(r.next, MajorRelease)
     assert isinstance(r.previous, MajorRelease)
     assert r.previous.version == Version.parse("0.17.0")
     assert r.next.version == Version.parse("2.0.0")
 
-    r = Release.from_jira("0.17.0", jira=fake_jira)
+    r = Release("0.17.0", repo=None, issue_tracker=fake_issue_tracker)
     assert isinstance(r.previous, MajorRelease)
     assert r.previous.version == Version.parse("0.16.0")
 
-    r = Release.from_jira("0.15.2", jira=fake_jira)
+    r = Release("0.15.2", repo=None, issue_tracker=fake_issue_tracker)
     assert isinstance(r.previous, PatchRelease)
     assert isinstance(r.next, MajorRelease)
     assert r.previous.version == Version.parse("0.15.1")
     assert r.next.version == Version.parse("0.16.0")
 
-    r = Release.from_jira("0.15.1", jira=fake_jira)
+    r = Release("0.15.1", repo=None, issue_tracker=fake_issue_tracker)
     assert isinstance(r.previous, MajorRelease)
     assert isinstance(r.next, PatchRelease)
     assert r.previous.version == Version.parse("0.15.0")
     assert r.next.version == Version.parse("0.15.2")
 
 
-def test_release_issues(fake_jira):
+def test_release_issues(fake_issue_tracker):
     # major release issues
-    r = Release.from_jira("1.0.0", jira=fake_jira)
+    r = Release("1.0.0", repo=None, issue_tracker=fake_issue_tracker)
     assert r.issues.keys() == set([
         "ARROW-300",
         "ARROW-4427",
@@ -295,7 +312,7 @@ def test_release_issues(fake_jira):
         "ARROW-8973"
     ])
     # minor release issues
-    r = Release.from_jira("0.17.0", jira=fake_jira)
+    r = Release("0.17.0", repo=None, issue_tracker=fake_issue_tracker)
     assert r.issues.keys() == set([
         "ARROW-2882",
         "ARROW-2587",
@@ -305,7 +322,7 @@ def test_release_issues(fake_jira):
         "ARROW-1636",
     ])
     # patch release issues
-    r = Release.from_jira("1.0.1", jira=fake_jira)
+    r = Release("1.0.1", repo=None, issue_tracker=fake_issue_tracker)
     assert r.issues.keys() == set([
         "ARROW-9684",
         "ARROW-9667",
@@ -315,6 +332,16 @@ def test_release_issues(fake_jira):
         "ARROW-9609",
         "ARROW-9606"
     ])
+    r = Release("2.0.0", repo=None, issue_tracker=fake_issue_tracker)
+    assert r.issues.keys() == set([
+        "ARROW-9784",
+        "ARROW-9767",
+        "GH-1230",
+        "ARROW-9694",
+        "ARROW-5643",
+        "GH-1243",
+        "GH-1300"
+    ])
 
 
 @pytest.mark.parametrize(('version', 'ncommits'), [
@@ -323,8 +350,8 @@ def test_release_issues(fake_jira):
     ("0.17.0", 569),
     ("0.15.1", 41)
 ])
-def test_release_commits(fake_jira, version, ncommits):
-    r = Release.from_jira(version, jira=fake_jira)
+def test_release_commits(fake_issue_tracker, version, ncommits):
+    r = Release(version, repo=None, issue_tracker=fake_issue_tracker)
     assert len(r.commits) == ncommits
     for c in r.commits:
         assert isinstance(c, Commit)
@@ -332,8 +359,8 @@ def test_release_commits(fake_jira, version, ncommits):
         assert c.url.endswith(c.hexsha)
 
 
-def test_maintenance_patch_selection(fake_jira):
-    r = Release.from_jira("0.17.1", jira=fake_jira)
+def test_maintenance_patch_selection(fake_issue_tracker):
+    r = Release("0.17.1", repo=None, issue_tracker=fake_issue_tracker)
 
     shas_to_pick = [
         c.hexsha for c in r.commits_to_pick(exclude_already_applied=False)
diff --git a/dev/archery/archery/templates/release_changelog.md.j2 b/dev/archery/archery/templates/release_changelog.md.j2
index 0c9efbc42f..0eedb217a8 100644
--- a/dev/archery/archery/templates/release_changelog.md.j2
+++ b/dev/archery/archery/templates/release_changelog.md.j2
@@ -23,7 +23,11 @@
 ## {{ category }}
 
 {% for issue, commit in issue_commit_pairs -%}
+{% if issue.project in ('ARROW', 'PARQUET') -%}
 * [{{ issue.key }}](https://issues.apache.org/jira/browse/{{ issue.key }}) - {{ commit.title.to_string(with_issue=False) if commit else issue.summary | md }}
+{% else -%}
+* [GH-{{ issue.key }}](https://github.com/apache/arrow/issues/{{ issue.key }}) - {{ commit.title.to_string(with_issue=False) if commit else issue.summary | md }}
+{% endif -%}
 {% endfor %}
 
 {% endfor %}
diff --git a/dev/archery/archery/templates/release_curation.txt.j2 b/dev/archery/archery/templates/release_curation.txt.j2
index 4f524d001c..0796f45162 100644
--- a/dev/archery/archery/templates/release_curation.txt.j2
+++ b/dev/archery/archery/templates/release_curation.txt.j2
@@ -17,26 +17,30 @@
 # under the License.
 #}
 {%- if not minimal -%}
-Total number of JIRA tickets assigned to version {{ release.version }}: {{ release.issues|length }}
+### Total number of GitHub tickets assigned to version {{ release.version }}: {{ release.issues|length }}
 
-Total number of applied patches since version {{ release.previous.version }}: {{ release.commits|length }}
+### Total number of applied patches since version {{ release.previous.version }}: {{ release.commits|length }}
 
-Patches with assigned issue in version {{ release.version }}:
+### Patches with assigned issue in version {{ release.version }}: {{ within|length }}
 {% for issue, commit in within -%}
  - {{ commit.url }} {{ commit.title }}
 {% endfor %}
 {% endif -%}
-Patches with assigned issue outside of version {{ release.version }}:
+### Patches with assigned issue outside of version {{ release.version }}: {{ outside|length }}
 {% for issue, commit in outside -%}
  - {{ commit.url }} {{ commit.title }}
 {% endfor %}
 {% if not minimal -%}
-Patches in version {{ release.version }} without a linked issue:
-{% for commit in nojira -%}
+### Minor patches in version {{ release.version }}: {{ minor|length }}
+{% for commit in minor -%}
  - {{ commit.url }} {{ commit.title }}
 {% endfor %}
-JIRA issues in version {{ release.version }} without a linked patch:
+### Patches in version {{ release.version }} without a linked issue:
+{% for commit in noissue -%}
+ - {{ commit.url }} {{ commit.title }}
+{% endfor %}
+### JIRA issues in version {{ release.version }} without a linked patch: {{ nopatch|length }}
 {% for issue in nopatch -%}
- - https://issues.apache.org/jira/browse/{{ issue.key }}
+ - https://github.com/apache/arrow/issues/{{ issue.key }}
 {% endfor %}
 {%- endif -%}
\ No newline at end of file
diff --git a/dev/archery/setup.py b/dev/archery/setup.py
index 4b13608cf8..51f066c9ed 100755
--- a/dev/archery/setup.py
+++ b/dev/archery/setup.py
@@ -31,7 +31,7 @@ extras = {
     'lint': ['numpydoc==1.1.0', 'autopep8', 'flake8', 'cmake_format==0.6.13'],
     'benchmark': ['pandas'],
     'docker': ['ruamel.yaml', 'python-dotenv'],
-    'release': [jinja_req, 'jira', 'semver', 'gitpython'],
+    'release': ['pygithub', jinja_req, 'jira', 'semver', 'gitpython'],
     'crossbow': ['github3.py', jinja_req, 'pygit2>=1.6.0', 'requests',
                  'ruamel.yaml', 'setuptools_scm'],
     'crossbow-upload': ['github3.py', jinja_req, 'ruamel.yaml',


[arrow] 04/10: GH-33666: [R] Remove extraneous argument to semi_join (#33693)

Posted by ra...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-11.0.0
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 057f291eff7a7a52cbc58ee08a23e47e9460e47d
Author: Nic Crane <th...@gmail.com>
AuthorDate: Tue Jan 17 02:45:57 2023 +0000

    GH-33666: [R] Remove extraneous argument to semi_join (#33693)
    
    This PR removes the `keep` argument from the test for `semi_join()`, which are causing the unit tests to fail.  It also removes the argument `suffix` argument (which is not part of the dplyr function signature) from the function signature here.
    
    Closes: #33666
    
    Authored-by: Nic Crane <th...@gmail.com>
    Signed-off-by: Dewey Dunnington <de...@fishandwhistle.net>
---
 r/tests/testthat/test-dplyr-join.R | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/r/tests/testthat/test-dplyr-join.R b/r/tests/testthat/test-dplyr-join.R
index 3470a886b3..2520d561cf 100644
--- a/r/tests/testthat/test-dplyr-join.R
+++ b/r/tests/testthat/test-dplyr-join.R
@@ -82,7 +82,7 @@ test_that("left_join with join_by", {
       left_join(
         to_join %>%
           rename(the_grouping = some_grouping),
-          join_by(some_grouping == the_grouping)
+        join_by(some_grouping == the_grouping)
       ) %>%
       collect(),
     left
@@ -240,14 +240,7 @@ test_that("full_join", {
 test_that("semi_join", {
   compare_dplyr_binding(
     .input %>%
-      semi_join(to_join, by = "some_grouping", keep = TRUE) %>%
-      collect(),
-    left
-  )
-
-  compare_dplyr_binding(
-    .input %>%
-      semi_join(to_join, by = "some_grouping", keep = FALSE) %>%
+      semi_join(to_join, by = "some_grouping") %>%
       collect(),
     left
   )


[arrow] 01/10: GH-33687: [Dev] Fix commit message generation in merge script (#33691)

Posted by ra...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-11.0.0
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 461c17d0da010770d00597b14efcf2d311ff3167
Author: Antoine Pitrou <an...@python.org>
AuthorDate: Mon Jan 16 14:59:02 2023 +0100

    GH-33687: [Dev] Fix commit message generation in merge script (#33691)
    
    * Regex for removing HTML comments was pathologically slow because of greedy pattern matching
    * Output of regex replacement was ignored (!)
    * Collapse extraneous newlines in generated commit message
    * Improve debugging output
    
    * Closes: #33687
    
    Authored-by: Antoine Pitrou <an...@python.org>
    Signed-off-by: Antoine Pitrou <an...@python.org>
---
 dev/merge_arrow_pr.py | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/dev/merge_arrow_pr.py b/dev/merge_arrow_pr.py
index 54729c0a52..352befc328 100755
--- a/dev/merge_arrow_pr.py
+++ b/dev/merge_arrow_pr.py
@@ -575,10 +575,10 @@ class PullRequest(object):
         commit_title = f'{self.title} (#{self.number})'
         commit_message_chunks = []
         if self.body is not None:
-            # Remove comments (i.e. <-- comment -->) from the pull request description.
-            body = re.sub(r"<!--(.|\s)*-->", "", self.body)
+            # Remove comments (i.e. <-- comment -->) from the PR description.
+            body = re.sub(r"<!--.*?-->", "", self.body, flags=re.DOTALL)
             # avoid github user name references by inserting a space after @
-            body = re.sub(r"@(\w+)", "@ \\1", self.body)
+            body = re.sub(r"@(\w+)", "@ \\1", body)
             commit_message_chunks.append(body)
 
         committer_name = run_cmd("git config --get user.name").strip()
@@ -596,9 +596,16 @@ class PullRequest(object):
 
         commit_message = "\n\n".join(commit_message_chunks)
 
+        # Normalize line ends and collapse extraneous newlines. We allow two
+        # consecutive newlines for paragraph breaks but not more.
+        commit_message = "\n".join(commit_message.splitlines())
+        commit_message = re.sub("\n{2,}", "\n\n", commit_message)
+
         if DEBUG:
+            print("*** Commit title ***")
             print(commit_title)
             print()
+            print("*** Commit message ***")
             print(commit_message)
 
         if DEBUG:


[arrow] 09/10: GH-33526: [R] Implement new function open_dataset_csv with signature more closely matching read_csv_arrow (#33614)

Posted by ra...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-11.0.0
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 66332515d0fae4529be85351948a2e19f7336f49
Author: Nic Crane <th...@gmail.com>
AuthorDate: Tue Jan 17 18:55:00 2023 +0000

    GH-33526: [R] Implement new function open_dataset_csv with signature more closely matching read_csv_arrow (#33614)
    
    This PR implements a wrapper around `open_dataset()` specifically for value-delimited files. It takes the parameters from `open_dataset()` and appends the parameters of `read_csv_arrow()` which are compatible with `open_dataset()`. This should make it easier for users to switch between the two, e.g.:
    
    ``` r
    library(arrow)
    library(dplyr)
    
    # Set up directory for examples
    tf <- tempfile()
    dir.create(tf)
    on.exit(unlink(tf))
    df <- data.frame(x = c("1", "2", "NULL"))
    
    file_path <- file.path(tf, "file1.txt")
    write.table(df, file_path, sep = ",", row.names = FALSE)
    
    read_csv_arrow(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1)
    #> # A tibble: 3 × 1
    #>       y
    #>   <int>
    #> 1     1
    #> 2     2
    #> 3    NA
    
    open_csv_dataset(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1) %>% collect()
    #> # A tibble: 3 × 1
    #>       y
    #>   <int>
    #> 1     1
    #> 2     2
    #> 3    NA
    ```
    
    This PR also hooks up the "na" (readr-style) parameter to "null_values" (i.e. CSVConvertOptions parameter).
    
    In the process of making this PR, I also refactored `CsvFileFormat$create()`.  Unfortunately, many changes needed to be made at once, which has considerably increasing the size/complexity of this PR.
    
    Authored-by: Nic Crane <th...@gmail.com>
    Signed-off-by: Nic Crane <th...@gmail.com>
---
 r/NAMESPACE                         |   3 +
 r/R/csv.R                           |   2 +-
 r/R/dataset-format.R                | 252 ++++++++++++++++++++++++------------
 r/R/dataset.R                       | 124 ++++++++++++++++++
 r/_pkgdown.yml                      |   4 +
 r/man/CsvFileFormat.Rd              |  41 ++++++
 r/man/FileFormat.Rd                 |   3 +-
 r/man/acero.Rd                      |   4 +-
 r/man/open_delim_dataset.Rd         | 216 +++++++++++++++++++++++++++++++
 r/tests/testthat/test-dataset-csv.R |  90 ++++++++++++-
 10 files changed, 648 insertions(+), 91 deletions(-)

diff --git a/r/NAMESPACE b/r/NAMESPACE
index cde81d977b..3df107a2d8 100644
--- a/r/NAMESPACE
+++ b/r/NAMESPACE
@@ -348,7 +348,10 @@ export(new_extension_type)
 export(null)
 export(num_range)
 export(one_of)
+export(open_csv_dataset)
 export(open_dataset)
+export(open_delim_dataset)
+export(open_tsv_dataset)
 export(read_csv_arrow)
 export(read_delim_arrow)
 export(read_feather)
diff --git a/r/R/csv.R b/r/R/csv.R
index 08f30fdefd..135394b967 100644
--- a/r/R/csv.R
+++ b/r/R/csv.R
@@ -500,7 +500,7 @@ CsvWriteOptions$create <- function(include_header = TRUE, batch_size = 1024L, nu
   )
 }
 
-readr_to_csv_read_options <- function(skip = 0, col_names = TRUE, col_types = NULL) {
+readr_to_csv_read_options <- function(skip = 0, col_names = TRUE) {
   if (isTRUE(col_names)) {
     # C++ default to parse is 0-length string array
     col_names <- character(0)
diff --git a/r/R/dataset-format.R b/r/R/dataset-format.R
index 7b88d9b8e0..0912941e64 100644
--- a/r/R/dataset-format.R
+++ b/r/R/dataset-format.R
@@ -53,7 +53,7 @@
 #' It returns the appropriate subclass of `FileFormat` (e.g. `ParquetFileFormat`)
 #' @rdname FileFormat
 #' @name FileFormat
-#' @examplesIf arrow_with_dataset() && tolower(Sys.info()[["sysname"]]) != "windows"
+#' @examplesIf arrow_with_dataset()
 #' ## Semi-colon delimited files
 #' # Set up directory for examples
 #' tf <- tempfile()
@@ -113,107 +113,105 @@ ParquetFileFormat$create <- function(...,
 #' @export
 IpcFileFormat <- R6Class("IpcFileFormat", inherit = FileFormat)
 
-#' @usage NULL
-#' @format NULL
-#' @rdname FileFormat
+#' CSV dataset file format
+#'
+#' @description
+#' A `CSVFileFormat` is a [FileFormat] subclass which holds information about how to
+#' read and parse the files included in a CSV `Dataset`.
+#'
+#' @section Factory:
+#' `CSVFileFormat$create()` can take options in the form of lists passed through as `parse_options`,
+#'  `read_options`, or `convert_options` parameters.  Alternatively, readr-style options can be passed
+#'  through individually.  While it is possible to pass in `CSVReadOptions`, `CSVConvertOptions`, and `CSVParseOptions`
+#'  objects, this is not recommended as options set in these objects are not validated for compatibility.
+#'
+#' @return A `CsvFileFormat` object
+#' @rdname CsvFileFormat
+#' @name CsvFileFormat
+#' @seealso [FileFormat]
+#' @examplesIf arrow_with_dataset()
+#' # Set up directory for examples
+#' tf <- tempfile()
+#' dir.create(tf)
+#' on.exit(unlink(tf))
+#' df <- data.frame(x = c("1", "2", "NULL"))
+#' write.table(df, file.path(tf, "file1.txt"), sep = ",", row.names = FALSE)
+#'
+#' # Create CsvFileFormat object with Arrow-style null_values option
+#' format <- CsvFileFormat$create(convert_options = list(null_values = c("", "NA", "NULL")))
+#' open_dataset(tf, format = format)
+#'
+#' # Use readr-style options
+#' format <- CsvFileFormat$create(na = c("", "NA", "NULL"))
+#' open_dataset(tf, format = format)
+#'
 #' @export
 CsvFileFormat <- R6Class("CsvFileFormat", inherit = FileFormat)
-CsvFileFormat$create <- function(...,
-                                 opts = csv_file_format_parse_options(...),
-                                 convert_options = csv_file_format_convert_opts(...),
-                                 read_options = csv_file_format_read_opts(...)) {
-  check_csv_file_format_args(...)
-  # Evaluate opts first to catch any unsupported arguments
-  force(opts)
-
-  options <- list(...)
-  schema <- options[["schema"]]
-  if (!is.null(schema) && !inherits(schema, "Schema")) {
-    abort(paste0(
-      "`schema` must be an object of class 'Schema' not '",
-      class(schema)[1],
-      "'."
-    ))
-  }
-
-  if (!inherits(read_options, "CsvReadOptions")) {
-    read_options <- do.call(CsvReadOptions$create, read_options)
-  }
+CsvFileFormat$create <- function(...) {
+  dots <- list(...)
+  options <- check_csv_file_format_args(dots)
+  check_schema(options[["schema"]], options[["read_options"]]$column_names)
 
-  if (!inherits(convert_options, "CsvConvertOptions")) {
-    convert_options <- do.call(CsvConvertOptions$create, convert_options)
-  }
-
-  if (!inherits(opts, "CsvParseOptions")) {
-    opts <- do.call(CsvParseOptions$create, opts)
-  }
-
-  column_names <- read_options$column_names
-  schema_names <- names(schema)
+  dataset___CsvFileFormat__Make(options$parse_options, options$convert_options, options$read_options)
+}
 
-  if (!is.null(schema) && !identical(schema_names, column_names)) {
-    missing_from_schema <- setdiff(column_names, schema_names)
-    missing_from_colnames <- setdiff(schema_names, column_names)
-    message_colnames <- NULL
-    message_schema <- NULL
-    message_order <- NULL
+# Check all arguments are valid
+check_csv_file_format_args <- function(args) {
+  options <- list(
+    parse_options = args$parse_options,
+    convert_options = args$convert_options,
+    read_options = args$read_options,
+    schema = args$schema
+  )
 
-    if (length(missing_from_colnames) > 0) {
-      message_colnames <- paste(
-        oxford_paste(missing_from_colnames, quote_symbol = "`"),
-        "not present in `column_names`"
-      )
-    }
+  check_unsupported_args(args)
+  check_unrecognised_args(args)
 
-    if (length(missing_from_schema) > 0) {
-      message_schema <- paste(
-        oxford_paste(missing_from_schema, quote_symbol = "`"),
-        "not present in `schema`"
-      )
-    }
+  # Evaluate parse_options first to catch any unsupported arguments
+  if (is.null(args$parse_options)) {
+    options$parse_options <- do.call(csv_file_format_parse_opts, args)
+  } else if (is.list(args$parse_options)) {
+    options$parse_options <- do.call(CsvParseOptions$create, args$parse_options)
+  }
 
-    if (length(missing_from_schema) == 0 && length(missing_from_colnames) == 0) {
-      message_order <- "`column_names` and `schema` field names match but are not in the same order"
-    }
+  if (is.null(args$convert_options)) {
+    options$convert_options <- do.call(csv_file_format_convert_opts, args)
+  } else if (is.list(args$convert_options)) {
+    options$convert_options <- do.call(CsvConvertOptions$create, args$convert_options)
+  }
 
-    abort(
-      c(
-        "Values in `column_names` must match `schema` field names",
-        x = message_order,
-        x = message_schema,
-        x = message_colnames
-      )
-    )
+  if (is.null(args$read_options)) {
+    options$read_options <- do.call(csv_file_format_read_opts, args)
+  } else if (is.list(args$read_options)) {
+    options$read_options <- do.call(CsvReadOptions$create, args$read_options)
   }
 
-  dataset___CsvFileFormat__Make(opts, convert_options, read_options)
+  options
 }
 
-# Check all arguments are valid
-check_csv_file_format_args <- function(...) {
-  opts <- list(...)
+check_unsupported_args <- function(args) {
+  opt_names <- get_opt_names(args)
+
   # Filter out arguments meant for CsvConvertOptions/CsvReadOptions
-  convert_opts <- c(names(formals(CsvConvertOptions$create)))
+  supported_convert_opts <- c(names(formals(CsvConvertOptions$create)), "na")
 
-  read_opts <- c(
+  supported_read_opts <- c(
     names(formals(CsvReadOptions$create)),
     names(formals(readr_to_csv_read_options))
   )
 
   # We only currently support all of the readr options for parseoptions
-  parse_opts <- c(
+  supported_parse_opts <- c(
     names(formals(CsvParseOptions$create)),
     names(formals(readr_to_csv_parse_options))
   )
 
-  opt_names <- names(opts)
-
   # Catch any readr-style options specified with full option names that are
   # supported by read_delim_arrow() (and its wrappers) but are not yet
   # supported here
   unsup_readr_opts <- setdiff(
     names(formals(read_delim_arrow)),
-    c(convert_opts, read_opts, parse_opts, "schema")
+    c(supported_convert_opts, supported_read_opts, supported_parse_opts, "schema")
   )
 
   is_unsup_opt <- opt_names %in% unsup_readr_opts
@@ -228,9 +226,36 @@ check_csv_file_format_args <- function(...) {
       call. = FALSE
     )
   }
+}
+
+# unlists "parse_options", "convert_options", "read_options" and returns them along with
+# names of options passed in individually via args.  `get_opt_names()` ignores any
+# CSV*Options objects passed in as these are not validated - users must ensure they've
+# chosen reasonable values in this case.
+get_opt_names <- function(args) {
+  opt_names <- names(args)
+
+  # extract names of parse_options, read_options, and convert_options
+  if ("parse_options" %in% names(args) && is.list(args[["parse_options"]])) {
+    opt_names <- c(opt_names, names(args[["parse_options"]]))
+  }
+
+  if ("read_options" %in% names(args) && is.list(args[["read_options"]])) {
+    opt_names <- c(opt_names, names(args[["read_options"]]))
+  }
 
+  if ("convert_options" %in% names(args) && is.list(args[["convert_options"]])) {
+    opt_names <- c(opt_names, names(args[["convert_options"]]))
+  }
+
+  setdiff(opt_names, c("parse_options", "read_options", "convert_options"))
+}
+
+check_unrecognised_args <- function(opts) {
   # Catch any options with full or partial names that do not match any of the
   # recognized Arrow C++ option names or readr-style option names
+  opt_names <- get_opt_names(opts)
+
   arrow_opts <- c(
     names(formals(CsvParseOptions$create)),
     names(formals(CsvReadOptions$create)),
@@ -240,7 +265,8 @@ check_csv_file_format_args <- function(...) {
 
   readr_opts <- c(
     names(formals(readr_to_csv_parse_options)),
-    names(formals(readr_to_csv_read_options))
+    names(formals(readr_to_csv_read_options)),
+    "na"
   )
 
   is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
@@ -271,26 +297,74 @@ check_ambiguous_options <- function(passed_opts, opts1, opts2) {
   }
 }
 
+check_schema <- function(schema, column_names) {
+  if (!is.null(schema) && !inherits(schema, "Schema")) {
+    abort(paste0(
+      "`schema` must be an object of class 'Schema' not '",
+      class(schema)[1],
+      "'."
+    ))
+  }
+
+  schema_names <- names(schema)
+
+  if (!is.null(schema) && !identical(schema_names, column_names)) {
+    missing_from_schema <- setdiff(column_names, schema_names)
+    missing_from_colnames <- setdiff(schema_names, column_names)
+    message_colnames <- NULL
+    message_schema <- NULL
+    message_order <- NULL
+
+    if (length(missing_from_colnames) > 0) {
+      message_colnames <- paste(
+        oxford_paste(missing_from_colnames, quote_symbol = "`"),
+        "not present in `column_names`"
+      )
+    }
+
+    if (length(missing_from_schema) > 0) {
+      message_schema <- paste(
+        oxford_paste(missing_from_schema, quote_symbol = "`"),
+        "not present in `schema`"
+      )
+    }
+
+    if (length(missing_from_schema) == 0 && length(missing_from_colnames) == 0) {
+      message_order <- "`column_names` and `schema` field names match but are not in the same order"
+    }
+
+    abort(
+      c(
+        "Values in `column_names` must match `schema` field names",
+        x = message_order,
+        x = message_schema,
+        x = message_colnames
+      )
+    )
+  }
+}
+
 # Support both readr-style option names and Arrow C++ option names
-csv_file_format_parse_options <- function(...) {
+csv_file_format_parse_opts <- function(...) {
   opts <- list(...)
   # Filter out arguments meant for CsvConvertOptions/CsvReadOptions
-  convert_opts <- names(formals(CsvConvertOptions$create))
+  convert_opts <- c(names(formals(CsvConvertOptions$create)), "na", "convert_options")
   read_opts <- c(
     names(formals(CsvReadOptions$create)),
-    names(formals(readr_to_csv_read_options))
+    names(formals(readr_to_csv_read_options)),
+    "read_options"
   )
   opts[convert_opts] <- NULL
   opts[read_opts] <- NULL
   opts[["schema"]] <- NULL
-  opt_names <- names(opts)
+  opts[["parse_options"]] <- NULL
+  opt_names <- get_opt_names(opts)
 
   arrow_opts <- c(names(formals(CsvParseOptions$create)))
   readr_opts <- c(names(formals(readr_to_csv_parse_options)))
 
   is_arrow_opt <- !is.na(pmatch(opt_names, arrow_opts))
   is_readr_opt <- !is.na(pmatch(opt_names, readr_opts))
-
   # Catch options with ambiguous partial names (such as "del") that make it
   # unclear whether the user is specifying Arrow C++ options ("delimiter") or
   # readr-style options ("delim")
@@ -313,28 +387,38 @@ csv_file_format_parse_options <- function(...) {
 csv_file_format_convert_opts <- function(...) {
   opts <- list(...)
   # Filter out arguments meant for CsvParseOptions/CsvReadOptions
-  arrow_opts <- names(formals(CsvParseOptions$create))
+  arrow_opts <- c(names(formals(CsvParseOptions$create)), "parse_options")
   readr_opts <- names(formals(readr_to_csv_parse_options))
   read_opts <- c(
     names(formals(CsvReadOptions$create)),
-    names(formals(readr_to_csv_read_options))
+    names(formals(readr_to_csv_read_options)),
+    "read_options"
   )
   opts[arrow_opts] <- NULL
   opts[readr_opts] <- NULL
   opts[read_opts] <- NULL
   opts[["schema"]] <- NULL
+  opts[["convert_options"]] <- NULL
+
+  # map "na" to "null_values"
+  if ("na" %in% names(opts)) {
+    opts[["null_values"]] <- opts[["na"]]
+    opts[["na"]] <- NULL
+  }
+
   do.call(CsvConvertOptions$create, opts)
 }
 
 csv_file_format_read_opts <- function(schema = NULL, ...) {
   opts <- list(...)
   # Filter out arguments meant for CsvParseOptions/CsvConvertOptions
-  arrow_opts <- names(formals(CsvParseOptions$create))
+  arrow_opts <- c(names(formals(CsvParseOptions$create)), "parse_options")
   readr_opts <- names(formals(readr_to_csv_parse_options))
-  convert_opts <- names(formals(CsvConvertOptions$create))
+  convert_opts <- c(names(formals(CsvConvertOptions$create)), "na", "convert_options")
   opts[arrow_opts] <- NULL
   opts[readr_opts] <- NULL
   opts[convert_opts] <- NULL
+  opts[["read_options"]] <- NULL
 
   opt_names <- names(opts)
   arrow_opts <- c(names(formals(CsvReadOptions$create)))
diff --git a/r/R/dataset.R b/r/R/dataset.R
index 732c05ecb0..71247b3581 100644
--- a/r/R/dataset.R
+++ b/r/R/dataset.R
@@ -228,6 +228,130 @@ open_dataset <- function(sources,
   )
 }
 
+#' Open a multi-file dataset of CSV or other delimiter-separated format
+#'
+#' A wrapper around [open_dataset] which explicitly includes parameters mirroring [read_csv_arrow()],
+#' [read_delim_arrow()], and [read_tsv_arrow()] to allows for easy switching between functions
+#' for opening single files and functions for opening datasets.
+#'
+#' @inheritParams open_dataset
+#' @inheritParams read_delim_arrow
+#'
+#' @section Options currently supported by [read_delim_arrow()] which are not supported here:
+#' * `file` (instead, please specify files in `sources`)
+#' * `col_select` (instead, subset columns after dataset creation)
+#' * `quoted_na`
+#' * `as_data_frame` (instead, convert to data frame after dataset creation)
+#' * `parse_options`
+#'
+#' @examplesIf arrow_with_dataset()
+#' # Set up directory for examples
+#' tf <- tempfile()
+#' dir.create(tf)
+#' df <- data.frame(x = c("1", "2", "NULL"))
+#'
+#' file_path <- file.path(tf, "file1.txt")
+#' write.table(df, file_path, sep = ",", row.names = FALSE)
+#'
+#' read_csv_arrow(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1)
+#' open_csv_dataset(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1)
+#'
+#' unlink(tf)
+#' @seealso [open_dataset()]
+#' @export
+open_delim_dataset <- function(sources,
+                               schema = NULL,
+                               partitioning = hive_partition(),
+                               hive_style = NA,
+                               unify_schemas = NULL,
+                               factory_options = list(),
+                               delim = ",",
+                               quote = "\"",
+                               escape_double = TRUE,
+                               escape_backslash = FALSE,
+                               col_names = TRUE,
+                               col_types = NULL,
+                               na = c("", "NA"),
+                               skip_empty_rows = TRUE,
+                               skip = 0L,
+                               convert_options = NULL,
+                               read_options = NULL,
+                               timestamp_parsers = NULL) {
+  open_dataset(
+    sources = sources,
+    schema = schema,
+    partitioning = partitioning,
+    hive_style = hive_style,
+    unify_schemas = unify_schemas,
+    factory_options = factory_options,
+    format = "text",
+    delim = delim,
+    quote = quote,
+    escape_double = escape_double,
+    escape_backslash = escape_backslash,
+    col_names = col_names,
+    col_types = col_types,
+    na = na,
+    skip_empty_rows = skip_empty_rows,
+    skip = skip,
+    convert_options = convert_options,
+    read_options = read_options,
+    timestamp_parsers = timestamp_parsers
+  )
+}
+
+#' @rdname open_delim_dataset
+#' @export
+open_csv_dataset <- function(sources,
+                             schema = NULL,
+                             partitioning = hive_partition(),
+                             hive_style = NA,
+                             unify_schemas = NULL,
+                             factory_options = list(),
+                             quote = "\"",
+                             escape_double = TRUE,
+                             escape_backslash = FALSE,
+                             col_names = TRUE,
+                             col_types = NULL,
+                             na = c("", "NA"),
+                             skip_empty_rows = TRUE,
+                             skip = 0L,
+                             convert_options = NULL,
+                             read_options = NULL,
+                             timestamp_parsers = NULL) {
+  mc <- match.call()
+  mc$delim <- ","
+  mc[[1]] <- get("open_delim_dataset", envir = asNamespace("arrow"))
+  eval.parent(mc)
+}
+
+#' @rdname open_delim_dataset
+#' @export
+open_tsv_dataset <- function(sources,
+                             schema = NULL,
+                             partitioning = hive_partition(),
+                             hive_style = NA,
+                             unify_schemas = NULL,
+                             factory_options = list(),
+                             quote = "\"",
+                             escape_double = TRUE,
+                             escape_backslash = FALSE,
+                             col_names = TRUE,
+                             col_types = NULL,
+                             na = c("", "NA"),
+                             skip_empty_rows = TRUE,
+                             skip = 0L,
+                             convert_options = NULL,
+                             read_options = NULL,
+                             timestamp_parsers = NULL) {
+  mc <- match.call()
+  mc$delim <- "\t"
+  mc[[1]] <- get("open_delim_dataset", envir = asNamespace("arrow"))
+  eval.parent(mc)
+}
+
+
+
 #' Multi-file datasets
 #'
 #' @description
diff --git a/r/_pkgdown.yml b/r/_pkgdown.yml
index 3d5dc2d1f2..391d340769 100644
--- a/r/_pkgdown.yml
+++ b/r/_pkgdown.yml
@@ -141,6 +141,9 @@ reference:
   - title: Multi-file datasets
     contents:
       - open_dataset
+      - open_delim_dataset
+      - open_csv_dataset
+      - open_tsv_dataset
       - write_dataset
       - dataset_factory
       - hive_partition
@@ -149,6 +152,7 @@ reference:
       - Expression
       - Scanner
       - FileFormat
+      - CsvFileFormat
       - FileWriteOptions
       - FragmentScanOptions
       - map_batches
diff --git a/r/man/CsvFileFormat.Rd b/r/man/CsvFileFormat.Rd
new file mode 100644
index 0000000000..aa368b8f29
--- /dev/null
+++ b/r/man/CsvFileFormat.Rd
@@ -0,0 +1,41 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/dataset-format.R
+\name{CsvFileFormat}
+\alias{CsvFileFormat}
+\title{CSV dataset file format}
+\value{
+A \code{CsvFileFormat} object
+}
+\description{
+A \code{CSVFileFormat} is a \link{FileFormat} subclass which holds information about how to
+read and parse the files included in a CSV \code{Dataset}.
+}
+\section{Factory}{
+
+\code{CSVFileFormat$create()} can take options in the form of lists passed through as \code{parse_options},
+\code{read_options}, or \code{convert_options} parameters.  Alternatively, readr-style options can be passed
+through individually.  While it is possible to pass in \code{CSVReadOptions}, \code{CSVConvertOptions}, and \code{CSVParseOptions}
+objects, this is not recommended as options set in these objects are not validated for compatibility.
+}
+
+\examples{
+\dontshow{if (arrow_with_dataset()) (if (getRversion() >= "3.4") withAutoprint else force)(\{ # examplesIf}
+# Set up directory for examples
+tf <- tempfile()
+dir.create(tf)
+on.exit(unlink(tf))
+df <- data.frame(x = c("1", "2", "NULL"))
+write.table(df, file.path(tf, "file1.txt"), sep = ",", row.names = FALSE)
+
+# Create CsvFileFormat object with Arrow-style null_values option
+format <- CsvFileFormat$create(convert_options = list(null_values = c("", "NA", "NULL")))
+open_dataset(tf, format = format)
+
+# Use readr-style options
+format <- CsvFileFormat$create(na = c("", "NA", "NULL"))
+open_dataset(tf, format = format)
+\dontshow{\}) # examplesIf}
+}
+\seealso{
+\link{FileFormat}
+}
diff --git a/r/man/FileFormat.Rd b/r/man/FileFormat.Rd
index 3c6fd330b0..296de02ead 100644
--- a/r/man/FileFormat.Rd
+++ b/r/man/FileFormat.Rd
@@ -4,7 +4,6 @@
 \alias{FileFormat}
 \alias{ParquetFileFormat}
 \alias{IpcFileFormat}
-\alias{CsvFileFormat}
 \title{Dataset file formats}
 \description{
 A \code{FileFormat} holds information about how to read and parse the files
@@ -52,7 +51,7 @@ It returns the appropriate subclass of \code{FileFormat} (e.g. \code{ParquetFile
 }
 
 \examples{
-\dontshow{if (arrow_with_dataset() && tolower(Sys.info()[["sysname"]]) != "windows") (if (getRversion() >= "3.4") withAutoprint else force)(\{ # examplesIf}
+\dontshow{if (arrow_with_dataset()) (if (getRversion() >= "3.4") withAutoprint else force)(\{ # examplesIf}
 ## Semi-colon delimited files
 # Set up directory for examples
 tf <- tempfile()
diff --git a/r/man/acero.Rd b/r/man/acero.Rd
index a54b210e18..b8aed28825 100644
--- a/r/man/acero.Rd
+++ b/r/man/acero.Rd
@@ -53,9 +53,9 @@ Table into an R \code{data.frame}.
 \item \code{\link[dplyr:slice]{slice_tail()}}: slicing within groups not supported; Arrow datasets do not have row order, so tail is non-deterministic; \code{prop} only supported on queries where \code{nrow()} is knowable without evaluating
 \item \code{\link[dplyr:summarise]{summarise()}}: window functions not currently supported; arguments \code{.drop = FALSE} and `.groups = "rowwise" not supported
 \item \code{\link[dplyr:count]{tally()}}
-\item \code{\link[dplyr:mutate]{transmute()}}
+\item \code{\link[dplyr:transmute]{transmute()}}
 \item \code{\link[dplyr:group_by]{ungroup()}}
-\item \code{\link[dplyr:reexports]{union()}}
+\item \code{\link[dplyr:setops]{union()}}
 \item \code{\link[dplyr:setops]{union_all()}}
 }
 }
diff --git a/r/man/open_delim_dataset.Rd b/r/man/open_delim_dataset.Rd
new file mode 100644
index 0000000000..d127f772c6
--- /dev/null
+++ b/r/man/open_delim_dataset.Rd
@@ -0,0 +1,216 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/dataset.R
+\name{open_delim_dataset}
+\alias{open_delim_dataset}
+\alias{open_csv_dataset}
+\alias{open_tsv_dataset}
+\title{Open a multi-file dataset of CSV or other delimiter-separated format}
+\usage{
+open_delim_dataset(
+  sources,
+  schema = NULL,
+  partitioning = hive_partition(),
+  hive_style = NA,
+  unify_schemas = NULL,
+  factory_options = list(),
+  delim = ",",
+  quote = "\\"",
+  escape_double = TRUE,
+  escape_backslash = FALSE,
+  col_names = TRUE,
+  col_types = NULL,
+  na = c("", "NA"),
+  skip_empty_rows = TRUE,
+  skip = 0L,
+  convert_options = NULL,
+  read_options = NULL,
+  timestamp_parsers = NULL
+)
+
+open_csv_dataset(
+  sources,
+  schema = NULL,
+  partitioning = hive_partition(),
+  hive_style = NA,
+  unify_schemas = NULL,
+  factory_options = list(),
+  quote = "\\"",
+  escape_double = TRUE,
+  escape_backslash = FALSE,
+  col_names = TRUE,
+  col_types = NULL,
+  na = c("", "NA"),
+  skip_empty_rows = TRUE,
+  skip = 0L,
+  convert_options = NULL,
+  read_options = NULL,
+  timestamp_parsers = NULL
+)
+
+open_tsv_dataset(
+  sources,
+  schema = NULL,
+  partitioning = hive_partition(),
+  hive_style = NA,
+  unify_schemas = NULL,
+  factory_options = list(),
+  quote = "\\"",
+  escape_double = TRUE,
+  escape_backslash = FALSE,
+  col_names = TRUE,
+  col_types = NULL,
+  na = c("", "NA"),
+  skip_empty_rows = TRUE,
+  skip = 0L,
+  convert_options = NULL,
+  read_options = NULL,
+  timestamp_parsers = NULL
+)
+}
+\arguments{
+\item{sources}{One of:
+\itemize{
+\item a string path or URI to a directory containing data files
+\item a \link{FileSystem} that references a directory containing data files
+(such as what is returned by \code{\link[=s3_bucket]{s3_bucket()}})
+\item a string path or URI to a single file
+\item a character vector of paths or URIs to individual data files
+\item a list of \code{Dataset} objects as created by this function
+\item a list of \code{DatasetFactory} objects as created by \code{\link[=dataset_factory]{dataset_factory()}}.
+}
+
+When \code{sources} is a vector of file URIs, they must all use the same protocol
+and point to files located in the same file system and having the same
+format.}
+
+\item{schema}{\link{Schema} for the \code{Dataset}. If \code{NULL} (the default), the schema
+will be inferred from the data sources.}
+
+\item{partitioning}{When \code{sources} is a directory path/URI, one of:
+\itemize{
+\item a \code{Schema}, in which case the file paths relative to \code{sources} will be
+parsed, and path segments will be matched with the schema fields.
+\item a character vector that defines the field names corresponding to those
+path segments (that is, you're providing the names that would correspond
+to a \code{Schema} but the types will be autodetected)
+\item a \code{Partitioning} or \code{PartitioningFactory}, such as returned
+by \code{\link[=hive_partition]{hive_partition()}}
+\item \code{NULL} for no partitioning
+}
+
+The default is to autodetect Hive-style partitions unless
+\code{hive_style = FALSE}. See the "Partitioning" section for details.
+When \code{sources} is not a directory path/URI, \code{partitioning} is ignored.}
+
+\item{hive_style}{Logical: should \code{partitioning} be interpreted as
+Hive-style? Default is \code{NA}, which means to inspect the file paths for
+Hive-style partitioning and behave accordingly.}
+
+\item{unify_schemas}{logical: should all data fragments (files, \code{Dataset}s)
+be scanned in order to create a unified schema from them? If \code{FALSE}, only
+the first fragment will be inspected for its schema. Use this fast path
+when you know and trust that all fragments have an identical schema.
+The default is \code{FALSE} when creating a dataset from a directory path/URI or
+vector of file paths/URIs (because there may be many files and scanning may
+be slow) but \code{TRUE} when \code{sources} is a list of \code{Dataset}s (because there
+should be few \code{Dataset}s in the list and their \code{Schema}s are already in
+memory).}
+
+\item{factory_options}{list of optional FileSystemFactoryOptions:
+\itemize{
+\item \code{partition_base_dir}: string path segment prefix to ignore when
+discovering partition information with DirectoryPartitioning. Not
+meaningful (ignored with a warning) for HivePartitioning, nor is it
+valid when providing a vector of file paths.
+\item \code{exclude_invalid_files}: logical: should files that are not valid data
+files be excluded? Default is \code{FALSE} because checking all files up
+front incurs I/O and thus will be slower, especially on remote
+filesystems. If false and there are invalid files, there will be an
+error at scan time. This is the only FileSystemFactoryOption that is
+valid for both when providing a directory path in which to discover
+files and when providing a vector of file paths.
+\item \code{selector_ignore_prefixes}: character vector of file prefixes to ignore
+when discovering files in a directory. If invalid files can be excluded
+by a common filename prefix this way, you can avoid the I/O cost of
+\code{exclude_invalid_files}. Not valid when providing a vector of file paths
+(but if you're providing the file list, you can filter invalid files
+yourself).
+}}
+
+\item{delim}{Single character used to separate fields within a record.}
+
+\item{quote}{Single character used to quote strings.}
+
+\item{escape_double}{Does the file escape quotes by doubling them?
+i.e. If this option is \code{TRUE}, the value \verb{""""} represents
+a single quote, \verb{\\"}.}
+
+\item{escape_backslash}{Does the file use backslashes to escape special
+characters? This is more general than \code{escape_double} as backslashes
+can be used to escape the delimiter character, the quote character, or
+to add special characters like \verb{\\\\n}.}
+
+\item{col_names}{If \code{TRUE}, the first row of the input will be used as the
+column names and will not be included in the data frame. If \code{FALSE}, column
+names will be generated by Arrow, starting with "f0", "f1", ..., "fN".
+Alternatively, you can specify a character vector of column names.}
+
+\item{col_types}{A compact string representation of the column types,
+an Arrow \link{Schema}, or \code{NULL} (the default) to infer types from the data.}
+
+\item{na}{A character vector of strings to interpret as missing values.}
+
+\item{skip_empty_rows}{Should blank rows be ignored altogether? If
+\code{TRUE}, blank rows will not be represented at all. If \code{FALSE}, they will be
+filled with missings.}
+
+\item{skip}{Number of lines to skip before reading data.}
+
+\item{convert_options}{see \link[=CsvReadOptions]{file reader options}}
+
+\item{read_options}{see \link[=CsvReadOptions]{file reader options}}
+
+\item{timestamp_parsers}{User-defined timestamp parsers. If more than one
+parser is specified, the CSV conversion logic will try parsing values
+starting from the beginning of this vector. Possible values are:
+\itemize{
+\item \code{NULL}: the default, which uses the ISO-8601 parser
+\item a character vector of \link[base:strptime]{strptime} parse strings
+\item a list of \link{TimestampParser} objects
+}}
+}
+\description{
+A wrapper around \link{open_dataset} which explicitly includes parameters mirroring \code{\link[=read_csv_arrow]{read_csv_arrow()}},
+\code{\link[=read_delim_arrow]{read_delim_arrow()}}, and \code{\link[=read_tsv_arrow]{read_tsv_arrow()}} to allows for easy switching between functions
+for opening single files and functions for opening datasets.
+}
+\section{Options currently supported by \code{\link[=read_delim_arrow]{read_delim_arrow()}} which are not supported here}{
+
+\itemize{
+\item \code{file} (instead, please specify files in \code{sources})
+\item \code{col_select} (instead, subset columns after dataset creation)
+\item \code{quoted_na}
+\item \code{as_data_frame} (instead, convert to data frame after dataset creation)
+\item \code{parse_options}
+}
+}
+
+\examples{
+\dontshow{if (arrow_with_dataset()) (if (getRversion() >= "3.4") withAutoprint else force)(\{ # examplesIf}
+# Set up directory for examples
+tf <- tempfile()
+dir.create(tf)
+df <- data.frame(x = c("1", "2", "NULL"))
+
+file_path <- file.path(tf, "file1.txt")
+write.table(df, file_path, sep = ",", row.names = FALSE)
+
+read_csv_arrow(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1)
+open_csv_dataset(file_path, na = c("", "NA", "NULL"), col_names = "y", skip = 1)
+
+unlink(tf)
+\dontshow{\}) # examplesIf}
+}
+\seealso{
+\code{\link[=open_dataset]{open_dataset()}}
+}
diff --git a/r/tests/testthat/test-dataset-csv.R b/r/tests/testthat/test-dataset-csv.R
index 436db985fb..b25c57b2ba 100644
--- a/r/tests/testthat/test-dataset-csv.R
+++ b/r/tests/testthat/test-dataset-csv.R
@@ -218,9 +218,9 @@ test_that("readr parse options", {
     character(0)
   )
 
-  # With not yet supported readr parse options (ARROW-8631)
+  # With not yet supported readr parse options
   expect_error(
-    open_dataset(tsv_dir, partitioning = "part", delim = "\t", na = "\\N"),
+    open_dataset(tsv_dir, partitioning = "part", delim = "\t", quoted_na = TRUE),
     "supported"
   )
 
@@ -476,3 +476,89 @@ test_that("CSV reading/parsing/convert options can be passed in as lists", {
 
   expect_equal(ds1, ds2)
 })
+
+test_that("open_delim_dataset params passed through to open_dataset", {
+  ds <- open_delim_dataset(csv_dir, delim = ",", partitioning = "part")
+  expect_r6_class(ds$format, "CsvFileFormat")
+  expect_r6_class(ds$filesystem, "LocalFileSystem")
+  expect_identical(names(ds), c(names(df1), "part"))
+  expect_identical(dim(ds), c(20L, 7L))
+
+  # quote
+  dst_dir <- make_temp_dir()
+  dst_file <- file.path(dst_dir, "data.csv")
+
+  df <- data.frame(a = c(1, 2), b = c("'abc'", "'def'"))
+  write.csv(df, dst_file, row.names = FALSE, quote = FALSE)
+
+  ds_quote <- open_csv_dataset(dst_dir, quote = "'") %>% collect()
+  expect_equal(ds_quote$b, c("abc", "def"))
+
+  # na
+  ds <- open_csv_dataset(csv_dir, partitioning = "part", na = c("", "NA", "FALSE")) %>% collect()
+  expect_identical(ds$lgl, c(
+    TRUE, NA, NA, TRUE, NA, TRUE, NA, NA, TRUE, NA, TRUE, NA, NA,
+    TRUE, NA, TRUE, NA, NA, TRUE, NA
+  ))
+
+  # col_names and skip
+  ds <- open_csv_dataset(
+    csv_dir,
+    partitioning = "part",
+    col_names = paste0("col_", 1:6),
+    skip = 1
+  ) %>% collect()
+
+  expect_named(ds, c("col_1", "col_2", "col_3", "col_4", "col_5", "col_6", "part"))
+  expect_equal(nrow(ds), 20)
+
+  # col_types
+  dst_dir <- make_temp_dir()
+  dst_file <- file.path(dst_dir, "data.csv")
+
+  df <- data.frame(a = c(1, NA, 2), b = c("'abc'", NA, "'def'"))
+  write.csv(df, dst_file, row.names = FALSE, quote = FALSE)
+
+  data_schema <- schema(a = string(), b = string())
+  ds_strings <- open_csv_dataset(dst_dir, col_types = data_schema)
+  expect_equal(ds_strings$schema, schema(a = string(), b = string()))
+
+  # skip_empty_rows
+  tf <- tempfile()
+  writeLines('"x"\n"y"\nNA\nNA\n"NULL"\n\n\n', tf)
+
+  ds <- open_csv_dataset(tf, skip_empty_rows = FALSE) %>% collect()
+  expect_equal(nrow(ds), 7)
+
+  # convert_options
+  ds <- open_csv_dataset(
+    csv_dir,
+    convert_options = list(null_values = c("NA", "", "FALSE"), strings_can_be_null = TRUE)
+  ) %>% collect()
+
+  expect_equal(
+    ds$lgl,
+    c(TRUE, NA, NA, TRUE, NA, TRUE, NA, NA, TRUE, NA, TRUE, NA, NA, TRUE, NA, TRUE, NA, NA, TRUE, NA)
+  )
+
+  # read_options
+  ds <- open_csv_dataset(
+    csv_dir,
+    read_options = list(column_names = paste0("col_", 1:6))
+  ) %>% collect()
+
+  expect_named(ds, c("col_1", "col_2", "col_3", "col_4", "col_5", "col_6"))
+
+  # timestamp_parsers
+  skip("GH-33708: timestamp_parsers don't appear to be working properly")
+
+  dst_dir <- make_temp_dir()
+  dst_file <- file.path(dst_dir, "data.csv")
+
+  df <- data.frame(time = "2023-01-16 19:47:57")
+  write.csv(df, dst_file, row.names = FALSE, quote = FALSE)
+
+  ds <- open_csv_dataset(dst_dir, timestamp_parsers = c(TimestampParser$create(format = "%d-%m-%y"))) %>% collect()
+
+  expect_equal(ds$time, "16-01-2023")
+})


[arrow] 08/10: GH-20512: [Python] Numpy conversion doesn't account for ListArray offset (#15210)

Posted by ra...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-11.0.0
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit c2199dc4902d200d186f0d23513689686e91fdce
Author: Will Jones <wi...@gmail.com>
AuthorDate: Tue Jan 17 10:15:08 2023 -0800

    GH-20512: [Python] Numpy conversion doesn't account for ListArray offset (#15210)
    
    
    * Closes: #20512
    
    Lead-authored-by: Will Jones <wi...@gmail.com>
    Co-authored-by: Joris Van den Bossche <jo...@gmail.com>
    Signed-off-by: Jacob Wujciak-Jens <ja...@wujciak.de>
---
 cpp/src/arrow/array/array_nested.h                 |  4 +-
 python/pyarrow/src/arrow/python/arrow_to_pandas.cc | 23 ++++++++---
 python/pyarrow/tests/test_pandas.py                | 45 ++++++++++++++++++++++
 3 files changed, 65 insertions(+), 7 deletions(-)

diff --git a/cpp/src/arrow/array/array_nested.h b/cpp/src/arrow/array/array_nested.h
index 489a7a3a3c..6fb3fd3c91 100644
--- a/cpp/src/arrow/array/array_nested.h
+++ b/cpp/src/arrow/array/array_nested.h
@@ -69,9 +69,11 @@ class BaseListArray : public Array {
   const TypeClass* list_type() const { return list_type_; }
 
   /// \brief Return array object containing the list's values
+  ///
+  /// Note that this buffer does not account for any slice offset or length.
   std::shared_ptr<Array> values() const { return values_; }
 
-  /// Note that this buffer does not account for any slice offset
+  /// Note that this buffer does not account for any slice offset or length.
   std::shared_ptr<Buffer> value_offsets() const { return data_->buffers[1]; }
 
   std::shared_ptr<DataType> value_type() const { return list_type_->value_type(); }
diff --git a/python/pyarrow/src/arrow/python/arrow_to_pandas.cc b/python/pyarrow/src/arrow/python/arrow_to_pandas.cc
index f58c151ea6..2faf7d381a 100644
--- a/python/pyarrow/src/arrow/python/arrow_to_pandas.cc
+++ b/python/pyarrow/src/arrow/python/arrow_to_pandas.cc
@@ -738,11 +738,17 @@ Status ConvertListsLike(PandasOptions options, const ChunkedArray& data,
   ArrayVector value_arrays;
   for (int c = 0; c < data.num_chunks(); c++) {
     const auto& arr = checked_cast<const ListArrayT&>(*data.chunk(c));
+    // values() does not account for offsets, so we need to slice into it.
+    // We can't use Flatten(), because it removes the values behind a null list
+    // value, and that makes the offsets into original list values and our
+    // flattened_values array different.
+    std::shared_ptr<Array> flattened_values = arr.values()->Slice(
+        arr.value_offset(0), arr.value_offset(arr.length()) - arr.value_offset(0));
     if (arr.value_type()->id() == Type::EXTENSION) {
-      const auto& arr_ext = checked_cast<const ExtensionArray&>(*arr.values());
+      const auto& arr_ext = checked_cast<const ExtensionArray&>(*flattened_values);
       value_arrays.emplace_back(arr_ext.storage());
     } else {
-      value_arrays.emplace_back(arr.values());
+      value_arrays.emplace_back(flattened_values);
     }
   }
 
@@ -772,8 +778,12 @@ Status ConvertListsLike(PandasOptions options, const ChunkedArray& data,
         Py_INCREF(Py_None);
         *out_values = Py_None;
       } else {
-        OwnedRef start(PyLong_FromLongLong(arr.value_offset(i) + chunk_offset));
-        OwnedRef end(PyLong_FromLongLong(arr.value_offset(i + 1) + chunk_offset));
+        // Need to subtract value_offset(0) since the original chunk might be a slice
+        // into another array.
+        OwnedRef start(PyLong_FromLongLong(arr.value_offset(i) + chunk_offset -
+                                           arr.value_offset(0)));
+        OwnedRef end(PyLong_FromLongLong(arr.value_offset(i + 1) + chunk_offset -
+                                         arr.value_offset(0)));
         OwnedRef slice(PySlice_New(start.obj(), end.obj(), nullptr));
 
         if (ARROW_PREDICT_FALSE(slice.obj() == nullptr)) {
@@ -791,7 +801,7 @@ Status ConvertListsLike(PandasOptions options, const ChunkedArray& data,
     }
     RETURN_IF_PYERROR();
 
-    chunk_offset += arr.values()->length();
+    chunk_offset += arr.value_offset(arr.length()) - arr.value_offset(0);
   }
 
   return Status::OK();
@@ -1083,7 +1093,8 @@ struct ObjectWriterVisitor {
       OwnedRef keywords(PyDict_New());
       PyDict_SetItemString(keywords.obj(), "tzinfo", PyDateTime_TimeZone_UTC);
       OwnedRef naive_datetime_replace(PyObject_GetAttrString(naive_datetime, "replace"));
-      OwnedRef datetime_utc(PyObject_Call(naive_datetime_replace.obj(), args.obj(), keywords.obj()));
+      OwnedRef datetime_utc(
+          PyObject_Call(naive_datetime_replace.obj(), args.obj(), keywords.obj()));
       // second step: adjust the datetime to tzinfo timezone (astimezone method)
       *out = PyObject_CallMethod(datetime_utc.obj(), "astimezone", "O", tzinfo.obj());
 
diff --git a/python/pyarrow/tests/test_pandas.py b/python/pyarrow/tests/test_pandas.py
index 729a4122c0..4d0ddf8754 100644
--- a/python/pyarrow/tests/test_pandas.py
+++ b/python/pyarrow/tests/test_pandas.py
@@ -2308,6 +2308,51 @@ class TestConvertListTypes:
         actual = arr.to_pandas()
         tm.assert_series_equal(actual, expected, check_names=False)
 
+    def test_list_no_duplicate_base(self):
+        # ARROW-18400
+        arr = pa.array([[1, 2], [3, 4, 5], None, [6, None], [7, 8]])
+        chunked_arr = pa.chunked_array([arr.slice(0, 3), arr.slice(3, 1)])
+
+        np_arr = chunked_arr.to_numpy()
+
+        expected = np.array([[1., 2.], [3., 4., 5.], None,
+                            [6., np.NaN]], dtype="object")
+        for left, right in zip(np_arr, expected):
+            if right is None:
+                assert left == right
+            else:
+                npt.assert_array_equal(left, right)
+
+        expected_base = np.array([[1., 2., 3., 4., 5., 6., np.NaN]])
+        npt.assert_array_equal(np_arr[0].base, expected_base)
+
+        np_arr_sliced = chunked_arr.slice(1, 3).to_numpy()
+
+        expected = np.array([[3, 4, 5], None, [6, np.NaN]], dtype="object")
+        for left, right in zip(np_arr_sliced, expected):
+            if right is None:
+                assert left == right
+            else:
+                npt.assert_array_equal(left, right)
+
+        expected_base = np.array([[3., 4., 5., 6., np.NaN]])
+        npt.assert_array_equal(np_arr_sliced[0].base, expected_base)
+
+    def test_list_values_behind_null(self):
+        arr = pa.ListArray.from_arrays(
+            offsets=pa.array([0, 2, 4, 6]),
+            values=pa.array([1, 2, 99, 99, 3, None]),
+            mask=pa.array([False, True, False])
+        )
+        np_arr = arr.to_numpy(zero_copy_only=False)
+
+        expected = np.array([[1., 2.], None, [3., np.NaN]], dtype="object")
+        for left, right in zip(np_arr, expected):
+            if right is None:
+                assert left == right
+            else:
+                npt.assert_array_equal(left, right)
+
 
 class TestConvertStructTypes:
     """


[arrow] 06/10: GH-15265: [Java] Publish SBOM artifacts (#15267)

Posted by ra...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-11.0.0
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 7b759bcb1ff0ebb064865b74d2afcd546667b169
Author: Dongjoon Hyun <do...@apache.org>
AuthorDate: Mon Jan 16 22:08:05 2023 -0800

    GH-15265: [Java] Publish SBOM artifacts (#15267)
    
    This closes #15265
    * Closes: #15265
    
    Authored-by: Dongjoon Hyun <do...@apache.org>
    Signed-off-by: Jacob Wujciak-Jens <ja...@wujciak.de>
---
 .github/workflows/java_nightly.yml |  2 +-
 ci/scripts/java_full_build.sh      |  8 ++++++-
 dev/tasks/java-jars/github.yml     |  2 ++
 dev/tasks/tasks.yml                | 48 ++++++++++++++++++++++++++++++++++++++
 java/pom.xml                       | 13 +++++++++++
 5 files changed, 71 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/java_nightly.yml b/.github/workflows/java_nightly.yml
index d4f2ca9517..24d8c7c54e 100644
--- a/.github/workflows/java_nightly.yml
+++ b/.github/workflows/java_nightly.yml
@@ -107,7 +107,7 @@ jobs:
           fi
           PATTERN_TO_GET_LIB_AND_VERSION='([a-z].+)-([0-9]+.[0-9]+.[0-9]+-SNAPSHOT)'
           mkdir -p repo/org/apache/arrow/
-          for LIBRARY in $(ls binaries/$PREFIX/java-jars | grep -E '.jar|.pom' | grep SNAPSHOT); do
+          for LIBRARY in $(ls binaries/$PREFIX/java-jars | grep -E '.jar|.json|.pom|.xml' | grep SNAPSHOT); do
             [[ $LIBRARY =~ $PATTERN_TO_GET_LIB_AND_VERSION ]]
             mkdir -p repo/org/apache/arrow/${BASH_REMATCH[1]}/${BASH_REMATCH[2]}
             mkdir -p repo/org/apache/arrow/${BASH_REMATCH[1]}/${DATE}
diff --git a/ci/scripts/java_full_build.sh b/ci/scripts/java_full_build.sh
index 1c07971bcc..2734f3e9db 100755
--- a/ci/scripts/java_full_build.sh
+++ b/ci/scripts/java_full_build.sh
@@ -65,7 +65,13 @@ find . \
      -exec echo {} ";" \
      -exec cp {} $dist_dir ";"
 find ~/.m2/repository/org/apache/arrow \
-     "(" -name "*.jar" -o -name "*.zip" -o -name "*.pom" ")" \
+     "(" \
+     -name "*.jar" -o \
+     -name "*.json" -o \
+     -name "*.pom" -o \
+     -name "*.xml" -o \
+     -name "*.zip" \
+     ")" \
      -exec echo {} ";" \
      -exec cp {} $dist_dir ";"
 
diff --git a/dev/tasks/java-jars/github.yml b/dev/tasks/java-jars/github.yml
index 3dcce6d950..290f198b4f 100644
--- a/dev/tasks/java-jars/github.yml
+++ b/dev/tasks/java-jars/github.yml
@@ -211,5 +211,7 @@ jobs:
             $GITHUB_WORKSPACE/arrow \
             $GITHUB_WORKSPACE/arrow/java-dist
       {{ macros.github_upload_releases(["arrow/java-dist/*.jar",
+                                        "arrow/java-dist/*.json",
                                         "arrow/java-dist/*.pom",
+                                        "arrow/java-dist/*.xml",
                                         "arrow/java-dist/*.zip"])|indent }}
diff --git a/dev/tasks/tasks.yml b/dev/tasks/tasks.yml
index ed75166536..8459fa381f 100644
--- a/dev/tasks/tasks.yml
+++ b/dev/tasks/tasks.yml
@@ -801,91 +801,131 @@ tasks:
     ci: github
     template: java-jars/github.yml
     artifacts:
+      - arrow-algorithm-{no_rc_snapshot_version}-cyclonedx.json
+      - arrow-algorithm-{no_rc_snapshot_version}-cyclonedx.xml
       - arrow-algorithm-{no_rc_snapshot_version}-javadoc.jar
       - arrow-algorithm-{no_rc_snapshot_version}-sources.jar
       - arrow-algorithm-{no_rc_snapshot_version}-tests.jar
       - arrow-algorithm-{no_rc_snapshot_version}.jar
       - arrow-algorithm-{no_rc_snapshot_version}.pom
+      - arrow-avro-{no_rc_snapshot_version}-cyclonedx.json
+      - arrow-avro-{no_rc_snapshot_version}-cyclonedx.xml
       - arrow-avro-{no_rc_snapshot_version}-javadoc.jar
       - arrow-avro-{no_rc_snapshot_version}-sources.jar
       - arrow-avro-{no_rc_snapshot_version}-tests.jar
       - arrow-avro-{no_rc_snapshot_version}.jar
       - arrow-avro-{no_rc_snapshot_version}.pom
+      - arrow-c-data-{no_rc_snapshot_version}-cyclonedx.json
+      - arrow-c-data-{no_rc_snapshot_version}-cyclonedx.xml
       - arrow-c-data-{no_rc_snapshot_version}-javadoc.jar
       - arrow-c-data-{no_rc_snapshot_version}-sources.jar
       - arrow-c-data-{no_rc_snapshot_version}-tests.jar
       - arrow-c-data-{no_rc_snapshot_version}.jar
       - arrow-c-data-{no_rc_snapshot_version}.pom
+      - arrow-compression-{no_rc_snapshot_version}-cyclonedx.json
+      - arrow-compression-{no_rc_snapshot_version}-cyclonedx.xml
       - arrow-compression-{no_rc_snapshot_version}-javadoc.jar
       - arrow-compression-{no_rc_snapshot_version}-sources.jar
       - arrow-compression-{no_rc_snapshot_version}-tests.jar
       - arrow-compression-{no_rc_snapshot_version}.jar
       - arrow-compression-{no_rc_snapshot_version}.pom
+      - arrow-dataset-{no_rc_snapshot_version}-cyclonedx.json
+      - arrow-dataset-{no_rc_snapshot_version}-cyclonedx.xml
       - arrow-dataset-{no_rc_snapshot_version}-javadoc.jar
       - arrow-dataset-{no_rc_snapshot_version}-sources.jar
       - arrow-dataset-{no_rc_snapshot_version}-tests.jar
       - arrow-dataset-{no_rc_snapshot_version}.jar
       - arrow-dataset-{no_rc_snapshot_version}.pom
+      - arrow-flight-{no_rc_snapshot_version}-cyclonedx.json
+      - arrow-flight-{no_rc_snapshot_version}-cyclonedx.xml
       - arrow-flight-{no_rc_snapshot_version}.pom
+      - arrow-format-{no_rc_snapshot_version}-cyclonedx.json
+      - arrow-format-{no_rc_snapshot_version}-cyclonedx.xml
       - arrow-format-{no_rc_snapshot_version}-javadoc.jar
       - arrow-format-{no_rc_snapshot_version}-sources.jar
       - arrow-format-{no_rc_snapshot_version}-tests.jar
       - arrow-format-{no_rc_snapshot_version}.jar
       - arrow-format-{no_rc_snapshot_version}.pom
+      - arrow-gandiva-{no_rc_snapshot_version}-cyclonedx.json
+      - arrow-gandiva-{no_rc_snapshot_version}-cyclonedx.xml
       - arrow-gandiva-{no_rc_snapshot_version}-javadoc.jar
       - arrow-gandiva-{no_rc_snapshot_version}-sources.jar
       - arrow-gandiva-{no_rc_snapshot_version}-tests.jar
       - arrow-gandiva-{no_rc_snapshot_version}.jar
       - arrow-gandiva-{no_rc_snapshot_version}.pom
+      - arrow-java-root-{no_rc_snapshot_version}-cyclonedx.json
+      - arrow-java-root-{no_rc_snapshot_version}-cyclonedx.xml
       - arrow-java-root-{no_rc_snapshot_version}-source-release.zip
       - arrow-java-root-{no_rc_snapshot_version}.pom
+      - arrow-jdbc-{no_rc_snapshot_version}-cyclonedx.json
+      - arrow-jdbc-{no_rc_snapshot_version}-cyclonedx.xml
       - arrow-jdbc-{no_rc_snapshot_version}-javadoc.jar
       - arrow-jdbc-{no_rc_snapshot_version}-sources.jar
       - arrow-jdbc-{no_rc_snapshot_version}-tests.jar
       - arrow-jdbc-{no_rc_snapshot_version}.jar
       - arrow-jdbc-{no_rc_snapshot_version}.pom
+      - arrow-memory-core-{no_rc_snapshot_version}-cyclonedx.json
+      - arrow-memory-core-{no_rc_snapshot_version}-cyclonedx.xml
       - arrow-memory-core-{no_rc_snapshot_version}-javadoc.jar
       - arrow-memory-core-{no_rc_snapshot_version}-sources.jar
       - arrow-memory-core-{no_rc_snapshot_version}-tests.jar
       - arrow-memory-core-{no_rc_snapshot_version}.jar
       - arrow-memory-core-{no_rc_snapshot_version}.pom
+      - arrow-memory-netty-{no_rc_snapshot_version}-cyclonedx.json
+      - arrow-memory-netty-{no_rc_snapshot_version}-cyclonedx.xml
       - arrow-memory-netty-{no_rc_snapshot_version}-javadoc.jar
       - arrow-memory-netty-{no_rc_snapshot_version}-sources.jar
       - arrow-memory-netty-{no_rc_snapshot_version}-tests.jar
       - arrow-memory-netty-{no_rc_snapshot_version}.jar
       - arrow-memory-netty-{no_rc_snapshot_version}.pom
+      - arrow-memory-unsafe-{no_rc_snapshot_version}-cyclonedx.json
+      - arrow-memory-unsafe-{no_rc_snapshot_version}-cyclonedx.xml
       - arrow-memory-unsafe-{no_rc_snapshot_version}-javadoc.jar
       - arrow-memory-unsafe-{no_rc_snapshot_version}-sources.jar
       - arrow-memory-unsafe-{no_rc_snapshot_version}-tests.jar
       - arrow-memory-unsafe-{no_rc_snapshot_version}.jar
       - arrow-memory-unsafe-{no_rc_snapshot_version}.pom
+      - arrow-memory-{no_rc_snapshot_version}-cyclonedx.json
+      - arrow-memory-{no_rc_snapshot_version}-cyclonedx.xml
       - arrow-memory-{no_rc_snapshot_version}.pom
+      - arrow-orc-{no_rc_snapshot_version}-cyclonedx.json
+      - arrow-orc-{no_rc_snapshot_version}-cyclonedx.xml
       - arrow-orc-{no_rc_snapshot_version}-javadoc.jar
       - arrow-orc-{no_rc_snapshot_version}-sources.jar
       - arrow-orc-{no_rc_snapshot_version}-tests.jar
       - arrow-orc-{no_rc_snapshot_version}.jar
       - arrow-orc-{no_rc_snapshot_version}.pom
+      - arrow-performance-{no_rc_snapshot_version}-cyclonedx.json
+      - arrow-performance-{no_rc_snapshot_version}-cyclonedx.xml
       - arrow-performance-{no_rc_snapshot_version}-sources.jar
       - arrow-performance-{no_rc_snapshot_version}-tests.jar
       - arrow-performance-{no_rc_snapshot_version}.jar
       - arrow-performance-{no_rc_snapshot_version}.pom
+      - arrow-plasma-{no_rc_snapshot_version}-cyclonedx.json
+      - arrow-plasma-{no_rc_snapshot_version}-cyclonedx.xml
       - arrow-plasma-{no_rc_snapshot_version}-javadoc.jar
       - arrow-plasma-{no_rc_snapshot_version}-sources.jar
       - arrow-plasma-{no_rc_snapshot_version}-tests.jar
       - arrow-plasma-{no_rc_snapshot_version}.jar
       - arrow-plasma-{no_rc_snapshot_version}.pom
+      - arrow-tools-{no_rc_snapshot_version}-cyclonedx.json
+      - arrow-tools-{no_rc_snapshot_version}-cyclonedx.xml
       - arrow-tools-{no_rc_snapshot_version}-jar-with-dependencies.jar
       - arrow-tools-{no_rc_snapshot_version}-javadoc.jar
       - arrow-tools-{no_rc_snapshot_version}-sources.jar
       - arrow-tools-{no_rc_snapshot_version}-tests.jar
       - arrow-tools-{no_rc_snapshot_version}.jar
       - arrow-tools-{no_rc_snapshot_version}.pom
+      - arrow-vector-{no_rc_snapshot_version}-cyclonedx.json
+      - arrow-vector-{no_rc_snapshot_version}-cyclonedx.xml
       - arrow-vector-{no_rc_snapshot_version}-javadoc.jar
       - arrow-vector-{no_rc_snapshot_version}-shade-format-flatbuffers.jar
       - arrow-vector-{no_rc_snapshot_version}-sources.jar
       - arrow-vector-{no_rc_snapshot_version}-tests.jar
       - arrow-vector-{no_rc_snapshot_version}.jar
       - arrow-vector-{no_rc_snapshot_version}.pom
+      - flight-core-{no_rc_snapshot_version}-cyclonedx.json
+      - flight-core-{no_rc_snapshot_version}-cyclonedx.xml
       - flight-core-{no_rc_snapshot_version}-jar-with-dependencies.jar
       - flight-core-{no_rc_snapshot_version}-javadoc.jar
       - flight-core-{no_rc_snapshot_version}-shaded-ext.jar
@@ -894,22 +934,30 @@ tasks:
       - flight-core-{no_rc_snapshot_version}-tests.jar
       - flight-core-{no_rc_snapshot_version}.jar
       - flight-core-{no_rc_snapshot_version}.pom
+      - flight-grpc-{no_rc_snapshot_version}-cyclonedx.json
+      - flight-grpc-{no_rc_snapshot_version}-cyclonedx.xml
       - flight-grpc-{no_rc_snapshot_version}-javadoc.jar
       - flight-grpc-{no_rc_snapshot_version}-sources.jar
       - flight-grpc-{no_rc_snapshot_version}-tests.jar
       - flight-grpc-{no_rc_snapshot_version}.jar
       - flight-grpc-{no_rc_snapshot_version}.pom
+      - flight-integration-tests-{no_rc_snapshot_version}-cyclonedx.json
+      - flight-integration-tests-{no_rc_snapshot_version}-cyclonedx.xml
       - flight-integration-tests-{no_rc_snapshot_version}-jar-with-dependencies.jar
       - flight-integration-tests-{no_rc_snapshot_version}-javadoc.jar
       - flight-integration-tests-{no_rc_snapshot_version}-sources.jar
       - flight-integration-tests-{no_rc_snapshot_version}-tests.jar
       - flight-integration-tests-{no_rc_snapshot_version}.jar
       - flight-integration-tests-{no_rc_snapshot_version}.pom
+      - flight-sql-{no_rc_snapshot_version}-cyclonedx.json
+      - flight-sql-{no_rc_snapshot_version}-cyclonedx.xml
       - flight-sql-{no_rc_snapshot_version}-javadoc.jar
       - flight-sql-{no_rc_snapshot_version}-sources.jar
       - flight-sql-{no_rc_snapshot_version}-tests.jar
       - flight-sql-{no_rc_snapshot_version}.jar
       - flight-sql-{no_rc_snapshot_version}.pom
+      - flight-sql-jdbc-driver-{no_rc_snapshot_version}-cyclonedx.json
+      - flight-sql-jdbc-driver-{no_rc_snapshot_version}-cyclonedx.xml
       - flight-sql-jdbc-driver-{no_rc_snapshot_version}-javadoc.jar
       - flight-sql-jdbc-driver-{no_rc_snapshot_version}-sources.jar
       - flight-sql-jdbc-driver-{no_rc_snapshot_version}-tests.jar
diff --git a/java/pom.xml b/java/pom.xml
index 44137d819e..3c34eb4558 100644
--- a/java/pom.xml
+++ b/java/pom.xml
@@ -355,6 +355,19 @@
           </execution>
         </executions>
       </plugin>
+      <plugin>
+        <groupId>org.cyclonedx</groupId>
+        <artifactId>cyclonedx-maven-plugin</artifactId>
+        <version>2.7.3</version>
+        <executions>
+          <execution>
+            <phase>package</phase>
+            <goals>
+              <goal>makeBom</goal>
+            </goals>
+          </execution>
+        </executions>
+      </plugin>
     </plugins>
 
     <pluginManagement>