You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@arrow.apache.org by ks...@apache.org on 2022/07/27 12:27:29 UTC

[arrow] branch maint-9.0.0 updated (74a4a0244e -> 0d8c1d5d98)

This is an automated email from the ASF dual-hosted git repository.

kszucs pushed a change to branch maint-9.0.0
in repository https://gitbox.apache.org/repos/asf/arrow.git


    from 74a4a0244e ARROW-15591: [C++] Add support for aggregation to the Substrait consumer (#13130)
     new c8ac3690bb ARROW-17051: [C++] Link Flight/gRPC/Protobuf consistently (#13599)
     new 6d524780db ARROW-17213: [C++] Fix for valgrind issue in test-r-linux-valgrind crossbow build (#13715)
     new 107163fec8 ARROW-16612: [R] Fix compression inference from filename (#13625)
     new 5839e594b5 ARROW-17211: [Java] Fix java-jar nightly on gh & self-hosted runners (#13712)
     new 5564777f2e ARROW-17206: [R] Skip test to fix snappy sanitizer issue (#13704)
     new 0d8c1d5d98 ARROW-17100: [C++][Parquet] Fix backwards compatibility for ParquetV2 data pages written prior to 3.0.0 per ARROW-10353 (#13665)

The 6 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .travis.yml                                       |  2 -
 ci/docker/ubuntu-18.04-cpp.dockerfile             |  5 +-
 ci/docker/ubuntu-20.04-cpp.dockerfile             |  5 +-
 ci/docker/ubuntu-22.04-cpp.dockerfile             |  5 +-
 ci/scripts/java_full_build.sh                     | 11 ++--
 cpp/CMakeLists.txt                                |  2 +-
 cpp/src/arrow/compute/kernels/scalar_compare.cc   |  3 +-
 cpp/src/arrow/flight/CMakeLists.txt               | 42 ++++++-------
 cpp/src/parquet/arrow/arrow_reader_writer_test.cc | 16 +++++
 cpp/src/parquet/column_reader.cc                  | 19 ++++--
 cpp/src/parquet/column_reader.h                   |  4 +-
 cpp/src/parquet/file_reader.cc                    | 11 +++-
 cpp/src/parquet/metadata.cc                       |  9 +++
 cpp/src/parquet/metadata.h                        |  1 +
 docker-compose.yml                                |  1 -
 r/R/csv.R                                         | 40 ++++++------
 r/R/feather.R                                     | 21 ++++---
 r/R/io.R                                          | 76 +++++++----------------
 r/R/ipc-stream.R                                  | 10 ---
 r/R/json.R                                        |  5 ++
 r/R/parquet.R                                     |  9 +++
 r/man/make_readable_file.Rd                       | 11 +---
 r/man/read_feather.Rd                             |  6 +-
 r/man/read_ipc_stream.Rd                          |  6 --
 r/man/write_feather.Rd                            |  9 +--
 r/man/write_ipc_stream.Rd                         |  6 --
 r/tests/testthat/test-compressed.R                |  8 +++
 r/tests/testthat/test-compute.R                   |  2 +
 r/tests/testthat/test-csv.R                       | 25 +++++++-
 r/tests/testthat/test-feather.R                   | 16 +++++
 r/tests/testthat/test-parquet.R                   | 16 +++++
 testing                                           |  2 +-
 32 files changed, 241 insertions(+), 163 deletions(-)

[arrow] 01/06: ARROW-17051: [C++] Link Flight/gRPC/Protobuf consistently (#13599)

Posted by ks...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

kszucs pushed a commit to branch maint-9.0.0
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit c8ac3690bb694c1220cf975c5b82cdfad1bdb702
Author: David Li <li...@gmail.com>
AuthorDate: Mon Jul 25 20:47:29 2022 -0400

    ARROW-17051: [C++] Link Flight/gRPC/Protobuf consistently (#13599)
    
    If Protobuf/gRPC are used statically, Flight must be as well, or else we can get odd runtime behavior due to the global state in those libraries when Flight SQL is involved (as Flight SQL would then bundle a second copy of Protobuf into its shared library).
    
    Authored-by: David Li <li...@gmail.com>
    Signed-off-by: Sutou Kouhei <ko...@clear-code.com>
---
 .travis.yml                           |  2 --
 ci/docker/ubuntu-18.04-cpp.dockerfile |  5 ++++-
 ci/docker/ubuntu-20.04-cpp.dockerfile |  5 ++++-
 ci/docker/ubuntu-22.04-cpp.dockerfile |  5 ++++-
 cpp/CMakeLists.txt                    |  2 +-
 cpp/src/arrow/flight/CMakeLists.txt   | 42 +++++++++++++++++------------------
 docker-compose.yml                    |  1 -
 7 files changed, 34 insertions(+), 28 deletions(-)

diff --git a/.travis.yml b/.travis.yml
index b3aa724107..5038f66181 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -93,7 +93,6 @@ jobs:
         # aws-sdk-cpp.
         DOCKER_RUN_ARGS: >-
           "
-          -e ARROW_BUILD_STATIC=OFF
           -e ARROW_FLIGHT=ON
           -e ARROW_GCS=OFF
           -e ARROW_MIMALLOC=OFF
@@ -145,7 +144,6 @@ jobs:
         # aws-sdk-cpp.
         DOCKER_RUN_ARGS: >-
           "
-          -e ARROW_BUILD_STATIC=OFF
           -e ARROW_FLIGHT=ON
           -e ARROW_GCS=OFF
           -e ARROW_MIMALLOC=OFF
diff --git a/ci/docker/ubuntu-18.04-cpp.dockerfile b/ci/docker/ubuntu-18.04-cpp.dockerfile
index 16490845bd..0e20b7c6a8 100644
--- a/ci/docker/ubuntu-18.04-cpp.dockerfile
+++ b/ci/docker/ubuntu-18.04-cpp.dockerfile
@@ -98,7 +98,10 @@ RUN apt-get update -y -q && \
 # - thrift is too old
 # - utf8proc is too old(v2.1.0)
 # - s3 tests would require boost-asio that is included since Boost 1.66.0
-ENV ARROW_BUILD_TESTS=ON \
+# ARROW-17051: this build uses static Protobuf, so we must also use
+# static Arrow to run Flight/Flight SQL tests
+ENV ARROW_BUILD_STATIC=ON \
+    ARROW_BUILD_TESTS=ON \
     ARROW_DATASET=ON \
     ARROW_DEPENDENCY_SOURCE=SYSTEM \
     ARROW_FLIGHT=OFF \
diff --git a/ci/docker/ubuntu-20.04-cpp.dockerfile b/ci/docker/ubuntu-20.04-cpp.dockerfile
index ae15835520..24d5f8e5da 100644
--- a/ci/docker/ubuntu-20.04-cpp.dockerfile
+++ b/ci/docker/ubuntu-20.04-cpp.dockerfile
@@ -123,7 +123,10 @@ RUN /arrow/ci/scripts/install_ceph.sh
 # - flatbuffer is not packaged
 # - libgtest-dev only provide sources
 # - libprotobuf-dev only provide sources
-ENV ARROW_BUILD_TESTS=ON \
+# ARROW-17051: this build uses static Protobuf, so we must also use
+# static Arrow to run Flight/Flight SQL tests
+ENV ARROW_BUILD_STATIC=ON \
+    ARROW_BUILD_TESTS=ON \
     ARROW_DEPENDENCY_SOURCE=SYSTEM \
     ARROW_DATASET=ON \
     ARROW_FLIGHT=OFF \
diff --git a/ci/docker/ubuntu-22.04-cpp.dockerfile b/ci/docker/ubuntu-22.04-cpp.dockerfile
index e7d2842dfc..c2019df153 100644
--- a/ci/docker/ubuntu-22.04-cpp.dockerfile
+++ b/ci/docker/ubuntu-22.04-cpp.dockerfile
@@ -150,7 +150,10 @@ RUN /arrow/ci/scripts/install_gcs_testbench.sh default
 # - flatbuffer is not packaged
 # - libgtest-dev only provide sources
 # - libprotobuf-dev only provide sources
-ENV ARROW_BUILD_TESTS=ON \
+# ARROW-17051: this build uses static Protobuf, so we must also use
+# static Arrow to run Flight/Flight SQL tests
+ENV ARROW_BUILD_STATIC=ON \
+    ARROW_BUILD_TESTS=ON \
     ARROW_DEPENDENCY_SOURCE=SYSTEM \
     ARROW_DATASET=ON \
     ARROW_FLIGHT=ON \
diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt
index b67f90e0bd..945ff7b6f8 100644
--- a/cpp/CMakeLists.txt
+++ b/cpp/CMakeLists.txt
@@ -862,7 +862,7 @@ add_dependencies(arrow_test_dependencies toolchain-tests)
 
 if(ARROW_STATIC_LINK_LIBS)
   add_dependencies(arrow_dependencies ${ARROW_STATIC_LINK_LIBS})
-  if(ARROW_ORC)
+  if(ARROW_HDFS OR ARROW_ORC)
     if(NOT MSVC_TOOLCHAIN)
       list(APPEND ARROW_STATIC_LINK_LIBS ${CMAKE_DL_LIBS})
       list(APPEND ARROW_STATIC_INSTALL_INTERFACE_LIBS ${CMAKE_DL_LIBS})
diff --git a/cpp/src/arrow/flight/CMakeLists.txt b/cpp/src/arrow/flight/CMakeLists.txt
index 39f2fecdde..a4bb287dfe 100644
--- a/cpp/src/arrow/flight/CMakeLists.txt
+++ b/cpp/src/arrow/flight/CMakeLists.txt
@@ -36,28 +36,28 @@ if(NOT ARROW_GRPC_USE_SHARED)
 endif()
 
 set(ARROW_FLIGHT_TEST_INTERFACE_LIBS)
-if(ARROW_FLIGHT_TEST_LINKAGE STREQUAL "static")
-  if(ARROW_BUILD_STATIC)
-    set(ARROW_FLIGHT_TEST_LINK_LIBS arrow_flight_static)
-  else()
-    set(ARROW_FLIGHT_TEST_LINK_LIBS arrow_flight_shared)
-  endif()
-  if(ARROW_FLIGHT_TESTING_BUILD_STATIC)
-    list(APPEND ARROW_FLIGHT_TEST_LINK_LIBS arrow_flight_testing_static)
+if(ARROW_BUILD_INTEGRATION OR ARROW_BUILD_TESTS)
+  if(ARROW_FLIGHT_TEST_LINKAGE STREQUAL "static")
+    if(NOT ARROW_BUILD_STATIC)
+      message(STATUS "If static Protobuf or gRPC are used, Arrow must be built statically"
+      )
+      message(STATUS "(These libraries have global state, and linkage must be consistent)"
+      )
+      message(FATAL_ERROR "Must build Arrow statically to link Flight tests statically")
+    endif()
+    set(ARROW_FLIGHT_TEST_LINK_LIBS arrow_flight_static arrow_flight_testing_static)
+    list(APPEND ARROW_FLIGHT_TEST_LINK_LIBS ${ARROW_TEST_STATIC_LINK_LIBS})
+    if(ARROW_CUDA)
+      list(APPEND ARROW_FLIGHT_TEST_INTERFACE_LIBS arrow_cuda_static)
+      list(APPEND ARROW_FLIGHT_TEST_LINK_LIBS arrow_cuda_static)
+    endif()
   else()
-    list(APPEND ARROW_FLIGHT_TEST_LINK_LIBS arrow_flight_testing_shared)
-  endif()
-  list(APPEND ARROW_FLIGHT_TEST_LINK_LIBS ${ARROW_TEST_LINK_LIBS})
-  if(ARROW_CUDA)
-    list(APPEND ARROW_FLIGHT_TEST_INTERFACE_LIBS arrow_cuda_static)
-    list(APPEND ARROW_FLIGHT_TEST_LINK_LIBS arrow_cuda_static)
-  endif()
-else()
-  set(ARROW_FLIGHT_TEST_LINK_LIBS arrow_flight_shared arrow_flight_testing_shared
-                                  ${ARROW_TEST_LINK_LIBS})
-  if(ARROW_CUDA)
-    list(APPEND ARROW_FLIGHT_TEST_INTERFACE_LIBS arrow_cuda_shared)
-    list(APPEND ARROW_FLIGHT_TEST_LINK_LIBS arrow_cuda_shared)
+    set(ARROW_FLIGHT_TEST_LINK_LIBS arrow_flight_shared arrow_flight_testing_shared
+                                    ${ARROW_TEST_SHARED_LINK_LIBS})
+    if(ARROW_CUDA)
+      list(APPEND ARROW_FLIGHT_TEST_INTERFACE_LIBS arrow_cuda_shared)
+      list(APPEND ARROW_FLIGHT_TEST_LINK_LIBS arrow_cuda_shared)
+    endif()
   endif()
 endif()
 list(APPEND
diff --git a/docker-compose.yml b/docker-compose.yml
index c476797f22..13d7a4da4f 100644
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -487,7 +487,6 @@ services:
       <<: *ccache
       CC: clang-${CLANG_TOOLS}
       CXX: clang++-${CLANG_TOOLS}
-      ARROW_BUILD_STATIC: "OFF"
       ARROW_ENABLE_TIMING_TESTS:  # inherit
       ARROW_FUZZING: "ON"  # Check fuzz regressions
       ARROW_JEMALLOC: "OFF"

[arrow] 04/06: ARROW-17211: [Java] Fix java-jar nightly on gh & self-hosted runners (#13712)

Posted by ks...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

kszucs pushed a commit to branch maint-9.0.0
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 5839e594b53deedee6c05f654a0692a3efd622b4
Author: Jacob Wujciak-Jens <ja...@wujciak.de>
AuthorDate: Wed Jul 27 14:02:00 2022 +0200

    ARROW-17211: [Java] Fix java-jar nightly on gh & self-hosted runners (#13712)
    
    Authored-by: Jacob Wujciak-Jens <ja...@wujciak.de>
    Signed-off-by: Krisztián Szűcs <sz...@gmail.com>
---
 ci/scripts/java_full_build.sh | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/ci/scripts/java_full_build.sh b/ci/scripts/java_full_build.sh
index 54800e7671..1c07971bcc 100755
--- a/ci/scripts/java_full_build.sh
+++ b/ci/scripts/java_full_build.sh
@@ -28,10 +28,13 @@ pushd ${arrow_dir}/java
 
 # Ensure that there is no old jar
 # inside the maven repository
-find ~/.m2/repository/org/apache/arrow \
-     "(" -name "*.jar" -o -name "*.zip" -o -name "*.pom" ")" \
-     -exec echo {} ";" \
-     -exec rm -rf {} ";"
+maven_repo=~/.m2/repository/org/apache/arrow
+if [ -d $maven_repo ]; then
+    find $maven_repo \
+      "(" -name "*.jar" -o -name "*.zip" -o -name "*.pom" ")" \
+      -exec echo {} ";" \
+      -exec rm -rf {} ";"
+fi
 
 # generate dummy GPG key for -Papache-release.
 # -Papache-release generates signs (*.asc) of artifacts.

[arrow] 06/06: ARROW-17100: [C++][Parquet] Fix backwards compatibility for ParquetV2 data pages written prior to 3.0.0 per ARROW-10353 (#13665)

Posted by ks...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

kszucs pushed a commit to branch maint-9.0.0
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 0d8c1d5d98be6ac38da42409a98c1f08b6f9db8c
Author: Will Jones <wi...@gmail.com>
AuthorDate: Wed Jul 27 08:11:01 2022 -0400

    ARROW-17100: [C++][Parquet] Fix backwards compatibility for ParquetV2 data pages written prior to 3.0.0 per ARROW-10353 (#13665)
    
    With these changes I can successfully read the parquet file provides in the original report.
    
    Parquet file: https://www.dropbox.com/s/portxgch3fpovnz/test2.parq?dl=0
    Gist to generate: https://gist.github.com/bivald/f93448eaf25808284c4029c691a58a6a
    Original report: https://lists.apache.org/thread/wtbqozdhj2hwn6f0sps2j70lr07grk06
    
    Based off of changes in ARROW-10353
    
    Authored-by: Will Jones <wi...@gmail.com>
    Signed-off-by: David Li <li...@gmail.com>
---
 cpp/src/parquet/arrow/arrow_reader_writer_test.cc | 16 ++++++++++++++++
 cpp/src/parquet/column_reader.cc                  | 19 ++++++++++++++-----
 cpp/src/parquet/column_reader.h                   |  4 +++-
 cpp/src/parquet/file_reader.cc                    | 11 ++++++++---
 cpp/src/parquet/metadata.cc                       |  9 +++++++++
 cpp/src/parquet/metadata.h                        |  1 +
 testing                                           |  2 +-
 7 files changed, 52 insertions(+), 10 deletions(-)

diff --git a/cpp/src/parquet/arrow/arrow_reader_writer_test.cc b/cpp/src/parquet/arrow/arrow_reader_writer_test.cc
index db8b685fa5..d719f0e642 100644
--- a/cpp/src/parquet/arrow/arrow_reader_writer_test.cc
+++ b/cpp/src/parquet/arrow/arrow_reader_writer_test.cc
@@ -3943,6 +3943,22 @@ TEST(TestArrowReaderAdHoc, WriteBatchedNestedNullableStringColumn) {
   ::arrow::AssertTablesEqual(*expected, *actual, /*same_chunk_layout=*/false);
 }
 
+TEST(TestArrowReaderAdHoc, OldDataPageV2) {
+  // ARROW-17100
+#ifndef ARROW_WITH_SNAPPY
+  GTEST_SKIP() << "Test requires Snappy compression";
+#endif
+  const char* c_root = std::getenv("ARROW_TEST_DATA");
+  if (!c_root) {
+    GTEST_SKIP() << "ARROW_TEST_DATA not set.";
+  }
+  std::stringstream ss;
+  ss << c_root << "/"
+     << "parquet/ARROW-17100.parquet";
+  std::string path = ss.str();
+  TryReadDataFile(path);
+}
+
 class TestArrowReaderAdHocSparkAndHvr
     : public ::testing::TestWithParam<
           std::tuple<std::string, std::shared_ptr<DataType>>> {};
diff --git a/cpp/src/parquet/column_reader.cc b/cpp/src/parquet/column_reader.cc
index b8d3b767b0..523030fd78 100644
--- a/cpp/src/parquet/column_reader.cc
+++ b/cpp/src/parquet/column_reader.cc
@@ -224,7 +224,7 @@ class SerializedPageReader : public PageReader {
  public:
   SerializedPageReader(std::shared_ptr<ArrowInputStream> stream, int64_t total_num_rows,
                        Compression::type codec, const ReaderProperties& properties,
-                       const CryptoContext* crypto_ctx)
+                       const CryptoContext* crypto_ctx, bool always_compressed)
       : properties_(properties),
         stream_(std::move(stream)),
         decompression_buffer_(AllocateBuffer(properties_.memory_pool(), 0)),
@@ -238,6 +238,7 @@ class SerializedPageReader : public PageReader {
     }
     max_page_header_size_ = kDefaultMaxPageHeaderSize;
     decompressor_ = GetCodec(codec);
+    always_compressed_ = always_compressed;
   }
 
   // Implement the PageReader interface
@@ -265,6 +266,8 @@ class SerializedPageReader : public PageReader {
   std::unique_ptr<::arrow::util::Codec> decompressor_;
   std::shared_ptr<ResizableBuffer> decompression_buffer_;
 
+  bool always_compressed_;
+
   // The fields below are used for calculation of AAD (additional authenticated data)
   // suffix which is part of the Parquet Modular Encryption.
   // The AAD suffix for a parquet module is built internally by
@@ -449,7 +452,10 @@ std::shared_ptr<Page> SerializedPageReader::NextPage() {
           header.repetition_levels_byte_length < 0) {
         throw ParquetException("Invalid page header (negative levels byte length)");
       }
-      bool is_compressed = header.__isset.is_compressed ? header.is_compressed : false;
+      // Arrow prior to 3.0.0 set is_compressed to false but still compressed.
+      bool is_compressed =
+          (header.__isset.is_compressed ? header.is_compressed : false) ||
+          always_compressed_;
       EncodedStatistics page_statistics = ExtractStatsFromHeader(header);
       seen_num_rows_ += header.num_values;
 
@@ -516,18 +522,21 @@ std::unique_ptr<PageReader> PageReader::Open(std::shared_ptr<ArrowInputStream> s
                                              int64_t total_num_rows,
                                              Compression::type codec,
                                              const ReaderProperties& properties,
+                                             bool always_compressed,
                                              const CryptoContext* ctx) {
   return std::unique_ptr<PageReader>(new SerializedPageReader(
-      std::move(stream), total_num_rows, codec, properties, ctx));
+      std::move(stream), total_num_rows, codec, properties, ctx, always_compressed));
 }
 
 std::unique_ptr<PageReader> PageReader::Open(std::shared_ptr<ArrowInputStream> stream,
                                              int64_t total_num_rows,
                                              Compression::type codec,
+                                             bool always_compressed,
                                              ::arrow::MemoryPool* pool,
                                              const CryptoContext* ctx) {
-  return std::unique_ptr<PageReader>(new SerializedPageReader(
-      std::move(stream), total_num_rows, codec, ReaderProperties(pool), ctx));
+  return std::unique_ptr<PageReader>(
+      new SerializedPageReader(std::move(stream), total_num_rows, codec,
+                               ReaderProperties(pool), ctx, always_compressed));
 }
 
 namespace {
diff --git a/cpp/src/parquet/column_reader.h b/cpp/src/parquet/column_reader.h
index c22f9f2fc7..1d35e3988c 100644
--- a/cpp/src/parquet/column_reader.h
+++ b/cpp/src/parquet/column_reader.h
@@ -105,11 +105,13 @@ class PARQUET_EXPORT PageReader {
 
   static std::unique_ptr<PageReader> Open(
       std::shared_ptr<ArrowInputStream> stream, int64_t total_num_rows,
-      Compression::type codec, ::arrow::MemoryPool* pool = ::arrow::default_memory_pool(),
+      Compression::type codec, bool always_compressed = false,
+      ::arrow::MemoryPool* pool = ::arrow::default_memory_pool(),
       const CryptoContext* ctx = NULLPTR);
   static std::unique_ptr<PageReader> Open(std::shared_ptr<ArrowInputStream> stream,
                                           int64_t total_num_rows, Compression::type codec,
                                           const ReaderProperties& properties,
+                                          bool always_compressed = false,
                                           const CryptoContext* ctx = NULLPTR);
 
   // @returns: shared_ptr<Page>(nullptr) on EOS, std::shared_ptr<Page>
diff --git a/cpp/src/parquet/file_reader.cc b/cpp/src/parquet/file_reader.cc
index 8086b0a280..90e19e594e 100644
--- a/cpp/src/parquet/file_reader.cc
+++ b/cpp/src/parquet/file_reader.cc
@@ -208,10 +208,15 @@ class SerializedRowGroup : public RowGroupReader::Contents {
 
     std::unique_ptr<ColumnCryptoMetaData> crypto_metadata = col->crypto_metadata();
 
+    // Prior to Arrow 3.0.0, is_compressed was always set to false in column headers,
+    // even if compression was used. See ARROW-17100.
+    bool always_compressed = file_metadata_->writer_version().VersionLt(
+        ApplicationVersion::PARQUET_CPP_10353_FIXED_VERSION());
+
     // Column is encrypted only if crypto_metadata exists.
     if (!crypto_metadata) {
       return PageReader::Open(stream, col->num_values(), col->compression(),
-                              properties_.memory_pool());
+                              always_compressed, properties_.memory_pool());
     }
 
     if (file_decryptor_ == nullptr) {
@@ -233,7 +238,7 @@ class SerializedRowGroup : public RowGroupReader::Contents {
       CryptoContext ctx(col->has_dictionary_page(), row_group_ordinal_,
                         static_cast<int16_t>(i), meta_decryptor, data_decryptor);
       return PageReader::Open(stream, col->num_values(), col->compression(),
-                              properties_.memory_pool(), &ctx);
+                              always_compressed, properties_.memory_pool(), &ctx);
     }
 
     // The column is encrypted with its own key
@@ -248,7 +253,7 @@ class SerializedRowGroup : public RowGroupReader::Contents {
     CryptoContext ctx(col->has_dictionary_page(), row_group_ordinal_,
                       static_cast<int16_t>(i), meta_decryptor, data_decryptor);
     return PageReader::Open(stream, col->num_values(), col->compression(),
-                            properties_.memory_pool(), &ctx);
+                            always_compressed, properties_.memory_pool(), &ctx);
   }
 
  private:
diff --git a/cpp/src/parquet/metadata.cc b/cpp/src/parquet/metadata.cc
index 6226c3ad09..1b2a3df9c4 100644
--- a/cpp/src/parquet/metadata.cc
+++ b/cpp/src/parquet/metadata.cc
@@ -58,6 +58,15 @@ const ApplicationVersion& ApplicationVersion::PARQUET_MR_FIXED_STATS_VERSION() {
   return version;
 }
 
+const ApplicationVersion& ApplicationVersion::PARQUET_CPP_10353_FIXED_VERSION() {
+  // parquet-cpp versions released prior to Arrow 3.0 would write DataPageV2 pages
+  // with is_compressed==0 but still write compressed data. (See: ARROW-10353).
+  // Parquet 1.5.1 had this problem, and after that we switched to the
+  // application name "parquet-cpp-arrow", so this version is fake.
+  static ApplicationVersion version("parquet-cpp", 2, 0, 0);
+  return version;
+}
+
 std::string ParquetVersionToString(ParquetVersion::type ver) {
   switch (ver) {
     case ParquetVersion::PARQUET_1_0:
diff --git a/cpp/src/parquet/metadata.h b/cpp/src/parquet/metadata.h
index 89dca5667b..bd59c628dc 100644
--- a/cpp/src/parquet/metadata.h
+++ b/cpp/src/parquet/metadata.h
@@ -57,6 +57,7 @@ class PARQUET_EXPORT ApplicationVersion {
   static const ApplicationVersion& PARQUET_816_FIXED_VERSION();
   static const ApplicationVersion& PARQUET_CPP_FIXED_STATS_VERSION();
   static const ApplicationVersion& PARQUET_MR_FIXED_STATS_VERSION();
+  static const ApplicationVersion& PARQUET_CPP_10353_FIXED_VERSION();
 
   // Application that wrote the file. e.g. "IMPALA"
   std::string application_;
diff --git a/testing b/testing
index 53b4980471..5bab2f264a 160000
--- a/testing
+++ b/testing
@@ -1 +1 @@
-Subproject commit 53b498047109d9940fcfab388bd9d6aeb8c57425
+Subproject commit 5bab2f264a23f5af68f69ea93d24ef1e8e77fc88

[arrow] 02/06: ARROW-17213: [C++] Fix for valgrind issue in test-r-linux-valgrind crossbow build (#13715)

Posted by ks...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

kszucs pushed a commit to branch maint-9.0.0
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 6d524780db653f073f11ba8a20fba0143187ba06
Author: Wes McKinney <we...@users.noreply.github.com>
AuthorDate: Tue Jul 26 20:12:41 2022 -0600

    ARROW-17213: [C++] Fix for valgrind issue in test-r-linux-valgrind crossbow build (#13715)
    
    Authored-by: Wes McKinney <we...@apache.org>
    Signed-off-by: Wes McKinney <we...@apache.org>
---
 cpp/src/arrow/compute/kernels/scalar_compare.cc | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/cpp/src/arrow/compute/kernels/scalar_compare.cc b/cpp/src/arrow/compute/kernels/scalar_compare.cc
index f071986dd2..cfe1085531 100644
--- a/cpp/src/arrow/compute/kernels/scalar_compare.cc
+++ b/cpp/src/arrow/compute/kernels/scalar_compare.cc
@@ -271,8 +271,7 @@ struct CompareKernel {
     if (out_is_byte_aligned) {
       out_buffer = out_arr->buffers[1].data + out_arr->offset / 8;
     } else {
-      ARROW_ASSIGN_OR_RAISE(out_buffer_tmp,
-                            ctx->Allocate(bit_util::BytesForBits(batch.length)));
+      ARROW_ASSIGN_OR_RAISE(out_buffer_tmp, ctx->AllocateBitmap(batch.length));
       out_buffer = out_buffer_tmp->mutable_data();
     }
     if (batch[0].is_array() && batch[1].is_array()) {

[arrow] 05/06: ARROW-17206: [R] Skip test to fix snappy sanitizer issue (#13704)

Posted by ks...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

kszucs pushed a commit to branch maint-9.0.0
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 5564777f2e163d3b1f9d2d3c81693a94024d796e
Author: Jacob Wujciak-Jens <ja...@wujciak.de>
AuthorDate: Wed Jul 27 14:03:19 2022 +0200

    ARROW-17206: [R] Skip test to fix snappy sanitizer issue (#13704)
    
    https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=30020&view=logs&j=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb&t=d9b15392-e4ce-5e4c-0c8c-b69645229181&l=7964
    
    
    
    Authored-by: Jacob Wujciak-Jens <ja...@wujciak.de>
    Signed-off-by: Krisztián Szűcs <sz...@gmail.com>
---
 r/tests/testthat/test-compute.R | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/r/tests/testthat/test-compute.R b/r/tests/testthat/test-compute.R
index 946583ae00..9e487169f4 100644
--- a/r/tests/testthat/test-compute.R
+++ b/r/tests/testthat/test-compute.R
@@ -208,6 +208,8 @@ test_that("register_user_defined_function() errors for unsupported specification
 test_that("user-defined functions work during multi-threaded execution", {
   skip_if_not(CanRunWithCapturedR())
   skip_if_not_available("dataset")
+  # Snappy has a UBSan issue: https://github.com/google/snappy/pull/148
+  skip_on_linux_devel()
 
   n_rows <- 10000
   n_partitions <- 10

[arrow] 03/06: ARROW-16612: [R] Fix compression inference from filename (#13625)

Posted by ks...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

kszucs pushed a commit to branch maint-9.0.0
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit 107163fec888e36a2d576d1f992f0e6f41ef7ad1
Author: Neal Richardson <ne...@gmail.com>
AuthorDate: Wed Jul 27 07:02:05 2022 -0400

    ARROW-16612: [R] Fix compression inference from filename (#13625)
    
    This is actually a much larger change than the original issue.
    
    * ~Infer compression from the file extension in `write_parquet()` and pass it to ParquetFileWriter rather than write to a CompressedOutputStream, and don't wrap the in a CompressedInputStream in `read_parquet()` because that doesn't work (and isn't how compression works for Parquet). Previously, reading from a file with extension `.parquet.gz` etc. would error unless you opened an input stream yourself. This is the original report from ARROW-16612.~ Cut and moved to [ARROW-17221](http [...]
    * Likewise for `read_feather()` and `write_feather()`, which also support compression within the file itself and not around it.
    * Since the whole "detect compression and wrap in a compressed stream" feature seems limited to CSV and JSON, and in making the changes here I was having to hack around that feature, I refactored to pull it out of the internal functions `make_readable_file()` and `make_output_stream()` and do it only in the csv/json functions.
    * In the process of refactoring, I noticed and fixed two bugs: (1) no matter what compression extension you provided to `make_output_stream()`, you would get a gzip-compressed stream because we weren't actually passing the codec to `CompressedOutputStream$create()`; (2) `.lz4` actually needs to be mapped to the "lz4_frame" codec; attempting to write a CSV to a `CompressedOutputStream$create(codec = "lz4")` raises an error. Neither were caught because our tests for this feature only te [...]
    * The refactoring should also mean that ARROW-16619 (inferring compression from URL), as well as from SubTreeFileSystem (S3 buckets etc.), is also supported.
    
    Authored-by: Neal Richardson <ne...@gmail.com>
    Signed-off-by: Neal Richardson <ne...@gmail.com>
---
 r/R/csv.R                          | 40 +++++++++++---------
 r/R/feather.R                      | 21 +++++++----
 r/R/io.R                           | 76 ++++++++++++--------------------------
 r/R/ipc-stream.R                   | 10 -----
 r/R/json.R                         |  5 +++
 r/R/parquet.R                      |  9 +++++
 r/man/make_readable_file.Rd        | 11 +-----
 r/man/read_feather.Rd              |  6 +--
 r/man/read_ipc_stream.Rd           |  6 ---
 r/man/write_feather.Rd             |  9 +++--
 r/man/write_ipc_stream.Rd          |  6 ---
 r/tests/testthat/test-compressed.R |  8 ++++
 r/tests/testthat/test-csv.R        | 25 ++++++++++++-
 r/tests/testthat/test-feather.R    | 16 ++++++++
 r/tests/testthat/test-parquet.R    | 16 ++++++++
 15 files changed, 145 insertions(+), 119 deletions(-)

diff --git a/r/R/csv.R b/r/R/csv.R
index 32ed0e4bee..6adbb40219 100644
--- a/r/R/csv.R
+++ b/r/R/csv.R
@@ -188,7 +188,12 @@ read_delim_arrow <- function(file,
   }
 
   if (!inherits(file, "InputStream")) {
+    compression <- detect_compression(file)
     file <- make_readable_file(file)
+    if (compression != "uncompressed") {
+      # TODO: accept compression and compression_level as args
+      file <- CompressedInputStream$create(file, compression)
+    }
     on.exit(file$close())
   }
   reader <- CsvTableReader$create(
@@ -699,7 +704,6 @@ write_csv_arrow <- function(x,
     )
   }
 
-  # default values are considered missing by base R
   if (missing(include_header) && !missing(col_names)) {
     include_header <- col_names
   }
@@ -712,16 +716,27 @@ write_csv_arrow <- function(x,
   }
 
   x_out <- x
-  if (is.data.frame(x)) {
-    x <- Table$create(x)
-  }
-
-  if (inherits(x, c("Dataset", "arrow_dplyr_query"))) {
-    x <- Scanner$create(x)$ToRecordBatchReader()
+  if (!inherits(x, "ArrowTabular")) {
+    tryCatch(
+      x <- as_record_batch_reader(x),
+      error = function(e) {
+        abort(
+          paste0(
+            "x must be an object of class 'data.frame', 'RecordBatch', ",
+            "'Dataset', 'Table', or 'RecordBatchReader' not '", class(x)[1], "'."
+          )
+        )
+      }
+    )
   }
 
   if (!inherits(sink, "OutputStream")) {
+    compression <- detect_compression(sink)
     sink <- make_output_stream(sink)
+    if (compression != "uncompressed") {
+      # TODO: accept compression and compression_level as args
+      sink <- CompressedOutputStream$create(sink, codec = compression)
+    }
     on.exit(sink$close())
   }
 
@@ -731,17 +746,6 @@ write_csv_arrow <- function(x,
     csv___WriteCSV__Table(x, write_options, sink)
   } else if (inherits(x, c("RecordBatchReader"))) {
     csv___WriteCSV__RecordBatchReader(x, write_options, sink)
-  } else {
-    abort(
-      c(
-        paste0(
-          paste(
-            "x must be an object of class 'data.frame', 'RecordBatch',",
-            "'Dataset', 'Table', or 'RecordBatchReader' not '"
-          ), class(x)[[1]], "'."
-        )
-      )
-    )
   }
 
   invisible(x_out)
diff --git a/r/R/feather.R b/r/R/feather.R
index 03c8a7b5f0..4e2e9947cb 100644
--- a/r/R/feather.R
+++ b/r/R/feather.R
@@ -38,8 +38,9 @@
 #' @param compression Name of compression codec to use, if any. Default is
 #' "lz4" if LZ4 is available in your build of the Arrow C++ library, otherwise
 #' "uncompressed". "zstd" is the other available codec and generally has better
-#' compression ratios in exchange for slower read and write performance
-#' See [codec_is_available()]. This option is not supported for V1.
+#' compression ratios in exchange for slower read and write performance.
+#' "lz4" is shorthand for the "lz4_frame" codec.
+#' See [codec_is_available()] for details. This option is not supported for V1.
 #' @param compression_level If `compression` is "zstd", you may
 #' specify an integer compression level. If omitted, the compression codec's
 #' default compression level is used.
@@ -67,11 +68,13 @@ write_feather <- function(x,
                           sink,
                           version = 2,
                           chunk_size = 65536L,
-                          compression = c("default", "lz4", "uncompressed", "zstd"),
+                          compression = c("default", "lz4", "lz4_frame", "uncompressed", "zstd"),
                           compression_level = NULL) {
   # Handle and validate options before touching data
   version <- as.integer(version)
   assert_that(version %in% 1:2)
+
+  # TODO(ARROW-17221): if (missing(compression)), we could detect_compression(sink) here
   compression <- match.arg(compression)
   chunk_size <- as.integer(chunk_size)
   assert_that(chunk_size > 0)
@@ -128,7 +131,7 @@ write_feather <- function(x,
 write_ipc_file <- function(x,
                            sink,
                            chunk_size = 65536L,
-                           compression = c("default", "lz4", "uncompressed", "zstd"),
+                           compression = c("default", "lz4", "lz4_frame", "uncompressed", "zstd"),
                            compression_level = NULL) {
   mc <- match.call()
   mc$version <- 2
@@ -147,7 +150,7 @@ write_ipc_file <- function(x,
 #'
 #' @inheritParams read_ipc_stream
 #' @inheritParams read_delim_arrow
-#' @param ... additional parameters, passed to [make_readable_file()].
+#' @inheritParams make_readable_file
 #'
 #' @return A `data.frame` if `as_data_frame` is `TRUE` (the default), or an
 #' Arrow [Table] otherwise
@@ -163,9 +166,13 @@ write_ipc_file <- function(x,
 #' dim(df)
 #' # Can select columns
 #' df <- read_feather(tf, col_select = starts_with("d"))
-read_feather <- function(file, col_select = NULL, as_data_frame = TRUE, ...) {
+read_feather <- function(file, col_select = NULL, as_data_frame = TRUE, mmap = TRUE) {
   if (!inherits(file, "RandomAccessFile")) {
-    file <- make_readable_file(file, ...)
+    # Compression is handled inside the IPC file format, so we don't need
+    # to detect from the file extension and wrap in a CompressedInputStream
+    # TODO: Why is this the only read_format() functions that allows passing
+    # mmap to make_readable_file?
+    file <- make_readable_file(file, mmap)
     on.exit(file$close())
   }
   reader <- FeatherReader$create(file)
diff --git a/r/R/io.R b/r/R/io.R
index 82e3847df5..fc664ed386 100644
--- a/r/R/io.R
+++ b/r/R/io.R
@@ -229,52 +229,31 @@ mmap_open <- function(path, mode = c("read", "write", "readwrite")) {
 #' Handle a range of possible input sources
 #' @param file A character file name, `raw` vector, or an Arrow input stream
 #' @param mmap Logical: whether to memory-map the file (default `TRUE`)
-#' @param compression If the file is compressed, created a [CompressedInputStream]
-#' with this compression codec, either a [Codec] or the string name of one.
-#' If `NULL` (default) and `file` is a string file name, the function will try
-#' to infer compression from the file extension.
-#' @param filesystem If not `NULL`, `file` will be opened via the
-#' `filesystem$OpenInputFile()` filesystem method, rather than the `io` module's
-#' `MemoryMappedFile` or `ReadableFile` constructors.
 #' @return An `InputStream` or a subclass of one.
 #' @keywords internal
-make_readable_file <- function(file, mmap = TRUE, compression = NULL, filesystem = NULL) {
+make_readable_file <- function(file, mmap = TRUE) {
   if (inherits(file, "SubTreeFileSystem")) {
     filesystem <- file$base_fs
-    # SubTreeFileSystem adds a slash to base_path, but filesystems will reject file names
-    # with trailing slashes, so we need to remove it here.
-    file <- sub("/$", "", file$base_path)
-  }
-  if (is.string(file)) {
+    # SubTreeFileSystem adds a slash to base_path, but filesystems will reject
+    # file names with trailing slashes, so we need to remove it here.
+    path <- sub("/$", "", file$base_path)
+    file <- filesystem$OpenInputFile(path)
+  } else if (is.string(file)) {
     if (is_url(file)) {
       file <- tryCatch(
         {
           fs_and_path <- FileSystem$from_uri(file)
-          filesystem <- fs_and_path$fs
-          fs_and_path$path
+          fs_and_path$fs$OpenInputFile(fs_and_path$path)
         },
         error = function(e) {
           MakeRConnectionInputStream(url(file, open = "rb"))
         }
       )
-    }
-
-    if (is.null(compression)) {
-      # Infer compression from the file path
-      compression <- detect_compression(file)
-    }
-
-    if (!is.null(filesystem)) {
-      file <- filesystem$OpenInputFile(file)
-    } else if (is.string(file) && isTRUE(mmap)) {
+    } else if (isTRUE(mmap)) {
       file <- mmap_open(file)
-    } else if (is.string(file)) {
+    } else {
       file <- ReadableFile$create(file)
     }
-
-    if (is_compressed(compression)) {
-      file <- CompressedInputStream$create(file, compression)
-    }
   } else if (inherits(file, c("raw", "Buffer"))) {
     file <- BufferReader$create(file)
   } else if (inherits(file, "connection")) {
@@ -294,7 +273,7 @@ make_readable_file <- function(file, mmap = TRUE, compression = NULL, filesystem
   file
 }
 
-make_output_stream <- function(x, filesystem = NULL, compression = NULL) {
+make_output_stream <- function(x) {
   if (inherits(x, "connection")) {
     if (!isOpen(x)) {
       open(x, "wb")
@@ -305,45 +284,36 @@ make_output_stream <- function(x, filesystem = NULL, compression = NULL) {
 
   if (inherits(x, "SubTreeFileSystem")) {
     filesystem <- x$base_fs
-    # SubTreeFileSystem adds a slash to base_path, but filesystems will reject file names
-    # with trailing slashes, so we need to remove it here.
-    x <- sub("/$", "", x$base_path)
+    # SubTreeFileSystem adds a slash to base_path, but filesystems will reject
+    # file names with trailing slashes, so we need to remove it here.
+    path <- sub("/$", "", x$base_path)
+    filesystem$OpenOutputStream(path)
   } else if (is_url(x)) {
     fs_and_path <- FileSystem$from_uri(x)
-    filesystem <- fs_and_path$fs
-    x <- fs_and_path$path
-  }
-
-  if (is.null(compression)) {
-    # Infer compression from sink
-    compression <- detect_compression(x)
-  }
-
-  assert_that(is.string(x))
-  if (is.null(filesystem) && is_compressed(compression)) {
-    CompressedOutputStream$create(x) ## compressed local
-  } else if (is.null(filesystem) && !is_compressed(compression)) {
-    FileOutputStream$create(x) ## uncompressed local
-  } else if (!is.null(filesystem) && is_compressed(compression)) {
-    CompressedOutputStream$create(filesystem$OpenOutputStream(x)) ## compressed remote
+    fs_and_path$fs$OpenOutputStream(fs_and_path$path)
   } else {
-    filesystem$OpenOutputStream(x) ## uncompressed remote
+    assert_that(is.string(x))
+    FileOutputStream$create(x)
   }
 }
 
 detect_compression <- function(path) {
+  if (inherits(path, "SubTreeFileSystem")) {
+    path <- path$base_path
+  }
   if (!is.string(path)) {
     return("uncompressed")
   }
 
-  # Remove any trailing slashes, which FileSystem$from_uri may add
+  # Remove any trailing slashes, which SubTreeFileSystem may add
   path <- sub("/$", "", path)
 
   switch(tools::file_ext(path),
     bz2 = "bz2",
     gz = "gzip",
-    lz4 = "lz4",
+    lz4 = "lz4_frame",
     zst = "zstd",
+    snappy = "snappy",
     "uncompressed"
   )
 }
diff --git a/r/R/ipc-stream.R b/r/R/ipc-stream.R
index 9fea0f9e52..dd59d0f4df 100644
--- a/r/R/ipc-stream.R
+++ b/r/R/ipc-stream.R
@@ -23,11 +23,6 @@
 #' a "stream" format and a "file" format, known as Feather. `write_ipc_stream()`
 #' and [write_feather()] write those formats, respectively.
 #'
-#' `write_arrow()`, a wrapper around `write_ipc_stream()` and `write_feather()`
-#' with some nonstandard behavior, is deprecated. You should explicitly choose
-#' the function that will write the desired IPC format (stream or file) since
-#' either can be written to a file or `OutputStream`.
-#'
 #' @inheritParams write_feather
 #' @param ... extra parameters passed to `write_feather()`.
 #'
@@ -87,11 +82,6 @@ write_to_raw <- function(x, format = c("stream", "file")) {
 #' a "stream" format and a "file" format, known as Feather. `read_ipc_stream()`
 #' and [read_feather()] read those formats, respectively.
 #'
-#' `read_arrow()`, a wrapper around `read_ipc_stream()` and `read_feather()`,
-#' is deprecated. You should explicitly choose
-#' the function that will read the desired IPC format (stream or file) since
-#' a file or `InputStream` may contain either.
-#'
 #' @param file A character file name or URI, `raw` vector, an Arrow input stream,
 #' or a `FileSystem` with path (`SubTreeFileSystem`).
 #' If a file name or URI, an Arrow [InputStream] will be opened and
diff --git a/r/R/json.R b/r/R/json.R
index 19cf6a9299..2b1f4916cb 100644
--- a/r/R/json.R
+++ b/r/R/json.R
@@ -44,7 +44,12 @@ read_json_arrow <- function(file,
                             schema = NULL,
                             ...) {
   if (!inherits(file, "InputStream")) {
+    compression <- detect_compression(file)
     file <- make_readable_file(file)
+    if (compression != "uncompressed") {
+      # TODO: accept compression and compression_level as args
+      file <- CompressedInputStream$create(file, compression)
+    }
     on.exit(file$close())
   }
   tab <- JsonTableReader$create(file, schema = schema, ...)$Read()
diff --git a/r/R/parquet.R b/r/R/parquet.R
index 8cd9daa857..0b3f93b20e 100644
--- a/r/R/parquet.R
+++ b/r/R/parquet.R
@@ -36,9 +36,17 @@
 read_parquet <- function(file,
                          col_select = NULL,
                          as_data_frame = TRUE,
+                         # TODO: for consistency with other readers/writers,
+                         # these properties should be enumerated as args here,
+                         # and ParquetArrowReaderProperties$create() should
+                         # accept them, as with ParquetWriterProperties.
+                         # Assembling `props` yourself is something you do with
+                         # ParquetFileReader but not here.
                          props = ParquetArrowReaderProperties$create(),
                          ...) {
   if (!inherits(file, "RandomAccessFile")) {
+    # Compression is handled inside the parquet file format, so we don't need
+    # to detect from the file extension and wrap in a CompressedInputStream
     file <- make_readable_file(file)
     on.exit(file$close())
   }
@@ -156,6 +164,7 @@ write_parquet <- function(x,
   x <- as_writable_table(x)
 
   if (!inherits(sink, "OutputStream")) {
+    # TODO(ARROW-17221): if (missing(compression)), we could detect_compression(sink) here
     sink <- make_output_stream(sink)
     on.exit(sink$close())
   }
diff --git a/r/man/make_readable_file.Rd b/r/man/make_readable_file.Rd
index fe2e298261..1544815211 100644
--- a/r/man/make_readable_file.Rd
+++ b/r/man/make_readable_file.Rd
@@ -4,21 +4,12 @@
 \alias{make_readable_file}
 \title{Handle a range of possible input sources}
 \usage{
-make_readable_file(file, mmap = TRUE, compression = NULL, filesystem = NULL)
+make_readable_file(file, mmap = TRUE)
 }
 \arguments{
 \item{file}{A character file name, \code{raw} vector, or an Arrow input stream}
 
 \item{mmap}{Logical: whether to memory-map the file (default \code{TRUE})}
-
-\item{compression}{If the file is compressed, created a \link{CompressedInputStream}
-with this compression codec, either a \link{Codec} or the string name of one.
-If \code{NULL} (default) and \code{file} is a string file name, the function will try
-to infer compression from the file extension.}
-
-\item{filesystem}{If not \code{NULL}, \code{file} will be opened via the
-\code{filesystem$OpenInputFile()} filesystem method, rather than the \code{io} module's
-\code{MemoryMappedFile} or \code{ReadableFile} constructors.}
 }
 \value{
 An \code{InputStream} or a subclass of one.
diff --git a/r/man/read_feather.Rd b/r/man/read_feather.Rd
index 07d20b8e01..218a163b99 100644
--- a/r/man/read_feather.Rd
+++ b/r/man/read_feather.Rd
@@ -5,9 +5,9 @@
 \alias{read_ipc_file}
 \title{Read a Feather file (an Arrow IPC file)}
 \usage{
-read_feather(file, col_select = NULL, as_data_frame = TRUE, ...)
+read_feather(file, col_select = NULL, as_data_frame = TRUE, mmap = TRUE)
 
-read_ipc_file(file, col_select = NULL, as_data_frame = TRUE, ...)
+read_ipc_file(file, col_select = NULL, as_data_frame = TRUE, mmap = TRUE)
 }
 \arguments{
 \item{file}{A character file name or URI, \code{raw} vector, an Arrow input stream,
@@ -24,7 +24,7 @@ of columns, as used in \code{dplyr::select()}.}
 \item{as_data_frame}{Should the function return a \code{data.frame} (default) or
 an Arrow \link{Table}?}
 
-\item{...}{additional parameters, passed to \code{\link[=make_readable_file]{make_readable_file()}}.}
+\item{mmap}{Logical: whether to memory-map the file (default \code{TRUE})}
 }
 \value{
 A \code{data.frame} if \code{as_data_frame} is \code{TRUE} (the default), or an
diff --git a/r/man/read_ipc_stream.Rd b/r/man/read_ipc_stream.Rd
index 567ee9882b..63b50e7c1b 100644
--- a/r/man/read_ipc_stream.Rd
+++ b/r/man/read_ipc_stream.Rd
@@ -27,12 +27,6 @@ Apache Arrow defines two formats for \href{https://arrow.apache.org/docs/format/
 a "stream" format and a "file" format, known as Feather. \code{read_ipc_stream()}
 and \code{\link[=read_feather]{read_feather()}} read those formats, respectively.
 }
-\details{
-\code{read_arrow()}, a wrapper around \code{read_ipc_stream()} and \code{read_feather()},
-is deprecated. You should explicitly choose
-the function that will read the desired IPC format (stream or file) since
-a file or \code{InputStream} may contain either.
-}
 \seealso{
 \code{\link[=write_feather]{write_feather()}} for writing IPC files. \link{RecordBatchReader} for a
 lower-level interface.
diff --git a/r/man/write_feather.Rd b/r/man/write_feather.Rd
index 85c83ff04b..2d8a86f969 100644
--- a/r/man/write_feather.Rd
+++ b/r/man/write_feather.Rd
@@ -10,7 +10,7 @@ write_feather(
   sink,
   version = 2,
   chunk_size = 65536L,
-  compression = c("default", "lz4", "uncompressed", "zstd"),
+  compression = c("default", "lz4", "lz4_frame", "uncompressed", "zstd"),
   compression_level = NULL
 )
 
@@ -18,7 +18,7 @@ write_ipc_file(
   x,
   sink,
   chunk_size = 65536L,
-  compression = c("default", "lz4", "uncompressed", "zstd"),
+  compression = c("default", "lz4", "lz4_frame", "uncompressed", "zstd"),
   compression_level = NULL
 )
 }
@@ -37,8 +37,9 @@ random row access. Default is 64K. This option is not supported for V1.}
 \item{compression}{Name of compression codec to use, if any. Default is
 "lz4" if LZ4 is available in your build of the Arrow C++ library, otherwise
 "uncompressed". "zstd" is the other available codec and generally has better
-compression ratios in exchange for slower read and write performance
-See \code{\link[=codec_is_available]{codec_is_available()}}. This option is not supported for V1.}
+compression ratios in exchange for slower read and write performance.
+"lz4" is shorthand for the "lz4_frame" codec.
+See \code{\link[=codec_is_available]{codec_is_available()}} for details. This option is not supported for V1.}
 
 \item{compression_level}{If \code{compression} is "zstd", you may
 specify an integer compression level. If omitted, the compression codec's
diff --git a/r/man/write_ipc_stream.Rd b/r/man/write_ipc_stream.Rd
index 60c3197732..094e3ad11a 100644
--- a/r/man/write_ipc_stream.Rd
+++ b/r/man/write_ipc_stream.Rd
@@ -22,12 +22,6 @@ Apache Arrow defines two formats for \href{https://arrow.apache.org/docs/format/
 a "stream" format and a "file" format, known as Feather. \code{write_ipc_stream()}
 and \code{\link[=write_feather]{write_feather()}} write those formats, respectively.
 }
-\details{
-\code{write_arrow()}, a wrapper around \code{write_ipc_stream()} and \code{write_feather()}
-with some nonstandard behavior, is deprecated. You should explicitly choose
-the function that will write the desired IPC format (stream or file) since
-either can be written to a file or \code{OutputStream}.
-}
 \examples{
 tf <- tempfile()
 on.exit(unlink(tf))
diff --git a/r/tests/testthat/test-compressed.R b/r/tests/testthat/test-compressed.R
index 485e16769f..7d1c1cfd39 100644
--- a/r/tests/testthat/test-compressed.R
+++ b/r/tests/testthat/test-compressed.R
@@ -40,6 +40,14 @@ test_that("Codec attributes", {
   expect_error(cod$level)
 })
 
+test_that("Default compression_level for zstd", {
+  skip_if_not_available("zstd")
+  cod <- Codec$create("zstd")
+  expect_equal(cod$name, "zstd")
+  # TODO: implement $level
+  expect_error(cod$level)
+})
+
 test_that("can write Buffer to CompressedOutputStream and read back in CompressedInputStream", {
   skip_if_not_available("gzip")
   buf <- buffer(as.raw(sample(0:255, size = 1024, replace = TRUE)))
diff --git a/r/tests/testthat/test-csv.R b/r/tests/testthat/test-csv.R
index d4878e6d67..cd8da2625c 100644
--- a/r/tests/testthat/test-csv.R
+++ b/r/tests/testthat/test-csv.R
@@ -566,8 +566,6 @@ test_that("read/write compressed file successfully", {
   skip_if_not_available("gzip")
   tfgz <- tempfile(fileext = ".csv.gz")
   tf <- tempfile(fileext = ".csv")
-  on.exit(unlink(tf))
-  on.exit(unlink(tfgz))
 
   write_csv_arrow(tbl, tf)
   write_csv_arrow(tbl, tfgz)
@@ -577,6 +575,29 @@ test_that("read/write compressed file successfully", {
     read_csv_arrow(tfgz),
     tbl
   )
+  skip_if_not_available("lz4")
+  tflz4 <- tempfile(fileext = ".csv.lz4")
+  write_csv_arrow(tbl, tflz4)
+  expect_false(file.size(tfgz) == file.size(tflz4))
+  expect_identical(
+    read_csv_arrow(tflz4),
+    tbl
+  )
+})
+
+test_that("read/write compressed filesystem path", {
+  skip_if_not_available("zstd")
+  tfzst <- tempfile(fileext = ".csv.zst")
+  fs <- LocalFileSystem$create()$path(tfzst)
+  write_csv_arrow(tbl, fs)
+
+  tf <- tempfile(fileext = ".csv")
+  write_csv_arrow(tbl, tf)
+  expect_lt(file.size(tfzst), file.size(tf))
+  expect_identical(
+    read_csv_arrow(fs),
+    tbl
+  )
 })
 
 test_that("read_csv_arrow() can read sub-second timestamps with col_types T setting (ARROW-15599)", {
diff --git a/r/tests/testthat/test-feather.R b/r/tests/testthat/test-feather.R
index 1ef2ecf3e9..8d7a43ad06 100644
--- a/r/tests/testthat/test-feather.R
+++ b/r/tests/testthat/test-feather.R
@@ -207,6 +207,22 @@ test_that("read_feather requires RandomAccessFile and errors nicely otherwise (A
   )
 })
 
+test_that("write_feather() does not detect compression from filename", {
+  # TODO(ARROW-17221): should this be supported?
+  without <- tempfile(fileext = ".arrow")
+  with_zst <- tempfile(fileext = ".arrow.zst")
+  write_feather(mtcars, without)
+  write_feather(mtcars, with_zst)
+  expect_equal(file.size(without), file.size(with_zst))
+})
+
+test_that("read_feather() handles (ignores) compression in filename", {
+  df <- tibble::tibble(x = 1:5)
+  f <- tempfile(fileext = ".parquet.zst")
+  write_feather(df, f)
+  expect_equal(read_feather(f), df)
+})
+
 test_that("read_feather() and write_feather() accept connection objects", {
   skip_if_not(CanRunWithCapturedR())
 
diff --git a/r/tests/testthat/test-parquet.R b/r/tests/testthat/test-parquet.R
index b75892bc84..32170534a4 100644
--- a/r/tests/testthat/test-parquet.R
+++ b/r/tests/testthat/test-parquet.R
@@ -185,6 +185,22 @@ test_that("write_parquet() defaults to snappy compression", {
   expect_equal(file.size(tmp1), file.size(tmp2))
 })
 
+test_that("write_parquet() does not detect compression from filename", {
+  # TODO(ARROW-17221): should this be supported?
+  without <- tempfile(fileext = ".parquet")
+  with_gz <- tempfile(fileext = ".parquet.gz")
+  write_parquet(mtcars, without)
+  write_parquet(mtcars, with_gz)
+  expect_equal(file.size(with_gz), file.size(without))
+})
+
+test_that("read_parquet() handles (ignores) compression in filename", {
+  df <- tibble::tibble(x = 1:5)
+  f <- tempfile(fileext = ".parquet.gz")
+  write_parquet(df, f)
+  expect_equal(read_parquet(f), df)
+})
+
 test_that("Factors are preserved when writing/reading from Parquet", {
   fct <- factor(c("a", "b"), levels = c("c", "a", "b"))
   ord <- factor(c("a", "b"), levels = c("c", "a", "b"), ordered = TRUE)