You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by ko...@apache.org on 2022/12/24 00:13:19 UTC

[arrow] branch master updated: GH-14968: [Python] Fix segfault for dataset ORC write (#15049)

This is an automated email from the ASF dual-hosted git repository.

kou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
     new 5feabc57bd GH-14968: [Python] Fix segfault for dataset ORC write (#15049)
5feabc57bd is described below

commit 5feabc57bd75a54b5fe003988a14394aa621df05
Author: Dr. Jan-Philip Gehrcke <jg...@googlemail.com>
AuthorDate: Sat Dec 24 01:13:12 2022 +0100

    GH-14968: [Python] Fix segfault for dataset ORC write (#15049)
    
    This is my attempt to address https://github.com/apache/arrow/issues/14968 pragmatically.
    
    This is my first PR for Arrow, so I'd appreciate a careful look and all the pointers :).
    * Closes: #14968
    
    Authored-by: Dr. Jan-Philip Gehrcke <jg...@googlemail.com>
    Signed-off-by: Sutou Kouhei <ko...@clear-code.com>
---
 cpp/src/arrow/dataset/file_base.h    |  3 +++
 docs/source/cpp/dataset.rst          |  7 ++++---
 docs/source/python/dataset.rst       | 12 ++++++------
 python/pyarrow/_dataset.pyx          |  9 ++++++++-
 python/pyarrow/tests/test_dataset.py | 18 ++++++++++++++++++
 5 files changed, 39 insertions(+), 10 deletions(-)

diff --git a/cpp/src/arrow/dataset/file_base.h b/cpp/src/arrow/dataset/file_base.h
index dab7510d5b..2b8421ce16 100644
--- a/cpp/src/arrow/dataset/file_base.h
+++ b/cpp/src/arrow/dataset/file_base.h
@@ -200,6 +200,9 @@ class ARROW_DS_EXPORT FileFormat : public std::enable_shared_from_this<FileForma
       fs::FileLocator destination_locator) const = 0;
 
   /// \brief Get default write options for this format.
+  ///
+  /// May return null shared_ptr if this file format does not yet support
+  /// writing datasets.
   virtual std::shared_ptr<FileWriteOptions> DefaultWriteOptions() = 0;
 
  protected:
diff --git a/docs/source/cpp/dataset.rst b/docs/source/cpp/dataset.rst
index 6a7d7cfb3f..1f5d0476c2 100644
--- a/docs/source/cpp/dataset.rst
+++ b/docs/source/cpp/dataset.rst
@@ -35,14 +35,15 @@ Tabular Datasets
 The Arrow Datasets library provides functionality to efficiently work with
 tabular, potentially larger than memory, and multi-file datasets. This includes:
 
-* A unified interface that supports different sources and file formats
-  (currently, Parquet, ORC, Feather / Arrow IPC, and CSV files) and different
-  file systems (local, cloud).
+* A unified interface that supports different sources and file formats and
+  different file systems (local, cloud).
 * Discovery of sources (crawling directories, handling partitioned datasets with
   various partitioning schemes, basic schema normalization, ...)
 * Optimized reading with predicate pushdown (filtering rows), projection
   (selecting and deriving columns), and optionally parallel reading.
 
+The supported file formats currently are Parquet, Feather / Arrow IPC, CSV and
+ORC (note that ORC datasets can currently only be read and not yet written).
 The goal is to expand support to other file formats and data sources
 (e.g. database connections) in the future.
 
diff --git a/docs/source/python/dataset.rst b/docs/source/python/dataset.rst
index 2ac592d8d0..6be5a800a5 100644
--- a/docs/source/python/dataset.rst
+++ b/docs/source/python/dataset.rst
@@ -41,17 +41,17 @@ Tabular Datasets
 The ``pyarrow.dataset`` module provides functionality to efficiently work with
 tabular, potentially larger than memory, and multi-file datasets. This includes:
 
-* A unified interface that supports different sources and file formats
-  (Parquet, ORC, Feather / Arrow IPC, and CSV files) and different file systems
-  (local, cloud).
+* A unified interface that supports different sources and file formats and
+  different file systems (local, cloud).
 * Discovery of sources (crawling directories, handle directory-based partitioned
   datasets, basic schema normalization, ..)
 * Optimized reading with predicate pushdown (filtering rows), projection
   (selecting and deriving columns), and optionally parallel reading.
 
-Currently, only Parquet, ORC, Feather / Arrow IPC, and CSV files are
-supported. The goal is to expand this in the future to other file formats and
-data sources (e.g. database connections).
+The supported file formats currently are Parquet, Feather / Arrow IPC, CSV and
+ORC (note that ORC datasets can currently only be read and not yet written).
+The goal is to expand support to other file formats and data sources
+(e.g. database connections) in the future.
 
 For those familiar with the existing :class:`pyarrow.parquet.ParquetDataset` for
 reading Parquet datasets: ``pyarrow.dataset``'s goal is similar but not specific
diff --git a/python/pyarrow/_dataset.pyx b/python/pyarrow/_dataset.pyx
index 7c504775e7..42781ff2aa 100644
--- a/python/pyarrow/_dataset.pyx
+++ b/python/pyarrow/_dataset.pyx
@@ -916,7 +916,14 @@ cdef class FileFormat(_Weakrefable):
         return Fragment.wrap(move(c_fragment))
 
     def make_write_options(self):
-        return FileWriteOptions.wrap(self.format.DefaultWriteOptions())
+        sp_write_options = self.format.DefaultWriteOptions()
+        if sp_write_options.get() == nullptr:
+            # DefaultWriteOptions() may return `nullptr` which means that
+            # the format does not yet support writing datasets.
+            raise NotImplementedError(
+                "Writing datasets not yet implemented for this file format."
+            )
+        return FileWriteOptions.wrap(sp_write_options)
 
     @property
     def default_extname(self):
diff --git a/python/pyarrow/tests/test_dataset.py b/python/pyarrow/tests/test_dataset.py
index ecac5211a4..27edc8afad 100644
--- a/python/pyarrow/tests/test_dataset.py
+++ b/python/pyarrow/tests/test_dataset.py
@@ -3074,6 +3074,24 @@ def test_orc_format_not_supported():
             ds.dataset(".", format="orc")
 
 
+@pytest.mark.orc
+def test_orc_writer_not_implemented_for_dataset():
+    with pytest.raises(
+        NotImplementedError,
+        match="Writing datasets not yet implemented for this file format"
+    ):
+        ds.write_dataset(
+            pa.table({"a": range(10)}), format='orc', base_dir='/tmp'
+        )
+
+    of = ds.OrcFileFormat()
+    with pytest.raises(
+        NotImplementedError,
+        match="Writing datasets not yet implemented for this file format"
+    ):
+        of.make_write_options()
+
+
 @pytest.mark.pandas
 def test_csv_format(tempdir, dataset_reader):
     table = pa.table({'a': pa.array([1, 2, 3], type="int64"),