You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/01 14:00:47 UTC

[GitHub] [arrow] sanjibansg opened a new pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

sanjibansg opened a new pull request #12530:
URL: https://github.com/apache/arrow/pull/12530


   This PR adds support for Filename-based partitioning. With this change, dataset partitions will be formed by the filenames separated by `_`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r821919985



##########
File path: cpp/src/arrow/filesystem/path_util.cc
##########
@@ -33,6 +33,7 @@ namespace internal {
 std::vector<std::string> SplitAbstractPath(const std::string& path) {
   std::vector<std::string> parts;
   auto v = util::string_view(path);
+

Review comment:
       Made the change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r833791280



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1592,6 +1593,145 @@ cdef class HivePartitioning(Partitioning):
             res.append(pyarrow_wrap_array(arr))
         return res
 
+cdef class FilenamePartitioning(Partitioning):
+    """
+    A Partitioning based on a specified Schema.
+
+    The FilenamePartitioning expects one segment in the file name for each
+    field in the schema (all fields are required to be present) separated
+    by '_'. For example given schema<year:int16, month:int8> the name
+    "2009_11" would be parsed to ("year"_ == 2009 and "month"_ == 11).
+
+    Parameters
+    ----------
+    schema : Schema
+        The schema that describes the partitions present in the file path.
+    dictionaries : dict[str, Array]
+        If the type of any field of `schema` is a dictionary type, the
+        corresponding entry of `dictionaries` must be an array containing
+        every value which may be taken by the corresponding column or an
+        error will be raised in parsing.
+    segment_encoding : str, default "uri"
+        After splitting paths into segments, decode the segments. Valid
+        values are "uri" (URI-decode segments) and "none" (leave as-is).
+
+    Returns
+    -------
+    FilenamePartitioning
+
+    Examples
+    --------
+    >>> from pyarrow.dataset import FilenamePartitioning
+    >>> partition = FilenamePartitioning(
+    ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
+    >>> print(partitioning.parse("2009_11"))
+    ((year == 2009:int16) and (month == 11:int8))
+    """
+
+    cdef:
+        CFilenamePartitioning* filename_partitioning
+
+    def __init__(self, Schema schema not None, dictionaries=None,
+                 segment_encoding="uri"):
+        cdef:
+            shared_ptr[CFilenamePartitioning] c_partitioning
+            CKeyValuePartitioningOptions c_options
+
+        c_options.segment_encoding = _get_segment_encoding(segment_encoding)
+        c_partitioning = make_shared[CFilenamePartitioning](
+            pyarrow_unwrap_schema(schema),
+            _partitioning_dictionaries(schema, dictionaries),
+            c_options,
+        )
+        self.init(<shared_ptr[CPartitioning]> c_partitioning)
+
+    cdef init(self, const shared_ptr[CPartitioning]& sp):
+        Partitioning.init(self, sp)
+        self.filename_partitioning = <CFilenamePartitioning*> sp.get()
+
+    @staticmethod
+    def discover(field_names=None, infer_dictionary=False,
+                 max_partition_dictionary_size=0,
+                 schema=None, segment_encoding="uri"):
+        """
+        Discover a FilenamePartitioning.
+
+        Parameters
+        ----------
+        field_names : list of str
+            The names to associate with the values from the subdirectory names.
+            If schema is given, will be populated from the schema.
+        infer_dictionary : bool, default False
+            When inferring a schema for partition fields, yield dictionary
+            encoded types instead of plain types. This can be more efficient
+            when materializing virtual columns, and Expressions parsed by the
+            finished Partitioning will include dictionaries of all unique
+            inspected values for each field.
+        max_partition_dictionary_size : int, default 0
+            Synonymous with infer_dictionary for backwards compatibility with
+            1.0: setting this to -1 or None is equivalent to passing

Review comment:
       Removed backwards compatibility.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#issuecomment-1083123344


   Benchmark runs are scheduled for baseline = c515a6924f6002a70f1046b1642629867bacabd2 and contender = 33df4c789e7dd6b4f590783ae69d4faa49e47e04. 33df4c789e7dd6b4f590783ae69d4faa49e47e04 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/242e107bbf34435ba8ea69836cf9c8c8...35cdc2052c39400db3fc51d171f8eb25/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/f25752a7c18748dc8586c46911737467...ce32f95df73b4759b668bef06018ca90/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/41c5cede84e54d76bd8055e6efd4f0eb...a50b2230f3c943f0b56759087897f2d6/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/fc26f86fa55248f4a6e0ba1326f5915b...37e675332ed7490eb237055b1b9ee450/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/417| `33df4c78` ec2-t3-xlarge-us-east-2>
   [Scheduled] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/413| `33df4c78` ursa-thinkcentre-m75q>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/416| `c515a692` ec2-t3-xlarge-us-east-2>
   [Scheduled] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/402| `c515a692` test-mac-arm>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/412| `c515a692` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r821004375



##########
File path: cpp/src/arrow/dataset/dataset_writer.cc
##########
@@ -249,6 +249,16 @@ class DatasetWriterDirectoryQueue : public util::AsyncDestroyable {
         write_options_(write_options),
         writer_state_(writer_state) {}
 
+  DatasetWriterDirectoryQueue(std::string directory, std::string prefix,

Review comment:
       This should replace the existing constructor.  This class is only used internally (e.g. it isn't exposed via dataset_writer.h) and so it only ever gets constructed in exactly one place so we only need to support one type of constructor.

##########
File path: cpp/src/arrow/dataset/dataset_writer.cc
##########
@@ -342,11 +357,12 @@ class DatasetWriterDirectoryQueue : public util::AsyncDestroyable {
   Make(util::AsyncTaskGroup* task_group,
        const FileSystemDatasetWriteOptions& write_options,
        DatasetWriterState* writer_state, std::shared_ptr<Schema> schema,
-       std::string dir) {
+       std::string directory, std::string prefix) {
     auto dir_queue = util::MakeUniqueAsync<DatasetWriterDirectoryQueue>(
-        std::move(dir), std::move(schema), write_options, writer_state);
+        std::move(directory), std::move(prefix), std::move(schema), write_options,
+        writer_state);
     RETURN_NOT_OK(task_group->AddTask(dir_queue->on_closed()));
-    dir_queue->PrepareDirectory();
+    dir_queue->PrepareDirectory(prefix);

Review comment:
       By this point you have already called `std::move` on `prefix` so that variable is no longer valid.  You get away with it when the prefix is small because C++'s small string optimization means that a move on a small string is a no-op.  However, with a large string, this could lead to errors.
   
   Also, since we've already stored it in `prefix_` we can just use `prefix_` directly and `PrepareDirectory` doesn't need to take any arguments.

##########
File path: cpp/src/arrow/dataset/partition_test.cc
##########
@@ -61,12 +61,12 @@ class TestPartitioning : public ::testing::Test {
     // formatted partition expressions are bound to the schema of the dataset being

Review comment:
       This method should be updated to:
   
   ```
   void AssertFormat(compute::Expression expr, const std::string& expected_directory, const std::string& expected_prefix  = "")
   ```
   
   Then it should check both `first` and `second`.  Finally, we should have at least one test case for filename partitioning that uses `AssertFormat`.

##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -561,6 +624,73 @@ class DirectoryPartitioningFactory : public KeyValuePartitioningFactory {
   std::vector<std::string> field_names_;
 };
 
+class FilenamePartitioningFactory : public KeyValuePartitioningFactory {
+ public:
+  FilenamePartitioningFactory(std::vector<std::string> field_names,
+                              PartitioningFactoryOptions options)
+      : KeyValuePartitioningFactory(options), field_names_(std::move(field_names)) {
+    Reset();
+    util::InitializeUTF8();
+  }
+
+  std::string type_name() const override { return "filename"; }
+
+  Result<std::shared_ptr<Schema>> Inspect(
+      const std::vector<std::string>& paths) override {
+    for (auto path : paths) {

Review comment:
       ```suggestion
       for (const auto& path : paths) {
   ```

##########
File path: cpp/src/arrow/dataset/dataset_writer.cc
##########
@@ -326,9 +336,14 @@ class DatasetWriterDirectoryQueue : public util::AsyncDestroyable {
 
   uint64_t rows_written() const { return rows_written_; }
 
-  void PrepareDirectory() {
-    init_future_ = DeferNotOk(write_options_.filesystem->io_context().executor()->Submit(
-        [this]() { return write_options_.filesystem->CreateDir(directory_); }));
+  void PrepareDirectory(std::string& prefix) {
+    if (prefix.empty()) {

Review comment:
       ```suggestion
       if (!directory_.empty()) {
   ```
   At the moment, `!directory_.empty() == prefix.empty()` so this works but in the future I think it would be very possible for some custom partitioning to manipulate both the prefix and the directory (e.g. maybe the user wants to partition directories by year but files by month).

##########
File path: cpp/src/arrow/filesystem/path_util.h
##########
@@ -55,6 +60,11 @@ Status ValidateAbstractPathParts(const std::vector<std::string>& parts);
 ARROW_EXPORT
 std::string ConcatAbstractPath(const std::string& base, const std::string& stem);
 
+// Append a non-empty stem to an abstract path with a filename prefix.
+ARROW_EXPORT
+std::string ConcatAbstractPath(const std::string& base, const std::string& prefix,

Review comment:
       I think this method is a little too single purpose.  This file is in the `filesystem` module and is meant to contain general purpose utilities that are common for anyone working with files.  There is no concept of "partitions" or "prefixes" in the domain of this module.  Perhaps you could instead create a more general version:
   
   ```
   std::string ConcatAbstractPaths(const std::vector<std::string>& parts);
   ```

##########
File path: cpp/src/arrow/filesystem/path_util.h
##########
@@ -38,6 +39,10 @@ constexpr char kSep = '/';
 ARROW_EXPORT
 std::vector<std::string> SplitAbstractPath(const std::string& s);
 
+// Split a filename into its individual partitions.
+ARROW_EXPORT
+std::vector<std::string> SplitFilename(const std::string& s);

Review comment:
       Similar to some of the comments I made below, this method should either move to `partitioning.h` or be made more generic.  Users that are using `path_util.h` aren't going to be aware of the fact that we are splitting on `_` and even if we document it better they might be confused why such a method would exist since its purpose is very unique to partitioning.

##########
File path: cpp/src/arrow/filesystem/path_util.cc
##########
@@ -33,6 +33,7 @@ namespace internal {
 std::vector<std::string> SplitAbstractPath(const std::string& path) {
   std::vector<std::string> parts;
   auto v = util::string_view(path);
+

Review comment:
       ```suggestion
   ```

##########
File path: cpp/src/arrow/dataset/dataset_writer.cc
##########
@@ -468,13 +485,15 @@ class DatasetWriter::DatasetWriterImpl : public util::AsyncDestroyable {
   }
 
   Future<> DoWriteRecordBatch(std::shared_ptr<RecordBatch> batch,
-                              const std::string& directory) {
+                              const std::string& directory, const std::string& prefix) {
     ARROW_ASSIGN_OR_RAISE(
         auto dir_queue_itr,
         ::arrow::internal::GetOrInsertGenerated(
-            &directory_queues_, directory, [this, &batch](const std::string& dir) {
-              return DatasetWriterDirectoryQueue::Make(
-                  &task_group_, write_options_, &writer_state_, batch->schema(), dir);
+            &directory_queues_, directory + prefix,
+            [this, &batch, &directory, &prefix](const std::string& dir) {

Review comment:
       ```suggestion
               [this, &batch, &directory, &prefix](const std::string& key) {
   ```
   A small change that may help future readers understand why we need to capture `directory` instead of using `dir`.

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1592,6 +1593,145 @@ cdef class HivePartitioning(Partitioning):
             res.append(pyarrow_wrap_array(arr))
         return res
 
+cdef class FilenamePartitioning(Partitioning):

Review comment:
       We should add some python tests which cover this logic.  Hopefully we can adapt some of the existing partitioning tests pretty easily.

##########
File path: cpp/src/arrow/filesystem/path_util.h
##########
@@ -113,6 +123,27 @@ std::string JoinAbstractPath(const StringRange& range) {
   return JoinAbstractPath(range.begin(), range.end());
 }
 
+// Join the components of filename partitions
+template <class StringIt>
+std::string JoinFilenamePartitions(StringIt it, StringIt end) {

Review comment:
       Same concern here.  This class is meant to be more generic so we can't refer to things like "partitions" and `_` which are concepts that only make sense within the domain of `partitioning.h`.
   
   Perhaps move this method inside of `partitioning.cc`?  Or you could make a generic `JoinPaths(const StringRange& range, const std::string& sep)` but I wouldn't expect it to put `sep` at the end of the path.

##########
File path: cpp/src/arrow/dataset/dataset_writer_test.cc
##########
@@ -206,6 +206,14 @@ TEST_F(DatasetWriterTestFixture, Basic) {
   AssertCreatedData({{"testdir/chunk-0.arrow", 0, 100}});
 }
 
+TEST_F(DatasetWriterTestFixture, BasicFilePrefix) {

Review comment:
       Can you add a test that uses both a directory and a prefix?  We don't have any partitioning schemes (yet) that can generate this sort of thing but it should be perfectly valid as far as this class is concerned.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r838000260



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1571,26 +1586,114 @@ cdef class HivePartitioning(Partitioning):
         return PartitioningFactory.wrap(
             CHivePartitioning.MakeFactory(c_options))
 
-    @property
-    def dictionaries(self):
+
+cdef class FilenamePartitioning(KeyValuePartitioning):
+    """
+    A Partitioning based on a specified Schema.
+
+    The FilenamePartitioning expects one segment in the file name for each
+    field in the schema (all fields are required to be present) separated
+    by '_'. For example given schema<year:int16, month:int8> the name
+    "2009_11" would be parsed to ("year"_ == 2009 and "month"_ == 11).
+
+    Parameters
+    ----------
+    schema : Schema
+        The schema that describes the partitions present in the file path.
+    dictionaries : dict[str, Array]
+        If the type of any field of `schema` is a dictionary type, the
+        corresponding entry of `dictionaries` must be an array containing
+        every value which may be taken by the corresponding column or an
+        error will be raised in parsing.
+    segment_encoding : str, default "uri"
+        After splitting paths into segments, decode the segments. Valid
+        values are "uri" (URI-decode segments) and "none" (leave as-is).
+
+    Returns
+    -------
+    FilenamePartitioning
+
+    Examples
+    --------
+    >>> from pyarrow.dataset import FilenamePartitioning
+    >>> partition = FilenamePartitioning(
+    ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))

Review comment:
       Corrected that, thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#issuecomment-1083123344


   Benchmark runs are scheduled for baseline = c515a6924f6002a70f1046b1642629867bacabd2 and contender = 33df4c789e7dd6b4f590783ae69d4faa49e47e04. 33df4c789e7dd6b4f590783ae69d4faa49e47e04 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Failed] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/242e107bbf34435ba8ea69836cf9c8c8...35cdc2052c39400db3fc51d171f8eb25/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/f25752a7c18748dc8586c46911737467...ce32f95df73b4759b668bef06018ca90/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/41c5cede84e54d76bd8055e6efd4f0eb...a50b2230f3c943f0b56759087897f2d6/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/fc26f86fa55248f4a6e0ba1326f5915b...37e675332ed7490eb237055b1b9ee450/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r831765088



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -61,6 +61,16 @@ Result<std::string> SafeUriUnescape(util::string_view encoded) {
   }
   return decoded;
 }
+
+std::string StripNonPrefix(const std::string& path) {
+  std::string v;
+  auto non_prefix_index = path.rfind(kFilenamePartitionSep);
+  if (path.length() > 0 && non_prefix_index != std::string::npos) {

Review comment:
       Made the change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r833791453



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1592,6 +1593,145 @@ cdef class HivePartitioning(Partitioning):
             res.append(pyarrow_wrap_array(arr))
         return res
 
+cdef class FilenamePartitioning(Partitioning):
+    """
+    A Partitioning based on a specified Schema.
+
+    The FilenamePartitioning expects one segment in the file name for each
+    field in the schema (all fields are required to be present) separated
+    by '_'. For example given schema<year:int16, month:int8> the name
+    "2009_11" would be parsed to ("year"_ == 2009 and "month"_ == 11).
+
+    Parameters
+    ----------
+    schema : Schema
+        The schema that describes the partitions present in the file path.
+    dictionaries : dict[str, Array]
+        If the type of any field of `schema` is a dictionary type, the
+        corresponding entry of `dictionaries` must be an array containing
+        every value which may be taken by the corresponding column or an
+        error will be raised in parsing.
+    segment_encoding : str, default "uri"
+        After splitting paths into segments, decode the segments. Valid
+        values are "uri" (URI-decode segments) and "none" (leave as-is).
+
+    Returns
+    -------
+    FilenamePartitioning
+
+    Examples
+    --------
+    >>> from pyarrow.dataset import FilenamePartitioning
+    >>> partition = FilenamePartitioning(
+    ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
+    >>> print(partitioning.parse("2009_11"))
+    ((year == 2009:int16) and (month == 11:int8))
+    """
+
+    cdef:
+        CFilenamePartitioning* filename_partitioning
+
+    def __init__(self, Schema schema not None, dictionaries=None,
+                 segment_encoding="uri"):
+        cdef:
+            shared_ptr[CFilenamePartitioning] c_partitioning
+            CKeyValuePartitioningOptions c_options
+
+        c_options.segment_encoding = _get_segment_encoding(segment_encoding)
+        c_partitioning = make_shared[CFilenamePartitioning](
+            pyarrow_unwrap_schema(schema),
+            _partitioning_dictionaries(schema, dictionaries),
+            c_options,
+        )
+        self.init(<shared_ptr[CPartitioning]> c_partitioning)
+
+    cdef init(self, const shared_ptr[CPartitioning]& sp):
+        Partitioning.init(self, sp)
+        self.filename_partitioning = <CFilenamePartitioning*> sp.get()
+
+    @staticmethod
+    def discover(field_names=None, infer_dictionary=False,
+                 max_partition_dictionary_size=0,
+                 schema=None, segment_encoding="uri"):
+        """
+        Discover a FilenamePartitioning.
+
+        Parameters
+        ----------
+        field_names : list of str
+            The names to associate with the values from the subdirectory names.
+            If schema is given, will be populated from the schema.
+        infer_dictionary : bool, default False
+            When inferring a schema for partition fields, yield dictionary
+            encoded types instead of plain types. This can be more efficient
+            when materializing virtual columns, and Expressions parsed by the
+            finished Partitioning will include dictionaries of all unique
+            inspected values for each field.
+        max_partition_dictionary_size : int, default 0
+            Synonymous with infer_dictionary for backwards compatibility with
+            1.0: setting this to -1 or None is equivalent to passing
+            infer_dictionary=True.
+        schema : Schema, default None
+            Use this schema instead of inferring a schema from partition
+            values. Partition values will be validated against this schema
+            before accumulation into the Partitioning's dictionary.
+        segment_encoding : str, default "uri"
+            After splitting paths into segments, decode the segments. Valid
+            values are "uri" (URI-decode segments) and "none" (leave as-is).
+
+        Returns
+        -------
+        PartitioningFactory
+            To be used in the FileSystemFactoryOptions.
+        """
+        cdef:
+            CPartitioningFactoryOptions c_options
+            vector[c_string] c_field_names
+
+        if max_partition_dictionary_size in {-1, None}:
+            infer_dictionary = True
+        elif max_partition_dictionary_size != 0:
+            raise NotImplementedError("max_partition_dictionary_size must be "
+                                      "0, -1, or None")
+
+        if infer_dictionary:
+            c_options.infer_dictionary = True
+
+        if schema:
+            c_options.schema = pyarrow_unwrap_schema(schema)
+            c_field_names = [tobytes(f.name) for f in schema]
+        elif not field_names:
+            raise ValueError(
+                "Neither field_names nor schema was passed; "
+                "cannot infer field_names")
+        else:
+            c_field_names = [tobytes(s) for s in field_names]
+
+        c_options.segment_encoding = _get_segment_encoding(segment_encoding)
+
+        return PartitioningFactory.wrap(
+            CFilenamePartitioning.MakeFactory(c_field_names, c_options))
+
+    @property
+    def dictionaries(self):

Review comment:
       Moved to base class.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r833791368



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1592,6 +1593,145 @@ cdef class HivePartitioning(Partitioning):
             res.append(pyarrow_wrap_array(arr))
         return res
 
+cdef class FilenamePartitioning(Partitioning):
+    """
+    A Partitioning based on a specified Schema.
+
+    The FilenamePartitioning expects one segment in the file name for each
+    field in the schema (all fields are required to be present) separated
+    by '_'. For example given schema<year:int16, month:int8> the name
+    "2009_11" would be parsed to ("year"_ == 2009 and "month"_ == 11).
+
+    Parameters
+    ----------
+    schema : Schema
+        The schema that describes the partitions present in the file path.
+    dictionaries : dict[str, Array]
+        If the type of any field of `schema` is a dictionary type, the
+        corresponding entry of `dictionaries` must be an array containing
+        every value which may be taken by the corresponding column or an
+        error will be raised in parsing.
+    segment_encoding : str, default "uri"
+        After splitting paths into segments, decode the segments. Valid
+        values are "uri" (URI-decode segments) and "none" (leave as-is).
+
+    Returns
+    -------
+    FilenamePartitioning
+
+    Examples
+    --------
+    >>> from pyarrow.dataset import FilenamePartitioning
+    >>> partition = FilenamePartitioning(
+    ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
+    >>> print(partitioning.parse("2009_11"))
+    ((year == 2009:int16) and (month == 11:int8))
+    """
+
+    cdef:
+        CFilenamePartitioning* filename_partitioning
+
+    def __init__(self, Schema schema not None, dictionaries=None,
+                 segment_encoding="uri"):
+        cdef:
+            shared_ptr[CFilenamePartitioning] c_partitioning
+            CKeyValuePartitioningOptions c_options
+
+        c_options.segment_encoding = _get_segment_encoding(segment_encoding)
+        c_partitioning = make_shared[CFilenamePartitioning](
+            pyarrow_unwrap_schema(schema),
+            _partitioning_dictionaries(schema, dictionaries),
+            c_options,
+        )
+        self.init(<shared_ptr[CPartitioning]> c_partitioning)
+
+    cdef init(self, const shared_ptr[CPartitioning]& sp):
+        Partitioning.init(self, sp)
+        self.filename_partitioning = <CFilenamePartitioning*> sp.get()
+
+    @staticmethod
+    def discover(field_names=None, infer_dictionary=False,
+                 max_partition_dictionary_size=0,
+                 schema=None, segment_encoding="uri"):
+        """
+        Discover a FilenamePartitioning.
+
+        Parameters
+        ----------
+        field_names : list of str
+            The names to associate with the values from the subdirectory names.
+            If schema is given, will be populated from the schema.
+        infer_dictionary : bool, default False
+            When inferring a schema for partition fields, yield dictionary
+            encoded types instead of plain types. This can be more efficient
+            when materializing virtual columns, and Expressions parsed by the
+            finished Partitioning will include dictionaries of all unique
+            inspected values for each field.
+        max_partition_dictionary_size : int, default 0
+            Synonymous with infer_dictionary for backwards compatibility with
+            1.0: setting this to -1 or None is equivalent to passing
+            infer_dictionary=True.
+        schema : Schema, default None
+            Use this schema instead of inferring a schema from partition
+            values. Partition values will be validated against this schema
+            before accumulation into the Partitioning's dictionary.
+        segment_encoding : str, default "uri"
+            After splitting paths into segments, decode the segments. Valid
+            values are "uri" (URI-decode segments) and "none" (leave as-is).
+
+        Returns
+        -------
+        PartitioningFactory
+            To be used in the FileSystemFactoryOptions.
+        """
+        cdef:
+            CPartitioningFactoryOptions c_options
+            vector[c_string] c_field_names
+
+        if max_partition_dictionary_size in {-1, None}:
+            infer_dictionary = True
+        elif max_partition_dictionary_size != 0:
+            raise NotImplementedError("max_partition_dictionary_size must be "
+                                      "0, -1, or None")
+
+        if infer_dictionary:
+            c_options.infer_dictionary = True
+
+        if schema:
+            c_options.schema = pyarrow_unwrap_schema(schema)
+            c_field_names = [tobytes(f.name) for f in schema]
+        elif not field_names:
+            raise ValueError(

Review comment:
       Made the change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r833816568



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1592,6 +1593,145 @@ cdef class HivePartitioning(Partitioning):
             res.append(pyarrow_wrap_array(arr))
         return res
 
+cdef class FilenamePartitioning(Partitioning):
+    """
+    A Partitioning based on a specified Schema.
+
+    The FilenamePartitioning expects one segment in the file name for each
+    field in the schema (all fields are required to be present) separated
+    by '_'. For example given schema<year:int16, month:int8> the name
+    "2009_11" would be parsed to ("year"_ == 2009 and "month"_ == 11).
+
+    Parameters
+    ----------
+    schema : Schema
+        The schema that describes the partitions present in the file path.
+    dictionaries : dict[str, Array]
+        If the type of any field of `schema` is a dictionary type, the
+        corresponding entry of `dictionaries` must be an array containing
+        every value which may be taken by the corresponding column or an
+        error will be raised in parsing.
+    segment_encoding : str, default "uri"
+        After splitting paths into segments, decode the segments. Valid
+        values are "uri" (URI-decode segments) and "none" (leave as-is).
+
+    Returns
+    -------
+    FilenamePartitioning
+
+    Examples
+    --------
+    >>> from pyarrow.dataset import FilenamePartitioning
+    >>> partition = FilenamePartitioning(
+    ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
+    >>> print(partitioning.parse("2009_11"))
+    ((year == 2009:int16) and (month == 11:int8))
+    """
+
+    cdef:
+        CFilenamePartitioning* filename_partitioning
+
+    def __init__(self, Schema schema not None, dictionaries=None,
+                 segment_encoding="uri"):
+        cdef:
+            shared_ptr[CFilenamePartitioning] c_partitioning
+            CKeyValuePartitioningOptions c_options
+
+        c_options.segment_encoding = _get_segment_encoding(segment_encoding)
+        c_partitioning = make_shared[CFilenamePartitioning](
+            pyarrow_unwrap_schema(schema),
+            _partitioning_dictionaries(schema, dictionaries),
+            c_options,
+        )
+        self.init(<shared_ptr[CPartitioning]> c_partitioning)
+
+    cdef init(self, const shared_ptr[CPartitioning]& sp):
+        Partitioning.init(self, sp)
+        self.filename_partitioning = <CFilenamePartitioning*> sp.get()
+
+    @staticmethod
+    def discover(field_names=None, infer_dictionary=False,
+                 max_partition_dictionary_size=0,
+                 schema=None, segment_encoding="uri"):
+        """
+        Discover a FilenamePartitioning.
+
+        Parameters
+        ----------
+        field_names : list of str
+            The names to associate with the values from the subdirectory names.
+            If schema is given, will be populated from the schema.
+        infer_dictionary : bool, default False
+            When inferring a schema for partition fields, yield dictionary
+            encoded types instead of plain types. This can be more efficient
+            when materializing virtual columns, and Expressions parsed by the
+            finished Partitioning will include dictionaries of all unique
+            inspected values for each field.
+        max_partition_dictionary_size : int, default 0
+            Synonymous with infer_dictionary for backwards compatibility with
+            1.0: setting this to -1 or None is equivalent to passing
+            infer_dictionary=True.
+        schema : Schema, default None
+            Use this schema instead of inferring a schema from partition
+            values. Partition values will be validated against this schema
+            before accumulation into the Partitioning's dictionary.
+        segment_encoding : str, default "uri"
+            After splitting paths into segments, decode the segments. Valid
+            values are "uri" (URI-decode segments) and "none" (leave as-is).
+
+        Returns
+        -------
+        PartitioningFactory
+            To be used in the FileSystemFactoryOptions.
+        """
+        cdef:
+            CPartitioningFactoryOptions c_options
+            vector[c_string] c_field_names
+
+        if max_partition_dictionary_size in {-1, None}:
+            infer_dictionary = True
+        elif max_partition_dictionary_size != 0:
+            raise NotImplementedError("max_partition_dictionary_size must be "
+                                      "0, -1, or None")
+
+        if infer_dictionary:
+            c_options.infer_dictionary = True
+
+        if schema:
+            c_options.schema = pyarrow_unwrap_schema(schema)
+            c_field_names = [tobytes(f.name) for f in schema]
+        elif not field_names:
+            raise ValueError(
+                "Neither field_names nor schema was passed; "
+                "cannot infer field_names")
+        else:
+            c_field_names = [tobytes(s) for s in field_names]
+
+        c_options.segment_encoding = _get_segment_encoding(segment_encoding)
+
+        return PartitioningFactory.wrap(
+            CFilenamePartitioning.MakeFactory(c_field_names, c_options))
+
+    @property
+    def dictionaries(self):

Review comment:
       I am not very sure of this. `dictionaries` is defined inside KeyValuePartitioning which I don't think has a `cdef cppclass` implementation.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r834091440



##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -76,7 +78,11 @@ class ARROW_DS_EXPORT Partitioning {
   /// \brief Parse a path into a partition expression
   virtual Result<compute::Expression> Parse(const std::string& path) const = 0;
 
-  virtual Result<std::string> Format(const compute::Expression& expr) const = 0;
+  struct PartitionPathFormat {
+    std::string directory, prefix;
+  };
+
+  virtual Result<PartitionPathFormat> Format(const compute::Expression& expr) const = 0;

Review comment:
       Added docstring for `PartitionPathFormat`

##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -329,6 +342,44 @@ Result<std::vector<KeyValuePartitioning::Key>> DirectoryPartitioning::ParseKeys(
   return keys;
 }
 
+FilenamePartitioning::FilenamePartitioning(std::shared_ptr<Schema> schema,
+                                           ArrayVector dictionaries,
+                                           KeyValuePartitioningOptions options)
+    : KeyValuePartitioning(std::move(schema), std::move(dictionaries), options) {
+  util::InitializeUTF8();
+}
+
+Result<std::vector<KeyValuePartitioning::Key>> FilenamePartitioning::ParseKeys(
+    const std::string& path) const {
+  std::vector<Key> keys;

Review comment:
       Made the change.

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1440,7 +1441,6 @@ cdef class DirectoryPartitioning(Partitioning):
             res.append(pyarrow_wrap_array(arr))
         return res
 
-

Review comment:
       Made the change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r821921308



##########
File path: cpp/src/arrow/dataset/dataset_writer_test.cc
##########
@@ -206,6 +206,14 @@ TEST_F(DatasetWriterTestFixture, Basic) {
   AssertCreatedData({{"testdir/chunk-0.arrow", 0, 100}});
 }
 
+TEST_F(DatasetWriterTestFixture, BasicFilePrefix) {

Review comment:
       Added the test.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r830650407



##########
File path: cpp/src/arrow/filesystem/path_util.h
##########
@@ -95,22 +98,22 @@ std::vector<std::string> MinimalCreateDirSet(std::vector<std::string> dirs);
 
 // Join the components of an abstract path.
 template <class StringIt>
-std::string JoinAbstractPath(StringIt it, StringIt end) {
+std::string JoinAbstractPath(StringIt it, StringIt end, const char& sep = kSep) {
   std::string path;
   for (; it != end; ++it) {
     if (it->empty()) continue;
 
     if (!path.empty()) {
-      path += kSep;
+      path += sep;
     }
     path += *it;
   }
   return path;
 }
 
 template <class StringRange>
-std::string JoinAbstractPath(const StringRange& range) {
-  return JoinAbstractPath(range.begin(), range.end());
+std::string JoinAbstractPath(const StringRange& range, const char& sep = kSep) {

Review comment:
       Made the change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r838225172



##########
File path: cpp/src/arrow/dataset/partition_test.cc
##########
@@ -406,6 +445,18 @@ TEST_F(TestPartitioning, HivePartitioningFormat) {
            equal(field_ref("beta"), literal("hello"))));
 }
 
+TEST_F(TestPartitioning, FilenamePartitioningFormat) {
+  partitioning_ = std::make_shared<FilenamePartitioning>(
+      schema({field("alpha", int32()), field("beta", utf8())}));
+
+  written_schema_ = partitioning_->schema();
+
+  AssertFormat(and_(equal(field_ref("alpha"), literal(0)),
+                    equal(field_ref("beta"), literal("hello"))),
+               "", "0_hello_");

Review comment:
       This case is actually not handled anywhere as of now. As suggested by @westonpace, we can URI encode the segments here to handle this case, I will create a JIRA issue for a follow-up PR on this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r833977881



##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -76,7 +78,11 @@ class ARROW_DS_EXPORT Partitioning {
   /// \brief Parse a path into a partition expression
   virtual Result<compute::Expression> Parse(const std::string& path) const = 0;
 
-  virtual Result<std::string> Format(const compute::Expression& expr) const = 0;
+  struct PartitionPathFormat {
+    std::string directory, prefix;
+  };
+
+  virtual Result<PartitionPathFormat> Format(const compute::Expression& expr) const = 0;

Review comment:
       This looks nice to me, perhaps you can add a short docstring too?

##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -361,7 +412,32 @@ Result<std::string> DirectoryPartitioning::FormatValues(
     break;
   }
 
-  return fs::internal::JoinAbstractPath(std::move(segments));
+  return PartitionPathFormat{fs::internal::JoinAbstractPath(std::move(segments)), ""};
+}
+
+Result<Partitioning::PartitionPathFormat> FilenamePartitioning::FormatValues(
+    const ScalarVector& values) const {
+  std::vector<std::string> segments(static_cast<size_t>(schema_->num_fields()));

Review comment:
       This method is all the same as `DirectoryPartitioning::FormatValues` except for the last line. Can you factor out the `segments` computation?

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1592,6 +1593,145 @@ cdef class HivePartitioning(Partitioning):
             res.append(pyarrow_wrap_array(arr))
         return res
 
+cdef class FilenamePartitioning(Partitioning):
+    """
+    A Partitioning based on a specified Schema.
+
+    The FilenamePartitioning expects one segment in the file name for each
+    field in the schema (all fields are required to be present) separated
+    by '_'. For example given schema<year:int16, month:int8> the name
+    "2009_11" would be parsed to ("year"_ == 2009 and "month"_ == 11).
+
+    Parameters
+    ----------
+    schema : Schema
+        The schema that describes the partitions present in the file path.
+    dictionaries : dict[str, Array]
+        If the type of any field of `schema` is a dictionary type, the
+        corresponding entry of `dictionaries` must be an array containing
+        every value which may be taken by the corresponding column or an
+        error will be raised in parsing.
+    segment_encoding : str, default "uri"
+        After splitting paths into segments, decode the segments. Valid
+        values are "uri" (URI-decode segments) and "none" (leave as-is).
+
+    Returns
+    -------
+    FilenamePartitioning
+
+    Examples
+    --------
+    >>> from pyarrow.dataset import FilenamePartitioning
+    >>> partition = FilenamePartitioning(
+    ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
+    >>> print(partitioning.parse("2009_11"))
+    ((year == 2009:int16) and (month == 11:int8))
+    """
+
+    cdef:
+        CFilenamePartitioning* filename_partitioning
+
+    def __init__(self, Schema schema not None, dictionaries=None,
+                 segment_encoding="uri"):
+        cdef:
+            shared_ptr[CFilenamePartitioning] c_partitioning
+            CKeyValuePartitioningOptions c_options
+
+        c_options.segment_encoding = _get_segment_encoding(segment_encoding)
+        c_partitioning = make_shared[CFilenamePartitioning](
+            pyarrow_unwrap_schema(schema),
+            _partitioning_dictionaries(schema, dictionaries),
+            c_options,
+        )
+        self.init(<shared_ptr[CPartitioning]> c_partitioning)
+
+    cdef init(self, const shared_ptr[CPartitioning]& sp):
+        Partitioning.init(self, sp)
+        self.filename_partitioning = <CFilenamePartitioning*> sp.get()
+
+    @staticmethod
+    def discover(field_names=None, infer_dictionary=False,
+                 max_partition_dictionary_size=0,
+                 schema=None, segment_encoding="uri"):
+        """
+        Discover a FilenamePartitioning.
+
+        Parameters
+        ----------
+        field_names : list of str
+            The names to associate with the values from the subdirectory names.
+            If schema is given, will be populated from the schema.
+        infer_dictionary : bool, default False
+            When inferring a schema for partition fields, yield dictionary
+            encoded types instead of plain types. This can be more efficient
+            when materializing virtual columns, and Expressions parsed by the
+            finished Partitioning will include dictionaries of all unique
+            inspected values for each field.
+        max_partition_dictionary_size : int, default 0
+            Synonymous with infer_dictionary for backwards compatibility with
+            1.0: setting this to -1 or None is equivalent to passing
+            infer_dictionary=True.
+        schema : Schema, default None
+            Use this schema instead of inferring a schema from partition
+            values. Partition values will be validated against this schema
+            before accumulation into the Partitioning's dictionary.
+        segment_encoding : str, default "uri"
+            After splitting paths into segments, decode the segments. Valid
+            values are "uri" (URI-decode segments) and "none" (leave as-is).
+
+        Returns
+        -------
+        PartitioningFactory
+            To be used in the FileSystemFactoryOptions.
+        """
+        cdef:
+            CPartitioningFactoryOptions c_options
+            vector[c_string] c_field_names
+
+        if max_partition_dictionary_size in {-1, None}:
+            infer_dictionary = True
+        elif max_partition_dictionary_size != 0:
+            raise NotImplementedError("max_partition_dictionary_size must be "
+                                      "0, -1, or None")
+
+        if infer_dictionary:
+            c_options.infer_dictionary = True
+
+        if schema:
+            c_options.schema = pyarrow_unwrap_schema(schema)
+            c_field_names = [tobytes(f.name) for f in schema]
+        elif not field_names:
+            raise ValueError(
+                "Neither field_names nor schema was passed; "
+                "cannot infer field_names")
+        else:
+            c_field_names = [tobytes(s) for s in field_names]
+
+        c_options.segment_encoding = _get_segment_encoding(segment_encoding)
+
+        return PartitioningFactory.wrap(
+            CFilenamePartitioning.MakeFactory(c_field_names, c_options))
+
+    @property
+    def dictionaries(self):

Review comment:
       Well, you could add an intermediate class for it in the hierarchy :-)

##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -561,6 +636,74 @@ class DirectoryPartitioningFactory : public KeyValuePartitioningFactory {
   std::vector<std::string> field_names_;
 };
 
+class FilenamePartitioningFactory : public KeyValuePartitioningFactory {
+ public:
+  FilenamePartitioningFactory(std::vector<std::string> field_names,
+                              PartitioningFactoryOptions options)
+      : KeyValuePartitioningFactory(options), field_names_(std::move(field_names)) {
+    Reset();
+    util::InitializeUTF8();
+  }
+
+  std::string type_name() const override { return "filename"; }
+
+  Result<std::shared_ptr<Schema>> Inspect(
+      const std::vector<std::string>& paths) override {
+    for (const auto& path : paths) {
+      size_t field_index = 0;

Review comment:
       Oh, I see this is essentially copy-pasted from `DirectoryPartitioningFactory::Inspect`. Is it possible to factor this out to avoid significant repetitions of code?

##########
File path: python/pyarrow/dataset.py
##########
@@ -117,6 +118,11 @@ def partitioning(schema=None, field_names=None, flavor=None,
       For example, given schema<year:int16, month:int8, day:int8>, a possible
       path would be "/year=2009/month=11/day=15" (but the field order does not
       need to match).
+    - "FilenamePartitioning": this scheme expects the partitions will have
+      filenames containing the field values separated by "_".
+      For example, given schema<year:int16, month:int8, day:int8>, a possible
+      partition filename "2009_11_part-0.parquet" would be parsed
+      to ("year"_ == 2009 and "month"_ == 11).

Review comment:
       Can you also update the description of the "flavor" parameter below?

##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -321,6 +330,30 @@ class ARROW_DS_EXPORT FunctionPartitioning : public Partitioning {
   std::string name_;
 };
 
+class ARROW_DS_EXPORT FilenamePartitioning : public KeyValuePartitioning {
+ public:
+  /// If a field in schema is of dictionary type, the corresponding element of
+  /// dictionaries must be contain the dictionary of values for that field.

Review comment:
       It seems you didn't commit the change, though?

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1440,7 +1441,6 @@ cdef class DirectoryPartitioning(Partitioning):
             res.append(pyarrow_wrap_array(arr))
         return res
 
-

Review comment:
       I think we need to keep two spaces between class definitions as per [PEP 8](https://peps.python.org/pep-0008/#blank-lines).

##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -329,6 +342,44 @@ Result<std::vector<KeyValuePartitioning::Key>> DirectoryPartitioning::ParseKeys(
   return keys;
 }
 
+FilenamePartitioning::FilenamePartitioning(std::shared_ptr<Schema> schema,
+                                           ArrayVector dictionaries,
+                                           KeyValuePartitioningOptions options)
+    : KeyValuePartitioning(std::move(schema), std::move(dictionaries), options) {
+  util::InitializeUTF8();
+}
+
+Result<std::vector<KeyValuePartitioning::Key>> FilenamePartitioning::ParseKeys(
+    const std::string& path) const {
+  std::vector<Key> keys;

Review comment:
       Tiny improvement:
   ```suggestion
     std::vector<Key> keys;
     keys.reserve(schema_.num_fields());
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#issuecomment-1083123344


   Benchmark runs are scheduled for baseline = c515a6924f6002a70f1046b1642629867bacabd2 and contender = 33df4c789e7dd6b4f590783ae69d4faa49e47e04. 33df4c789e7dd6b4f590783ae69d4faa49e47e04 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/242e107bbf34435ba8ea69836cf9c8c8...35cdc2052c39400db3fc51d171f8eb25/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/f25752a7c18748dc8586c46911737467...ce32f95df73b4759b668bef06018ca90/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/41c5cede84e54d76bd8055e6efd4f0eb...a50b2230f3c943f0b56759087897f2d6/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/fc26f86fa55248f4a6e0ba1326f5915b...37e675332ed7490eb237055b1b9ee450/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/417| `33df4c78` ec2-t3-xlarge-us-east-2>
   [Scheduled] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/403| `33df4c78` test-mac-arm>
   [Scheduled] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/413| `33df4c78` ursa-thinkcentre-m75q>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/416| `c515a692` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/402| `c515a692` test-mac-arm>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/412| `c515a692` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r821921733



##########
File path: cpp/src/arrow/filesystem/path_util.h
##########
@@ -55,6 +60,11 @@ Status ValidateAbstractPathParts(const std::vector<std::string>& parts);
 ARROW_EXPORT
 std::string ConcatAbstractPath(const std::string& base, const std::string& stem);
 
+// Append a non-empty stem to an abstract path with a filename prefix.
+ARROW_EXPORT
+std::string ConcatAbstractPath(const std::string& base, const std::string& prefix,

Review comment:
       Made a generic ConcatAbstractPaths()




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r831708034



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -61,6 +61,16 @@ Result<std::string> SafeUriUnescape(util::string_view encoded) {
   }
   return decoded;
 }
+
+std::string StripNonPrefix(const std::string& path) {
+  std::string v;
+  auto non_prefix_index = path.rfind(kFilenamePartitionSep);
+  if (path.length() > 0 && non_prefix_index != std::string::npos) {

Review comment:
       ```suggestion
     if (non_prefix_index != std::string::npos) {
   ```
   Minor nit: If `length() <= 0` then `non_prefix_index != std::string::npos` will always be false so you can do away with the first check.

##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -361,7 +411,32 @@ Result<std::string> DirectoryPartitioning::FormatValues(
     break;
   }
 
-  return fs::internal::JoinAbstractPath(std::move(segments));
+  return std::make_pair(fs::internal::JoinAbstractPath(std::move(segments)), "");
+}
+
+Result<std::pair<std::string, std::string>> FilenamePartitioning::FormatValues(
+    const ScalarVector& values) const {
+  std::vector<std::string> segments(static_cast<size_t>(schema_->num_fields()));
+
+  for (int i = 0; i < schema_->num_fields(); ++i) {
+    if (values[i] != nullptr && values[i]->is_valid) {
+      segments[i] = values[i]->ToString();
+      continue;
+    }
+
+    if (auto illegal_index = NextValid(values, i)) {
+      // XXX maybe we should just ignore keys provided after the first absent one?
+      return Status::Invalid("No partition key for ", schema_->field(i)->name(),
+                             " but a key was provided subsequently for ",
+                             schema_->field(*illegal_index)->name(), ".");
+    }
+
+    // if all subsequent keys are absent we'll just print the available keys
+    break;

Review comment:
       That should be fine for this case.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r830650343



##########
File path: cpp/src/arrow/filesystem/path_util.h
##########
@@ -30,13 +30,16 @@ namespace fs {
 namespace internal {
 
 constexpr char kSep = '/';
+constexpr char kFilenameSep = '_';
 
 // Computations on abstract paths (not local paths with system-dependent behaviour).
 // Abstract paths are typically used in URIs.
 
 // Split an abstract path into its individual components.
 ARROW_EXPORT
-std::vector<std::string> SplitAbstractPath(const std::string& s);
+std::vector<std::string> SplitAbstractPath(util::string_view path, const char& sep);
+ARROW_EXPORT
+std::vector<std::string> SplitAbstractPath(const std::string& path);

Review comment:
       Redefined `SplitAbstractPath()`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r833791453



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1592,6 +1593,145 @@ cdef class HivePartitioning(Partitioning):
             res.append(pyarrow_wrap_array(arr))
         return res
 
+cdef class FilenamePartitioning(Partitioning):
+    """
+    A Partitioning based on a specified Schema.
+
+    The FilenamePartitioning expects one segment in the file name for each
+    field in the schema (all fields are required to be present) separated
+    by '_'. For example given schema<year:int16, month:int8> the name
+    "2009_11" would be parsed to ("year"_ == 2009 and "month"_ == 11).
+
+    Parameters
+    ----------
+    schema : Schema
+        The schema that describes the partitions present in the file path.
+    dictionaries : dict[str, Array]
+        If the type of any field of `schema` is a dictionary type, the
+        corresponding entry of `dictionaries` must be an array containing
+        every value which may be taken by the corresponding column or an
+        error will be raised in parsing.
+    segment_encoding : str, default "uri"
+        After splitting paths into segments, decode the segments. Valid
+        values are "uri" (URI-decode segments) and "none" (leave as-is).
+
+    Returns
+    -------
+    FilenamePartitioning
+
+    Examples
+    --------
+    >>> from pyarrow.dataset import FilenamePartitioning
+    >>> partition = FilenamePartitioning(
+    ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
+    >>> print(partitioning.parse("2009_11"))
+    ((year == 2009:int16) and (month == 11:int8))
+    """
+
+    cdef:
+        CFilenamePartitioning* filename_partitioning
+
+    def __init__(self, Schema schema not None, dictionaries=None,
+                 segment_encoding="uri"):
+        cdef:
+            shared_ptr[CFilenamePartitioning] c_partitioning
+            CKeyValuePartitioningOptions c_options
+
+        c_options.segment_encoding = _get_segment_encoding(segment_encoding)
+        c_partitioning = make_shared[CFilenamePartitioning](
+            pyarrow_unwrap_schema(schema),
+            _partitioning_dictionaries(schema, dictionaries),
+            c_options,
+        )
+        self.init(<shared_ptr[CPartitioning]> c_partitioning)
+
+    cdef init(self, const shared_ptr[CPartitioning]& sp):
+        Partitioning.init(self, sp)
+        self.filename_partitioning = <CFilenamePartitioning*> sp.get()
+
+    @staticmethod
+    def discover(field_names=None, infer_dictionary=False,
+                 max_partition_dictionary_size=0,
+                 schema=None, segment_encoding="uri"):
+        """
+        Discover a FilenamePartitioning.
+
+        Parameters
+        ----------
+        field_names : list of str
+            The names to associate with the values from the subdirectory names.
+            If schema is given, will be populated from the schema.
+        infer_dictionary : bool, default False
+            When inferring a schema for partition fields, yield dictionary
+            encoded types instead of plain types. This can be more efficient
+            when materializing virtual columns, and Expressions parsed by the
+            finished Partitioning will include dictionaries of all unique
+            inspected values for each field.
+        max_partition_dictionary_size : int, default 0
+            Synonymous with infer_dictionary for backwards compatibility with
+            1.0: setting this to -1 or None is equivalent to passing
+            infer_dictionary=True.
+        schema : Schema, default None
+            Use this schema instead of inferring a schema from partition
+            values. Partition values will be validated against this schema
+            before accumulation into the Partitioning's dictionary.
+        segment_encoding : str, default "uri"
+            After splitting paths into segments, decode the segments. Valid
+            values are "uri" (URI-decode segments) and "none" (leave as-is).
+
+        Returns
+        -------
+        PartitioningFactory
+            To be used in the FileSystemFactoryOptions.
+        """
+        cdef:
+            CPartitioningFactoryOptions c_options
+            vector[c_string] c_field_names
+
+        if max_partition_dictionary_size in {-1, None}:
+            infer_dictionary = True
+        elif max_partition_dictionary_size != 0:
+            raise NotImplementedError("max_partition_dictionary_size must be "
+                                      "0, -1, or None")
+
+        if infer_dictionary:
+            c_options.infer_dictionary = True
+
+        if schema:
+            c_options.schema = pyarrow_unwrap_schema(schema)
+            c_field_names = [tobytes(f.name) for f in schema]
+        elif not field_names:
+            raise ValueError(
+                "Neither field_names nor schema was passed; "
+                "cannot infer field_names")
+        else:
+            c_field_names = [tobytes(s) for s in field_names]
+
+        c_options.segment_encoding = _get_segment_encoding(segment_encoding)
+
+        return PartitioningFactory.wrap(
+            CFilenamePartitioning.MakeFactory(c_field_names, c_options))
+
+    @property
+    def dictionaries(self):

Review comment:
       Moved to base class.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r834092452



##########
File path: python/pyarrow/dataset.py
##########
@@ -117,6 +118,11 @@ def partitioning(schema=None, field_names=None, flavor=None,
       For example, given schema<year:int16, month:int8, day:int8>, a possible
       path would be "/year=2009/month=11/day=15" (but the field order does not
       need to match).
+    - "FilenamePartitioning": this scheme expects the partitions will have
+      filenames containing the field values separated by "_".
+      For example, given schema<year:int16, month:int8, day:int8>, a possible
+      partition filename "2009_11_part-0.parquet" would be parsed
+      to ("year"_ == 2009 and "month"_ == 11).

Review comment:
       Updated desc of `flavor` for FilenamePartitioning




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r836480624



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1571,26 +1586,114 @@ cdef class HivePartitioning(Partitioning):
         return PartitioningFactory.wrap(
             CHivePartitioning.MakeFactory(c_options))
 
-    @property
-    def dictionaries(self):
+
+cdef class FilenamePartitioning(KeyValuePartitioning):
+    """
+    A Partitioning based on a specified Schema.
+
+    The FilenamePartitioning expects one segment in the file name for each
+    field in the schema (all fields are required to be present) separated
+    by '_'. For example given schema<year:int16, month:int8> the name
+    "2009_11" would be parsed to ("year"_ == 2009 and "month"_ == 11).
+
+    Parameters
+    ----------
+    schema : Schema
+        The schema that describes the partitions present in the file path.
+    dictionaries : dict[str, Array]
+        If the type of any field of `schema` is a dictionary type, the
+        corresponding entry of `dictionaries` must be an array containing
+        every value which may be taken by the corresponding column or an
+        error will be raised in parsing.
+    segment_encoding : str, default "uri"
+        After splitting paths into segments, decode the segments. Valid
+        values are "uri" (URI-decode segments) and "none" (leave as-is).
+
+    Returns
+    -------
+    FilenamePartitioning
+
+    Examples
+    --------
+    >>> from pyarrow.dataset import FilenamePartitioning
+    >>> partition = FilenamePartitioning(
+    ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))

Review comment:
       The variable name doesn't match what's below:
   ```suggestion
       >>> partitioning = FilenamePartitioning(
       ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
   ```

##########
File path: cpp/src/arrow/dataset/partition_test.cc
##########
@@ -406,6 +445,18 @@ TEST_F(TestPartitioning, HivePartitioningFormat) {
            equal(field_ref("beta"), literal("hello"))));
 }
 
+TEST_F(TestPartitioning, FilenamePartitioningFormat) {
+  partitioning_ = std::make_shared<FilenamePartitioning>(
+      schema({field("alpha", int32()), field("beta", utf8())}));
+
+  written_schema_ = partitioning_->schema();
+
+  AssertFormat(and_(equal(field_ref("alpha"), literal(0)),
+                    equal(field_ref("beta"), literal("hello"))),
+               "", "0_hello_");

Review comment:
       What happens if instead of `literal("hello")` I pass `literal("foo_bar")` or `literal("foo/bar")`?

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1571,26 +1586,114 @@ cdef class HivePartitioning(Partitioning):
         return PartitioningFactory.wrap(
             CHivePartitioning.MakeFactory(c_options))
 
-    @property
-    def dictionaries(self):
+
+cdef class FilenamePartitioning(KeyValuePartitioning):
+    """
+    A Partitioning based on a specified Schema.
+
+    The FilenamePartitioning expects one segment in the file name for each
+    field in the schema (all fields are required to be present) separated
+    by '_'. For example given schema<year:int16, month:int8> the name
+    "2009_11" would be parsed to ("year"_ == 2009 and "month"_ == 11).
+
+    Parameters
+    ----------
+    schema : Schema
+        The schema that describes the partitions present in the file path.
+    dictionaries : dict[str, Array]
+        If the type of any field of `schema` is a dictionary type, the
+        corresponding entry of `dictionaries` must be an array containing
+        every value which may be taken by the corresponding column or an
+        error will be raised in parsing.
+    segment_encoding : str, default "uri"
+        After splitting paths into segments, decode the segments. Valid
+        values are "uri" (URI-decode segments) and "none" (leave as-is).
+
+    Returns
+    -------
+    FilenamePartitioning
+
+    Examples
+    --------
+    >>> from pyarrow.dataset import FilenamePartitioning
+    >>> partition = FilenamePartitioning(
+    ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
+    >>> print(partitioning.parse("2009_11"))
+    ((year == 2009:int16) and (month == 11:int8))

Review comment:
       Actually, this gives an error:
   ```
   >>> partitioning.parse("2009_11")
   Traceback (most recent call last):
     Input In [11] in <cell line: 1>
       partitioning.parse("2009_11")
     File pyarrow/_dataset.pyx:1252 in pyarrow._dataset.Partitioning.parse
       return Expression.wrap(GetResultValue(result))
     File pyarrow/error.pxi:143 in pyarrow.lib.pyarrow_internal_check_status
       return check_status(status)
     File pyarrow/error.pxi:99 in pyarrow.lib.check_status
       raise ArrowInvalid(message)
   ArrowInvalid: error parsing '' as scalar of type int8
   ```

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1571,26 +1586,114 @@ cdef class HivePartitioning(Partitioning):
         return PartitioningFactory.wrap(
             CHivePartitioning.MakeFactory(c_options))
 
-    @property
-    def dictionaries(self):
+
+cdef class FilenamePartitioning(KeyValuePartitioning):
+    """
+    A Partitioning based on a specified Schema.
+
+    The FilenamePartitioning expects one segment in the file name for each
+    field in the schema (all fields are required to be present) separated
+    by '_'. For example given schema<year:int16, month:int8> the name
+    "2009_11" would be parsed to ("year"_ == 2009 and "month"_ == 11).

Review comment:
       Judging by the exemple, probably this should be `"2009_11_"`?

##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -569,6 +570,22 @@ def test_partitioning():
         with pytest.raises(pa.ArrowInvalid):
             partitioning.parse(shouldfail)
 
+    partitioning = ds.FilenamePartitioning(
+        pa.schema([
+            pa.field('group', pa.int64()),
+            pa.field('key', pa.float64())
+        ])
+    )
+    assert partitioning.dictionaries is None

Review comment:
       It would be nice IMHO.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r835553081



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1592,6 +1593,145 @@ cdef class HivePartitioning(Partitioning):
             res.append(pyarrow_wrap_array(arr))
         return res
 
+cdef class FilenamePartitioning(Partitioning):
+    """
+    A Partitioning based on a specified Schema.
+
+    The FilenamePartitioning expects one segment in the file name for each
+    field in the schema (all fields are required to be present) separated
+    by '_'. For example given schema<year:int16, month:int8> the name
+    "2009_11" would be parsed to ("year"_ == 2009 and "month"_ == 11).
+
+    Parameters
+    ----------
+    schema : Schema
+        The schema that describes the partitions present in the file path.
+    dictionaries : dict[str, Array]
+        If the type of any field of `schema` is a dictionary type, the
+        corresponding entry of `dictionaries` must be an array containing
+        every value which may be taken by the corresponding column or an
+        error will be raised in parsing.
+    segment_encoding : str, default "uri"
+        After splitting paths into segments, decode the segments. Valid
+        values are "uri" (URI-decode segments) and "none" (leave as-is).
+
+    Returns
+    -------
+    FilenamePartitioning
+
+    Examples
+    --------
+    >>> from pyarrow.dataset import FilenamePartitioning
+    >>> partition = FilenamePartitioning(
+    ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
+    >>> print(partitioning.parse("2009_11"))
+    ((year == 2009:int16) and (month == 11:int8))
+    """
+
+    cdef:
+        CFilenamePartitioning* filename_partitioning
+
+    def __init__(self, Schema schema not None, dictionaries=None,
+                 segment_encoding="uri"):
+        cdef:
+            shared_ptr[CFilenamePartitioning] c_partitioning
+            CKeyValuePartitioningOptions c_options
+
+        c_options.segment_encoding = _get_segment_encoding(segment_encoding)
+        c_partitioning = make_shared[CFilenamePartitioning](
+            pyarrow_unwrap_schema(schema),
+            _partitioning_dictionaries(schema, dictionaries),
+            c_options,
+        )
+        self.init(<shared_ptr[CPartitioning]> c_partitioning)
+
+    cdef init(self, const shared_ptr[CPartitioning]& sp):
+        Partitioning.init(self, sp)
+        self.filename_partitioning = <CFilenamePartitioning*> sp.get()
+
+    @staticmethod
+    def discover(field_names=None, infer_dictionary=False,
+                 max_partition_dictionary_size=0,
+                 schema=None, segment_encoding="uri"):
+        """
+        Discover a FilenamePartitioning.
+
+        Parameters
+        ----------
+        field_names : list of str
+            The names to associate with the values from the subdirectory names.
+            If schema is given, will be populated from the schema.
+        infer_dictionary : bool, default False
+            When inferring a schema for partition fields, yield dictionary
+            encoded types instead of plain types. This can be more efficient
+            when materializing virtual columns, and Expressions parsed by the
+            finished Partitioning will include dictionaries of all unique
+            inspected values for each field.
+        max_partition_dictionary_size : int, default 0
+            Synonymous with infer_dictionary for backwards compatibility with
+            1.0: setting this to -1 or None is equivalent to passing
+            infer_dictionary=True.
+        schema : Schema, default None
+            Use this schema instead of inferring a schema from partition
+            values. Partition values will be validated against this schema
+            before accumulation into the Partitioning's dictionary.
+        segment_encoding : str, default "uri"
+            After splitting paths into segments, decode the segments. Valid
+            values are "uri" (URI-decode segments) and "none" (leave as-is).
+
+        Returns
+        -------
+        PartitioningFactory
+            To be used in the FileSystemFactoryOptions.
+        """
+        cdef:
+            CPartitioningFactoryOptions c_options
+            vector[c_string] c_field_names
+
+        if max_partition_dictionary_size in {-1, None}:
+            infer_dictionary = True
+        elif max_partition_dictionary_size != 0:
+            raise NotImplementedError("max_partition_dictionary_size must be "
+                                      "0, -1, or None")
+
+        if infer_dictionary:
+            c_options.infer_dictionary = True
+
+        if schema:
+            c_options.schema = pyarrow_unwrap_schema(schema)
+            c_field_names = [tobytes(f.name) for f in schema]
+        elif not field_names:
+            raise ValueError(
+                "Neither field_names nor schema was passed; "
+                "cannot infer field_names")
+        else:
+            c_field_names = [tobytes(s) for s in field_names]
+
+        c_options.segment_encoding = _get_segment_encoding(segment_encoding)
+
+        return PartitioningFactory.wrap(
+            CFilenamePartitioning.MakeFactory(c_field_names, c_options))
+
+    @property
+    def dictionaries(self):

Review comment:
       Added the cython implementation of `KeyValuePartitioning` to act as a base class for Directory, Hive and FilenamePartitioning which now contains the dictionaries method.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#issuecomment-1083121452


   Thanks a lot for your contribution, @sanjibansg.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r829848259



##########
File path: cpp/src/arrow/filesystem/path_util.h
##########
@@ -30,13 +30,16 @@ namespace fs {
 namespace internal {
 
 constexpr char kSep = '/';
+constexpr char kFilenameSep = '_';
 
 // Computations on abstract paths (not local paths with system-dependent behaviour).
 // Abstract paths are typically used in URIs.
 
 // Split an abstract path into its individual components.
 ARROW_EXPORT
-std::vector<std::string> SplitAbstractPath(const std::string& s);
+std::vector<std::string> SplitAbstractPath(util::string_view path, const char& sep);
+ARROW_EXPORT
+std::vector<std::string> SplitAbstractPath(const std::string& path);

Review comment:
       Do we need two methods or could we get away with:
   
   ```
   ARROW_EXPORT
   std::vector<std::string> SplitAbstractPath(util::string_view path, char sep = kSep)
   ```
   
   Also, you don't need to pass primitives by const reference (it is generally harmless and happens sometimes with templates but if we know the type is a primitive then it's clearer do away with it).

##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -361,7 +411,32 @@ Result<std::string> DirectoryPartitioning::FormatValues(
     break;
   }
 
-  return fs::internal::JoinAbstractPath(std::move(segments));
+  return std::make_pair(fs::internal::JoinAbstractPath(std::move(segments)), "");
+}
+
+Result<std::pair<std::string, std::string>> FilenamePartitioning::FormatValues(
+    const ScalarVector& values) const {
+  std::vector<std::string> segments(static_cast<size_t>(schema_->num_fields()));
+
+  for (int i = 0; i < schema_->num_fields(); ++i) {
+    if (values[i] != nullptr && values[i]->is_valid) {
+      segments[i] = values[i]->ToString();
+      continue;
+    }
+
+    if (auto illegal_index = NextValid(values, i)) {
+      // XXX maybe we should just ignore keys provided after the first absent one?
+      return Status::Invalid("No partition key for ", schema_->field(i)->name(),
+                             " but a key was provided subsequently for ",
+                             schema_->field(*illegal_index)->name(), ".");
+    }
+
+    // if all subsequent keys are absent we'll just print the available keys
+    break;

Review comment:
       Are there tests in place for this case?  In other words, do we have a test where the schema is something like `schema({field("a", int32()), field("b", int32())}` and we only pass in one field (or no fields)?

##########
File path: cpp/src/arrow/filesystem/path_util.h
##########
@@ -95,22 +98,22 @@ std::vector<std::string> MinimalCreateDirSet(std::vector<std::string> dirs);
 
 // Join the components of an abstract path.
 template <class StringIt>
-std::string JoinAbstractPath(StringIt it, StringIt end) {
+std::string JoinAbstractPath(StringIt it, StringIt end, const char& sep = kSep) {
   std::string path;
   for (; it != end; ++it) {
     if (it->empty()) continue;
 
     if (!path.empty()) {
-      path += kSep;
+      path += sep;
     }
     path += *it;
   }
   return path;
 }
 
 template <class StringRange>
-std::string JoinAbstractPath(const StringRange& range) {
-  return JoinAbstractPath(range.begin(), range.end());
+std::string JoinAbstractPath(const StringRange& range, const char& sep = kSep) {

Review comment:
       ```suggestion
   std::string JoinAbstractPath(const StringRange& range, char sep = kSep) {
   ```

##########
File path: cpp/src/arrow/filesystem/path_util.cc
##########
@@ -30,28 +30,23 @@ namespace internal {
 
 // XXX How does this encode Windows UNC paths?
 
-std::vector<std::string> SplitAbstractPath(const std::string& path) {
+std::vector<std::string> SplitAbstractPath(util::string_view path, const char& sep) {
   std::vector<std::string> parts;
-  auto v = util::string_view(path);
-  // Strip trailing slash
-  if (v.length() > 0 && v.back() == kSep) {
-    v = v.substr(0, v.length() - 1);
-  }

Review comment:
       We would still need to strip the trailing slash.  If you need to keep the trailing slash to make the partitioning code work then I would stick with two methods but have the second method (that doesn't strip the trailing separator) be in partition.cc.

##########
File path: cpp/src/arrow/filesystem/path_util.h
##########
@@ -30,13 +30,16 @@ namespace fs {
 namespace internal {
 
 constexpr char kSep = '/';
+constexpr char kFilenameSep = '_';

Review comment:
       We should move this constant into partition.h and rename it to something like `kFilenamePartitionSep`.  When I see `kFilenameSep` I think of `/` or `.`

##########
File path: cpp/src/arrow/filesystem/path_util.h
##########
@@ -95,22 +98,22 @@ std::vector<std::string> MinimalCreateDirSet(std::vector<std::string> dirs);
 
 // Join the components of an abstract path.
 template <class StringIt>
-std::string JoinAbstractPath(StringIt it, StringIt end) {
+std::string JoinAbstractPath(StringIt it, StringIt end, const char& sep = kSep) {

Review comment:
       ```suggestion
   std::string JoinAbstractPath(StringIt it, StringIt end, char sep = kSep) {
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r822306683



##########
File path: cpp/src/arrow/dataset/partition_test.cc
##########
@@ -61,12 +61,12 @@ class TestPartitioning : public ::testing::Test {
     // formatted partition expressions are bound to the schema of the dataset being

Review comment:
       Added test for FilenamePartitioningFormat.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r833819829



##########
File path: cpp/src/arrow/dataset/partition_test.cc
##########
@@ -291,7 +293,23 @@ TEST_F(TestPartitioning, DiscoverSchema) {
   AssertInspect({"/0/1", "/hello"}, {Str("alpha"), Int("beta")});
 }
 
-TEST_F(TestPartitioning, DictionaryInference) {
+TEST_F(TestPartitioning, DiscoverSchemaFilename) {
+  factory_ = FilenamePartitioning::MakeFactory({"alpha", "beta"});
+
+  // type is int32 if possible
+  AssertInspect({"0_1_"}, {Int("alpha"), Int("beta")});
+
+  // extra segments are ignored
+  AssertInspect({"0_1_what_"}, {Int("alpha"), Int("beta")});
+
+  // fall back to string if any segment for field alpha is not parseable as int
+  AssertInspect({"0_1_", "hello_1_"}, {Str("alpha"), Int("beta")});
+
+  // If there are too many digits fall back to string
+  AssertInspect({"3760212050_1_"}, {Str("alpha"), Int("beta")});
+}

Review comment:
       Modified tests to check for invalid syntax and incomplete fields.
   I think we only have tests on URI encoding in HivePartitioning




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r833791797



##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -321,6 +330,30 @@ class ARROW_DS_EXPORT FunctionPartitioning : public Partitioning {
   std::string name_;
 };
 
+class ARROW_DS_EXPORT FilenamePartitioning : public KeyValuePartitioning {
+ public:
+  /// If a field in schema is of dictionary type, the corresponding element of
+  /// dictionaries must be contain the dictionary of values for that field.

Review comment:
       Made the change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r835552303



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -361,7 +412,32 @@ Result<std::string> DirectoryPartitioning::FormatValues(
     break;
   }
 
-  return fs::internal::JoinAbstractPath(std::move(segments));
+  return PartitionPathFormat{fs::internal::JoinAbstractPath(std::move(segments)), ""};
+}
+
+Result<Partitioning::PartitionPathFormat> FilenamePartitioning::FormatValues(
+    const ScalarVector& values) const {
+  std::vector<std::string> segments(static_cast<size_t>(schema_->num_fields()));

Review comment:
       Yes, I thought the way to perform FilenamePartitioning should be very similar to DirectoryPartitioning. I have changed the implementation of `FormatValues`, `Inspect` and `ParseKeys` methods of these two classes to use functions present in the parent class KeyValuePartitioning which have the common operation so as to reduce code repetitions. Does this approach looks good?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#issuecomment-1083123344


   Benchmark runs are scheduled for baseline = c515a6924f6002a70f1046b1642629867bacabd2 and contender = 33df4c789e7dd6b4f590783ae69d4faa49e47e04. 33df4c789e7dd6b4f590783ae69d4faa49e47e04 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/242e107bbf34435ba8ea69836cf9c8c8...35cdc2052c39400db3fc51d171f8eb25/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/f25752a7c18748dc8586c46911737467...ce32f95df73b4759b668bef06018ca90/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/41c5cede84e54d76bd8055e6efd4f0eb...a50b2230f3c943f0b56759087897f2d6/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/fc26f86fa55248f4a6e0ba1326f5915b...37e675332ed7490eb237055b1b9ee450/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/417| `33df4c78` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/416| `c515a692` ec2-t3-xlarge-us-east-2>
   [Scheduled] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/412| `c515a692` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#issuecomment-1083123344


   Benchmark runs are scheduled for baseline = c515a6924f6002a70f1046b1642629867bacabd2 and contender = 33df4c789e7dd6b4f590783ae69d4faa49e47e04. 33df4c789e7dd6b4f590783ae69d4faa49e47e04 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/242e107bbf34435ba8ea69836cf9c8c8...35cdc2052c39400db3fc51d171f8eb25/)
   [Finished :arrow_down:0.21% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/f25752a7c18748dc8586c46911737467...ce32f95df73b4759b668bef06018ca90/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/41c5cede84e54d76bd8055e6efd4f0eb...a50b2230f3c943f0b56759087897f2d6/)
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/fc26f86fa55248f4a6e0ba1326f5915b...37e675332ed7490eb237055b1b9ee450/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/417| `33df4c78` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/403| `33df4c78` test-mac-arm>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/413| `33df4c78` ursa-thinkcentre-m75q>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/416| `c515a692` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/402| `c515a692` test-mac-arm>
   [Scheduled] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/402| `c515a692` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/412| `c515a692` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#issuecomment-1083123344


   Benchmark runs are scheduled for baseline = c515a6924f6002a70f1046b1642629867bacabd2 and contender = 33df4c789e7dd6b4f590783ae69d4faa49e47e04. 33df4c789e7dd6b4f590783ae69d4faa49e47e04 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/242e107bbf34435ba8ea69836cf9c8c8...35cdc2052c39400db3fc51d171f8eb25/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/f25752a7c18748dc8586c46911737467...ce32f95df73b4759b668bef06018ca90/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/41c5cede84e54d76bd8055e6efd4f0eb...a50b2230f3c943f0b56759087897f2d6/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/fc26f86fa55248f4a6e0ba1326f5915b...37e675332ed7490eb237055b1b9ee450/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/417| `33df4c78` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/416| `c515a692` ec2-t3-xlarge-us-east-2>
   [Scheduled] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/402| `c515a692` test-mac-arm>
   [Scheduled] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/412| `c515a692` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r833816568



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1592,6 +1593,145 @@ cdef class HivePartitioning(Partitioning):
             res.append(pyarrow_wrap_array(arr))
         return res
 
+cdef class FilenamePartitioning(Partitioning):
+    """
+    A Partitioning based on a specified Schema.
+
+    The FilenamePartitioning expects one segment in the file name for each
+    field in the schema (all fields are required to be present) separated
+    by '_'. For example given schema<year:int16, month:int8> the name
+    "2009_11" would be parsed to ("year"_ == 2009 and "month"_ == 11).
+
+    Parameters
+    ----------
+    schema : Schema
+        The schema that describes the partitions present in the file path.
+    dictionaries : dict[str, Array]
+        If the type of any field of `schema` is a dictionary type, the
+        corresponding entry of `dictionaries` must be an array containing
+        every value which may be taken by the corresponding column or an
+        error will be raised in parsing.
+    segment_encoding : str, default "uri"
+        After splitting paths into segments, decode the segments. Valid
+        values are "uri" (URI-decode segments) and "none" (leave as-is).
+
+    Returns
+    -------
+    FilenamePartitioning
+
+    Examples
+    --------
+    >>> from pyarrow.dataset import FilenamePartitioning
+    >>> partition = FilenamePartitioning(
+    ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
+    >>> print(partitioning.parse("2009_11"))
+    ((year == 2009:int16) and (month == 11:int8))
+    """
+
+    cdef:
+        CFilenamePartitioning* filename_partitioning
+
+    def __init__(self, Schema schema not None, dictionaries=None,
+                 segment_encoding="uri"):
+        cdef:
+            shared_ptr[CFilenamePartitioning] c_partitioning
+            CKeyValuePartitioningOptions c_options
+
+        c_options.segment_encoding = _get_segment_encoding(segment_encoding)
+        c_partitioning = make_shared[CFilenamePartitioning](
+            pyarrow_unwrap_schema(schema),
+            _partitioning_dictionaries(schema, dictionaries),
+            c_options,
+        )
+        self.init(<shared_ptr[CPartitioning]> c_partitioning)
+
+    cdef init(self, const shared_ptr[CPartitioning]& sp):
+        Partitioning.init(self, sp)
+        self.filename_partitioning = <CFilenamePartitioning*> sp.get()
+
+    @staticmethod
+    def discover(field_names=None, infer_dictionary=False,
+                 max_partition_dictionary_size=0,
+                 schema=None, segment_encoding="uri"):
+        """
+        Discover a FilenamePartitioning.
+
+        Parameters
+        ----------
+        field_names : list of str
+            The names to associate with the values from the subdirectory names.
+            If schema is given, will be populated from the schema.
+        infer_dictionary : bool, default False
+            When inferring a schema for partition fields, yield dictionary
+            encoded types instead of plain types. This can be more efficient
+            when materializing virtual columns, and Expressions parsed by the
+            finished Partitioning will include dictionaries of all unique
+            inspected values for each field.
+        max_partition_dictionary_size : int, default 0
+            Synonymous with infer_dictionary for backwards compatibility with
+            1.0: setting this to -1 or None is equivalent to passing
+            infer_dictionary=True.
+        schema : Schema, default None
+            Use this schema instead of inferring a schema from partition
+            values. Partition values will be validated against this schema
+            before accumulation into the Partitioning's dictionary.
+        segment_encoding : str, default "uri"
+            After splitting paths into segments, decode the segments. Valid
+            values are "uri" (URI-decode segments) and "none" (leave as-is).
+
+        Returns
+        -------
+        PartitioningFactory
+            To be used in the FileSystemFactoryOptions.
+        """
+        cdef:
+            CPartitioningFactoryOptions c_options
+            vector[c_string] c_field_names
+
+        if max_partition_dictionary_size in {-1, None}:
+            infer_dictionary = True
+        elif max_partition_dictionary_size != 0:
+            raise NotImplementedError("max_partition_dictionary_size must be "
+                                      "0, -1, or None")
+
+        if infer_dictionary:
+            c_options.infer_dictionary = True
+
+        if schema:
+            c_options.schema = pyarrow_unwrap_schema(schema)
+            c_field_names = [tobytes(f.name) for f in schema]
+        elif not field_names:
+            raise ValueError(
+                "Neither field_names nor schema was passed; "
+                "cannot infer field_names")
+        else:
+            c_field_names = [tobytes(s) for s in field_names]
+
+        c_options.segment_encoding = _get_segment_encoding(segment_encoding)
+
+        return PartitioningFactory.wrap(
+            CFilenamePartitioning.MakeFactory(c_field_names, c_options))
+
+    @property
+    def dictionaries(self):

Review comment:
       I am not very sure of this. `dictionaries` is defined inside KeyValuePartitioning which I don't think has a cdef cppclass implementation.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#issuecomment-1083123344


   Benchmark runs are scheduled for baseline = c515a6924f6002a70f1046b1642629867bacabd2 and contender = 33df4c789e7dd6b4f590783ae69d4faa49e47e04. 33df4c789e7dd6b4f590783ae69d4faa49e47e04 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/242e107bbf34435ba8ea69836cf9c8c8...35cdc2052c39400db3fc51d171f8eb25/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/f25752a7c18748dc8586c46911737467...ce32f95df73b4759b668bef06018ca90/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/41c5cede84e54d76bd8055e6efd4f0eb...a50b2230f3c943f0b56759087897f2d6/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/fc26f86fa55248f4a6e0ba1326f5915b...37e675332ed7490eb237055b1b9ee450/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/417| `33df4c78` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/416| `c515a692` ec2-t3-xlarge-us-east-2>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#issuecomment-1083123344


   Benchmark runs are scheduled for baseline = c515a6924f6002a70f1046b1642629867bacabd2 and contender = 33df4c789e7dd6b4f590783ae69d4faa49e47e04. 33df4c789e7dd6b4f590783ae69d4faa49e47e04 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/242e107bbf34435ba8ea69836cf9c8c8...35cdc2052c39400db3fc51d171f8eb25/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/f25752a7c18748dc8586c46911737467...ce32f95df73b4759b668bef06018ca90/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/41c5cede84e54d76bd8055e6efd4f0eb...a50b2230f3c943f0b56759087897f2d6/)
   [Scheduled :warning: ursa-thinkcentre-m75q is offline.] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/fc26f86fa55248f4a6e0ba1326f5915b...37e675332ed7490eb237055b1b9ee450/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/417| `33df4c78` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/416| `c515a692` ec2-t3-xlarge-us-east-2>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r830650296



##########
File path: cpp/src/arrow/filesystem/path_util.h
##########
@@ -30,13 +30,16 @@ namespace fs {
 namespace internal {
 
 constexpr char kSep = '/';
+constexpr char kFilenameSep = '_';

Review comment:
       Made the change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r838333982



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1299,8 +1300,43 @@ cdef vector[shared_ptr[CArray]] _partitioning_dictionaries(
 
     return c_dictionaries
 
+cdef class KeyValuePartitioning(Partitioning):
 
-cdef class DirectoryPartitioning(Partitioning):
+    cdef:
+        CKeyValuePartitioning* keyvalue_partitioning
+
+    def __init__(self):
+        _forbid_instantiation(self.__class__)
+
+    cdef init(self, const shared_ptr[CPartitioning]& sp):
+        Partitioning.init(self, sp)
+        self.keyvalue_partitioning = <CKeyValuePartitioning*> sp.get()
+        self.wrapped = sp
+        self.partitioning = sp.get()
+
+    @property
+    def dictionaries(self):
+        """
+        The unique values for each partition field, if available.
+
+        Those values are only available if the Partitioning object was
+        created through dataset discovery from a PartitioningFactory, or
+        if the dictionaries were manually specified in the constructor.
+        If not available, this returns None.

Review comment:
       Corrected that, thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#issuecomment-1083123344


   Benchmark runs are scheduled for baseline = c515a6924f6002a70f1046b1642629867bacabd2 and contender = 33df4c789e7dd6b4f590783ae69d4faa49e47e04. 33df4c789e7dd6b4f590783ae69d4faa49e47e04 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/242e107bbf34435ba8ea69836cf9c8c8...35cdc2052c39400db3fc51d171f8eb25/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/f25752a7c18748dc8586c46911737467...ce32f95df73b4759b668bef06018ca90/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/41c5cede84e54d76bd8055e6efd4f0eb...a50b2230f3c943f0b56759087897f2d6/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/fc26f86fa55248f4a6e0ba1326f5915b...37e675332ed7490eb237055b1b9ee450/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/417| `33df4c78` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/416| `c515a692` ec2-t3-xlarge-us-east-2>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r830650465



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -361,7 +411,32 @@ Result<std::string> DirectoryPartitioning::FormatValues(
     break;
   }
 
-  return fs::internal::JoinAbstractPath(std::move(segments));
+  return std::make_pair(fs::internal::JoinAbstractPath(std::move(segments)), "");
+}
+
+Result<std::pair<std::string, std::string>> FilenamePartitioning::FormatValues(
+    const ScalarVector& values) const {
+  std::vector<std::string> segments(static_cast<size_t>(schema_->num_fields()));
+
+  for (int i = 0; i < schema_->num_fields(); ++i) {
+    if (values[i] != nullptr && values[i]->is_valid) {
+      segments[i] = values[i]->ToString();
+      continue;
+    }
+
+    if (auto illegal_index = NextValid(values, i)) {
+      // XXX maybe we should just ignore keys provided after the first absent one?
+      return Status::Invalid("No partition key for ", schema_->field(i)->name(),
+                             " but a key was provided subsequently for ",
+                             schema_->field(*illegal_index)->name(), ".");
+    }
+
+    // if all subsequent keys are absent we'll just print the available keys
+    break;

Review comment:
       Added a test case with only one field, do we need any more changes?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r821919737



##########
File path: cpp/src/arrow/dataset/dataset_writer.cc
##########
@@ -468,13 +485,15 @@ class DatasetWriter::DatasetWriterImpl : public util::AsyncDestroyable {
   }
 
   Future<> DoWriteRecordBatch(std::shared_ptr<RecordBatch> batch,
-                              const std::string& directory) {
+                              const std::string& directory, const std::string& prefix) {
     ARROW_ASSIGN_OR_RAISE(
         auto dir_queue_itr,
         ::arrow::internal::GetOrInsertGenerated(
-            &directory_queues_, directory, [this, &batch](const std::string& dir) {
-              return DatasetWriterDirectoryQueue::Make(
-                  &task_group_, write_options_, &writer_state_, batch->schema(), dir);
+            &directory_queues_, directory + prefix,
+            [this, &batch, &directory, &prefix](const std::string& dir) {

Review comment:
       Made the change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r822306230



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -561,6 +624,73 @@ class DirectoryPartitioningFactory : public KeyValuePartitioningFactory {
   std::vector<std::string> field_names_;
 };
 
+class FilenamePartitioningFactory : public KeyValuePartitioningFactory {
+ public:
+  FilenamePartitioningFactory(std::vector<std::string> field_names,
+                              PartitioningFactoryOptions options)
+      : KeyValuePartitioningFactory(options), field_names_(std::move(field_names)) {
+    Reset();
+    util::InitializeUTF8();
+  }
+
+  std::string type_name() const override { return "filename"; }
+
+  Result<std::shared_ptr<Schema>> Inspect(
+      const std::vector<std::string>& paths) override {
+    for (auto path : paths) {

Review comment:
       Made the change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r821921095



##########
File path: cpp/src/arrow/dataset/dataset_writer.cc
##########
@@ -342,11 +357,12 @@ class DatasetWriterDirectoryQueue : public util::AsyncDestroyable {
   Make(util::AsyncTaskGroup* task_group,
        const FileSystemDatasetWriteOptions& write_options,
        DatasetWriterState* writer_state, std::shared_ptr<Schema> schema,
-       std::string dir) {
+       std::string directory, std::string prefix) {
     auto dir_queue = util::MakeUniqueAsync<DatasetWriterDirectoryQueue>(
-        std::move(dir), std::move(schema), write_options, writer_state);
+        std::move(directory), std::move(prefix), std::move(schema), write_options,
+        writer_state);
     RETURN_NOT_OK(task_group->AddTask(dir_queue->on_closed()));
-    dir_queue->PrepareDirectory();
+    dir_queue->PrepareDirectory(prefix);

Review comment:
       Made the change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r838000345



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1571,26 +1586,114 @@ cdef class HivePartitioning(Partitioning):
         return PartitioningFactory.wrap(
             CHivePartitioning.MakeFactory(c_options))
 
-    @property
-    def dictionaries(self):
+
+cdef class FilenamePartitioning(KeyValuePartitioning):
+    """
+    A Partitioning based on a specified Schema.
+
+    The FilenamePartitioning expects one segment in the file name for each
+    field in the schema (all fields are required to be present) separated
+    by '_'. For example given schema<year:int16, month:int8> the name
+    "2009_11" would be parsed to ("year"_ == 2009 and "month"_ == 11).
+
+    Parameters
+    ----------
+    schema : Schema
+        The schema that describes the partitions present in the file path.
+    dictionaries : dict[str, Array]
+        If the type of any field of `schema` is a dictionary type, the
+        corresponding entry of `dictionaries` must be an array containing
+        every value which may be taken by the corresponding column or an
+        error will be raised in parsing.
+    segment_encoding : str, default "uri"
+        After splitting paths into segments, decode the segments. Valid
+        values are "uri" (URI-decode segments) and "none" (leave as-is).
+
+    Returns
+    -------
+    FilenamePartitioning
+
+    Examples
+    --------
+    >>> from pyarrow.dataset import FilenamePartitioning
+    >>> partition = FilenamePartitioning(
+    ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
+    >>> print(partitioning.parse("2009_11"))
+    ((year == 2009:int16) and (month == 11:int8))

Review comment:
       Corrected the parsing string, thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#issuecomment-1083123344


   Benchmark runs are scheduled for baseline = c515a6924f6002a70f1046b1642629867bacabd2 and contender = 33df4c789e7dd6b4f590783ae69d4faa49e47e04. 33df4c789e7dd6b4f590783ae69d4faa49e47e04 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/242e107bbf34435ba8ea69836cf9c8c8...35cdc2052c39400db3fc51d171f8eb25/)
   [Finished :arrow_down:0.21% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/f25752a7c18748dc8586c46911737467...ce32f95df73b4759b668bef06018ca90/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/41c5cede84e54d76bd8055e6efd4f0eb...a50b2230f3c943f0b56759087897f2d6/)
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/fc26f86fa55248f4a6e0ba1326f5915b...37e675332ed7490eb237055b1b9ee450/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/417| `33df4c78` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/403| `33df4c78` test-mac-arm>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/413| `33df4c78` ursa-thinkcentre-m75q>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/416| `c515a692` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/402| `c515a692` test-mac-arm>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/412| `c515a692` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r838160065



##########
File path: cpp/src/arrow/dataset/partition_test.cc
##########
@@ -406,6 +445,18 @@ TEST_F(TestPartitioning, HivePartitioningFormat) {
            equal(field_ref("beta"), literal("hello"))));
 }
 
+TEST_F(TestPartitioning, FilenamePartitioningFormat) {
+  partitioning_ = std::make_shared<FilenamePartitioning>(
+      schema({field("alpha", int32()), field("beta", utf8())}));
+
+  written_schema_ = partitioning_->schema();
+
+  AssertFormat(and_(equal(field_ref("alpha"), literal(0)),
+                    equal(field_ref("beta"), literal("hello"))),
+               "", "0_hello_");

Review comment:
       With this code, if we have `literal(foo/bar)`,  then the prefix string is expected to be `"0_foo/bar_"`,  with `literal(foo_bar)`, the prefix string is expected to be `"0_foo_bar_"`. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r822306113



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1592,6 +1593,145 @@ cdef class HivePartitioning(Partitioning):
             res.append(pyarrow_wrap_array(arr))
         return res
 
+cdef class FilenamePartitioning(Partitioning):

Review comment:
       Added test in Python for FilenamePartitioning




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg edited a comment on pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg edited a comment on pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#issuecomment-1062040407


   > @westonpace asked me to review this as I opened the ticket originally based on a user-request. My main criteria for "does this do what the original user had in mind" is "can we **read** from a directory of files in which sections of the filenames are variables we want to analyse in our data" - and it looks like this both does that and enables us to write these files as well, which is really cool!
   > 
   > One thing I do want to check though - if I have a load of files called, e.g. `foo_bar_whatever_month_year.csv`, is there a way I can just have `month` and `year` as variables without the `foo`, `bar`, and `whatever` or would I have to read them in as variables and then just drop those columns later?
   
   Yes, we would have to read them in as variables and then drop those columns later. Currently, with this PR, the entire filename(excluding the last part for eg. `part-0.parquet` or `chunk-0.parquet`) is expected to have the partitioning values separated by `_`. In the future, we may need to add the functionality to allow custom name separator then just only using the underscore.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r829848259



##########
File path: cpp/src/arrow/filesystem/path_util.h
##########
@@ -30,13 +30,16 @@ namespace fs {
 namespace internal {
 
 constexpr char kSep = '/';
+constexpr char kFilenameSep = '_';
 
 // Computations on abstract paths (not local paths with system-dependent behaviour).
 // Abstract paths are typically used in URIs.
 
 // Split an abstract path into its individual components.
 ARROW_EXPORT
-std::vector<std::string> SplitAbstractPath(const std::string& s);
+std::vector<std::string> SplitAbstractPath(util::string_view path, const char& sep);
+ARROW_EXPORT
+std::vector<std::string> SplitAbstractPath(const std::string& path);

Review comment:
       Do we need two methods or could we get away with:
   
   ```
   ARROW_EXPORT
   std::vector<std::string> SplitAbstractPath(util::string_view path, char sep = kSep)
   ```
   
   Also, you don't need to pass primitives by const reference (it is generally harmless and happens sometimes with templates but if we know the type is a primitive then it's clearer do away with it).

##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -361,7 +411,32 @@ Result<std::string> DirectoryPartitioning::FormatValues(
     break;
   }
 
-  return fs::internal::JoinAbstractPath(std::move(segments));
+  return std::make_pair(fs::internal::JoinAbstractPath(std::move(segments)), "");
+}
+
+Result<std::pair<std::string, std::string>> FilenamePartitioning::FormatValues(
+    const ScalarVector& values) const {
+  std::vector<std::string> segments(static_cast<size_t>(schema_->num_fields()));
+
+  for (int i = 0; i < schema_->num_fields(); ++i) {
+    if (values[i] != nullptr && values[i]->is_valid) {
+      segments[i] = values[i]->ToString();
+      continue;
+    }
+
+    if (auto illegal_index = NextValid(values, i)) {
+      // XXX maybe we should just ignore keys provided after the first absent one?
+      return Status::Invalid("No partition key for ", schema_->field(i)->name(),
+                             " but a key was provided subsequently for ",
+                             schema_->field(*illegal_index)->name(), ".");
+    }
+
+    // if all subsequent keys are absent we'll just print the available keys
+    break;

Review comment:
       Are there tests in place for this case?  In other words, do we have a test where the schema is something like `schema({field("a", int32()), field("b", int32())}` and we only pass in one field (or no fields)?

##########
File path: cpp/src/arrow/filesystem/path_util.h
##########
@@ -95,22 +98,22 @@ std::vector<std::string> MinimalCreateDirSet(std::vector<std::string> dirs);
 
 // Join the components of an abstract path.
 template <class StringIt>
-std::string JoinAbstractPath(StringIt it, StringIt end) {
+std::string JoinAbstractPath(StringIt it, StringIt end, const char& sep = kSep) {
   std::string path;
   for (; it != end; ++it) {
     if (it->empty()) continue;
 
     if (!path.empty()) {
-      path += kSep;
+      path += sep;
     }
     path += *it;
   }
   return path;
 }
 
 template <class StringRange>
-std::string JoinAbstractPath(const StringRange& range) {
-  return JoinAbstractPath(range.begin(), range.end());
+std::string JoinAbstractPath(const StringRange& range, const char& sep = kSep) {

Review comment:
       ```suggestion
   std::string JoinAbstractPath(const StringRange& range, char sep = kSep) {
   ```

##########
File path: cpp/src/arrow/filesystem/path_util.cc
##########
@@ -30,28 +30,23 @@ namespace internal {
 
 // XXX How does this encode Windows UNC paths?
 
-std::vector<std::string> SplitAbstractPath(const std::string& path) {
+std::vector<std::string> SplitAbstractPath(util::string_view path, const char& sep) {
   std::vector<std::string> parts;
-  auto v = util::string_view(path);
-  // Strip trailing slash
-  if (v.length() > 0 && v.back() == kSep) {
-    v = v.substr(0, v.length() - 1);
-  }

Review comment:
       We would still need to strip the trailing slash.  If you need to keep the trailing slash to make the partitioning code work then I would stick with two methods but have the second method (that doesn't strip the trailing separator) be in partition.cc.

##########
File path: cpp/src/arrow/filesystem/path_util.h
##########
@@ -30,13 +30,16 @@ namespace fs {
 namespace internal {
 
 constexpr char kSep = '/';
+constexpr char kFilenameSep = '_';

Review comment:
       We should move this constant into partition.h and rename it to something like `kFilenamePartitionSep`.  When I see `kFilenameSep` I think of `/` or `.`

##########
File path: cpp/src/arrow/filesystem/path_util.h
##########
@@ -95,22 +98,22 @@ std::vector<std::string> MinimalCreateDirSet(std::vector<std::string> dirs);
 
 // Join the components of an abstract path.
 template <class StringIt>
-std::string JoinAbstractPath(StringIt it, StringIt end) {
+std::string JoinAbstractPath(StringIt it, StringIt end, const char& sep = kSep) {

Review comment:
       ```suggestion
   std::string JoinAbstractPath(StringIt it, StringIt end, char sep = kSep) {
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r830650524



##########
File path: cpp/src/arrow/filesystem/path_util.cc
##########
@@ -30,28 +30,23 @@ namespace internal {
 
 // XXX How does this encode Windows UNC paths?
 
-std::vector<std::string> SplitAbstractPath(const std::string& path) {
+std::vector<std::string> SplitAbstractPath(util::string_view path, const char& sep) {
   std::vector<std::string> parts;
-  auto v = util::string_view(path);
-  // Strip trailing slash
-  if (v.length() > 0 && v.back() == kSep) {
-    v = v.substr(0, v.length() - 1);
-  }

Review comment:
       Redefined the `SplitAbstractPath()`, I think that should work, do we need anymore changes?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#issuecomment-1062040407


   > @westonpace asked me to review this as I opened the ticket originally based on a user-request. My main criteria for "does this do what the original user had in mind" is "can we **read** from a directory of files in which sections of the filenames are variables we want to analyse in our data" - and it looks like this both does that and enables us to write these files as well, which is really cool!
   > 
   > One thing I do want to check though - if I have a load of files called, e.g. `foo_bar_whatever_month_year.csv`, is there a way I can just have `month` and `year` as variables without the `foo`, `bar`, and `whatever` or would I have to read them in as variables and then just drop those columns later?
   
   Yes, we would have to read them in as variables and then drop those columns later. Currently, with this PR, the entire filename(discarding the last part for eg. `part-0.parquet` or `chunk-0.parquet`) is expected to have the partitioning values separated by `_`. In the future, we may need to add the functionality to allow custom name separator then just only using the underscore.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r833794071



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -561,6 +636,74 @@ class DirectoryPartitioningFactory : public KeyValuePartitioningFactory {
   std::vector<std::string> field_names_;
 };
 
+class FilenamePartitioningFactory : public KeyValuePartitioningFactory {
+ public:
+  FilenamePartitioningFactory(std::vector<std::string> field_names,
+                              PartitioningFactoryOptions options)
+      : KeyValuePartitioningFactory(options), field_names_(std::move(field_names)) {
+    Reset();
+    util::InitializeUTF8();
+  }
+
+  std::string type_name() const override { return "filename"; }
+
+  Result<std::shared_ptr<Schema>> Inspect(
+      const std::vector<std::string>& paths) override {
+    for (const auto& path : paths) {
+      size_t field_index = 0;

Review comment:
       We compare this with the size of `field_names_`, so defining this as a `size_t` should be appropriate?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#issuecomment-1083123344


   Benchmark runs are scheduled for baseline = c515a6924f6002a70f1046b1642629867bacabd2 and contender = 33df4c789e7dd6b4f590783ae69d4faa49e47e04. 33df4c789e7dd6b4f590783ae69d4faa49e47e04 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/242e107bbf34435ba8ea69836cf9c8c8...35cdc2052c39400db3fc51d171f8eb25/)
   [Finished :arrow_down:0.21% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/f25752a7c18748dc8586c46911737467...ce32f95df73b4759b668bef06018ca90/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/41c5cede84e54d76bd8055e6efd4f0eb...a50b2230f3c943f0b56759087897f2d6/)
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/fc26f86fa55248f4a6e0ba1326f5915b...37e675332ed7490eb237055b1b9ee450/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/417| `33df4c78` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/403| `33df4c78` test-mac-arm>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/403| `33df4c78` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/413| `33df4c78` ursa-thinkcentre-m75q>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/416| `c515a692` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/402| `c515a692` test-mac-arm>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/402| `c515a692` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/412| `c515a692` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#issuecomment-1055470522


   https://issues.apache.org/jira/browse/ARROW-14612


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r821922420



##########
File path: cpp/src/arrow/filesystem/path_util.h
##########
@@ -113,6 +123,27 @@ std::string JoinAbstractPath(const StringRange& range) {
   return JoinAbstractPath(range.begin(), range.end());
 }
 
+// Join the components of filename partitions
+template <class StringIt>
+std::string JoinFilenamePartitions(StringIt it, StringIt end) {

Review comment:
       Used a generic Join function and concatenated the separator in the end of the path for `FilenamePartitioning` in `partition.cc`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r821923116



##########
File path: cpp/src/arrow/filesystem/path_util.h
##########
@@ -38,6 +39,10 @@ constexpr char kSep = '/';
 ARROW_EXPORT
 std::vector<std::string> SplitAbstractPath(const std::string& s);
 
+// Split a filename into its individual partitions.
+ARROW_EXPORT
+std::vector<std::string> SplitFilename(const std::string& s);

Review comment:
       Made a generic `SplitAbstractPath()`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r830650377



##########
File path: cpp/src/arrow/filesystem/path_util.h
##########
@@ -95,22 +98,22 @@ std::vector<std::string> MinimalCreateDirSet(std::vector<std::string> dirs);
 
 // Join the components of an abstract path.
 template <class StringIt>
-std::string JoinAbstractPath(StringIt it, StringIt end) {
+std::string JoinAbstractPath(StringIt it, StringIt end, const char& sep = kSep) {

Review comment:
       Made the change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r833791048



##########
File path: python/pyarrow/dataset.py
##########
@@ -210,6 +211,26 @@ def partitioning(schema=None, field_names=None, flavor=None,
             raise ValueError(
                 "For the default directory flavor, need to specify "
                 "a Schema or a list of field names")
+    if flavor == "filename":

Review comment:
       Made the change.

##########
File path: python/pyarrow/dataset.py
##########
@@ -210,6 +211,26 @@ def partitioning(schema=None, field_names=None, flavor=None,
             raise ValueError(
                 "For the default directory flavor, need to specify "
                 "a Schema or a list of field names")
+    if flavor == "filename":
+        # default flavor
+        if schema is not None:
+            if field_names is not None:
+                raise ValueError(
+                    "Cannot specify both 'schema' and 'field_names'")
+            if dictionaries == 'infer':
+                return FilenamePartitioning.discover(schema=schema)
+            return FilenamePartitioning(schema, dictionaries)
+        elif field_names is not None:
+            if isinstance(field_names, list):
+                return FilenamePartitioning.discover(field_names)
+            else:
+                raise ValueError(
+                    "Expected list of field names, got {}".format(
+                        type(field_names)))
+        else:
+            raise ValueError(
+                "For the default filename flavor, need to specify "

Review comment:
       Corrected that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r833791743



##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -76,7 +78,8 @@ class ARROW_DS_EXPORT Partitioning {
   /// \brief Parse a path into a partition expression
   virtual Result<compute::Expression> Parse(const std::string& path) const = 0;
 
-  virtual Result<std::string> Format(const compute::Expression& expr) const = 0;
+  virtual Result<std::pair<std::string, std::string>> Format(
+      const compute::Expression& expr) const = 0;

Review comment:
       Changed function return signature to structure. I am not very sure of the name of the struct use, does that sound okay?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r833796094



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -569,6 +570,22 @@ def test_partitioning():
         with pytest.raises(pa.ArrowInvalid):
             partitioning.parse(shouldfail)
 
+    partitioning = ds.FilenamePartitioning(
+        pa.schema([
+            pa.field('group', pa.int64()),
+            pa.field('key', pa.float64())
+        ])
+    )
+    assert partitioning.dictionaries is None

Review comment:
       I think we have some dictionaries check like this for Hive Partitioning,
   https://github.com/apache/arrow/blob/68e48cbf808806eacd7439c25fd116111d291b94/python/pyarrow/tests/test_dataset.py#L3280
   Should we have something similar for FilenamePartitioning as well?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r833794245



##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -61,6 +61,16 @@ Result<std::string> SafeUriUnescape(util::string_view encoded) {
   }
   return decoded;
 }
+
+std::string StripNonPrefix(const std::string& path) {

Review comment:
       Added a comment explaining the function.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#issuecomment-1083123344


   Benchmark runs are scheduled for baseline = c515a6924f6002a70f1046b1642629867bacabd2 and contender = 33df4c789e7dd6b4f590783ae69d4faa49e47e04. 33df4c789e7dd6b4f590783ae69d4faa49e47e04 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/242e107bbf34435ba8ea69836cf9c8c8...35cdc2052c39400db3fc51d171f8eb25/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/f25752a7c18748dc8586c46911737467...ce32f95df73b4759b668bef06018ca90/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/41c5cede84e54d76bd8055e6efd4f0eb...a50b2230f3c943f0b56759087897f2d6/)
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/fc26f86fa55248f4a6e0ba1326f5915b...37e675332ed7490eb237055b1b9ee450/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/417| `33df4c78` ec2-t3-xlarge-us-east-2>
   [Scheduled] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/403| `33df4c78` test-mac-arm>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/413| `33df4c78` ursa-thinkcentre-m75q>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/416| `c515a692` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/402| `c515a692` test-mac-arm>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/412| `c515a692` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #12530:
URL: https://github.com/apache/arrow/pull/12530


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r832240231



##########
File path: python/pyarrow/dataset.py
##########
@@ -210,6 +211,26 @@ def partitioning(schema=None, field_names=None, flavor=None,
             raise ValueError(
                 "For the default directory flavor, need to specify "
                 "a Schema or a list of field names")
+    if flavor == "filename":
+        # default flavor
+        if schema is not None:
+            if field_names is not None:
+                raise ValueError(
+                    "Cannot specify both 'schema' and 'field_names'")
+            if dictionaries == 'infer':
+                return FilenamePartitioning.discover(schema=schema)
+            return FilenamePartitioning(schema, dictionaries)
+        elif field_names is not None:
+            if isinstance(field_names, list):
+                return FilenamePartitioning.discover(field_names)
+            else:
+                raise ValueError(
+                    "Expected list of field names, got {}".format(
+                        type(field_names)))
+        else:
+            raise ValueError(
+                "For the default filename flavor, need to specify "

Review comment:
       Why "default"?

##########
File path: python/pyarrow/dataset.py
##########
@@ -210,6 +211,26 @@ def partitioning(schema=None, field_names=None, flavor=None,
             raise ValueError(
                 "For the default directory flavor, need to specify "
                 "a Schema or a list of field names")
+    if flavor == "filename":

Review comment:
       Can you update the docstring to mention this possibility?

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1592,6 +1593,145 @@ cdef class HivePartitioning(Partitioning):
             res.append(pyarrow_wrap_array(arr))
         return res
 
+cdef class FilenamePartitioning(Partitioning):
+    """
+    A Partitioning based on a specified Schema.
+
+    The FilenamePartitioning expects one segment in the file name for each
+    field in the schema (all fields are required to be present) separated
+    by '_'. For example given schema<year:int16, month:int8> the name
+    "2009_11" would be parsed to ("year"_ == 2009 and "month"_ == 11).
+
+    Parameters
+    ----------
+    schema : Schema
+        The schema that describes the partitions present in the file path.
+    dictionaries : dict[str, Array]
+        If the type of any field of `schema` is a dictionary type, the
+        corresponding entry of `dictionaries` must be an array containing
+        every value which may be taken by the corresponding column or an
+        error will be raised in parsing.
+    segment_encoding : str, default "uri"
+        After splitting paths into segments, decode the segments. Valid
+        values are "uri" (URI-decode segments) and "none" (leave as-is).
+
+    Returns
+    -------
+    FilenamePartitioning
+
+    Examples
+    --------
+    >>> from pyarrow.dataset import FilenamePartitioning
+    >>> partition = FilenamePartitioning(
+    ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
+    >>> print(partitioning.parse("2009_11"))
+    ((year == 2009:int16) and (month == 11:int8))
+    """
+
+    cdef:
+        CFilenamePartitioning* filename_partitioning
+
+    def __init__(self, Schema schema not None, dictionaries=None,
+                 segment_encoding="uri"):
+        cdef:
+            shared_ptr[CFilenamePartitioning] c_partitioning
+            CKeyValuePartitioningOptions c_options
+
+        c_options.segment_encoding = _get_segment_encoding(segment_encoding)
+        c_partitioning = make_shared[CFilenamePartitioning](
+            pyarrow_unwrap_schema(schema),
+            _partitioning_dictionaries(schema, dictionaries),
+            c_options,
+        )
+        self.init(<shared_ptr[CPartitioning]> c_partitioning)
+
+    cdef init(self, const shared_ptr[CPartitioning]& sp):
+        Partitioning.init(self, sp)
+        self.filename_partitioning = <CFilenamePartitioning*> sp.get()
+
+    @staticmethod
+    def discover(field_names=None, infer_dictionary=False,
+                 max_partition_dictionary_size=0,
+                 schema=None, segment_encoding="uri"):
+        """
+        Discover a FilenamePartitioning.
+
+        Parameters
+        ----------
+        field_names : list of str
+            The names to associate with the values from the subdirectory names.
+            If schema is given, will be populated from the schema.
+        infer_dictionary : bool, default False
+            When inferring a schema for partition fields, yield dictionary
+            encoded types instead of plain types. This can be more efficient
+            when materializing virtual columns, and Expressions parsed by the
+            finished Partitioning will include dictionaries of all unique
+            inspected values for each field.
+        max_partition_dictionary_size : int, default 0
+            Synonymous with infer_dictionary for backwards compatibility with
+            1.0: setting this to -1 or None is equivalent to passing
+            infer_dictionary=True.
+        schema : Schema, default None
+            Use this schema instead of inferring a schema from partition
+            values. Partition values will be validated against this schema
+            before accumulation into the Partitioning's dictionary.
+        segment_encoding : str, default "uri"
+            After splitting paths into segments, decode the segments. Valid
+            values are "uri" (URI-decode segments) and "none" (leave as-is).
+
+        Returns
+        -------
+        PartitioningFactory
+            To be used in the FileSystemFactoryOptions.
+        """
+        cdef:
+            CPartitioningFactoryOptions c_options
+            vector[c_string] c_field_names
+
+        if max_partition_dictionary_size in {-1, None}:
+            infer_dictionary = True
+        elif max_partition_dictionary_size != 0:
+            raise NotImplementedError("max_partition_dictionary_size must be "
+                                      "0, -1, or None")
+
+        if infer_dictionary:
+            c_options.infer_dictionary = True
+
+        if schema:
+            c_options.schema = pyarrow_unwrap_schema(schema)
+            c_field_names = [tobytes(f.name) for f in schema]
+        elif not field_names:
+            raise ValueError(

Review comment:
       This should be `TypeError`.

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1592,6 +1593,145 @@ cdef class HivePartitioning(Partitioning):
             res.append(pyarrow_wrap_array(arr))
         return res
 
+cdef class FilenamePartitioning(Partitioning):
+    """
+    A Partitioning based on a specified Schema.
+
+    The FilenamePartitioning expects one segment in the file name for each
+    field in the schema (all fields are required to be present) separated
+    by '_'. For example given schema<year:int16, month:int8> the name
+    "2009_11" would be parsed to ("year"_ == 2009 and "month"_ == 11).
+
+    Parameters
+    ----------
+    schema : Schema
+        The schema that describes the partitions present in the file path.
+    dictionaries : dict[str, Array]
+        If the type of any field of `schema` is a dictionary type, the
+        corresponding entry of `dictionaries` must be an array containing
+        every value which may be taken by the corresponding column or an
+        error will be raised in parsing.
+    segment_encoding : str, default "uri"
+        After splitting paths into segments, decode the segments. Valid
+        values are "uri" (URI-decode segments) and "none" (leave as-is).
+
+    Returns
+    -------
+    FilenamePartitioning
+
+    Examples
+    --------
+    >>> from pyarrow.dataset import FilenamePartitioning
+    >>> partition = FilenamePartitioning(
+    ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
+    >>> print(partitioning.parse("2009_11"))
+    ((year == 2009:int16) and (month == 11:int8))
+    """
+
+    cdef:
+        CFilenamePartitioning* filename_partitioning
+
+    def __init__(self, Schema schema not None, dictionaries=None,
+                 segment_encoding="uri"):
+        cdef:
+            shared_ptr[CFilenamePartitioning] c_partitioning
+            CKeyValuePartitioningOptions c_options
+
+        c_options.segment_encoding = _get_segment_encoding(segment_encoding)
+        c_partitioning = make_shared[CFilenamePartitioning](
+            pyarrow_unwrap_schema(schema),
+            _partitioning_dictionaries(schema, dictionaries),
+            c_options,
+        )
+        self.init(<shared_ptr[CPartitioning]> c_partitioning)
+
+    cdef init(self, const shared_ptr[CPartitioning]& sp):
+        Partitioning.init(self, sp)
+        self.filename_partitioning = <CFilenamePartitioning*> sp.get()
+
+    @staticmethod
+    def discover(field_names=None, infer_dictionary=False,
+                 max_partition_dictionary_size=0,
+                 schema=None, segment_encoding="uri"):
+        """
+        Discover a FilenamePartitioning.
+
+        Parameters
+        ----------
+        field_names : list of str
+            The names to associate with the values from the subdirectory names.
+            If schema is given, will be populated from the schema.
+        infer_dictionary : bool, default False
+            When inferring a schema for partition fields, yield dictionary
+            encoded types instead of plain types. This can be more efficient
+            when materializing virtual columns, and Expressions parsed by the
+            finished Partitioning will include dictionaries of all unique
+            inspected values for each field.
+        max_partition_dictionary_size : int, default 0
+            Synonymous with infer_dictionary for backwards compatibility with
+            1.0: setting this to -1 or None is equivalent to passing
+            infer_dictionary=True.
+        schema : Schema, default None
+            Use this schema instead of inferring a schema from partition
+            values. Partition values will be validated against this schema
+            before accumulation into the Partitioning's dictionary.
+        segment_encoding : str, default "uri"
+            After splitting paths into segments, decode the segments. Valid
+            values are "uri" (URI-decode segments) and "none" (leave as-is).
+
+        Returns
+        -------
+        PartitioningFactory
+            To be used in the FileSystemFactoryOptions.
+        """
+        cdef:
+            CPartitioningFactoryOptions c_options
+            vector[c_string] c_field_names
+
+        if max_partition_dictionary_size in {-1, None}:
+            infer_dictionary = True
+        elif max_partition_dictionary_size != 0:
+            raise NotImplementedError("max_partition_dictionary_size must be "
+                                      "0, -1, or None")
+
+        if infer_dictionary:
+            c_options.infer_dictionary = True
+
+        if schema:
+            c_options.schema = pyarrow_unwrap_schema(schema)
+            c_field_names = [tobytes(f.name) for f in schema]
+        elif not field_names:
+            raise ValueError(
+                "Neither field_names nor schema was passed; "
+                "cannot infer field_names")
+        else:
+            c_field_names = [tobytes(s) for s in field_names]
+
+        c_options.segment_encoding = _get_segment_encoding(segment_encoding)
+
+        return PartitioningFactory.wrap(
+            CFilenamePartitioning.MakeFactory(c_field_names, c_options))
+
+    @property
+    def dictionaries(self):

Review comment:
       Is it possible to avoid copy-pasting this, and instead move it to the base class?

##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -321,6 +330,30 @@ class ARROW_DS_EXPORT FunctionPartitioning : public Partitioning {
   std::string name_;
 };
 
+class ARROW_DS_EXPORT FilenamePartitioning : public KeyValuePartitioning {
+ public:
+  /// If a field in schema is of dictionary type, the corresponding element of
+  /// dictionaries must be contain the dictionary of values for that field.

Review comment:
       ```suggestion
     /// \brief Construct a FilenamePartitioning from its components.
     ///
     /// If a field in the `schema` is of dictionary type, the corresponding element of
     /// `dictionaries` must contain the dictionary of values for that field.
   ```

##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -61,6 +61,16 @@ Result<std::string> SafeUriUnescape(util::string_view encoded) {
   }
   return decoded;
 }
+
+std::string StripNonPrefix(const std::string& path) {

Review comment:
       Add a short comment describing this?

##########
File path: cpp/src/arrow/dataset/partition.cc
##########
@@ -561,6 +636,74 @@ class DirectoryPartitioningFactory : public KeyValuePartitioningFactory {
   std::vector<std::string> field_names_;
 };
 
+class FilenamePartitioningFactory : public KeyValuePartitioningFactory {
+ public:
+  FilenamePartitioningFactory(std::vector<std::string> field_names,
+                              PartitioningFactoryOptions options)
+      : KeyValuePartitioningFactory(options), field_names_(std::move(field_names)) {
+    Reset();
+    util::InitializeUTF8();
+  }
+
+  std::string type_name() const override { return "filename"; }
+
+  Result<std::shared_ptr<Schema>> Inspect(
+      const std::vector<std::string>& paths) override {
+    for (const auto& path : paths) {
+      size_t field_index = 0;

Review comment:
       Why not make this an `int`?

##########
File path: cpp/src/arrow/dataset/partition_test.cc
##########
@@ -291,7 +293,23 @@ TEST_F(TestPartitioning, DiscoverSchema) {
   AssertInspect({"/0/1", "/hello"}, {Str("alpha"), Int("beta")});
 }
 
-TEST_F(TestPartitioning, DictionaryInference) {
+TEST_F(TestPartitioning, DiscoverSchemaFilename) {
+  factory_ = FilenamePartitioning::MakeFactory({"alpha", "beta"});
+
+  // type is int32 if possible
+  AssertInspect({"0_1_"}, {Int("alpha"), Int("beta")});
+
+  // extra segments are ignored
+  AssertInspect({"0_1_what_"}, {Int("alpha"), Int("beta")});
+
+  // fall back to string if any segment for field alpha is not parseable as int
+  AssertInspect({"0_1_", "hello_1_"}, {Str("alpha"), Int("beta")});
+
+  // If there are too many digits fall back to string
+  AssertInspect({"3760212050_1_"}, {Str("alpha"), Int("beta")});
+}

Review comment:
       Also is URI-encoding tested somewhere?

##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -569,6 +570,22 @@ def test_partitioning():
         with pytest.raises(pa.ArrowInvalid):
             partitioning.parse(shouldfail)
 
+    partitioning = ds.FilenamePartitioning(
+        pa.schema([
+            pa.field('group', pa.int64()),
+            pa.field('key', pa.float64())
+        ])
+    )
+    assert partitioning.dictionaries is None

Review comment:
       Are dictionaries tested somewhere else?

##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1592,6 +1593,145 @@ cdef class HivePartitioning(Partitioning):
             res.append(pyarrow_wrap_array(arr))
         return res
 
+cdef class FilenamePartitioning(Partitioning):
+    """
+    A Partitioning based on a specified Schema.
+
+    The FilenamePartitioning expects one segment in the file name for each
+    field in the schema (all fields are required to be present) separated
+    by '_'. For example given schema<year:int16, month:int8> the name
+    "2009_11" would be parsed to ("year"_ == 2009 and "month"_ == 11).
+
+    Parameters
+    ----------
+    schema : Schema
+        The schema that describes the partitions present in the file path.
+    dictionaries : dict[str, Array]
+        If the type of any field of `schema` is a dictionary type, the
+        corresponding entry of `dictionaries` must be an array containing
+        every value which may be taken by the corresponding column or an
+        error will be raised in parsing.
+    segment_encoding : str, default "uri"
+        After splitting paths into segments, decode the segments. Valid
+        values are "uri" (URI-decode segments) and "none" (leave as-is).
+
+    Returns
+    -------
+    FilenamePartitioning
+
+    Examples
+    --------
+    >>> from pyarrow.dataset import FilenamePartitioning
+    >>> partition = FilenamePartitioning(
+    ...     pa.schema([("year", pa.int16()), ("month", pa.int8())]))
+    >>> print(partitioning.parse("2009_11"))
+    ((year == 2009:int16) and (month == 11:int8))
+    """
+
+    cdef:
+        CFilenamePartitioning* filename_partitioning
+
+    def __init__(self, Schema schema not None, dictionaries=None,
+                 segment_encoding="uri"):
+        cdef:
+            shared_ptr[CFilenamePartitioning] c_partitioning
+            CKeyValuePartitioningOptions c_options
+
+        c_options.segment_encoding = _get_segment_encoding(segment_encoding)
+        c_partitioning = make_shared[CFilenamePartitioning](
+            pyarrow_unwrap_schema(schema),
+            _partitioning_dictionaries(schema, dictionaries),
+            c_options,
+        )
+        self.init(<shared_ptr[CPartitioning]> c_partitioning)
+
+    cdef init(self, const shared_ptr[CPartitioning]& sp):
+        Partitioning.init(self, sp)
+        self.filename_partitioning = <CFilenamePartitioning*> sp.get()
+
+    @staticmethod
+    def discover(field_names=None, infer_dictionary=False,
+                 max_partition_dictionary_size=0,
+                 schema=None, segment_encoding="uri"):
+        """
+        Discover a FilenamePartitioning.
+
+        Parameters
+        ----------
+        field_names : list of str
+            The names to associate with the values from the subdirectory names.
+            If schema is given, will be populated from the schema.
+        infer_dictionary : bool, default False
+            When inferring a schema for partition fields, yield dictionary
+            encoded types instead of plain types. This can be more efficient
+            when materializing virtual columns, and Expressions parsed by the
+            finished Partitioning will include dictionaries of all unique
+            inspected values for each field.
+        max_partition_dictionary_size : int, default 0
+            Synonymous with infer_dictionary for backwards compatibility with
+            1.0: setting this to -1 or None is equivalent to passing

Review comment:
       There is no reason to try to enforce backwards compatibility since this is a new API, is there?

##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -76,7 +78,8 @@ class ARROW_DS_EXPORT Partitioning {
   /// \brief Parse a path into a partition expression
   virtual Result<compute::Expression> Parse(const std::string& path) const = 0;
 
-  virtual Result<std::string> Format(const compute::Expression& expr) const = 0;
+  virtual Result<std::pair<std::string, std::string>> Format(
+      const compute::Expression& expr) const = 0;

Review comment:
       The new signature isn't immediately understandable (what does the pair represent?).
   Can you instead make it return a locally-defined struct e.g.:
   ```suggestion
     struct SomeStruct { std::string foo, bar };
     virtual Result<SomeStruct> Format(const compute::Expression& expr) const = 0;
   ```

##########
File path: cpp/src/arrow/dataset/partition_test.cc
##########
@@ -291,7 +293,23 @@ TEST_F(TestPartitioning, DiscoverSchema) {
   AssertInspect({"/0/1", "/hello"}, {Str("alpha"), Int("beta")});
 }
 
-TEST_F(TestPartitioning, DictionaryInference) {
+TEST_F(TestPartitioning, DiscoverSchemaFilename) {
+  factory_ = FilenamePartitioning::MakeFactory({"alpha", "beta"});
+
+  // type is int32 if possible
+  AssertInspect({"0_1_"}, {Int("alpha"), Int("beta")});
+
+  // extra segments are ignored
+  AssertInspect({"0_1_what_"}, {Int("alpha"), Int("beta")});
+
+  // fall back to string if any segment for field alpha is not parseable as int
+  AssertInspect({"0_1_", "hello_1_"}, {Str("alpha"), Int("beta")});
+
+  // If there are too many digits fall back to string
+  AssertInspect({"3760212050_1_"}, {Str("alpha"), Int("beta")});
+}

Review comment:
       Is it possible to test for errors here? (not enough fields for example, or bad syntax or bad utf-8 or...)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r834094321



##########
File path: cpp/src/arrow/dataset/partition.h
##########
@@ -321,6 +330,30 @@ class ARROW_DS_EXPORT FunctionPartitioning : public Partitioning {
   std::string name_;
 };
 
+class ARROW_DS_EXPORT FilenamePartitioning : public KeyValuePartitioning {
+ public:
+  /// If a field in schema is of dictionary type, the corresponding element of
+  /// dictionaries must be contain the dictionary of values for that field.

Review comment:
       Yes, sorry, missed that. Mistakenly, did the change in the file present in the build include dir. Committed a change now. Thanks for pointing that out!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r838000160



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1571,26 +1586,114 @@ cdef class HivePartitioning(Partitioning):
         return PartitioningFactory.wrap(
             CHivePartitioning.MakeFactory(c_options))
 
-    @property
-    def dictionaries(self):
+
+cdef class FilenamePartitioning(KeyValuePartitioning):
+    """
+    A Partitioning based on a specified Schema.
+
+    The FilenamePartitioning expects one segment in the file name for each
+    field in the schema (all fields are required to be present) separated
+    by '_'. For example given schema<year:int16, month:int8> the name
+    "2009_11" would be parsed to ("year"_ == 2009 and "month"_ == 11).

Review comment:
       Yes, corrected that, thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r838161825



##########
File path: python/pyarrow/tests/test_dataset.py
##########
@@ -569,6 +570,22 @@ def test_partitioning():
         with pytest.raises(pa.ArrowInvalid):
             partitioning.parse(shouldfail)
 
+    partitioning = ds.FilenamePartitioning(
+        pa.schema([
+            pa.field('group', pa.int64()),
+            pa.field('key', pa.float64())
+        ])
+    )
+    assert partitioning.dictionaries is None

Review comment:
       Added the test. I noticed that the `dictionaries()` method might have an issue where it returned `None` object even if the Partitioning object has a dictionary field.
   
   Thanks to @westonpace for finding the cause and suggesting the fix.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot commented on pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#issuecomment-1083123344


   Benchmark runs are scheduled for baseline = c515a6924f6002a70f1046b1642629867bacabd2 and contender = 33df4c789e7dd6b4f590783ae69d4faa49e47e04. 33df4c789e7dd6b4f590783ae69d4faa49e47e04 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/242e107bbf34435ba8ea69836cf9c8c8...35cdc2052c39400db3fc51d171f8eb25/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/f25752a7c18748dc8586c46911737467...ce32f95df73b4759b668bef06018ca90/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/41c5cede84e54d76bd8055e6efd4f0eb...a50b2230f3c943f0b56759087897f2d6/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/fc26f86fa55248f4a6e0ba1326f5915b...37e675332ed7490eb237055b1b9ee450/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#issuecomment-1083123344


   Benchmark runs are scheduled for baseline = c515a6924f6002a70f1046b1642629867bacabd2 and contender = 33df4c789e7dd6b4f590783ae69d4faa49e47e04. 33df4c789e7dd6b4f590783ae69d4faa49e47e04 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/242e107bbf34435ba8ea69836cf9c8c8...35cdc2052c39400db3fc51d171f8eb25/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/f25752a7c18748dc8586c46911737467...ce32f95df73b4759b668bef06018ca90/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/41c5cede84e54d76bd8055e6efd4f0eb...a50b2230f3c943f0b56759087897f2d6/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/fc26f86fa55248f4a6e0ba1326f5915b...37e675332ed7490eb237055b1b9ee450/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r838321106



##########
File path: python/pyarrow/_dataset.pyx
##########
@@ -1299,8 +1300,43 @@ cdef vector[shared_ptr[CArray]] _partitioning_dictionaries(
 
     return c_dictionaries
 
+cdef class KeyValuePartitioning(Partitioning):
 
-cdef class DirectoryPartitioning(Partitioning):
+    cdef:
+        CKeyValuePartitioning* keyvalue_partitioning
+
+    def __init__(self):
+        _forbid_instantiation(self.__class__)
+
+    cdef init(self, const shared_ptr[CPartitioning]& sp):
+        Partitioning.init(self, sp)
+        self.keyvalue_partitioning = <CKeyValuePartitioning*> sp.get()
+        self.wrapped = sp
+        self.partitioning = sp.get()
+
+    @property
+    def dictionaries(self):
+        """
+        The unique values for each partition field, if available.
+
+        Those values are only available if the Partitioning object was
+        created through dataset discovery from a PartitioningFactory, or
+        if the dictionaries were manually specified in the constructor.
+        If not available, this returns None.

Review comment:
       Is it still the case that this returns None?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#issuecomment-1083123344


   Benchmark runs are scheduled for baseline = c515a6924f6002a70f1046b1642629867bacabd2 and contender = 33df4c789e7dd6b4f590783ae69d4faa49e47e04. 33df4c789e7dd6b4f590783ae69d4faa49e47e04 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/242e107bbf34435ba8ea69836cf9c8c8...35cdc2052c39400db3fc51d171f8eb25/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/f25752a7c18748dc8586c46911737467...ce32f95df73b4759b668bef06018ca90/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/41c5cede84e54d76bd8055e6efd4f0eb...a50b2230f3c943f0b56759087897f2d6/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/fc26f86fa55248f4a6e0ba1326f5915b...37e675332ed7490eb237055b1b9ee450/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#issuecomment-1083123344


   Benchmark runs are scheduled for baseline = c515a6924f6002a70f1046b1642629867bacabd2 and contender = 33df4c789e7dd6b4f590783ae69d4faa49e47e04. 33df4c789e7dd6b4f590783ae69d4faa49e47e04 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/242e107bbf34435ba8ea69836cf9c8c8...35cdc2052c39400db3fc51d171f8eb25/)
   [Finished :arrow_down:0.21% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/f25752a7c18748dc8586c46911737467...ce32f95df73b4759b668bef06018ca90/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/41c5cede84e54d76bd8055e6efd4f0eb...a50b2230f3c943f0b56759087897f2d6/)
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/fc26f86fa55248f4a6e0ba1326f5915b...37e675332ed7490eb237055b1b9ee450/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/417| `33df4c78` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/403| `33df4c78` test-mac-arm>
   [Scheduled] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/403| `33df4c78` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/413| `33df4c78` ursa-thinkcentre-m75q>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/416| `c515a692` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/402| `c515a692` test-mac-arm>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/402| `c515a692` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/412| `c515a692` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r821919339



##########
File path: cpp/src/arrow/dataset/dataset_writer.cc
##########
@@ -326,9 +336,14 @@ class DatasetWriterDirectoryQueue : public util::AsyncDestroyable {
 
   uint64_t rows_written() const { return rows_written_; }
 
-  void PrepareDirectory() {
-    init_future_ = DeferNotOk(write_options_.filesystem->io_context().executor()->Submit(
-        [this]() { return write_options_.filesystem->CreateDir(directory_); }));
+  void PrepareDirectory(std::string& prefix) {
+    if (prefix.empty()) {

Review comment:
       Made the change. Yes, we may need to have custom partitioning having both prefix and directory in the future.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r821922420



##########
File path: cpp/src/arrow/filesystem/path_util.h
##########
@@ -113,6 +123,27 @@ std::string JoinAbstractPath(const StringRange& range) {
   return JoinAbstractPath(range.begin(), range.end());
 }
 
+// Join the components of filename partitions
+template <class StringIt>
+std::string JoinFilenamePartitions(StringIt it, StringIt end) {

Review comment:
       Used a generic Join function and concatenated the separator in the end of the path for FilenamePartitioning in partition.cc




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sanjibansg commented on a change in pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

Posted by GitBox <gi...@apache.org>.
sanjibansg commented on a change in pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#discussion_r821917951



##########
File path: cpp/src/arrow/dataset/dataset_writer.cc
##########
@@ -249,6 +249,16 @@ class DatasetWriterDirectoryQueue : public util::AsyncDestroyable {
         write_options_(write_options),
         writer_state_(writer_state) {}
 
+  DatasetWriterDirectoryQueue(std::string directory, std::string prefix,

Review comment:
       Made the change. Removed the additional constructor.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org