You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "HaochengLIU (via GitHub)" <gi...@apache.org> on 2023/04/08 04:55:06 UTC

[GitHub] [arrow] HaochengLIU opened a new pull request, #34984: GH-34565: [C++] Teach dataset_writer to accept custom filename functor

HaochengLIU opened a new pull request, #34984:
URL: https://github.com/apache/arrow/pull/34984

   ### Rationale for this change
   
   Existing basename_template will only use a monotonically increasing int as new filenames. when there is needs for custom filenames(left padding, hash-code), downstream users must rename the files in a post-processing step.
   
   ### What changes are included in this PR?
   
   A new functor is added to FileSystemDatasetWriteOptions which allows users to provide a custom name for dataset_writer.
   
   ### Are these changes tested?
   
   Yes. Unit tests are added for normal and ill-formed lambdas.
   
   ### Are there any user-facing changes?
   
   Yes. It allows users to customize output file names.
   * Closes: #34565
   
   \
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] HaochengLIU commented on pull request #34984: GH-34565: [C++] Teach dataset_writer to accept custom filename functor

Posted by "HaochengLIU (via GitHub)" <gi...@apache.org>.
HaochengLIU commented on PR #34984:
URL: https://github.com/apache/arrow/pull/34984#issuecomment-1500791414

   @westonpace Mind taking a look when you have time? Ty


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34984: GH-34565: [C++] Teach dataset_writer to accept custom filename functor

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34984:
URL: https://github.com/apache/arrow/pull/34984#issuecomment-1500791411

   * Closes: #34565


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] HaochengLIU commented on pull request #34984: GH-34565: [C++] Teach dataset_writer to accept custom filename functor

Posted by "HaochengLIU (via GitHub)" <gi...@apache.org>.
HaochengLIU commented on PR #34984:
URL: https://github.com/apache/arrow/pull/34984#issuecomment-1504389057

   > A few small suggestions but this looks good. Thank you for the contribution!
   PR updated.
   Happy to pick up my long-last love of open-source and Arrow is a fantastic project. Ty for taking the time reviewing :) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] HaochengLIU commented on pull request #34984: GH-34565: [C++] Teach dataset_writer to accept custom filename functor

Posted by "HaochengLIU (via GitHub)" <gi...@apache.org>.
HaochengLIU commented on PR #34984:
URL: https://github.com/apache/arrow/pull/34984#issuecomment-1505413840

   SGTM. Thanks Raul


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34984: GH-34565: [C++] Teach dataset_writer to accept custom filename functor

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34984:
URL: https://github.com/apache/arrow/pull/34984#issuecomment-1500791418

   :warning: GitHub issue #34565 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] HaochengLIU commented on a diff in pull request #34984: GH-34565: [C++] Teach dataset_writer to accept custom filename functor

Posted by "HaochengLIU (via GitHub)" <gi...@apache.org>.
HaochengLIU commented on code in PR #34984:
URL: https://github.com/apache/arrow/pull/34984#discussion_r1163463791


##########
cpp/src/arrow/dataset/file_base.h:
##########
@@ -404,6 +404,10 @@ struct ARROW_DS_EXPORT FileSystemDatasetWriteOptions {
   /// {i} will be replaced by an auto incremented integer.
   std::string basename_template;
 
+  /// A functor which will be applied on basename_template to replace auto incremented
+  /// integer only if provided.

Review Comment:
   updated, your version is much easier to digest



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] HaochengLIU commented on a diff in pull request #34984: GH-34565: [C++] Teach dataset_writer to accept custom filename functor

Posted by "HaochengLIU (via GitHub)" <gi...@apache.org>.
HaochengLIU commented on code in PR #34984:
URL: https://github.com/apache/arrow/pull/34984#discussion_r1163461925


##########
cpp/src/arrow/dataset/dataset_writer.cc:
##########
@@ -308,11 +309,24 @@ class DatasetWriterDirectoryQueue {
   }
 
   Result<std::string> GetNextFilename() {
-    auto basename = ::arrow::internal::Replace(write_options_.basename_template,
-                                               kIntegerToken, ToChars(file_counter_++));
+    std::optional<std::string> basename;
+    if (write_options_.basename_template_functor == nullptr) {
+      basename = ::arrow::internal::Replace(write_options_.basename_template,
+                                            kIntegerToken, ToChars(file_counter_++));
+    } else {
+      basename = ::arrow::internal::Replace(

Review Comment:
   I realize after using the suggested code, it allows users to pass in a basename_template_functor which applies on `basename_template` rather than `{i}`, it will make "ValidateBasenameTemplate(...)" obsolete.  I suggest we keep the existing code



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #34984: GH-34565: [C++] Teach dataset_writer to accept custom filename functor

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #34984:
URL: https://github.com/apache/arrow/pull/34984#issuecomment-1509324635

   ['Python', 'R'] benchmarks have high level of regressions.
   [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/6a8e3bdf7372444a840a6c8d3f646d2d...3fb7cb48bed14aa8a80b3a48b5274f53/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace merged pull request #34984: GH-34565: [C++] Teach dataset_writer to accept custom filename functor

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace merged PR #34984:
URL: https://github.com/apache/arrow/pull/34984


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] HaochengLIU commented on a diff in pull request #34984: GH-34565: [C++] Teach dataset_writer to accept custom filename functor

Posted by "HaochengLIU (via GitHub)" <gi...@apache.org>.
HaochengLIU commented on code in PR #34984:
URL: https://github.com/apache/arrow/pull/34984#discussion_r1163461925


##########
cpp/src/arrow/dataset/dataset_writer.cc:
##########
@@ -308,11 +309,24 @@ class DatasetWriterDirectoryQueue {
   }
 
   Result<std::string> GetNextFilename() {
-    auto basename = ::arrow::internal::Replace(write_options_.basename_template,
-                                               kIntegerToken, ToChars(file_counter_++));
+    std::optional<std::string> basename;
+    if (write_options_.basename_template_functor == nullptr) {
+      basename = ::arrow::internal::Replace(write_options_.basename_template,
+                                            kIntegerToken, ToChars(file_counter_++));
+    } else {
+      basename = ::arrow::internal::Replace(

Review Comment:
   good point, updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #34984: GH-34565: [C++] Teach dataset_writer to accept custom filename functor

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #34984:
URL: https://github.com/apache/arrow/pull/34984#discussion_r1164404088


##########
cpp/src/arrow/dataset/dataset_writer.cc:
##########
@@ -308,11 +309,24 @@ class DatasetWriterDirectoryQueue {
   }
 
   Result<std::string> GetNextFilename() {
-    auto basename = ::arrow::internal::Replace(write_options_.basename_template,
-                                               kIntegerToken, ToChars(file_counter_++));
+    std::optional<std::string> basename;
+    if (write_options_.basename_template_functor == nullptr) {
+      basename = ::arrow::internal::Replace(write_options_.basename_template,
+                                            kIntegerToken, ToChars(file_counter_++));
+    } else {
+      basename = ::arrow::internal::Replace(

Review Comment:
   Ok.  I am happy keeping it how it is.  If a user wants to avoid basename_template they can just set it to `"{i}"`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] HaochengLIU commented on pull request #34984: GH-34565: [C++] Teach dataset_writer to accept custom filename functor

Posted by "HaochengLIU (via GitHub)" <gi...@apache.org>.
HaochengLIU commented on PR #34984:
URL: https://github.com/apache/arrow/pull/34984#issuecomment-1505347665

   It seems weston is on west coast. @raulcd Can you guide if we could wait until he is online? Ty


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] raulcd commented on pull request #34984: GH-34565: [C++] Teach dataset_writer to accept custom filename functor

Posted by "raulcd (via GitHub)" <gi...@apache.org>.
raulcd commented on PR #34984:
URL: https://github.com/apache/arrow/pull/34984#issuecomment-1505330207

   > One quick question: Do you happen to know when Arrow 12.0.0 will be released? Would like to get this PR in so I could use at work
   
   We have a feature freeze policy for our Releases, see: https://arrow.apache.org/docs/dev/developers/release.html#creating-a-release-candidate
   The code freeze is scheduled to be performed today. I plan to do it in the next couple of hours.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] HaochengLIU commented on pull request #34984: GH-34565: [C++] Teach dataset_writer to accept custom filename functor

Posted by "HaochengLIU (via GitHub)" <gi...@apache.org>.
HaochengLIU commented on PR #34984:
URL: https://github.com/apache/arrow/pull/34984#issuecomment-1505335182

   Thanks for the info Raul. @westonpace If possible can I catch the last train :) ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #34984: GH-34565: [C++] Teach dataset_writer to accept custom filename functor

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #34984:
URL: https://github.com/apache/arrow/pull/34984#issuecomment-1509322428

   Benchmark runs are scheduled for baseline = 1912fb9a6a652001e17bfa08552309196d76b8ff and contender = 88fd5237591c6eb521a8eea1bbe1eb1507b61c3d. 88fd5237591c6eb521a8eea1bbe1eb1507b61c3d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/857147a210bc40c19c228bac6050cc5a...a4f691a7777145308bd62db0b7d25bfe/)
   [Failed] [test-mac-arm](https://conbench.ursa.dev/compare/runs/e8ba6328386e4fc3bd3f5e475084133a...311053a8d2fc4dd99d65fd6de434b19d/)
   [Finished :arrow_down:3.57% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/6a8e3bdf7372444a840a6c8d3f646d2d...3fb7cb48bed14aa8a80b3a48b5274f53/)
   [Finished :arrow_down:0.34% :arrow_up:0.03%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/cbb72b870c514fbb9f9103f93986ef8a...94bdb64d476e47ca9774e524ff110ef7/)
   Buildkite builds:
   [Finished] [`88fd5237` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2698)
   [Failed] [`88fd5237` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2732)
   [Finished] [`88fd5237` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2696)
   [Finished] [`88fd5237` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2723)
   [Finished] [`1912fb9a` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2697)
   [Failed] [`1912fb9a` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2731)
   [Finished] [`1912fb9a` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2695)
   [Finished] [`1912fb9a` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2722)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] HaochengLIU commented on pull request #34984: GH-34565: [C++] Teach dataset_writer to accept custom filename functor

Posted by "HaochengLIU (via GitHub)" <gi...@apache.org>.
HaochengLIU commented on PR #34984:
URL: https://github.com/apache/arrow/pull/34984#issuecomment-1504395168

   One quick question: Do you happen to know when Arrow 12.0.0 will be released?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #34984: GH-34565: [C++] Teach dataset_writer to accept custom filename functor

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #34984:
URL: https://github.com/apache/arrow/pull/34984#discussion_r1163385209


##########
cpp/src/arrow/dataset/dataset_writer.cc:
##########
@@ -308,11 +309,24 @@ class DatasetWriterDirectoryQueue {
   }
 
   Result<std::string> GetNextFilename() {
-    auto basename = ::arrow::internal::Replace(write_options_.basename_template,
-                                               kIntegerToken, ToChars(file_counter_++));
+    std::optional<std::string> basename;
+    if (write_options_.basename_template_functor == nullptr) {
+      basename = ::arrow::internal::Replace(write_options_.basename_template,
+                                            kIntegerToken, ToChars(file_counter_++));
+    } else {
+      basename = ::arrow::internal::Replace(

Review Comment:
   If a user is providing a `basename_template_functor` do we need to use `basename_template` at all?  Can this just be...
   
   ```
   basename = write_options_.basename_template_functor(file_counter_++);
   ```
   
   I don't think it matters too much but I think it would be simpler to say "either you use the template OR you use a custom function".



##########
cpp/src/arrow/dataset/dataset_writer.cc:
##########
@@ -308,11 +309,24 @@ class DatasetWriterDirectoryQueue {
   }
 
   Result<std::string> GetNextFilename() {
-    auto basename = ::arrow::internal::Replace(write_options_.basename_template,
-                                               kIntegerToken, ToChars(file_counter_++));
+    std::optional<std::string> basename;
+    if (write_options_.basename_template_functor == nullptr) {
+      basename = ::arrow::internal::Replace(write_options_.basename_template,
+                                            kIntegerToken, ToChars(file_counter_++));
+    } else {
+      basename = ::arrow::internal::Replace(
+          write_options_.basename_template, kIntegerToken,
+          write_options_.basename_template_functor(file_counter_++));
+    }
     if (!basename) {
       return Status::Invalid("string interpolation of basename template failed");
     }
+    if (used_filenames_.find(*basename) != used_filenames_.end()) {
+      return Status::Invalid("filename ", *basename,
+                             " is already used before. Check basename_template_functor");
+    } else {
+      used_filenames_.insert(*basename);
+    }

Review Comment:
   Small nit, not a big deal, but you could combine the `find` and `insert` calls...
   
   ```
   if (!used_filenames_.insert(*basename).second) {
     return Status::Invalid(...);
   }
   ```



##########
cpp/src/arrow/dataset/file_base.h:
##########
@@ -404,6 +404,10 @@ struct ARROW_DS_EXPORT FileSystemDatasetWriteOptions {
   /// {i} will be replaced by an auto incremented integer.
   std::string basename_template;
 
+  /// A functor which will be applied on basename_template to replace auto incremented
+  /// integer only if provided.

Review Comment:
   ```suggestion
     /// A functor which will be applied on an incremented counter.  The result will be inserted into
     /// the basename_template in place of {i}.
     ///
     /// This can be used, for example, to left-pad the file counter.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] raulcd commented on pull request #34984: GH-34565: [C++] Teach dataset_writer to accept custom filename functor

Posted by "raulcd (via GitHub)" <gi...@apache.org>.
raulcd commented on PR #34984:
URL: https://github.com/apache/arrow/pull/34984#issuecomment-1505376574

   > It seems weston is on west coast. @raulcd Can you guide if we could wait until he is online? Ty
   
   Sure, I can wait slightly longer. I'll try to ping him if there's no answer but in order to follow the current schedule for the release I'll branch the 12.0.0 release today (CEST timezone).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org