You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/14 08:46:54 UTC

[GitHub] [arrow] pitrou commented on a diff in pull request #12701: ARROW-15892: [C++] Dataset APIs require s3:ListBucket Permissions

pitrou commented on code in PR #12701:
URL: https://github.com/apache/arrow/pull/12701#discussion_r850219507


##########
cpp/src/arrow/dataset/file_base.h:
##########
@@ -387,6 +387,8 @@ struct ARROW_DS_EXPORT FileSystemDatasetWriteOptions {
   /// Controls what happens if an output directory already exists.
   ExistingDataBehavior existing_data_behavior = ExistingDataBehavior::kError;
 
+  bool create_dir = true;

Review Comment:
   Can you add a docstring for this?



##########
python/pyarrow/dataset.py:
##########
@@ -753,7 +753,7 @@ def write_dataset(data, base_dir, basename_template=None, format=None,
                   max_partitions=None, max_open_files=None,
                   max_rows_per_file=None, min_rows_per_group=None,
                   max_rows_per_group=None, file_visitor=None,
-                  existing_data_behavior='error'):
+                  existing_data_behavior='error', create_dir=True):

Review Comment:
   Can you add the corresponding parameter description in the docstring below?



##########
cpp/src/arrow/dataset/dataset_writer.cc:
##########
@@ -476,7 +478,8 @@ class DatasetWriter::DatasetWriterImpl : public util::AsyncDestroyable {
   }
 
   Future<> DoWriteRecordBatch(std::shared_ptr<RecordBatch> batch,
-                              const std::string& directory, const std::string& prefix) {
+                              const std::string& directory, const std::string& prefix,
+                              const bool& create_dir) {

Review Comment:
   No need to pass primitive types by reference.
   ```suggestion
                                 bool create_dir) {
   ```



##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -4286,6 +4287,55 @@ def test_write_dataset_s3(s3_example_simple):
     assert result.equals(table)
 
 
+_minio_put_only_policy = """{
+    "Version": "2012-10-17",
+    "Statement": [
+        {
+            "Effect": "Allow",
+            "Action": [
+                "s3:PutObject",
+                "s3:ListBucket",
+                "s3:GetObjectVersion"
+            ],
+            "Resource": [
+                "arn:aws:s3:::*"
+            ]
+        }
+    ]
+}"""
+
+
+@pytest.mark.parquet
+@pytest.mark.s3
+def test_write_dataset_s3_put_only(s3_server, limited_s3_user):

Review Comment:
   Can you add a comment explaining what this test is for, and referencing the JIRA issue number?



##########
cpp/src/arrow/dataset/dataset_writer.cc:
##########
@@ -476,7 +478,8 @@ class DatasetWriter::DatasetWriterImpl : public util::AsyncDestroyable {
   }
 
   Future<> DoWriteRecordBatch(std::shared_ptr<RecordBatch> batch,
-                              const std::string& directory, const std::string& prefix) {
+                              const std::string& directory, const std::string& prefix,
+                              const bool& create_dir) {

Review Comment:
   Also, I may be misunderstanding, but this parameter doesn't seem actually used?



##########
python/pyarrow/tests/conftest.py:
##########
@@ -311,3 +313,23 @@ def s3_server(s3_connection):
         finally:
             if proc is not None:
                 proc.kill()
+
+
+@pytest.fixture(scope='session')
+def limited_s3_user(request, s3_server):

Review Comment:
   Also, why is this a fixture if it's invoked as a regular function?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org