You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/08 13:53:00 UTC

[GitHub] [arrow] AlenkaF opened a new pull request, #12838: ARROW-15757: [Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior

AlenkaF opened a new pull request, #12838:
URL: https://github.com/apache/arrow/pull/12838

   This PR tries to pass `existing_data_behavior` to `write_to_dataset` in case of the new dataset implementation.
   
   Connected to https://github.com/apache/arrow/pull/12811.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on a diff in pull request #12838: ARROW-15757: [Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on code in PR #12838:
URL: https://github.com/apache/arrow/pull/12838#discussion_r854964604


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -2928,8 +2928,10 @@ def write_to_dataset(table, root_path, partition_cols=None,
                      partition_filename_cb=None, filesystem=None,
                      use_legacy_dataset=None, schema=None,
                      partitioning=None, basename_template=None,
-                     use_threads=None, file_visitor=None, **kwargs):
-    """Wrapper around parquet.write_dataset for writing a Table to
+                     use_threads=None, file_visitor=None,
+                     existing_data_behavior='overwrite_or_ignore',

Review Comment:
   Set the default to `overwrite_or_ignore` because of the specification of `basename_template`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on a diff in pull request #12838: ARROW-15757: [Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on code in PR #12838:
URL: https://github.com/apache/arrow/pull/12838#discussion_r855485487


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -2928,8 +2928,10 @@ def write_to_dataset(table, root_path, partition_cols=None,
                      partition_filename_cb=None, filesystem=None,
                      use_legacy_dataset=None, schema=None,
                      partitioning=None, basename_template=None,
-                     use_threads=None, file_visitor=None, **kwargs):
-    """Wrapper around parquet.write_dataset for writing a Table to
+                     use_threads=None, file_visitor=None,
+                     existing_data_behavior='overwrite_or_ignore',

Review Comment:
   Changed it to None.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #12838: ARROW-15757: [Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #12838:
URL: https://github.com/apache/arrow/pull/12838#issuecomment-1110589470

   Benchmark runs are scheduled for baseline = 916112cfa01c61c869ed951e47205db83c167e57 and contender = fed115fd9f13774b436176ed0f5a492c749359c8. fed115fd9f13774b436176ed0f5a492c749359c8 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/c42df1cb06d84967b27c49f88564e709...19ed7e8059d94ab1bd835b58c4cc892b/)
   [Failed] [test-mac-arm](https://conbench.ursa.dev/compare/runs/e5fe8e27eb3c4ce2baeb59d050c11c30...effde027019c4f84bd048a7bc5862ff5/)
   [Finished :arrow_down:7.86% :arrow_up:2.14%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/dd0dae5a44ef4f7995f9173159f340e5...f17c9aa4fe1343a68ea89e73a877c3d5/)
   [Finished :arrow_down:0.8% :arrow_up:0.04%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/1274d247ffc34b64b2cac01191bb17bc...00c02e724b4b4db78d9f0c3bf292b452/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/598| `fed115fd` ec2-t3-xlarge-us-east-2>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/586| `fed115fd` test-mac-arm>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/585| `fed115fd` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/596| `fed115fd` ursa-thinkcentre-m75q>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/597| `916112cf` ec2-t3-xlarge-us-east-2>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/585| `916112cf` test-mac-arm>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/584| `916112cf` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/595| `916112cf` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12838: ARROW-15757: [Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12838:
URL: https://github.com/apache/arrow/pull/12838#issuecomment-1092883870

   https://issues.apache.org/jira/browse/ARROW-15757


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on a diff in pull request #12838: ARROW-15757: [Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on code in PR #12838:
URL: https://github.com/apache/arrow/pull/12838#discussion_r855285568


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3116,6 +3136,14 @@ def file_visitor(written_file):
                              "'basename_template' parameter instead. For "
                              "usage see `pyarrow.dataset.write_dataset`"),
             FutureWarning, stacklevel=2)
+    if existing_data_behavior is not None:
+        warnings.warn(
+            _DEPR_MSG.format("existing_data_behavior", " Specify "
+                             "'use_legacy_dataset=False' while constructing "
+                             "the ParquetDataset, and then use the "
+                             "'basename_template' parameter instead. For "

Review Comment:
   Sorry I was careless, will correct.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #12838: ARROW-15757: [Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #12838:
URL: https://github.com/apache/arrow/pull/12838#discussion_r855156736


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3116,6 +3136,14 @@ def file_visitor(written_file):
                              "'basename_template' parameter instead. For "
                              "usage see `pyarrow.dataset.write_dataset`"),
             FutureWarning, stacklevel=2)
+    if existing_data_behavior is not None:

Review Comment:
   ```suggestion
       if existing_data_behavior != "overwrite_or_ignore":
   ```
   
   ? (since that is the default value. Or otherwise we need to change the default value to None and turn that into "overwrite_or_ignore" in the use_legacy_dataset=False code path)



##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3116,6 +3136,14 @@ def file_visitor(written_file):
                              "'basename_template' parameter instead. For "
                              "usage see `pyarrow.dataset.write_dataset`"),
             FutureWarning, stacklevel=2)
+    if existing_data_behavior is not None:
+        warnings.warn(
+            _DEPR_MSG.format("existing_data_behavior", " Specify "
+                             "'use_legacy_dataset=False' while constructing "
+                             "the ParquetDataset, and then use the "
+                             "'basename_template' parameter instead. For "

Review Comment:
   this last part is not relevant here I think (copy paste from partition_filename_cb?)



##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3116,6 +3136,14 @@ def file_visitor(written_file):
                              "'basename_template' parameter instead. For "
                              "usage see `pyarrow.dataset.write_dataset`"),
             FutureWarning, stacklevel=2)
+    if existing_data_behavior is not None:
+        warnings.warn(
+            _DEPR_MSG.format("existing_data_behavior", " Specify "
+                             "'use_legacy_dataset=False' while constructing "
+                             "the ParquetDataset, and then use the "
+                             "'basename_template' parameter instead. For "

Review Comment:
   This could also be an error like the schema/partitioning/use_threads handling above?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #12838: ARROW-15757: [Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #12838:
URL: https://github.com/apache/arrow/pull/12838#discussion_r855283459


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3116,6 +3136,14 @@ def file_visitor(written_file):
                              "'basename_template' parameter instead. For "
                              "usage see `pyarrow.dataset.write_dataset`"),
             FutureWarning, stacklevel=2)
+    if existing_data_behavior is not None:

Review Comment:
   Yes, sounds good



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on a diff in pull request #12838: ARROW-15757: [Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on code in PR #12838:
URL: https://github.com/apache/arrow/pull/12838#discussion_r855282124


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3116,6 +3136,14 @@ def file_visitor(written_file):
                              "'basename_template' parameter instead. For "
                              "usage see `pyarrow.dataset.write_dataset`"),
             FutureWarning, stacklevel=2)
+    if existing_data_behavior is not None:

Review Comment:
   Oh, of course. Then I think it is better to set the default to None and set it in the new API separately to "overwrite_or_ignore". What do u think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #12838: ARROW-15757: [Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #12838:
URL: https://github.com/apache/arrow/pull/12838#issuecomment-1110589587

   ['Python', 'R'] benchmarks have high level of regressions.
   [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/dd0dae5a44ef4f7995f9173159f340e5...f17c9aa4fe1343a68ea89e73a877c3d5/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kszucs closed pull request #12838: ARROW-15757: [Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior

Posted by GitBox <gi...@apache.org>.
kszucs closed pull request #12838: ARROW-15757: [Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior
URL: https://github.com/apache/arrow/pull/12838


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org