You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/06 10:58:03 UTC

[GitHub] [arrow] AlenkaF opened a new pull request, #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

AlenkaF opened a new pull request, #12811:
URL: https://github.com/apache/arrow/pull/12811

   This PR tries to amend `pq.write_to_dataset` to:
   
   1. raise a deprecation warning for `use_legacy_dataset=True` and already switch the default to `False`.
   2. raise deprecation warnings for all keywords (when `use_legacy_dataset=True`) that won't be supported in the new implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on code in PR #12811:
URL: https://github.com/apache/arrow/pull/12811#discussion_r846131904


##########
python/pyarrow/parquet.py:
##########
@@ -2233,11 +2236,48 @@ def write_to_dataset(table, root_path, partition_cols=None,
         and allow you to override the partition filename. If nothing is
         passed, the filename will consist of a uuid.
     use_legacy_dataset : bool
-        Default is True unless a ``pyarrow.fs`` filesystem is passed.
-        Set to False to enable the new code path (experimental, using the
-        new Arrow Dataset API). This is more efficient when using partition
-        columns, but does not (yet) support `partition_filename_cb` and
-        `metadata_collector` keywords.
+        Default is False. Set to True to use the the legacy behaviour
+        (this option is deprecated, and the legacy implementation will be
+        removed in a future version). The legacy implementation still
+        supports `partition_filename_cb` and `metadata_collector` keywords
+        but is less efficient when using partition columns.
+    format : FileFormat or str
+        The format in which to write the dataset. Currently supported:
+        "parquet", "ipc"/"arrow"/"feather", and "csv". If a FileSystemDataset
+        is being written and `format` is not specified, it defaults to the
+        same format as the specified FileSystemDataset. When writing a
+        Table or RecordBatch, this keyword is required.
+    file_options : pyarrow.dataset.FileWriteOptions, optional
+        FileFormat specific write options, created using the
+        ``FileFormat.make_write_options()`` function.
+    use_threads : bool, default True
+        Write files in parallel. If enabled, then maximum parallelism will be
+        used determined by the number of available CPU cores.
+    schema : Schema, optional
+    partitioning : Partitioning or list[str], optional
+        The partitioning scheme specified with the ``partitioning()``
+        function or a list of field names. When providing a list of
+        field names, you can use ``partitioning_flavor`` to drive which
+        partitioning type should be used.
+    file_visitor : function
+        If set, this function will be called with a WrittenFile instance
+        for each file created during the call.  This object will have both
+        a path attribute and a metadata attribute.
+
+        The path attribute will be a string containing the path to
+        the created file.
+
+        The metadata attribute will be the parquet metadata of the file.
+        This metadata will have the file path attribute set and can be used
+        to build a _metadata file.  The metadata attribute will be None if
+        the format is not parquet.
+
+        Example visitor which simple collects the filenames created::
+
+            visited_paths = []
+
+            def file_visitor(written_file):
+                visited_paths.append(written_file.path)

Review Comment:
   @jorisvandenbossche should other keywords also be passed? (partitioning_flavor, max_*)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on code in PR #12811:
URL: https://github.com/apache/arrow/pull/12811#discussion_r845864090


##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1531,8 +1531,8 @@ def test_dataset_read_dictionary(tempdir, use_legacy_dataset):
     t1 = pa.table([[util.rands(10) for i in range(5)] * 10], names=['f0'])
     t2 = pa.table([[util.rands(10) for i in range(5)] * 10], names=['f0'])
     # TODO pass use_legacy_dataset (need to fix unique names)
-    pq.write_to_dataset(t1, root_path=str(path))
-    pq.write_to_dataset(t2, root_path=str(path))
+    pq.write_to_dataset(t1, root_path=str(path), use_legacy_dataset=True)
+    pq.write_to_dataset(t2, root_path=str(path), use_legacy_dataset=True)

Review Comment:
   Not sure why I didn't get this yesterday 🤦‍♀️ 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on code in PR #12811:
URL: https://github.com/apache/arrow/pull/12811#discussion_r853351929


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -3011,7 +3012,8 @@ def _create_parquet_dataset_simple(root_path):
     for i in range(4):
         table = pa.table({'f1': [i] * 10, 'f2': np.random.randn(10)})
         pq.write_to_dataset(
-            table, str(root_path), metadata_collector=metadata_collector
+            table, str(root_path), use_legacy_dataset=True,
+            metadata_collector=metadata_collector

Review Comment:
   Not sure right now, changing to `use_legacy_dataset=False` seems to be working 👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on code in PR #12811:
URL: https://github.com/apache/arrow/pull/12811#discussion_r845863555


##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1290,7 +1290,7 @@ def _test_write_to_dataset_no_partitions(base_path,
     # Without partitions, append files to root_path
     n = 5
     for i in range(n):
-        pq.write_to_dataset(output_table, base_path,
+        pq.write_to_dataset(output_table, base_path, use_legacy_dataset=True,

Review Comment:
   Of course, makes sense.
   Yes, I think we should add what you are suggesting to the `pq.write_to_dataset`. Will try and commit for review!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on code in PR #12811:
URL: https://github.com/apache/arrow/pull/12811#discussion_r845909851


##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1290,7 +1290,7 @@ def _test_write_to_dataset_no_partitions(base_path,
     # Without partitions, append files to root_path
     n = 5
     for i in range(n):
-        pq.write_to_dataset(output_table, base_path,
+        pq.write_to_dataset(output_table, base_path, use_legacy_dataset=True,

Review Comment:
   Hm, thinking aloud: `existing_data_behavior` controls how the dataset will handle data that already exists. If I implement a unique way of writing parquet files when using the new API in `wrtie_to_dataset` I will also have to set `existing_data_behavior` to be `overwrite_or_ignore`. That will then make trouble when exposing the same parameter in https://issues.apache.org/jira/browse/ARROW-15757.
   
   I could do a check if the parameter is specified or not. But am not sure if there will be additional complications.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #12811:
URL: https://github.com/apache/arrow/pull/12811#discussion_r851129892


##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1290,7 +1290,7 @@ def _test_write_to_dataset_no_partitions(base_path,
     # Without partitions, append files to root_path
     n = 5
     for i in range(n):
-        pq.write_to_dataset(output_table, base_path,
+        pq.write_to_dataset(output_table, base_path, use_legacy_dataset=True,

Review Comment:
   Thinking more about this: if we switch the default as we are now doing, I think we _should_ try to preserve the current behaviour of overwriting/adding data (otherwise it would be a quite breaking change for people using `pq.write_to_dataset` this way). We can still try to deprecate this and later move towards the same default as the dataset.write_dataset implementation. 
   But that can be done in a later stage with a proper deprecation warning (eg detect if the directory already exists and is not empty, and in that case indicate this will start raising an error in the future).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on code in PR #12811:
URL: https://github.com/apache/arrow/pull/12811#discussion_r851226263


##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1290,7 +1290,7 @@ def _test_write_to_dataset_no_partitions(base_path,
     # Without partitions, append files to root_path
     n = 5
     for i in range(n):
-        pq.write_to_dataset(output_table, base_path,
+        pq.write_to_dataset(output_table, base_path, use_legacy_dataset=True,

Review Comment:
   Follow up on this thread: I kept the new implementation as the default and added the use of `basename_template` to mimic the legacy behaviour.
   
   At a later stage I think it will be important to rearrange the keywords for `write_to_dataset` to first list the keywords connected to the new API.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #12811:
URL: https://github.com/apache/arrow/pull/12811#discussion_r845291213


##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1290,7 +1290,7 @@ def _test_write_to_dataset_no_partitions(base_path,
     # Without partitions, append files to root_path
     n = 5
     for i in range(n):
-        pq.write_to_dataset(output_table, base_path,
+        pq.write_to_dataset(output_table, base_path, use_legacy_dataset=True,

Review Comment:
   So the reason that this is otherwise failing, is because with the new dataset implementation, we are using a fixed file name (`part-0.parquet`), while before we where using a uuid filename. And so therefore, with the non-legacy writer, it is each time overwriting the same file inside the loop.
   
   To what extent would this be something that users also could bump into? We could in theory use the `basename_template` argument to replicate this "uuid" filename behaviour inside `pq.write_to_dataset`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on code in PR #12811:
URL: https://github.com/apache/arrow/pull/12811#discussion_r845909851


##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1290,7 +1290,7 @@ def _test_write_to_dataset_no_partitions(base_path,
     # Without partitions, append files to root_path
     n = 5
     for i in range(n):
-        pq.write_to_dataset(output_table, base_path,
+        pq.write_to_dataset(output_table, base_path, use_legacy_dataset=True,

Review Comment:
   Hm, thinking aloud: `existing_data_behavior` controls how the dataset will handle data that already exists. If I implement a unique way of writing parquet files when using the new API in `wrtie_to_dataset` I will also have to set `existing_data_behavior` to be `overwrite_or_ignore`. That will then make trouble when exposing the same parameter in https://issues.apache.org/jira/browse/ARROW-15757.
   
   I could do a check if the parameter is specified or not. But am not sure if there will be additional complications ..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche closed pull request #12811: ARROW-16122: [Python] Change use_legacy_dataset default and deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche closed pull request #12811: ARROW-16122: [Python] Change use_legacy_dataset default and deprecate no-longer supported keywords in parquet.write_to_dataset
URL: https://github.com/apache/arrow/pull/12811


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on code in PR #12811:
URL: https://github.com/apache/arrow/pull/12811#discussion_r851112317


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -2962,11 +2965,48 @@ def write_to_dataset(table, root_path, partition_cols=None,
         and allow you to override the partition filename. If nothing is
         passed, the filename will consist of a uuid.
     use_legacy_dataset : bool
-        Default is True unless a ``pyarrow.fs`` filesystem is passed.
-        Set to False to enable the new code path (experimental, using the
-        new Arrow Dataset API). This is more efficient when using partition
-        columns, but does not (yet) support `partition_filename_cb` and
-        `metadata_collector` keywords.
+        Default is False. Set to True to use the the legacy behaviour
+        (this option is deprecated, and the legacy implementation will be
+        removed in a future version). The legacy implementation still
+        supports `partition_filename_cb` and `metadata_collector` keywords
+        but is less efficient when using partition columns.
+    format : FileFormat or str

Review Comment:
   Oh, you are absolutely right - this doesn't make any sense (it was added as a try to expose all `write_dataset` keywords). Will remove it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #12811:
URL: https://github.com/apache/arrow/pull/12811#discussion_r851234914


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -2962,11 +2965,47 @@ def write_to_dataset(table, root_path, partition_cols=None,
         and allow you to override the partition filename. If nothing is
         passed, the filename will consist of a uuid.
     use_legacy_dataset : bool
-        Default is True unless a ``pyarrow.fs`` filesystem is passed.
-        Set to False to enable the new code path (experimental, using the
-        new Arrow Dataset API). This is more efficient when using partition
-        columns, but does not (yet) support `partition_filename_cb` and
-        `metadata_collector` keywords.
+        Default is False. Set to True to use the the legacy behaviour
+        (this option is deprecated, and the legacy implementation will be
+        removed in a future version). The legacy implementation still
+        supports `partition_filename_cb` and `metadata_collector` keywords
+        but is less efficient when using partition columns.
+    file_options : pyarrow.dataset.FileWriteOptions, optional

Review Comment:
   I think `file_options` is similar as `format`? (only used to have it raise for the legacy version, and so can also be removed)



##########
python/pyarrow/parquet/__init__.py:
##########
@@ -2962,11 +2965,47 @@ def write_to_dataset(table, root_path, partition_cols=None,
         and allow you to override the partition filename. If nothing is
         passed, the filename will consist of a uuid.
     use_legacy_dataset : bool
-        Default is True unless a ``pyarrow.fs`` filesystem is passed.
-        Set to False to enable the new code path (experimental, using the
-        new Arrow Dataset API). This is more efficient when using partition
-        columns, but does not (yet) support `partition_filename_cb` and
-        `metadata_collector` keywords.
+        Default is False. Set to True to use the the legacy behaviour
+        (this option is deprecated, and the legacy implementation will be
+        removed in a future version). The legacy implementation still
+        supports `partition_filename_cb` and `metadata_collector` keywords
+        but is less efficient when using partition columns.
+    file_options : pyarrow.dataset.FileWriteOptions, optional
+        FileFormat specific write options, created using the
+        ``FileFormat.make_write_options()`` function.
+    use_threads : bool, default True
+        Write files in parallel. If enabled, then maximum parallelism will be
+        used determined by the number of available CPU cores.
+    schema : Schema, optional
+    partitioning : Partitioning or list[str], optional
+        The partitioning scheme specified with the ``partitioning()``
+        function or a list of field names. When providing a list of
+        field names, you can use ``partitioning_flavor`` to drive which
+        partitioning type should be used.
+    basename_template : str, optional
+        A template string used to generate basenames of written data files.
+        The token '{i}' will be replaced with an automatically incremented
+        integer. If not specified, it defaults to
+        "part-{i}." + format.default_extname

Review Comment:
   ```suggestion
           integer. If not specified, it defaults to "guid-{i}.parquet"
   ```



##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3042,13 +3079,50 @@ def file_visitor(written_file):
             part_schema = table.select(partition_cols).schema
             partitioning = ds.partitioning(part_schema, flavor="hive")
 
+        if basename_template is None:
+            basename_template = guid() + '{i}.parquet'

Review Comment:
   ```suggestion
               basename_template = guid() + '-{i}.parquet'
   ```



##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1531,8 +1531,8 @@ def test_dataset_read_dictionary(tempdir, use_legacy_dataset):
     t1 = pa.table([[util.rands(10) for i in range(5)] * 10], names=['f0'])
     t2 = pa.table([[util.rands(10) for i in range(5)] * 10], names=['f0'])
     # TODO pass use_legacy_dataset (need to fix unique names)
-    pq.write_to_dataset(t1, root_path=str(path))
-    pq.write_to_dataset(t2, root_path=str(path))
+    pq.write_to_dataset(t1, root_path=str(path), use_legacy_dataset=True)
+    pq.write_to_dataset(t2, root_path=str(path), use_legacy_dataset=True)

Review Comment:
   Same here, this test can now be updated?



##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1290,7 +1290,7 @@ def _test_write_to_dataset_no_partitions(base_path,
     # Without partitions, append files to root_path
     n = 5
     for i in range(n):
-        pq.write_to_dataset(output_table, base_path,
+        pq.write_to_dataset(output_table, base_path, use_legacy_dataset=True,

Review Comment:
   With the latest changes, it should be possible to now change the hardcoded `use_legacy_dataset=True` to `use_legacy_dataset=use_legacy_dataset`? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #12811:
URL: https://github.com/apache/arrow/pull/12811#issuecomment-1090179382

   https://issues.apache.org/jira/browse/ARROW-16122


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #12811:
URL: https://github.com/apache/arrow/pull/12811#discussion_r851085571


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3082,6 +3155,11 @@ def file_visitor(written_file):
             _mkdir_if_not_exists(fs, '/'.join([root_path, subdir]))
             if partition_filename_cb:
                 outfile = partition_filename_cb(keys)
+
+                # raise for unsupported keywords
+                warnings.warn(
+                    _DEPR_MSG.format("partition_filename_cb", msg3),
+                    FutureWarning, stacklevel=2)

Review Comment:
   Since this is in a for loop, we would raise the warning multiple times. I think you can check for the keyword and raise the warning once before the actual writing code (eg where you now define the `msg3`)



##########
python/pyarrow/parquet/__init__.py:
##########
@@ -2962,11 +2965,48 @@ def write_to_dataset(table, root_path, partition_cols=None,
         and allow you to override the partition filename. If nothing is
         passed, the filename will consist of a uuid.
     use_legacy_dataset : bool
-        Default is True unless a ``pyarrow.fs`` filesystem is passed.
-        Set to False to enable the new code path (experimental, using the
-        new Arrow Dataset API). This is more efficient when using partition
-        columns, but does not (yet) support `partition_filename_cb` and
-        `metadata_collector` keywords.
+        Default is False. Set to True to use the the legacy behaviour
+        (this option is deprecated, and the legacy implementation will be
+        removed in a future version). The legacy implementation still
+        supports `partition_filename_cb` and `metadata_collector` keywords
+        but is less efficient when using partition columns.
+    format : FileFormat or str

Review Comment:
   I would leave out the `format`, since this should always be "parquet" (so we can hardcode that value). We don't want that users can use `pyarrow.parquet.write_to_dataset` to write eg Feather files



##########
python/pyarrow/parquet/__init__.py:
##########
@@ -2962,11 +2965,48 @@ def write_to_dataset(table, root_path, partition_cols=None,
         and allow you to override the partition filename. If nothing is
         passed, the filename will consist of a uuid.
     use_legacy_dataset : bool
-        Default is True unless a ``pyarrow.fs`` filesystem is passed.
-        Set to False to enable the new code path (experimental, using the
-        new Arrow Dataset API). This is more efficient when using partition
-        columns, but does not (yet) support `partition_filename_cb` and
-        `metadata_collector` keywords.
+        Default is False. Set to True to use the the legacy behaviour
+        (this option is deprecated, and the legacy implementation will be
+        removed in a future version). The legacy implementation still
+        supports `partition_filename_cb` and `metadata_collector` keywords
+        but is less efficient when using partition columns.
+    format : FileFormat or str

Review Comment:
   And it is actually still hardcoded in the code below, so passing this keyword currently wouldn't have any effect. Or is it only added to raise a warning in the legacy case?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #12811:
URL: https://github.com/apache/arrow/pull/12811#discussion_r853271087


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -2962,11 +2964,43 @@ def write_to_dataset(table, root_path, partition_cols=None,
         and allow you to override the partition filename. If nothing is
         passed, the filename will consist of a uuid.
     use_legacy_dataset : bool
-        Default is True unless a ``pyarrow.fs`` filesystem is passed.
-        Set to False to enable the new code path (experimental, using the
-        new Arrow Dataset API). This is more efficient when using partition
-        columns, but does not (yet) support `partition_filename_cb` and
-        `metadata_collector` keywords.
+        Default is False. Set to True to use the the legacy behaviour
+        (this option is deprecated, and the legacy implementation will be
+        removed in a future version). The legacy implementation still
+        supports `partition_filename_cb` and `metadata_collector` keywords
+        but is less efficient when using partition columns.
+    use_threads : bool, default True
+        Write files in parallel. If enabled, then maximum parallelism will be
+        used determined by the number of available CPU cores.
+    schema : Schema, optional
+    partitioning : Partitioning or list[str], optional
+        The partitioning scheme specified with the ``partitioning()``

Review Comment:
   ```suggestion
           The partitioning scheme specified with the ``pyarrow.dataset.partitioning()``
   ```



##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -3011,7 +3012,8 @@ def _create_parquet_dataset_simple(root_path):
     for i in range(4):
         table = pa.table({'f1': [i] * 10, 'f2': np.random.randn(10)})
         pq.write_to_dataset(
-            table, str(root_path), metadata_collector=metadata_collector
+            table, str(root_path), use_legacy_dataset=True,
+            metadata_collector=metadata_collector

Review Comment:
   Was there a reason for specifying True here specifically? (`metadata_collector` should be supported with `use_legacy_dataset=False` as well)



##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -937,7 +937,7 @@ def _create_dataset_for_fragments(tempdir, chunk_size=None, filesystem=None):
     path = str(tempdir / "test_parquet_dataset")
 
     # write_to_dataset currently requires pandas
-    pq.write_to_dataset(table, path,
+    pq.write_to_dataset(table, path, use_legacy_dataset=True,
                         partition_cols=["part"], chunk_size=chunk_size)

Review Comment:
   Maybe we can open a follow-up JIRA for this one?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #12811:
URL: https://github.com/apache/arrow/pull/12811#discussion_r845311683


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -937,7 +937,7 @@ def _create_dataset_for_fragments(tempdir, chunk_size=None, filesystem=None):
     path = str(tempdir / "test_parquet_dataset")
 
     # write_to_dataset currently requires pandas
-    pq.write_to_dataset(table, path,
+    pq.write_to_dataset(table, path, use_legacy_dataset=True,
                         partition_cols=["part"], chunk_size=chunk_size)

Review Comment:
   The dataset API now has a `max_rows_per_group`, but that doesn't necessarily directly relate to Parquet row groups? 
   
   It's more generic about how many rows are written in one go, but so effectively is therefore also a max parquet row group size? (since those need to be written in one go)



##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -937,7 +937,7 @@ def _create_dataset_for_fragments(tempdir, chunk_size=None, filesystem=None):
     path = str(tempdir / "test_parquet_dataset")
 
     # write_to_dataset currently requires pandas
-    pq.write_to_dataset(table, path,
+    pq.write_to_dataset(table, path, use_legacy_dataset=True,
                         partition_cols=["part"], chunk_size=chunk_size)

Review Comment:
   The dataset API now has a `max_rows_per_group`, I see, but that doesn't necessarily directly relate to Parquet row groups? 
   
   It's more generic about how many rows are written in one go, but so effectively is therefore also a max parquet row group size? (since those need to be written in one go)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #12811:
URL: https://github.com/apache/arrow/pull/12811#discussion_r845292683


##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1531,8 +1531,8 @@ def test_dataset_read_dictionary(tempdir, use_legacy_dataset):
     t1 = pa.table([[util.rands(10) for i in range(5)] * 10], names=['f0'])
     t2 = pa.table([[util.rands(10) for i in range(5)] * 10], names=['f0'])
     # TODO pass use_legacy_dataset (need to fix unique names)
-    pq.write_to_dataset(t1, root_path=str(path))
-    pq.write_to_dataset(t2, root_path=str(path))
+    pq.write_to_dataset(t1, root_path=str(path), use_legacy_dataset=True)
+    pq.write_to_dataset(t2, root_path=str(path), use_legacy_dataset=True)

Review Comment:
   Same here as my comment above (https://github.com/apache/arrow/pull/12811/files#r845291213), as the TODO comment also indicates: calling this twice with `use_legacy_dataset=False` would overwrite the same file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on code in PR #12811:
URL: https://github.com/apache/arrow/pull/12811#discussion_r851133972


##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1290,7 +1290,7 @@ def _test_write_to_dataset_no_partitions(base_path,
     # Without partitions, append files to root_path
     n = 5
     for i in range(n):
-        pq.write_to_dataset(output_table, base_path,
+        pq.write_to_dataset(output_table, base_path, use_legacy_dataset=True,

Review Comment:
   Yes, good point.
   I will step back and redo the issue in a way that the default stays `True` and it raises a deprecation warning in this case.
   
   For the later stage is it worth creating a JIRA already?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on PR #12811:
URL: https://github.com/apache/arrow/pull/12811#issuecomment-1102864577

   We should also ensure to suppress the warnings when running the tests (when we are explicitly testing `use_legacy_dataset=True`, and thus want to ignore the warning), but that could maybe be done as a follow-up as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on code in PR #12811:
URL: https://github.com/apache/arrow/pull/12811#discussion_r853350416


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -937,7 +937,7 @@ def _create_dataset_for_fragments(tempdir, chunk_size=None, filesystem=None):
     path = str(tempdir / "test_parquet_dataset")
 
     # write_to_dataset currently requires pandas
-    pq.write_to_dataset(table, path,
+    pq.write_to_dataset(table, path, use_legacy_dataset=True,
                         partition_cols=["part"], chunk_size=chunk_size)

Review Comment:
   Created https://issues.apache.org/jira/browse/ARROW-16240



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #12811:
URL: https://github.com/apache/arrow/pull/12811#discussion_r845304218


##########
python/pyarrow/tests/test_dataset.py:
##########
@@ -937,7 +937,7 @@ def _create_dataset_for_fragments(tempdir, chunk_size=None, filesystem=None):
     path = str(tempdir / "test_parquet_dataset")
 
     # write_to_dataset currently requires pandas
-    pq.write_to_dataset(table, path,
+    pq.write_to_dataset(table, path, use_legacy_dataset=True,
                         partition_cols=["part"], chunk_size=chunk_size)

Review Comment:
   So here this fails with using the new dataset implementation, because `dataset.write_dataset(..)` doesn't support the parquet `row_group_size` keyword (to which `chunk_size` gets translated). The `ParquetFileWriteOptions` doesn't support this keyword. 
   
   On the parquet side, this is also the only keyword that is not passed to the `ParquetWriter` init (and thus to parquet's `WriterProperties` or `ArrowWriterProperties`), but to the actual `write_table` call. In C++ this can be seen at
   
   https://github.com/apache/arrow/blob/76d064c729f5e2287bf2a2d5e02d1fb192ae5738/cpp/src/parquet/arrow/writer.h#L62-L71
   
   cc @westonpace do you remember if this has been discussed before how the `row_group_size`/`chunk_size` setting from Parquet fits into the dataset API?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on PR #12811:
URL: https://github.com/apache/arrow/pull/12811#issuecomment-1100070916

   @jorisvandenbossche the PR should be ready now. I exposed `basename_template` and it now mimics the legacy behaviour.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #12811: ARROW-16122: [Python] Change use_legacy_dataset default and deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #12811:
URL: https://github.com/apache/arrow/pull/12811#issuecomment-1107428720

   Benchmark runs are scheduled for baseline = 4f08a9b6d0f1249f3f3246167e18360da52a6f0d and contender = 1763622b6f60e974e495b3349cc2f4b7caaf1951. 1763622b6f60e974e495b3349cc2f4b7caaf1951 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/c2985d33043949998de77c5a2b94a057...dd19e80e53df418eacc6c1e9e94360ab/)
   [Failed] [test-mac-arm](https://conbench.ursa.dev/compare/runs/065cf6a190fd497988f302078c3cefee...152f7dc5024a44e8801f1fefef725a7e/)
   [Failed :arrow_down:0.38% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/790e57bcf69749e8ab9e1a6e83ce6dd1...2816745be03f4b59b109515ff85b8735/)
   [Finished :arrow_down:0.55% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/20ba331f0dd24ab8996f30f01335d3ef...73f4539b4a5f4dd0a83cc70b26dce91c/)
   Buildkite builds:
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/565| `1763622b` ec2-t3-xlarge-us-east-2>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/553| `1763622b` test-mac-arm>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/551| `1763622b` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/563| `1763622b` ursa-thinkcentre-m75q>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/564| `4f08a9b6` ec2-t3-xlarge-us-east-2>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/552| `4f08a9b6` test-mac-arm>
   [Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/550| `4f08a9b6` ursa-i9-9960x>
   [Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/562| `4f08a9b6` ursa-thinkcentre-m75q>
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on PR #12811:
URL: https://github.com/apache/arrow/pull/12811#issuecomment-1100094017

   @jorisvandenbossche I applied the suggestions 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #12811:
URL: https://github.com/apache/arrow/pull/12811#discussion_r844824953


##########
python/pyarrow/parquet.py:
##########
@@ -2236,26 +2236,18 @@ def write_to_dataset(table, root_path, partition_cols=None,
         and allow you to override the partition filename. If nothing is
         passed, the filename will consist of a uuid.
     use_legacy_dataset : bool
-        Default is True unless a ``pyarrow.fs`` filesystem is passed.
-        Set to False to enable the new code path (experimental, using the
-        new Arrow Dataset API). This is more efficient when using partition
-        columns, but does not (yet) support `partition_filename_cb` and
-        `metadata_collector` keywords.
+        Default is False. Set to True to use the the legacy behaviour
+        (this option is deprecated, and the legacy implementation will be
+        removed in a future version). The legacy implementation still
+        supports `partition_filename_cb` and `metadata_collector` keywords
+        but is less efficient when using partition columns.
     **kwargs : dict,
         Additional kwargs for write_table function. See docstring for
         `write_table` or `ParquetWriter` for more information.
         Using `metadata_collector` in kwargs allows one to collect the
         file metadata instances of dataset pieces. The file paths in the
         ColumnChunkMetaData will be set relative to `root_path`.
     """
-    if use_legacy_dataset is None:
-        # if a new filesystem is passed -> default to new implementation
-        if isinstance(filesystem, FileSystem):

Review Comment:
   I was thinking that we could still let it default to None, and only interpret that as True _if_ `partition_filename_cb` is specififed. 
   
   Because currently, if you were using `partition_filename_cb` (but didn't specify `use_legacy_dataset=True`, since that was not needed until now), that will directly start to raise an error. While we could maybe start with having it raise a deprecation warning.



##########
python/pyarrow/parquet.py:
##########
@@ -2341,6 +2340,11 @@ def file_visitor(written_file):
     else:
         if partition_filename_cb:
             outfile = partition_filename_cb(None)
+
+            # raise for unsupported keywords
+            warnings.warn(
+                _DEPR_MSG.format("partition_filename_cb", ""),
+                FutureWarning, stacklevel=2)

Review Comment:
   I would maybe try to provide a customized deprecation message in this case, so we can give some more detail how you can replace `partition_filename_cb` in the new way (with the base_template)
   
   Also, the `partition_filename_cb` seems to be used above as well (in the `if` branch), so we would want to deprecate it in that case as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on code in PR #12811:
URL: https://github.com/apache/arrow/pull/12811#discussion_r851135533


##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1290,7 +1290,7 @@ def _test_write_to_dataset_no_partitions(base_path,
     # Without partitions, append files to root_path
     n = 5
     for i in range(n):
-        pq.write_to_dataset(output_table, base_path,
+        pq.write_to_dataset(output_table, base_path, use_legacy_dataset=True,

Review Comment:
   In this case where the default will still be True, does exposing `write_dataset` keywords still makes sense or should we wait once the default changes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on code in PR #12811:
URL: https://github.com/apache/arrow/pull/12811#discussion_r851128672


##########
python/pyarrow/tests/parquet/test_dataset.py:
##########
@@ -1290,7 +1290,7 @@ def _test_write_to_dataset_no_partitions(base_path,
     # Without partitions, append files to root_path
     n = 5
     for i in range(n):
-        pq.write_to_dataset(output_table, base_path,
+        pq.write_to_dataset(output_table, base_path, use_legacy_dataset=True,

Review Comment:
   So what I don't understand here is that the `dataset.write_dataset` function has a default to raise an error if there is existing data? But then why doesn't the above test fail with that error? (instead of failing in the test because we now overwrote the files)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on code in PR #12811:
URL: https://github.com/apache/arrow/pull/12811#discussion_r851251586


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -2962,11 +2965,47 @@ def write_to_dataset(table, root_path, partition_cols=None,
         and allow you to override the partition filename. If nothing is
         passed, the filename will consist of a uuid.
     use_legacy_dataset : bool
-        Default is True unless a ``pyarrow.fs`` filesystem is passed.
-        Set to False to enable the new code path (experimental, using the
-        new Arrow Dataset API). This is more efficient when using partition
-        columns, but does not (yet) support `partition_filename_cb` and
-        `metadata_collector` keywords.
+        Default is False. Set to True to use the the legacy behaviour
+        (this option is deprecated, and the legacy implementation will be
+        removed in a future version). The legacy implementation still
+        supports `partition_filename_cb` and `metadata_collector` keywords
+        but is less efficient when using partition columns.
+    file_options : pyarrow.dataset.FileWriteOptions, optional

Review Comment:
   Yes, thanks again!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

Posted by GitBox <gi...@apache.org>.
AlenkaF commented on PR #12811:
URL: https://github.com/apache/arrow/pull/12811#issuecomment-1102945814

   Created https://issues.apache.org/jira/browse/ARROW-16241 for the follow-up on the warnings when explicitly using `use_legacy_dataset=True` in the tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org