You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/07 07:51:43 UTC

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #12811: ARROW-16122: [Python] Deprecate no-longer supported keywords in parquet.write_to_dataset

jorisvandenbossche commented on code in PR #12811:
URL: https://github.com/apache/arrow/pull/12811#discussion_r844824953


##########
python/pyarrow/parquet.py:
##########
@@ -2236,26 +2236,18 @@ def write_to_dataset(table, root_path, partition_cols=None,
         and allow you to override the partition filename. If nothing is
         passed, the filename will consist of a uuid.
     use_legacy_dataset : bool
-        Default is True unless a ``pyarrow.fs`` filesystem is passed.
-        Set to False to enable the new code path (experimental, using the
-        new Arrow Dataset API). This is more efficient when using partition
-        columns, but does not (yet) support `partition_filename_cb` and
-        `metadata_collector` keywords.
+        Default is False. Set to True to use the the legacy behaviour
+        (this option is deprecated, and the legacy implementation will be
+        removed in a future version). The legacy implementation still
+        supports `partition_filename_cb` and `metadata_collector` keywords
+        but is less efficient when using partition columns.
     **kwargs : dict,
         Additional kwargs for write_table function. See docstring for
         `write_table` or `ParquetWriter` for more information.
         Using `metadata_collector` in kwargs allows one to collect the
         file metadata instances of dataset pieces. The file paths in the
         ColumnChunkMetaData will be set relative to `root_path`.
     """
-    if use_legacy_dataset is None:
-        # if a new filesystem is passed -> default to new implementation
-        if isinstance(filesystem, FileSystem):

Review Comment:
   I was thinking that we could still let it default to None, and only interpret that as True _if_ `partition_filename_cb` is specififed. 
   
   Because currently, if you were using `partition_filename_cb` (but didn't specify `use_legacy_dataset=True`, since that was not needed until now), that will directly start to raise an error. While we could maybe start with having it raise a deprecation warning.



##########
python/pyarrow/parquet.py:
##########
@@ -2341,6 +2340,11 @@ def file_visitor(written_file):
     else:
         if partition_filename_cb:
             outfile = partition_filename_cb(None)
+
+            # raise for unsupported keywords
+            warnings.warn(
+                _DEPR_MSG.format("partition_filename_cb", ""),
+                FutureWarning, stacklevel=2)

Review Comment:
   I would maybe try to provide a customized deprecation message in this case, so we can give some more detail how you can replace `partition_filename_cb` in the new way (with the base_template)
   
   Also, the `partition_filename_cb` seems to be used above as well (in the `if` branch), so we would want to deprecate it in that case as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org