You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/13 07:21:23 UTC

[GitHub] [arrow] AlenkaF commented on a diff in pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

AlenkaF commented on code in PR #13591:
URL: https://github.com/apache/arrow/pull/13591#discussion_r919713913


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3063,16 +3064,19 @@ def write_to_dataset(table, root_path, partition_cols=None,
         used determined by the number of available CPU cores.
     schema : Schema, optional
     partitioning : Partitioning or list[str], optional
+        (This option is used only when `use_legacy_dataset` is False.)

Review Comment:
   I would suggest moving the info in the parentheses to the end of the argument description and remove the (). Similar for other arguments.
   
   As information for arguments that are only used in the new dataset implementation is added, would it be good to add it also to `schema` and `use_threads`? 



##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3049,6 +3049,7 @@ def write_to_dataset(table, root_path, partition_cols=None,
         Column names by which to partition the dataset.
         Columns are partitioned in the order they are given
     partition_filename_cb : callable,
+        (This option is used only when `use_legacy_dataset` is True.)

Review Comment:
   In the case of `partition_filename_cb`:
    - if `use_legacy_dataset=None` and `partition_filename_cb` is specified by the user, the implementation used will be set to legacy implementation, so no actions are needed by the user.
    - if `use_legacy_dataset=False`, the user will receive a ValueError suggesting to specify `use_legacy_dataset=True` to be able to use `partition_filename_cb`.
   
   So in this case I would maybe add a slightly different information . Can be to also specify `use_legacy_dataset=True` or that the new dataset implementation doesn't support this argument.



##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3114,8 +3114,11 @@ def file_visitor(written_file):
         the entire directory will be deleted.  This allows you to overwrite
         old partitions completely.
     **kwargs : dict,
-        Additional kwargs for write_table function. See docstring for
-        `write_table` or `ParquetWriter` for more information.
+        When `use_legacy_dataset` is False, used as additional kwargs for 
+        `pyarrow.dataset.ParquetFileFormat.make_write_options` function.

Review Comment:
   I think this should be ...
   
   ```suggestion
           `pyarrow.dataset.write_dataset` function.
   ```
   Then something like this can be added: (See docstring for `write_dataset` or  `ParquetFileFormat` for more information).
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org