You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/13 05:10:21 UTC

[GitHub] [arrow] mirkhosro opened a new pull request, #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

mirkhosro opened a new pull request, #13591:
URL: https://github.com/apache/arrow/pull/13591

   This patch is an attempt to make the documentation of `pyarrow.parquet.write_to_dataset` function clearer so that the user can easily learn
   - Which parameters are used by the new code path and which ones are used by the legacy code path
   - How kwargs are handled. That is, which underlying function that `pyarrow.parquet.write_to_dataset` is a wrapper around they are passed to


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mirkhosro commented on a diff in pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

mirkhosro commented on code in PR #13591:
URL: https://github.com/apache/arrow/pull/13591#discussion_r922559056


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3113,9 +3117,14 @@ def file_visitor(written_file):
         dataset.  The first time each partition directory is encountered
         the entire directory will be deleted.  This allows you to overwrite
         old partitions completely.
+        This option is only supported for use_legacy_dataset=False.
     **kwargs : dict,
-        Additional kwargs for write_table function. See docstring for
-        `write_table` or `ParquetWriter` for more information.
+        When use_legacy_dataset=False, used as additional kwargs for 
+        `pyarrow.dataset.write_dataset` function (See docstring for 
+        `write_dataset` or `ParquetFileFormat` for more information).

Review Comment:
   how about I change it to 
   ```
   When use_legacy_dataset=False, passed as `file_options` to 
           `pyarrow.dataset.write_dataset` function (See docstring for 
           `write_dataset` or `ParquetFileFormat` for more information).
   ```
   because it's passed as file_options to write_dataset in the end
   ```
           # map format arguments
           parquet_format = ds.ParquetFileFormat()
           write_options = parquet_format.make_write_options(**kwargs)
   ...
           ds.write_dataset(
               table, root_path, filesystem=filesystem,
               format=parquet_format, file_options=write_options, schema=schema,
               ...)
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mirkhosro commented on a diff in pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

mirkhosro commented on code in PR #13591:
URL: https://github.com/apache/arrow/pull/13591#discussion_r920712317


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3049,6 +3049,7 @@ def write_to_dataset(table, root_path, partition_cols=None,
         Column names by which to partition the dataset.
         Columns are partitioned in the order they are given
     partition_filename_cb : callable,
+        (This option is used only when `use_legacy_dataset` is True.)

Review Comment:
   I tried to add something to that affect. Please check if looks OK.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #13591:
URL: https://github.com/apache/arrow/pull/13591#issuecomment-1182773191

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mirkhosro commented on a diff in pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

mirkhosro commented on code in PR #13591:
URL: https://github.com/apache/arrow/pull/13591#discussion_r921636046


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3063,16 +3064,19 @@ def write_to_dataset(table, root_path, partition_cols=None,
         used determined by the number of available CPU cores.
     schema : Schema, optional
     partitioning : Partitioning or list[str], optional
+        (This option is used only when `use_legacy_dataset` is False.)

Review Comment:
   Right. I was confusing passing `schema` to `Table.from_pandas` (which is what I did in 7.0.0.) with passing it to this function. You're right that schema is inferred from the table. I undid my last change.
   Also added the legacy info to `schema` parameter, and updated the function description to clarify what functions it's a wrapper around



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mirkhosro commented on pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

mirkhosro commented on PR #13591:
URL: https://github.com/apache/arrow/pull/13591#issuecomment-1183999734

   Also a general question: is there any functionality in `parquet.write_to_dataset` that the underlying  `dataset.write_dataset` cannot provide?
   If no, we could update the function description and suggest that users use that function instead if they don't need the legacy code path.
   Regardless, we could still update the function description and clarify that it is a wrapper around both `parquet.write_table` (when using legacy code path) and `dataset.write_dataset` (when using the new code path).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on PR #13591:
URL: https://github.com/apache/arrow/pull/13591#issuecomment-1185211658

   Thanks for the changes @mirkhosro.
   Only the linter errors need to be corrected and then I think the PR is ready:
   
   ```
   INFO:archery:Running Python linter (flake8)
   /arrow/python/pyarrow/parquet/__init__.py:3021:75: W291 trailing whitespace
   /arrow/python/pyarrow/parquet/__init__.py:3056:72: W291 trailing whitespace
   /arrow/python/pyarrow/parquet/__init__.py:3057:80: E501 line too long (80 > 79 characters)
   /arrow/python/pyarrow/parquet/__init__.py:3122:69: W291 trailing whitespace
   /arrow/python/pyarrow/parquet/__init__.py:3123:68: W291 trailing whitespace
   /arrow/python/pyarrow/parquet/__init__.py:3125:68: W291 trailing whitespace
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a diff in pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on code in PR #13591:
URL: https://github.com/apache/arrow/pull/13591#discussion_r920747511


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3063,16 +3064,19 @@ def write_to_dataset(table, root_path, partition_cols=None,
         used determined by the number of available CPU cores.
     schema : Schema, optional
     partitioning : Partitioning or list[str], optional
+        (This option is used only when `use_legacy_dataset` is False.)

Review Comment:
   Oh, I think you are right! Great catch!
   Agree with the change +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #13591:
URL: https://github.com/apache/arrow/pull/13591#issuecomment-1182773179

   https://issues.apache.org/jira/browse/ARROW-17046


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mirkhosro commented on a diff in pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

mirkhosro commented on code in PR #13591:
URL: https://github.com/apache/arrow/pull/13591#discussion_r921636046


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3063,16 +3064,19 @@ def write_to_dataset(table, root_path, partition_cols=None,
         used determined by the number of available CPU cores.
     schema : Schema, optional
     partitioning : Partitioning or list[str], optional
+        (This option is used only when `use_legacy_dataset` is False.)

Review Comment:
   Right. I was confusing passing `schema` to `Table.from_pandas` (which is what I did in 7.0.0.) with passing it to this function. You're right that schema is inferred from the table. I undid my last change.
   Also added the legacy info to `schema` parameter, and update the function description to clarify what functions it's a wrapper around



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

ursabot commented on PR #13591:
URL: https://github.com/apache/arrow/pull/13591#issuecomment-1192645308

   Benchmark runs are scheduled for baseline = 25757cbc01bb52d0c69d75f6156ba3f2bc71a186 and contender = 433f79526bd21cd2d6cc1832294f70ff4e6cce53. 433f79526bd21cd2d6cc1832294f70ff4e6cce53 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/82db5195382c441e92f2a420bb6a5245...634223f472164419b73dfcef74d2155c/)
   [Failed :arrow_down:0.14% :arrow_up:0.03%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/aea3b2b016a5400e86c28fb210f1d5eb...7840041aca1246299e40705329534cda/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/7be8b6843b4947f48c413fd7e3e60a39...d3c31e296a494880a1a6e536ed47cb02/)
   [Finished :arrow_down:0.36% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/f71831c3635747ccb89ef54d5599378c...db8b871f5ea448218a41ff00b65f1a56/)
   Buildkite builds:
   [Failed] [`433f7952` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1166)
   [Failed] [`433f7952` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1175)
   [Failed] [`433f7952` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1160)
   [Finished] [`433f7952` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1183)
   [Failed] [`25757cbc` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1165)
   [Failed] [`25757cbc` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1172)
   [Failed] [`25757cbc` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1159)
   [Finished] [`25757cbc` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1182)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a diff in pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on code in PR #13591:
URL: https://github.com/apache/arrow/pull/13591#discussion_r920725733


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3113,9 +3114,14 @@ def file_visitor(written_file):
         dataset.  The first time each partition directory is encountered
         the entire directory will be deleted.  This allows you to overwrite
         old partitions completely.
+        This option is only supported for use_legacy_dataset=False.
     **kwargs : dict,
-        Additional kwargs for write_table function. See docstring for
-        `write_table` or `ParquetWriter` for more information.
+        When use_legacy_dataset=False, used as additional kwargs for 
+        `pyarrow.dataset.write_dataset` function (See docstring for 
+        write_dataset or ParquetFileFormat for more information).

Review Comment:
   ```suggestion
           `write_dataset` or `ParquetFileFormat` for more information).
   ```



##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3063,16 +3064,19 @@ def write_to_dataset(table, root_path, partition_cols=None,
         used determined by the number of available CPU cores.
     schema : Schema, optional
     partitioning : Partitioning or list[str], optional
+        (This option is used only when `use_legacy_dataset` is False.)

Review Comment:
   Hm, I do not think so but can be wrong:
   
   https://github.com/apache/arrow/blob/69b06ecd7da4b4d58a0bc9858068b99bfdfa4876/python/pyarrow/parquet/__init__.py#L3243-L3261



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mirkhosro commented on a diff in pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

mirkhosro commented on code in PR #13591:
URL: https://github.com/apache/arrow/pull/13591#discussion_r920712031


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3114,8 +3114,11 @@ def file_visitor(written_file):
         the entire directory will be deleted.  This allows you to overwrite
         old partitions completely.
     **kwargs : dict,
-        Additional kwargs for write_table function. See docstring for
-        `write_table` or `ParquetWriter` for more information.
+        When `use_legacy_dataset` is False, used as additional kwargs for 
+        `pyarrow.dataset.ParquetFileFormat.make_write_options` function.

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on code in PR #13591:
URL: https://github.com/apache/arrow/pull/13591#discussion_r921918829


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3113,9 +3117,14 @@ def file_visitor(written_file):
         dataset.  The first time each partition directory is encountered
         the entire directory will be deleted.  This allows you to overwrite
         old partitions completely.
+        This option is only supported for use_legacy_dataset=False.
     **kwargs : dict,
-        Additional kwargs for write_table function. See docstring for
-        `write_table` or `ParquetWriter` for more information.
+        When use_legacy_dataset=False, used as additional kwargs for 
+        `pyarrow.dataset.write_dataset` function (See docstring for 
+        `write_dataset` or `ParquetFileFormat` for more information).

Review Comment:
   They are actually passed to `ParquetFileFormat.make_write_options` I think, and not `write_dataset`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on PR #13591:
URL: https://github.com/apache/arrow/pull/13591#issuecomment-1184030781

   > Also a general question: is there any functionality in `parquet.write_to_dataset` that the underlying `dataset.write_dataset` cannot provide? 
   
   `parquet.write_dataset` will soon be the only API used so I think that would be a no.
   
   > If no, we could update the function description and suggest that users use that function instead if they don't need the legacy code path. Regardless, we could still update the function description and clarify that it is a wrapper around both `parquet.write_table` (when using legacy code path) and `dataset.write_dataset` (when using the new code path).
   
   I think that is added already. Not in the arg description but in the warnings when using legacy implementation.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a diff in pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on code in PR #13591:
URL: https://github.com/apache/arrow/pull/13591#discussion_r924249456


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3113,9 +3117,14 @@ def file_visitor(written_file):
         dataset.  The first time each partition directory is encountered
         the entire directory will be deleted.  This allows you to overwrite
         old partitions completely.
+        This option is only supported for use_legacy_dataset=False.
     **kwargs : dict,
-        Additional kwargs for write_table function. See docstring for
-        `write_table` or `ParquetWriter` for more information.
+        When use_legacy_dataset=False, used as additional kwargs for 
+        `pyarrow.dataset.write_dataset` function (See docstring for 
+        `write_dataset` or `ParquetFileFormat` for more information).

Review Comment:
   cc @jorisvandenbossche 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a diff in pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on code in PR #13591:
URL: https://github.com/apache/arrow/pull/13591#discussion_r923043560


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3113,9 +3117,14 @@ def file_visitor(written_file):
         dataset.  The first time each partition directory is encountered
         the entire directory will be deleted.  This allows you to overwrite
         old partitions completely.
+        This option is only supported for use_legacy_dataset=False.
     **kwargs : dict,
-        Additional kwargs for write_table function. See docstring for
-        `write_table` or `ParquetWriter` for more information.
+        When use_legacy_dataset=False, used as additional kwargs for 
+        `pyarrow.dataset.write_dataset` function (See docstring for 
+        `write_dataset` or `ParquetFileFormat` for more information).

Review Comment:
   The CI is green, thanks! (one fail is not related).
   
   Regarding the change in this discussion: the kwargs are passed first to `ParquetFileFormat.make_write_options` and then to `write_dataset` so we could also mention both?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on code in PR #13591:
URL: https://github.com/apache/arrow/pull/13591#discussion_r925252401


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3113,9 +3117,14 @@ def file_visitor(written_file):
         dataset.  The first time each partition directory is encountered
         the entire directory will be deleted.  This allows you to overwrite
         old partitions completely.
+        This option is only supported for use_legacy_dataset=False.
     **kwargs : dict,
-        Additional kwargs for write_table function. See docstring for
-        `write_table` or `ParquetWriter` for more information.
+        When use_legacy_dataset=False, used as additional kwargs for 
+        `pyarrow.dataset.write_dataset` function (See docstring for 
+        `write_dataset` or `ParquetFileFormat` for more information).

Review Comment:
   > the kwargs are passed first to `ParquetFileFormat.make_write_options` and then to `write_dataset` so we could also mention both?
   
   Yes, that sounds good (but I would certainly mention `ParquetFileFormat.make_write_options`, as that is where the actual args are passed to)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on code in PR #13591:
URL: https://github.com/apache/arrow/pull/13591#discussion_r925255405


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3113,9 +3117,14 @@ def file_visitor(written_file):
         dataset.  The first time each partition directory is encountered
         the entire directory will be deleted.  This allows you to overwrite
         old partitions completely.
+        This option is only supported for use_legacy_dataset=False.
     **kwargs : dict,
-        Additional kwargs for write_table function. See docstring for
-        `write_table` or `ParquetWriter` for more information.
+        When use_legacy_dataset=False, used as additional kwargs for
+        `pyarrow.dataset.write_dataset` function (See docstring for
+        `write_dataset` or `ParquetFileFormat` for more information).

Review Comment:
   ```suggestion
           `pyarrow.dataset.write_dataset` function (passed to
           `ParquetFileFormat.make_write_options`). See the docstring
           of `write_table` for the available options.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mirkhosro commented on a diff in pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

mirkhosro commented on code in PR #13591:
URL: https://github.com/apache/arrow/pull/13591#discussion_r922560664


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3113,9 +3117,14 @@ def file_visitor(written_file):
         dataset.  The first time each partition directory is encountered
         the entire directory will be deleted.  This allows you to overwrite
         old partitions completely.
+        This option is only supported for use_legacy_dataset=False.
     **kwargs : dict,
-        Additional kwargs for write_table function. See docstring for
-        `write_table` or `ParquetWriter` for more information.
+        When use_legacy_dataset=False, used as additional kwargs for 
+        `pyarrow.dataset.write_dataset` function (See docstring for 
+        `write_dataset` or `ParquetFileFormat` for more information).

Review Comment:
   Linter should also be fixed now. Sorry if I still miss any, I'm using the Web editor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mirkhosro commented on a diff in pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

mirkhosro commented on code in PR #13591:
URL: https://github.com/apache/arrow/pull/13591#discussion_r920713377


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3063,16 +3064,19 @@ def write_to_dataset(table, root_path, partition_cols=None,
         used determined by the number of available CPU cores.
     schema : Schema, optional
     partitioning : Partitioning or list[str], optional
+        (This option is used only when `use_legacy_dataset` is False.)

Review Comment:
   Moving to the end and removing parentheses done.
   Isn't schema and use_threads also used by the legacy code path? At least that's what I see when I look at the code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF merged pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

AlenkaF merged PR #13591:
URL: https://github.com/apache/arrow/pull/13591


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on PR #13591:
URL: https://github.com/apache/arrow/pull/13591#issuecomment-1182861088

   Linter is complaining about trailing whitespace on line 3117 and 3119.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mirkhosro commented on a diff in pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

mirkhosro commented on code in PR #13591:
URL: https://github.com/apache/arrow/pull/13591#discussion_r922559056


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3113,9 +3117,14 @@ def file_visitor(written_file):
         dataset.  The first time each partition directory is encountered
         the entire directory will be deleted.  This allows you to overwrite
         old partitions completely.
+        This option is only supported for use_legacy_dataset=False.
     **kwargs : dict,
-        Additional kwargs for write_table function. See docstring for
-        `write_table` or `ParquetWriter` for more information.
+        When use_legacy_dataset=False, used as additional kwargs for 
+        `pyarrow.dataset.write_dataset` function (See docstring for 
+        `write_dataset` or `ParquetFileFormat` for more information).

Review Comment:
   how about I change it to 
   ```
   When use_legacy_dataset=False, passed as `file_options` to 
           `pyarrow.dataset.write_dataset` function (See docstring for 
           `write_dataset` or `ParquetFileFormat` for more information).
   ```
   because it's passed as file_options to write_dataset first
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a diff in pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on code in PR #13591:
URL: https://github.com/apache/arrow/pull/13591#discussion_r920777542


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3063,16 +3064,19 @@ def write_to_dataset(table, root_path, partition_cols=None,
         used determined by the number of available CPU cores.
     schema : Schema, optional
     partitioning : Partitioning or list[str], optional
+        (This option is used only when `use_legacy_dataset` is False.)

Review Comment:
   Looking at it again, _I think_ that in the legacy implementation the schema is inferred from the table.
   I guess that in 7.0.0 when you added a schema as an argument, it was ignored. And the current solution errors at `ParquetWriter` as multiple values for `schema` are supplied.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a diff in pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on code in PR #13591:
URL: https://github.com/apache/arrow/pull/13591#discussion_r919713913


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3063,16 +3064,19 @@ def write_to_dataset(table, root_path, partition_cols=None,
         used determined by the number of available CPU cores.
     schema : Schema, optional
     partitioning : Partitioning or list[str], optional
+        (This option is used only when `use_legacy_dataset` is False.)

Review Comment:
   I would suggest moving the info in the parentheses to the end of the argument description and remove the (). Similar for other arguments.
   
   As information for arguments that are only used in the new dataset implementation is added, would it be good to add it also to `schema` and `use_threads`? 



##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3049,6 +3049,7 @@ def write_to_dataset(table, root_path, partition_cols=None,
         Column names by which to partition the dataset.
         Columns are partitioned in the order they are given
     partition_filename_cb : callable,
+        (This option is used only when `use_legacy_dataset` is True.)

Review Comment:
   In the case of `partition_filename_cb`:
    - if `use_legacy_dataset=None` and `partition_filename_cb` is specified by the user, the implementation used will be set to legacy implementation, so no actions are needed by the user.
    - if `use_legacy_dataset=False`, the user will receive a ValueError suggesting to specify `use_legacy_dataset=True` to be able to use `partition_filename_cb`.
   
   So in this case I would maybe add a slightly different information . Can be to also specify `use_legacy_dataset=True` or that the new dataset implementation doesn't support this argument.



##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3114,8 +3114,11 @@ def file_visitor(written_file):
         the entire directory will be deleted.  This allows you to overwrite
         old partitions completely.
     **kwargs : dict,
-        Additional kwargs for write_table function. See docstring for
-        `write_table` or `ParquetWriter` for more information.
+        When `use_legacy_dataset` is False, used as additional kwargs for 
+        `pyarrow.dataset.ParquetFileFormat.make_write_options` function.

Review Comment:
   I think this should be ...
   
   ```suggestion
           `pyarrow.dataset.write_dataset` function.
   ```
   Then something like this can be added: (See docstring for `write_dataset` or  `ParquetFileFormat` for more information).
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mirkhosro commented on a diff in pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

mirkhosro commented on code in PR #13591:
URL: https://github.com/apache/arrow/pull/13591#discussion_r926091484


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3113,9 +3117,14 @@ def file_visitor(written_file):
         dataset.  The first time each partition directory is encountered
         the entire directory will be deleted.  This allows you to overwrite
         old partitions completely.
+        This option is only supported for use_legacy_dataset=False.
     **kwargs : dict,
-        Additional kwargs for write_table function. See docstring for
-        `write_table` or `ParquetWriter` for more information.
+        When use_legacy_dataset=False, used as additional kwargs for 
+        `pyarrow.dataset.write_dataset` function (See docstring for 
+        `write_dataset` or `ParquetFileFormat` for more information).

Review Comment:
   great! thank you for your suggestion, it's now committed.
   I also simplified the function names in the docstring by removing `pyarrow.` prefix, since it was not present elsewhere in docstrings
   I hope this looks good now, but please let me know if any further correction are needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on code in PR #13591:
URL: https://github.com/apache/arrow/pull/13591#discussion_r925254106


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3113,9 +3117,14 @@ def file_visitor(written_file):
         dataset.  The first time each partition directory is encountered
         the entire directory will be deleted.  This allows you to overwrite
         old partitions completely.
+        This option is only supported for use_legacy_dataset=False.
     **kwargs : dict,
-        Additional kwargs for write_table function. See docstring for
-        `write_table` or `ParquetWriter` for more information.
+        When use_legacy_dataset=False, used as additional kwargs for 
+        `pyarrow.dataset.write_dataset` function (See docstring for 
+        `write_dataset` or `ParquetFileFormat` for more information).

Review Comment:
   Actually, looking at our docs of this, `make_write_options` isn't that informative, as that docstrings is non-existing (updating this would also be a welcome improvement!). 
   In practice, though, it are the keywords that are available in `pq.write_table`/`ParquetWriter`, and those keywords are documented in those functions. So it's probably more useful to also mention those. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mirkhosro commented on a diff in pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

mirkhosro commented on code in PR #13591:
URL: https://github.com/apache/arrow/pull/13591#discussion_r922559056


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3113,9 +3117,14 @@ def file_visitor(written_file):
         dataset.  The first time each partition directory is encountered
         the entire directory will be deleted.  This allows you to overwrite
         old partitions completely.
+        This option is only supported for use_legacy_dataset=False.
     **kwargs : dict,
-        Additional kwargs for write_table function. See docstring for
-        `write_table` or `ParquetWriter` for more information.
+        When use_legacy_dataset=False, used as additional kwargs for 
+        `pyarrow.dataset.write_dataset` function (See docstring for 
+        `write_dataset` or `ParquetFileFormat` for more information).

Review Comment:
   how about I change it to 
   ```
   When use_legacy_dataset=False, passed as `file_options` to 
           `pyarrow.dataset.write_dataset` function (See docstring for 
           `write_dataset` or `ParquetFileFormat` for more information).
   ```
   because it's passed as file_options to write_dataset first
   ```
           # map format arguments
           parquet_format = ds.ParquetFileFormat()
           write_options = parquet_format.make_write_options(**kwargs)
   ...
           ds.write_dataset(
               table, root_path, filesystem=filesystem,
               format=parquet_format, file_options=write_options, schema=schema,
               ...)
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mirkhosro commented on a diff in pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

mirkhosro commented on code in PR #13591:
URL: https://github.com/apache/arrow/pull/13591#discussion_r921636046


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3063,16 +3064,19 @@ def write_to_dataset(table, root_path, partition_cols=None,
         used determined by the number of available CPU cores.
     schema : Schema, optional
     partitioning : Partitioning or list[str], optional
+        (This option is used only when `use_legacy_dataset` is False.)

Review Comment:
   Right. I was confusing passing `schema` to `Table.from_pandas` with passing it to this function, which is what I did in 7.0.0. You're right that schema is inferred from the table. I undid my last change.
   Also added the legacy info to `schema` parameter, and update the function description to clarify what functions it's a wrapper around



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a diff in pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on code in PR #13591:
URL: https://github.com/apache/arrow/pull/13591#discussion_r921922183


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3113,9 +3117,14 @@ def file_visitor(written_file):
         dataset.  The first time each partition directory is encountered
         the entire directory will be deleted.  This allows you to overwrite
         old partitions completely.
+        This option is only supported for use_legacy_dataset=False.
     **kwargs : dict,
-        Additional kwargs for write_table function. See docstring for
-        `write_table` or `ParquetWriter` for more information.
+        When use_legacy_dataset=False, used as additional kwargs for 
+        `pyarrow.dataset.write_dataset` function (See docstring for 
+        `write_dataset` or `ParquetFileFormat` for more information).

Review Comment:
   Sorry, my mistake:
   https://github.com/apache/arrow/pull/13591#discussion_r919724340



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mirkhosro commented on a diff in pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

mirkhosro commented on code in PR #13591:
URL: https://github.com/apache/arrow/pull/13591#discussion_r920744447


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3063,16 +3064,19 @@ def write_to_dataset(table, root_path, partition_cols=None,
         used determined by the number of available CPU cores.
     schema : Schema, optional
     partitioning : Partitioning or list[str], optional
+        (This option is used only when `use_legacy_dataset` is False.)

Review Comment:
   You're right about `use_threads`. However we should be able to pass `schema` which gets passed to `ParquetWriter` through kwargs of `write_table` in legacy code path. In fact my code relied on that behavior in 7.0.0 as I was specifying the schema through that option. But now it's erroring out in 8.0.0.
   I think one fix would be to remove this check and pass `schema` to `write_table` (as a keyword argument) which would in turn pass it to `ParquetWriter`. I can make that change if you agree that that's the correct behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mirkhosro commented on a diff in pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

mirkhosro commented on code in PR #13591:
URL: https://github.com/apache/arrow/pull/13591#discussion_r920757395


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3063,16 +3064,19 @@ def write_to_dataset(table, root_path, partition_cols=None,
         used determined by the number of available CPU cores.
     schema : Schema, optional
     partitioning : Partitioning or list[str], optional
+        (This option is used only when `use_legacy_dataset` is False.)

Review Comment:
   Thanks! I added the change, but I'm not able to test it. Also I'm not sure if it's gonna clash with `subschema` defined in the same portion of the function.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] AlenkaF commented on a diff in pull request #13591: ARROW-17046 [Python] improve documentation of pyarrow.parquet.write_to_dataset function

Posted by GitBox <gi...@apache.org>.

AlenkaF commented on code in PR #13591:
URL: https://github.com/apache/arrow/pull/13591#discussion_r920759034


##########
python/pyarrow/parquet/__init__.py:
##########
@@ -3063,16 +3064,19 @@ def write_to_dataset(table, root_path, partition_cols=None,
         used determined by the number of available CPU cores.
     schema : Schema, optional
     partitioning : Partitioning or list[str], optional
+        (This option is used only when `use_legacy_dataset` is False.)

Review Comment:
   This can get a bit tricky, I will run the tests and have a better look at the code asap.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org