You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/12/16 20:53:16 UTC

[GitHub] [spark] bjornjorgensen opened a new pull request, #39098: [SPARK-41553][PS] Change `num_files` to `repartition`

bjornjorgensen opened a new pull request, #39098:
URL: https://github.com/apache/spark/pull/39098

   ### What changes were proposed in this pull request?
   Change num_files to repartition
   
   ### Why are the changes needed?
   
   `num_files` has been deprecated and might be removed in a future version. "
   "Use `DataFrame.spark.repartition` instead.",
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Pass GA


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] bjornjorgensen commented on pull request #39098: [SPARK-41553][PS][PYTHON][CORE] Change `num_files` to `repartition`

Posted by GitBox <gi...@apache.org>.

bjornjorgensen commented on PR #39098:
URL: https://github.com/apache/spark/pull/39098#issuecomment-1356274657

   @dongjoon-hyun Thank you. 
   
   
   "And the num_files argument doesn't actually manage the number of files, but specifying the partition number."
   https://github.com/apache/spark/pull/33379
   
   But in the documentation 
   https://github.com/apache/spark/blob/d95fb4c33f6f061190fae091868117d182659147/python/pyspark/pandas/generic.py#L674
   
   "So, we should deprecate the num_files argument and encourage users to use DataFrame.spark.repartition API instead."
   https://github.com/apache/spark/pull/33379
   
   The use of `DataFrame.spark.repartition` API has not been documented. 
   
   So if we change back to `num_files` 
   
   Change "The number of files can be controlled by `num_files`." to "The number of partitions can be controlled by `num_files`."
   
   Add a note instead of `num_files` to manage the number of files `DataFrame.spark.repartition` can be used. 
   
   And remove if num_files is not None:


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39098: [SPARK-41553][PS][PYTHON][CORE] Change `num_files` to `repartition`

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #39098:
URL: https://github.com/apache/spark/pull/39098#discussion_r1051506753


##########
python/pyspark/pandas/generic.py:
##########
@@ -694,7 +694,7 @@ def to_csv(
         escapechar: str, default None
             String of length 1. Character used to escape `sep` and `quotechar`
             when appropriate.
-        num_files: the number of files to be written in `path` directory when
+        repartition: the number of files to be written in `path` directory when

Review Comment:
   I think we should just remove this out in the future, and let users to use DataFrame.spark.repartition instead .. instead of adding a new renamed parameter.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #39098: [SPARK-41553][PS][PYTHON][CORE] Fix the documentation for `num_files`

Posted by GitBox <gi...@apache.org>.

HyukjinKwon closed pull request #39098: [SPARK-41553][PS][PYTHON][CORE] Fix the documentation for `num_files` 
URL: https://github.com/apache/spark/pull/39098


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #39098: [SPARK-41553][PS][PYTHON][CORE] Change `num_files` to `repartition`

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #39098:
URL: https://github.com/apache/spark/pull/39098#discussion_r1058130837


##########
python/pyspark/pandas/generic.py:
##########
@@ -748,7 +748,7 @@ def to_csv(
         2012-02-29 12:00:00,US,2
         2012-03-31 12:00:00,JP,3
 
-        >>> df.cummax().to_csv(path=r'%s/to_csv/foo.csv' % path, num_files=1)
+        >>> df.cummax().to_csv(path=r'%s/to_csv/foo.csv' % path, repartition=1)

Review Comment:
   repartition -> num_files?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39098: [SPARK-41553][PS][PYTHON][CORE] Change `num_files` to `repartition`

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on code in PR #39098:
URL: https://github.com/apache/spark/pull/39098#discussion_r1056874918


##########
python/pyspark/pandas/generic.py:
##########
@@ -900,8 +900,8 @@ def to_json(
 
         .. note:: pandas-on-Spark writes JSON files into the directory, `path`, and writes
             multiple `part-...` files in the directory when `path` is specified.
-            This behavior was inherited from Apache Spark. The number of files can
-            be controlled by `num_files`.
+            This behavior was inherited from Apache Spark. The number of partitions can
+            be controlled by `num_files`. This is deprecated use `repartition`.

Review Comment:
   ditto.



##########
python/pyspark/pandas/generic.py:
##########
@@ -927,8 +927,8 @@ def to_json(
             A string representing the compression to use in the output file,
             only used when the first argument is a filename. By default, the
             compression is inferred from the filename.
-        num_files: the number of files to be written in `path` directory when
-            this is a path.
+        num_files: the number of partitions to be written in `path` directory when
+            this is a path. This is deprecated use `repartition`.

Review Comment:
   ditto



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #39098: [SPARK-41553][PS][PYTHON][CORE] Fix the documentation for `num_files`

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on PR #39098:
URL: https://github.com/apache/spark/pull/39098#issuecomment-1368141033

   Merged to master.
   
   Thanks guys for reviewing and addressing them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] bjornjorgensen commented on pull request #39098: [SPARK-41553][PS][PYTHON][CORE] Fix the documentation for `num_files`

Posted by GitBox <gi...@apache.org>.

bjornjorgensen commented on PR #39098:
URL: https://github.com/apache/spark/pull/39098#issuecomment-1368205777

   Small improvements. 
   It's a pleasure to work together with you. 
   Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39098: [SPARK-41553][PS][PYTHON][CORE] Change `num_files` to `repartition`

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on code in PR #39098:
URL: https://github.com/apache/spark/pull/39098#discussion_r1056874804


##########
python/pyspark/pandas/generic.py:
##########
@@ -671,8 +671,8 @@ def to_csv(
 
         .. note:: pandas-on-Spark writes CSV files into the directory, `path`, and writes
             multiple `part-...` files in the directory when `path` is specified.
-            This behavior was inherited from Apache Spark. The number of files can
-            be controlled by `num_files`.
+            This behavior was inherited from Apache Spark. The number of partitions can
+            be controlled by `num_files`. This is deprecated use `repartition`.

Review Comment:
   ```python
   - This is deprecated use `repartition`.
   + This is deprecated in favor of `repartition`.
   ```
   
   Or,
   
   ```python
   - This is deprecated use `repartition`.
   + This is deprecated. Use `repartition` instead.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39098: [SPARK-41553][PS][PYTHON][CORE] Change `num_files` to `repartition`

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on code in PR #39098:
URL: https://github.com/apache/spark/pull/39098#discussion_r1056874822


##########
python/pyspark/pandas/generic.py:
##########
@@ -694,8 +694,8 @@ def to_csv(
         escapechar: str, default None
             String of length 1. Character used to escape `sep` and `quotechar`
             when appropriate.
-        num_files: the number of files to be written in `path` directory when
-            this is a path.
+        num_files: the number of partitions to be written in `path` directory when
+            this is a path. This is deprecated use `repartition`.

Review Comment:
   ditto.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #39098: [SPARK-41553][PS][PYTHON][CORE] Change `num_files` to `repartition`

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on PR #39098:
URL: https://github.com/apache/spark/pull/39098#issuecomment-1356332568

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #39098: [SPARK-41553][PS][PYTHON][CORE] Change `num_files` to `repartition`

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on code in PR #39098:
URL: https://github.com/apache/spark/pull/39098#discussion_r1059224962


##########
python/pyspark/pandas/generic.py:
##########
@@ -694,8 +694,8 @@ def to_csv(
         escapechar: str, default None
             String of length 1. Character used to escape `sep` and `quotechar`
             when appropriate.
-        num_files: the number of files to be written in `path` directory when
-            this is a path.
+        num_files: the number of partitions to be written in `path` directory when
+            this is a path. This is deprecated. Use `repartition` instead.

Review Comment:
   This could have different meanings. The original intention was `Use DataFrame.spark.repartition instead`, wasn't it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org