You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/10/26 13:58:53 UTC

[GitHub] [spark] soxofaan opened a new pull request, #38399: [SPARK-40922][PYTHON] document multiple path support in `pyspark.pandas.read_csv`

soxofaan opened a new pull request, #38399:
URL: https://github.com/apache/spark/pull/38399

   ### What changes were proposed in this pull request?
   
   as discussed in https://issues.apache.org/jira/browse/SPARK-40922: 
   
   > The path argument of `pyspark.pandas.read_csv(path, ...)` currently has type annotation `str` and is documented as
   >
   >       path : str
   >           The path string storing the CSV file to be read.
   >The implementation however uses `pyspark.sql.DataFrameReader.csv(path, ...)` which does support multiple paths:
   >
   >        path : str or list
   >            string, or list of strings, for input path(s),
   >            or RDD of Strings storing CSV rows.
   >
   
   This PR updates the type annotation and documentation of `path` argument of `pyspark.pandas.read_csv`
   
   ### Why are the changes needed?
   
   Loading multiple CSV files at once is a useful feature to have and should be documented 
   
   ### Does this PR introduce _any_ user-facing change?
   it documents and existing feature
   
   ### How was this patch tested?
   No need for tests (so far): only type annotations and docblocks were changed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] soxofaan commented on pull request #38399: [SPARK-40922][PYTHON] Document multiple path support in `pyspark.pandas.read_csv`

Posted by GitBox <gi...@apache.org>.
soxofaan commented on PR #38399:
URL: https://github.com/apache/spark/pull/38399#issuecomment-1293115974

   I added an example
   
   FYI: while looking around in the code, I suspect the feature of supporting multiple paths is also present in other `read_*` functions (like  read_orc,  read_json and probably some others too), but I haven't experimented with that yet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #38399: [SPARK-40922][PYTHON] document multiple path support in `pyspark.pandas.read_csv`

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #38399:
URL: https://github.com/apache/spark/pull/38399#issuecomment-1292521845

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #38399: [SPARK-40922][PYTHON] Document multiple path support in `pyspark.pandas.read_csv`

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #38399:
URL: https://github.com/apache/spark/pull/38399#issuecomment-1292834356

   cc @itholic 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on a diff in pull request #38399: [SPARK-40922][PYTHON] Document multiple path support in `pyspark.pandas.read_csv`

Posted by GitBox <gi...@apache.org>.
itholic commented on code in PR #38399:
URL: https://github.com/apache/spark/pull/38399#discussion_r1006328647


##########
python/pyspark/pandas/namespace.py:
##########
@@ -234,8 +234,8 @@ def read_csv(
 
     Parameters
     ----------
-    path : str
-        The path string storing the CSV file to be read.
+    path : str or list
+        path(s) of the CSV file(s) to be read.

Review Comment:
   Yeah, let's add at least one example for docstring



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #38399: [SPARK-40922][PYTHON] Document multiple path support in `pyspark.pandas.read_csv`

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #38399:
URL: https://github.com/apache/spark/pull/38399#issuecomment-1293346850

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #38399: [SPARK-40922][PYTHON] Document multiple path support in `pyspark.pandas.read_csv`

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #38399: [SPARK-40922][PYTHON] Document multiple path support in `pyspark.pandas.read_csv`
URL: https://github.com/apache/spark/pull/38399


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] itholic commented on pull request #38399: [SPARK-40922][PYTHON] Document multiple path support in `pyspark.pandas.read_csv`

Posted by GitBox <gi...@apache.org>.
itholic commented on PR #38399:
URL: https://github.com/apache/spark/pull/38399#issuecomment-1292859548

   Looks good except https://github.com/apache/spark/pull/38399#discussion_r1006312293


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38399: [SPARK-40922][PYTHON] Document multiple path support in `pyspark.pandas.read_csv`

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on code in PR #38399:
URL: https://github.com/apache/spark/pull/38399#discussion_r1006312293


##########
python/pyspark/pandas/namespace.py:
##########
@@ -234,8 +234,8 @@ def read_csv(
 
     Parameters
     ----------
-    path : str
-        The path string storing the CSV file to be read.
+    path : str or list
+        path(s) of the CSV file(s) to be read.

Review Comment:
   Can we add this examaple to the docstring?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org