You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/17 07:17:20 UTC

[GitHub] [spark] deshanxiao opened a new pull request, #37549: [SPARK-40103][R][FEATURE] Support read/write.csv() in SparkR

deshanxiao opened a new pull request, #37549:
URL: https://github.com/apache/spark/pull/37549

### What changes were proposed in this pull request?
Support read csv file in SparkR.

### Why are the changes needed?
Today, almost languages spark supports have the DataFrameReader.csv() API but R. R users usually use read.df() to read the csv file. So we need a more high-level api about it.

Java:
[DataFrameReader.csv()](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html)

Scala:
[DataFrameReader.csv()](https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrameReader.html#csv(paths:String*):org.apache.spark.sql.DataFrame)

Python:
[DataFrameReader.csv()](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.csv.html#pyspark.sql.DataFrameReader.csv)

R base library "utils" has introduce the namespace of "read.csv". So this api has changed the name from "read.csv" to "read.spark.csv" instead to avoid the conflict.

### Does this PR introduce _any_ user-facing change?
Yes, read.spark.csv and write.spark.csv are introduced.

### How was this patch tested?
UT (TODO)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zero323 commented on a diff in pull request #37549: [SPARK-40103][R][FEATURE] Support read/write.csv() in SparkR

Posted by GitBox <gi...@apache.org>.

zero323 commented on code in PR #37549:
URL: https://github.com/apache/spark/pull/37549#discussion_r947657570


##########
R/pkg/R/SQLContext.R:
##########
@@ -492,6 +492,37 @@ read.text <- function(path, ...) {
   dataFrame(sdf)
 }
 
+#' Create a SparkDataFrame from a csv file.
+#'
+#' Loads a Parquet file, returning the result as a SparkDataFrame.
+#'
+#' @param path Path of file to read. A vector of multiple paths is allowed.
+#' @param ... additional external data source specific named properties.
+#'            You can find the csv-specific options for reading csv files in
+# nolint start
+#'            \url{https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option}{Data Source Option} in the version you use.
+# nolint end
+#' @return SparkDataFrame
+#' @rdname read.spark.csv
+#' @examples
+#'\dontrun{
+#' sparkR.session()
+#' path <- "path/to/file.csv"
+#' df <- read.spark.csv(path)
+#' }
+#' @name read.spark.csv
+#' @note read.spark.csv since 3.3.0
+read.spark.csv <- function(path, ...) {

Review Comment:
   I have the same feelings as @HyukjinKwon here.
   
   If it wasn't for the name conflict it would be a nice addition. However, adding another naming convention (`spark.read.csv` would be a better choice, but then, it wouldn't be an "obvious" reading utility) just for the sake of omitting `csv` in a call, doesn't seem to be worth it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zero323 commented on a diff in pull request #37549: [SPARK-40103][R][FEATURE] Support read/write.csv() in SparkR

Posted by GitBox <gi...@apache.org>.

zero323 commented on code in PR #37549:
URL: https://github.com/apache/spark/pull/37549#discussion_r948338099


##########
R/pkg/R/SQLContext.R:
##########
@@ -492,6 +492,37 @@ read.text <- function(path, ...) {
   dataFrame(sdf)
 }
 
+#' Create a SparkDataFrame from a csv file.
+#'
+#' Loads a Parquet file, returning the result as a SparkDataFrame.
+#'
+#' @param path Path of file to read. A vector of multiple paths is allowed.
+#' @param ... additional external data source specific named properties.
+#'            You can find the csv-specific options for reading csv files in
+# nolint start
+#'            \url{https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option}{Data Source Option} in the version you use.
+# nolint end
+#' @return SparkDataFrame
+#' @rdname read.spark.csv
+#' @examples
+#'\dontrun{
+#' sparkR.session()
+#' path <- "path/to/file.csv"
+#' df <- read.spark.csv(path)
+#' }
+#' @name read.spark.csv
+#' @note read.spark.csv since 3.3.0
+read.spark.csv <- function(path, ...) {

Review Comment:
   I don't see any clean solution that would follow existing SparkR conventions @deshanxiao. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] deshanxiao commented on pull request #37549: [SPARK-40103][R][FEATURE] Support read/write.csv() in SparkR

Posted by GitBox <gi...@apache.org>.

deshanxiao commented on PR #37549:
URL: https://github.com/apache/spark/pull/37549#issuecomment-1219025554

   Close this PR. If anyone have any good solutions or suggestions, please feel free to open.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37549: [SPARK-40103][R][FEATURE] Support read/write.csv() in SparkR

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on PR #37549:
URL: https://github.com/apache/spark/pull/37549#issuecomment-1219288142

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] deshanxiao closed pull request #37549: [SPARK-40103][R][FEATURE] Support read/write.csv() in SparkR

Posted by GitBox <gi...@apache.org>.

deshanxiao closed pull request #37549: [SPARK-40103][R][FEATURE] Support read/write.csv() in SparkR
URL: https://github.com/apache/spark/pull/37549


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] deshanxiao commented on a diff in pull request #37549: [SPARK-40103][R][FEATURE] Support read/write.csv() in SparkR

Posted by GitBox <gi...@apache.org>.

deshanxiao commented on code in PR #37549:
URL: https://github.com/apache/spark/pull/37549#discussion_r947575505


##########
R/pkg/R/SQLContext.R:
##########
@@ -492,6 +492,37 @@ read.text <- function(path, ...) {
   dataFrame(sdf)
 }
 
+#' Create a SparkDataFrame from a csv file.
+#'
+#' Loads a Parquet file, returning the result as a SparkDataFrame.
+#'
+#' @param path Path of file to read. A vector of multiple paths is allowed.
+#' @param ... additional external data source specific named properties.
+#'            You can find the csv-specific options for reading csv files in
+# nolint start
+#'            \url{https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option}{Data Source Option} in the version you use.
+# nolint end
+#' @return SparkDataFrame
+#' @rdname read.spark.csv
+#' @examples
+#'\dontrun{
+#' sparkR.session()
+#' path <- "path/to/file.csv"
+#' df <- read.spark.csv(path)
+#' }
+#' @name read.spark.csv
+#' @note read.spark.csv since 3.3.0
+read.spark.csv <- function(path, ...) {

Review Comment:
   `format ` function looks like just working on scala & java:
   https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#manually-specifying-options



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37549: [SPARK-40103][R][FEATURE] Support read/write.csv() in SparkR

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #37549:
URL: https://github.com/apache/spark/pull/37549#discussion_r947542357


##########
R/pkg/R/SQLContext.R:
##########
@@ -492,6 +492,37 @@ read.text <- function(path, ...) {
   dataFrame(sdf)
 }
 
+#' Create a SparkDataFrame from a csv file.
+#'
+#' Loads a Parquet file, returning the result as a SparkDataFrame.
+#'
+#' @param path Path of file to read. A vector of multiple paths is allowed.
+#' @param ... additional external data source specific named properties.
+#'            You can find the csv-specific options for reading csv files in
+# nolint start
+#'            \url{https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option}{Data Source Option} in the version you use.
+# nolint end
+#' @return SparkDataFrame
+#' @rdname read.spark.csv
+#' @examples
+#'\dontrun{
+#' sparkR.session()
+#' path <- "path/to/file.csv"
+#' df <- read.spark.csv(path)
+#' }
+#' @name read.spark.csv
+#' @note read.spark.csv since 3.3.0
+read.spark.csv <- function(path, ...) {

Review Comment:
   The problem is that if it's worthwhile adding this different signature given that we're already able to do it with `format`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] deshanxiao commented on a diff in pull request #37549: [SPARK-40103][R][FEATURE] Support read/write.csv() in SparkR

Posted by GitBox <gi...@apache.org>.

deshanxiao commented on code in PR #37549:
URL: https://github.com/apache/spark/pull/37549#discussion_r947573399


##########
R/pkg/R/SQLContext.R:
##########
@@ -492,6 +492,37 @@ read.text <- function(path, ...) {
   dataFrame(sdf)
 }
 
+#' Create a SparkDataFrame from a csv file.
+#'
+#' Loads a Parquet file, returning the result as a SparkDataFrame.
+#'
+#' @param path Path of file to read. A vector of multiple paths is allowed.
+#' @param ... additional external data source specific named properties.
+#'            You can find the csv-specific options for reading csv files in
+# nolint start
+#'            \url{https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option}{Data Source Option} in the version you use.
+# nolint end
+#' @return SparkDataFrame
+#' @rdname read.spark.csv
+#' @examples
+#'\dontrun{
+#' sparkR.session()
+#' path <- "path/to/file.csv"
+#' df <- read.spark.csv(path)
+#' }
+#' @name read.spark.csv
+#' @note read.spark.csv since 3.3.0
+read.spark.csv <- function(path, ...) {

Review Comment:
   Yes, We can read csv file by following code:
   `df <- read.df("examples/src/main/resources/people.csv", "csv", sep = ";", inferSchema = TRUE, header = TRUE)
   `
   
   However, considering that other formats have corresponding advanced functions(read.text() etc.), it is necessary to add a high-level api here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] deshanxiao commented on a diff in pull request #37549: [SPARK-40103][R][FEATURE] Support read/write.csv() in SparkR

Posted by GitBox <gi...@apache.org>.

deshanxiao commented on code in PR #37549:
URL: https://github.com/apache/spark/pull/37549#discussion_r947701434


##########
R/pkg/R/SQLContext.R:
##########
@@ -492,6 +492,37 @@ read.text <- function(path, ...) {
   dataFrame(sdf)
 }
 
+#' Create a SparkDataFrame from a csv file.
+#'
+#' Loads a Parquet file, returning the result as a SparkDataFrame.
+#'
+#' @param path Path of file to read. A vector of multiple paths is allowed.
+#' @param ... additional external data source specific named properties.
+#'            You can find the csv-specific options for reading csv files in
+# nolint start
+#'            \url{https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option}{Data Source Option} in the version you use.
+# nolint end
+#' @return SparkDataFrame
+#' @rdname read.spark.csv
+#' @examples
+#'\dontrun{
+#' sparkR.session()
+#' path <- "path/to/file.csv"
+#' df <- read.spark.csv(path)
+#' }
+#' @name read.spark.csv
+#' @note read.spark.csv since 3.3.0
+read.spark.csv <- function(path, ...) {

Review Comment:
   @zero323  I totally argee. Do we have a more elegant way to support read.csv while maintaining compatibility? If not, I will close this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org