You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/15 17:22:05 UTC

[GitHub] [spark] santosh-d3vpl3x opened a new pull request, #37526: SPARK-40087 Support multiple "Column" drop in R

santosh-d3vpl3x opened a new pull request, #37526:
URL: https://github.com/apache/spark/pull/37526

   ### What changes were proposed in this pull request?
   This is a followup on SPARK-39895. The PR previously attempted to adjust implementation for R as well to match signatures but that part was removed and we only focused on getting python implementation to behave correctly.
   
   Change supports following operations:
   
   df <- select(read.json(jsonPath), "name", "age")
   
   df$age2 <- df$age
   
   df1 <- drop(df, df$age, df$name)
   expect_equal(columns(df1), c("age2"))
   
   df1 <- drop(df, list(df$age, column("random")))
   expect_equal(columns(df1), c("name", "age2"))
   
   df1 <- drop(df, list(df$age, df$name))
   expect_equal(columns(df1), c("age2"))
   
   
   ### Why are the changes needed?
   Followup on previous PR
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, adds support for multiple "Column" drop in R
   
   
   ### How was this patch tested?
   Added test cases for R
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] santosh-d3vpl3x commented on a diff in pull request #37526: [WIP][SPARK-40087][R][SQL] Support multiple "Column" drop in R

Posted by GitBox <gi...@apache.org>.

santosh-d3vpl3x commented on code in PR #37526:
URL: https://github.com/apache/spark/pull/37526#discussion_r946754596


##########
R/pkg/R/DataFrame.R:
##########
@@ -3577,40 +3577,90 @@ setMethod("str",
 #' This is a no-op if schema doesn't contain column name(s).
 #'
 #' @param x a SparkDataFrame.
-#' @param col a character vector of column names or a Column.
-#' @param ... further arguments to be passed to or from other methods.
-#' @return A SparkDataFrame.
+#' @param col a list of columns or single Column or name.
+#' @param ... additional column(s) if only one column is specified in \code{col}.
+#'            If more than one column is assigned in \code{col}, \code{...}
+#'            should be left empty.
+#' @return A new SparkDataFrame with selected columns.
 #'
 #' @family SparkDataFrame functions
 #' @rdname drop
 #' @name drop
-#' @aliases drop,SparkDataFrame-method
+#' @aliases drop,SparkDataFrame,character-method
+#' @family subsetting functions
 #' @examples
-#'\dontrun{
-#' sparkR.session()
-#' path <- "path/to/file.json"
-#' df <- read.json(path)
-#' drop(df, "col1")
-#' drop(df, c("col1", "col2"))
-#' drop(df, df$col1)
+#' \dontrun{
+#'   drop(df, "*")
+#'   drop(df, "col1", "col2")
+#'   drop(df, df$name, df$age + 1)
+#'   drop(df, c("col1", "col2"))
+#'   drop(df, list(df$name, df$age + 1))
 #' }
-#' @note drop since 2.0.0
-setMethod("drop",
-          signature(x = "SparkDataFrame"),
-          function(x, col) {
-            stopifnot(class(col) == "character" || class(col) == "Column")
+#' @note drop(SparkDataFrame, character) since 2.0.0
+setMethod("drop", signature(x = "SparkDataFrame", col = "character"),
+          function(x, col, ...) {
+            if (length(col) > 1) {
+              if (length(list(...)) > 0) {
+                stop("To drop multiple columns, use a character vector or list for col")
+              }
 
-            if (class(col) == "Column") {
-              sdf <- callJMethod(x@sdf, "drop", col@jc)
+              drop(x, as.list(col))
             } else {
-              sdf <- callJMethod(x@sdf, "drop", as.list(col))
+              sdf <- callJMethod(x@sdf, "drop", list(col, ...))
+              dataFrame(sdf)
             }
+          })
+
+#' @rdname drop
+#' @aliases drop,SparkDataFrame,Column-method
+#' @note drop(SparkDataFrame, Column) since 2.0.0
+setMethod("drop", signature(x = "SparkDataFrame", col = "Column"),
+          function(x, col, ...) {
+            jcols <- lapply(list(col, ...), function(c) {
+              c@jc
+            })
+            sdf <- callJMethod(x@sdf, "drop", jcols[[1]], jcols[-1])
             dataFrame(sdf)
           })
 
-# Expose base::drop
-#' @name drop
 #' @rdname drop
+#' @aliases drop,SparkDataFrame,list-method
+#' @note drop(SparkDataFrame, list) since 3.4.0
+setMethod("drop",
+          signature(x = "SparkDataFrame", col = "list"),
+          function(x, col) {
+            cols <- lapply(col, function(c) {
+              if (class(c) == "Column") {
+                c@jc
+              } else {
+                col(c)@jc
+              }
+            })
+            sdf <- callJMethod(x@sdf, "drop", cols[[1]], cols[-1])
+            dataFrame(sdf)
+          })
+
+#' Expose base::drop which deletes the dimensions of an array which have only one level.

Review Comment:
   Now [Run / Linters, licenses, dependencies and documentation generation](https://github.com/santosh-d3vpl3x/spark/runs/7857027573?check_suite_focus=true#logs) errors out with following:
   
   ```
   -- Building function reference -------------------------------------------------
   Error in check_missing_topics(rows, pkg) : 
     All topics must be included in reference index
   ✖ Missing topics: drop,ANY,ANY-method
   ℹ Either add to _pkgdown.yml or use @keyword internal
   Error: 
   ! error in callr subprocess
   Caused by error in `check_missing_topics(rows, pkg)`:
   ! All topics must be included in reference index
   ✖ Missing topics: drop,ANY,ANY-method
   ℹ Either add to _pkgdown.yml or use @keyword internal
   ```
   
   @HyukjinKwon would you know what is the best way to deal with this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37526: [SPARK-40087][R] Support multiple "Column" drop in R

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on PR #37526:
URL: https://github.com/apache/spark/pull/37526#issuecomment-1216315030

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] santosh-d3vpl3x commented on a diff in pull request #37526: [SPARK-40087][R][SQL] Support multiple "Column" drop in R

Posted by GitBox <gi...@apache.org>.

santosh-d3vpl3x commented on code in PR #37526:
URL: https://github.com/apache/spark/pull/37526#discussion_r948175216


##########
R/pkg/R/DataFrame.R:
##########
@@ -3577,40 +3577,90 @@ setMethod("str",
 #' This is a no-op if schema doesn't contain column name(s).
 #'
 #' @param x a SparkDataFrame.
-#' @param col a character vector of column names or a Column.
-#' @param ... further arguments to be passed to or from other methods.
-#' @return A SparkDataFrame.
+#' @param col a list of columns or single Column or name.
+#' @param ... additional column(s) if only one column is specified in \code{col}.
+#'            If more than one column is assigned in \code{col}, \code{...}
+#'            should be left empty.
+#' @return A new SparkDataFrame with selected columns.
 #'
 #' @family SparkDataFrame functions
 #' @rdname drop
 #' @name drop
-#' @aliases drop,SparkDataFrame-method
+#' @aliases drop,SparkDataFrame,character-method
+#' @family subsetting functions
 #' @examples
-#'\dontrun{
-#' sparkR.session()
-#' path <- "path/to/file.json"
-#' df <- read.json(path)
-#' drop(df, "col1")
-#' drop(df, c("col1", "col2"))
-#' drop(df, df$col1)
+#' \dontrun{
+#'   drop(df, "*")
+#'   drop(df, "col1", "col2")
+#'   drop(df, df$name, df$age + 1)
+#'   drop(df, c("col1", "col2"))
+#'   drop(df, list(df$name, df$age + 1))
 #' }
-#' @note drop since 2.0.0
-setMethod("drop",
-          signature(x = "SparkDataFrame"),
-          function(x, col) {
-            stopifnot(class(col) == "character" || class(col) == "Column")
+#' @note drop(SparkDataFrame, character) since 2.0.0
+setMethod("drop", signature(x = "SparkDataFrame", col = "character"),

Review Comment:
   @HyukjinKwon Would you like to take another look?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] santosh-d3vpl3x commented on a diff in pull request #37526: [WIP][SPARK-40087][R][SQL] Support multiple "Column" drop in R

Posted by GitBox <gi...@apache.org>.

santosh-d3vpl3x commented on code in PR #37526:
URL: https://github.com/apache/spark/pull/37526#discussion_r946626739


##########
R/pkg/R/DataFrame.R:
##########
@@ -3577,40 +3577,90 @@ setMethod("str",
 #' This is a no-op if schema doesn't contain column name(s).
 #'
 #' @param x a SparkDataFrame.
-#' @param col a character vector of column names or a Column.
-#' @param ... further arguments to be passed to or from other methods.
-#' @return A SparkDataFrame.
+#' @param col a list of columns or single Column or name.
+#' @param ... additional column(s) if only one column is specified in \code{col}.
+#'            If more than one column is assigned in \code{col}, \code{...}
+#'            should be left empty.
+#' @return A new SparkDataFrame with selected columns.
 #'
 #' @family SparkDataFrame functions
 #' @rdname drop
 #' @name drop
-#' @aliases drop,SparkDataFrame-method
+#' @aliases drop,SparkDataFrame,character-method
+#' @family subsetting functions
 #' @examples
-#'\dontrun{
-#' sparkR.session()
-#' path <- "path/to/file.json"
-#' df <- read.json(path)
-#' drop(df, "col1")
-#' drop(df, c("col1", "col2"))
-#' drop(df, df$col1)
+#' \dontrun{
+#'   drop(df, "*")
+#'   drop(df, "col1", "col2")
+#'   drop(df, df$name, df$age + 1)
+#'   drop(df, c("col1", "col2"))
+#'   drop(df, list(df$name, df$age + 1))
 #' }
-#' @note drop since 2.0.0
-setMethod("drop",
-          signature(x = "SparkDataFrame"),
-          function(x, col) {
-            stopifnot(class(col) == "character" || class(col) == "Column")
+#' @note drop(SparkDataFrame, character) since 2.0.0
+setMethod("drop", signature(x = "SparkDataFrame", col = "character"),
+          function(x, col, ...) {
+            if (length(col) > 1) {
+              if (length(list(...)) > 0) {
+                stop("To drop multiple columns, use a character vector or list for col")
+              }
 
-            if (class(col) == "Column") {
-              sdf <- callJMethod(x@sdf, "drop", col@jc)
+              drop(x, as.list(col))
             } else {
-              sdf <- callJMethod(x@sdf, "drop", as.list(col))
+              sdf <- callJMethod(x@sdf, "drop", list(col, ...))
+              dataFrame(sdf)
             }
+          })
+
+#' @rdname drop
+#' @aliases drop,SparkDataFrame,Column-method
+#' @note drop(SparkDataFrame, Column) since 2.0.0
+setMethod("drop", signature(x = "SparkDataFrame", col = "Column"),
+          function(x, col, ...) {
+            jcols <- lapply(list(col, ...), function(c) {
+              c@jc
+            })
+            sdf <- callJMethod(x@sdf, "drop", jcols[[1]], jcols[-1])
             dataFrame(sdf)
           })
 
-# Expose base::drop
-#' @name drop
 #' @rdname drop
+#' @aliases drop,SparkDataFrame,list-method
+#' @note drop(SparkDataFrame, list) since 3.4.0
+setMethod("drop",
+          signature(x = "SparkDataFrame", col = "list"),
+          function(x, col) {
+            cols <- lapply(col, function(c) {
+              if (class(c) == "Column") {
+                c@jc
+              } else {
+                col(c)@jc
+              }
+            })
+            sdf <- callJMethod(x@sdf, "drop", cols[[1]], cols[-1])
+            dataFrame(sdf)
+          })
+
+#' Expose base::drop which deletes the dimensions of an array which have only one level.

Review Comment:
   Previous pipeline run threw:
   ```
   Undocumented S4 methods:
     generic 'drop' and siglist 'ANY,ANY'
   All user-level objects in a package (including S4 classes and methods)
   should have documentation entries.
   See chapter ‘Writing R documentation files’ in the ‘Writing R
   Extensions’ manual.
   * checking for code/documentation mismatches ... OK
   * checking Rd \usage sections ... WARNING
   Objects in \usage without \alias in documentation object 'drop':
     ‘\S4method{drop}{ANY,ANY}’
   
   Functions with \usage entries need to have the appropriate \alias
   entries, and all their arguments documented.
   The \usage entries must correspond to syntactically valid R code.
   See chapter ‘Writing R documentation files’ in the ‘Writing R
   Extensions’ manual.
   ```
   I have copied the documentation from `base::drop` to satisfy pipeline but not sure if this is the best approach.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37526: [SPARK-40087][R][SQL] Support multiple "Column" drop in R

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on PR #37526:
URL: https://github.com/apache/spark/pull/37526#issuecomment-1219004510

   cc @zero323 and @zhengruifeng in case you guys find some time for a second look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37526: [SPARK-40087][R][SQL] Support multiple "Column" drop in R

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on PR #37526:
URL: https://github.com/apache/spark/pull/37526#issuecomment-1219467390

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37526: [WIP][SPARK-40087][R][SQL] Support multiple "Column" drop in R

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #37526:
URL: https://github.com/apache/spark/pull/37526#discussion_r947633106


##########
R/pkg/R/DataFrame.R:
##########
@@ -3577,40 +3577,90 @@ setMethod("str",
 #' This is a no-op if schema doesn't contain column name(s).
 #'
 #' @param x a SparkDataFrame.
-#' @param col a character vector of column names or a Column.
-#' @param ... further arguments to be passed to or from other methods.
-#' @return A SparkDataFrame.
+#' @param col a list of columns or single Column or name.
+#' @param ... additional column(s) if only one column is specified in \code{col}.
+#'            If more than one column is assigned in \code{col}, \code{...}
+#'            should be left empty.
+#' @return A new SparkDataFrame with selected columns.
 #'
 #' @family SparkDataFrame functions
 #' @rdname drop
 #' @name drop
-#' @aliases drop,SparkDataFrame-method
+#' @aliases drop,SparkDataFrame,character-method
+#' @family subsetting functions
 #' @examples
-#'\dontrun{
-#' sparkR.session()
-#' path <- "path/to/file.json"
-#' df <- read.json(path)
-#' drop(df, "col1")
-#' drop(df, c("col1", "col2"))
-#' drop(df, df$col1)
+#' \dontrun{
+#'   drop(df, "*")
+#'   drop(df, "col1", "col2")
+#'   drop(df, df$name, df$age + 1)
+#'   drop(df, c("col1", "col2"))
+#'   drop(df, list(df$name, df$age + 1))
 #' }
-#' @note drop since 2.0.0
-setMethod("drop",
-          signature(x = "SparkDataFrame"),
-          function(x, col) {
-            stopifnot(class(col) == "character" || class(col) == "Column")
+#' @note drop(SparkDataFrame, character) since 2.0.0
+setMethod("drop", signature(x = "SparkDataFrame", col = "character"),

Review Comment:
   Hm, I think you can leverage Union type, see `pkg/R/DataFrame.R:setClassUnion("characterOrColumn", c("character", "Column"))` instead of adding multiple overloaded ones.
   
   And, we can probably don't have to add `list` version



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #37526: [SPARK-40087][R][SQL] Support multiple "Column" drop in R

Posted by GitBox <gi...@apache.org>.

HyukjinKwon closed pull request #37526: [SPARK-40087][R][SQL] Support multiple "Column" drop in R
URL: https://github.com/apache/spark/pull/37526


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org