You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/05 22:37:15 UTC

[GitHub] [arrow] amoeba opened a new pull request, #14330: ARROW-17439: [R] Change behavior of pull to compute instead of collect

amoeba opened a new pull request, #14330:
URL: https://github.com/apache/arrow/pull/14330

   I could use some help with a couple of things:
   
   1. `devtools::check` warns about `pull.ArrowTabluar` being undocumented. I `@export`ed it to stay consistent with other ArrowTabular generics defined in `arrow-tabular.R` and don't understand why checking doesn't warn on all of these. Does this one just not need exporting?
       <details>
       <summary>Relevant devtools::check() output</summary>
   
       ```
       ❯ checking for missing documentation entries ... WARNING
         Undocumented code objects:
           ‘pull.ArrowTabular’
         All user-level objects in a package should have documentation entries.
         See chapter ‘Writing R documentation files’ in the ‘Writing R
         Extensions’ manual.
       ```
       </details>
   2. I found an inconsistency with `dplyr::pull` and Tables: Pulling an ungrouped Table produces a ChunkedArray whereas pulling a grouped Table produces a Table. This makes a subsequent call to `as.vector` produce an error of `Error in as.vector(x, mode) : cannot coerce type 'environment' to vector of type 'any'`
        <details>
       <summary>Example of the difference</summary>
   
       ```r
       > sw_table <- arrow_table(starwars)
   
       > sw_table |>
       +   filter(height > 100) |>
       +   group_by(homeworld) |>
       +   pull(name) |> 
       +   class()
       [1] "Table"        "ArrowTabular" "ArrowObject"  "R6"          
   
       > sw_table |>
       +   filter(height > 100) |>
       +   pull(name) |> 
       +   class()
       [1] "ChunkedArray" "ArrowDatum"   "ArrowObject"  "R6"
       ```
       </details>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #14330: ARROW-17439: [R] Change behavior of pull to compute instead of collect

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #14330:
URL: https://github.com/apache/arrow/pull/14330#issuecomment-1269059161

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] nealrichardson commented on pull request #14330: ARROW-17439: [R] Change behavior of pull to compute instead of collect

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on PR #14330:
URL: https://github.com/apache/arrow/pull/14330#issuecomment-1273591786

   > As I mentioned in JIRA, do you implement an argument like `as_data_frame`? For consistency with other functions, wouldn't it be better to return an R vector by default and an Arrow object optionally?
   
   Personally I don't think so, and as I previously commented on the jira, I don't think the consistency argument is quite so simple because arrow is different from querying remote databases with dbplyr. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on pull request #14330: ARROW-17439: [R] Change behavior of pull to compute instead of collect

Posted by GitBox <gi...@apache.org>.
amoeba commented on PR #14330:
URL: https://github.com/apache/arrow/pull/14330#issuecomment-1272197525

   Thanks for the link @eitsupi that change fixes the issue I was having.
   
   Re:  @wjones127's comment:
   
   > On the CHECK issue, IIRC we don't @export dplyr S3 methods. We do call s3_register on them here:
   
   I removed the export directive but I would eventually like to understand why the other exported functions in `arrow-tabular.R` (and probably elsewhere) don't need specific documentation (but I'll do that elsewhere).
   
   Both the issues I raised are resolved and this can be reviewed.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #14330: ARROW-17439: [R] Change behavior of pull to compute instead of collect

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #14330:
URL: https://github.com/apache/arrow/pull/14330#issuecomment-1269059136

   https://issues.apache.org/jira/browse/ARROW-17439


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] eitsupi commented on pull request #14330: ARROW-17439: [R] Change behavior of pull to compute instead of collect

Posted by GitBox <gi...@apache.org>.
eitsupi commented on PR #14330:
URL: https://github.com/apache/arrow/pull/14330#issuecomment-1269769097

   > 2\. Pulling an ungrouped Table produces a ChunkedArray whereas pulling a grouped Table produces a Table. This makes a subsequent call to `as.vector` produce an error of `Error in as.vector(x, mode) : cannot coerce type 'environment' to vector of type 'any'`
   
   This is due to ARROW-17738.
   You can use `as_arrow_table` instead of `compute` here I think. https://github.com/apache/arrow/pull/14160#issuecomment-1268404022


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on pull request #14330: ARROW-17439: [R] Change behavior of pull to compute instead of collect

Posted by GitBox <gi...@apache.org>.
wjones127 commented on PR #14330:
URL: https://github.com/apache/arrow/pull/14330#issuecomment-1270238061

   On the CHECK issue, IIRC we don't `@export` dplyr S3 methods. We do call `s3_register` on them here:
   
   https://github.com/apache/arrow/blob/418f11558d561cfca700891efb2a53e478d26a8e/r/R/arrow-package.R#L75-L79


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] nealrichardson commented on a diff in pull request #14330: ARROW-17439: [R] Change behavior of pull to compute instead of collect

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on code in PR #14330:
URL: https://github.com/apache/arrow/pull/14330#discussion_r991408093


##########
r/R/arrow-tabular.R:
##########
@@ -259,3 +259,7 @@ na.omit.ArrowTabular <- function(object, ...) {
 
 #' @export
 na.exclude.ArrowTabular <- na.omit.ArrowTabular
+
+pull.ArrowTabular <- function(x, var = -1) {

Review Comment:
   I'd move this to dplyr-collect.R to keep it with the other pull() method definitions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] eitsupi commented on pull request #14330: ARROW-17439: [R] Change behavior of pull to compute instead of collect

Posted by GitBox <gi...@apache.org>.
eitsupi commented on PR #14330:
URL: https://github.com/apache/arrow/pull/14330#issuecomment-1274639198

   Since the dplyr documentation says that pull returns a R vector, I think the documentation should at least explain that `pull` to arrow objects does not return a vector and using with `as.vector` to become a vector.
   Users likely know very little about Chuncked Arrays.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] nealrichardson merged pull request #14330: ARROW-17439: [R] Change behavior of pull to compute instead of collect

Posted by GitBox <gi...@apache.org>.
nealrichardson merged PR #14330:
URL: https://github.com/apache/arrow/pull/14330


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #14330: ARROW-17439: [R] Change behavior of pull to compute instead of collect

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #14330:
URL: https://github.com/apache/arrow/pull/14330#issuecomment-1277975287

   Benchmark runs are scheduled for baseline = 093a4fe34609fa40edd1c8c8cb6b2875581a8c0f and contender = 20626f833be5c1241161054665ccc3906f3da1c3. 20626f833be5c1241161054665ccc3906f3da1c3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/6221bdbdf58a45acbe52a86e3d7f1e6d...c8e5c4fd35fe485cb9e189b1efb672c4/)
   [Failed :arrow_down:1.11% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/86bead39ab4547a0a0bceada3acef2a2...e9d5c2ba346d4f20b8ac362f68ccf3f3/)
   [Failed :arrow_down:2.74% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/f9bef4540c5d404eb203addd85f43db0...d5151e89044d43e2b73f4231a2ae6c8e/)
   [Finished :arrow_down:0.14% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/f7fe1f48272744ed898af4a3ccfe94b9...a51e4d973b9e43de903b66508e01c62c/)
   Buildkite builds:
   [Finished] [`20626f83` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1679)
   [Failed] [`20626f83` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1698)
   [Failed] [`20626f83` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1681)
   [Finished] [`20626f83` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1692)
   [Finished] [`093a4fe3` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/1678)
   [Failed] [`093a4fe3` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/1697)
   [Failed] [`093a4fe3` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/1680)
   [Finished] [`093a4fe3` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/1691)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #14330: ARROW-17439: [R] Change behavior of pull to compute instead of collect

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #14330:
URL: https://github.com/apache/arrow/pull/14330#issuecomment-1277977074

   ['Python', 'R'] benchmarks have high level of regressions.
   [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/f9bef4540c5d404eb203addd85f43db0...d5151e89044d43e2b73f4231a2ae6c8e/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] eitsupi commented on pull request #14330: ARROW-17439: [R] Change behavior of pull to compute instead of collect

Posted by GitBox <gi...@apache.org>.
eitsupi commented on PR #14330:
URL: https://github.com/apache/arrow/pull/14330#issuecomment-1273487425

   As I mentioned in JIRA, do you implement an argument like `as_data_frame`?
   For consistency with other functions, wouldn't it be better to return an R vector by default and an Arrow object optionally?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on a diff in pull request #14330: ARROW-17439: [R] Change behavior of pull to compute instead of collect

Posted by GitBox <gi...@apache.org>.
amoeba commented on code in PR #14330:
URL: https://github.com/apache/arrow/pull/14330#discussion_r992842243


##########
r/R/arrow-tabular.R:
##########
@@ -259,3 +259,7 @@ na.omit.ArrowTabular <- function(object, ...) {
 
 #' @export
 na.exclude.ArrowTabular <- na.omit.ArrowTabular
+
+pull.ArrowTabular <- function(x, var = -1) {

Review Comment:
   Sounds good. Done in https://github.com/apache/arrow/pull/14330/commits/9c8691bb0f430c1e4a0a62003547fa4e57d62d70.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on pull request #14330: ARROW-17439: [R] Change behavior of pull to compute instead of collect

Posted by GitBox <gi...@apache.org>.
amoeba commented on PR #14330:
URL: https://github.com/apache/arrow/pull/14330#issuecomment-1275379898

   Thanks the look @nealrichardson. This should be good to merge now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] nealrichardson commented on pull request #14330: ARROW-17439: [R] Change behavior of pull to compute instead of collect

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on PR #14330:
URL: https://github.com/apache/arrow/pull/14330#issuecomment-1276472343

   > Since the dplyr documentation says that pull returns a R vector, I think the documentation should at least explain that `pull` to arrow objects does not return a vector and using with `as.vector` to become a vector. Users likely know very little about Chuncked Arrays.
   
   @eitsupi Good call, I'll add a note about this in #14387


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org