You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2022/09/16 17:19:00 UTC
[jira] [Created] (ARROW-17759) [R] Implement dplyr::slice_sample()
Neal Richardson created ARROW-17759:
---------------------------------------
Summary: [R] Implement dplyr::slice_sample()
Key: ARROW-17759
URL: https://issues.apache.org/jira/browse/ARROW-17759
Project: Apache Arrow
Issue Type: Sub-task
Components: R
Reporter: Neal Richardson
Fix For: 10.0.0
{code}
slice_sample(.data, ..., n, prop, weight_by = NULL, replace = FALSE)
{code}
If {{n}} is provided, compute {{nrow(.data)}}, and if that is not NA, convert to a {prop}. (Might want to do prop + .01 or something and then do head(n) after, i.e. sample more than you need and then take {{n}}, just so you don't by randomness get fewer than n.)
With prop, turn this into {{filter(arrow_random() < prop)}}. See ARROW-17572.
Defer weight_by to a followup. It should be doable but might be expensive (need to scan everything to compute sum and ensure that all values are positive).
Defer replace = TRUE.
Also probably can only do if .data is ungrouped, I think the dplyr methods do sampling within groups.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)