You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2022/09/16 17:19:00 UTC

[jira] [Created] (ARROW-17759) [R] Implement dplyr::slice_sample()

Neal Richardson created ARROW-17759:
---------------------------------------

             Summary: [R] Implement dplyr::slice_sample()
                 Key: ARROW-17759
                 URL: https://issues.apache.org/jira/browse/ARROW-17759
             Project: Apache Arrow
          Issue Type: Sub-task
          Components: R
            Reporter: Neal Richardson
             Fix For: 10.0.0


{code}
slice_sample(.data, ..., n, prop, weight_by = NULL, replace = FALSE)
{code}

If {{n}} is provided, compute {{nrow(.data)}}, and if that is not NA, convert to a {prop}. (Might want to do prop + .01 or something and then do head(n) after, i.e. sample more than you need and then take {{n}}, just so you don't by randomness get fewer than n.)

With prop, turn this into {{filter(arrow_random() < prop)}}. See ARROW-17572. 

Defer weight_by to a followup. It should be doable but might be expensive (need to scan everything to compute sum and ensure that all values are positive).

Defer replace = TRUE. 

Also probably can only do if .data is ungrouped, I think the dplyr methods do sampling within groups. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)