You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@madlib.apache.org by "Frank McQuillan (JIRA)" <ji...@apache.org> on 2017/06/28 01:19:01 UTC
[jira] [Resolved] (MADLIB-986) Stratified sampling
[ https://issues.apache.org/jira/browse/MADLIB-986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Frank McQuillan resolved MADLIB-986.
------------------------------------
Resolution: Fixed
> Stratified sampling
> -------------------
>
> Key: MADLIB-986
> URL: https://issues.apache.org/jira/browse/MADLIB-986
> Project: Apache MADlib
> Issue Type: New Feature
> Components: Module: Sampling
> Reporter: Frank McQuillan
> Assignee: Orhan Kislal
> Labels: starter
> Fix For: v1.12
>
>
> Story
> As a data scientist, I want to sample a data table in proportion to the number of rows in each group, so that I can do model building on the sampled data sets.
> The MVP for this story is:
> * sample proportion is global, i.e., single fractional value between 0 and 1
> * allow option to sample without replacement (default) and sample with replacement
> * allow option to output a subset of columns to the output table
> Proposed Interface
> {code}
> stratified_sample (
> source_table,
> output_table,
> proportion,
> grouping_col -- optional
> with_replacement, -- optional
> target_cols -- optional
> )
> source_table
> TEXT. The name of the table containing the input data.
> output_table
> TEXT. Name of output table that contains the sampled data.
> The output table contains all the columns present in the source table
> unless otherwise specified in the 'target_cols' parameter below.
> proportion
> FLOAT8 in the range (0,1). The size of the sample in each stratum will
> be taken in proportion to the size of the stratum.
> grouping_col (optional)
> TEXT, default: NULL. A single column or a list of comma-separated columns
> that defines how to stratify. When this parameter is NULL,
> no grouping is used so the sampling is non-stratified.
> with_replacement (optional)
> BOOLEAN, default FALSE. Determines whether to sample with replacement
> or without replacement (default).
> target_cols (optional)
> TEXT, default NULL. A comma-separated list of columns to appear in the 'output_table'.
> If NULL, all columns from the 'source_table' will appear in the 'output_table'.
> {code}
> Other notes
> PDL tools is one example implementation of stratified sampling to review [2].
> Please review existing MADlib sample functions [3] to see if these can be used as a basis, or built on, for this stratified sample story.
> References
> [2] PDL tools sampling modules incl stratified sampling
> http://pivotalsoftware.github.io/PDLTools/group__grp__sampling.html
> [3] Existing MADlib sample function
> http://madlib.incubator.apache.org/docs/latest/group__grp__sample.html
> [4] Pandas/Selecting Random Samples
> http://pandas.pydata.org/pandas-docs/stable/indexing.html#selecting-random-samples
> [5] General
> https://en.wikipedia.org/wiki/Stratified_sampling
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)