You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@datafu.apache.org by Xiangrui Meng <me...@gmail.com> on 2014/01/15 19:14:03 UTC

Review Request 16911: DATAFU-5: Update SimpleRandomSample (SRS) to be consistent with SimpleRandomSampleWithReplacement (SRSWR)

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16911/
-----------------------------------------------------------

Review request for DataFu and Matthew Hayes.


Repository: datafu


Description
-------

In the current implementation, SRS takes the sampling probability in the constructor of the UDF, while SRSWR takes the sample size in the function call. The attached patch updates SRS to make it consistent with SRSWR.
After the patch, SRS takes a bag of items, a desired sampling probability, and optionally a lower bound of the size of the population as the inputs, while SRSWR takes a bag of items, a desired sample size, and optionally a lower bound of the size of the population as the inputs.
Another benefit of the patch is that user doesn't have to create multiple instances of the UDF to sample with different probabilities.

It is a rewrite of the UDF, so better check the new file directly instead of the diff.


Diffs
-----

  src/java/datafu/pig/sampling/SimpleRandomSample.java 98d4e58 
  test/pig/datafu/test/pig/sampling/SimpleRandomSampleTest.java 8ca9d5d 

Diff: https://reviews.apache.org/r/16911/diff/


Testing
-------

unit tests


Thanks,

Xiangrui Meng


Re: Review Request 16911: DATAFU-5: Update SimpleRandomSample (SRS) to be consistent with SimpleRandomSampleWithReplacement (SRSWR)

Posted by Xiangrui Meng <me...@gmail.com>.

> On Jan. 16, 2014, 2:07 a.m., Matthew Hayes wrote:
> > src/java/datafu/pig/sampling/SimpleRandomSample.java, line 163
> > <https://reviews.apache.org/r/16911/diff/1/?file=424023#file424023line163>
> >
> >     One thing I was curious about was whether Pig uses separate instances of Initial in a case like this:
> >     
> >     sampled = FOREACH (GROUP data ALL) GENERATE SRS(data, 0.3), SRS(data, 0.4);
> >     
> >     I couldn't find this documented but it did use separate instances in the test.  It may be worth adding a test case like this.

Added a test case. Actually, Pig may generate many Initial instances, one per record. So I changed the random number generator to static. Otherwise, the generated values are not "independent".


> On Jan. 16, 2014, 2:07 a.m., Matthew Hayes wrote:
> > src/java/datafu/pig/sampling/SimpleRandomSample.java, line 223
> > <https://reviews.apache.org/r/16911/diff/1/?file=424023#file424023line223>
> >
> >     _totalNumItems isn't actually the real count right?  It's just the count for this particular map task.  Should update the comment.  I suppose if one's lower bound is low enough (relative to the true value) it's possible for this new count to be higher.

Changed the name to _localCount. Yes, this is for the case when the provided lower bound is not good enough.


> On Jan. 16, 2014, 2:07 a.m., Matthew Hayes wrote:
> > src/java/datafu/pig/sampling/SimpleRandomSample.java, line 410
> > <https://reviews.apache.org/r/16911/diff/1/?file=424023#file424023line410>
> >
> >     info is probably better here

Changed to System.out.


- Xiangrui


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16911/#review31989
-----------------------------------------------------------


On Jan. 15, 2014, 6:14 p.m., Xiangrui Meng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16911/
> -----------------------------------------------------------
> 
> (Updated Jan. 15, 2014, 6:14 p.m.)
> 
> 
> Review request for DataFu and Matthew Hayes.
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> In the current implementation, SRS takes the sampling probability in the constructor of the UDF, while SRSWR takes the sample size in the function call. The attached patch updates SRS to make it consistent with SRSWR.
> After the patch, SRS takes a bag of items, a desired sampling probability, and optionally a lower bound of the size of the population as the inputs, while SRSWR takes a bag of items, a desired sample size, and optionally a lower bound of the size of the population as the inputs.
> Another benefit of the patch is that user doesn't have to create multiple instances of the UDF to sample with different probabilities.
> 
> It is a rewrite of the UDF, so better check the new file directly instead of the diff.
> 
> 
> Diffs
> -----
> 
>   src/java/datafu/pig/sampling/SimpleRandomSample.java 98d4e58 
>   test/pig/datafu/test/pig/sampling/SimpleRandomSampleTest.java 8ca9d5d 
> 
> Diff: https://reviews.apache.org/r/16911/diff/
> 
> 
> Testing
> -------
> 
> unit tests
> 
> 
> Thanks,
> 
> Xiangrui Meng
> 
>


Re: Review Request 16911: DATAFU-5: Update SimpleRandomSample (SRS) to be consistent with SimpleRandomSampleWithReplacement (SRSWR)

Posted by Matthew Hayes <ma...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16911/#review31989
-----------------------------------------------------------


Overall this looks pretty good to me.  The one open question I have is about breaking API compatibility.  If we are going to release 2.0 with a org.apache.datafu packaging change then this seems okay to me.


src/java/datafu/pig/sampling/SimpleRandomSample.java
<https://reviews.apache.org/r/16911/#comment60743>

    One thing I was curious about was whether Pig uses separate instances of Initial in a case like this:
    
    sampled = FOREACH (GROUP data ALL) GENERATE SRS(data, 0.3), SRS(data, 0.4);
    
    I couldn't find this documented but it did use separate instances in the test.  It may be worth adding a test case like this.



src/java/datafu/pig/sampling/SimpleRandomSample.java
<https://reviews.apache.org/r/16911/#comment60741>

    _totalNumItems isn't actually the real count right?  It's just the count for this particular map task.  Should update the comment.  I suppose if one's lower bound is low enough (relative to the true value) it's possible for this new count to be higher.



src/java/datafu/pig/sampling/SimpleRandomSample.java
<https://reviews.apache.org/r/16911/#comment60745>

    info is probably better here


- Matthew Hayes


On Jan. 15, 2014, 6:14 p.m., Xiangrui Meng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16911/
> -----------------------------------------------------------
> 
> (Updated Jan. 15, 2014, 6:14 p.m.)
> 
> 
> Review request for DataFu and Matthew Hayes.
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> In the current implementation, SRS takes the sampling probability in the constructor of the UDF, while SRSWR takes the sample size in the function call. The attached patch updates SRS to make it consistent with SRSWR.
> After the patch, SRS takes a bag of items, a desired sampling probability, and optionally a lower bound of the size of the population as the inputs, while SRSWR takes a bag of items, a desired sample size, and optionally a lower bound of the size of the population as the inputs.
> Another benefit of the patch is that user doesn't have to create multiple instances of the UDF to sample with different probabilities.
> 
> It is a rewrite of the UDF, so better check the new file directly instead of the diff.
> 
> 
> Diffs
> -----
> 
>   src/java/datafu/pig/sampling/SimpleRandomSample.java 98d4e58 
>   test/pig/datafu/test/pig/sampling/SimpleRandomSampleTest.java 8ca9d5d 
> 
> Diff: https://reviews.apache.org/r/16911/diff/
> 
> 
> Testing
> -------
> 
> unit tests
> 
> 
> Thanks,
> 
> Xiangrui Meng
> 
>


Re: Review Request 16911: DATAFU-5: Update SimpleRandomSample (SRS) to be consistent with SimpleRandomSampleWithReplacement (SRSWR)

Posted by Matthew Hayes <ma...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16911/#review32156
-----------------------------------------------------------


The recent changes look good to me.  It seems like this change to the UDF could be made in a backwards compatible way though.  A probability passed into the constructor could be treated as a default value if no probability is passed when the UDF is invoked.  

- Matthew Hayes


On Jan. 17, 2014, 2:49 a.m., Xiangrui Meng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16911/
> -----------------------------------------------------------
> 
> (Updated Jan. 17, 2014, 2:49 a.m.)
> 
> 
> Review request for DataFu and Matthew Hayes.
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> In the current implementation, SRS takes the sampling probability in the constructor of the UDF, while SRSWR takes the sample size in the function call. The attached patch updates SRS to make it consistent with SRSWR.
> After the patch, SRS takes a bag of items, a desired sampling probability, and optionally a lower bound of the size of the population as the inputs, while SRSWR takes a bag of items, a desired sample size, and optionally a lower bound of the size of the population as the inputs.
> Another benefit of the patch is that user doesn't have to create multiple instances of the UDF to sample with different probabilities.
> 
> It is a rewrite of the UDF, so better check the new file directly instead of the diff.
> 
> 
> Diffs
> -----
> 
>   src/java/datafu/pig/sampling/SimpleRandomSample.java 98d4e58 
>   test/pig/datafu/test/pig/sampling/SimpleRandomSampleTest.java 8ca9d5d 
> 
> Diff: https://reviews.apache.org/r/16911/diff/
> 
> 
> Testing
> -------
> 
> unit tests
> 
> 
> Thanks,
> 
> Xiangrui Meng
> 
>


Re: Review Request 16911: DATAFU-5: Update SimpleRandomSample (SRS) to be consistent with SimpleRandomSampleWithReplacement (SRSWR)

Posted by Matthew Hayes <ma...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16911/#review32840
-----------------------------------------------------------



test/pig/datafu/test/pig/sampling/SimpleRandomSampleTestOld.java
<https://reviews.apache.org/r/16911/#comment61845>

    Wrong header (run check-license-headers.sh to check).


- Matthew Hayes


On Jan. 25, 2014, 8:10 p.m., Xiangrui Meng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16911/
> -----------------------------------------------------------
> 
> (Updated Jan. 25, 2014, 8:10 p.m.)
> 
> 
> Review request for DataFu and Matthew Hayes.
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> In the current implementation, SRS takes the sampling probability in the constructor of the UDF, while SRSWR takes the sample size in the function call. The attached patch updates SRS to make it consistent with SRSWR.
> After the patch, SRS takes a bag of items, a desired sampling probability, and optionally a lower bound of the size of the population as the inputs, while SRSWR takes a bag of items, a desired sample size, and optionally a lower bound of the size of the population as the inputs.
> Another benefit of the patch is that user doesn't have to create multiple instances of the UDF to sample with different probabilities.
> 
> It is a rewrite of the UDF, so better check the new file directly instead of the diff.
> 
> 
> Diffs
> -----
> 
>   src/java/datafu/pig/sampling/SimpleRandomSample.java aff088a 
>   test/pig/datafu/test/pig/sampling/SimpleRandomSampleTest.java 7a0ced2 
>   test/pig/datafu/test/pig/sampling/SimpleRandomSampleTestOld.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/16911/diff/
> 
> 
> Testing
> -------
> 
> unit tests
> 
> 
> Thanks,
> 
> Xiangrui Meng
> 
>


Re: Review Request 16911: DATAFU-5: Update SimpleRandomSample (SRS) to be consistent with SimpleRandomSampleWithReplacement (SRSWR)

Posted by Xiangrui Meng <me...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16911/
-----------------------------------------------------------

(Updated Jan. 25, 2014, 8:10 p.m.)


Review request for DataFu and Matthew Hayes.


Changes
-------

Make the change backward compatible. The old tests were moved to SimpleRandomSampleTestOld and marked deprecated.


Repository: datafu


Description
-------

In the current implementation, SRS takes the sampling probability in the constructor of the UDF, while SRSWR takes the sample size in the function call. The attached patch updates SRS to make it consistent with SRSWR.
After the patch, SRS takes a bag of items, a desired sampling probability, and optionally a lower bound of the size of the population as the inputs, while SRSWR takes a bag of items, a desired sample size, and optionally a lower bound of the size of the population as the inputs.
Another benefit of the patch is that user doesn't have to create multiple instances of the UDF to sample with different probabilities.

It is a rewrite of the UDF, so better check the new file directly instead of the diff.


Diffs (updated)
-----

  src/java/datafu/pig/sampling/SimpleRandomSample.java aff088a 
  test/pig/datafu/test/pig/sampling/SimpleRandomSampleTest.java 7a0ced2 
  test/pig/datafu/test/pig/sampling/SimpleRandomSampleTestOld.java PRE-CREATION 

Diff: https://reviews.apache.org/r/16911/diff/


Testing
-------

unit tests


Thanks,

Xiangrui Meng


Re: Review Request 16911: DATAFU-5: Update SimpleRandomSample (SRS) to be consistent with SimpleRandomSampleWithReplacement (SRSWR)

Posted by Sam Shah <sh...@umich.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16911/#review32199
-----------------------------------------------------------


+1 to Matt's comment. This could be made backwards compatible. Otherwise it looks good.

- Sam Shah


On Jan. 17, 2014, 2:49 a.m., Xiangrui Meng wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16911/
> -----------------------------------------------------------
> 
> (Updated Jan. 17, 2014, 2:49 a.m.)
> 
> 
> Review request for DataFu and Matthew Hayes.
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> In the current implementation, SRS takes the sampling probability in the constructor of the UDF, while SRSWR takes the sample size in the function call. The attached patch updates SRS to make it consistent with SRSWR.
> After the patch, SRS takes a bag of items, a desired sampling probability, and optionally a lower bound of the size of the population as the inputs, while SRSWR takes a bag of items, a desired sample size, and optionally a lower bound of the size of the population as the inputs.
> Another benefit of the patch is that user doesn't have to create multiple instances of the UDF to sample with different probabilities.
> 
> It is a rewrite of the UDF, so better check the new file directly instead of the diff.
> 
> 
> Diffs
> -----
> 
>   src/java/datafu/pig/sampling/SimpleRandomSample.java 98d4e58 
>   test/pig/datafu/test/pig/sampling/SimpleRandomSampleTest.java 8ca9d5d 
> 
> Diff: https://reviews.apache.org/r/16911/diff/
> 
> 
> Testing
> -------
> 
> unit tests
> 
> 
> Thanks,
> 
> Xiangrui Meng
> 
>


Re: Review Request 16911: DATAFU-5: Update SimpleRandomSample (SRS) to be consistent with SimpleRandomSampleWithReplacement (SRSWR)

Posted by Xiangrui Meng <me...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16911/
-----------------------------------------------------------

(Updated Jan. 17, 2014, 2:49 a.m.)


Review request for DataFu and Matthew Hayes.


Changes
-------

Changed the random random generator to static to avoid creating many instances.
Minor updates based on Matt's comments.


Repository: datafu


Description
-------

In the current implementation, SRS takes the sampling probability in the constructor of the UDF, while SRSWR takes the sample size in the function call. The attached patch updates SRS to make it consistent with SRSWR.
After the patch, SRS takes a bag of items, a desired sampling probability, and optionally a lower bound of the size of the population as the inputs, while SRSWR takes a bag of items, a desired sample size, and optionally a lower bound of the size of the population as the inputs.
Another benefit of the patch is that user doesn't have to create multiple instances of the UDF to sample with different probabilities.

It is a rewrite of the UDF, so better check the new file directly instead of the diff.


Diffs (updated)
-----

  src/java/datafu/pig/sampling/SimpleRandomSample.java 98d4e58 
  test/pig/datafu/test/pig/sampling/SimpleRandomSampleTest.java 8ca9d5d 

Diff: https://reviews.apache.org/r/16911/diff/


Testing
-------

unit tests


Thanks,

Xiangrui Meng