You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2020/05/29 03:21:26 UTC

[GitHub] [beam] darshanj opened a new pull request #11855: [BEAM-10005] | combinefn for ApproximateQuantiles and ApproximateUnique

darshanj opened a new pull request #11855:
URL: https://github.com/apache/beam/pull/11855


   Added api combineFn that can be used for inputs windowed with non global windows
   
   **Please** add a meaningful description for your change here
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`).
    - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
    - [ ] Update `CHANGES.md` with noteworthy changes.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   Post-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/)
   Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/) | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/)
   XLang | --- | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/)
   
   Pre-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   --- |Java | Python | Go | Website
   --- | --- | --- | --- | ---
   Non-portable | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/) 
   Portable | --- | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/) | --- | ---
   
   See [.test-infra/jenkins/README](https://github.com/apache/beam/blob/master/.test-infra/jenkins/README.md) for trigger phrase, status and link of all Jenkins jobs.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] robinyqiu edited a comment on pull request #11855: [BEAM-10005] | combinefn for ApproximateQuantiles and ApproximateUnique

Posted by GitBox <gi...@apache.org>.
robinyqiu edited a comment on pull request #11855:
URL: https://github.com/apache/beam/pull/11855#issuecomment-644554122


   @darshanj FYI I found this discussion thread in dev@ may be relevant to the problem you want to solve. You can take a look while I review this PR: https://lists.apache.org/thread.html/79c8c04c3710bfc22a3b89a728567f4de92859f28219bcd87ae4ad49%40%3Cdev.beam.apache.org%3E


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] robinyqiu commented on a change in pull request #11855: [BEAM-10005] | combinefn for ApproximateQuantiles and ApproximateUnique

Posted by GitBox <gi...@apache.org>.
robinyqiu commented on a change in pull request #11855:
URL: https://github.com/apache/beam/pull/11855#discussion_r440609700



##########
File path: sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java
##########
@@ -54,7 +57,9 @@
  *     input.apply(HllCount.Init.forStrings().globally()).apply(HllCount.Extract.globally());
  * }</pre>
  *
- * For more details about using {@code HllCount} and the {@code zetasketch} extension module, see
+ * <p>{@link #combineFn} can also be used manually, with the {@link Combine} transform.

Review comment:
       Please add this new line of comment to the end of the block, because the next line is tied to the previous example.

##########
File path: sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateQuantiles.java
##########
@@ -54,12 +56,39 @@
 /**
  * {@code PTransform}s for getting an idea of a {@code PCollection}'s data distribution using
  * approximate {@code N}-tiles (e.g. quartiles, percentiles, etc.), either globally or per-key.
+ *
+ * <p>{@link #combineFn} can also be used manually, with the {@link Combine} transform.
  */
 public class ApproximateQuantiles {
   private ApproximateQuantiles() {
     // do not instantiate
   }
 
+  /**
+   * Returns a {@link CombineFn} that gives an idea of the distribution of a collection of values
+   * using approximate {@code N}-tiles.
+   *
+   * @param <T> the type of the elements in the input {@code PCollection}
+   * @param numQuantiles the number of elements in the resulting quantile values {@code List}
+   * @param compareFn the function to use to order the elements
+   */
+  public static <T, ComparatorT extends Comparator<T> & Serializable>
+      CombineFn<T, ?, List<T>> combineFn(int numQuantiles, ComparatorT compareFn) {
+    checkArgument(compareFn != null, "compareFn should not be null");

Review comment:
       Nit: you can use `checkNotNull` instead

##########
File path: sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/ApproximateUnique.java
##########
@@ -108,6 +120,38 @@
     return new Globally<>(maximumEstimationError);
   }
 
+  /**
+   * Returns a {@link Combine.Globally} that gives a single value that is an estimate of the number
+   * of distinct elements in the input {@code PCollection}.
+   *
+   * @param <T> the type of the elements in the input {@code PCollection}
+   * @param sampleSize the number of entries in the statistical sample; the higher this number, the
+   *     more accurate the estimate will be; should be {@code >= 16}
+   * @param inputCoder the coder for {@code PCollection<T>} where the combineFn will be applied on.
+   * @throws IllegalArgumentException if the {@code sampleSize} argument is too small
+   */
+  public static <T> Combine.Globally<T, Long> combineFn(int sampleSize, Coder<T> inputCoder) {

Review comment:
       The `combineFn()` function here does not actually return the `combineFn` (as you did in `ApproximateQuantile`), please make a change here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] robinyqiu commented on pull request #11855: [BEAM-10005] | combinefn for ApproximateQuantiles and ApproximateUnique

Posted by GitBox <gi...@apache.org>.
robinyqiu commented on pull request #11855:
URL: https://github.com/apache/beam/pull/11855#issuecomment-644554122


   @darshanj FYI I found this discussion thread in dev@ may be relevant to the problem you want to solve. PTAL: https://lists.apache.org/thread.html/79c8c04c3710bfc22a3b89a728567f4de92859f28219bcd87ae4ad49%40%3Cdev.beam.apache.org%3E


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] darshanj commented on pull request #11855: [BEAM-10005] | combinefn for ApproximateQuantiles and ApproximateUnique

Posted by GitBox <gi...@apache.org>.
darshanj commented on pull request #11855:
URL: https://github.com/apache/beam/pull/11855#issuecomment-635981947


   Looks like test is failing which is not related:
   org.apache.beam.sdk.transforms.ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful
   May be rerun checks will work.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] stale[bot] commented on pull request #11855: [BEAM-10005] | combinefn for ApproximateQuantiles and ApproximateUnique

Posted by GitBox <gi...@apache.org>.
stale[bot] commented on pull request #11855:
URL: https://github.com/apache/beam/pull/11855#issuecomment-678836076


   This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] iemejia removed a comment on pull request #11855: [BEAM-10005] | combinefn for ApproximateQuantiles and ApproximateUnique

Posted by GitBox <gi...@apache.org>.
iemejia removed a comment on pull request #11855:
URL: https://github.com/apache/beam/pull/11855#issuecomment-635844667


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] stale[bot] commented on pull request #11855: [BEAM-10005] | combinefn for ApproximateQuantiles and ApproximateUnique

Posted by GitBox <gi...@apache.org>.
stale[bot] commented on pull request #11855:
URL: https://github.com/apache/beam/pull/11855#issuecomment-683484649


   This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] robinyqiu edited a comment on pull request #11855: [BEAM-10005] | combinefn for ApproximateQuantiles and ApproximateUnique

Posted by GitBox <gi...@apache.org>.
robinyqiu edited a comment on pull request #11855:
URL: https://github.com/apache/beam/pull/11855#issuecomment-644554122


   @darshanj FYI I found this discussion thread in dev@ may be interesting to you. You can take a look while I review this PR: https://lists.apache.org/thread.html/79c8c04c3710bfc22a3b89a728567f4de92859f28219bcd87ae4ad49%40%3Cdev.beam.apache.org%3E


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] aaltay commented on pull request #11855: [BEAM-10005] | combinefn for ApproximateQuantiles and ApproximateUnique

Posted by GitBox <gi...@apache.org>.
aaltay commented on pull request #11855:
URL: https://github.com/apache/beam/pull/11855#issuecomment-642985920


   R: @robinyqiu 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] stale[bot] closed pull request #11855: [BEAM-10005] | combinefn for ApproximateQuantiles and ApproximateUnique

Posted by GitBox <gi...@apache.org>.
stale[bot] closed pull request #11855:
URL: https://github.com/apache/beam/pull/11855


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] iemejia commented on pull request #11855: [BEAM-10005] | combinefn for ApproximateQuantiles and ApproximateUnique

Posted by GitBox <gi...@apache.org>.
iemejia commented on pull request #11855:
URL: https://github.com/apache/beam/pull/11855#issuecomment-635844667


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] iemejia commented on pull request #11855: [BEAM-10005] | combinefn for ApproximateQuantiles and ApproximateUnique

Posted by GitBox <gi...@apache.org>.
iemejia commented on pull request #11855:
URL: https://github.com/apache/beam/pull/11855#issuecomment-636342018


   Run Java PreCommit


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] darshanj commented on pull request #11855: [BEAM-10005] | combinefn for ApproximateQuantiles and ApproximateUnique

Posted by GitBox <gi...@apache.org>.
darshanj commented on pull request #11855:
URL: https://github.com/apache/beam/pull/11855#issuecomment-635734663


   R: @kennknowles


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org