You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/12/13 02:19:00 UTC
[GitHub] [flink-ml] yunfengzhou-hub commented on a diff in pull request #189: [FLINK-30348] Refine Transformer for RandomSplitter

yunfengzhou-hub commented on code in PR #189:
URL: https://github.com/apache/flink-ml/pull/189#discussion_r1046592801


##########
docs/content/docs/operators/feature/randomsplitter.md:
##########
@@ -34,6 +34,7 @@ An AlgoOperator which splits a table into N tables according to the given weight
 | Key     | Default      | Type     | Required | Description                    |
 |:--------|:-------------|:---------|:---------|:-------------------------------|
 | weights | `[1.0, 1.0]` | Double[] | no       | The weights of data splitting. |
+| seed    | `null`       | Long     | no       | The random seed.               |

Review Comment:
   It might be better to describe the conditions required to reproduce random split results. For example, if users set the same random seed but changed the parallelism of the upstream operator, can they still expect to get the same split result?



##########
flink-ml-lib/src/main/java/org/apache/flink/ml/feature/randomsplitter/RandomSplitter.java:
##########
@@ -83,11 +84,12 @@ public Table[] transform(Table... inputs) {
 
     private static class SplitterOperator extends AbstractStreamOperator<Row>
             implements OneInputStreamOperator<Row, Row> {
-        private final Random random = new Random(0);
+        private final Random random;
         OutputTag<Row>[] outputTag;
         final double[] fractions;
 
-        public SplitterOperator(OutputTag<Row>[] outputTag, Double[] weights) {
+        public SplitterOperator(OutputTag<Row>[] outputTag, Double[] weights, long seed) {
+            random = new Random(seed);

Review Comment:
   It might be better to avoid having random values behaving the same on each subtask, e.g., always assigning the first element to the first output table regardless of the id of the subtask. You may check [`RowGenerator.open()`](https://github.com/apache/flink-ml/blob/master/flink-ml-benchmark/src/main/java/org/apache/flink/ml/benchmark/datagenerator/common/RowGenerator.java#L52) for how to create randoms from both initial seed and subtask id.



##########
flink-ml-lib/src/test/java/org/apache/flink/ml/feature/RandomSplitterTest.java:
##########
@@ -95,7 +96,7 @@ public void testOutputSchema() {
     @Test
     public void testWeights() throws Exception {
         Table data = getTable(1000);
-        RandomSplitter splitter = new RandomSplitter().setWeights(2.0, 1.0, 2.0);
+        RandomSplitter splitter = new RandomSplitter().setWeights(2.0, 1.0, 2.0).setSeed(0);

Review Comment:
   It seems that the default seed is not good enough to generate split results that meet statistical expectations. Can we modify the random splitting behavior to improve the default splitting behavior?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org