You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/04/24 03:48:14 UTC

[GitHub] [flink-ml] lindong28 commented on a diff in pull request #87: [FLINK-27096] Improve Benchmark Framework

lindong28 commented on code in PR #87:
URL: https://github.com/apache/flink-ml/pull/87#discussion_r855177774


##########
flink-ml-benchmark/README.md:
##########
@@ -63,59 +63,75 @@ Then in Flink ML's binary distribution's folder, execute the following command
 to run the default benchmarks.
 
 ```bash
-./bin/flink-ml-benchmark.sh ./conf/benchmark-conf.json --output-file results.json
+./bin/benchmark-run.sh ./conf/benchmark-conf.json --output-file results.json
 ```
 
 You will notice that some Flink jobs are submitted to your Flink cluster, and
 the following information is printed out in your terminal. This means that you
 have successfully executed the default benchmarks.
 
 ```
-Found benchmarks [KMeansModel-1, KMeans-1]
+Found 2 benchmarks.
 ...
 Benchmarks execution completed.
 Benchmark results summary:
-[ {
-  "name" : "KMeansModel-1",
-  "totalTimeMs" : 782.0,
-  "inputRecordNum" : 10000,
-  "inputThroughput" : 12787.723785166241,
-  "outputRecordNum" : 10000,
-  "outputThroughput" : 12787.723785166241
-}, {
-  "name" : "KMeans-1",
-  "totalTimeMs" : 7022.0,
-  "inputRecordNum" : 10000,
-  "inputThroughput" : 1424.0956992309884,
-  "outputRecordNum" : 1,
-  "outputThroughput" : 0.14240956992309883
-} ]
+{
+  "KMeansModel-1" : {
+    "stage" : {
+      "className" : "org.apache.flink.ml.clustering.kmeans.KMeansModel",
+      "paramMap" : {
+        "k" : 2,
+        "distanceMeasure" : "euclidean",
+        "featuresCol" : "features",
+        "predictionCol" : "prediction"
+      }
+    },
+    ...
+    "modelData" : null,
+    "results" : {
+      "totalTimeMs" : 6596.0,
+      "inputRecordNum" : 10000,
+      "inputThroughput" : 1516.0703456640388,
+      "outputRecordNum" : 1,
+      "outputThroughput" : 0.15160703456640387
+    }
+  }
+}
 Benchmark results saved as json in results.json.
 ```
 
 The command above would save the results into `results.json` as below.
 
 ```json
-[ {
-  "name" : "KMeansModel-1",
-  "totalTimeMs" : 782.0,
-  "inputRecordNum" : 10000,
-  "inputThroughput" : 12787.723785166241,
-  "outputRecordNum" : 10000,
-  "outputThroughput" : 12787.723785166241
-}, {
-  "name" : "KMeans-1",
-  "totalTimeMs" : 7022.0,
-  "inputRecordNum" : 10000,
-  "inputThroughput" : 1424.0956992309884,
-  "outputRecordNum" : 1,
-  "outputThroughput" : 0.14240956992309883
-} ]
+{
+  "KMeansModel-1" : {
+    "stage" : {
+      "className" : "org.apache.flink.ml.clustering.kmeans.KMeansModel",
+      "paramMap" : {
+        "k" : 2,
+        "distanceMeasure" : "euclidean",
+        "featuresCol" : "features",
+        "predictionCol" : "prediction"
+      }
+    },
+    ...
+    "modelData" : null,

Review Comment:
   Would it be more intuitive to remove the this line if the model data is not specified for this stage?



##########
flink-ml-benchmark/README.md:
##########
@@ -194,3 +189,9 @@ stage. The stage is benchmarked against 10000 randomly generated vectors.
   }
 }
 ```
+
+### Benchmark Results Visualization
+
+`benchmark-results-visualize.py` is provided as a helper script to visualize

Review Comment:
   Could we provide commands that users can copy and paste to visualize the `results.json` in this quickstart?



##########
flink-ml-benchmark/src/main/java/org/apache/flink/ml/benchmark/BenchmarkResult.java:
##########
@@ -18,13 +18,25 @@
 
 package org.apache.flink.ml.benchmark;
 
+import org.apache.flink.ml.api.Stage;
+import org.apache.flink.ml.benchmark.datagenerator.DataGenerator;
+import org.apache.flink.ml.benchmark.datagenerator.InputDataGenerator;
 import org.apache.flink.util.Preconditions;
 
 /** The result of executing a benchmark. */
 public class BenchmarkResult {
     /** The benchmark name. */
     public final String name;
 
+    /** The stage to execute benchmark on. */
+    public final Stage<?> stage;

Review Comment:
   Instead of keeping the stage/inputDataGenerator/modelDataGenerator instances in `BenchmarkResult`, would it be simpler to just pass the original `Map<String, ?> params` (which is used to instantiate these instances) to `BenchmarkUtils.getResultsMapAsJson(...)` directly?



##########
flink-ml-benchmark/src/main/java/org/apache/flink/ml/benchmark/Benchmark.java:
##########
@@ -75,33 +75,43 @@ public static void printHelp() {
     public static void executeBenchmarks(CommandLine commandLine) throws Exception {
         String configFile = commandLine.getArgs()[0];
         Map<String, ?> benchmarks = BenchmarkUtils.parseJsonFile(configFile);
-        System.out.println("Found benchmarks " + benchmarks.keySet());
+        System.out.println("Found " + benchmarks.keySet().size() + " benchmarks.");
+        String saveFile = commandLine.getOptionValue(OUTPUT_FILE_OPTION.getLongOpt());
 
         StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
         StreamTableEnvironment tEnv = StreamTableEnvironment.create(env);
 
         List<BenchmarkResult> results = new ArrayList<>();
+        String benchmarkResultsJson = "{}";
+        int index = 0;
         for (Map.Entry<String, ?> benchmark : benchmarks.entrySet()) {
-            LOG.info("Running benchmark " + benchmark.getKey() + ".");
+            LOG.info(
+                    String.format(
+                            "Running benchmark %d/%d: %s",
+                            index++, benchmarks.keySet().size(), benchmark.getKey()));
 
             BenchmarkResult result =
                     BenchmarkUtils.runBenchmark(
                             tEnv, benchmark.getKey(), (Map<String, ?>) benchmark.getValue());
 
             results.add(result);
-            LOG.info(BenchmarkUtils.getResultsMapAsJson(result));
-        }
+            LOG.info("\n" + BenchmarkUtils.getResultsMapAsJson(result));
+
+            benchmarkResultsJson =
+                    BenchmarkUtils.getResultsMapAsJson(results.toArray(new BenchmarkResult[0]));
 
-        String benchmarkResultsJson =
-                BenchmarkUtils.getResultsMapAsJson(results.toArray(new BenchmarkResult[0]));
+            if (commandLine.hasOption(OUTPUT_FILE_OPTION.getLongOpt())) {
+                ReadWriteUtils.saveToFile(saveFile, benchmarkResultsJson, true);

Review Comment:
   Instead of saving all benchmark results in the loop body, would it be more efficient to save benchmark results after the while loop is done?
   
   We probably also need to record in the `results.json` the fact that a benchmark fails, probably with a string indicating the the cause of the failure. The `benchmark-results-visualize.py` would need to handle this information probably in the graph (e.g. with a red `X`) in the figure.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org