You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Dong Lin (Jira)" <ji...@apache.org> on 2023/03/01 09:59:00 UTC

[jira] [Resolved] (FLINK-31125) Flink ML benchmark framework should minimize the source operator overhead

     [ https://issues.apache.org/jira/browse/FLINK-31125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dong Lin resolved FLINK-31125.
------------------------------
    Resolution: Fixed

> Flink ML benchmark framework should minimize the source operator overhead
> -------------------------------------------------------------------------
>
>                 Key: FLINK-31125
>                 URL: https://issues.apache.org/jira/browse/FLINK-31125
>             Project: Flink
>          Issue Type: Improvement
>          Components: Library / Machine Learning
>            Reporter: Dong Lin
>            Assignee: Dong Lin
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: ml-2.2.0
>
>
> Flink ML benchmark framework estimates the throughput by having a source operator generate a given number (e.g. 10^7) of input records with random values, let the given AlgoOperator process these input records, and divide the number of records by the total execution time. 
> The overhead of generating random values for all input records has observable impact on the estimated throughput. We would like to minimize the overhead of the source operator so that the benchmark result can focus on the throughput of the AlgoOperator as much as possible.
> Note that [spark-sql-perf|https://github.com/databricks/spark-sql-perf] generates all input records in advance into memory before running the benchmark. This allows Spark ML benchmark to read records from memory instead of generating values for those records during the benchmark.
> We can generate value once and re-use it for all input records. This approach minimizes the source operator head and allows us to compare Flink ML benchmark result with Spark ML benchmark result (from spark-sql-perf) fairly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)