You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Martijn Visser (Jira)" <ji...@apache.org> on 2022/04/06 07:23:00 UTC

[jira] [Comment Edited] (FLINK-27078) There is a performance gap between the new csv source(file system source + CSV format) and legacy CsvTableSource.

    [ https://issues.apache.org/jira/browse/FLINK-27078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517898#comment-17517898 ] 

Martijn Visser edited comment on FLINK-27078 at 4/6/22 7:22 AM:
----------------------------------------------------------------

[~wanglijie95] Most likely yes. But it's good to have a benchmark to actually compare data. 


was (Author: martijnvisser):
[~wanglijie95] Most likely yes. 

> There is a performance gap between the new csv source(file system source + CSV format) and legacy CsvTableSource.
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-27078
>                 URL: https://issues.apache.org/jira/browse/FLINK-27078
>             Project: Flink
>          Issue Type: Improvement
>    Affects Versions: 1.15.0
>            Reporter: Lijie Wang
>            Priority: Major
>         Attachments: PerformanceTest.java
>
>
> In FLINK-26692, we tried to migrate TPCDS e2e tests to use new csv source . We found that after changing to the new source, TPCDS e2e tests runs slower than before. It only took 20 minutes before, now it takes 30 minutes(See [pr19152|https://github.com/apache/flink/pull/19152] for details).
> We found that mainly because the new csv source is slower than the legacy {{{}CsvTableSource{}}}. We did an experiment to verify it: Run the code in [^PerformanceTest.java]  and read a csv file of about 3.8G ({{{}store_sales.dat{}}} of the TPCDS-10G, which can be generated by {{{}./[dsdgen_linux|https://datacadamia.com/data/type/relation/benchmark/tpcds/dsdgen] -SCALE 10 -FORCE Y -DIR ...{}}}), and you will find that the job running time is very different: On my computer, the job runs for 50s with the new csv source and 20s with the legacy {{{}CsvTableSource{}}}.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)