You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Lijie Wang (Jira)" <ji...@apache.org> on 2022/04/06 06:54:00 UTC

[jira] [Created] (FLINK-27078) There is a performance gap between the new csv source(file system source + CSV format) and legacy CsvTableSource.

Lijie Wang created FLINK-27078:
----------------------------------

             Summary: There is a performance gap between the new csv source(file system source + CSV format) and legacy CsvTableSource.
                 Key: FLINK-27078
                 URL: https://issues.apache.org/jira/browse/FLINK-27078
             Project: Flink
          Issue Type: Improvement
    Affects Versions: 1.15.0
            Reporter: Lijie Wang


In FLINK-26692, we tried to migrate TPCDS e2e tests to use new csv source . We found that after changing to the new source, TPCDS e2e tests runs slower than before. It only took 20 minutes before, now it takes 30 minutes.

We found that mainly because the new csv source is slower than the legacy {{{}CsvTableSource{}}}. We did an experiment to verify it: Run the code in [^PerformanceTest.java]  and read a csv file of about 3.8G (store_sales.dat of the TPCDS-10G, which can be generated by {{{}./[dsdgen_linux|https://datacadamia.com/data/type/relation/benchmark/tpcds/dsdgen] -SCALE 10 -FORCE Y -DIR ...{}}}), and you will find that the job running time is very different: On my computer, the job runs for 50s with the new csv source and 20s with the legacy {{{}CsvTableSource{}}}.
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)