You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Robert Metzger (JIRA)" <ji...@apache.org> on 2019/02/28 12:54:00 UTC

[jira] [Updated] (FLINK-10684) Improve the CSV reading process

     [ https://issues.apache.org/jira/browse/FLINK-10684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Metzger updated FLINK-10684:
-----------------------------------
    Component/s:     (was: Core)
                 API / DataSet

> Improve the CSV reading process
> -------------------------------
>
>                 Key: FLINK-10684
>                 URL: https://issues.apache.org/jira/browse/FLINK-10684
>             Project: Flink
>          Issue Type: Improvement
>          Components: API / DataSet
>            Reporter: Xingcan Cui
>            Priority: Major
>
> CSV is one of the most commonly used file formats in data wrangling. To load records from CSV files, Flink has provided the basic {{CsvInputFormat}}, as well as some variants (e.g., {{RowCsvInputFormat}} and {{PojoCsvInputFormat}}). However, it seems that the reading process can be improved. For example, we could add a built-in util to automatically infer schemas from CSV headers and samples of data. Also, the current bad record handling method can be improved by somehow keeping the invalid lines (and even the reasons for failed parsing), instead of logging the total number only.
> This is an umbrella issue for all the improvements and bug fixes for the CSV reading process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)