You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Fabian Hueske (JIRA)" <ji...@apache.org> on 2018/07/16 10:02:00 UTC

[jira] [Commented] (FLINK-9814) CsvTableSource "lack of column" warning

    [ https://issues.apache.org/jira/browse/FLINK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545019#comment-16545019 ] 

Fabian Hueske commented on FLINK-9814:
--------------------------------------

I think it depends on when the warning should be thrown. There are three options:

1) When the program is defined, i.e., when the {{main()}} method is executed. This is the earliest point in time when code is executed and may happen on an external machine or within the cluster (e.g., if the program is submitted through the web UI). The problem here is that we would need a connection to the file system which might not be available. If we have a connection, all files would be sequentially checked which might cause a significant delay.

2) When the JobManager receives the programs and generates InputSplits. At this time, we have a connection to the file system (since we need to look up all files). However, reading the headers of all files sequentially might cause a significant delay.

3) When a TaskManager receives the first InputSplit of a file. Since InputSplits are assigned with locality preferences (i.e., somewhat random), the input split with the file header might be read last, i.e., after most of the IO was already done.

So, IMO, there is no good place to do these checks, at least not by default.

Besides that, what should the check do? Check the column header or types?

> CsvTableSource "lack of column" warning
> ---------------------------------------
>
>                 Key: FLINK-9814
>                 URL: https://issues.apache.org/jira/browse/FLINK-9814
>             Project: Flink
>          Issue Type: Wish
>          Components: Table API &amp; SQL
>    Affects Versions: 1.5.0
>            Reporter: François Lacombe
>            Assignee: vinoyang
>            Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> The CsvTableSource class is built by defining expected columns to be find in the corresponding csv file.
>  
> It would be great to throw an Exception when the csv file doesn't have the same structure as defined in the source. For retro-compatibility sake, developers should explicitly set the builder to define columns stricly and expect Exception to be thrown in case of structure difference.
> It can be easilly checked with file header if it exists.
> Is this possible ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)