You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Stephan Ewen (JIRA)" <ji...@apache.org> on 2015/06/22 17:58:00 UTC

[jira] [Commented] (FLINK-2186) Rework CSV import to support very wide files

    [ https://issues.apache.org/jira/browse/FLINK-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596116#comment-14596116 ] 

Stephan Ewen commented on FLINK-2186:
-------------------------------------

Not sure if we can/want to fix this on the generic readCsvFile method. That method returns tuples, and those are of limited size.

When you are reading many column CSV files, you probably want to return an array of fields anyways.

> Rework CSV import to support very wide files
> --------------------------------------------
>
>                 Key: FLINK-2186
>                 URL: https://issues.apache.org/jira/browse/FLINK-2186
>             Project: Flink
>          Issue Type: Improvement
>          Components: Machine Learning Library, Scala API
>            Reporter: Theodore Vasiloudis
>             Fix For: 0.10
>
>
> In the current readVcsFile implementation, importing CSV files with many columns can become from cumbersome to impossible.
> For example to import an 11 column file we need to write:
> {code}
> val cancer = env.readCsvFile[(String, String, String, String, String, String, String, String, String, String, String)]("/path/to/breast-cancer-wisconsin.data")
> {code}
> For many use cases in Machine Learning we might have CSV files with thousands or millions of columns that we want to import as vectors.
> In that case using the current readCsvFile method becomes impossible.
> We therefore need to rework the current function, or create a new one that will allow us to import CSV files with an arbitrary number of columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)