You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Frank Kemmer (JIRA)" <ji...@apache.org> on 2018/09/05 18:06:00 UTC

[jira] [Created] (SPARK-25343) Extend CSV parsing to Dataset[List[String]]

Frank Kemmer created SPARK-25343:
------------------------------------

             Summary: Extend CSV parsing to Dataset[List[String]]
                 Key: SPARK-25343
                 URL: https://issues.apache.org/jira/browse/SPARK-25343
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 2.3.1
            Reporter: Frank Kemmer


With the cvs() method it is currenty possible to create a Dataframe from Dataset[String], where the given string contains comma separated values. This is really great.

But very often we have to parse files where we have to split the values of a line by very individual value separators and regular expressions. The result is a Dataset[List[String]]. This list corresponds to what you would get, after splitting the values of a CSV string.

It would be great, if the csv() method would also accept such a Dataset as input especially given a target schema. The csv parser usually casts the separated values against the schema and can sort out lines where the values of the columns do not fit with the schema.

This is the functionality I am looking for and I think it is already implemented in the CSV parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org