You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jayesh lalwani (JIRA)" <ji...@apache.org> on 2017/03/07 16:08:38 UTC

[jira] [Commented] (SPARK-15463) Support for creating a dataframe from CSV in Dataset[String]

    [ https://issues.apache.org/jira/browse/SPARK-15463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899665#comment-15899665 ] 

Jayesh lalwani commented on SPARK-15463:
----------------------------------------

Does it make sense to have a to_csv and a from_csv function that is modeled after to_json and from_json?

The applications that we are supporting need inputs from a combination of sources and formats. Also, there is a combination of sinks and formats. For example, we might need
a) Files with CSV content
b) Files with JSON content
c) Kafka with CSV content
d) Kafka with JSON content
e) Parquet

Also, if the input has a nested structure (JSON/Parquet) sometimes, we prefer keeping the data in a StructType object.. and sometimes we prefer to flatten the Struct Type object into a dataframe. 
For example, if we are getting data from Kafka as JSON, massaging it, and are writing JSON to Kafka, we would prefer to be able to transform a StructType object, and not have to flatten it into a dataframe
Another example is that we get data from JSON, that needs to be stored in an RDMBS database. This requires us to flatten the data into a data frame before storing it into the table

So, this is what I was thinking. We should have the following functions
1) from_json - Convert a Dataframe with String to DataFrame with StructType
2) to_json - Convert a Dataframe with StructType to Dataframe of String
3) from_csv - Convert a Dataframe of String to DataFrame of StructType
4) to_csv - COnvert a DataFrame of StructType to DataFrame of String
5) flatten - convert Data Frame with StructType into a DataFrame that has the same fields as the StructType


Essentially, the request in the Change Request can be done by calling *flatten(from_csv(....))*

> Support for creating a dataframe from CSV in Dataset[String]
> ------------------------------------------------------------
>
>                 Key: SPARK-15463
>                 URL: https://issues.apache.org/jira/browse/SPARK-15463
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: PJ Fanning
>
> I currently use Databrick's spark-csv lib but some features don't work with Apache Spark 2.0.0-SNAPSHOT. I understand that with the addition of CSV support into spark-sql directly, that spark-csv won't be modified.
> I currently read some CSV data that has been pre-processed and is in RDD[String] format.
> There is sqlContext.read.json(rdd: RDD[String]) but other formats don't appear to support the creation of DataFrames based on loading from RDD[String].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org