You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/28 04:28:33 UTC

[GitHub] [spark] swapnilushinde opened a new pull request #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions.

swapnilushinde opened a new pull request #24724: User friendly dataset, dataframe generation for csv datasources without explicit StructType definitions.
URL: https://github.com/apache/spark/pull/24724
 
 
   ## What changes were proposed in this pull request?
   Many users frequently load structured data from csv datasources. It's is very common with current APIs to load csv as Dataframe where schema needs to be defined as StructType object. Many users then convert Dataframe to Dataset with objects of Product (case classes).
   Loading CSV files becomes relatively complex which can be easily simplified. This change would help to work with csv files more user friendly.
   
   **Input -** 
   ```    
   csv file with five columns - {id: Int,
    name: String,
    subject: String,
    marks: Int,
    result: Boolean}
   ```
   **Current approach -**
   ```
   val schema = StructType(StructField(id,IntegerType,false),
   StructField(name,StringType,false),
   StructField(subject,StringType,false),
   StructField(marks,IntegerType,false),
   StructField(result,Booleanype,false))
   
   val df = spark.read.schema(schema).csv(<file_path>)
   case class A(id: Int, name: String, subject: String, marks: Int, result: Boolean) 
   val ds = df.as[A]
   ```
   
   **Proposed change -**
   ```
   case class A (id: Int, name: String, subject: String, marks: Int, result: Boolean) 
   val df = spark.createDataframe[A](optionsMap, <file_paths>)
   val ds = spark.createDataset[A](optionsMap, <file_paths>)
   ```
   
   - No explicit schema definition with StructType is needed as it can be resolved by Product classes.
   - Redundant codebase in applications to define verbose structType can be avoided with this change.
   - Proposed APIs are similar to current APIs so easy to use. All current and future csv options can be used as is with no changes needed. (exception - inferSchema is internally disabled as it's useless/confusing with this api)
   - Similar to current createDataset/createDataframe APIs, it would make loading csv files for debug purpose more convenient.
   
   
   
   ## How was this patch tested?
   This change is manually tested. I didnt see similar createDataset/createDataframe unit test cases. Please let me know best place to add unit test cases for this and existing similar APIs.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org