You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2018/12/04 03:11:00 UTC
[jira] [Commented] (SPARK-26259) RecordSeparator other than newline discovers incorrect schema

    [ https://issues.apache.org/jira/browse/SPARK-26259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16708134#comment-16708134 ] 

Hyukjin Kwon commented on SPARK-26259:
--------------------------------------

please avoid to set the fixed version which is usually set after acutally it's fixed.

> RecordSeparator other than newline discovers incorrect schema
> -------------------------------------------------------------
>
>                 Key: SPARK-26259
>                 URL: https://issues.apache.org/jira/browse/SPARK-26259
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: PoojaMurarka
>            Priority: Major
>
> Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed in SPARK 2.3 which allows record Separators other than new line but this doesn't work when schema is not specified i.e. while inferring the schema
>  Let me try to explain this using below data and scenarios:
> Input Data - (input_data.csv) as shown below: *+where recordSeparator is "\t"+*
> {noformat}
> "dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed"    "2012-01-01","0","0","0","0","1","9","9.1","66","0"    "2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat}
> *Case 1: Schema Defined *: Below Spark code with defined *schema* reads data correctly:
> {code:java}
> val customSchema = StructType(Array(
>         StructField("dteday", DateType, true),
>         StructField("hr", IntegerType, true),
>         StructField("holiday", IntegerType, true),
>         StructField("weekday", IntegerType, true),
>         StructField("workingday", DateType, true),
>         StructField("weathersit", IntegerType, true),
>         StructField("temp", IntegerType, true),
>         StructField("atemp", DoubleType, true),
>         StructField("hum", IntegerType, true),
>         StructField("windspeed", IntegerType, true)));
> Dataset<Row> ds = executionContext.getSparkSession().read().format( "csv" )
>           .option( "header", true )
>           .option( "schema", customSchema)
>           .option( "sep", "," )
>           .load( "input_data.csv" );
> {code}
> *Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is done i.e. entire data is read as column names.
> {code:java}
> Dataset<Row> ds = executionContext.getSparkSession().read().format( "csv" )
>           .option( "header", true )
>           .option( "inferSchema", true)
>           .option( "sep", "," )
>           .load( "input_data.csv" );
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org