You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2018/12/04 03:10:00 UTC
[jira] [Updated] (SPARK-26259) RecordSeparator other than newline
discovers incorrect schema
[ https://issues.apache.org/jira/browse/SPARK-26259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon updated SPARK-26259:
---------------------------------
Fix Version/s: (was: 2.4.1)
> RecordSeparator other than newline discovers incorrect schema
> -------------------------------------------------------------
>
> Key: SPARK-26259
> URL: https://issues.apache.org/jira/browse/SPARK-26259
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.4.0
> Reporter: PoojaMurarka
> Priority: Major
>
> Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed in SPARK 2.3 which allows record Separators other than new line but this doesn't work when schema is not specified i.e. while inferring the schema
> Let me try to explain this using below data and scenarios:
> Input Data - (input_data.csv) as shown below: *+where recordSeparator is "\t"+*
> {noformat}
> "dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed" "2012-01-01","0","0","0","0","1","9","9.1","66","0" "2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat}
> *Case 1: Schema Defined *: Below Spark code with defined *schema* reads data correctly:
> {code:java}
> val customSchema = StructType(Array(
> StructField("dteday", DateType, true),
> StructField("hr", IntegerType, true),
> StructField("holiday", IntegerType, true),
> StructField("weekday", IntegerType, true),
> StructField("workingday", DateType, true),
> StructField("weathersit", IntegerType, true),
> StructField("temp", IntegerType, true),
> StructField("atemp", DoubleType, true),
> StructField("hum", IntegerType, true),
> StructField("windspeed", IntegerType, true)));
> Dataset<Row> ds = executionContext.getSparkSession().read().format( "csv" )
> .option( "header", true )
> .option( "schema", customSchema)
> .option( "sep", "," )
> .load( "input_data.csv" );
> {code}
> *Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is done i.e. entire data is read as column names.
> {code:java}
> Dataset<Row> ds = executionContext.getSparkSession().read().format( "csv" )
> .option( "header", true )
> .option( "inferSchema", true)
> .option( "sep", "," )
> .load( "input_data.csv" );
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org