You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "PoojaMurarka (JIRA)" <ji...@apache.org> on 2018/12/04 04:15:00 UTC
[jira] [Comment Edited] (SPARK-26259) RecordSeparator other than newline discovers incorrect schema

    [ https://issues.apache.org/jira/browse/SPARK-26259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16708161#comment-16708161 ] 

PoojaMurarka edited comment on SPARK-26259 at 12/4/18 4:14 AM:
---------------------------------------------------------------

The fix for using custom record delimiters seems to be only available when schema is specified based on the examples. Please correct me if I am wrong.
Rather I am looking for setting custom record delimiter while discovery schema i.e. only use *inferschema* as true rather than specifying schema.
Let me know if above issue covers both scenarios.


was (Author: pooja.murarka):
The fix for using custom record delimiters seems to be only available when schema is specified based on the examples. Please correct me if I am wrong.
Rather I am looking for setting custom record delimiter while discovery schema i.e. only use inferschema as true rather than specifying schema.
Let me know if above issue covers both scenarios.

> RecordSeparator other than newline discovers incorrect schema
> -------------------------------------------------------------
>
>                 Key: SPARK-26259
>                 URL: https://issues.apache.org/jira/browse/SPARK-26259
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: PoojaMurarka
>            Priority: Major
>
> Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed in SPARK 2.3 which allows record Separators other than new line but this doesn't work when schema is not specified i.e. while inferring the schema
>  Let me try to explain this using below data and scenarios:
> Input Data - (input_data.csv) as shown below: *+where recordSeparator is "\t"+*
> {noformat}
> "dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed"    "2012-01-01","0","0","0","0","1","9","9.1","66","0"    "2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat}
> *Case 1: Schema Defined *: Below Spark code with defined *schema* reads data correctly:
> {code:java}
> val customSchema = StructType(Array(
>         StructField("dteday", DateType, true),
>         StructField("hr", IntegerType, true),
>         StructField("holiday", IntegerType, true),
>         StructField("weekday", IntegerType, true),
>         StructField("workingday", DateType, true),
>         StructField("weathersit", IntegerType, true),
>         StructField("temp", IntegerType, true),
>         StructField("atemp", DoubleType, true),
>         StructField("hum", IntegerType, true),
>         StructField("windspeed", IntegerType, true)));
> Dataset<Row> ds = executionContext.getSparkSession().read().format( "csv" )
>           .option( "header", true )
>           .option( "schema", customSchema)
>           .option( "sep", "," )
>           .load( "input_data.csv" );
> {code}
> *Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is done i.e. entire data is read as column names.
> {code:java}
> Dataset<Row> ds = executionContext.getSparkSession().read().format( "csv" )
>           .option( "header", true )
>           .option( "inferSchema", true)
>           .option( "sep", "," )
>           .load( "input_data.csv" );
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org