You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jean Georges Perrin (JIRA)" <ji...@apache.org> on 2019/02/22 14:49:00 UTC

[jira] [Updated] (SPARK-26972) Issue with CSV import and inferSchema set to true

     [ https://issues.apache.org/jira/browse/SPARK-26972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean Georges Perrin updated SPARK-26972:
----------------------------------------
    Attachment: issue.txt

> Issue with CSV import and inferSchema set to true
> -------------------------------------------------
>
>                 Key: SPARK-26972
>                 URL: https://issues.apache.org/jira/browse/SPARK-26972
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.1.3, 2.3.3, 2.4.0
>         Environment: Java 8/Scala 2.11/MacOs
>            Reporter: Jean Georges Perrin
>            Priority: Major
>         Attachments: issue.txt
>
>
>  
>  
> Issue with CSV import and inferSchema set to true.
> I found a few discrepencies while working with inferSchema set to true in CSV ingestion.
> Given the following CSV:
> {{id;authorId;title;releaseDate;link}}
> {{1;1;Fantastic Beasts and Where to Find Them: The Original Screenplay;11/18/16;http://amzn.to/2kup94P}}
> {{2;1;*Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry Potter; Book 1)*;10/6/15;http://amzn.to/2l2lSwP}}
> {{3;1;*The Tales of Beedle the Bard, Standard Edition (Harry Potter)*;12/4/08;http://amzn.to/2kYezqr}}
> {{4;1;*Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry Potter; Book 2)*;10/4/16;http://amzn.to/2kYhL5n}}
> {{5;2;*Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple; the Coffee; and a Great Database*;4/23/17;http://amzn.to/2i3mthT}}
> {{6;2;*Development Tools in 2006: any Room for a 4GL-style Language? }}
> {{An independent study by Jean Georges Perrin, IIUG Board Member*;12/28/16;http://amzn.to/2vBxOe1}}
> {{7;3;Adventures of Huckleberry Finn;5/26/94;http://amzn.to/2wOeOav}}
> {{8;3;A Connecticut Yankee in King Arthur's Court;6/17/17;http://amzn.to/2x1NuoD}}
> {{10;4;Jacques le Fataliste;3/1/00;http://amzn.to/2uZj2KA}}
> {{11;4;Diderot Encyclopedia: The Complete Illustrations 1762-1777;;http://amzn.to/2i2zo3I}}
> {{12;;A Woman in Berlin;7/11/06;http://amzn.to/2i472WZ}}
> {{13;6;Spring Boot in Action;1/3/16;http://amzn.to/2hCPktW}}
> {{14;6;Spring in Action: Covers Spring 4;11/28/14;http://amzn.to/2yJLyCk}}
> {{15;7;Soft Skills: The software developer's life manual;12/29/14;http://amzn.to/2zNnSyn}}
> {{16;8;Of Mice and Men;;http://amzn.to/2zJjXoc}}
> {{17;9;*Java 8 in Action: Lambdas; Streams; and functional-style programming*;8/28/14;http://amzn.to/2isdqoL}}
> {{18;12;Hamlet;6/8/12;http://amzn.to/2yRbewY}}
> {{19;13;Pensées;12/31/1670;http://amzn.to/2jweHOG}}
> {{20;14;*Fables choisies; mises en vers par M. de La Fontaine*;9/1/1999;http://amzn.to/2yRH10W}}
> {{21;15;Discourse on Method and Meditations on First Philosophy;6/15/1999;http://amzn.to/2hwB8zc}}
> {{22;12;Twelfth Night;7/1/4;http://amzn.to/2zPYnwo}}
> {{23;12;Macbeth;7/1/3;http://amzn.to/2zPYnwo}}
> And this code:
> {{Dataset<Row> df = spark.read().format("csv")}}
> {{ .option("header", "true")}}
> {{ .option("multiline", true)}}
> {{ .option("sep", ";")}}
> {{ .option("quote", "*")}}
> {{ .option("dateFormat", "M/d/y")}}
> {{ .option("inferSchema", true)}}
> {{ .load("data/books.csv");}}
> {{df.show(7);}}
> {{df.printSchema();}}
> h1. In Spark v2.0.1
> {{Excerpt of the dataframe content:}}
> {{+---+--------+--------------------+-----------+--------------------+}}
> {{| id|authorId| title|releaseDate| link|}}
> {{+---+--------+--------------------+-----------+--------------------+}}
> {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
> {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
> {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}}
> {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}}
> {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}}
> {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}}
> {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}}
> {{+---+--------+--------------------+-----------+--------------------+}}
> {{only showing top 7 rows}}{{Dataframe's schema:}}
> {{root}}
> {{ |-- id: integer (nullable = true)}}
> {{ |-- authorId: integer (nullable = true)}}
> {{ |-- title: string (nullable = true)}}
> {{ |-- releaseDate: string (nullable = true)}}
> {{ |-- link: string (nullable = true)}}
> *This is fine and the expected output*.
> h1. Using Apache Spark v2.1.3
> Excerpt of the dataframe content:
> {{+--------------------+--------+--------------------+-----------+--------------------+}}
> {{ | id|authorId| title|releaseDate| link|}}
> {{ +--------------------+--------+--------------------+-----------+--------------------+}}
> {{ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
> {{ | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
> {{ | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}}
> {{ | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}}
> {{ | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}}
> {{ | 6| 2|Development Tools...| null| null|}}
> {{ |An independent st...|12/28/16|http://amzn.to/2v...| null| null|}}
> {{ +--------------------+--------+--------------------+-----------+--------------------+}}
> {{ only showing top 7 rows}}{{Dataframe's schema:}}
> {{ root}}
> {{ |-- id: string (nullable = true)}}
> {{ |-- authorId: string (nullable = true)}}
> {{ |-- title: string (nullable = true)}}
> {{ |-- releaseDate: string (nullable = true)}}
> {{ |-- link: string (nullable = true)}}
> The *multiline* option is *not recognized*. And, of course, the schema is wrong.
> h1. Using Apache Spark v2.2.3
> Excerpt of the dataframe content:
> {{+---+--------+--------------------+-----------+--------------------+}}
> {{| id|authorId| title|releaseDate| link}}
> {{|}}
> {{+---+--------+--------------------+-----------+--------------------+}}
> {{| 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...|}}
> {{| 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...|}}
> {{| 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...|}}
> {{| 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...|}}
> {{| 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...|}}
> {{| 6| 2|Development Tools...| 12/28/16|http://amzn.to/2v...|}}
> {{| 7| 3|Adventures of Huc...| 5/26/94|http://amzn.to/2w...|}}
> {{+---+--------+--------------------+-----------+--------------------+}}
> {{only showing top 7 rows}}{{Dataframe's schema:}}
> {{root}}
> {{ |-- id: integer (nullable = true)}}
> {{ |-- authorId: integer (nullable = true)}}
> {{ |-- title: string (nullable = true)}}
> {{ |-- releaseDate: string (nullable = true)}}
> {{ |-- link}}
> {{: string (nullable = true)}}
> The *link* column *has a carriage return* at the end of its name. If I run and use:
> {{df.show(7, 90);}}
> I get:
> {{+---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+}}
> {{| id|authorId| title|releaseDate| link}}
> {{|}}
> {{+---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+}}
> {{| 1| 1| Fantastic Beasts and Where to Find Them: The Original Screenplay| 11/18/16|http://amzn.to/2kup94P}}
> {{|}}
> {{| 2| 1| Harry Potter and the Sorcerer's Stone: The Illustrated Edition (Harry Potter; Book 1)| 10/6/15|http://amzn.to/2l2lSwP}}
> {{|}}
> {{| 3| 1| The Tales of Beedle the Bard, Standard Edition (Harry Potter)| 12/4/08|http://amzn.to/2kYezqr}}
> {{|}}
> {{| 4| 1| Harry Potter and the Chamber of Secrets: The Illustrated Edition (Harry Potter; Book 2)| 10/4/16|http://amzn.to/2kYhL5n}}
> {{|}}
> {{| 5| 2|Informix 12.10 on Mac 10.12 with a dash of Java 8: The Tale of the Apple; the Coffee; a...| 4/23/17|http://amzn.to/2i3mthT}}
> {{|}}
> {{| 6| 2|Development Tools in 2006: any Room for a 4GL-style Language? }}
> {{An independent study by...| 12/28/16|http://amzn.to/2vBxOe1}}
> {{|}}
> {{| 7| 3| Adventures of Huckleberry Finn| 5/26/94|http://amzn.to/2wOeOav}}
> {{|}}
> {{+---+--------+------------------------------------------------------------------------------------------+-----------+-----------------------+}}
> The carriage *return is added to my the last cell*.
> Same behavior in v2.3.3 and v2.4.0.
> If I add the schema, like in:
> {{StructType schema = DataTypes.createStructType(new StructField[] {}}
> {{ DataTypes.createStructField(}}
> {{ "id",}}
> {{ DataTypes.IntegerType,}}
> {{ false),}}
> {{ DataTypes.createStructField(}}
> {{ "authordId",}}
> {{ DataTypes.IntegerType,}}
> {{ true),}}
> {{ DataTypes.createStructField(}}
> {{ "bookTitle",}}
> {{ DataTypes.StringType,}}
> {{ false),}}
> {{ DataTypes.createStructField(}}
> {{ "releaseDate",}}
> {{ DataTypes.DateType,}}
> {{ true), // nullable, but this will be ignore}}
> {{ DataTypes.createStructField(}}
> {{ "url",}}
> {{ DataTypes.StringType,}}
> {{ false) });}}
> {{// Reads a CSV file with header, called books.csv, stores it in a dataframe}}
> {{Dataset<Row> df = spark.read().format("csv")}}
> {{ .option("header", "true")}}
> {{ .option("multiline", true)}}
> {{ .option("sep", ";")}}
> {{ .option("dateFormat", "M/d/y")}}
> {{ .option("quote", "*")}}
> {{ .schema(schema)}}
> {{ .load("data/books.csv");}}
> The output is matching what is expected in any version *except version 2.1.3, where Spark simply crashes*.
> All the code can be downloaded from GitHub at: [https://github.com/jgperrin/net.jgp.books.sparkWithJava.ch07.]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org