You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2019/09/16 01:22:00 UTC
[jira] [Resolved] (SPARK-29068) CSV read reports incorrect row
count
[ https://issues.apache.org/jira/browse/SPARK-29068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-29068.
----------------------------------
Resolution: Invalid
You should set {{comment}} to {{#}} and {{header}} to {{true}}
> CSV read reports incorrect row count
> ------------------------------------
>
> Key: SPARK-29068
> URL: https://issues.apache.org/jira/browse/SPARK-29068
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.4.4
> Reporter: Thomas Diesler
> Priority: Major
>
> Reading the [SFNY example data|https://github.com/jadeyee/r2d3-part-1-data/blob/master/part_1_data.csv] in Java like this ...
> {code:java}
> Path srcdir = Paths.get("src/test/resources");
> Path inpath = srcdir.resolve("part_1_data.csv");
> SparkSession session = getOrCreateSession();
> Dataset<Row> dataset = session.read()
> //.option("header", true)
> .option("mode", "DROPMALFORMED")
> .schema(new StructType()
> .add("insf", IntegerType, false)
> .add("beds", DoubleType, false)
> .add("baths", DoubleType, false)
> .add("price", IntegerType, false)
> .add("year", IntegerType, false)
> .add("sqft", IntegerType, false)
> .add("prcsqft", IntegerType, false)
> .add("elevation", IntegerType, false))
> .csv(inpath.toString());
> {code}
> Incorrectly reports 495 instead of 492 rows. It seems to include the three header rows in the count.
> Also, without DROPMALFORMED it creates 495 rows with three null value rows. This also seems to be incorrect because the schema explicitly requires non null values for all fields.
> This code works fine with Spark-2.1.0
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org