You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2017/11/22 14:51:01 UTC
[jira] [Resolved] (SPARK-22580) Count after filtering uncached CSV for isnull(columnNameOfCorruptRecord) always 0

     [ https://issues.apache.org/jira/browse/SPARK-22580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-22580.
----------------------------------
    Resolution: Duplicate

> Count after filtering uncached CSV for isnull(columnNameOfCorruptRecord) always 0
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-22580
>                 URL: https://issues.apache.org/jira/browse/SPARK-22580
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.2.0
>         Environment: Same behavior on Debian and MS Windows (8.1) system. JRE 1.8
>            Reporter: Florian Kaspar
>
> It seems that doing counts after filtering for the parser-created columnNameOfCorruptRecord and doing a count afterwards does not recognize any invalid row that was put to this special column.
> Filtering for members of the actualSchema works fine and yields correct counts.
> Input CSV example:
> {noformat}
> val1, cat1, 1.337
> val2, cat1, 1.337
> val3, cat2, 42.0
> some, invalid, line
> {noformat}
> Code snippet:
> {code:java}
>         StructType schema = new StructType(new StructField[] { 
>                 new StructField("s1", DataTypes.StringType, true, Metadata.empty()),
>                 new StructField("s2", DataTypes.StringType, true, Metadata.empty()),
>                 new StructField("d1", DataTypes.DoubleType, true, Metadata.empty()),
>                 new StructField("FALLBACK", DataTypes.StringType, true, Metadata.empty())});
>             Dataset<Row> csv = sqlContext.read()
>                     .option("header", "false")
>                     .option("parserLib", "univocity")
>                     .option("mode", "PERMISSIVE")
>                     .option("maxCharsPerColumn", 10000000)
>                     .option("ignoreLeadingWhiteSpace", "false")
>                     .option("ignoreTrailingWhiteSpace", "false")
>                     .option("comment", null)
>                     .option("header", "false")
>                     .option("columnNameOfCorruptRecord", "FALLBACK")
>                     .schema(schema)
>                     .csv(path/to/csv/file);
>              long validCount = csv.filter("FALLBACK IS NULL").count();
>              long invalidCount = csv.filter("FALLBACK IS NOT NULL").count();
> {code}
> Expected: 
> validCount is 3
> Invalid Count is 1
> Actual:
> validCount is 4
> Invalid Count is 0
> Caching the csv after load solves the problem and shows the correct counts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org