You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2017/11/22 14:51:01 UTC
[jira] [Resolved] (SPARK-22580) Count after filtering uncached CSV
for isnull(columnNameOfCorruptRecord) always 0
[ https://issues.apache.org/jira/browse/SPARK-22580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-22580.
----------------------------------
Resolution: Duplicate
> Count after filtering uncached CSV for isnull(columnNameOfCorruptRecord) always 0
> ---------------------------------------------------------------------------------
>
> Key: SPARK-22580
> URL: https://issues.apache.org/jira/browse/SPARK-22580
> Project: Spark
> Issue Type: Bug
> Components: Input/Output
> Affects Versions: 2.2.0
> Environment: Same behavior on Debian and MS Windows (8.1) system. JRE 1.8
> Reporter: Florian Kaspar
>
> It seems that doing counts after filtering for the parser-created columnNameOfCorruptRecord and doing a count afterwards does not recognize any invalid row that was put to this special column.
> Filtering for members of the actualSchema works fine and yields correct counts.
> Input CSV example:
> {noformat}
> val1, cat1, 1.337
> val2, cat1, 1.337
> val3, cat2, 42.0
> some, invalid, line
> {noformat}
> Code snippet:
> {code:java}
> StructType schema = new StructType(new StructField[] {
> new StructField("s1", DataTypes.StringType, true, Metadata.empty()),
> new StructField("s2", DataTypes.StringType, true, Metadata.empty()),
> new StructField("d1", DataTypes.DoubleType, true, Metadata.empty()),
> new StructField("FALLBACK", DataTypes.StringType, true, Metadata.empty())});
> Dataset<Row> csv = sqlContext.read()
> .option("header", "false")
> .option("parserLib", "univocity")
> .option("mode", "PERMISSIVE")
> .option("maxCharsPerColumn", 10000000)
> .option("ignoreLeadingWhiteSpace", "false")
> .option("ignoreTrailingWhiteSpace", "false")
> .option("comment", null)
> .option("header", "false")
> .option("columnNameOfCorruptRecord", "FALLBACK")
> .schema(schema)
> .csv(path/to/csv/file);
> long validCount = csv.filter("FALLBACK IS NULL").count();
> long invalidCount = csv.filter("FALLBACK IS NOT NULL").count();
> {code}
> Expected:
> validCount is 3
> Invalid Count is 1
> Actual:
> validCount is 4
> Invalid Count is 0
> Caching the csv after load solves the problem and shows the correct counts.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org