You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean R. Owen (Jira)" <ji...@apache.org> on 2021/02/11 16:15:00 UTC

[jira] [Resolved] (SPARK-34422) CSV(/JSON?) files with corrupt row + Permissive mode can yield wrong partial result row

     [ https://issues.apache.org/jira/browse/SPARK-34422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean R. Owen resolved SPARK-34422.
----------------------------------
    Resolution: Not A Problem

Hm, no I am not sure this is a problem in Spark. The semantics are different. The partial result row in Spark does not already include the col for the corrupted record, whereas in the spark-xml representation it does (hence the bug there).

Closing this as when I 'fixed' it it caused test failures, which convinced me after debugging that it's not the same situation.

> CSV(/JSON?) files with corrupt row + Permissive mode can yield wrong partial result row
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-34422
>                 URL: https://issues.apache.org/jira/browse/SPARK-34422
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.7, 3.0.1, 3.1.1
>            Reporter: Sean R. Owen
>            Assignee: Sean R. Owen
>            Priority: Major
>
> (This was actually found and fixed in spark-xml, which copied some Spark code for handling bad records. See https://github.com/databricks/spark-xml/issues/517 )
> When CSV parsing (or, I think JSON?) encounters a bad record, in Permissive mode, it can return a partial result of values that were successfully parsed, along with the problem input in a new 'corrupt record' column.
> However the logic in FailureSafeParser that copies the partial results to the resulting Row has an off-by-one error that arises when the catalyst projection puts the 'corrupt record' column anywhere but the last column, which can readily happen. This could mean the resulting partial results are wrong, or, that processing the bad record in permissive mode fails entirely, if the resulting elements don't happen to match the schema of the result.
> The partial results are usually not that useful, so being wrong isn't a huge deal, but, failing entirely in permissive mode is a problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org