You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Sreenath Chothar (JIRA)" <ji...@apache.org> on 2017/11/06 12:53:00 UTC

[jira] [Created] (SPARK-22455) Provide an option to store the exception records/files and reasons in log files when reading data from a file-based data source.

Sreenath Chothar created SPARK-22455:
----------------------------------------

Summary: Provide an option to store the exception records/files and reasons in log files when reading data from a file-based data source.
Key: SPARK-22455
URL: https://issues.apache.org/jira/browse/SPARK-22455
Project: Spark
Issue Type: Improvement
Components: Input/Output
Affects Versions: 2.2.0
Reporter: Sreenath Chothar

Provide an option to store the exception/bad records and reasons in log files when reading data from a file-based data source into a PySpark dataframe. Now only following three options are available:
1. PERMISSIVE : sets other fields to null when it meets a corrupted record, and puts the malformed string into a field configured by columnNameOfCorruptRecord.
2. DROPMALFORMED : ignores the whole corrupted records.
3. FAILFAST : throws an exception when it meets corrupted records.

We could use first option to accumulate the corrupted records and output to a log file.But we can't use this option when input schema is inferred automatically. If the number of columns to read is too large, providing the complete schema with additional column for storing corrupted data is difficult. Instead "pyspark.sql.DataFrameReader.csv" reader functions could provide an option to redirect the bad records to configured log file path with exception details.

--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org