You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "DILIP KUMAR MOHAPATRO (Jira)" <ji...@apache.org> on 2020/01/27 12:44:00 UTC

[jira] [Created] (SPARK-30650) The parquet file written by spark often incurs corrupted footer and hence not readable

DILIP KUMAR MOHAPATRO created SPARK-30650:
---------------------------------------------

Summary: The parquet file written by spark often incurs corrupted footer and hence not readable
Key: SPARK-30650
URL: https://issues.apache.org/jira/browse/SPARK-30650
Project: Spark
Issue Type: Bug
Components: Block Manager, Input/Output, Optimizer
Affects Versions: 1.6.1
Reporter: DILIP KUMAR MOHAPATRO

This issue is similar to an archived one,

[https://mail-archives.apache.org/mod_mbox/spark-issues/201501.mbox/%3CJIRA.12767358.1421214067000.78480.1421214094403@Atlassian.JIRA%3E]

The parquet file written by spark often incurs corrupted footer and hence not readable by spark.

The issue is more consistent when the granularity of a field increases. i.e. when redundancy of values in dataset is reduced(= more number of unique values).

Coalesce also doesn't help here. It automatically generated a certain number of parquet files, each with a definite size as controlled by spark internals. But, few of them written corrupted footer. But writing job ends with success status.

Here are few examples,

There are the files(267.2 M each) which the 1.6.x version spark has generated. But few of them are found with corrupted footer and hence not readable. This scenario happens more frequently when the file(input) size exceeds a certain limit and also the level of redundancy of the data matters. With the same file size, Lesser the level of redundancy, more is the probability of getting the footer corrupted.

Hence in iterations of the job when those are required to read for processing, ends up with
{{{color:#FF0000}*Can not read value 0 in block _n_ in file xxxx*{color}}}

--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org