You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Ryan Blue (JIRA)" <ji...@apache.org> on 2018/02/05 17:40:00 UTC
[jira] [Commented] (PARQUET-1203) Corrupted parquet file from Spark
[ https://issues.apache.org/jira/browse/PARQUET-1203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16352677#comment-16352677 ]
Ryan Blue commented on PARQUET-1203:
------------------------------------
[~djiangxu], I replied on the Spark list. I'm closing this because it is not a bug in Parquet. Corruption happens when you get bad hardware and I recommend tracking down the bad node.
> Corrupted parquet file from Spark
> ---------------------------------
>
> Key: PARQUET-1203
> URL: https://issues.apache.org/jira/browse/PARQUET-1203
> Project: Parquet
> Issue Type: Bug
> Environment: Spark 2.2.1
> Reporter: Dong Jiang
> Priority: Major
>
> Hi,
> We are running on Spark 2.2.1, generating parquet files on S3, like the following
> pseudo code
> df.write.parquet(...)
> We have recently noticed parquet file corruptions, when reading the parquet
> in Spark or Presto. I downloaded the corrupted file from S3 and got following errors in Spark as the following:
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read
> value at 40870 in block 0 in file
> file:/Users/djiang/part-00122-80f4886a-75ce-42fa-b78f-4af35426f434.c000.snappy.parquet
> Caused by: org.apache.parquet.io.ParquetDecodingException: could not read
> page Page [bytes.size=1048594, valueCount=43663, uncompressedSize=1048594]
> in col [incoming_aliases_array, list, element, key_value, value] BINARY
> It appears only one column in one of the rows in the file is corrupt, the
> file has 111041 rows.
> My questions are
> 1) How can I identify the corrupted row?
> 2) What could cause the corruption? Spark issue or Parquet issue?
> Any help is greatly appreciated.
> Thanks,
> Dong
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)