You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Yuming Wang (Jira)" <ji...@apache.org> on 2021/12/04 05:06:00 UTC
[jira] [Commented] (PARQUET-1203) Corrupted parquet file from Spark
[ https://issues.apache.org/jira/browse/PARQUET-1203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453288#comment-17453288 ]
Yuming Wang commented on PARQUET-1203:
--------------------------------------
It may be caused by hardware issue. You can add this line:
{code:scala}
"HostName" -> java.net.InetAddress.getLocalHost.getHostName
{code}
to https://github.com/apache/spark/blob/v3.2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala#L115-L117 to find out which machine will generate corruption files.
> Corrupted parquet file from Spark
> ---------------------------------
>
> Key: PARQUET-1203
> URL: https://issues.apache.org/jira/browse/PARQUET-1203
> Project: Parquet
> Issue Type: Bug
> Environment: Spark 2.2.1
> Reporter: Dong Jiang
> Assignee: Ryan Blue
> Priority: Major
>
> Hi,
> We are running on Spark 2.2.1, generating parquet files on S3, like the following
> pseudo code
> df.write.parquet(...)
> We have recently noticed parquet file corruptions, when reading the parquet
> in Spark or Presto. I downloaded the corrupted file from S3 and got following errors in Spark as the following:
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read
> value at 40870 in block 0 in file
> file:/Users/djiang/part-00122-80f4886a-75ce-42fa-b78f-4af35426f434.c000.snappy.parquet
> Caused by: org.apache.parquet.io.ParquetDecodingException: could not read
> page Page [bytes.size=1048594, valueCount=43663, uncompressedSize=1048594]
> in col [incoming_aliases_array, list, element, key_value, value] BINARY
> It appears only one column in one of the rows in the file is corrupt, the
> file has 111041 rows.
> My questions are
> 1) How can I identify the corrupted row?
> 2) What could cause the corruption? Spark issue or Parquet issue?
> Any help is greatly appreciated.
> Thanks,
> Dong
--
This message was sent by Atlassian Jira
(v8.20.1#820001)