You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "Yuming Wang (Jira)" <ji...@apache.org> on 2021/12/04 05:06:00 UTC

[jira] [Commented] (PARQUET-1203) Corrupted parquet file from Spark

    [ https://issues.apache.org/jira/browse/PARQUET-1203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453288#comment-17453288 ] 

Yuming Wang commented on PARQUET-1203:
--------------------------------------

It may be caused by hardware issue. You can add this line:
{code:scala}
"HostName" -> java.net.InetAddress.getLocalHost.getHostName
{code}
to https://github.com/apache/spark/blob/v3.2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala#L115-L117 to find out which machine will generate corruption files.

> Corrupted parquet file from Spark
> ---------------------------------
>
>                 Key: PARQUET-1203
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1203
>             Project: Parquet
>          Issue Type: Bug
>         Environment: Spark 2.2.1
>            Reporter: Dong Jiang
>            Assignee: Ryan Blue
>            Priority: Major
>
> Hi, 
> We are running on Spark 2.2.1, generating parquet files on S3, like the following 
> pseudo code 
> df.write.parquet(...) 
> We have recently noticed parquet file corruptions, when reading the parquet 
> in Spark or Presto. I downloaded the corrupted file from S3 and got following errors in Spark as the following: 
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read 
> value at 40870 in block 0 in file 
> file:/Users/djiang/part-00122-80f4886a-75ce-42fa-b78f-4af35426f434.c000.snappy.parquet 
> Caused by: org.apache.parquet.io.ParquetDecodingException: could not read 
> page Page [bytes.size=1048594, valueCount=43663, uncompressedSize=1048594] 
> in col [incoming_aliases_array, list, element, key_value, value] BINARY 
> It appears only one column in one of the rows in the file is corrupt, the 
> file has 111041 rows. 
> My questions are 
> 1) How can I identify the corrupted row? 
> 2) What could cause the corruption? Spark issue or Parquet issue? 
> Any help is greatly appreciated. 
> Thanks, 
> Dong 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)