You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "Ryan Blue (JIRA)" <ji...@apache.org> on 2018/02/05 17:40:00 UTC

[jira] [Commented] (PARQUET-1203) Corrupted parquet file from Spark

    [ https://issues.apache.org/jira/browse/PARQUET-1203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16352677#comment-16352677 ] 

Ryan Blue commented on PARQUET-1203:
------------------------------------

[~djiangxu], I replied on the Spark list. I'm closing this because it is not a bug in Parquet. Corruption happens when you get bad hardware and I recommend tracking down the bad node.

> Corrupted parquet file from Spark
> ---------------------------------
>
>                 Key: PARQUET-1203
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1203
>             Project: Parquet
>          Issue Type: Bug
>         Environment: Spark 2.2.1
>            Reporter: Dong Jiang
>            Priority: Major
>
> Hi, 
> We are running on Spark 2.2.1, generating parquet files on S3, like the following 
> pseudo code 
> df.write.parquet(...) 
> We have recently noticed parquet file corruptions, when reading the parquet 
> in Spark or Presto. I downloaded the corrupted file from S3 and got following errors in Spark as the following: 
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read 
> value at 40870 in block 0 in file 
> file:/Users/djiang/part-00122-80f4886a-75ce-42fa-b78f-4af35426f434.c000.snappy.parquet 
> Caused by: org.apache.parquet.io.ParquetDecodingException: could not read 
> page Page [bytes.size=1048594, valueCount=43663, uncompressedSize=1048594] 
> in col [incoming_aliases_array, list, element, key_value, value] BINARY 
> It appears only one column in one of the rows in the file is corrupt, the 
> file has 111041 rows. 
> My questions are 
> 1) How can I identify the corrupted row? 
> 2) What could cause the corruption? Spark issue or Parquet issue? 
> Any help is greatly appreciated. 
> Thanks, 
> Dong 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)