You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Dong Jiang <dj...@dataxu.com> on 2018/02/05 17:01:46 UTC

Corrupt parquet file

Hi, 

We are running on Spark 2.2.1, generating parquet files, like the following
pseudo code
df.write.parquet(...)
We have recently noticed parquet file corruptions, when reading the parquet
in Spark or Presto, as the following:

Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read
value at 40870 in block 0 in file
file:/Users/djiang/part-00122-80f4886a-75ce-42fa-b78f-4af35426f434.c000.snappy.parquet

Caused by: org.apache.parquet.io.ParquetDecodingException: could not read
page Page [bytes.size=1048594, valueCount=43663, uncompressedSize=1048594]
in col [incoming_aliases_array, list, element, key_value, value] BINARY

It appears only one column in one of the rows in the file is corrupt, the
file has 111041 rows.

My questions are
1) How can I identify the corrupted row?
2) What could cause the corruption? Spark issue or Parquet issue?

Any help is greatly appreciated.

Thanks,

Dong



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: Corrupt parquet file

Posted by Steve Loughran <st...@hortonworks.com>.

On 12 Feb 2018, at 20:21, Ryan Blue <rb...@netflix.com>> wrote:

I wouldn't say we have a primary failure mode that we deal with. What we concluded was that all the schemes we came up with to avoid corruption couldn't cover all cases. For example, what about when memory holding a value is corrupted just before it is handed off to the writer?

That's why we track down the source of the corruption and remove it from our clusters and let Amazon know to remove the instance from the hardware pool. We also structure our ETL so we have some time to reprocess.


I see.

I could remove memory/disk buffering of the blocks as a source of corruption leaving only working memory  failures which somehow get past ECC, or bus errors of some form.

Filed https://issues.apache.org/jira/browse/HADOOP-15224 to add to the todo list, Hadoop >= 3.2




Re: Corrupt parquet file

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I wouldn't say we have a primary failure mode that we deal with. What we
concluded was that all the schemes we came up with to avoid corruption
couldn't cover all cases. For example, what about when memory holding a
value is corrupted just before it is handed off to the writer?

That's why we track down the source of the corruption and remove it from
our clusters and let Amazon know to remove the instance from the hardware
pool. We also structure our ETL so we have some time to reprocess.

rb

On Mon, Feb 12, 2018 at 11:49 AM, Steve Loughran <st...@hortonworks.com>
wrote:

>
>
> On 12 Feb 2018, at 19:35, Dong Jiang <dj...@dataxu.com> wrote:
>
> I got no error messages from EMR. We write directly from dataframe to S3.
> There doesn’t appear to be an issue with S3 file, we can still down the
> parquet file and read most of the columns, just one column is corrupted in
> parquet.
> I suspect we need to write to HDFS first, make sure we can read back the
> entire data set, and then copy from HDFS to S3. Any other thoughts?
>
>
>
> The s3 object store clients mostly buffer to local temp fs before they
> write, at least all the ASF connectors do, so that data can be PUT/POSTed
> in 5+MB blocks, without requiring enough heap to buffer all data written by
> all threads. That's done to file://, not HDFS. Even if you do that copy up
> later from HDFS to S3, there's still going to be that local HDD buffering:
> it's not going to fix the problem —not if this really is corrupted local
> HDD data
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: Corrupt parquet file

Posted by Steve Loughran <st...@hortonworks.com>.

On 12 Feb 2018, at 19:35, Dong Jiang <dj...@dataxu.com>> wrote:

I got no error messages from EMR. We write directly from dataframe to S3. There doesn’t appear to be an issue with S3 file, we can still down the parquet file and read most of the columns, just one column is corrupted in parquet.
I suspect we need to write to HDFS first, make sure we can read back the entire data set, and then copy from HDFS to S3. Any other thoughts?


The s3 object store clients mostly buffer to local temp fs before they write, at least all the ASF connectors do, so that data can be PUT/POSTed in 5+MB blocks, without requiring enough heap to buffer all data written by all threads. That's done to file://, not HDFS. Even if you do that copy up later from HDFS to S3, there's still going to be that local HDD buffering: it's not going to fix the problem —not if this really is corrupted local HDD data

Re: Corrupt parquet file

Posted by Dong Jiang <dj...@dataxu.com>.
I got no error messages from EMR. We write directly from dataframe to S3. There doesn’t appear to be an issue with S3 file, we can still down the parquet file and read most of the columns, just one column is corrupted in parquet.
I suspect we need to write to HDFS first, make sure we can read back the entire data set, and then copy from HDFS to S3. Any other thoughts?

From: Steve Loughran <st...@hortonworks.com>
Date: Monday, February 12, 2018 at 2:27 PM
To: "rblue@netflix.com" <rb...@netflix.com>
Cc: Dong Jiang <dj...@dataxu.com>, Apache Spark Dev <de...@spark.apache.org>
Subject: Re: Corrupt parquet file

What failure mode is likely here?

As the uploads are signed, the network payload is not corruptible from the moment its written into the HTTPS request, which places it earlier

* RAM corruption which ECC doesn't pick up. It'd be interesting to know what stats & health checks AWS run here, such as, say, low-intensity RAM checks when VM space is idle.
* Any temp files buffering the blocks to HDD are being corrupted, which could happen with faulty physical disk? Is that likely?
* S3 itself is in trouble.

I don't see any checksum verification of disk0buffered block data before it is uploaded to S3: the files are just handed straight off to the AWS SDK. I could certainly force that through the hadoop CRC check sequence, but that complicates retransmission as well as performance.

What could work would be to build the MD5 sum of each block as it is written from spark to buffer, then verify that the returned etag of that POST/PUT matches the original value, That'd to end-to-end error checking from the JVM ram all the way to S3, leaving VM ECC and S3 itself as the failure points.

-Steve

my old work on this: https://www.slideshare.net/steve_l/did-you-reallywantthatdata



On 5 Feb 2018, at 18:41, Ryan Blue <rb...@netflix.com.INVALID>> wrote:

In that case, I'd recommend tracking down the node where the files were created and reporting it to EMR.

On Mon, Feb 5, 2018 at 10:38 AM, Dong Jiang <dj...@dataxu.com>> wrote:
Thanks for the response, Ryan.
We have transient EMR cluster, and we do rerun the cluster whenever the cluster failed. However, in this particular case, the cluster succeeded, not reporting any errors. I was able to null out the corrupted the column and recover the rest of the 133 columns. I do feel the issue is more than 1-2 occurrences a year. This is the second time, I am aware of the issue within a month, and we certainly don’t run as large data infrastructure compared to Netflix.

I will keep an eye on this issue.

Thanks,

Dong

From: Ryan Blue <rb...@netflix.com>>
Reply-To: "rblue@netflix.com<ma...@netflix.com>" <rb...@netflix.com>>
Date: Monday, February 5, 2018 at 1:34 PM

To: Dong Jiang <dj...@dataxu.com>>
Cc: Spark Dev List <de...@spark.apache.org>>
Subject: Re: Corrupt parquet file


We ensure the bad node is removed from our cluster and reprocess to replace the data. We only see this once or twice a year, so it isn't a significant problem.

We've discussed options for adding write-side validation, but it is expensive and still unreliable if you don't trust the hardware.

rb

On Mon, Feb 5, 2018 at 10:28 AM, Dong Jiang <dj...@dataxu.com>> wrote:
Hi, Ryan,

Do you have any suggestions on how we could detect and prevent this issue?
This is the second time we encountered this issue. We have a wide table, with 134 columns in the file. The issue seems only impact one column, and very hard to detect. It seems you have encountered this issue before, what do you do to prevent a recurrence?

Thanks,

Dong

From: Ryan Blue <rb...@netflix.com>>
Reply-To: "rblue@netflix.com<ma...@netflix.com>" <rb...@netflix.com>>
Date: Monday, February 5, 2018 at 12:46 PM

To: Dong Jiang <dj...@dataxu.com>>
Cc: Spark Dev List <de...@spark.apache.org>>
Subject: Re: Corrupt parquet file

If you can still access the logs, then you should be able to find where the write task ran. Maybe you can get an instance ID and open a ticket with Amazon. Otherwise, it will probably start failing the HW checks when the instance hardware is reused, so I wouldn't worry about it.

The _SUCCESS file convention means that the job ran successfully, at least to the point where _SUCCESS is created. I wouldn't rely on _SUCCESS to indicate actual job success (you could do other tasks after that fail) and it carries no guarantee about the data that was written.

rb

On Mon, Feb 5, 2018 at 9:41 AM, Dong Jiang <dj...@dataxu.com>> wrote:
Hi, Ryan,

Many thanks for your quick response.
We ran Spark on transient EMR clusters. Nothing in the log or EMR events suggests any issues with the cluster or the nodes. We also see the _SUCCESS file on the S3. If we see the _SUCCESS file, does that suggest all data is good?
How can we prevent a recurrence? Can you share your experience?

Thanks,

Dong

From: Ryan Blue <rb...@netflix.com>>
Reply-To: "rblue@netflix.com<ma...@netflix.com>" <rb...@netflix.com>>
Date: Monday, February 5, 2018 at 12:38 PM
To: Dong Jiang <dj...@dataxu.com>>
Cc: Spark Dev List <de...@spark.apache.org>>
Subject: Re: Corrupt parquet file

Dong,

We see this from time to time as well. In my experience, it is almost always caused by a bad node. You should try to find out where the file was written and remove that node as soon as possible.

As far as finding out what is wrong with the file, that's a difficult task. Parquet's encoding is very dense and corruption in encoded values often looks like different data. When you see a decoding exception like this, we find it is usually that the compressed data was corrupted and is no longer valid. You can look for the page of data based on the value counter, but that's about it.

Even if you could find a single record that was affected, that's not valuable because you don't know whether there is other corruption that is undetectable. There's nothing to reliably recover here. What we do in this case is find and remove the bad node, then reprocess data so we know everything is correct from the upstream source.

rb

On Mon, Feb 5, 2018 at 9:01 AM, Dong Jiang <dj...@dataxu.com>> wrote:
Hi,

We are running on Spark 2.2.1, generating parquet files, like the following
pseudo code
df.write.parquet(...)
We have recently noticed parquet file corruptions, when reading the parquet
in Spark or Presto, as the following:

Caused by: org.apache.parquet.io<http://org.apache.parquet.io/>.ParquetDecodingException: Can not read
value at 40870 in block 0 in file
file:/Users/djiang/part-00122-80f4886a-75ce-42fa-b78f-4af35426f434.c000.snappy.parquet

Caused by: org.apache.parquet.io<http://org.apache.parquet.io/>.ParquetDecodingException: could not read
page Page [bytes.size=1048594, valueCount=43663, uncompressedSize=1048594]
in col [incoming_aliases_array, list, element, key_value, value] BINARY

It appears only one column in one of the rows in the file is corrupt, the
file has 111041 rows.

My questions are
1) How can I identify the corrupted row?
2) What could cause the corruption? Spark issue or Parquet issue?

Any help is greatly appreciated.

Thanks,

Dong



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix


Re: Corrupt parquet file

Posted by Steve Loughran <st...@hortonworks.com>.
What failure mode is likely here?

As the uploads are signed, the network payload is not corruptible from the moment its written into the HTTPS request, which places it earlier

* RAM corruption which ECC doesn't pick up. It'd be interesting to know what stats & health checks AWS run here, such as, say, low-intensity RAM checks when VM space is idle.
* Any temp files buffering the blocks to HDD are being corrupted, which could happen with faulty physical disk? Is that likely?
* S3 itself is in trouble.

I don't see any checksum verification of disk0buffered block data before it is uploaded to S3: the files are just handed straight off to the AWS SDK. I could certainly force that through the hadoop CRC check sequence, but that complicates retransmission as well as performance.

What could work would be to build the MD5 sum of each block as it is written from spark to buffer, then verify that the returned etag of that POST/PUT matches the original value, That'd to end-to-end error checking from the JVM ram all the way to S3, leaving VM ECC and S3 itself as the failure points.

-Steve

my old work on this: https://www.slideshare.net/steve_l/did-you-reallywantthatdata


On 5 Feb 2018, at 18:41, Ryan Blue <rb...@netflix.com.INVALID>> wrote:

In that case, I'd recommend tracking down the node where the files were created and reporting it to EMR.

On Mon, Feb 5, 2018 at 10:38 AM, Dong Jiang <dj...@dataxu.com>> wrote:
Thanks for the response, Ryan.
We have transient EMR cluster, and we do rerun the cluster whenever the cluster failed. However, in this particular case, the cluster succeeded, not reporting any errors. I was able to null out the corrupted the column and recover the rest of the 133 columns. I do feel the issue is more than 1-2 occurrences a year. This is the second time, I am aware of the issue within a month, and we certainly don’t run as large data infrastructure compared to Netflix.

I will keep an eye on this issue.

Thanks,

Dong

From: Ryan Blue <rb...@netflix.com>>
Reply-To: "rblue@netflix.com<ma...@netflix.com>" <rb...@netflix.com>>
Date: Monday, February 5, 2018 at 1:34 PM

To: Dong Jiang <dj...@dataxu.com>>
Cc: Spark Dev List <de...@spark.apache.org>>
Subject: Re: Corrupt parquet file


We ensure the bad node is removed from our cluster and reprocess to replace the data. We only see this once or twice a year, so it isn't a significant problem.

We've discussed options for adding write-side validation, but it is expensive and still unreliable if you don't trust the hardware.

rb

On Mon, Feb 5, 2018 at 10:28 AM, Dong Jiang <dj...@dataxu.com>> wrote:
Hi, Ryan,

Do you have any suggestions on how we could detect and prevent this issue?
This is the second time we encountered this issue. We have a wide table, with 134 columns in the file. The issue seems only impact one column, and very hard to detect. It seems you have encountered this issue before, what do you do to prevent a recurrence?

Thanks,

Dong

From: Ryan Blue <rb...@netflix.com>>
Reply-To: "rblue@netflix.com<ma...@netflix.com>" <rb...@netflix.com>>
Date: Monday, February 5, 2018 at 12:46 PM

To: Dong Jiang <dj...@dataxu.com>>
Cc: Spark Dev List <de...@spark.apache.org>>
Subject: Re: Corrupt parquet file

If you can still access the logs, then you should be able to find where the write task ran. Maybe you can get an instance ID and open a ticket with Amazon. Otherwise, it will probably start failing the HW checks when the instance hardware is reused, so I wouldn't worry about it.

The _SUCCESS file convention means that the job ran successfully, at least to the point where _SUCCESS is created. I wouldn't rely on _SUCCESS to indicate actual job success (you could do other tasks after that fail) and it carries no guarantee about the data that was written.

rb

On Mon, Feb 5, 2018 at 9:41 AM, Dong Jiang <dj...@dataxu.com>> wrote:
Hi, Ryan,

Many thanks for your quick response.
We ran Spark on transient EMR clusters. Nothing in the log or EMR events suggests any issues with the cluster or the nodes. We also see the _SUCCESS file on the S3. If we see the _SUCCESS file, does that suggest all data is good?
How can we prevent a recurrence? Can you share your experience?

Thanks,

Dong

From: Ryan Blue <rb...@netflix.com>>
Reply-To: "rblue@netflix.com<ma...@netflix.com>" <rb...@netflix.com>>
Date: Monday, February 5, 2018 at 12:38 PM
To: Dong Jiang <dj...@dataxu.com>>
Cc: Spark Dev List <de...@spark.apache.org>>
Subject: Re: Corrupt parquet file

Dong,

We see this from time to time as well. In my experience, it is almost always caused by a bad node. You should try to find out where the file was written and remove that node as soon as possible.

As far as finding out what is wrong with the file, that's a difficult task. Parquet's encoding is very dense and corruption in encoded values often looks like different data. When you see a decoding exception like this, we find it is usually that the compressed data was corrupted and is no longer valid. You can look for the page of data based on the value counter, but that's about it.

Even if you could find a single record that was affected, that's not valuable because you don't know whether there is other corruption that is undetectable. There's nothing to reliably recover here. What we do in this case is find and remove the bad node, then reprocess data so we know everything is correct from the upstream source.

rb

On Mon, Feb 5, 2018 at 9:01 AM, Dong Jiang <dj...@dataxu.com>> wrote:
Hi,

We are running on Spark 2.2.1, generating parquet files, like the following
pseudo code
df.write.parquet(...)
We have recently noticed parquet file corruptions, when reading the parquet
in Spark or Presto, as the following:

Caused by: org.apache.parquet.io<http://org.apache.parquet.io/>.ParquetDecodingException: Can not read
value at 40870 in block 0 in file
file:/Users/djiang/part-00122-80f4886a-75ce-42fa-b78f-4af35426f434.c000.snappy.parquet

Caused by: org.apache.parquet.io<http://org.apache.parquet.io/>.ParquetDecodingException: could not read
page Page [bytes.size=1048594, valueCount=43663, uncompressedSize=1048594]
in col [incoming_aliases_array, list, element, key_value, value] BINARY

It appears only one column in one of the rows in the file is corrupt, the
file has 111041 rows.

My questions are
1) How can I identify the corrupted row?
2) What could cause the corruption? Spark issue or Parquet issue?

Any help is greatly appreciated.

Thanks,

Dong



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix


Re: Corrupt parquet file

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
In that case, I'd recommend tracking down the node where the files were
created and reporting it to EMR.

On Mon, Feb 5, 2018 at 10:38 AM, Dong Jiang <dj...@dataxu.com> wrote:

> Thanks for the response, Ryan.
>
> We have transient EMR cluster, and we do rerun the cluster whenever the
> cluster failed. However, in this particular case, the cluster succeeded,
> not reporting any errors. I was able to null out the corrupted the column
> and recover the rest of the 133 columns. I do feel the issue is more than
> 1-2 occurrences a year. This is the second time, I am aware of the issue
> within a month, and we certainly don’t run as large data infrastructure
> compared to Netflix.
>
>
>
> I will keep an eye on this issue.
>
>
>
> Thanks,
>
>
> Dong
>
>
>
> *From: *Ryan Blue <rb...@netflix.com>
> *Reply-To: *"rblue@netflix.com" <rb...@netflix.com>
> *Date: *Monday, February 5, 2018 at 1:34 PM
>
> *To: *Dong Jiang <dj...@dataxu.com>
> *Cc: *Spark Dev List <de...@spark.apache.org>
> *Subject: *Re: Corrupt parquet file
>
>
>
> We ensure the bad node is removed from our cluster and reprocess to
> replace the data. We only see this once or twice a year, so it isn't a
> significant problem.
>
>
>
> We've discussed options for adding write-side validation, but it is
> expensive and still unreliable if you don't trust the hardware.
>
>
>
> rb
>
>
>
> On Mon, Feb 5, 2018 at 10:28 AM, Dong Jiang <dj...@dataxu.com> wrote:
>
> Hi, Ryan,
>
>
> Do you have any suggestions on how we could detect and prevent this issue?
>
> This is the second time we encountered this issue. We have a wide table,
> with 134 columns in the file. The issue seems only impact one column, and
> very hard to detect. It seems you have encountered this issue before, what
> do you do to prevent a recurrence?
>
>
>
> Thanks,
>
>
>
> Dong
>
>
>
> *From: *Ryan Blue <rb...@netflix.com>
> *Reply-To: *"rblue@netflix.com" <rb...@netflix.com>
> *Date: *Monday, February 5, 2018 at 12:46 PM
>
>
> *To: *Dong Jiang <dj...@dataxu.com>
> *Cc: *Spark Dev List <de...@spark.apache.org>
> *Subject: *Re: Corrupt parquet file
>
>
>
> If you can still access the logs, then you should be able to find where
> the write task ran. Maybe you can get an instance ID and open a ticket with
> Amazon. Otherwise, it will probably start failing the HW checks when the
> instance hardware is reused, so I wouldn't worry about it.
>
>
>
> The _SUCCESS file convention means that the job ran successfully, at least
> to the point where _SUCCESS is created. I wouldn't rely on _SUCCESS to
> indicate actual job success (you could do other tasks after that fail) and
> it carries no guarantee about the data that was written.
>
>
>
> rb
>
>
>
> On Mon, Feb 5, 2018 at 9:41 AM, Dong Jiang <dj...@dataxu.com> wrote:
>
> Hi, Ryan,
>
>
>
> Many thanks for your quick response.
>
> We ran Spark on transient EMR clusters. Nothing in the log or EMR events
> suggests any issues with the cluster or the nodes. We also see the _SUCCESS
> file on the S3. If we see the _SUCCESS file, does that suggest all data is
> good?
>
> How can we prevent a recurrence? Can you share your experience?
>
>
>
> Thanks,
>
>
> Dong
>
>
>
> *From: *Ryan Blue <rb...@netflix.com>
> *Reply-To: *"rblue@netflix.com" <rb...@netflix.com>
> *Date: *Monday, February 5, 2018 at 12:38 PM
> *To: *Dong Jiang <dj...@dataxu.com>
> *Cc: *Spark Dev List <de...@spark.apache.org>
> *Subject: *Re: Corrupt parquet file
>
>
>
> Dong,
>
>
>
> We see this from time to time as well. In my experience, it is almost
> always caused by a bad node. You should try to find out where the file was
> written and remove that node as soon as possible.
>
>
>
> As far as finding out what is wrong with the file, that's a difficult
> task. Parquet's encoding is very dense and corruption in encoded values
> often looks like different data. When you see a decoding exception like
> this, we find it is usually that the compressed data was corrupted and is
> no longer valid. You can look for the page of data based on the value
> counter, but that's about it.
>
>
>
> Even if you could find a single record that was affected, that's not
> valuable because you don't know whether there is other corruption that is
> undetectable. There's nothing to reliably recover here. What we do in this
> case is find and remove the bad node, then reprocess data so we know
> everything is correct from the upstream source.
>
>
>
> rb
>
>
>
> On Mon, Feb 5, 2018 at 9:01 AM, Dong Jiang <dj...@dataxu.com> wrote:
>
> Hi,
>
> We are running on Spark 2.2.1, generating parquet files, like the following
> pseudo code
> df.write.parquet(...)
> We have recently noticed parquet file corruptions, when reading the parquet
> in Spark or Presto, as the following:
>
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read
> value at 40870 in block 0 in file
> file:/Users/djiang/part-00122-80f4886a-75ce-42fa-b78f-
> 4af35426f434.c000.snappy.parquet
>
> Caused by: org.apache.parquet.io.ParquetDecodingException: could not read
> page Page [bytes.size=1048594, valueCount=43663, uncompressedSize=1048594]
> in col [incoming_aliases_array, list, element, key_value, value] BINARY
>
> It appears only one column in one of the rows in the file is corrupt, the
> file has 111041 rows.
>
> My questions are
> 1) How can I identify the corrupted row?
> 2) What could cause the corruption? Spark issue or Parquet issue?
>
> Any help is greatly appreciated.
>
> Thanks,
>
> Dong
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: Corrupt parquet file

Posted by Dong Jiang <dj...@dataxu.com>.
Thanks for the response, Ryan.
We have transient EMR cluster, and we do rerun the cluster whenever the cluster failed. However, in this particular case, the cluster succeeded, not reporting any errors. I was able to null out the corrupted the column and recover the rest of the 133 columns. I do feel the issue is more than 1-2 occurrences a year. This is the second time, I am aware of the issue within a month, and we certainly don’t run as large data infrastructure compared to Netflix.

I will keep an eye on this issue.

Thanks,

Dong

From: Ryan Blue <rb...@netflix.com>
Reply-To: "rblue@netflix.com" <rb...@netflix.com>
Date: Monday, February 5, 2018 at 1:34 PM
To: Dong Jiang <dj...@dataxu.com>
Cc: Spark Dev List <de...@spark.apache.org>
Subject: Re: Corrupt parquet file

We ensure the bad node is removed from our cluster and reprocess to replace the data. We only see this once or twice a year, so it isn't a significant problem.

We've discussed options for adding write-side validation, but it is expensive and still unreliable if you don't trust the hardware.

rb

On Mon, Feb 5, 2018 at 10:28 AM, Dong Jiang <dj...@dataxu.com>> wrote:
Hi, Ryan,

Do you have any suggestions on how we could detect and prevent this issue?
This is the second time we encountered this issue. We have a wide table, with 134 columns in the file. The issue seems only impact one column, and very hard to detect. It seems you have encountered this issue before, what do you do to prevent a recurrence?

Thanks,

Dong

From: Ryan Blue <rb...@netflix.com>>
Reply-To: "rblue@netflix.com<ma...@netflix.com>" <rb...@netflix.com>>
Date: Monday, February 5, 2018 at 12:46 PM

To: Dong Jiang <dj...@dataxu.com>>
Cc: Spark Dev List <de...@spark.apache.org>>
Subject: Re: Corrupt parquet file

If you can still access the logs, then you should be able to find where the write task ran. Maybe you can get an instance ID and open a ticket with Amazon. Otherwise, it will probably start failing the HW checks when the instance hardware is reused, so I wouldn't worry about it.

The _SUCCESS file convention means that the job ran successfully, at least to the point where _SUCCESS is created. I wouldn't rely on _SUCCESS to indicate actual job success (you could do other tasks after that fail) and it carries no guarantee about the data that was written.

rb

On Mon, Feb 5, 2018 at 9:41 AM, Dong Jiang <dj...@dataxu.com>> wrote:
Hi, Ryan,

Many thanks for your quick response.
We ran Spark on transient EMR clusters. Nothing in the log or EMR events suggests any issues with the cluster or the nodes. We also see the _SUCCESS file on the S3. If we see the _SUCCESS file, does that suggest all data is good?
How can we prevent a recurrence? Can you share your experience?

Thanks,

Dong

From: Ryan Blue <rb...@netflix.com>>
Reply-To: "rblue@netflix.com<ma...@netflix.com>" <rb...@netflix.com>>
Date: Monday, February 5, 2018 at 12:38 PM
To: Dong Jiang <dj...@dataxu.com>>
Cc: Spark Dev List <de...@spark.apache.org>>
Subject: Re: Corrupt parquet file

Dong,

We see this from time to time as well. In my experience, it is almost always caused by a bad node. You should try to find out where the file was written and remove that node as soon as possible.

As far as finding out what is wrong with the file, that's a difficult task. Parquet's encoding is very dense and corruption in encoded values often looks like different data. When you see a decoding exception like this, we find it is usually that the compressed data was corrupted and is no longer valid. You can look for the page of data based on the value counter, but that's about it.

Even if you could find a single record that was affected, that's not valuable because you don't know whether there is other corruption that is undetectable. There's nothing to reliably recover here. What we do in this case is find and remove the bad node, then reprocess data so we know everything is correct from the upstream source.

rb

On Mon, Feb 5, 2018 at 9:01 AM, Dong Jiang <dj...@dataxu.com>> wrote:
Hi,

We are running on Spark 2.2.1, generating parquet files, like the following
pseudo code
df.write.parquet(...)
We have recently noticed parquet file corruptions, when reading the parquet
in Spark or Presto, as the following:

Caused by: org.apache.parquet.io<http://org.apache.parquet.io>.ParquetDecodingException: Can not read
value at 40870 in block 0 in file
file:/Users/djiang/part-00122-80f4886a-75ce-42fa-b78f-4af35426f434.c000.snappy.parquet

Caused by: org.apache.parquet.io<http://org.apache.parquet.io>.ParquetDecodingException: could not read
page Page [bytes.size=1048594, valueCount=43663, uncompressedSize=1048594]
in col [incoming_aliases_array, list, element, key_value, value] BINARY

It appears only one column in one of the rows in the file is corrupt, the
file has 111041 rows.

My questions are
1) How can I identify the corrupted row?
2) What could cause the corruption? Spark issue or Parquet issue?

Any help is greatly appreciated.

Thanks,

Dong



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix

Re: Corrupt parquet file

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
We ensure the bad node is removed from our cluster and reprocess to replace
the data. We only see this once or twice a year, so it isn't a significant
problem.

We've discussed options for adding write-side validation, but it is
expensive and still unreliable if you don't trust the hardware.

rb

On Mon, Feb 5, 2018 at 10:28 AM, Dong Jiang <dj...@dataxu.com> wrote:

> Hi, Ryan,
>
>
> Do you have any suggestions on how we could detect and prevent this issue?
>
> This is the second time we encountered this issue. We have a wide table,
> with 134 columns in the file. The issue seems only impact one column, and
> very hard to detect. It seems you have encountered this issue before, what
> do you do to prevent a recurrence?
>
>
>
> Thanks,
>
>
>
> Dong
>
>
>
> *From: *Ryan Blue <rb...@netflix.com>
> *Reply-To: *"rblue@netflix.com" <rb...@netflix.com>
> *Date: *Monday, February 5, 2018 at 12:46 PM
>
> *To: *Dong Jiang <dj...@dataxu.com>
> *Cc: *Spark Dev List <de...@spark.apache.org>
> *Subject: *Re: Corrupt parquet file
>
>
>
> If you can still access the logs, then you should be able to find where
> the write task ran. Maybe you can get an instance ID and open a ticket with
> Amazon. Otherwise, it will probably start failing the HW checks when the
> instance hardware is reused, so I wouldn't worry about it.
>
>
>
> The _SUCCESS file convention means that the job ran successfully, at least
> to the point where _SUCCESS is created. I wouldn't rely on _SUCCESS to
> indicate actual job success (you could do other tasks after that fail) and
> it carries no guarantee about the data that was written.
>
>
>
> rb
>
>
>
> On Mon, Feb 5, 2018 at 9:41 AM, Dong Jiang <dj...@dataxu.com> wrote:
>
> Hi, Ryan,
>
>
>
> Many thanks for your quick response.
>
> We ran Spark on transient EMR clusters. Nothing in the log or EMR events
> suggests any issues with the cluster or the nodes. We also see the _SUCCESS
> file on the S3. If we see the _SUCCESS file, does that suggest all data is
> good?
>
> How can we prevent a recurrence? Can you share your experience?
>
>
>
> Thanks,
>
>
> Dong
>
>
>
> *From: *Ryan Blue <rb...@netflix.com>
> *Reply-To: *"rblue@netflix.com" <rb...@netflix.com>
> *Date: *Monday, February 5, 2018 at 12:38 PM
> *To: *Dong Jiang <dj...@dataxu.com>
> *Cc: *Spark Dev List <de...@spark.apache.org>
> *Subject: *Re: Corrupt parquet file
>
>
>
> Dong,
>
>
>
> We see this from time to time as well. In my experience, it is almost
> always caused by a bad node. You should try to find out where the file was
> written and remove that node as soon as possible.
>
>
>
> As far as finding out what is wrong with the file, that's a difficult
> task. Parquet's encoding is very dense and corruption in encoded values
> often looks like different data. When you see a decoding exception like
> this, we find it is usually that the compressed data was corrupted and is
> no longer valid. You can look for the page of data based on the value
> counter, but that's about it.
>
>
>
> Even if you could find a single record that was affected, that's not
> valuable because you don't know whether there is other corruption that is
> undetectable. There's nothing to reliably recover here. What we do in this
> case is find and remove the bad node, then reprocess data so we know
> everything is correct from the upstream source.
>
>
>
> rb
>
>
>
> On Mon, Feb 5, 2018 at 9:01 AM, Dong Jiang <dj...@dataxu.com> wrote:
>
> Hi,
>
> We are running on Spark 2.2.1, generating parquet files, like the following
> pseudo code
> df.write.parquet(...)
> We have recently noticed parquet file corruptions, when reading the parquet
> in Spark or Presto, as the following:
>
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read
> value at 40870 in block 0 in file
> file:/Users/djiang/part-00122-80f4886a-75ce-42fa-b78f-
> 4af35426f434.c000.snappy.parquet
>
> Caused by: org.apache.parquet.io.ParquetDecodingException: could not read
> page Page [bytes.size=1048594, valueCount=43663, uncompressedSize=1048594]
> in col [incoming_aliases_array, list, element, key_value, value] BINARY
>
> It appears only one column in one of the rows in the file is corrupt, the
> file has 111041 rows.
>
> My questions are
> 1) How can I identify the corrupted row?
> 2) What could cause the corruption? Spark issue or Parquet issue?
>
> Any help is greatly appreciated.
>
> Thanks,
>
> Dong
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: Corrupt parquet file

Posted by Dong Jiang <dj...@dataxu.com>.
Hi, Ryan,

Do you have any suggestions on how we could detect and prevent this issue?
This is the second time we encountered this issue. We have a wide table, with 134 columns in the file. The issue seems only impact one column, and very hard to detect. It seems you have encountered this issue before, what do you do to prevent a recurrence?

Thanks,

Dong

From: Ryan Blue <rb...@netflix.com>
Reply-To: "rblue@netflix.com" <rb...@netflix.com>
Date: Monday, February 5, 2018 at 12:46 PM
To: Dong Jiang <dj...@dataxu.com>
Cc: Spark Dev List <de...@spark.apache.org>
Subject: Re: Corrupt parquet file

If you can still access the logs, then you should be able to find where the write task ran. Maybe you can get an instance ID and open a ticket with Amazon. Otherwise, it will probably start failing the HW checks when the instance hardware is reused, so I wouldn't worry about it.

The _SUCCESS file convention means that the job ran successfully, at least to the point where _SUCCESS is created. I wouldn't rely on _SUCCESS to indicate actual job success (you could do other tasks after that fail) and it carries no guarantee about the data that was written.

rb

On Mon, Feb 5, 2018 at 9:41 AM, Dong Jiang <dj...@dataxu.com>> wrote:
Hi, Ryan,

Many thanks for your quick response.
We ran Spark on transient EMR clusters. Nothing in the log or EMR events suggests any issues with the cluster or the nodes. We also see the _SUCCESS file on the S3. If we see the _SUCCESS file, does that suggest all data is good?
How can we prevent a recurrence? Can you share your experience?

Thanks,

Dong

From: Ryan Blue <rb...@netflix.com>>
Reply-To: "rblue@netflix.com<ma...@netflix.com>" <rb...@netflix.com>>
Date: Monday, February 5, 2018 at 12:38 PM
To: Dong Jiang <dj...@dataxu.com>>
Cc: Spark Dev List <de...@spark.apache.org>>
Subject: Re: Corrupt parquet file

Dong,

We see this from time to time as well. In my experience, it is almost always caused by a bad node. You should try to find out where the file was written and remove that node as soon as possible.

As far as finding out what is wrong with the file, that's a difficult task. Parquet's encoding is very dense and corruption in encoded values often looks like different data. When you see a decoding exception like this, we find it is usually that the compressed data was corrupted and is no longer valid. You can look for the page of data based on the value counter, but that's about it.

Even if you could find a single record that was affected, that's not valuable because you don't know whether there is other corruption that is undetectable. There's nothing to reliably recover here. What we do in this case is find and remove the bad node, then reprocess data so we know everything is correct from the upstream source.

rb

On Mon, Feb 5, 2018 at 9:01 AM, Dong Jiang <dj...@dataxu.com>> wrote:
Hi,

We are running on Spark 2.2.1, generating parquet files, like the following
pseudo code
df.write.parquet(...)
We have recently noticed parquet file corruptions, when reading the parquet
in Spark or Presto, as the following:

Caused by: org.apache.parquet.io<http://org.apache.parquet.io>.ParquetDecodingException: Can not read
value at 40870 in block 0 in file
file:/Users/djiang/part-00122-80f4886a-75ce-42fa-b78f-4af35426f434.c000.snappy.parquet

Caused by: org.apache.parquet.io<http://org.apache.parquet.io>.ParquetDecodingException: could not read
page Page [bytes.size=1048594, valueCount=43663, uncompressedSize=1048594]
in col [incoming_aliases_array, list, element, key_value, value] BINARY

It appears only one column in one of the rows in the file is corrupt, the
file has 111041 rows.

My questions are
1) How can I identify the corrupted row?
2) What could cause the corruption? Spark issue or Parquet issue?

Any help is greatly appreciated.

Thanks,

Dong



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>



--
Ryan Blue
Software Engineer
Netflix



--
Ryan Blue
Software Engineer
Netflix

Re: Corrupt parquet file

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
If you can still access the logs, then you should be able to find where the
write task ran. Maybe you can get an instance ID and open a ticket with
Amazon. Otherwise, it will probably start failing the HW checks when the
instance hardware is reused, so I wouldn't worry about it.

The _SUCCESS file convention means that the job ran successfully, at least
to the point where _SUCCESS is created. I wouldn't rely on _SUCCESS to
indicate actual job success (you could do other tasks after that fail) and
it carries no guarantee about the data that was written.

rb

On Mon, Feb 5, 2018 at 9:41 AM, Dong Jiang <dj...@dataxu.com> wrote:

> Hi, Ryan,
>
>
>
> Many thanks for your quick response.
>
> We ran Spark on transient EMR clusters. Nothing in the log or EMR events
> suggests any issues with the cluster or the nodes. We also see the _SUCCESS
> file on the S3. If we see the _SUCCESS file, does that suggest all data is
> good?
>
> How can we prevent a recurrence? Can you share your experience?
>
>
>
> Thanks,
>
>
> Dong
>
>
>
> *From: *Ryan Blue <rb...@netflix.com>
> *Reply-To: *"rblue@netflix.com" <rb...@netflix.com>
> *Date: *Monday, February 5, 2018 at 12:38 PM
> *To: *Dong Jiang <dj...@dataxu.com>
> *Cc: *Spark Dev List <de...@spark.apache.org>
> *Subject: *Re: Corrupt parquet file
>
>
>
> Dong,
>
>
>
> We see this from time to time as well. In my experience, it is almost
> always caused by a bad node. You should try to find out where the file was
> written and remove that node as soon as possible.
>
>
>
> As far as finding out what is wrong with the file, that's a difficult
> task. Parquet's encoding is very dense and corruption in encoded values
> often looks like different data. When you see a decoding exception like
> this, we find it is usually that the compressed data was corrupted and is
> no longer valid. You can look for the page of data based on the value
> counter, but that's about it.
>
>
>
> Even if you could find a single record that was affected, that's not
> valuable because you don't know whether there is other corruption that is
> undetectable. There's nothing to reliably recover here. What we do in this
> case is find and remove the bad node, then reprocess data so we know
> everything is correct from the upstream source.
>
>
>
> rb
>
>
>
> On Mon, Feb 5, 2018 at 9:01 AM, Dong Jiang <dj...@dataxu.com> wrote:
>
> Hi,
>
> We are running on Spark 2.2.1, generating parquet files, like the following
> pseudo code
> df.write.parquet(...)
> We have recently noticed parquet file corruptions, when reading the parquet
> in Spark or Presto, as the following:
>
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read
> value at 40870 in block 0 in file
> file:/Users/djiang/part-00122-80f4886a-75ce-42fa-b78f-
> 4af35426f434.c000.snappy.parquet
>
> Caused by: org.apache.parquet.io.ParquetDecodingException: could not read
> page Page [bytes.size=1048594, valueCount=43663, uncompressedSize=1048594]
> in col [incoming_aliases_array, list, element, key_value, value] BINARY
>
> It appears only one column in one of the rows in the file is corrupt, the
> file has 111041 rows.
>
> My questions are
> 1) How can I identify the corrupted row?
> 2) What could cause the corruption? Spark issue or Parquet issue?
>
> Any help is greatly appreciated.
>
> Thanks,
>
> Dong
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>
>
>
>
> --
>
> Ryan Blue
>
> Software Engineer
>
> Netflix
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: Corrupt parquet file

Posted by Dong Jiang <dj...@dataxu.com>.
Hi, Ryan,

Many thanks for your quick response.
We ran Spark on transient EMR clusters. Nothing in the log or EMR events suggests any issues with the cluster or the nodes. We also see the _SUCCESS file on the S3. If we see the _SUCCESS file, does that suggest all data is good?
How can we prevent a recurrence? Can you share your experience?

Thanks,

Dong

From: Ryan Blue <rb...@netflix.com>
Reply-To: "rblue@netflix.com" <rb...@netflix.com>
Date: Monday, February 5, 2018 at 12:38 PM
To: Dong Jiang <dj...@dataxu.com>
Cc: Spark Dev List <de...@spark.apache.org>
Subject: Re: Corrupt parquet file

Dong,

We see this from time to time as well. In my experience, it is almost always caused by a bad node. You should try to find out where the file was written and remove that node as soon as possible.

As far as finding out what is wrong with the file, that's a difficult task. Parquet's encoding is very dense and corruption in encoded values often looks like different data. When you see a decoding exception like this, we find it is usually that the compressed data was corrupted and is no longer valid. You can look for the page of data based on the value counter, but that's about it.

Even if you could find a single record that was affected, that's not valuable because you don't know whether there is other corruption that is undetectable. There's nothing to reliably recover here. What we do in this case is find and remove the bad node, then reprocess data so we know everything is correct from the upstream source.

rb

On Mon, Feb 5, 2018 at 9:01 AM, Dong Jiang <dj...@dataxu.com>> wrote:
Hi,

We are running on Spark 2.2.1, generating parquet files, like the following
pseudo code
df.write.parquet(...)
We have recently noticed parquet file corruptions, when reading the parquet
in Spark or Presto, as the following:

Caused by: org.apache.parquet.io<http://org.apache.parquet.io>.ParquetDecodingException: Can not read
value at 40870 in block 0 in file
file:/Users/djiang/part-00122-80f4886a-75ce-42fa-b78f-4af35426f434.c000.snappy.parquet

Caused by: org.apache.parquet.io<http://org.apache.parquet.io>.ParquetDecodingException: could not read
page Page [bytes.size=1048594, valueCount=43663, uncompressedSize=1048594]
in col [incoming_aliases_array, list, element, key_value, value] BINARY

It appears only one column in one of the rows in the file is corrupt, the
file has 111041 rows.

My questions are
1) How can I identify the corrupted row?
2) What could cause the corruption? Spark issue or Parquet issue?

Any help is greatly appreciated.

Thanks,

Dong



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>



--
Ryan Blue
Software Engineer
Netflix

Re: Corrupt parquet file

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Dong,

We see this from time to time as well. In my experience, it is almost
always caused by a bad node. You should try to find out where the file was
written and remove that node as soon as possible.

As far as finding out what is wrong with the file, that's a difficult task.
Parquet's encoding is very dense and corruption in encoded values often
looks like different data. When you see a decoding exception like this, we
find it is usually that the compressed data was corrupted and is no longer
valid. You can look for the page of data based on the value counter, but
that's about it.

Even if you could find a single record that was affected, that's not
valuable because you don't know whether there is other corruption that is
undetectable. There's nothing to reliably recover here. What we do in this
case is find and remove the bad node, then reprocess data so we know
everything is correct from the upstream source.

rb

On Mon, Feb 5, 2018 at 9:01 AM, Dong Jiang <dj...@dataxu.com> wrote:

> Hi,
>
> We are running on Spark 2.2.1, generating parquet files, like the following
> pseudo code
> df.write.parquet(...)
> We have recently noticed parquet file corruptions, when reading the parquet
> in Spark or Presto, as the following:
>
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read
> value at 40870 in block 0 in file
> file:/Users/djiang/part-00122-80f4886a-75ce-42fa-b78f-
> 4af35426f434.c000.snappy.parquet
>
> Caused by: org.apache.parquet.io.ParquetDecodingException: could not read
> page Page [bytes.size=1048594, valueCount=43663, uncompressedSize=1048594]
> in col [incoming_aliases_array, list, element, key_value, value] BINARY
>
> It appears only one column in one of the rows in the file is corrupt, the
> file has 111041 rows.
>
> My questions are
> 1) How can I identify the corrupted row?
> 2) What could cause the corruption? Spark issue or Parquet issue?
>
> Any help is greatly appreciated.
>
> Thanks,
>
> Dong
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>


-- 
Ryan Blue
Software Engineer
Netflix