You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Marios Meimaris (Jira)" <ji...@apache.org> on 2021/06/24 12:22:00 UTC
[jira] [Updated] (PARQUET-2060) Parquet corruption can cause
infinite loop with Snappy
[ https://issues.apache.org/jira/browse/PARQUET-2060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Marios Meimaris updated PARQUET-2060:
-------------------------------------
Description:
I am attaching a valid and corrupt parquet file (datapageV2) that differ in one byte.
We hit an infinite loop when trying to read the corrupt file in [https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnReaderBase.java#L698] and specifically in the `page.getData().toInputStream()` call.
Stack trace of infinite loop:
java.io.DataInputStream.readFully(DataInputStream.java:195)
java.io.DataInputStream.readFully(DataInputStream.java:169)
org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:287)
org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:237)
org.apache.parquet.bytes.BytesInput.toInputStream(BytesInput.java:246)
org.apache.parquet.column.impl.ColumnReaderBase.readPageV2(ColumnReaderBase.java:698)
org.apache.parquet.column.impl.ColumnReaderBase.access$400(ColumnReaderBase.java:57)
org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:628)
org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:620)
org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:192)
org.apache.parquet.column.impl.ColumnReaderBase.readPage(ColumnReaderBase.java:620)
org.apache.parquet.column.impl.ColumnReaderBase.checkRead(ColumnReaderBase.java:594)
The call to `readFully` will underneath go through `NonBlockedDecompressorStream` which will always hit this path: [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/NonBlockedDecompressorStream.java#L45]. This will cause `setInput` to not be called on the decompressor, and the subsequent calls to `decompress` will always hit this condition: [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/SnappyDecompressor.java#L54]. Therefore, the 0 value will be returned by the read method, which will cause an infinite loop in [https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/io/DataInputStream.java#L198]
This originates from the corruption, which causes the input stream of the data page to be of size 0, which makes `getCompressedData` always return -1.
I am wondering whether this can be caught earlier so that the read fails in case of such corruptions.
Since this happens in `BytesInput.toInputStream`, I don't think it's only relevant to DataPageV2.
In [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java#L111,] if we call `bytes.toByteArray` and log its length, it is 0 in the case of the corrupt file, and 6 in the case of the valid file.
A potential fix is to check the array size there and fail early, but I am not sure if a zero-length byte array can ever be expected in the case of valid files.
Attached:
Valid file: `datapage_v2_snappy.parquet`
Corrupt file: `datapage_v2_snappy.parquet1383`
was:
I am attaching a valid and corrupt parquet file (datapageV2) that differ in one byte.
We hit an infinite loop when trying to read the corrupt file in https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnReaderBase.java#L698 and specifically in the `page.getData().toInputStream()` call.
Stack trace of infinite loop:
java.io.DataInputStream.readFully(DataInputStream.java:195)
java.io.DataInputStream.readFully(DataInputStream.java:169)
org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:287)
org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:237)
org.apache.parquet.bytes.BytesInput.toInputStream(BytesInput.java:246)
org.apache.parquet.column.impl.ColumnReaderBase.readPageV2(ColumnReaderBase.java:698)
org.apache.parquet.column.impl.ColumnReaderBase.access$400(ColumnReaderBase.java:57)
org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:628)
org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:620)
org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:192)
org.apache.parquet.column.impl.ColumnReaderBase.readPage(ColumnReaderBase.java:620)
org.apache.parquet.column.impl.ColumnReaderBase.checkRead(ColumnReaderBase.java:594)
The call to `readFully` will underneath go through `NonBlockedDecompressorStream` which will always hit this path: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/NonBlockedDecompressorStream.java#L45. This will cause `setInput` to not be called on the decompressor, and the subsequent calls to `decompress` will always hit this condition: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/SnappyDecompressor.java#L54. Therefore, the 0 value will be returned by the read method, which will cause an infinite loop in https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/io/DataInputStream.java#L198
This originates from the corruption, which causes the input stream of the data page to be of size 0, which makes `getCompressedData` always return -1.
I am wondering whether this can be caught earlier so that the read fails in case of such corruptions.
Since this happens in `BytesInput.toInputStream`, I don't think it's only relevant to DataPageV2.
Attached:
Valid file: `datapage_v2_snappy.parquet`
Corrupt file: `datapage_v2_snappy.parquet1383`
> Parquet corruption can cause infinite loop with Snappy
> ------------------------------------------------------
>
> Key: PARQUET-2060
> URL: https://issues.apache.org/jira/browse/PARQUET-2060
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-mr
> Reporter: Marios Meimaris
> Priority: Major
> Attachments: datapage_v2.snappy.parquet, datapage_v2.snappy.parquet1383
>
>
> I am attaching a valid and corrupt parquet file (datapageV2) that differ in one byte.
> We hit an infinite loop when trying to read the corrupt file in [https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnReaderBase.java#L698] and specifically in the `page.getData().toInputStream()` call.
> Stack trace of infinite loop:
> java.io.DataInputStream.readFully(DataInputStream.java:195)
> java.io.DataInputStream.readFully(DataInputStream.java:169)
> org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:287)
> org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:237)
> org.apache.parquet.bytes.BytesInput.toInputStream(BytesInput.java:246)
> org.apache.parquet.column.impl.ColumnReaderBase.readPageV2(ColumnReaderBase.java:698)
> org.apache.parquet.column.impl.ColumnReaderBase.access$400(ColumnReaderBase.java:57)
> org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:628)
> org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:620)
> org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:192)
> org.apache.parquet.column.impl.ColumnReaderBase.readPage(ColumnReaderBase.java:620)
> org.apache.parquet.column.impl.ColumnReaderBase.checkRead(ColumnReaderBase.java:594)
>
> The call to `readFully` will underneath go through `NonBlockedDecompressorStream` which will always hit this path: [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/NonBlockedDecompressorStream.java#L45]. This will cause `setInput` to not be called on the decompressor, and the subsequent calls to `decompress` will always hit this condition: [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/SnappyDecompressor.java#L54]. Therefore, the 0 value will be returned by the read method, which will cause an infinite loop in [https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/io/DataInputStream.java#L198]
> This originates from the corruption, which causes the input stream of the data page to be of size 0, which makes `getCompressedData` always return -1.
> I am wondering whether this can be caught earlier so that the read fails in case of such corruptions.
> Since this happens in `BytesInput.toInputStream`, I don't think it's only relevant to DataPageV2.
>
> In [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java#L111,] if we call `bytes.toByteArray` and log its length, it is 0 in the case of the corrupt file, and 6 in the case of the valid file.
> A potential fix is to check the array size there and fail early, but I am not sure if a zero-length byte array can ever be expected in the case of valid files.
>
> Attached:
> Valid file: `datapage_v2_snappy.parquet`
> Corrupt file: `datapage_v2_snappy.parquet1383`
--
This message was sent by Atlassian Jira
(v8.3.4#803005)