You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "James Turton (Jira)" <ji...@apache.org> on 2022/02/17 12:12:00 UTC
[jira] [Created] (PARQUET-2126) Thread safety bug in CodecFactory
James Turton created PARQUET-2126:
-------------------------------------
Summary: Thread safety bug in CodecFactory
Key: PARQUET-2126
URL: https://issues.apache.org/jira/browse/PARQUET-2126
Project: Parquet
Issue Type: Bug
Components: parquet-mr
Affects Versions: 1.12.2
Reporter: James Turton
The code for returning Compressor objects to the caller goes to some lengths to achieve thread safety, including keeping Codec objects in an Apache Commons pool that has thread-safe borrow semantics. This is all undone by the BytesCompressor and BytesDecompressor Maps in org.apache.parquet.hadoop.CodecFactory which end up caching single compressor and decompressor instances due to code in CodecFactory@getCompressor and CodecFactory@getDecompressor. When the caller runs multiple threads, those threads end up sharing compressor and decompressor instances.
For compressors based on Xerial Snappy this bug has no effect because that library is itself thread safe. But when BuiltInGzipCompressor from Hadoop is selected for the CompressionCodecName.GZIP case, serious problems ensue. That class is not thread safe and sharing one instance of it between threads produces both silent data corruption and JVM crashes.
To fix this situation, parquet-mr should stop caching single compressor and decompressor instances.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)