You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Samarth Jain (Jira)" <ji...@apache.org> on 2019/08/20 21:22:00 UTC
[jira] [Updated] (PARQUET-1641) Parquet pages for different columns cannot be read in parallel

     [ https://issues.apache.org/jira/browse/PARQUET-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Samarth Jain updated PARQUET-1641:
----------------------------------
    Description: 
All ColumnChunkPageReader instances use the same decompressor. 

[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1286]
{code:java}
BytesInputDecompressor decompressor = options.getCodecFactory().getDecompressor(descriptor.metadata.getCodec());
return new ColumnChunkPageReader(decompressor, pagesInChunk, dictionaryPage);
{code}
The CodecFactory caches the decompressors for every codec type returning the same instance on every getCompressor(codecName) call. See the caching happening here:

[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java#L197]
{code:java}
@Override
 public BytesDecompressor getDecompressor(CompressionCodecName codecName) {
    BytesDecompressor decomp = decompressors.get(codecName);
    if (decomp == null){ 
       decomp = createDecompressor(codecName); decompressors.put(codecName, decomp); 
    }
    return decomp;
 }
 
{code}
 

If multiple threads try to read the pages belonging to different columns, they run into thread

safety issues. This issue prevents increasing the throughput at which applications can read parquet data by parallelizing page reads. 

  was:
All ColumnChunkPageReader instances use the same decompressor. 

[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1286]

<code>

BytesInputDecompressor decompressor = options.getCodecFactory().getDecompressor(descriptor.metadata.getCodec());

return new ColumnChunkPageReader(decompressor, pagesInChunk, dictionaryPage);

</code>

 

The CodecFactory caches the decompressors for every codec type returning the same instance on every getCompressor(codecName) call. See the caching happening here:

[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java#L197]

<code>

@Override
public BytesDecompressor getDecompressor(CompressionCodecName codecName) {
 BytesDecompressor decomp = decompressors.get(codecName);
 if (decomp == null) {
 decomp = createDecompressor(codecName);
 decompressors.put(codecName, decomp);
 }
 return decomp;
}

</code>

 

If multiple threads try to read the pages belonging to different columns, they run into thread

safety issues. This issue prevents increasing the throughput at which applications can read parquet data by parallelizing page reads. 


> Parquet pages for different columns cannot be read in parallel 
> ---------------------------------------------------------------
>
>                 Key: PARQUET-1641
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1641
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Samarth Jain
>            Priority: Major
>
> All ColumnChunkPageReader instances use the same decompressor. 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1286]
> {code:java}
> BytesInputDecompressor decompressor = options.getCodecFactory().getDecompressor(descriptor.metadata.getCodec());
> return new ColumnChunkPageReader(decompressor, pagesInChunk, dictionaryPage);
> {code}
> The CodecFactory caches the decompressors for every codec type returning the same instance on every getCompressor(codecName) call. See the caching happening here:
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java#L197]
> {code:java}
> @Override
>  public BytesDecompressor getDecompressor(CompressionCodecName codecName) {
>     BytesDecompressor decomp = decompressors.get(codecName);
>     if (decomp == null){ 
>        decomp = createDecompressor(codecName); decompressors.put(codecName, decomp); 
>     }
>     return decomp;
>  }
>  
> {code}
>  
> If multiple threads try to read the pages belonging to different columns, they run into thread
> safety issues. This issue prevents increasing the throughput at which applications can read parquet data by parallelizing page reads. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)