You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Florian Scheibner (JIRA)" <ji...@apache.org> on 2016/10/07 18:35:20 UTC

[jira] [Updated] (PARQUET-739) Rle-decoding uses static buffer that is shared accross threads

     [ https://issues.apache.org/jira/browse/PARQUET-739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Florian Scheibner updated PARQUET-739:
--------------------------------------
    Description: 
Reading two parquet files in parallel lead to a memory corruption that caused a crash. The columns are rle dictionary encoded strings in an uncompressed page, created with parquet-mr. 

Initial debugging showed that the indices for the dictionary returned by the rle decoder are garbage. So that data page got corrupted in memory. Reading the files in one thread works.

I have a ColumnReader for each column and read one element from reach column to get a complete row.

The indices are decoded into one global static buffer. So multiple threads all use the same buffer and overwrite each other's indices.

  was:
Reading two parquet files in parallel lead to a memory corruption that caused a crash. The columns are rle dictionary encoded strings in an uncompressed page, created with parquet-mr. -fsanitize tracked the issue to a use-after free:
{code}
=================================================================
==81678==ERROR: AddressSanitizer: heap-use-after-free on address 0x6060001088c0 at pc 0x000003dbd42b bp 0x7fffe30fbe00 sp 0x7fffe30fbdf8
READ of size 16 at 0x6060001088c0 thread T8
   #0 0x3dbd42a in int parquet::RleDecoder::GetBatchWithDict<parquet::ByteArray>(parquet::Vector<parquet::ByteArray> const&, parquet::ByteArray*, int) (/home/fscheibner/Snowflake/ExecPlatform/bin/snowflake+0x3dbd42a)
   #1 0x3db8efa in parquet::DictionaryDecoder<parquet::DataType<(parquet::Type::type)6> >::Decode(parquet::ByteArray*, int) (/home/fscheibner/Snowflake/ExecPlatform/bin/snowflake+0x3db8efa)
   #2 0x3d84767 in parquet::TypedColumnReader<parquet::DataType<(parquet::Type::type)6> >::ReadValues(long, parquet::ByteArray*) (/home/fscheibner/Snowflake/ExecPlatform/bin/snowflake+0x3d84767)
   #3 0x3d83497 in parquet::TypedColumnReader<parquet::DataType<(parquet::Type::type)6> >::ReadBatch(int, short*, short*, parquet::ByteArray*, long*) (/home/fscheibner/Snowflake/ExecPlatform/bin/snowflake+0x3d83497)
{code}

Initial debugging showed that the indices for the dictionary returned by the rle decoder are garbage. So that data page got corrupted in memory. Reading the files in one thread works.

I have a ColumnReader for each column and read one element from reach column to get a complete row.
My guess is that some data buffer is freed and then later still used for reading. I couldn't track the source yet. Any ideas [~wesmckinn]?

        Summary: Rle-decoding uses static buffer that is shared accross threads  (was: Read after free with uncompressed page)

> Rle-decoding uses static buffer that is shared accross threads
> --------------------------------------------------------------
>
>                 Key: PARQUET-739
>                 URL: https://issues.apache.org/jira/browse/PARQUET-739
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>            Reporter: Florian Scheibner
>            Assignee: Florian Scheibner
>
> Reading two parquet files in parallel lead to a memory corruption that caused a crash. The columns are rle dictionary encoded strings in an uncompressed page, created with parquet-mr. 
> Initial debugging showed that the indices for the dictionary returned by the rle decoder are garbage. So that data page got corrupted in memory. Reading the files in one thread works.
> I have a ColumnReader for each column and read one element from reach column to get a complete row.
> The indices are decoded into one global static buffer. So multiple threads all use the same buffer and overwrite each other's indices.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)