You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Brian Bowman <Br...@sas.com> on 2019/10/12 10:02:35 UTC

Re: Dictionary Decoding for BYTE_ARRAY types

Thanks Wes,

I'm getting the per-Row Group MAX/MIN BYTE_ARRAY values back.   Are the MAX/MIN maximum lengths for each BYTE_ARRAY columns also stored?

For example, the following BYTE_ARRAY column with three "Canadian Province" values:

MIN = "Alberta"

           "British Columbia"

MAX ="Saskatchewan"

"British Columbia" is the longest value (16) though it's not a MIN/MAX value.  Is this maximum length (e.g. 16 in this example) of each BYTE_ARRAY column value stored in any Parquet column-scoped metadata?

Thanks,


Brian

On 9/12/19, 6:10 PM, "Wes McKinney" <we...@gmail.com> wrote:

    EXTERNAL
    
    See https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fblob%2Fmaster%2Fcpp%2Fsrc%2Fparquet%2Fmetadata.h%23L120&amp;data=02%7C01%7CBrian.Bowman%40sas.com%7C0a6c7aaa3421434a4d9608d737cdfeeb%7Cb1c14d5c362545b3a4309552373a0c2f%7C0%7C0%7C637039230199362742&amp;sdata=0nRu7wDCSkI8tnqLCO4yEcVRT%2BxvRpIEdp6ME%2BT1gdY%3D&amp;reserved=0
    
    On Thu, Sep 12, 2019 at 4:59 PM Brian Bowman <Br...@sas.com> wrote:
    >
    > Thanks Wes,
    >
    > With that in mind, I’m searching for a public API that returns MAX length value for ByteArray columns.  Can you point me to an example?
    >
    > -Brian
    >
    > On 9/12/19, 5:34 PM, "Wes McKinney" <we...@gmail.com> wrote:
    >
    >     EXTERNAL
    >
    >     The memory references returned by ReadBatch are not guaranteed to
    >     persist from one function call to the next. So you need to copy the
    >     ByteArray data into your own data structures before calling ReadBatch
    >     again.
    >
    >     Column readers for different columns are independent from each other.
    >     So function calls for column 7 should not affect anything having to do
    >     with column 4.
    >
    >     On Thu, Sep 12, 2019 at 4:29 PM Brian Bowman <Br...@sas.com> wrote:
    >     >
    >     > All,
    >     >
    >     > I’m debugging a low-level API Parquet reader case where the table has DOUBLE, BYTE_ARRAY, and FIXED_LENGTH_BYTE_ARRAY types.
    >     >
    >     > Four of the columns (ordinally 3, 4, 7, 9) are of type BYTE_ARRAY.
    >     >
    >     > In the following ReadBatch(), rowsToRead is already set to all rows in the Row Group.  The quantity is verified by the return value in values_read.
    >     >
    >     >       byte_array_reader->ReadBatch(rowsToRead,nullptr,nullptr,rowColPtr,&values_read);
    >     >
    >     > Column 4 is dictionary encoded.  Upon return from its ReadBatch() call,  the result vector of BYTE_ARRAY descriptors (rolColPtr) has  correct len/ptr pairs pointing into a decoded dictionary string – although not from the original dictionary vaues in the .parquet file being read.
    >     >
    >     > As soon as the the ReadBatch()  call is made for the next BYTE_ARRAY column (#7), a new DICTIONARY_PAGE is read and the BYTE_ARRAY descriptor values for column 4 are trashed.
    >     >
    >     > Is this expected behavior or a bug?  If expected, then it seems the dictionary values for Column 4 (… or any BYTE_ARRAY column that is dictionary-compressed) should be copied and the descriptor vector addresses back-patched, BEFORE invoking ReadBatch() again.  Is this the case?
    >     >
    >     > Thanks for clarifying,
    >     >
    >     >
    >     > -Brian
    >     >
    >     >
    >     >
    >     >
    >
    >
    


Re: Dictionary Decoding for BYTE_ARRAY types

Posted by Wes McKinney <we...@gmail.com>.
On Sat, Oct 12, 2019 at 5:10 AM Brian Bowman <Br...@sas.com> wrote:
>
> Thanks Wes,
>
> I'm getting the per-Row Group MAX/MIN BYTE_ARRAY values back.   Are the MAX/MIN maximum lengths for each BYTE_ARRAY columns also stored?

No, they are not.

> For example, the following BYTE_ARRAY column with three "Canadian Province" values:
>
> MIN = "Alberta"
>
>            "British Columbia"
>
> MAX ="Saskatchewan"
>
> "British Columbia" is the longest value (16) though it's not a MIN/MAX value.  Is this maximum length (e.g. 16 in this example) of each BYTE_ARRAY column value stored in any Parquet column-scoped metadata?
>
> Thanks,
>
>
> Brian
>
> On 9/12/19, 6:10 PM, "Wes McKinney" <we...@gmail.com> wrote:
>
>     EXTERNAL
>
>     See https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fblob%2Fmaster%2Fcpp%2Fsrc%2Fparquet%2Fmetadata.h%23L120&amp;data=02%7C01%7CBrian.Bowman%40sas.com%7C0a6c7aaa3421434a4d9608d737cdfeeb%7Cb1c14d5c362545b3a4309552373a0c2f%7C0%7C0%7C637039230199362742&amp;sdata=0nRu7wDCSkI8tnqLCO4yEcVRT%2BxvRpIEdp6ME%2BT1gdY%3D&amp;reserved=0
>
>     On Thu, Sep 12, 2019 at 4:59 PM Brian Bowman <Br...@sas.com> wrote:
>     >
>     > Thanks Wes,
>     >
>     > With that in mind, I’m searching for a public API that returns MAX length value for ByteArray columns.  Can you point me to an example?
>     >
>     > -Brian
>     >
>     > On 9/12/19, 5:34 PM, "Wes McKinney" <we...@gmail.com> wrote:
>     >
>     >     EXTERNAL
>     >
>     >     The memory references returned by ReadBatch are not guaranteed to
>     >     persist from one function call to the next. So you need to copy the
>     >     ByteArray data into your own data structures before calling ReadBatch
>     >     again.
>     >
>     >     Column readers for different columns are independent from each other.
>     >     So function calls for column 7 should not affect anything having to do
>     >     with column 4.
>     >
>     >     On Thu, Sep 12, 2019 at 4:29 PM Brian Bowman <Br...@sas.com> wrote:
>     >     >
>     >     > All,
>     >     >
>     >     > I’m debugging a low-level API Parquet reader case where the table has DOUBLE, BYTE_ARRAY, and FIXED_LENGTH_BYTE_ARRAY types.
>     >     >
>     >     > Four of the columns (ordinally 3, 4, 7, 9) are of type BYTE_ARRAY.
>     >     >
>     >     > In the following ReadBatch(), rowsToRead is already set to all rows in the Row Group.  The quantity is verified by the return value in values_read.
>     >     >
>     >     >       byte_array_reader->ReadBatch(rowsToRead,nullptr,nullptr,rowColPtr,&values_read);
>     >     >
>     >     > Column 4 is dictionary encoded.  Upon return from its ReadBatch() call,  the result vector of BYTE_ARRAY descriptors (rolColPtr) has  correct len/ptr pairs pointing into a decoded dictionary string – although not from the original dictionary vaues in the .parquet file being read.
>     >     >
>     >     > As soon as the the ReadBatch()  call is made for the next BYTE_ARRAY column (#7), a new DICTIONARY_PAGE is read and the BYTE_ARRAY descriptor values for column 4 are trashed.
>     >     >
>     >     > Is this expected behavior or a bug?  If expected, then it seems the dictionary values for Column 4 (… or any BYTE_ARRAY column that is dictionary-compressed) should be copied and the descriptor vector addresses back-patched, BEFORE invoking ReadBatch() again.  Is this the case?
>     >     >
>     >     > Thanks for clarifying,
>     >     >
>     >     >
>     >     > -Brian
>     >     >
>     >     >
>     >     >
>     >     >
>     >
>     >
>
>