You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Brian Bowman <Br...@sas.com> on 2019/10/12 10:02:35 UTC
Re: Dictionary Decoding for BYTE_ARRAY types
Thanks Wes,
I'm getting the per-Row Group MAX/MIN BYTE_ARRAY values back. Are the MAX/MIN maximum lengths for each BYTE_ARRAY columns also stored?
For example, the following BYTE_ARRAY column with three "Canadian Province" values:
MIN = "Alberta"
"British Columbia"
MAX ="Saskatchewan"
"British Columbia" is the longest value (16) though it's not a MIN/MAX value. Is this maximum length (e.g. 16 in this example) of each BYTE_ARRAY column value stored in any Parquet column-scoped metadata?
Thanks,
Brian
On 9/12/19, 6:10 PM, "Wes McKinney" <we...@gmail.com> wrote:
EXTERNAL
See https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fblob%2Fmaster%2Fcpp%2Fsrc%2Fparquet%2Fmetadata.h%23L120&data=02%7C01%7CBrian.Bowman%40sas.com%7C0a6c7aaa3421434a4d9608d737cdfeeb%7Cb1c14d5c362545b3a4309552373a0c2f%7C0%7C0%7C637039230199362742&sdata=0nRu7wDCSkI8tnqLCO4yEcVRT%2BxvRpIEdp6ME%2BT1gdY%3D&reserved=0
On Thu, Sep 12, 2019 at 4:59 PM Brian Bowman <Br...@sas.com> wrote:
>
> Thanks Wes,
>
> With that in mind, I’m searching for a public API that returns MAX length value for ByteArray columns. Can you point me to an example?
>
> -Brian
>
> On 9/12/19, 5:34 PM, "Wes McKinney" <we...@gmail.com> wrote:
>
> EXTERNAL
>
> The memory references returned by ReadBatch are not guaranteed to
> persist from one function call to the next. So you need to copy the
> ByteArray data into your own data structures before calling ReadBatch
> again.
>
> Column readers for different columns are independent from each other.
> So function calls for column 7 should not affect anything having to do
> with column 4.
>
> On Thu, Sep 12, 2019 at 4:29 PM Brian Bowman <Br...@sas.com> wrote:
> >
> > All,
> >
> > I’m debugging a low-level API Parquet reader case where the table has DOUBLE, BYTE_ARRAY, and FIXED_LENGTH_BYTE_ARRAY types.
> >
> > Four of the columns (ordinally 3, 4, 7, 9) are of type BYTE_ARRAY.
> >
> > In the following ReadBatch(), rowsToRead is already set to all rows in the Row Group. The quantity is verified by the return value in values_read.
> >
> > byte_array_reader->ReadBatch(rowsToRead,nullptr,nullptr,rowColPtr,&values_read);
> >
> > Column 4 is dictionary encoded. Upon return from its ReadBatch() call, the result vector of BYTE_ARRAY descriptors (rolColPtr) has correct len/ptr pairs pointing into a decoded dictionary string – although not from the original dictionary vaues in the .parquet file being read.
> >
> > As soon as the the ReadBatch() call is made for the next BYTE_ARRAY column (#7), a new DICTIONARY_PAGE is read and the BYTE_ARRAY descriptor values for column 4 are trashed.
> >
> > Is this expected behavior or a bug? If expected, then it seems the dictionary values for Column 4 (… or any BYTE_ARRAY column that is dictionary-compressed) should be copied and the descriptor vector addresses back-patched, BEFORE invoking ReadBatch() again. Is this the case?
> >
> > Thanks for clarifying,
> >
> >
> > -Brian
> >
> >
> >
> >
>
>
Re: Dictionary Decoding for BYTE_ARRAY types
Posted by Wes McKinney <we...@gmail.com>.
On Sat, Oct 12, 2019 at 5:10 AM Brian Bowman <Br...@sas.com> wrote:
>
> Thanks Wes,
>
> I'm getting the per-Row Group MAX/MIN BYTE_ARRAY values back. Are the MAX/MIN maximum lengths for each BYTE_ARRAY columns also stored?
No, they are not.
> For example, the following BYTE_ARRAY column with three "Canadian Province" values:
>
> MIN = "Alberta"
>
> "British Columbia"
>
> MAX ="Saskatchewan"
>
> "British Columbia" is the longest value (16) though it's not a MIN/MAX value. Is this maximum length (e.g. 16 in this example) of each BYTE_ARRAY column value stored in any Parquet column-scoped metadata?
>
> Thanks,
>
>
> Brian
>
> On 9/12/19, 6:10 PM, "Wes McKinney" <we...@gmail.com> wrote:
>
> EXTERNAL
>
> See https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fblob%2Fmaster%2Fcpp%2Fsrc%2Fparquet%2Fmetadata.h%23L120&data=02%7C01%7CBrian.Bowman%40sas.com%7C0a6c7aaa3421434a4d9608d737cdfeeb%7Cb1c14d5c362545b3a4309552373a0c2f%7C0%7C0%7C637039230199362742&sdata=0nRu7wDCSkI8tnqLCO4yEcVRT%2BxvRpIEdp6ME%2BT1gdY%3D&reserved=0
>
> On Thu, Sep 12, 2019 at 4:59 PM Brian Bowman <Br...@sas.com> wrote:
> >
> > Thanks Wes,
> >
> > With that in mind, I’m searching for a public API that returns MAX length value for ByteArray columns. Can you point me to an example?
> >
> > -Brian
> >
> > On 9/12/19, 5:34 PM, "Wes McKinney" <we...@gmail.com> wrote:
> >
> > EXTERNAL
> >
> > The memory references returned by ReadBatch are not guaranteed to
> > persist from one function call to the next. So you need to copy the
> > ByteArray data into your own data structures before calling ReadBatch
> > again.
> >
> > Column readers for different columns are independent from each other.
> > So function calls for column 7 should not affect anything having to do
> > with column 4.
> >
> > On Thu, Sep 12, 2019 at 4:29 PM Brian Bowman <Br...@sas.com> wrote:
> > >
> > > All,
> > >
> > > I’m debugging a low-level API Parquet reader case where the table has DOUBLE, BYTE_ARRAY, and FIXED_LENGTH_BYTE_ARRAY types.
> > >
> > > Four of the columns (ordinally 3, 4, 7, 9) are of type BYTE_ARRAY.
> > >
> > > In the following ReadBatch(), rowsToRead is already set to all rows in the Row Group. The quantity is verified by the return value in values_read.
> > >
> > > byte_array_reader->ReadBatch(rowsToRead,nullptr,nullptr,rowColPtr,&values_read);
> > >
> > > Column 4 is dictionary encoded. Upon return from its ReadBatch() call, the result vector of BYTE_ARRAY descriptors (rolColPtr) has correct len/ptr pairs pointing into a decoded dictionary string – although not from the original dictionary vaues in the .parquet file being read.
> > >
> > > As soon as the the ReadBatch() call is made for the next BYTE_ARRAY column (#7), a new DICTIONARY_PAGE is read and the BYTE_ARRAY descriptor values for column 4 are trashed.
> > >
> > > Is this expected behavior or a bug? If expected, then it seems the dictionary values for Column 4 (… or any BYTE_ARRAY column that is dictionary-compressed) should be copied and the descriptor vector addresses back-patched, BEFORE invoking ReadBatch() again. Is this the case?
> > >
> > > Thanks for clarifying,
> > >
> > >
> > > -Brian
> > >
> > >
> > >
> > >
> >
> >
>
>