You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Brian Bowman <Br...@sas.com> on 2019/05/13 15:42:51 UTC

Definition Levels and Null

All,

I’m working to integrate the historic usage of SAS missing values for IEEE doubles into our SAS Viya Parquet integration.  SAS writes a NAN to represent floating-point doubles that are “missing,” i.e. NULL in more general data management terms.

Of course SAS’ goal is to create .parquet files that are universally readable.  Therefore, it appears that the SAS Parquet writer(s) will NOT be able to write the usual NAN to represent “missing,” because doing so will cause a floating point exception for other readers.

Based on the Parquet doc at:  https://parquet.apache.org/documentation/latest/ and by examining code, I understand that Parquet NULL values are indicated by setting 0x000 at the definition level vector offset corresponding to each NULL column offset value.

Conversely, It appears that the per-column, per page definition level data is never written when required is not specified for the column schema.

Is my understanding and Parquet terminology correct here?

Thanks,

Brian

Re: Definition Levels and Null

Posted by Brian Bowman <Br...@sas.com>.

Thanks Wes,

We are using the Parquet C++ low-level APIs. 

Our Parquet "adapter" code will translate the SAS "missing" NaN representation to the correct position in the int16_t def level vector passed to the Parquet low-level writer.   Similarly, this adapter will reconstitute the NaN "missing" representation from the def level vector returned from LevelDecoder() at https://github.com/apache/parquet-cpp/blob/master/src/parquet/column_reader.cc#L77 up through ReadBatch() and ultimately back to SAS.

-Brian

On 5/13/19, 2:48 PM, "Wes McKinney" <we...@gmail.com> wrote:

    EXTERNAL
    
    To comment from the Parquet C++ side, we expose two writer APIs
    
    * High level, using Apache Arrow -- use Arrow's bitmap-based
    null/valid representation for null values, NaN is NaN
    * Low level, produces your own repetition/definition levels
    
    So if you're using the low level API, and you have values like
    
    [1, 2, 3, NULL = NaN, 5]
    
    then you could represent this as
    
    def_levels = [1, 1, 1, 0, 1]
    rep_levels = nullptr
    values = [1, 2, 3, 5]
    
    If you don't use the definition level encoding of nulls then other
    readers will presume the values to be non-null.
    
    On Mon, May 13, 2019 at 1:06 PM Tim Armstrong
    <ta...@cloudera.com.invalid> wrote:
    >
    > > I see that OPTIONAL or REPEATED must be specified as the Repetition type
    > for columns where def level of 0 indicates NULL and 1 means not NULL.  The
    > SchemaDescriptor::BuildTree method at
    > https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema.cc#L661
    > shows how this causes max_def_level to increment.
    > That seems right if your data doesn't have any complex types in it,
    > max_def_level will always be 0 or 1 depending on whether the column is
    > REQUIRED/OPTIONAL. One option, depending on your data model, is to always
    > just mark the field as OPTIONAL and provide the def levels. If they're all
    > 1 they will compress extremely well. Impala actually does this because
    > mostly columns end up being potentially nullable in Impala/Hive data model.
    >
    > > We are using standard Parquet API's via C++/libparquet.co and therefore
    > not doing our own Parquet file-format writer/reader.
    > Ok, great! I'm not so familiar with the parquet-cpp APIs but I took a quick
    > look and I guess it does expose the concept of ref/def levels.
    >
    > > NaNs representing missing values occur frequently in a myriad of SAS use
    > cases.  Other data types may be NULL as well, so I'm wondering if using def
    > level to indicate NULLs is safer (with consideration to other readers) and
    > also consumes less memory/storage across the spectrum of Parquet-supported
    > data types?
    > If I was in your situation, this is what I'd probably do. We're seen a lot
    > more inconsistency with handling of NaN between readers.
    >
    > On Mon, May 13, 2019 at 10:49 AM Brian Bowman <Br...@sas.com> wrote:
    >
    > > Tim,
    > >
    > > Thanks for your detailed reply and especially for pointing the RLE
    > > encoding for the def level!
    > >
    > > Your comment:
    > >
    > >     <<- If the field is required, the max def level is 0, therefore all
    > > values
    > >        are 0, therefore the def levels can be "decoded" from nothing and
    > > the def
    > >        levels can be omitted for the page.>>
    > >
    > > I see that OPTIONAL or REPEATED must be specified as the Repetition type
    > > for columns where def level of 0 indicates NULL and 1 means not NULL.  The
    > > SchemaDescriptor::BuildTree method at
    > > https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema.cc#L661
    > > shows how this causes max_def_level to increment.
    > >
    > > We are using standard Parquet API's via C++/libparquet.co and therefore
    > > not doing our own Parquet file-format writer/reader.
    > >
    > > NaNs representing missing values occur frequently in a myriad of SAS use
    > > cases.  Other data types may be NULL as well, so I'm wondering if using def
    > > level to indicate NULLs is safer (with consideration to other readers) and
    > > also consumes less memory/storage across the spectrum of Parquet-supported
    > > data types?
    > >
    > > Best,
    > >
    > > Brian
    > >
    > >
    > > On 5/13/19, 1:03 PM, "Tim Armstrong" <ta...@cloudera.com.INVALID>
    > > wrote:
    > >
    > >     EXTERNAL
    > >
    > >     Parquet float/double values can hold any IEEE floating point value -
    > >
    > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L413
    > > .
    > >     So there's no reason you can't write NaN to the files. If a reader
    > > isn't
    > >     handling NaN values correctly, that seems like an issue with that
    > > reader,
    > >     although I think you're correct in that you're more likely to hit
    > > reader
    > >     bugs with NaN than NULL. (I may be telling you something you already
    > > know,
    > >     but thought I'd start with that).
    > >
    > >     I don't think the Parquet format is opinionated about what NULL vs NaN
    > >     means, although I'd assume that NULL means that the data simply wasn't
    > >     present, and NaN means that it was the result of a floating point
    > >     calculation that resulted in NaN.
    > >
    > >     The rep/definition level encoding is fairly complex because of the
    > > handling
    > >     of nested types and the various ways of encoding the sequence of
    > > levels.
    > >     The way I'd think about it is:
    > >
    > >        - If you don't have any complex/nested types, rep levels aren't
    > > needed
    > >        and the logical def levels degenerate into 1=not null, 0 = null.
    > >        - The RLE encoding has a bit-width implied by the max def level
    > > value -
    > >        if the max-level is 1, 1 bit is needed per value. If it is 0, 0
    > > bits are
    > >        needed per value.
    > >        - If the field is required, the max def level is 0, therefore all
    > > values
    > >        are 0, therefore the def levels can be "decoded" from nothing and
    > > the def
    > >        levels can be omitted for the page.
    > >        - If the field is nullable, the bit width is 0, therefore each def
    > > level
    > >        is logically a bit. However, RLE encoding is applied to the
    > > sequence of 1/0
    > >        levels -
    > >        https://github.com/apache/parquet-format/blob/master/Encodings.md
    > >
    > >     The last point is where I think your understanding might diverge from
    > > the
    > >     implementation - the encoded def levels are not simply a bit vector,
    > > it's a
    > >     more complex hybrid RLE/bit-packed encoding.
    > >
    > >     If you use one of the existing Parquet libraries it will handle all
    > > this
    > >     for you - it's a headache to get it all right from scratch.
    > >     - Tim
    > >
    > >
    > >     On Mon, May 13, 2019 at 8:43 AM Brian Bowman <Br...@sas.com>
    > > wrote:
    > >
    > >     > All,
    > >     >
    > >     > I’m working to integrate the historic usage of SAS missing values
    > > for IEEE
    > >     > doubles into our SAS Viya Parquet integration.  SAS writes a NAN to
    > >     > represent floating-point doubles that are “missing,” i.e. NULL in
    > > more
    > >     > general data management terms.
    > >     >
    > >     > Of course SAS’ goal is to create .parquet files that are universally
    > >     > readable.  Therefore, it appears that the SAS Parquet writer(s) will
    > > NOT be
    > >     > able to write the usual NAN to represent “missing,” because doing so
    > > will
    > >     > cause a floating point exception for other readers.
    > >     >
    > >     > Based on the Parquet doc at:
    > >     > https://parquet.apache.org/documentation/latest/ and by examining
    > > code, I
    > >     > understand that Parquet NULL values are indicated by setting 0x000
    > > at the
    > >     > definition level vector offset corresponding to each NULL column
    > > offset
    > >     > value.
    > >     >
    > >     > Conversely, It appears that the per-column, per page definition
    > > level data
    > >     > is never written when required is not specified for the column
    > > schema.
    > >     >
    > >     > Is my understanding and Parquet terminology correct here?
    > >     >
    > >     > Thanks,
    > >     >
    > >     > Brian
    > >     >
    > >
    > >
    > >

Re: Definition Levels and Null

Posted by Wes McKinney <we...@gmail.com>.

To comment from the Parquet C++ side, we expose two writer APIs

* High level, using Apache Arrow -- use Arrow's bitmap-based
null/valid representation for null values, NaN is NaN
* Low level, produces your own repetition/definition levels

So if you're using the low level API, and you have values like

[1, 2, 3, NULL = NaN, 5]

then you could represent this as

def_levels = [1, 1, 1, 0, 1]
rep_levels = nullptr
values = [1, 2, 3, 5]

If you don't use the definition level encoding of nulls then other
readers will presume the values to be non-null.

On Mon, May 13, 2019 at 1:06 PM Tim Armstrong
<ta...@cloudera.com.invalid> wrote:
>
> > I see that OPTIONAL or REPEATED must be specified as the Repetition type
> for columns where def level of 0 indicates NULL and 1 means not NULL.  The
> SchemaDescriptor::BuildTree method at
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema.cc#L661
> shows how this causes max_def_level to increment.
> That seems right if your data doesn't have any complex types in it,
> max_def_level will always be 0 or 1 depending on whether the column is
> REQUIRED/OPTIONAL. One option, depending on your data model, is to always
> just mark the field as OPTIONAL and provide the def levels. If they're all
> 1 they will compress extremely well. Impala actually does this because
> mostly columns end up being potentially nullable in Impala/Hive data model.
>
> > We are using standard Parquet API's via C++/libparquet.co and therefore
> not doing our own Parquet file-format writer/reader.
> Ok, great! I'm not so familiar with the parquet-cpp APIs but I took a quick
> look and I guess it does expose the concept of ref/def levels.
>
> > NaNs representing missing values occur frequently in a myriad of SAS use
> cases.  Other data types may be NULL as well, so I'm wondering if using def
> level to indicate NULLs is safer (with consideration to other readers) and
> also consumes less memory/storage across the spectrum of Parquet-supported
> data types?
> If I was in your situation, this is what I'd probably do. We're seen a lot
> more inconsistency with handling of NaN between readers.
>
> On Mon, May 13, 2019 at 10:49 AM Brian Bowman <Br...@sas.com> wrote:
>
> > Tim,
> >
> > Thanks for your detailed reply and especially for pointing the RLE
> > encoding for the def level!
> >
> > Your comment:
> >
> >     <<- If the field is required, the max def level is 0, therefore all
> > values
> >        are 0, therefore the def levels can be "decoded" from nothing and
> > the def
> >        levels can be omitted for the page.>>
> >
> > I see that OPTIONAL or REPEATED must be specified as the Repetition type
> > for columns where def level of 0 indicates NULL and 1 means not NULL.  The
> > SchemaDescriptor::BuildTree method at
> > https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema.cc#L661
> > shows how this causes max_def_level to increment.
> >
> > We are using standard Parquet API's via C++/libparquet.co and therefore
> > not doing our own Parquet file-format writer/reader.
> >
> > NaNs representing missing values occur frequently in a myriad of SAS use
> > cases.  Other data types may be NULL as well, so I'm wondering if using def
> > level to indicate NULLs is safer (with consideration to other readers) and
> > also consumes less memory/storage across the spectrum of Parquet-supported
> > data types?
> >
> > Best,
> >
> > Brian
> >
> >
> > On 5/13/19, 1:03 PM, "Tim Armstrong" <ta...@cloudera.com.INVALID>
> > wrote:
> >
> >     EXTERNAL
> >
> >     Parquet float/double values can hold any IEEE floating point value -
> >
> > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L413
> > .
> >     So there's no reason you can't write NaN to the files. If a reader
> > isn't
> >     handling NaN values correctly, that seems like an issue with that
> > reader,
> >     although I think you're correct in that you're more likely to hit
> > reader
> >     bugs with NaN than NULL. (I may be telling you something you already
> > know,
> >     but thought I'd start with that).
> >
> >     I don't think the Parquet format is opinionated about what NULL vs NaN
> >     means, although I'd assume that NULL means that the data simply wasn't
> >     present, and NaN means that it was the result of a floating point
> >     calculation that resulted in NaN.
> >
> >     The rep/definition level encoding is fairly complex because of the
> > handling
> >     of nested types and the various ways of encoding the sequence of
> > levels.
> >     The way I'd think about it is:
> >
> >        - If you don't have any complex/nested types, rep levels aren't
> > needed
> >        and the logical def levels degenerate into 1=not null, 0 = null.
> >        - The RLE encoding has a bit-width implied by the max def level
> > value -
> >        if the max-level is 1, 1 bit is needed per value. If it is 0, 0
> > bits are
> >        needed per value.
> >        - If the field is required, the max def level is 0, therefore all
> > values
> >        are 0, therefore the def levels can be "decoded" from nothing and
> > the def
> >        levels can be omitted for the page.
> >        - If the field is nullable, the bit width is 0, therefore each def
> > level
> >        is logically a bit. However, RLE encoding is applied to the
> > sequence of 1/0
> >        levels -
> >        https://github.com/apache/parquet-format/blob/master/Encodings.md
> >
> >     The last point is where I think your understanding might diverge from
> > the
> >     implementation - the encoded def levels are not simply a bit vector,
> > it's a
> >     more complex hybrid RLE/bit-packed encoding.
> >
> >     If you use one of the existing Parquet libraries it will handle all
> > this
> >     for you - it's a headache to get it all right from scratch.
> >     - Tim
> >
> >
> >     On Mon, May 13, 2019 at 8:43 AM Brian Bowman <Br...@sas.com>
> > wrote:
> >
> >     > All,
> >     >
> >     > I’m working to integrate the historic usage of SAS missing values
> > for IEEE
> >     > doubles into our SAS Viya Parquet integration.  SAS writes a NAN to
> >     > represent floating-point doubles that are “missing,” i.e. NULL in
> > more
> >     > general data management terms.
> >     >
> >     > Of course SAS’ goal is to create .parquet files that are universally
> >     > readable.  Therefore, it appears that the SAS Parquet writer(s) will
> > NOT be
> >     > able to write the usual NAN to represent “missing,” because doing so
> > will
> >     > cause a floating point exception for other readers.
> >     >
> >     > Based on the Parquet doc at:
> >     > https://parquet.apache.org/documentation/latest/ and by examining
> > code, I
> >     > understand that Parquet NULL values are indicated by setting 0x000
> > at the
> >     > definition level vector offset corresponding to each NULL column
> > offset
> >     > value.
> >     >
> >     > Conversely, It appears that the per-column, per page definition
> > level data
> >     > is never written when required is not specified for the column
> > schema.
> >     >
> >     > Is my understanding and Parquet terminology correct here?
> >     >
> >     > Thanks,
> >     >
> >     > Brian
> >     >
> >
> >
> >

Re: Definition Levels and Null

Posted by Tim Armstrong <ta...@cloudera.com.INVALID>.

> I see that OPTIONAL or REPEATED must be specified as the Repetition type
for columns where def level of 0 indicates NULL and 1 means not NULL.  The
SchemaDescriptor::BuildTree method at
https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema.cc#L661
shows how this causes max_def_level to increment.
That seems right if your data doesn't have any complex types in it,
max_def_level will always be 0 or 1 depending on whether the column is
REQUIRED/OPTIONAL. One option, depending on your data model, is to always
just mark the field as OPTIONAL and provide the def levels. If they're all
1 they will compress extremely well. Impala actually does this because
mostly columns end up being potentially nullable in Impala/Hive data model.

> We are using standard Parquet API's via C++/libparquet.co and therefore
not doing our own Parquet file-format writer/reader.
Ok, great! I'm not so familiar with the parquet-cpp APIs but I took a quick
look and I guess it does expose the concept of ref/def levels.

> NaNs representing missing values occur frequently in a myriad of SAS use
cases.  Other data types may be NULL as well, so I'm wondering if using def
level to indicate NULLs is safer (with consideration to other readers) and
also consumes less memory/storage across the spectrum of Parquet-supported
data types?
If I was in your situation, this is what I'd probably do. We're seen a lot
more inconsistency with handling of NaN between readers.

On Mon, May 13, 2019 at 10:49 AM Brian Bowman <Br...@sas.com> wrote:

> Tim,
>
> Thanks for your detailed reply and especially for pointing the RLE
> encoding for the def level!
>
> Your comment:
>
>     <<- If the field is required, the max def level is 0, therefore all
> values
>        are 0, therefore the def levels can be "decoded" from nothing and
> the def
>        levels can be omitted for the page.>>
>
> I see that OPTIONAL or REPEATED must be specified as the Repetition type
> for columns where def level of 0 indicates NULL and 1 means not NULL.  The
> SchemaDescriptor::BuildTree method at
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema.cc#L661
> shows how this causes max_def_level to increment.
>
> We are using standard Parquet API's via C++/libparquet.co and therefore
> not doing our own Parquet file-format writer/reader.
>
> NaNs representing missing values occur frequently in a myriad of SAS use
> cases.  Other data types may be NULL as well, so I'm wondering if using def
> level to indicate NULLs is safer (with consideration to other readers) and
> also consumes less memory/storage across the spectrum of Parquet-supported
> data types?
>
> Best,
>
> Brian
>
>
> On 5/13/19, 1:03 PM, "Tim Armstrong" <ta...@cloudera.com.INVALID>
> wrote:
>
>     EXTERNAL
>
>     Parquet float/double values can hold any IEEE floating point value -
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L413
> .
>     So there's no reason you can't write NaN to the files. If a reader
> isn't
>     handling NaN values correctly, that seems like an issue with that
> reader,
>     although I think you're correct in that you're more likely to hit
> reader
>     bugs with NaN than NULL. (I may be telling you something you already
> know,
>     but thought I'd start with that).
>
>     I don't think the Parquet format is opinionated about what NULL vs NaN
>     means, although I'd assume that NULL means that the data simply wasn't
>     present, and NaN means that it was the result of a floating point
>     calculation that resulted in NaN.
>
>     The rep/definition level encoding is fairly complex because of the
> handling
>     of nested types and the various ways of encoding the sequence of
> levels.
>     The way I'd think about it is:
>
>        - If you don't have any complex/nested types, rep levels aren't
> needed
>        and the logical def levels degenerate into 1=not null, 0 = null.
>        - The RLE encoding has a bit-width implied by the max def level
> value -
>        if the max-level is 1, 1 bit is needed per value. If it is 0, 0
> bits are
>        needed per value.
>        - If the field is required, the max def level is 0, therefore all
> values
>        are 0, therefore the def levels can be "decoded" from nothing and
> the def
>        levels can be omitted for the page.
>        - If the field is nullable, the bit width is 0, therefore each def
> level
>        is logically a bit. However, RLE encoding is applied to the
> sequence of 1/0
>        levels -
>        https://github.com/apache/parquet-format/blob/master/Encodings.md
>
>     The last point is where I think your understanding might diverge from
> the
>     implementation - the encoded def levels are not simply a bit vector,
> it's a
>     more complex hybrid RLE/bit-packed encoding.
>
>     If you use one of the existing Parquet libraries it will handle all
> this
>     for you - it's a headache to get it all right from scratch.
>     - Tim
>
>
>     On Mon, May 13, 2019 at 8:43 AM Brian Bowman <Br...@sas.com>
> wrote:
>
>     > All,
>     >
>     > I’m working to integrate the historic usage of SAS missing values
> for IEEE
>     > doubles into our SAS Viya Parquet integration.  SAS writes a NAN to
>     > represent floating-point doubles that are “missing,” i.e. NULL in
> more
>     > general data management terms.
>     >
>     > Of course SAS’ goal is to create .parquet files that are universally
>     > readable.  Therefore, it appears that the SAS Parquet writer(s) will
> NOT be
>     > able to write the usual NAN to represent “missing,” because doing so
> will
>     > cause a floating point exception for other readers.
>     >
>     > Based on the Parquet doc at:
>     > https://parquet.apache.org/documentation/latest/ and by examining
> code, I
>     > understand that Parquet NULL values are indicated by setting 0x000
> at the
>     > definition level vector offset corresponding to each NULL column
> offset
>     > value.
>     >
>     > Conversely, It appears that the per-column, per page definition
> level data
>     > is never written when required is not specified for the column
> schema.
>     >
>     > Is my understanding and Parquet terminology correct here?
>     >
>     > Thanks,
>     >
>     > Brian
>     >
>
>
>

Re: Definition Levels and Null

Posted by Brian Bowman <Br...@sas.com>.

Tim,

Thanks for your detailed reply and especially for pointing the RLE encoding for the def level!

Your comment:         

    <<- If the field is required, the max def level is 0, therefore all values
       are 0, therefore the def levels can be "decoded" from nothing and the def
       levels can be omitted for the page.>>

I see that OPTIONAL or REPEATED must be specified as the Repetition type for columns where def level of 0 indicates NULL and 1 means not NULL.  The SchemaDescriptor::BuildTree method at https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema.cc#L661
shows how this causes max_def_level to increment. 

We are using standard Parquet API's via C++/libparquet.co and therefore not doing our own Parquet file-format writer/reader.
 
NaNs representing missing values occur frequently in a myriad of SAS use cases.  Other data types may be NULL as well, so I'm wondering if using def level to indicate NULLs is safer (with consideration to other readers) and also consumes less memory/storage across the spectrum of Parquet-supported data types?

Best,

Brian


On 5/13/19, 1:03 PM, "Tim Armstrong" <ta...@cloudera.com.INVALID> wrote:

    EXTERNAL
    
    Parquet float/double values can hold any IEEE floating point value -
    https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L413.
    So there's no reason you can't write NaN to the files. If a reader isn't
    handling NaN values correctly, that seems like an issue with that reader,
    although I think you're correct in that you're more likely to hit reader
    bugs with NaN than NULL. (I may be telling you something you already know,
    but thought I'd start with that).
    
    I don't think the Parquet format is opinionated about what NULL vs NaN
    means, although I'd assume that NULL means that the data simply wasn't
    present, and NaN means that it was the result of a floating point
    calculation that resulted in NaN.
    
    The rep/definition level encoding is fairly complex because of the handling
    of nested types and the various ways of encoding the sequence of levels.
    The way I'd think about it is:
    
       - If you don't have any complex/nested types, rep levels aren't needed
       and the logical def levels degenerate into 1=not null, 0 = null.
       - The RLE encoding has a bit-width implied by the max def level value -
       if the max-level is 1, 1 bit is needed per value. If it is 0, 0 bits are
       needed per value.
       - If the field is required, the max def level is 0, therefore all values
       are 0, therefore the def levels can be "decoded" from nothing and the def
       levels can be omitted for the page.
       - If the field is nullable, the bit width is 0, therefore each def level
       is logically a bit. However, RLE encoding is applied to the sequence of 1/0
       levels -
       https://github.com/apache/parquet-format/blob/master/Encodings.md
    
    The last point is where I think your understanding might diverge from the
    implementation - the encoded def levels are not simply a bit vector, it's a
    more complex hybrid RLE/bit-packed encoding.
    
    If you use one of the existing Parquet libraries it will handle all this
    for you - it's a headache to get it all right from scratch.
    - Tim
    
    
    On Mon, May 13, 2019 at 8:43 AM Brian Bowman <Br...@sas.com> wrote:
    
    > All,
    >
    > I’m working to integrate the historic usage of SAS missing values for IEEE
    > doubles into our SAS Viya Parquet integration.  SAS writes a NAN to
    > represent floating-point doubles that are “missing,” i.e. NULL in more
    > general data management terms.
    >
    > Of course SAS’ goal is to create .parquet files that are universally
    > readable.  Therefore, it appears that the SAS Parquet writer(s) will NOT be
    > able to write the usual NAN to represent “missing,” because doing so will
    > cause a floating point exception for other readers.
    >
    > Based on the Parquet doc at:
    > https://parquet.apache.org/documentation/latest/ and by examining code, I
    > understand that Parquet NULL values are indicated by setting 0x000 at the
    > definition level vector offset corresponding to each NULL column offset
    > value.
    >
    > Conversely, It appears that the per-column, per page definition level data
    > is never written when required is not specified for the column schema.
    >
    > Is my understanding and Parquet terminology correct here?
    >
    > Thanks,
    >
    > Brian
    >

Re: Definition Levels and Null

Posted by Tim Armstrong <ta...@cloudera.com.INVALID>.

Parquet float/double values can hold any IEEE floating point value -
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L413.
So there's no reason you can't write NaN to the files. If a reader isn't
handling NaN values correctly, that seems like an issue with that reader,
although I think you're correct in that you're more likely to hit reader
bugs with NaN than NULL. (I may be telling you something you already know,
but thought I'd start with that).

I don't think the Parquet format is opinionated about what NULL vs NaN
means, although I'd assume that NULL means that the data simply wasn't
present, and NaN means that it was the result of a floating point
calculation that resulted in NaN.

The rep/definition level encoding is fairly complex because of the handling
of nested types and the various ways of encoding the sequence of levels.
The way I'd think about it is:

   - If you don't have any complex/nested types, rep levels aren't needed
   and the logical def levels degenerate into 1=not null, 0 = null.
   - The RLE encoding has a bit-width implied by the max def level value -
   if the max-level is 1, 1 bit is needed per value. If it is 0, 0 bits are
   needed per value.
   - If the field is required, the max def level is 0, therefore all values
   are 0, therefore the def levels can be "decoded" from nothing and the def
   levels can be omitted for the page.
   - If the field is nullable, the bit width is 0, therefore each def level
   is logically a bit. However, RLE encoding is applied to the sequence of 1/0
   levels -
   https://github.com/apache/parquet-format/blob/master/Encodings.md

The last point is where I think your understanding might diverge from the
implementation - the encoded def levels are not simply a bit vector, it's a
more complex hybrid RLE/bit-packed encoding.

If you use one of the existing Parquet libraries it will handle all this
for you - it's a headache to get it all right from scratch.
- Tim

On Mon, May 13, 2019 at 8:43 AM Brian Bowman <Br...@sas.com> wrote:

> All,
>
> I’m working to integrate the historic usage of SAS missing values for IEEE
> doubles into our SAS Viya Parquet integration.  SAS writes a NAN to
> represent floating-point doubles that are “missing,” i.e. NULL in more
> general data management terms.
>
> Of course SAS’ goal is to create .parquet files that are universally
> readable.  Therefore, it appears that the SAS Parquet writer(s) will NOT be
> able to write the usual NAN to represent “missing,” because doing so will
> cause a floating point exception for other readers.
>
> Based on the Parquet doc at:
> https://parquet.apache.org/documentation/latest/ and by examining code, I
> understand that Parquet NULL values are indicated by setting 0x000 at the
> definition level vector offset corresponding to each NULL column offset
> value.
>
> Conversely, It appears that the per-column, per page definition level data
> is never written when required is not specified for the column schema.
>
> Is my understanding and Parquet terminology correct here?
>
> Thanks,
>
> Brian
>