You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Gabor Szadovszky <ga...@cloudera.com> on 2017/11/08 14:41:57 UTC

PARQUET-1025: Support new min-max statistics in parquet-mr

Hi,

I started working on the jira PARQUET-1025 <https://issues.apache.org/jira/browse/PARQUET-1025>. It is about implementing the new min-max statistics specified in PARQUET-686 <https://issues.apache.org/jira/browse/PARQUET-686>.

After looking in the code deeper I think the spec of the new min-max stats contradicts to the actual design of parquet-mr:
The concept of parquet-mr is to use the raw types (representing the primitive types in the spec) and let the client of the API have them converted to rich objects (representing the logical types). For example parquet-mr have Binary for both UTF8 and DECIMAL (at least for one specific representation) instead of returning String or BigDecimal anywhere.
The new min-max stats requires to have separate comparison mechanisms for the same primitives depending on the logical types. For example UTF8 requires unsigned (lexicographical) while DECIMAL requires signed comparisons for the same Binary class.
The problem is that we are specifying sorting orders based on logical types while we are not providing specific java types for them. It means the client can unintentionally use a different comparison logic than parquet-mr in the min-max stats which can lead to discarding relevant values during filtering.

I can see two possible solutions however none of them seems to be good enough for me.

1. Implement specific comparison logics based on the logical types in the related Statistics object
The problem here is that we are still using the raw types therefore client code in filters might implement different comparison logic than specified and implemented in Statistics. For example in case of having UINT_32 the min-max comparison in Statistics will use the proper unsigned comparison logic while the client code in case of checking the elements (in the column or in the dictionary) might implement somewhat simpler e.g. value > 7 which may lead to different results. See UserDefinedPredicate.keep(T)
It is highly confusing for the client that it has to re-implement the same comparison logic in the client code for the raw types as it is implemented for the statistics.

2. Return specific “rich” objects for the logical types instead of the raw types
This solution would solve the problems what the previous one has but would introduce bigger problems.
Breaks the actual design of parquet-mr
Backward incompatible: existing client code would not work properly for several logical types

(3.) Revert the specs of the new min-max statistics
Let’s keep working on the raw types without having any additional comparison logic. Specify the existing comparison logic (e.g. signed for all primitive types, (signed) lexicographical for binary) and expect the client to use these if it want to have sorting implemented for the values for better performance.

What do you think, guys? Any suggestions on one of my solutions or for a new one I did not recognise?

Thanks a lot,
Gabor

Re: PARQUET-1025: Support new min-max statistics in parquet-mr

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Binary is not dependent on type and I think it should stay that way. Does
it currently implement `Comparable<Binary>`?

Data models can materialize data as user-friendly values that implement
`Comparable`. Internal interfaces can and should use primitive types.

rb

On Tue, Nov 14, 2017 at 9:16 AM, Gabor Szadovszky <
gabor.szadovszky@cloudera.com> wrote:

> I agree that we cannot do anything practical to use UINT in java unless we
> return proper unsigned objects instead of primitives but it would break the
> whole API.
> While my new approach does not solve the comparison problem of UINTs it
> would have some benefits over the outside comparator logic:
> The natural ordering would be correct; easier/simpler for the users
> I think, the whole implementation would be simpler. For example we do not
> need to change the actual statistics implementations (maybe the toString
> used for debugging/logging).
> The users do not need to change their code to get the benefits of the
> proper ordering including the new min-max statistics
> The only drawback is that it would not be backward compatible which I
> don’t think really matters here as the current comparison logic implemented
> in Binary is incorrect for any case.
>
> From the UINT point of view we still can use related min-max statistics if
> the min-max values are in the non-negative range of the related signed
> value (e.g. UINT_32 from 0 to 2^31 -1).
>
> I propose extending Binary with specific implementations of
> compareTo/toString for the different logical types instead of the separate
> Comparator idea which I think is more complex and confusing to the user.
> What’s your opinion?
>
> Gabor
>
> > On 14 Nov 2017, at 17:22, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> >
> > Applications may shoot themselves in the foot by using the wrong
> > comparison. There’s not much we can do if the application uses UINT32 and
> > doesn’t use unsigned comparison. Maybe I’m missing something new about
> this
> > line of reasoning?
> >
> > The row group filters are currently passed column descriptors, which can
> be
> > updated to include the full Type instead of PrimitiveTypeName and used to
> > get the correct Comparator. I don’t think it matters that the Java
> > representation is the primitive type. If users implement their own
> > UserDefinedPredicate classes, then they should understand how data will
> be
> > passed. We have the same requirement for implementing other low-level
> > interfaces, like data models.
> >
> > rb
> > 
> >
> > On Tue, Nov 14, 2017 at 5:27 AM, Gabor Szadovszky <
> > gabor.szadovszky@cloudera.com <ma...@cloudera.com>>
> wrote:
> >
> >> Thanks a lot, Zoltan for making it clear.
> >>
> >> Meanwhile I’ve discovered that the problem I’ve mentioned before with
> the
> >> actual ordering of Binary and the former statistics is already solved:
> >> PARQUET-686 <https://issues.apache.org/jira/browse/PARQUET-686 <
> https://issues.apache.org/jira/browse/PARQUET-686>>
> >> What still stands against my proposal (except that it does not support
> the
> >> proper comparison for UINT values) is that it breaks the backward
> >> compatibility: Binary.compareTo(Binary) works in different way.
> >>
> >> Gabor
> >>
> >>> On 14 Nov 2017, at 14:08, Zoltan Ivanfi <zi...@cloudera.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> Let me rephrase the problem with unsigned ints in a way that I
> personally
> >>> find easier to understand.
> >>>
> >>> Let's take a typical application that uses native Java types
> internally.
> >>> For parquet-mr, these native types have to be converted to Parquet
> >>> primitive types. The parquet-mr library supports low-level filtering of
> >>> rows, pages or row groups by allowing the application to implement
> >>> callbacks or build complex conditions using a set of predicates
> provided
> >> by
> >>> parquet-mr. For this low-level filtering, conditions must be specified
> in
> >>> terms of Parquet's primitive/logical types and not in terms of the
> native
> >>> Java types that the application internally uses.
> >>>
> >>> The "primitive/logical" part of the last sentence is the actual change
> in
> >>> question. This used to be "primitive" and now for the new statistics
> >> fields
> >>> it should become "logical". Most logical types annotate binaries, which
> >> is
> >>> a class (hierarchy) in parquet-mr. Making these logical-type-aware is
> >>> more-or-less easily accomplished by providing a correct compareTo()
> >>> implementation.
> >>>
> >>> The UINT logical types are an exception, however. For these,
> parquet-mr's
> >>> internal primitive types are the regular native Java types, so there is
> >> no
> >>> way to change their comparison behaviour. Even if we provide a correct
> >>> Comparator, the application can simply specify filtering conditions
> like
> >>> "value < 42", which will result in a signed comparison on unsigned
> >> fields.
> >>> It makes it way too easy for developers to shoot themselves in the
> foot.
> >>>
> >>> Zoltan
> >>>
> >>>
> >>> On Tue, Nov 14, 2017 at 12:29 PM Gabor Szadovszky <
> >>> gabor.szadovszky@cloudera.com> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> During the development of this feature I’ve found out some scenarios
> >> which
> >>>> would be really confusing for the clients.
> >>>> For example: We have the new min-max statistics in the file and also
> >> have
> >>>> dictionary encoding for binary elements. The client have filtering
> >>>> implemented for the related column involving Gt. Which comparator
> shall
> >> we
> >>>> use in the DictionaryFilter? If we use the new one based on the
> >> statistics
> >>>> it might contradict to the UserDefinedPredicate implemented by the
> >> client
> >>>> and might cause false negatives. If we use the current natural order
> of
> >>>> Binary it would contradict to the one in statistics. Further more, we
> >> are
> >>>> about the make the new comparators available to the client so the
> >>>> implemented UserDefinedPredicate may not match to the DictionaryFilter
> >> or
> >>>> the statistics anyway.
> >>>>
> >>>> I think, the following new proposal would solve the confusion issues
> and
> >>>> even make the existing client code work properly with the new
> >> statistics. I
> >>>> am not sure yet, but think that the implementation would be cleaner as
> >> well.
> >>>> As far as I know the unsigned integer types (UINT_8 etc.) are not used
> >>>> widely. I would skip creating min-max statistics for them or keep
> having
> >>>> signed comparison. (BTW, don’t we want to deprecate them?)
> >>>> By leaving the unsigned integers out of the picture only the Binary
> >> class
> >>>> is left to support the different comparison logics. So, let’s refactor
> >>>> Binary and create different implementations for the different logical
> >>>> types. This way the natural ordering of Binary will always reflect the
> >> one
> >>>> specified for the logical types and the Statistics implementations do
> >> not
> >>>> need to be changed.
> >>>> We do have a problem though and it is the current natural ordering of
> >>>> Binary. It is implemented in a way that seems to be lexicographical
> but
> >> the
> >>>> byte comparison is signed. I don’t think it is correct so I would drop
> >> this
> >>>> implementation but it makes a bit hard to implement the handing of the
> >>>> former min-max statistics. If I would like to be correct, I would not
> >> write
> >>>> the former min-max statistics for Binary at all and would not use them
> >> at
> >>>> read. (Or only use it if it was not written by parquet-mr.) I guess,
> >> this
> >>>> issue was not identified because clients are rarely using characters
> >> where
> >>>> unsigned/signed comparison matters.
> >>>> What do you think?
> >>>>
> >>>> Regards,
> >>>> Gabor
> >>>>
> >>>>> On 8 Nov 2017, at 18:02, Ryan Blue <rb...@netflix.com.INVALID>
> wrote:
> >>>>>
> >>>>> Gabor,
> >>>>>
> >>>>> You're right that working with logical type data isn't really
> something
> >>>>> that Parquet has done much in the past. We have the distinction
> between
> >>>>> logical types and physical types to simplify what we need to support
> --
> >>>> we
> >>>>> only have encodings for physical types -- but in this case, we can't
> do
> >>>>> that, and will need to use comparators based on the logical types.
> For
> >>>> now,
> >>>>> we don't need to expose much to the layer above because all of the
> >> types
> >>>>> have a defined sort order. In the future, we should be able to
> support
> >>>>> other orders for UTF8 data, but we don't need to focus on that now.
> >>>>>
> >>>>> Zoltan,
> >>>>>
> >>>>> While it isn't ideal to have no control over the sort order, the only
> >>>> thing
> >>>>> we can do at the Parquet level is to handle data correctly when it
> does
> >>>>> come in sorted. The concerns you're raising are good to think about,
> >> but
> >>>> we
> >>>>> need to solve them elsewhere. I think that the table format should
> >> have a
> >>>>> way to specify the sort order it wants for incoming data and
> >> communicate
> >>>>> that to the engine writing Parquet files. That's what we're working
> on
> >>>>> adding to the data source v2 API in Spark. Tables should be able to
> >>>> specify
> >>>>> the expected clustering (for partitioning) and sort order for rows,
> >> then
> >>>>> the query plan is automatically rewritten to make that happen.
> >>>>>
> >>>>> rb
> >>>>>
> >>>>> On Wed, Nov 8, 2017 at 7:47 AM, Zoltan Ivanfi <zi...@cloudera.com>
> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I don't know the solution, just adding my thoughts.
> >>>>>>
> >>>>>> In my opinion the underlying problem is that min-max ranges have to
> be
> >>>>>> small for the best filtering results. In order to achieve this, the
> >> data
> >>>>>> has to be sorted, but calculating min-max statistics is the
> >>>> responsibility
> >>>>>> of the library, while sorting the data is a responsibility of the
> >>>>>> application above the library. Without logical-type-based sorting
> >> rules,
> >>>>>> the application may sort data using a different ordering than the
> one
> >>>> used
> >>>>>> for calculating/filtering on the min-max values. This results in too
> >>>> broad
> >>>>>> min-max ranges, that in theory could still function correctly, but
> are
> >>>> not
> >>>>>> optimal for filtering. (As an extra twist, even if we have
> >>>>>> logical-type-based sorting rules, the application can still use a
> >>>> different
> >>>>>> ordering for sorting and even for comparisons. The results can range
> >>>> from
> >>>>>> overly broad min-max ranges to incorrectly discarded values.)
> >>>>>>
> >>>>>> From this point of view, dictionary entries and bloom filters are
> much
> >>>> less
> >>>>>> problematic, because sorting by *any* order will optimize these
> >>>> structures,
> >>>>>> as the only important thing is that equal values should end up next
> to
> >>>> each
> >>>>>> other to increase the chance of having a low number of values in a
> >>>> single
> >>>>>> page/row group.
> >>>>>>
> >>>>>> I think that trying to optimize min-max stats by sorting crosses the
> >>>> layer
> >>>>>> boundary between the library and the application and as such is much
> >>>> better
> >>>>>> suited to a full-stack implementation like Impala than the
> parquet-mr
> >> +
> >>>>>> separate application stack. Neither relying on the application to
> >>>> calculate
> >>>>>> standards-compliant statistics nor forcing the application to use
> the
> >>>>>> library's data types for sorting and comparison seems like a good
> >>>> solution
> >>>>>> to me.
> >>>>>>
> >>>>>> Please note that this problem inherently applies to columns
> statistics
> >>>> as
> >>>>>> well.
> >>>>>>
> >>>>>> Zoltan
> >>>>>>
> >>>>>> On Wed, Nov 8, 2017 at 3:41 PM, Gabor Szadovszky <
> >>>>>> gabor.szadovszky@cloudera.com> wrote:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> I started working on the jira PARQUET-1025 <
> >> https://issues.apache.org/
> >>>>>>> jira/browse/PARQUET-1025>. It is about implementing the new min-max
> >>>>>>> statistics specified in PARQUET-686 <https://issues.apache.org/
> >>>>>>> jira/browse/PARQUET-686>.
> >>>>>>>
> >>>>>>> After looking in the code deeper I think the spec of the new
> min-max
> >>>>>> stats
> >>>>>>> contradicts to the actual design of parquet-mr:
> >>>>>>> The concept of parquet-mr is to use the raw types (representing the
> >>>>>>> primitive types in the spec) and let the client of the API have
> them
> >>>>>>> converted to rich objects (representing the logical types). For
> >> example
> >>>>>>> parquet-mr have Binary for both UTF8 and DECIMAL (at least for one
> >>>>>> specific
> >>>>>>> representation) instead of returning String or BigDecimal anywhere.
> >>>>>>> The new min-max stats requires to have separate comparison
> mechanisms
> >>>> for
> >>>>>>> the same primitives depending on the logical types. For example
> UTF8
> >>>>>>> requires unsigned (lexicographical) while DECIMAL requires signed
> >>>>>>> comparisons for the same Binary class.
> >>>>>>> The problem is that we are specifying sorting orders based on
> logical
> >>>>>>> types while we are not providing specific java types for them. It
> >> means
> >>>>>> the
> >>>>>>> client can unintentionally use a different comparison logic than
> >>>>>> parquet-mr
> >>>>>>> in the min-max stats which can lead to discarding relevant values
> >>>> during
> >>>>>>> filtering.
> >>>>>>>
> >>>>>>> I can see two possible solutions however none of them seems to be
> >> good
> >>>>>>> enough for me.
> >>>>>>>
> >>>>>>> 1. Implement specific comparison logics based on the logical types
> in
> >>>> the
> >>>>>>> related Statistics object
> >>>>>>> The problem here is that we are still using the raw types therefore
> >>>>>> client
> >>>>>>> code in filters might implement different comparison logic than
> >>>> specified
> >>>>>>> and implemented in Statistics. For example in case of having
> UINT_32
> >>>> the
> >>>>>>> min-max comparison in Statistics will use the proper unsigned
> >>>> comparison
> >>>>>>> logic while the client code in case of checking the elements (in
> the
> >>>>>> column
> >>>>>>> or in the dictionary) might implement somewhat simpler e.g. value
> > 7
> >>>>>> which
> >>>>>>> may lead to different results. See UserDefinedPredicate.keep(T)
> >>>>>>> It is highly confusing for the client that it has to re-implement
> the
> >>>>>> same
> >>>>>>> comparison logic in the client code for the raw types as it is
> >>>>>> implemented
> >>>>>>> for the statistics.
> >>>>>>>
> >>>>>>> 2. Return specific “rich” objects for the logical types instead of
> >> the
> >>>>>> raw
> >>>>>>> types
> >>>>>>> This solution would solve the problems what the previous one has
> but
> >>>>>> would
> >>>>>>> introduce bigger problems.
> >>>>>>> Breaks the actual design of parquet-mr
> >>>>>>> Backward incompatible: existing client code would not work properly
> >> for
> >>>>>>> several logical types
> >>>>>>>
> >>>>>>> (3.) Revert the specs of the new min-max statistics
> >>>>>>> Let’s keep working on the raw types without having any additional
> >>>>>>> comparison logic. Specify the existing comparison logic (e.g.
> signed
> >>>> for
> >>>>>>> all primitive types, (signed) lexicographical for binary) and
> expect
> >>>> the
> >>>>>>> client to use these if it want to have sorting implemented for the
> >>>> values
> >>>>>>> for better performance.
> >>>>>>>
> >>>>>>>
> >>>>>>> What do you think, guys? Any suggestions on one of my solutions or
> >> for
> >>>> a
> >>>>>>> new one I did not recognise?
> >>>>>>>
> >>>>>>> Thanks a lot,
> >>>>>>> Gabor
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Ryan Blue
> >>>>> Software Engineer
> >>>>> Netflix
> >>>>
> >>>>
> >>
> >>
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: PARQUET-1025: Support new min-max statistics in parquet-mr

Posted by Gabor Szadovszky <ga...@cloudera.com>.

> Binary is not dependent on type

We already have factory methods like Binary.fromCharSequence and Binary.fromString and separate classes for them. I don’t understand why we created specific Binary implementations for these but we already have them.

> used in many places where we don't necessarily know the data type

I’ve tried to check the code parts where Binary is created and I think we can solve it. It is also not easy to have the proper Type everywhere where the Comparator is needed. (For example IncrementallyUpdatedFilterPredicateBuilder)

> I don’t think it is worth the effort to make compareTo work for all Binary cases when we know we can't fix this for the other primitives

It would require quite a lot of effort to have the proper Binary created but after that all the already existing implementation would work just fine including the statistics, and we would not have to do all the other changes required to have the Type instance available when the statistics are created. So, I would say both ideas would require similar efforts.
The "other primitives” that require special ordering are only the UINTs which you said not really used. 

Gabor

> On 14 Nov 2017, at 19:27, Ryan Blue <rb...@netflix.com> wrote:
> 
> > The only problems are the UINTs where we cannot override the related primitives to compare correctly.
> 
> We don't need to because this is in a low-level API.
> 
> > Implementing the correct compareTo would not be an API break and as it was implemented incorrectly.
> 
> Agreed, but Binary is not dependent on type, and shouldn't be dependent on type. It holds bytes and is used in many places where we don't necessarily know the data type. I don't think it is worth the effort to make compareTo work for all Binary cases when we know we can't fix this for the other primitives, which must use Comparators anyway. That's why I think the best solution is to deprecate the use of Comparable#compareTo and remove it.
> 
> On Tue, Nov 14, 2017 at 10:19 AM, Gabor Szadovszky <gabor.szadovszky@cloudera.com <ma...@cloudera.com>> wrote:
> Binary implements Comparable<Binary>. The statistics and the filters depend on the generics T extends Comparable<T> and that’s why the current min-max statistics are incorrect for all Binaries. We also falsely suggest to the API users that the classes implement correct comparisons.
> I agree, that user-friendly values shall be part of the models and parquet-mr shall be kept low level by using the primitive types. But we have to give this up because the min-max specification depends on the logical types for comparing primitives. If we mandate logical type specific comparison in the specification we should have those implemented in the Comparables themselves instead of providing separate Comparators. The only problems are the UINTs where we cannot override the related primitives to compare correctly.
> Removing Comparable from Binary would be a breaking change that’s why you suggesting adding to to 2.0. Implementing the correct compareTo would not be an API break and as it was implemented incorrectly, it would be more like a fix than a breaking change. So, it could be added to an 1.x release and would not require a major one.
> 
> Gabor
> 
> 
> > On 14 Nov 2017, at 19:07, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> >
> >> The other design proposed by Gabor (correct compareTo() in the
> > logical-type-specific Binary implementations themselves) prevents this from
> > happening
> >
> > Not implementing `Comparable` at all prevents this from happening; we
> > should deprecate `Binary` implementing it for 2.0. Then you must use a
> > `Comparator`. As you said, this doesn't solve the problem for types where
> > you're likely to shoot yourself in the foot, the ones that are represented
> > by Java primitives. I don't find that argument compelling for `Binary`.
> >
> > For unsigned ints, there's little we can do. But, I know of no instances of
> > people actually using unsigned ints, nor do I think it is a big risk when
> > people choose to implement low-level interfaces.
> >
> > rb
> >
> > On Tue, Nov 14, 2017 at 9:54 AM, Zoltan Ivanfi <zi@cloudera.com <ma...@cloudera.com>> wrote:
> >
> >> Hi,
> >>
> >> The reason why I brought up that API callers can easily shoot themselves in
> >> the foot is that it does not necessarily have to be this way.
> >>
> >> One of the design alternatives (separate Comparators) makes it very easy to
> >> shoot yourself in the foot for any logical type because you can
> >> unintentionally bypass it.
> >>
> >> The other design proposed by Gabor (correct compareTo() in the
> >> logical-type-specific Binary implementations themselves) prevents this from
> >> happening, but does not have a proper answer to the UINT problem. I find
> >> this design more appealing, because it leads to a better API in my opinion.
> >>
> >> Unlike other logical types that require custom comparisons, the primitive
> >> value behind a UINT is not an object. UINTs are not used as much as other
> >> logical types either. I would prefer a simpler solution that makes using
> >> those more important types "just work" and requires extra care for UINTs
> >> than one that makes all types work in the same way but requires the
> >> programmer to explicitly request a Comparator for all types. I would even
> >> consider deprecating the current way of getting UINTs from parquet-mr
> >> because I think that using regular Java integers for UINTs is a weak point
> >> of the API.
> >>
> >> Zoltan
> >>
> >> On Tue, Nov 14, 2017 at 6:16 PM Gabor Szadovszky <
> >> gabor.szadovszky@cloudera.com <ma...@cloudera.com>> wrote:
> >>
> >>> I agree that we cannot do anything practical to use UINT in java unless
> >> we
> >>> return proper unsigned objects instead of primitives but it would break
> >> the
> >>> whole API.
> >>> While my new approach does not solve the comparison problem of UINTs it
> >>> would have some benefits over the outside comparator logic:
> >>> The natural ordering would be correct; easier/simpler for the users
> >>> I think, the whole implementation would be simpler. For example we do not
> >>> need to change the actual statistics implementations (maybe the toString
> >>> used for debugging/logging).
> >>> The users do not need to change their code to get the benefits of the
> >>> proper ordering including the new min-max statistics
> >>> The only drawback is that it would not be backward compatible which I
> >>> don’t think really matters here as the current comparison logic
> >> implemented
> >>> in Binary is incorrect for any case.
> >>>
> >>> From the UINT point of view we still can use related min-max statistics
> >> if
> >>> the min-max values are in the non-negative range of the related signed
> >>> value (e.g. UINT_32 from 0 to 2^31 -1).
> >>>
> >>> I propose extending Binary with specific implementations of
> >>> compareTo/toString for the different logical types instead of the
> >> separate
> >>> Comparator idea which I think is more complex and confusing to the user.
> >>> What’s your opinion?
> >>>
> >>> Gabor
> >>>
> >>>> On 14 Nov 2017, at 17:22, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> >>>>
> >>>> Applications may shoot themselves in the foot by using the wrong
> >>>> comparison. There’s not much we can do if the application uses UINT32
> >> and
> >>>> doesn’t use unsigned comparison. Maybe I’m missing something new about
> >>> this
> >>>> line of reasoning?
> >>>>
> >>>> The row group filters are currently passed column descriptors, which
> >> can
> >>> be
> >>>> updated to include the full Type instead of PrimitiveTypeName and used
> >> to
> >>>> get the correct Comparator. I don’t think it matters that the Java
> >>>> representation is the primitive type. If users implement their own
> >>>> UserDefinedPredicate classes, then they should understand how data will
> >>> be
> >>>> passed. We have the same requirement for implementing other low-level
> >>>> interfaces, like data models.
> >>>>
> >>>> rb
> >>>> 
> >>>>
> >>>> On Tue, Nov 14, 2017 at 5:27 AM, Gabor Szadovszky <
> >>>> gabor.szadovszky@cloudera.com <ma...@cloudera.com> <mailto:gabor.szadovszky@cloudera.com <ma...@cloudera.com>>>
> >>> wrote:
> >>>>
> >>>>> Thanks a lot, Zoltan for making it clear.
> >>>>>
> >>>>> Meanwhile I’ve discovered that the problem I’ve mentioned before with
> >>> the
> >>>>> actual ordering of Binary and the former statistics is already solved:
> >>>>> PARQUET-686 <https://issues.apache.org/jira/browse/PARQUET-686 <https://issues.apache.org/jira/browse/PARQUET-686> <
> >>> https://issues.apache.org/jira/browse/PARQUET-686 <https://issues.apache.org/jira/browse/PARQUET-686>>>
> >>>>> What still stands against my proposal (except that it does not support
> >>> the
> >>>>> proper comparison for UINT values) is that it breaks the backward
> >>>>> compatibility: Binary.compareTo(Binary) works in different way.
> >>>>>
> >>>>> Gabor
> >>>>>
> >>>>>> On 14 Nov 2017, at 14:08, Zoltan Ivanfi <zi@cloudera.com <ma...@cloudera.com>> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> Let me rephrase the problem with unsigned ints in a way that I
> >>> personally
> >>>>>> find easier to understand.
> >>>>>>
> >>>>>> Let's take a typical application that uses native Java types
> >>> internally.
> >>>>>> For parquet-mr, these native types have to be converted to Parquet
> >>>>>> primitive types. The parquet-mr library supports low-level filtering
> >> of
> >>>>>> rows, pages or row groups by allowing the application to implement
> >>>>>> callbacks or build complex conditions using a set of predicates
> >>> provided
> >>>>> by
> >>>>>> parquet-mr. For this low-level filtering, conditions must be
> >> specified
> >>> in
> >>>>>> terms of Parquet's primitive/logical types and not in terms of the
> >>> native
> >>>>>> Java types that the application internally uses.
> >>>>>>
> >>>>>> The "primitive/logical" part of the last sentence is the actual
> >> change
> >>> in
> >>>>>> question. This used to be "primitive" and now for the new statistics
> >>>>> fields
> >>>>>> it should become "logical". Most logical types annotate binaries,
> >> which
> >>>>> is
> >>>>>> a class (hierarchy) in parquet-mr. Making these logical-type-aware is
> >>>>>> more-or-less easily accomplished by providing a correct compareTo()
> >>>>>> implementation.
> >>>>>>
> >>>>>> The UINT logical types are an exception, however. For these,
> >>> parquet-mr's
> >>>>>> internal primitive types are the regular native Java types, so there
> >> is
> >>>>> no
> >>>>>> way to change their comparison behaviour. Even if we provide a
> >> correct
> >>>>>> Comparator, the application can simply specify filtering conditions
> >>> like
> >>>>>> "value < 42", which will result in a signed comparison on unsigned
> >>>>> fields.
> >>>>>> It makes it way too easy for developers to shoot themselves in the
> >>> foot.
> >>>>>>
> >>>>>> Zoltan
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Nov 14, 2017 at 12:29 PM Gabor Szadovszky <
> >>>>>> gabor.szadovszky@cloudera.com <ma...@cloudera.com>> wrote:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> During the development of this feature I’ve found out some scenarios
> >>>>> which
> >>>>>>> would be really confusing for the clients.
> >>>>>>> For example: We have the new min-max statistics in the file and also
> >>>>> have
> >>>>>>> dictionary encoding for binary elements. The client have filtering
> >>>>>>> implemented for the related column involving Gt. Which comparator
> >>> shall
> >>>>> we
> >>>>>>> use in the DictionaryFilter? If we use the new one based on the
> >>>>> statistics
> >>>>>>> it might contradict to the UserDefinedPredicate implemented by the
> >>>>> client
> >>>>>>> and might cause false negatives. If we use the current natural order
> >>> of
> >>>>>>> Binary it would contradict to the one in statistics. Further more,
> >> we
> >>>>> are
> >>>>>>> about the make the new comparators available to the client so the
> >>>>>>> implemented UserDefinedPredicate may not match to the
> >> DictionaryFilter
> >>>>> or
> >>>>>>> the statistics anyway.
> >>>>>>>
> >>>>>>> I think, the following new proposal would solve the confusion issues
> >>> and
> >>>>>>> even make the existing client code work properly with the new
> >>>>> statistics. I
> >>>>>>> am not sure yet, but think that the implementation would be cleaner
> >> as
> >>>>> well.
> >>>>>>> As far as I know the unsigned integer types (UINT_8 etc.) are not
> >> used
> >>>>>>> widely. I would skip creating min-max statistics for them or keep
> >>> having
> >>>>>>> signed comparison. (BTW, don’t we want to deprecate them?)
> >>>>>>> By leaving the unsigned integers out of the picture only the Binary
> >>>>> class
> >>>>>>> is left to support the different comparison logics. So, let’s
> >> refactor
> >>>>>>> Binary and create different implementations for the different
> >> logical
> >>>>>>> types. This way the natural ordering of Binary will always reflect
> >> the
> >>>>> one
> >>>>>>> specified for the logical types and the Statistics implementations
> >> do
> >>>>> not
> >>>>>>> need to be changed.
> >>>>>>> We do have a problem though and it is the current natural ordering
> >> of
> >>>>>>> Binary. It is implemented in a way that seems to be lexicographical
> >>> but
> >>>>> the
> >>>>>>> byte comparison is signed. I don’t think it is correct so I would
> >> drop
> >>>>> this
> >>>>>>> implementation but it makes a bit hard to implement the handing of
> >> the
> >>>>>>> former min-max statistics. If I would like to be correct, I would
> >> not
> >>>>> write
> >>>>>>> the former min-max statistics for Binary at all and would not use
> >> them
> >>>>> at
> >>>>>>> read. (Or only use it if it was not written by parquet-mr.) I guess,
> >>>>> this
> >>>>>>> issue was not identified because clients are rarely using characters
> >>>>> where
> >>>>>>> unsigned/signed comparison matters.
> >>>>>>> What do you think?
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Gabor
> >>>>>>>
> >>>>>>>> On 8 Nov 2017, at 18:02, Ryan Blue <rb...@netflix.com.INVALID>
> >>> wrote:
> >>>>>>>>
> >>>>>>>> Gabor,
> >>>>>>>>
> >>>>>>>> You're right that working with logical type data isn't really
> >>> something
> >>>>>>>> that Parquet has done much in the past. We have the distinction
> >>> between
> >>>>>>>> logical types and physical types to simplify what we need to
> >> support
> >>> --
> >>>>>>> we
> >>>>>>>> only have encodings for physical types -- but in this case, we
> >> can't
> >>> do
> >>>>>>>> that, and will need to use comparators based on the logical types.
> >>> For
> >>>>>>> now,
> >>>>>>>> we don't need to expose much to the layer above because all of the
> >>>>> types
> >>>>>>>> have a defined sort order. In the future, we should be able to
> >>> support
> >>>>>>>> other orders for UTF8 data, but we don't need to focus on that now.
> >>>>>>>>
> >>>>>>>> Zoltan,
> >>>>>>>>
> >>>>>>>> While it isn't ideal to have no control over the sort order, the
> >> only
> >>>>>>> thing
> >>>>>>>> we can do at the Parquet level is to handle data correctly when it
> >>> does
> >>>>>>>> come in sorted. The concerns you're raising are good to think
> >> about,
> >>>>> but
> >>>>>>> we
> >>>>>>>> need to solve them elsewhere. I think that the table format should
> >>>>> have a
> >>>>>>>> way to specify the sort order it wants for incoming data and
> >>>>> communicate
> >>>>>>>> that to the engine writing Parquet files. That's what we're working
> >>> on
> >>>>>>>> adding to the data source v2 API in Spark. Tables should be able to
> >>>>>>> specify
> >>>>>>>> the expected clustering (for partitioning) and sort order for rows,
> >>>>> then
> >>>>>>>> the query plan is automatically rewritten to make that happen.
> >>>>>>>>
> >>>>>>>> rb
> >>>>>>>>
> >>>>>>>> On Wed, Nov 8, 2017 at 7:47 AM, Zoltan Ivanfi <zi@cloudera.com <ma...@cloudera.com>>
> >>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> I don't know the solution, just adding my thoughts.
> >>>>>>>>>
> >>>>>>>>> In my opinion the underlying problem is that min-max ranges have
> >> to
> >>> be
> >>>>>>>>> small for the best filtering results. In order to achieve this,
> >> the
> >>>>> data
> >>>>>>>>> has to be sorted, but calculating min-max statistics is the
> >>>>>>> responsibility
> >>>>>>>>> of the library, while sorting the data is a responsibility of the
> >>>>>>>>> application above the library. Without logical-type-based sorting
> >>>>> rules,
> >>>>>>>>> the application may sort data using a different ordering than the
> >>> one
> >>>>>>> used
> >>>>>>>>> for calculating/filtering on the min-max values. This results in
> >> too
> >>>>>>> broad
> >>>>>>>>> min-max ranges, that in theory could still function correctly, but
> >>> are
> >>>>>>> not
> >>>>>>>>> optimal for filtering. (As an extra twist, even if we have
> >>>>>>>>> logical-type-based sorting rules, the application can still use a
> >>>>>>> different
> >>>>>>>>> ordering for sorting and even for comparisons. The results can
> >> range
> >>>>>>> from
> >>>>>>>>> overly broad min-max ranges to incorrectly discarded values.)
> >>>>>>>>>
> >>>>>>>>> From this point of view, dictionary entries and bloom filters are
> >>> much
> >>>>>>> less
> >>>>>>>>> problematic, because sorting by *any* order will optimize these
> >>>>>>> structures,
> >>>>>>>>> as the only important thing is that equal values should end up
> >> next
> >>> to
> >>>>>>> each
> >>>>>>>>> other to increase the chance of having a low number of values in a
> >>>>>>> single
> >>>>>>>>> page/row group.
> >>>>>>>>>
> >>>>>>>>> I think that trying to optimize min-max stats by sorting crosses
> >> the
> >>>>>>> layer
> >>>>>>>>> boundary between the library and the application and as such is
> >> much
> >>>>>>> better
> >>>>>>>>> suited to a full-stack implementation like Impala than the
> >>> parquet-mr
> >>>>> +
> >>>>>>>>> separate application stack. Neither relying on the application to
> >>>>>>> calculate
> >>>>>>>>> standards-compliant statistics nor forcing the application to use
> >>> the
> >>>>>>>>> library's data types for sorting and comparison seems like a good
> >>>>>>> solution
> >>>>>>>>> to me.
> >>>>>>>>>
> >>>>>>>>> Please note that this problem inherently applies to columns
> >>> statistics
> >>>>>>> as
> >>>>>>>>> well.
> >>>>>>>>>
> >>>>>>>>> Zoltan
> >>>>>>>>>
> >>>>>>>>> On Wed, Nov 8, 2017 at 3:41 PM, Gabor Szadovszky <
> >>>>>>>>> gabor.szadovszky@cloudera.com <ma...@cloudera.com>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> I started working on the jira PARQUET-1025 <
> >>>>> https://issues.apache.org/ <https://issues.apache.org/>
> >>>>>>>>>> jira/browse/PARQUET-1025>. It is about implementing the new
> >> min-max
> >>>>>>>>>> statistics specified in PARQUET-686 <https://issues.apache.org/ <https://issues.apache.org/>
> >>>>>>>>>> jira/browse/PARQUET-686>.
> >>>>>>>>>>
> >>>>>>>>>> After looking in the code deeper I think the spec of the new
> >>> min-max
> >>>>>>>>> stats
> >>>>>>>>>> contradicts to the actual design of parquet-mr:
> >>>>>>>>>> The concept of parquet-mr is to use the raw types (representing
> >> the
> >>>>>>>>>> primitive types in the spec) and let the client of the API have
> >>> them
> >>>>>>>>>> converted to rich objects (representing the logical types). For
> >>>>> example
> >>>>>>>>>> parquet-mr have Binary for both UTF8 and DECIMAL (at least for
> >> one
> >>>>>>>>> specific
> >>>>>>>>>> representation) instead of returning String or BigDecimal
> >> anywhere.
> >>>>>>>>>> The new min-max stats requires to have separate comparison
> >>> mechanisms
> >>>>>>> for
> >>>>>>>>>> the same primitives depending on the logical types. For example
> >>> UTF8
> >>>>>>>>>> requires unsigned (lexicographical) while DECIMAL requires signed
> >>>>>>>>>> comparisons for the same Binary class.
> >>>>>>>>>> The problem is that we are specifying sorting orders based on
> >>> logical
> >>>>>>>>>> types while we are not providing specific java types for them. It
> >>>>> means
> >>>>>>>>> the
> >>>>>>>>>> client can unintentionally use a different comparison logic than
> >>>>>>>>> parquet-mr
> >>>>>>>>>> in the min-max stats which can lead to discarding relevant values
> >>>>>>> during
> >>>>>>>>>> filtering.
> >>>>>>>>>>
> >>>>>>>>>> I can see two possible solutions however none of them seems to be
> >>>>> good
> >>>>>>>>>> enough for me.
> >>>>>>>>>>
> >>>>>>>>>> 1. Implement specific comparison logics based on the logical
> >> types
> >>> in
> >>>>>>> the
> >>>>>>>>>> related Statistics object
> >>>>>>>>>> The problem here is that we are still using the raw types
> >> therefore
> >>>>>>>>> client
> >>>>>>>>>> code in filters might implement different comparison logic than
> >>>>>>> specified
> >>>>>>>>>> and implemented in Statistics. For example in case of having
> >>> UINT_32
> >>>>>>> the
> >>>>>>>>>> min-max comparison in Statistics will use the proper unsigned
> >>>>>>> comparison
> >>>>>>>>>> logic while the client code in case of checking the elements (in
> >>> the
> >>>>>>>>> column
> >>>>>>>>>> or in the dictionary) might implement somewhat simpler e.g. value
> >>>> 7
> >>>>>>>>> which
> >>>>>>>>>> may lead to different results. See UserDefinedPredicate.keep(T)
> >>>>>>>>>> It is highly confusing for the client that it has to re-implement
> >>> the
> >>>>>>>>> same
> >>>>>>>>>> comparison logic in the client code for the raw types as it is
> >>>>>>>>> implemented
> >>>>>>>>>> for the statistics.
> >>>>>>>>>>
> >>>>>>>>>> 2. Return specific “rich” objects for the logical types instead
> >> of
> >>>>> the
> >>>>>>>>> raw
> >>>>>>>>>> types
> >>>>>>>>>> This solution would solve the problems what the previous one has
> >>> but
> >>>>>>>>> would
> >>>>>>>>>> introduce bigger problems.
> >>>>>>>>>> Breaks the actual design of parquet-mr
> >>>>>>>>>> Backward incompatible: existing client code would not work
> >> properly
> >>>>> for
> >>>>>>>>>> several logical types
> >>>>>>>>>>
> >>>>>>>>>> (3.) Revert the specs of the new min-max statistics
> >>>>>>>>>> Let’s keep working on the raw types without having any additional
> >>>>>>>>>> comparison logic. Specify the existing comparison logic (e.g.
> >>> signed
> >>>>>>> for
> >>>>>>>>>> all primitive types, (signed) lexicographical for binary) and
> >>> expect
> >>>>>>> the
> >>>>>>>>>> client to use these if it want to have sorting implemented for
> >> the
> >>>>>>> values
> >>>>>>>>>> for better performance.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> What do you think, guys? Any suggestions on one of my solutions
> >> or
> >>>>> for
> >>>>>>> a
> >>>>>>>>>> new one I did not recognise?
> >>>>>>>>>>
> >>>>>>>>>> Thanks a lot,
> >>>>>>>>>> Gabor
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Ryan Blue
> >>>>>>>> Software Engineer
> >>>>>>>> Netflix
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Ryan Blue
> >>>> Software Engineer
> >>>> Netflix
> >>>
> >>>
> >>
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> 
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: PARQUET-1025: Support new min-max statistics in parquet-mr

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

> The only problems are the UINTs where we cannot override the related
primitives to compare correctly.

We don't need to because this is in a low-level API.

> Implementing the correct compareTo would not be an API break and as it
was implemented incorrectly.

Agreed, but Binary is not dependent on type, and shouldn't be dependent on
type. It holds bytes and is used in many places where we don't necessarily
know the data type. I don't think it is worth the effort to make compareTo
work for all Binary cases when we know we can't fix this for the other
primitives, which must use Comparators anyway. That's why I think the best
solution is to deprecate the use of Comparable#compareTo and remove it.

On Tue, Nov 14, 2017 at 10:19 AM, Gabor Szadovszky <
gabor.szadovszky@cloudera.com> wrote:

> Binary implements Comparable<Binary>. The statistics and the filters
> depend on the generics T extends Comparable<T> and that’s why the current
> min-max statistics are incorrect for all Binaries. We also falsely suggest
> to the API users that the classes implement correct comparisons.
> I agree, that user-friendly values shall be part of the models and
> parquet-mr shall be kept low level by using the primitive types. But we
> have to give this up because the min-max specification depends on the
> logical types for comparing primitives. If we mandate logical type specific
> comparison in the specification we should have those implemented in the
> Comparables themselves instead of providing separate Comparators. The only
> problems are the UINTs where we cannot override the related primitives to
> compare correctly.
> Removing Comparable from Binary would be a breaking change that’s why you
> suggesting adding to to 2.0. Implementing the correct compareTo would not
> be an API break and as it was implemented incorrectly, it would be more
> like a fix than a breaking change. So, it could be added to an 1.x release
> and would not require a major one.
>
> Gabor
>
>
> > On 14 Nov 2017, at 19:07, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> >
> >> The other design proposed by Gabor (correct compareTo() in the
> > logical-type-specific Binary implementations themselves) prevents this
> from
> > happening
> >
> > Not implementing `Comparable` at all prevents this from happening; we
> > should deprecate `Binary` implementing it for 2.0. Then you must use a
> > `Comparator`. As you said, this doesn't solve the problem for types where
> > you're likely to shoot yourself in the foot, the ones that are
> represented
> > by Java primitives. I don't find that argument compelling for `Binary`.
> >
> > For unsigned ints, there's little we can do. But, I know of no instances
> of
> > people actually using unsigned ints, nor do I think it is a big risk when
> > people choose to implement low-level interfaces.
> >
> > rb
> >
> > On Tue, Nov 14, 2017 at 9:54 AM, Zoltan Ivanfi <zi...@cloudera.com> wrote:
> >
> >> Hi,
> >>
> >> The reason why I brought up that API callers can easily shoot
> themselves in
> >> the foot is that it does not necessarily have to be this way.
> >>
> >> One of the design alternatives (separate Comparators) makes it very
> easy to
> >> shoot yourself in the foot for any logical type because you can
> >> unintentionally bypass it.
> >>
> >> The other design proposed by Gabor (correct compareTo() in the
> >> logical-type-specific Binary implementations themselves) prevents this
> from
> >> happening, but does not have a proper answer to the UINT problem. I find
> >> this design more appealing, because it leads to a better API in my
> opinion.
> >>
> >> Unlike other logical types that require custom comparisons, the
> primitive
> >> value behind a UINT is not an object. UINTs are not used as much as
> other
> >> logical types either. I would prefer a simpler solution that makes using
> >> those more important types "just work" and requires extra care for UINTs
> >> than one that makes all types work in the same way but requires the
> >> programmer to explicitly request a Comparator for all types. I would
> even
> >> consider deprecating the current way of getting UINTs from parquet-mr
> >> because I think that using regular Java integers for UINTs is a weak
> point
> >> of the API.
> >>
> >> Zoltan
> >>
> >> On Tue, Nov 14, 2017 at 6:16 PM Gabor Szadovszky <
> >> gabor.szadovszky@cloudera.com> wrote:
> >>
> >>> I agree that we cannot do anything practical to use UINT in java unless
> >> we
> >>> return proper unsigned objects instead of primitives but it would break
> >> the
> >>> whole API.
> >>> While my new approach does not solve the comparison problem of UINTs it
> >>> would have some benefits over the outside comparator logic:
> >>> The natural ordering would be correct; easier/simpler for the users
> >>> I think, the whole implementation would be simpler. For example we do
> not
> >>> need to change the actual statistics implementations (maybe the
> toString
> >>> used for debugging/logging).
> >>> The users do not need to change their code to get the benefits of the
> >>> proper ordering including the new min-max statistics
> >>> The only drawback is that it would not be backward compatible which I
> >>> don’t think really matters here as the current comparison logic
> >> implemented
> >>> in Binary is incorrect for any case.
> >>>
> >>> From the UINT point of view we still can use related min-max statistics
> >> if
> >>> the min-max values are in the non-negative range of the related signed
> >>> value (e.g. UINT_32 from 0 to 2^31 -1).
> >>>
> >>> I propose extending Binary with specific implementations of
> >>> compareTo/toString for the different logical types instead of the
> >> separate
> >>> Comparator idea which I think is more complex and confusing to the
> user.
> >>> What’s your opinion?
> >>>
> >>> Gabor
> >>>
> >>>> On 14 Nov 2017, at 17:22, Ryan Blue <rb...@netflix.com.INVALID>
> wrote:
> >>>>
> >>>> Applications may shoot themselves in the foot by using the wrong
> >>>> comparison. There’s not much we can do if the application uses UINT32
> >> and
> >>>> doesn’t use unsigned comparison. Maybe I’m missing something new about
> >>> this
> >>>> line of reasoning?
> >>>>
> >>>> The row group filters are currently passed column descriptors, which
> >> can
> >>> be
> >>>> updated to include the full Type instead of PrimitiveTypeName and used
> >> to
> >>>> get the correct Comparator. I don’t think it matters that the Java
> >>>> representation is the primitive type. If users implement their own
> >>>> UserDefinedPredicate classes, then they should understand how data
> will
> >>> be
> >>>> passed. We have the same requirement for implementing other low-level
> >>>> interfaces, like data models.
> >>>>
> >>>> rb
> >>>> 
> >>>>
> >>>> On Tue, Nov 14, 2017 at 5:27 AM, Gabor Szadovszky <
> >>>> gabor.szadovszky@cloudera.com <ma...@cloudera.com>>
> >>> wrote:
> >>>>
> >>>>> Thanks a lot, Zoltan for making it clear.
> >>>>>
> >>>>> Meanwhile I’ve discovered that the problem I’ve mentioned before with
> >>> the
> >>>>> actual ordering of Binary and the former statistics is already
> solved:
> >>>>> PARQUET-686 <https://issues.apache.org/jira/browse/PARQUET-686 <
> >>> https://issues.apache.org/jira/browse/PARQUET-686>>
> >>>>> What still stands against my proposal (except that it does not
> support
> >>> the
> >>>>> proper comparison for UINT values) is that it breaks the backward
> >>>>> compatibility: Binary.compareTo(Binary) works in different way.
> >>>>>
> >>>>> Gabor
> >>>>>
> >>>>>> On 14 Nov 2017, at 14:08, Zoltan Ivanfi <zi...@cloudera.com> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> Let me rephrase the problem with unsigned ints in a way that I
> >>> personally
> >>>>>> find easier to understand.
> >>>>>>
> >>>>>> Let's take a typical application that uses native Java types
> >>> internally.
> >>>>>> For parquet-mr, these native types have to be converted to Parquet
> >>>>>> primitive types. The parquet-mr library supports low-level filtering
> >> of
> >>>>>> rows, pages or row groups by allowing the application to implement
> >>>>>> callbacks or build complex conditions using a set of predicates
> >>> provided
> >>>>> by
> >>>>>> parquet-mr. For this low-level filtering, conditions must be
> >> specified
> >>> in
> >>>>>> terms of Parquet's primitive/logical types and not in terms of the
> >>> native
> >>>>>> Java types that the application internally uses.
> >>>>>>
> >>>>>> The "primitive/logical" part of the last sentence is the actual
> >> change
> >>> in
> >>>>>> question. This used to be "primitive" and now for the new statistics
> >>>>> fields
> >>>>>> it should become "logical". Most logical types annotate binaries,
> >> which
> >>>>> is
> >>>>>> a class (hierarchy) in parquet-mr. Making these logical-type-aware
> is
> >>>>>> more-or-less easily accomplished by providing a correct compareTo()
> >>>>>> implementation.
> >>>>>>
> >>>>>> The UINT logical types are an exception, however. For these,
> >>> parquet-mr's
> >>>>>> internal primitive types are the regular native Java types, so there
> >> is
> >>>>> no
> >>>>>> way to change their comparison behaviour. Even if we provide a
> >> correct
> >>>>>> Comparator, the application can simply specify filtering conditions
> >>> like
> >>>>>> "value < 42", which will result in a signed comparison on unsigned
> >>>>> fields.
> >>>>>> It makes it way too easy for developers to shoot themselves in the
> >>> foot.
> >>>>>>
> >>>>>> Zoltan
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Nov 14, 2017 at 12:29 PM Gabor Szadovszky <
> >>>>>> gabor.szadovszky@cloudera.com> wrote:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> During the development of this feature I’ve found out some
> scenarios
> >>>>> which
> >>>>>>> would be really confusing for the clients.
> >>>>>>> For example: We have the new min-max statistics in the file and
> also
> >>>>> have
> >>>>>>> dictionary encoding for binary elements. The client have filtering
> >>>>>>> implemented for the related column involving Gt. Which comparator
> >>> shall
> >>>>> we
> >>>>>>> use in the DictionaryFilter? If we use the new one based on the
> >>>>> statistics
> >>>>>>> it might contradict to the UserDefinedPredicate implemented by the
> >>>>> client
> >>>>>>> and might cause false negatives. If we use the current natural
> order
> >>> of
> >>>>>>> Binary it would contradict to the one in statistics. Further more,
> >> we
> >>>>> are
> >>>>>>> about the make the new comparators available to the client so the
> >>>>>>> implemented UserDefinedPredicate may not match to the
> >> DictionaryFilter
> >>>>> or
> >>>>>>> the statistics anyway.
> >>>>>>>
> >>>>>>> I think, the following new proposal would solve the confusion
> issues
> >>> and
> >>>>>>> even make the existing client code work properly with the new
> >>>>> statistics. I
> >>>>>>> am not sure yet, but think that the implementation would be cleaner
> >> as
> >>>>> well.
> >>>>>>> As far as I know the unsigned integer types (UINT_8 etc.) are not
> >> used
> >>>>>>> widely. I would skip creating min-max statistics for them or keep
> >>> having
> >>>>>>> signed comparison. (BTW, don’t we want to deprecate them?)
> >>>>>>> By leaving the unsigned integers out of the picture only the Binary
> >>>>> class
> >>>>>>> is left to support the different comparison logics. So, let’s
> >> refactor
> >>>>>>> Binary and create different implementations for the different
> >> logical
> >>>>>>> types. This way the natural ordering of Binary will always reflect
> >> the
> >>>>> one
> >>>>>>> specified for the logical types and the Statistics implementations
> >> do
> >>>>> not
> >>>>>>> need to be changed.
> >>>>>>> We do have a problem though and it is the current natural ordering
> >> of
> >>>>>>> Binary. It is implemented in a way that seems to be lexicographical
> >>> but
> >>>>> the
> >>>>>>> byte comparison is signed. I don’t think it is correct so I would
> >> drop
> >>>>> this
> >>>>>>> implementation but it makes a bit hard to implement the handing of
> >> the
> >>>>>>> former min-max statistics. If I would like to be correct, I would
> >> not
> >>>>> write
> >>>>>>> the former min-max statistics for Binary at all and would not use
> >> them
> >>>>> at
> >>>>>>> read. (Or only use it if it was not written by parquet-mr.) I
> guess,
> >>>>> this
> >>>>>>> issue was not identified because clients are rarely using
> characters
> >>>>> where
> >>>>>>> unsigned/signed comparison matters.
> >>>>>>> What do you think?
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Gabor
> >>>>>>>
> >>>>>>>> On 8 Nov 2017, at 18:02, Ryan Blue <rb...@netflix.com.INVALID>
> >>> wrote:
> >>>>>>>>
> >>>>>>>> Gabor,
> >>>>>>>>
> >>>>>>>> You're right that working with logical type data isn't really
> >>> something
> >>>>>>>> that Parquet has done much in the past. We have the distinction
> >>> between
> >>>>>>>> logical types and physical types to simplify what we need to
> >> support
> >>> --
> >>>>>>> we
> >>>>>>>> only have encodings for physical types -- but in this case, we
> >> can't
> >>> do
> >>>>>>>> that, and will need to use comparators based on the logical types.
> >>> For
> >>>>>>> now,
> >>>>>>>> we don't need to expose much to the layer above because all of the
> >>>>> types
> >>>>>>>> have a defined sort order. In the future, we should be able to
> >>> support
> >>>>>>>> other orders for UTF8 data, but we don't need to focus on that
> now.
> >>>>>>>>
> >>>>>>>> Zoltan,
> >>>>>>>>
> >>>>>>>> While it isn't ideal to have no control over the sort order, the
> >> only
> >>>>>>> thing
> >>>>>>>> we can do at the Parquet level is to handle data correctly when it
> >>> does
> >>>>>>>> come in sorted. The concerns you're raising are good to think
> >> about,
> >>>>> but
> >>>>>>> we
> >>>>>>>> need to solve them elsewhere. I think that the table format should
> >>>>> have a
> >>>>>>>> way to specify the sort order it wants for incoming data and
> >>>>> communicate
> >>>>>>>> that to the engine writing Parquet files. That's what we're
> working
> >>> on
> >>>>>>>> adding to the data source v2 API in Spark. Tables should be able
> to
> >>>>>>> specify
> >>>>>>>> the expected clustering (for partitioning) and sort order for
> rows,
> >>>>> then
> >>>>>>>> the query plan is automatically rewritten to make that happen.
> >>>>>>>>
> >>>>>>>> rb
> >>>>>>>>
> >>>>>>>> On Wed, Nov 8, 2017 at 7:47 AM, Zoltan Ivanfi <zi...@cloudera.com>
> >>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> I don't know the solution, just adding my thoughts.
> >>>>>>>>>
> >>>>>>>>> In my opinion the underlying problem is that min-max ranges have
> >> to
> >>> be
> >>>>>>>>> small for the best filtering results. In order to achieve this,
> >> the
> >>>>> data
> >>>>>>>>> has to be sorted, but calculating min-max statistics is the
> >>>>>>> responsibility
> >>>>>>>>> of the library, while sorting the data is a responsibility of the
> >>>>>>>>> application above the library. Without logical-type-based sorting
> >>>>> rules,
> >>>>>>>>> the application may sort data using a different ordering than the
> >>> one
> >>>>>>> used
> >>>>>>>>> for calculating/filtering on the min-max values. This results in
> >> too
> >>>>>>> broad
> >>>>>>>>> min-max ranges, that in theory could still function correctly,
> but
> >>> are
> >>>>>>> not
> >>>>>>>>> optimal for filtering. (As an extra twist, even if we have
> >>>>>>>>> logical-type-based sorting rules, the application can still use a
> >>>>>>> different
> >>>>>>>>> ordering for sorting and even for comparisons. The results can
> >> range
> >>>>>>> from
> >>>>>>>>> overly broad min-max ranges to incorrectly discarded values.)
> >>>>>>>>>
> >>>>>>>>> From this point of view, dictionary entries and bloom filters are
> >>> much
> >>>>>>> less
> >>>>>>>>> problematic, because sorting by *any* order will optimize these
> >>>>>>> structures,
> >>>>>>>>> as the only important thing is that equal values should end up
> >> next
> >>> to
> >>>>>>> each
> >>>>>>>>> other to increase the chance of having a low number of values in
> a
> >>>>>>> single
> >>>>>>>>> page/row group.
> >>>>>>>>>
> >>>>>>>>> I think that trying to optimize min-max stats by sorting crosses
> >> the
> >>>>>>> layer
> >>>>>>>>> boundary between the library and the application and as such is
> >> much
> >>>>>>> better
> >>>>>>>>> suited to a full-stack implementation like Impala than the
> >>> parquet-mr
> >>>>> +
> >>>>>>>>> separate application stack. Neither relying on the application to
> >>>>>>> calculate
> >>>>>>>>> standards-compliant statistics nor forcing the application to use
> >>> the
> >>>>>>>>> library's data types for sorting and comparison seems like a good
> >>>>>>> solution
> >>>>>>>>> to me.
> >>>>>>>>>
> >>>>>>>>> Please note that this problem inherently applies to columns
> >>> statistics
> >>>>>>> as
> >>>>>>>>> well.
> >>>>>>>>>
> >>>>>>>>> Zoltan
> >>>>>>>>>
> >>>>>>>>> On Wed, Nov 8, 2017 at 3:41 PM, Gabor Szadovszky <
> >>>>>>>>> gabor.szadovszky@cloudera.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> I started working on the jira PARQUET-1025 <
> >>>>> https://issues.apache.org/
> >>>>>>>>>> jira/browse/PARQUET-1025>. It is about implementing the new
> >> min-max
> >>>>>>>>>> statistics specified in PARQUET-686 <https://issues.apache.org/
> >>>>>>>>>> jira/browse/PARQUET-686>.
> >>>>>>>>>>
> >>>>>>>>>> After looking in the code deeper I think the spec of the new
> >>> min-max
> >>>>>>>>> stats
> >>>>>>>>>> contradicts to the actual design of parquet-mr:
> >>>>>>>>>> The concept of parquet-mr is to use the raw types (representing
> >> the
> >>>>>>>>>> primitive types in the spec) and let the client of the API have
> >>> them
> >>>>>>>>>> converted to rich objects (representing the logical types). For
> >>>>> example
> >>>>>>>>>> parquet-mr have Binary for both UTF8 and DECIMAL (at least for
> >> one
> >>>>>>>>> specific
> >>>>>>>>>> representation) instead of returning String or BigDecimal
> >> anywhere.
> >>>>>>>>>> The new min-max stats requires to have separate comparison
> >>> mechanisms
> >>>>>>> for
> >>>>>>>>>> the same primitives depending on the logical types. For example
> >>> UTF8
> >>>>>>>>>> requires unsigned (lexicographical) while DECIMAL requires
> signed
> >>>>>>>>>> comparisons for the same Binary class.
> >>>>>>>>>> The problem is that we are specifying sorting orders based on
> >>> logical
> >>>>>>>>>> types while we are not providing specific java types for them.
> It
> >>>>> means
> >>>>>>>>> the
> >>>>>>>>>> client can unintentionally use a different comparison logic than
> >>>>>>>>> parquet-mr
> >>>>>>>>>> in the min-max stats which can lead to discarding relevant
> values
> >>>>>>> during
> >>>>>>>>>> filtering.
> >>>>>>>>>>
> >>>>>>>>>> I can see two possible solutions however none of them seems to
> be
> >>>>> good
> >>>>>>>>>> enough for me.
> >>>>>>>>>>
> >>>>>>>>>> 1. Implement specific comparison logics based on the logical
> >> types
> >>> in
> >>>>>>> the
> >>>>>>>>>> related Statistics object
> >>>>>>>>>> The problem here is that we are still using the raw types
> >> therefore
> >>>>>>>>> client
> >>>>>>>>>> code in filters might implement different comparison logic than
> >>>>>>> specified
> >>>>>>>>>> and implemented in Statistics. For example in case of having
> >>> UINT_32
> >>>>>>> the
> >>>>>>>>>> min-max comparison in Statistics will use the proper unsigned
> >>>>>>> comparison
> >>>>>>>>>> logic while the client code in case of checking the elements (in
> >>> the
> >>>>>>>>> column
> >>>>>>>>>> or in the dictionary) might implement somewhat simpler e.g.
> value
> >>>> 7
> >>>>>>>>> which
> >>>>>>>>>> may lead to different results. See UserDefinedPredicate.keep(T)
> >>>>>>>>>> It is highly confusing for the client that it has to
> re-implement
> >>> the
> >>>>>>>>> same
> >>>>>>>>>> comparison logic in the client code for the raw types as it is
> >>>>>>>>> implemented
> >>>>>>>>>> for the statistics.
> >>>>>>>>>>
> >>>>>>>>>> 2. Return specific “rich” objects for the logical types instead
> >> of
> >>>>> the
> >>>>>>>>> raw
> >>>>>>>>>> types
> >>>>>>>>>> This solution would solve the problems what the previous one has
> >>> but
> >>>>>>>>> would
> >>>>>>>>>> introduce bigger problems.
> >>>>>>>>>> Breaks the actual design of parquet-mr
> >>>>>>>>>> Backward incompatible: existing client code would not work
> >> properly
> >>>>> for
> >>>>>>>>>> several logical types
> >>>>>>>>>>
> >>>>>>>>>> (3.) Revert the specs of the new min-max statistics
> >>>>>>>>>> Let’s keep working on the raw types without having any
> additional
> >>>>>>>>>> comparison logic. Specify the existing comparison logic (e.g.
> >>> signed
> >>>>>>> for
> >>>>>>>>>> all primitive types, (signed) lexicographical for binary) and
> >>> expect
> >>>>>>> the
> >>>>>>>>>> client to use these if it want to have sorting implemented for
> >> the
> >>>>>>> values
> >>>>>>>>>> for better performance.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> What do you think, guys? Any suggestions on one of my solutions
> >> or
> >>>>> for
> >>>>>>> a
> >>>>>>>>>> new one I did not recognise?
> >>>>>>>>>>
> >>>>>>>>>> Thanks a lot,
> >>>>>>>>>> Gabor
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Ryan Blue
> >>>>>>>> Software Engineer
> >>>>>>>> Netflix
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Ryan Blue
> >>>> Software Engineer
> >>>> Netflix
> >>>
> >>>
> >>
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: PARQUET-1025: Support new min-max statistics in parquet-mr

Posted by Gabor Szadovszky <ga...@cloudera.com>.

Binary implements Comparable<Binary>. The statistics and the filters depend on the generics T extends Comparable<T> and that’s why the current min-max statistics are incorrect for all Binaries. We also falsely suggest to the API users that the classes implement correct comparisons.
I agree, that user-friendly values shall be part of the models and parquet-mr shall be kept low level by using the primitive types. But we have to give this up because the min-max specification depends on the logical types for comparing primitives. If we mandate logical type specific comparison in the specification we should have those implemented in the Comparables themselves instead of providing separate Comparators. The only problems are the UINTs where we cannot override the related primitives to compare correctly.
Removing Comparable from Binary would be a breaking change that’s why you suggesting adding to to 2.0. Implementing the correct compareTo would not be an API break and as it was implemented incorrectly, it would be more like a fix than a breaking change. So, it could be added to an 1.x release and would not require a major one.

Gabor


> On 14 Nov 2017, at 19:07, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
>> The other design proposed by Gabor (correct compareTo() in the
> logical-type-specific Binary implementations themselves) prevents this from
> happening
> 
> Not implementing `Comparable` at all prevents this from happening; we
> should deprecate `Binary` implementing it for 2.0. Then you must use a
> `Comparator`. As you said, this doesn't solve the problem for types where
> you're likely to shoot yourself in the foot, the ones that are represented
> by Java primitives. I don't find that argument compelling for `Binary`.
> 
> For unsigned ints, there's little we can do. But, I know of no instances of
> people actually using unsigned ints, nor do I think it is a big risk when
> people choose to implement low-level interfaces.
> 
> rb
> 
> On Tue, Nov 14, 2017 at 9:54 AM, Zoltan Ivanfi <zi...@cloudera.com> wrote:
> 
>> Hi,
>> 
>> The reason why I brought up that API callers can easily shoot themselves in
>> the foot is that it does not necessarily have to be this way.
>> 
>> One of the design alternatives (separate Comparators) makes it very easy to
>> shoot yourself in the foot for any logical type because you can
>> unintentionally bypass it.
>> 
>> The other design proposed by Gabor (correct compareTo() in the
>> logical-type-specific Binary implementations themselves) prevents this from
>> happening, but does not have a proper answer to the UINT problem. I find
>> this design more appealing, because it leads to a better API in my opinion.
>> 
>> Unlike other logical types that require custom comparisons, the primitive
>> value behind a UINT is not an object. UINTs are not used as much as other
>> logical types either. I would prefer a simpler solution that makes using
>> those more important types "just work" and requires extra care for UINTs
>> than one that makes all types work in the same way but requires the
>> programmer to explicitly request a Comparator for all types. I would even
>> consider deprecating the current way of getting UINTs from parquet-mr
>> because I think that using regular Java integers for UINTs is a weak point
>> of the API.
>> 
>> Zoltan
>> 
>> On Tue, Nov 14, 2017 at 6:16 PM Gabor Szadovszky <
>> gabor.szadovszky@cloudera.com> wrote:
>> 
>>> I agree that we cannot do anything practical to use UINT in java unless
>> we
>>> return proper unsigned objects instead of primitives but it would break
>> the
>>> whole API.
>>> While my new approach does not solve the comparison problem of UINTs it
>>> would have some benefits over the outside comparator logic:
>>> The natural ordering would be correct; easier/simpler for the users
>>> I think, the whole implementation would be simpler. For example we do not
>>> need to change the actual statistics implementations (maybe the toString
>>> used for debugging/logging).
>>> The users do not need to change their code to get the benefits of the
>>> proper ordering including the new min-max statistics
>>> The only drawback is that it would not be backward compatible which I
>>> don’t think really matters here as the current comparison logic
>> implemented
>>> in Binary is incorrect for any case.
>>> 
>>> From the UINT point of view we still can use related min-max statistics
>> if
>>> the min-max values are in the non-negative range of the related signed
>>> value (e.g. UINT_32 from 0 to 2^31 -1).
>>> 
>>> I propose extending Binary with specific implementations of
>>> compareTo/toString for the different logical types instead of the
>> separate
>>> Comparator idea which I think is more complex and confusing to the user.
>>> What’s your opinion?
>>> 
>>> Gabor
>>> 
>>>> On 14 Nov 2017, at 17:22, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>>>> 
>>>> Applications may shoot themselves in the foot by using the wrong
>>>> comparison. There’s not much we can do if the application uses UINT32
>> and
>>>> doesn’t use unsigned comparison. Maybe I’m missing something new about
>>> this
>>>> line of reasoning?
>>>> 
>>>> The row group filters are currently passed column descriptors, which
>> can
>>> be
>>>> updated to include the full Type instead of PrimitiveTypeName and used
>> to
>>>> get the correct Comparator. I don’t think it matters that the Java
>>>> representation is the primitive type. If users implement their own
>>>> UserDefinedPredicate classes, then they should understand how data will
>>> be
>>>> passed. We have the same requirement for implementing other low-level
>>>> interfaces, like data models.
>>>> 
>>>> rb
>>>> 
>>>> 
>>>> On Tue, Nov 14, 2017 at 5:27 AM, Gabor Szadovszky <
>>>> gabor.szadovszky@cloudera.com <ma...@cloudera.com>>
>>> wrote:
>>>> 
>>>>> Thanks a lot, Zoltan for making it clear.
>>>>> 
>>>>> Meanwhile I’ve discovered that the problem I’ve mentioned before with
>>> the
>>>>> actual ordering of Binary and the former statistics is already solved:
>>>>> PARQUET-686 <https://issues.apache.org/jira/browse/PARQUET-686 <
>>> https://issues.apache.org/jira/browse/PARQUET-686>>
>>>>> What still stands against my proposal (except that it does not support
>>> the
>>>>> proper comparison for UINT values) is that it breaks the backward
>>>>> compatibility: Binary.compareTo(Binary) works in different way.
>>>>> 
>>>>> Gabor
>>>>> 
>>>>>> On 14 Nov 2017, at 14:08, Zoltan Ivanfi <zi...@cloudera.com> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Let me rephrase the problem with unsigned ints in a way that I
>>> personally
>>>>>> find easier to understand.
>>>>>> 
>>>>>> Let's take a typical application that uses native Java types
>>> internally.
>>>>>> For parquet-mr, these native types have to be converted to Parquet
>>>>>> primitive types. The parquet-mr library supports low-level filtering
>> of
>>>>>> rows, pages or row groups by allowing the application to implement
>>>>>> callbacks or build complex conditions using a set of predicates
>>> provided
>>>>> by
>>>>>> parquet-mr. For this low-level filtering, conditions must be
>> specified
>>> in
>>>>>> terms of Parquet's primitive/logical types and not in terms of the
>>> native
>>>>>> Java types that the application internally uses.
>>>>>> 
>>>>>> The "primitive/logical" part of the last sentence is the actual
>> change
>>> in
>>>>>> question. This used to be "primitive" and now for the new statistics
>>>>> fields
>>>>>> it should become "logical". Most logical types annotate binaries,
>> which
>>>>> is
>>>>>> a class (hierarchy) in parquet-mr. Making these logical-type-aware is
>>>>>> more-or-less easily accomplished by providing a correct compareTo()
>>>>>> implementation.
>>>>>> 
>>>>>> The UINT logical types are an exception, however. For these,
>>> parquet-mr's
>>>>>> internal primitive types are the regular native Java types, so there
>> is
>>>>> no
>>>>>> way to change their comparison behaviour. Even if we provide a
>> correct
>>>>>> Comparator, the application can simply specify filtering conditions
>>> like
>>>>>> "value < 42", which will result in a signed comparison on unsigned
>>>>> fields.
>>>>>> It makes it way too easy for developers to shoot themselves in the
>>> foot.
>>>>>> 
>>>>>> Zoltan
>>>>>> 
>>>>>> 
>>>>>> On Tue, Nov 14, 2017 at 12:29 PM Gabor Szadovszky <
>>>>>> gabor.szadovszky@cloudera.com> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> During the development of this feature I’ve found out some scenarios
>>>>> which
>>>>>>> would be really confusing for the clients.
>>>>>>> For example: We have the new min-max statistics in the file and also
>>>>> have
>>>>>>> dictionary encoding for binary elements. The client have filtering
>>>>>>> implemented for the related column involving Gt. Which comparator
>>> shall
>>>>> we
>>>>>>> use in the DictionaryFilter? If we use the new one based on the
>>>>> statistics
>>>>>>> it might contradict to the UserDefinedPredicate implemented by the
>>>>> client
>>>>>>> and might cause false negatives. If we use the current natural order
>>> of
>>>>>>> Binary it would contradict to the one in statistics. Further more,
>> we
>>>>> are
>>>>>>> about the make the new comparators available to the client so the
>>>>>>> implemented UserDefinedPredicate may not match to the
>> DictionaryFilter
>>>>> or
>>>>>>> the statistics anyway.
>>>>>>> 
>>>>>>> I think, the following new proposal would solve the confusion issues
>>> and
>>>>>>> even make the existing client code work properly with the new
>>>>> statistics. I
>>>>>>> am not sure yet, but think that the implementation would be cleaner
>> as
>>>>> well.
>>>>>>> As far as I know the unsigned integer types (UINT_8 etc.) are not
>> used
>>>>>>> widely. I would skip creating min-max statistics for them or keep
>>> having
>>>>>>> signed comparison. (BTW, don’t we want to deprecate them?)
>>>>>>> By leaving the unsigned integers out of the picture only the Binary
>>>>> class
>>>>>>> is left to support the different comparison logics. So, let’s
>> refactor
>>>>>>> Binary and create different implementations for the different
>> logical
>>>>>>> types. This way the natural ordering of Binary will always reflect
>> the
>>>>> one
>>>>>>> specified for the logical types and the Statistics implementations
>> do
>>>>> not
>>>>>>> need to be changed.
>>>>>>> We do have a problem though and it is the current natural ordering
>> of
>>>>>>> Binary. It is implemented in a way that seems to be lexicographical
>>> but
>>>>> the
>>>>>>> byte comparison is signed. I don’t think it is correct so I would
>> drop
>>>>> this
>>>>>>> implementation but it makes a bit hard to implement the handing of
>> the
>>>>>>> former min-max statistics. If I would like to be correct, I would
>> not
>>>>> write
>>>>>>> the former min-max statistics for Binary at all and would not use
>> them
>>>>> at
>>>>>>> read. (Or only use it if it was not written by parquet-mr.) I guess,
>>>>> this
>>>>>>> issue was not identified because clients are rarely using characters
>>>>> where
>>>>>>> unsigned/signed comparison matters.
>>>>>>> What do you think?
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Gabor
>>>>>>> 
>>>>>>>> On 8 Nov 2017, at 18:02, Ryan Blue <rb...@netflix.com.INVALID>
>>> wrote:
>>>>>>>> 
>>>>>>>> Gabor,
>>>>>>>> 
>>>>>>>> You're right that working with logical type data isn't really
>>> something
>>>>>>>> that Parquet has done much in the past. We have the distinction
>>> between
>>>>>>>> logical types and physical types to simplify what we need to
>> support
>>> --
>>>>>>> we
>>>>>>>> only have encodings for physical types -- but in this case, we
>> can't
>>> do
>>>>>>>> that, and will need to use comparators based on the logical types.
>>> For
>>>>>>> now,
>>>>>>>> we don't need to expose much to the layer above because all of the
>>>>> types
>>>>>>>> have a defined sort order. In the future, we should be able to
>>> support
>>>>>>>> other orders for UTF8 data, but we don't need to focus on that now.
>>>>>>>> 
>>>>>>>> Zoltan,
>>>>>>>> 
>>>>>>>> While it isn't ideal to have no control over the sort order, the
>> only
>>>>>>> thing
>>>>>>>> we can do at the Parquet level is to handle data correctly when it
>>> does
>>>>>>>> come in sorted. The concerns you're raising are good to think
>> about,
>>>>> but
>>>>>>> we
>>>>>>>> need to solve them elsewhere. I think that the table format should
>>>>> have a
>>>>>>>> way to specify the sort order it wants for incoming data and
>>>>> communicate
>>>>>>>> that to the engine writing Parquet files. That's what we're working
>>> on
>>>>>>>> adding to the data source v2 API in Spark. Tables should be able to
>>>>>>> specify
>>>>>>>> the expected clustering (for partitioning) and sort order for rows,
>>>>> then
>>>>>>>> the query plan is automatically rewritten to make that happen.
>>>>>>>> 
>>>>>>>> rb
>>>>>>>> 
>>>>>>>> On Wed, Nov 8, 2017 at 7:47 AM, Zoltan Ivanfi <zi...@cloudera.com>
>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> I don't know the solution, just adding my thoughts.
>>>>>>>>> 
>>>>>>>>> In my opinion the underlying problem is that min-max ranges have
>> to
>>> be
>>>>>>>>> small for the best filtering results. In order to achieve this,
>> the
>>>>> data
>>>>>>>>> has to be sorted, but calculating min-max statistics is the
>>>>>>> responsibility
>>>>>>>>> of the library, while sorting the data is a responsibility of the
>>>>>>>>> application above the library. Without logical-type-based sorting
>>>>> rules,
>>>>>>>>> the application may sort data using a different ordering than the
>>> one
>>>>>>> used
>>>>>>>>> for calculating/filtering on the min-max values. This results in
>> too
>>>>>>> broad
>>>>>>>>> min-max ranges, that in theory could still function correctly, but
>>> are
>>>>>>> not
>>>>>>>>> optimal for filtering. (As an extra twist, even if we have
>>>>>>>>> logical-type-based sorting rules, the application can still use a
>>>>>>> different
>>>>>>>>> ordering for sorting and even for comparisons. The results can
>> range
>>>>>>> from
>>>>>>>>> overly broad min-max ranges to incorrectly discarded values.)
>>>>>>>>> 
>>>>>>>>> From this point of view, dictionary entries and bloom filters are
>>> much
>>>>>>> less
>>>>>>>>> problematic, because sorting by *any* order will optimize these
>>>>>>> structures,
>>>>>>>>> as the only important thing is that equal values should end up
>> next
>>> to
>>>>>>> each
>>>>>>>>> other to increase the chance of having a low number of values in a
>>>>>>> single
>>>>>>>>> page/row group.
>>>>>>>>> 
>>>>>>>>> I think that trying to optimize min-max stats by sorting crosses
>> the
>>>>>>> layer
>>>>>>>>> boundary between the library and the application and as such is
>> much
>>>>>>> better
>>>>>>>>> suited to a full-stack implementation like Impala than the
>>> parquet-mr
>>>>> +
>>>>>>>>> separate application stack. Neither relying on the application to
>>>>>>> calculate
>>>>>>>>> standards-compliant statistics nor forcing the application to use
>>> the
>>>>>>>>> library's data types for sorting and comparison seems like a good
>>>>>>> solution
>>>>>>>>> to me.
>>>>>>>>> 
>>>>>>>>> Please note that this problem inherently applies to columns
>>> statistics
>>>>>>> as
>>>>>>>>> well.
>>>>>>>>> 
>>>>>>>>> Zoltan
>>>>>>>>> 
>>>>>>>>> On Wed, Nov 8, 2017 at 3:41 PM, Gabor Szadovszky <
>>>>>>>>> gabor.szadovszky@cloudera.com> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> I started working on the jira PARQUET-1025 <
>>>>> https://issues.apache.org/
>>>>>>>>>> jira/browse/PARQUET-1025>. It is about implementing the new
>> min-max
>>>>>>>>>> statistics specified in PARQUET-686 <https://issues.apache.org/
>>>>>>>>>> jira/browse/PARQUET-686>.
>>>>>>>>>> 
>>>>>>>>>> After looking in the code deeper I think the spec of the new
>>> min-max
>>>>>>>>> stats
>>>>>>>>>> contradicts to the actual design of parquet-mr:
>>>>>>>>>> The concept of parquet-mr is to use the raw types (representing
>> the
>>>>>>>>>> primitive types in the spec) and let the client of the API have
>>> them
>>>>>>>>>> converted to rich objects (representing the logical types). For
>>>>> example
>>>>>>>>>> parquet-mr have Binary for both UTF8 and DECIMAL (at least for
>> one
>>>>>>>>> specific
>>>>>>>>>> representation) instead of returning String or BigDecimal
>> anywhere.
>>>>>>>>>> The new min-max stats requires to have separate comparison
>>> mechanisms
>>>>>>> for
>>>>>>>>>> the same primitives depending on the logical types. For example
>>> UTF8
>>>>>>>>>> requires unsigned (lexicographical) while DECIMAL requires signed
>>>>>>>>>> comparisons for the same Binary class.
>>>>>>>>>> The problem is that we are specifying sorting orders based on
>>> logical
>>>>>>>>>> types while we are not providing specific java types for them. It
>>>>> means
>>>>>>>>> the
>>>>>>>>>> client can unintentionally use a different comparison logic than
>>>>>>>>> parquet-mr
>>>>>>>>>> in the min-max stats which can lead to discarding relevant values
>>>>>>> during
>>>>>>>>>> filtering.
>>>>>>>>>> 
>>>>>>>>>> I can see two possible solutions however none of them seems to be
>>>>> good
>>>>>>>>>> enough for me.
>>>>>>>>>> 
>>>>>>>>>> 1. Implement specific comparison logics based on the logical
>> types
>>> in
>>>>>>> the
>>>>>>>>>> related Statistics object
>>>>>>>>>> The problem here is that we are still using the raw types
>> therefore
>>>>>>>>> client
>>>>>>>>>> code in filters might implement different comparison logic than
>>>>>>> specified
>>>>>>>>>> and implemented in Statistics. For example in case of having
>>> UINT_32
>>>>>>> the
>>>>>>>>>> min-max comparison in Statistics will use the proper unsigned
>>>>>>> comparison
>>>>>>>>>> logic while the client code in case of checking the elements (in
>>> the
>>>>>>>>> column
>>>>>>>>>> or in the dictionary) might implement somewhat simpler e.g. value
>>>> 7
>>>>>>>>> which
>>>>>>>>>> may lead to different results. See UserDefinedPredicate.keep(T)
>>>>>>>>>> It is highly confusing for the client that it has to re-implement
>>> the
>>>>>>>>> same
>>>>>>>>>> comparison logic in the client code for the raw types as it is
>>>>>>>>> implemented
>>>>>>>>>> for the statistics.
>>>>>>>>>> 
>>>>>>>>>> 2. Return specific “rich” objects for the logical types instead
>> of
>>>>> the
>>>>>>>>> raw
>>>>>>>>>> types
>>>>>>>>>> This solution would solve the problems what the previous one has
>>> but
>>>>>>>>> would
>>>>>>>>>> introduce bigger problems.
>>>>>>>>>> Breaks the actual design of parquet-mr
>>>>>>>>>> Backward incompatible: existing client code would not work
>> properly
>>>>> for
>>>>>>>>>> several logical types
>>>>>>>>>> 
>>>>>>>>>> (3.) Revert the specs of the new min-max statistics
>>>>>>>>>> Let’s keep working on the raw types without having any additional
>>>>>>>>>> comparison logic. Specify the existing comparison logic (e.g.
>>> signed
>>>>>>> for
>>>>>>>>>> all primitive types, (signed) lexicographical for binary) and
>>> expect
>>>>>>> the
>>>>>>>>>> client to use these if it want to have sorting implemented for
>> the
>>>>>>> values
>>>>>>>>>> for better performance.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> What do you think, guys? Any suggestions on one of my solutions
>> or
>>>>> for
>>>>>>> a
>>>>>>>>>> new one I did not recognise?
>>>>>>>>>> 
>>>>>>>>>> Thanks a lot,
>>>>>>>>>> Gabor
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>> 
>>> 
>> 
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: PARQUET-1025: Support new min-max statistics in parquet-mr

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

> The other design proposed by Gabor (correct compareTo() in the
logical-type-specific Binary implementations themselves) prevents this from
happening

Not implementing `Comparable` at all prevents this from happening; we
should deprecate `Binary` implementing it for 2.0. Then you must use a
`Comparator`. As you said, this doesn't solve the problem for types where
you're likely to shoot yourself in the foot, the ones that are represented
by Java primitives. I don't find that argument compelling for `Binary`.

For unsigned ints, there's little we can do. But, I know of no instances of
people actually using unsigned ints, nor do I think it is a big risk when
people choose to implement low-level interfaces.

rb

On Tue, Nov 14, 2017 at 9:54 AM, Zoltan Ivanfi <zi...@cloudera.com> wrote:

> Hi,
>
> The reason why I brought up that API callers can easily shoot themselves in
> the foot is that it does not necessarily have to be this way.
>
> One of the design alternatives (separate Comparators) makes it very easy to
> shoot yourself in the foot for any logical type because you can
> unintentionally bypass it.
>
> The other design proposed by Gabor (correct compareTo() in the
> logical-type-specific Binary implementations themselves) prevents this from
> happening, but does not have a proper answer to the UINT problem. I find
> this design more appealing, because it leads to a better API in my opinion.
>
> Unlike other logical types that require custom comparisons, the primitive
> value behind a UINT is not an object. UINTs are not used as much as other
> logical types either. I would prefer a simpler solution that makes using
> those more important types "just work" and requires extra care for UINTs
> than one that makes all types work in the same way but requires the
> programmer to explicitly request a Comparator for all types. I would even
> consider deprecating the current way of getting UINTs from parquet-mr
> because I think that using regular Java integers for UINTs is a weak point
> of the API.
>
> Zoltan
>
> On Tue, Nov 14, 2017 at 6:16 PM Gabor Szadovszky <
> gabor.szadovszky@cloudera.com> wrote:
>
> > I agree that we cannot do anything practical to use UINT in java unless
> we
> > return proper unsigned objects instead of primitives but it would break
> the
> > whole API.
> > While my new approach does not solve the comparison problem of UINTs it
> > would have some benefits over the outside comparator logic:
> > The natural ordering would be correct; easier/simpler for the users
> > I think, the whole implementation would be simpler. For example we do not
> > need to change the actual statistics implementations (maybe the toString
> > used for debugging/logging).
> > The users do not need to change their code to get the benefits of the
> > proper ordering including the new min-max statistics
> > The only drawback is that it would not be backward compatible which I
> > don’t think really matters here as the current comparison logic
> implemented
> > in Binary is incorrect for any case.
> >
> > From the UINT point of view we still can use related min-max statistics
> if
> > the min-max values are in the non-negative range of the related signed
> > value (e.g. UINT_32 from 0 to 2^31 -1).
> >
> > I propose extending Binary with specific implementations of
> > compareTo/toString for the different logical types instead of the
> separate
> > Comparator idea which I think is more complex and confusing to the user.
> > What’s your opinion?
> >
> > Gabor
> >
> > > On 14 Nov 2017, at 17:22, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> > >
> > > Applications may shoot themselves in the foot by using the wrong
> > > comparison. There’s not much we can do if the application uses UINT32
> and
> > > doesn’t use unsigned comparison. Maybe I’m missing something new about
> > this
> > > line of reasoning?
> > >
> > > The row group filters are currently passed column descriptors, which
> can
> > be
> > > updated to include the full Type instead of PrimitiveTypeName and used
> to
> > > get the correct Comparator. I don’t think it matters that the Java
> > > representation is the primitive type. If users implement their own
> > > UserDefinedPredicate classes, then they should understand how data will
> > be
> > > passed. We have the same requirement for implementing other low-level
> > > interfaces, like data models.
> > >
> > > rb
> > > 
> > >
> > > On Tue, Nov 14, 2017 at 5:27 AM, Gabor Szadovszky <
> > > gabor.szadovszky@cloudera.com <ma...@cloudera.com>>
> > wrote:
> > >
> > >> Thanks a lot, Zoltan for making it clear.
> > >>
> > >> Meanwhile I’ve discovered that the problem I’ve mentioned before with
> > the
> > >> actual ordering of Binary and the former statistics is already solved:
> > >> PARQUET-686 <https://issues.apache.org/jira/browse/PARQUET-686 <
> > https://issues.apache.org/jira/browse/PARQUET-686>>
> > >> What still stands against my proposal (except that it does not support
> > the
> > >> proper comparison for UINT values) is that it breaks the backward
> > >> compatibility: Binary.compareTo(Binary) works in different way.
> > >>
> > >> Gabor
> > >>
> > >>> On 14 Nov 2017, at 14:08, Zoltan Ivanfi <zi...@cloudera.com> wrote:
> > >>>
> > >>> Hi,
> > >>>
> > >>> Let me rephrase the problem with unsigned ints in a way that I
> > personally
> > >>> find easier to understand.
> > >>>
> > >>> Let's take a typical application that uses native Java types
> > internally.
> > >>> For parquet-mr, these native types have to be converted to Parquet
> > >>> primitive types. The parquet-mr library supports low-level filtering
> of
> > >>> rows, pages or row groups by allowing the application to implement
> > >>> callbacks or build complex conditions using a set of predicates
> > provided
> > >> by
> > >>> parquet-mr. For this low-level filtering, conditions must be
> specified
> > in
> > >>> terms of Parquet's primitive/logical types and not in terms of the
> > native
> > >>> Java types that the application internally uses.
> > >>>
> > >>> The "primitive/logical" part of the last sentence is the actual
> change
> > in
> > >>> question. This used to be "primitive" and now for the new statistics
> > >> fields
> > >>> it should become "logical". Most logical types annotate binaries,
> which
> > >> is
> > >>> a class (hierarchy) in parquet-mr. Making these logical-type-aware is
> > >>> more-or-less easily accomplished by providing a correct compareTo()
> > >>> implementation.
> > >>>
> > >>> The UINT logical types are an exception, however. For these,
> > parquet-mr's
> > >>> internal primitive types are the regular native Java types, so there
> is
> > >> no
> > >>> way to change their comparison behaviour. Even if we provide a
> correct
> > >>> Comparator, the application can simply specify filtering conditions
> > like
> > >>> "value < 42", which will result in a signed comparison on unsigned
> > >> fields.
> > >>> It makes it way too easy for developers to shoot themselves in the
> > foot.
> > >>>
> > >>> Zoltan
> > >>>
> > >>>
> > >>> On Tue, Nov 14, 2017 at 12:29 PM Gabor Szadovszky <
> > >>> gabor.szadovszky@cloudera.com> wrote:
> > >>>
> > >>>> Hi,
> > >>>>
> > >>>> During the development of this feature I’ve found out some scenarios
> > >> which
> > >>>> would be really confusing for the clients.
> > >>>> For example: We have the new min-max statistics in the file and also
> > >> have
> > >>>> dictionary encoding for binary elements. The client have filtering
> > >>>> implemented for the related column involving Gt. Which comparator
> > shall
> > >> we
> > >>>> use in the DictionaryFilter? If we use the new one based on the
> > >> statistics
> > >>>> it might contradict to the UserDefinedPredicate implemented by the
> > >> client
> > >>>> and might cause false negatives. If we use the current natural order
> > of
> > >>>> Binary it would contradict to the one in statistics. Further more,
> we
> > >> are
> > >>>> about the make the new comparators available to the client so the
> > >>>> implemented UserDefinedPredicate may not match to the
> DictionaryFilter
> > >> or
> > >>>> the statistics anyway.
> > >>>>
> > >>>> I think, the following new proposal would solve the confusion issues
> > and
> > >>>> even make the existing client code work properly with the new
> > >> statistics. I
> > >>>> am not sure yet, but think that the implementation would be cleaner
> as
> > >> well.
> > >>>> As far as I know the unsigned integer types (UINT_8 etc.) are not
> used
> > >>>> widely. I would skip creating min-max statistics for them or keep
> > having
> > >>>> signed comparison. (BTW, don’t we want to deprecate them?)
> > >>>> By leaving the unsigned integers out of the picture only the Binary
> > >> class
> > >>>> is left to support the different comparison logics. So, let’s
> refactor
> > >>>> Binary and create different implementations for the different
> logical
> > >>>> types. This way the natural ordering of Binary will always reflect
> the
> > >> one
> > >>>> specified for the logical types and the Statistics implementations
> do
> > >> not
> > >>>> need to be changed.
> > >>>> We do have a problem though and it is the current natural ordering
> of
> > >>>> Binary. It is implemented in a way that seems to be lexicographical
> > but
> > >> the
> > >>>> byte comparison is signed. I don’t think it is correct so I would
> drop
> > >> this
> > >>>> implementation but it makes a bit hard to implement the handing of
> the
> > >>>> former min-max statistics. If I would like to be correct, I would
> not
> > >> write
> > >>>> the former min-max statistics for Binary at all and would not use
> them
> > >> at
> > >>>> read. (Or only use it if it was not written by parquet-mr.) I guess,
> > >> this
> > >>>> issue was not identified because clients are rarely using characters
> > >> where
> > >>>> unsigned/signed comparison matters.
> > >>>> What do you think?
> > >>>>
> > >>>> Regards,
> > >>>> Gabor
> > >>>>
> > >>>>> On 8 Nov 2017, at 18:02, Ryan Blue <rb...@netflix.com.INVALID>
> > wrote:
> > >>>>>
> > >>>>> Gabor,
> > >>>>>
> > >>>>> You're right that working with logical type data isn't really
> > something
> > >>>>> that Parquet has done much in the past. We have the distinction
> > between
> > >>>>> logical types and physical types to simplify what we need to
> support
> > --
> > >>>> we
> > >>>>> only have encodings for physical types -- but in this case, we
> can't
> > do
> > >>>>> that, and will need to use comparators based on the logical types.
> > For
> > >>>> now,
> > >>>>> we don't need to expose much to the layer above because all of the
> > >> types
> > >>>>> have a defined sort order. In the future, we should be able to
> > support
> > >>>>> other orders for UTF8 data, but we don't need to focus on that now.
> > >>>>>
> > >>>>> Zoltan,
> > >>>>>
> > >>>>> While it isn't ideal to have no control over the sort order, the
> only
> > >>>> thing
> > >>>>> we can do at the Parquet level is to handle data correctly when it
> > does
> > >>>>> come in sorted. The concerns you're raising are good to think
> about,
> > >> but
> > >>>> we
> > >>>>> need to solve them elsewhere. I think that the table format should
> > >> have a
> > >>>>> way to specify the sort order it wants for incoming data and
> > >> communicate
> > >>>>> that to the engine writing Parquet files. That's what we're working
> > on
> > >>>>> adding to the data source v2 API in Spark. Tables should be able to
> > >>>> specify
> > >>>>> the expected clustering (for partitioning) and sort order for rows,
> > >> then
> > >>>>> the query plan is automatically rewritten to make that happen.
> > >>>>>
> > >>>>> rb
> > >>>>>
> > >>>>> On Wed, Nov 8, 2017 at 7:47 AM, Zoltan Ivanfi <zi...@cloudera.com>
> > wrote:
> > >>>>>
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>> I don't know the solution, just adding my thoughts.
> > >>>>>>
> > >>>>>> In my opinion the underlying problem is that min-max ranges have
> to
> > be
> > >>>>>> small for the best filtering results. In order to achieve this,
> the
> > >> data
> > >>>>>> has to be sorted, but calculating min-max statistics is the
> > >>>> responsibility
> > >>>>>> of the library, while sorting the data is a responsibility of the
> > >>>>>> application above the library. Without logical-type-based sorting
> > >> rules,
> > >>>>>> the application may sort data using a different ordering than the
> > one
> > >>>> used
> > >>>>>> for calculating/filtering on the min-max values. This results in
> too
> > >>>> broad
> > >>>>>> min-max ranges, that in theory could still function correctly, but
> > are
> > >>>> not
> > >>>>>> optimal for filtering. (As an extra twist, even if we have
> > >>>>>> logical-type-based sorting rules, the application can still use a
> > >>>> different
> > >>>>>> ordering for sorting and even for comparisons. The results can
> range
> > >>>> from
> > >>>>>> overly broad min-max ranges to incorrectly discarded values.)
> > >>>>>>
> > >>>>>> From this point of view, dictionary entries and bloom filters are
> > much
> > >>>> less
> > >>>>>> problematic, because sorting by *any* order will optimize these
> > >>>> structures,
> > >>>>>> as the only important thing is that equal values should end up
> next
> > to
> > >>>> each
> > >>>>>> other to increase the chance of having a low number of values in a
> > >>>> single
> > >>>>>> page/row group.
> > >>>>>>
> > >>>>>> I think that trying to optimize min-max stats by sorting crosses
> the
> > >>>> layer
> > >>>>>> boundary between the library and the application and as such is
> much
> > >>>> better
> > >>>>>> suited to a full-stack implementation like Impala than the
> > parquet-mr
> > >> +
> > >>>>>> separate application stack. Neither relying on the application to
> > >>>> calculate
> > >>>>>> standards-compliant statistics nor forcing the application to use
> > the
> > >>>>>> library's data types for sorting and comparison seems like a good
> > >>>> solution
> > >>>>>> to me.
> > >>>>>>
> > >>>>>> Please note that this problem inherently applies to columns
> > statistics
> > >>>> as
> > >>>>>> well.
> > >>>>>>
> > >>>>>> Zoltan
> > >>>>>>
> > >>>>>> On Wed, Nov 8, 2017 at 3:41 PM, Gabor Szadovszky <
> > >>>>>> gabor.szadovszky@cloudera.com> wrote:
> > >>>>>>
> > >>>>>>> Hi,
> > >>>>>>>
> > >>>>>>> I started working on the jira PARQUET-1025 <
> > >> https://issues.apache.org/
> > >>>>>>> jira/browse/PARQUET-1025>. It is about implementing the new
> min-max
> > >>>>>>> statistics specified in PARQUET-686 <https://issues.apache.org/
> > >>>>>>> jira/browse/PARQUET-686>.
> > >>>>>>>
> > >>>>>>> After looking in the code deeper I think the spec of the new
> > min-max
> > >>>>>> stats
> > >>>>>>> contradicts to the actual design of parquet-mr:
> > >>>>>>> The concept of parquet-mr is to use the raw types (representing
> the
> > >>>>>>> primitive types in the spec) and let the client of the API have
> > them
> > >>>>>>> converted to rich objects (representing the logical types). For
> > >> example
> > >>>>>>> parquet-mr have Binary for both UTF8 and DECIMAL (at least for
> one
> > >>>>>> specific
> > >>>>>>> representation) instead of returning String or BigDecimal
> anywhere.
> > >>>>>>> The new min-max stats requires to have separate comparison
> > mechanisms
> > >>>> for
> > >>>>>>> the same primitives depending on the logical types. For example
> > UTF8
> > >>>>>>> requires unsigned (lexicographical) while DECIMAL requires signed
> > >>>>>>> comparisons for the same Binary class.
> > >>>>>>> The problem is that we are specifying sorting orders based on
> > logical
> > >>>>>>> types while we are not providing specific java types for them. It
> > >> means
> > >>>>>> the
> > >>>>>>> client can unintentionally use a different comparison logic than
> > >>>>>> parquet-mr
> > >>>>>>> in the min-max stats which can lead to discarding relevant values
> > >>>> during
> > >>>>>>> filtering.
> > >>>>>>>
> > >>>>>>> I can see two possible solutions however none of them seems to be
> > >> good
> > >>>>>>> enough for me.
> > >>>>>>>
> > >>>>>>> 1. Implement specific comparison logics based on the logical
> types
> > in
> > >>>> the
> > >>>>>>> related Statistics object
> > >>>>>>> The problem here is that we are still using the raw types
> therefore
> > >>>>>> client
> > >>>>>>> code in filters might implement different comparison logic than
> > >>>> specified
> > >>>>>>> and implemented in Statistics. For example in case of having
> > UINT_32
> > >>>> the
> > >>>>>>> min-max comparison in Statistics will use the proper unsigned
> > >>>> comparison
> > >>>>>>> logic while the client code in case of checking the elements (in
> > the
> > >>>>>> column
> > >>>>>>> or in the dictionary) might implement somewhat simpler e.g. value
> > > 7
> > >>>>>> which
> > >>>>>>> may lead to different results. See UserDefinedPredicate.keep(T)
> > >>>>>>> It is highly confusing for the client that it has to re-implement
> > the
> > >>>>>> same
> > >>>>>>> comparison logic in the client code for the raw types as it is
> > >>>>>> implemented
> > >>>>>>> for the statistics.
> > >>>>>>>
> > >>>>>>> 2. Return specific “rich” objects for the logical types instead
> of
> > >> the
> > >>>>>> raw
> > >>>>>>> types
> > >>>>>>> This solution would solve the problems what the previous one has
> > but
> > >>>>>> would
> > >>>>>>> introduce bigger problems.
> > >>>>>>> Breaks the actual design of parquet-mr
> > >>>>>>> Backward incompatible: existing client code would not work
> properly
> > >> for
> > >>>>>>> several logical types
> > >>>>>>>
> > >>>>>>> (3.) Revert the specs of the new min-max statistics
> > >>>>>>> Let’s keep working on the raw types without having any additional
> > >>>>>>> comparison logic. Specify the existing comparison logic (e.g.
> > signed
> > >>>> for
> > >>>>>>> all primitive types, (signed) lexicographical for binary) and
> > expect
> > >>>> the
> > >>>>>>> client to use these if it want to have sorting implemented for
> the
> > >>>> values
> > >>>>>>> for better performance.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> What do you think, guys? Any suggestions on one of my solutions
> or
> > >> for
> > >>>> a
> > >>>>>>> new one I did not recognise?
> > >>>>>>>
> > >>>>>>> Thanks a lot,
> > >>>>>>> Gabor
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> --
> > >>>>> Ryan Blue
> > >>>>> Software Engineer
> > >>>>> Netflix
> > >>>>
> > >>>>
> > >>
> > >>
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> >
> >
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: PARQUET-1025: Support new min-max statistics in parquet-mr

Posted by Zoltan Ivanfi <zi...@cloudera.com>.

Hi,

The reason why I brought up that API callers can easily shoot themselves in
the foot is that it does not necessarily have to be this way.

One of the design alternatives (separate Comparators) makes it very easy to
shoot yourself in the foot for any logical type because you can
unintentionally bypass it.

The other design proposed by Gabor (correct compareTo() in the
logical-type-specific Binary implementations themselves) prevents this from
happening, but does not have a proper answer to the UINT problem. I find
this design more appealing, because it leads to a better API in my opinion.

Unlike other logical types that require custom comparisons, the primitive
value behind a UINT is not an object. UINTs are not used as much as other
logical types either. I would prefer a simpler solution that makes using
those more important types "just work" and requires extra care for UINTs
than one that makes all types work in the same way but requires the
programmer to explicitly request a Comparator for all types. I would even
consider deprecating the current way of getting UINTs from parquet-mr
because I think that using regular Java integers for UINTs is a weak point
of the API.

Zoltan

On Tue, Nov 14, 2017 at 6:16 PM Gabor Szadovszky <
gabor.szadovszky@cloudera.com> wrote:

> I agree that we cannot do anything practical to use UINT in java unless we
> return proper unsigned objects instead of primitives but it would break the
> whole API.
> While my new approach does not solve the comparison problem of UINTs it
> would have some benefits over the outside comparator logic:
> The natural ordering would be correct; easier/simpler for the users
> I think, the whole implementation would be simpler. For example we do not
> need to change the actual statistics implementations (maybe the toString
> used for debugging/logging).
> The users do not need to change their code to get the benefits of the
> proper ordering including the new min-max statistics
> The only drawback is that it would not be backward compatible which I
> don’t think really matters here as the current comparison logic implemented
> in Binary is incorrect for any case.
>
> From the UINT point of view we still can use related min-max statistics if
> the min-max values are in the non-negative range of the related signed
> value (e.g. UINT_32 from 0 to 2^31 -1).
>
> I propose extending Binary with specific implementations of
> compareTo/toString for the different logical types instead of the separate
> Comparator idea which I think is more complex and confusing to the user.
> What’s your opinion?
>
> Gabor
>
> > On 14 Nov 2017, at 17:22, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> >
> > Applications may shoot themselves in the foot by using the wrong
> > comparison. There’s not much we can do if the application uses UINT32 and
> > doesn’t use unsigned comparison. Maybe I’m missing something new about
> this
> > line of reasoning?
> >
> > The row group filters are currently passed column descriptors, which can
> be
> > updated to include the full Type instead of PrimitiveTypeName and used to
> > get the correct Comparator. I don’t think it matters that the Java
> > representation is the primitive type. If users implement their own
> > UserDefinedPredicate classes, then they should understand how data will
> be
> > passed. We have the same requirement for implementing other low-level
> > interfaces, like data models.
> >
> > rb
> > 
> >
> > On Tue, Nov 14, 2017 at 5:27 AM, Gabor Szadovszky <
> > gabor.szadovszky@cloudera.com <ma...@cloudera.com>>
> wrote:
> >
> >> Thanks a lot, Zoltan for making it clear.
> >>
> >> Meanwhile I’ve discovered that the problem I’ve mentioned before with
> the
> >> actual ordering of Binary and the former statistics is already solved:
> >> PARQUET-686 <https://issues.apache.org/jira/browse/PARQUET-686 <
> https://issues.apache.org/jira/browse/PARQUET-686>>
> >> What still stands against my proposal (except that it does not support
> the
> >> proper comparison for UINT values) is that it breaks the backward
> >> compatibility: Binary.compareTo(Binary) works in different way.
> >>
> >> Gabor
> >>
> >>> On 14 Nov 2017, at 14:08, Zoltan Ivanfi <zi...@cloudera.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> Let me rephrase the problem with unsigned ints in a way that I
> personally
> >>> find easier to understand.
> >>>
> >>> Let's take a typical application that uses native Java types
> internally.
> >>> For parquet-mr, these native types have to be converted to Parquet
> >>> primitive types. The parquet-mr library supports low-level filtering of
> >>> rows, pages or row groups by allowing the application to implement
> >>> callbacks or build complex conditions using a set of predicates
> provided
> >> by
> >>> parquet-mr. For this low-level filtering, conditions must be specified
> in
> >>> terms of Parquet's primitive/logical types and not in terms of the
> native
> >>> Java types that the application internally uses.
> >>>
> >>> The "primitive/logical" part of the last sentence is the actual change
> in
> >>> question. This used to be "primitive" and now for the new statistics
> >> fields
> >>> it should become "logical". Most logical types annotate binaries, which
> >> is
> >>> a class (hierarchy) in parquet-mr. Making these logical-type-aware is
> >>> more-or-less easily accomplished by providing a correct compareTo()
> >>> implementation.
> >>>
> >>> The UINT logical types are an exception, however. For these,
> parquet-mr's
> >>> internal primitive types are the regular native Java types, so there is
> >> no
> >>> way to change their comparison behaviour. Even if we provide a correct
> >>> Comparator, the application can simply specify filtering conditions
> like
> >>> "value < 42", which will result in a signed comparison on unsigned
> >> fields.
> >>> It makes it way too easy for developers to shoot themselves in the
> foot.
> >>>
> >>> Zoltan
> >>>
> >>>
> >>> On Tue, Nov 14, 2017 at 12:29 PM Gabor Szadovszky <
> >>> gabor.szadovszky@cloudera.com> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> During the development of this feature I’ve found out some scenarios
> >> which
> >>>> would be really confusing for the clients.
> >>>> For example: We have the new min-max statistics in the file and also
> >> have
> >>>> dictionary encoding for binary elements. The client have filtering
> >>>> implemented for the related column involving Gt. Which comparator
> shall
> >> we
> >>>> use in the DictionaryFilter? If we use the new one based on the
> >> statistics
> >>>> it might contradict to the UserDefinedPredicate implemented by the
> >> client
> >>>> and might cause false negatives. If we use the current natural order
> of
> >>>> Binary it would contradict to the one in statistics. Further more, we
> >> are
> >>>> about the make the new comparators available to the client so the
> >>>> implemented UserDefinedPredicate may not match to the DictionaryFilter
> >> or
> >>>> the statistics anyway.
> >>>>
> >>>> I think, the following new proposal would solve the confusion issues
> and
> >>>> even make the existing client code work properly with the new
> >> statistics. I
> >>>> am not sure yet, but think that the implementation would be cleaner as
> >> well.
> >>>> As far as I know the unsigned integer types (UINT_8 etc.) are not used
> >>>> widely. I would skip creating min-max statistics for them or keep
> having
> >>>> signed comparison. (BTW, don’t we want to deprecate them?)
> >>>> By leaving the unsigned integers out of the picture only the Binary
> >> class
> >>>> is left to support the different comparison logics. So, let’s refactor
> >>>> Binary and create different implementations for the different logical
> >>>> types. This way the natural ordering of Binary will always reflect the
> >> one
> >>>> specified for the logical types and the Statistics implementations do
> >> not
> >>>> need to be changed.
> >>>> We do have a problem though and it is the current natural ordering of
> >>>> Binary. It is implemented in a way that seems to be lexicographical
> but
> >> the
> >>>> byte comparison is signed. I don’t think it is correct so I would drop
> >> this
> >>>> implementation but it makes a bit hard to implement the handing of the
> >>>> former min-max statistics. If I would like to be correct, I would not
> >> write
> >>>> the former min-max statistics for Binary at all and would not use them
> >> at
> >>>> read. (Or only use it if it was not written by parquet-mr.) I guess,
> >> this
> >>>> issue was not identified because clients are rarely using characters
> >> where
> >>>> unsigned/signed comparison matters.
> >>>> What do you think?
> >>>>
> >>>> Regards,
> >>>> Gabor
> >>>>
> >>>>> On 8 Nov 2017, at 18:02, Ryan Blue <rb...@netflix.com.INVALID>
> wrote:
> >>>>>
> >>>>> Gabor,
> >>>>>
> >>>>> You're right that working with logical type data isn't really
> something
> >>>>> that Parquet has done much in the past. We have the distinction
> between
> >>>>> logical types and physical types to simplify what we need to support
> --
> >>>> we
> >>>>> only have encodings for physical types -- but in this case, we can't
> do
> >>>>> that, and will need to use comparators based on the logical types.
> For
> >>>> now,
> >>>>> we don't need to expose much to the layer above because all of the
> >> types
> >>>>> have a defined sort order. In the future, we should be able to
> support
> >>>>> other orders for UTF8 data, but we don't need to focus on that now.
> >>>>>
> >>>>> Zoltan,
> >>>>>
> >>>>> While it isn't ideal to have no control over the sort order, the only
> >>>> thing
> >>>>> we can do at the Parquet level is to handle data correctly when it
> does
> >>>>> come in sorted. The concerns you're raising are good to think about,
> >> but
> >>>> we
> >>>>> need to solve them elsewhere. I think that the table format should
> >> have a
> >>>>> way to specify the sort order it wants for incoming data and
> >> communicate
> >>>>> that to the engine writing Parquet files. That's what we're working
> on
> >>>>> adding to the data source v2 API in Spark. Tables should be able to
> >>>> specify
> >>>>> the expected clustering (for partitioning) and sort order for rows,
> >> then
> >>>>> the query plan is automatically rewritten to make that happen.
> >>>>>
> >>>>> rb
> >>>>>
> >>>>> On Wed, Nov 8, 2017 at 7:47 AM, Zoltan Ivanfi <zi...@cloudera.com>
> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I don't know the solution, just adding my thoughts.
> >>>>>>
> >>>>>> In my opinion the underlying problem is that min-max ranges have to
> be
> >>>>>> small for the best filtering results. In order to achieve this, the
> >> data
> >>>>>> has to be sorted, but calculating min-max statistics is the
> >>>> responsibility
> >>>>>> of the library, while sorting the data is a responsibility of the
> >>>>>> application above the library. Without logical-type-based sorting
> >> rules,
> >>>>>> the application may sort data using a different ordering than the
> one
> >>>> used
> >>>>>> for calculating/filtering on the min-max values. This results in too
> >>>> broad
> >>>>>> min-max ranges, that in theory could still function correctly, but
> are
> >>>> not
> >>>>>> optimal for filtering. (As an extra twist, even if we have
> >>>>>> logical-type-based sorting rules, the application can still use a
> >>>> different
> >>>>>> ordering for sorting and even for comparisons. The results can range
> >>>> from
> >>>>>> overly broad min-max ranges to incorrectly discarded values.)
> >>>>>>
> >>>>>> From this point of view, dictionary entries and bloom filters are
> much
> >>>> less
> >>>>>> problematic, because sorting by *any* order will optimize these
> >>>> structures,
> >>>>>> as the only important thing is that equal values should end up next
> to
> >>>> each
> >>>>>> other to increase the chance of having a low number of values in a
> >>>> single
> >>>>>> page/row group.
> >>>>>>
> >>>>>> I think that trying to optimize min-max stats by sorting crosses the
> >>>> layer
> >>>>>> boundary between the library and the application and as such is much
> >>>> better
> >>>>>> suited to a full-stack implementation like Impala than the
> parquet-mr
> >> +
> >>>>>> separate application stack. Neither relying on the application to
> >>>> calculate
> >>>>>> standards-compliant statistics nor forcing the application to use
> the
> >>>>>> library's data types for sorting and comparison seems like a good
> >>>> solution
> >>>>>> to me.
> >>>>>>
> >>>>>> Please note that this problem inherently applies to columns
> statistics
> >>>> as
> >>>>>> well.
> >>>>>>
> >>>>>> Zoltan
> >>>>>>
> >>>>>> On Wed, Nov 8, 2017 at 3:41 PM, Gabor Szadovszky <
> >>>>>> gabor.szadovszky@cloudera.com> wrote:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> I started working on the jira PARQUET-1025 <
> >> https://issues.apache.org/
> >>>>>>> jira/browse/PARQUET-1025>. It is about implementing the new min-max
> >>>>>>> statistics specified in PARQUET-686 <https://issues.apache.org/
> >>>>>>> jira/browse/PARQUET-686>.
> >>>>>>>
> >>>>>>> After looking in the code deeper I think the spec of the new
> min-max
> >>>>>> stats
> >>>>>>> contradicts to the actual design of parquet-mr:
> >>>>>>> The concept of parquet-mr is to use the raw types (representing the
> >>>>>>> primitive types in the spec) and let the client of the API have
> them
> >>>>>>> converted to rich objects (representing the logical types). For
> >> example
> >>>>>>> parquet-mr have Binary for both UTF8 and DECIMAL (at least for one
> >>>>>> specific
> >>>>>>> representation) instead of returning String or BigDecimal anywhere.
> >>>>>>> The new min-max stats requires to have separate comparison
> mechanisms
> >>>> for
> >>>>>>> the same primitives depending on the logical types. For example
> UTF8
> >>>>>>> requires unsigned (lexicographical) while DECIMAL requires signed
> >>>>>>> comparisons for the same Binary class.
> >>>>>>> The problem is that we are specifying sorting orders based on
> logical
> >>>>>>> types while we are not providing specific java types for them. It
> >> means
> >>>>>> the
> >>>>>>> client can unintentionally use a different comparison logic than
> >>>>>> parquet-mr
> >>>>>>> in the min-max stats which can lead to discarding relevant values
> >>>> during
> >>>>>>> filtering.
> >>>>>>>
> >>>>>>> I can see two possible solutions however none of them seems to be
> >> good
> >>>>>>> enough for me.
> >>>>>>>
> >>>>>>> 1. Implement specific comparison logics based on the logical types
> in
> >>>> the
> >>>>>>> related Statistics object
> >>>>>>> The problem here is that we are still using the raw types therefore
> >>>>>> client
> >>>>>>> code in filters might implement different comparison logic than
> >>>> specified
> >>>>>>> and implemented in Statistics. For example in case of having
> UINT_32
> >>>> the
> >>>>>>> min-max comparison in Statistics will use the proper unsigned
> >>>> comparison
> >>>>>>> logic while the client code in case of checking the elements (in
> the
> >>>>>> column
> >>>>>>> or in the dictionary) might implement somewhat simpler e.g. value
> > 7
> >>>>>> which
> >>>>>>> may lead to different results. See UserDefinedPredicate.keep(T)
> >>>>>>> It is highly confusing for the client that it has to re-implement
> the
> >>>>>> same
> >>>>>>> comparison logic in the client code for the raw types as it is
> >>>>>> implemented
> >>>>>>> for the statistics.
> >>>>>>>
> >>>>>>> 2. Return specific “rich” objects for the logical types instead of
> >> the
> >>>>>> raw
> >>>>>>> types
> >>>>>>> This solution would solve the problems what the previous one has
> but
> >>>>>> would
> >>>>>>> introduce bigger problems.
> >>>>>>> Breaks the actual design of parquet-mr
> >>>>>>> Backward incompatible: existing client code would not work properly
> >> for
> >>>>>>> several logical types
> >>>>>>>
> >>>>>>> (3.) Revert the specs of the new min-max statistics
> >>>>>>> Let’s keep working on the raw types without having any additional
> >>>>>>> comparison logic. Specify the existing comparison logic (e.g.
> signed
> >>>> for
> >>>>>>> all primitive types, (signed) lexicographical for binary) and
> expect
> >>>> the
> >>>>>>> client to use these if it want to have sorting implemented for the
> >>>> values
> >>>>>>> for better performance.
> >>>>>>>
> >>>>>>>
> >>>>>>> What do you think, guys? Any suggestions on one of my solutions or
> >> for
> >>>> a
> >>>>>>> new one I did not recognise?
> >>>>>>>
> >>>>>>> Thanks a lot,
> >>>>>>> Gabor
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Ryan Blue
> >>>>> Software Engineer
> >>>>> Netflix
> >>>>
> >>>>
> >>
> >>
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>

Re: PARQUET-1025: Support new min-max statistics in parquet-mr

Posted by Gabor Szadovszky <ga...@cloudera.com>.

I agree that we cannot do anything practical to use UINT in java unless we return proper unsigned objects instead of primitives but it would break the whole API.
While my new approach does not solve the comparison problem of UINTs it would have some benefits over the outside comparator logic:
The natural ordering would be correct; easier/simpler for the users
I think, the whole implementation would be simpler. For example we do not need to change the actual statistics implementations (maybe the toString used for debugging/logging).
The users do not need to change their code to get the benefits of the proper ordering including the new min-max statistics
The only drawback is that it would not be backward compatible which I don’t think really matters here as the current comparison logic implemented in Binary is incorrect for any case.

From the UINT point of view we still can use related min-max statistics if the min-max values are in the non-negative range of the related signed value (e.g. UINT_32 from 0 to 2^31 -1).

I propose extending Binary with specific implementations of compareTo/toString for the different logical types instead of the separate Comparator idea which I think is more complex and confusing to the user.
What’s your opinion?

Gabor

> On 14 Nov 2017, at 17:22, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
> Applications may shoot themselves in the foot by using the wrong
> comparison. There’s not much we can do if the application uses UINT32 and
> doesn’t use unsigned comparison. Maybe I’m missing something new about this
> line of reasoning?
> 
> The row group filters are currently passed column descriptors, which can be
> updated to include the full Type instead of PrimitiveTypeName and used to
> get the correct Comparator. I don’t think it matters that the Java
> representation is the primitive type. If users implement their own
> UserDefinedPredicate classes, then they should understand how data will be
> passed. We have the same requirement for implementing other low-level
> interfaces, like data models.
> 
> rb
> 
> 
> On Tue, Nov 14, 2017 at 5:27 AM, Gabor Szadovszky <
> gabor.szadovszky@cloudera.com <ma...@cloudera.com>> wrote:
> 
>> Thanks a lot, Zoltan for making it clear.
>> 
>> Meanwhile I’ve discovered that the problem I’ve mentioned before with the
>> actual ordering of Binary and the former statistics is already solved:
>> PARQUET-686 <https://issues.apache.org/jira/browse/PARQUET-686 <https://issues.apache.org/jira/browse/PARQUET-686>>
>> What still stands against my proposal (except that it does not support the
>> proper comparison for UINT values) is that it breaks the backward
>> compatibility: Binary.compareTo(Binary) works in different way.
>> 
>> Gabor
>> 
>>> On 14 Nov 2017, at 14:08, Zoltan Ivanfi <zi...@cloudera.com> wrote:
>>> 
>>> Hi,
>>> 
>>> Let me rephrase the problem with unsigned ints in a way that I personally
>>> find easier to understand.
>>> 
>>> Let's take a typical application that uses native Java types internally.
>>> For parquet-mr, these native types have to be converted to Parquet
>>> primitive types. The parquet-mr library supports low-level filtering of
>>> rows, pages or row groups by allowing the application to implement
>>> callbacks or build complex conditions using a set of predicates provided
>> by
>>> parquet-mr. For this low-level filtering, conditions must be specified in
>>> terms of Parquet's primitive/logical types and not in terms of the native
>>> Java types that the application internally uses.
>>> 
>>> The "primitive/logical" part of the last sentence is the actual change in
>>> question. This used to be "primitive" and now for the new statistics
>> fields
>>> it should become "logical". Most logical types annotate binaries, which
>> is
>>> a class (hierarchy) in parquet-mr. Making these logical-type-aware is
>>> more-or-less easily accomplished by providing a correct compareTo()
>>> implementation.
>>> 
>>> The UINT logical types are an exception, however. For these, parquet-mr's
>>> internal primitive types are the regular native Java types, so there is
>> no
>>> way to change their comparison behaviour. Even if we provide a correct
>>> Comparator, the application can simply specify filtering conditions like
>>> "value < 42", which will result in a signed comparison on unsigned
>> fields.
>>> It makes it way too easy for developers to shoot themselves in the foot.
>>> 
>>> Zoltan
>>> 
>>> 
>>> On Tue, Nov 14, 2017 at 12:29 PM Gabor Szadovszky <
>>> gabor.szadovszky@cloudera.com> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> During the development of this feature I’ve found out some scenarios
>> which
>>>> would be really confusing for the clients.
>>>> For example: We have the new min-max statistics in the file and also
>> have
>>>> dictionary encoding for binary elements. The client have filtering
>>>> implemented for the related column involving Gt. Which comparator shall
>> we
>>>> use in the DictionaryFilter? If we use the new one based on the
>> statistics
>>>> it might contradict to the UserDefinedPredicate implemented by the
>> client
>>>> and might cause false negatives. If we use the current natural order of
>>>> Binary it would contradict to the one in statistics. Further more, we
>> are
>>>> about the make the new comparators available to the client so the
>>>> implemented UserDefinedPredicate may not match to the DictionaryFilter
>> or
>>>> the statistics anyway.
>>>> 
>>>> I think, the following new proposal would solve the confusion issues and
>>>> even make the existing client code work properly with the new
>> statistics. I
>>>> am not sure yet, but think that the implementation would be cleaner as
>> well.
>>>> As far as I know the unsigned integer types (UINT_8 etc.) are not used
>>>> widely. I would skip creating min-max statistics for them or keep having
>>>> signed comparison. (BTW, don’t we want to deprecate them?)
>>>> By leaving the unsigned integers out of the picture only the Binary
>> class
>>>> is left to support the different comparison logics. So, let’s refactor
>>>> Binary and create different implementations for the different logical
>>>> types. This way the natural ordering of Binary will always reflect the
>> one
>>>> specified for the logical types and the Statistics implementations do
>> not
>>>> need to be changed.
>>>> We do have a problem though and it is the current natural ordering of
>>>> Binary. It is implemented in a way that seems to be lexicographical but
>> the
>>>> byte comparison is signed. I don’t think it is correct so I would drop
>> this
>>>> implementation but it makes a bit hard to implement the handing of the
>>>> former min-max statistics. If I would like to be correct, I would not
>> write
>>>> the former min-max statistics for Binary at all and would not use them
>> at
>>>> read. (Or only use it if it was not written by parquet-mr.) I guess,
>> this
>>>> issue was not identified because clients are rarely using characters
>> where
>>>> unsigned/signed comparison matters.
>>>> What do you think?
>>>> 
>>>> Regards,
>>>> Gabor
>>>> 
>>>>> On 8 Nov 2017, at 18:02, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>>>>> 
>>>>> Gabor,
>>>>> 
>>>>> You're right that working with logical type data isn't really something
>>>>> that Parquet has done much in the past. We have the distinction between
>>>>> logical types and physical types to simplify what we need to support --
>>>> we
>>>>> only have encodings for physical types -- but in this case, we can't do
>>>>> that, and will need to use comparators based on the logical types. For
>>>> now,
>>>>> we don't need to expose much to the layer above because all of the
>> types
>>>>> have a defined sort order. In the future, we should be able to support
>>>>> other orders for UTF8 data, but we don't need to focus on that now.
>>>>> 
>>>>> Zoltan,
>>>>> 
>>>>> While it isn't ideal to have no control over the sort order, the only
>>>> thing
>>>>> we can do at the Parquet level is to handle data correctly when it does
>>>>> come in sorted. The concerns you're raising are good to think about,
>> but
>>>> we
>>>>> need to solve them elsewhere. I think that the table format should
>> have a
>>>>> way to specify the sort order it wants for incoming data and
>> communicate
>>>>> that to the engine writing Parquet files. That's what we're working on
>>>>> adding to the data source v2 API in Spark. Tables should be able to
>>>> specify
>>>>> the expected clustering (for partitioning) and sort order for rows,
>> then
>>>>> the query plan is automatically rewritten to make that happen.
>>>>> 
>>>>> rb
>>>>> 
>>>>> On Wed, Nov 8, 2017 at 7:47 AM, Zoltan Ivanfi <zi...@cloudera.com> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I don't know the solution, just adding my thoughts.
>>>>>> 
>>>>>> In my opinion the underlying problem is that min-max ranges have to be
>>>>>> small for the best filtering results. In order to achieve this, the
>> data
>>>>>> has to be sorted, but calculating min-max statistics is the
>>>> responsibility
>>>>>> of the library, while sorting the data is a responsibility of the
>>>>>> application above the library. Without logical-type-based sorting
>> rules,
>>>>>> the application may sort data using a different ordering than the one
>>>> used
>>>>>> for calculating/filtering on the min-max values. This results in too
>>>> broad
>>>>>> min-max ranges, that in theory could still function correctly, but are
>>>> not
>>>>>> optimal for filtering. (As an extra twist, even if we have
>>>>>> logical-type-based sorting rules, the application can still use a
>>>> different
>>>>>> ordering for sorting and even for comparisons. The results can range
>>>> from
>>>>>> overly broad min-max ranges to incorrectly discarded values.)
>>>>>> 
>>>>>> From this point of view, dictionary entries and bloom filters are much
>>>> less
>>>>>> problematic, because sorting by *any* order will optimize these
>>>> structures,
>>>>>> as the only important thing is that equal values should end up next to
>>>> each
>>>>>> other to increase the chance of having a low number of values in a
>>>> single
>>>>>> page/row group.
>>>>>> 
>>>>>> I think that trying to optimize min-max stats by sorting crosses the
>>>> layer
>>>>>> boundary between the library and the application and as such is much
>>>> better
>>>>>> suited to a full-stack implementation like Impala than the parquet-mr
>> +
>>>>>> separate application stack. Neither relying on the application to
>>>> calculate
>>>>>> standards-compliant statistics nor forcing the application to use the
>>>>>> library's data types for sorting and comparison seems like a good
>>>> solution
>>>>>> to me.
>>>>>> 
>>>>>> Please note that this problem inherently applies to columns statistics
>>>> as
>>>>>> well.
>>>>>> 
>>>>>> Zoltan
>>>>>> 
>>>>>> On Wed, Nov 8, 2017 at 3:41 PM, Gabor Szadovszky <
>>>>>> gabor.szadovszky@cloudera.com> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I started working on the jira PARQUET-1025 <
>> https://issues.apache.org/
>>>>>>> jira/browse/PARQUET-1025>. It is about implementing the new min-max
>>>>>>> statistics specified in PARQUET-686 <https://issues.apache.org/
>>>>>>> jira/browse/PARQUET-686>.
>>>>>>> 
>>>>>>> After looking in the code deeper I think the spec of the new min-max
>>>>>> stats
>>>>>>> contradicts to the actual design of parquet-mr:
>>>>>>> The concept of parquet-mr is to use the raw types (representing the
>>>>>>> primitive types in the spec) and let the client of the API have them
>>>>>>> converted to rich objects (representing the logical types). For
>> example
>>>>>>> parquet-mr have Binary for both UTF8 and DECIMAL (at least for one
>>>>>> specific
>>>>>>> representation) instead of returning String or BigDecimal anywhere.
>>>>>>> The new min-max stats requires to have separate comparison mechanisms
>>>> for
>>>>>>> the same primitives depending on the logical types. For example UTF8
>>>>>>> requires unsigned (lexicographical) while DECIMAL requires signed
>>>>>>> comparisons for the same Binary class.
>>>>>>> The problem is that we are specifying sorting orders based on logical
>>>>>>> types while we are not providing specific java types for them. It
>> means
>>>>>> the
>>>>>>> client can unintentionally use a different comparison logic than
>>>>>> parquet-mr
>>>>>>> in the min-max stats which can lead to discarding relevant values
>>>> during
>>>>>>> filtering.
>>>>>>> 
>>>>>>> I can see two possible solutions however none of them seems to be
>> good
>>>>>>> enough for me.
>>>>>>> 
>>>>>>> 1. Implement specific comparison logics based on the logical types in
>>>> the
>>>>>>> related Statistics object
>>>>>>> The problem here is that we are still using the raw types therefore
>>>>>> client
>>>>>>> code in filters might implement different comparison logic than
>>>> specified
>>>>>>> and implemented in Statistics. For example in case of having UINT_32
>>>> the
>>>>>>> min-max comparison in Statistics will use the proper unsigned
>>>> comparison
>>>>>>> logic while the client code in case of checking the elements (in the
>>>>>> column
>>>>>>> or in the dictionary) might implement somewhat simpler e.g. value > 7
>>>>>> which
>>>>>>> may lead to different results. See UserDefinedPredicate.keep(T)
>>>>>>> It is highly confusing for the client that it has to re-implement the
>>>>>> same
>>>>>>> comparison logic in the client code for the raw types as it is
>>>>>> implemented
>>>>>>> for the statistics.
>>>>>>> 
>>>>>>> 2. Return specific “rich” objects for the logical types instead of
>> the
>>>>>> raw
>>>>>>> types
>>>>>>> This solution would solve the problems what the previous one has but
>>>>>> would
>>>>>>> introduce bigger problems.
>>>>>>> Breaks the actual design of parquet-mr
>>>>>>> Backward incompatible: existing client code would not work properly
>> for
>>>>>>> several logical types
>>>>>>> 
>>>>>>> (3.) Revert the specs of the new min-max statistics
>>>>>>> Let’s keep working on the raw types without having any additional
>>>>>>> comparison logic. Specify the existing comparison logic (e.g. signed
>>>> for
>>>>>>> all primitive types, (signed) lexicographical for binary) and expect
>>>> the
>>>>>>> client to use these if it want to have sorting implemented for the
>>>> values
>>>>>>> for better performance.
>>>>>>> 
>>>>>>> 
>>>>>>> What do you think, guys? Any suggestions on one of my solutions or
>> for
>>>> a
>>>>>>> new one I did not recognise?
>>>>>>> 
>>>>>>> Thanks a lot,
>>>>>>> Gabor
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>> 
>>>> 
>> 
>> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: PARQUET-1025: Support new min-max statistics in parquet-mr

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Applications may shoot themselves in the foot by using the wrong
comparison. There’s not much we can do if the application uses UINT32 and
doesn’t use unsigned comparison. Maybe I’m missing something new about this
line of reasoning?

The row group filters are currently passed column descriptors, which can be
updated to include the full Type instead of PrimitiveTypeName and used to
get the correct Comparator. I don’t think it matters that the Java
representation is the primitive type. If users implement their own
UserDefinedPredicate classes, then they should understand how data will be
passed. We have the same requirement for implementing other low-level
interfaces, like data models.

rb


On Tue, Nov 14, 2017 at 5:27 AM, Gabor Szadovszky <
gabor.szadovszky@cloudera.com> wrote:

> Thanks a lot, Zoltan for making it clear.
>
> Meanwhile I’ve discovered that the problem I’ve mentioned before with the
> actual ordering of Binary and the former statistics is already solved:
> PARQUET-686 <https://issues.apache.org/jira/browse/PARQUET-686>
> What still stands against my proposal (except that it does not support the
> proper comparison for UINT values) is that it breaks the backward
> compatibility: Binary.compareTo(Binary) works in different way.
>
> Gabor
>
> > On 14 Nov 2017, at 14:08, Zoltan Ivanfi <zi...@cloudera.com> wrote:
> >
> > Hi,
> >
> > Let me rephrase the problem with unsigned ints in a way that I personally
> > find easier to understand.
> >
> > Let's take a typical application that uses native Java types internally.
> > For parquet-mr, these native types have to be converted to Parquet
> > primitive types. The parquet-mr library supports low-level filtering of
> > rows, pages or row groups by allowing the application to implement
> > callbacks or build complex conditions using a set of predicates provided
> by
> > parquet-mr. For this low-level filtering, conditions must be specified in
> > terms of Parquet's primitive/logical types and not in terms of the native
> > Java types that the application internally uses.
> >
> > The "primitive/logical" part of the last sentence is the actual change in
> > question. This used to be "primitive" and now for the new statistics
> fields
> > it should become "logical". Most logical types annotate binaries, which
> is
> > a class (hierarchy) in parquet-mr. Making these logical-type-aware is
> > more-or-less easily accomplished by providing a correct compareTo()
> > implementation.
> >
> > The UINT logical types are an exception, however. For these, parquet-mr's
> > internal primitive types are the regular native Java types, so there is
> no
> > way to change their comparison behaviour. Even if we provide a correct
> > Comparator, the application can simply specify filtering conditions like
> > "value < 42", which will result in a signed comparison on unsigned
> fields.
> > It makes it way too easy for developers to shoot themselves in the foot.
> >
> > Zoltan
> >
> >
> > On Tue, Nov 14, 2017 at 12:29 PM Gabor Szadovszky <
> > gabor.szadovszky@cloudera.com> wrote:
> >
> >> Hi,
> >>
> >> During the development of this feature I’ve found out some scenarios
> which
> >> would be really confusing for the clients.
> >> For example: We have the new min-max statistics in the file and also
> have
> >> dictionary encoding for binary elements. The client have filtering
> >> implemented for the related column involving Gt. Which comparator shall
> we
> >> use in the DictionaryFilter? If we use the new one based on the
> statistics
> >> it might contradict to the UserDefinedPredicate implemented by the
> client
> >> and might cause false negatives. If we use the current natural order of
> >> Binary it would contradict to the one in statistics. Further more, we
> are
> >> about the make the new comparators available to the client so the
> >> implemented UserDefinedPredicate may not match to the DictionaryFilter
> or
> >> the statistics anyway.
> >>
> >> I think, the following new proposal would solve the confusion issues and
> >> even make the existing client code work properly with the new
> statistics. I
> >> am not sure yet, but think that the implementation would be cleaner as
> well.
> >> As far as I know the unsigned integer types (UINT_8 etc.) are not used
> >> widely. I would skip creating min-max statistics for them or keep having
> >> signed comparison. (BTW, don’t we want to deprecate them?)
> >> By leaving the unsigned integers out of the picture only the Binary
> class
> >> is left to support the different comparison logics. So, let’s refactor
> >> Binary and create different implementations for the different logical
> >> types. This way the natural ordering of Binary will always reflect the
> one
> >> specified for the logical types and the Statistics implementations do
> not
> >> need to be changed.
> >> We do have a problem though and it is the current natural ordering of
> >> Binary. It is implemented in a way that seems to be lexicographical but
> the
> >> byte comparison is signed. I don’t think it is correct so I would drop
> this
> >> implementation but it makes a bit hard to implement the handing of the
> >> former min-max statistics. If I would like to be correct, I would not
> write
> >> the former min-max statistics for Binary at all and would not use them
> at
> >> read. (Or only use it if it was not written by parquet-mr.) I guess,
> this
> >> issue was not identified because clients are rarely using characters
> where
> >> unsigned/signed comparison matters.
> >> What do you think?
> >>
> >> Regards,
> >> Gabor
> >>
> >>> On 8 Nov 2017, at 18:02, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> >>>
> >>> Gabor,
> >>>
> >>> You're right that working with logical type data isn't really something
> >>> that Parquet has done much in the past. We have the distinction between
> >>> logical types and physical types to simplify what we need to support --
> >> we
> >>> only have encodings for physical types -- but in this case, we can't do
> >>> that, and will need to use comparators based on the logical types. For
> >> now,
> >>> we don't need to expose much to the layer above because all of the
> types
> >>> have a defined sort order. In the future, we should be able to support
> >>> other orders for UTF8 data, but we don't need to focus on that now.
> >>>
> >>> Zoltan,
> >>>
> >>> While it isn't ideal to have no control over the sort order, the only
> >> thing
> >>> we can do at the Parquet level is to handle data correctly when it does
> >>> come in sorted. The concerns you're raising are good to think about,
> but
> >> we
> >>> need to solve them elsewhere. I think that the table format should
> have a
> >>> way to specify the sort order it wants for incoming data and
> communicate
> >>> that to the engine writing Parquet files. That's what we're working on
> >>> adding to the data source v2 API in Spark. Tables should be able to
> >> specify
> >>> the expected clustering (for partitioning) and sort order for rows,
> then
> >>> the query plan is automatically rewritten to make that happen.
> >>>
> >>> rb
> >>>
> >>> On Wed, Nov 8, 2017 at 7:47 AM, Zoltan Ivanfi <zi...@cloudera.com> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> I don't know the solution, just adding my thoughts.
> >>>>
> >>>> In my opinion the underlying problem is that min-max ranges have to be
> >>>> small for the best filtering results. In order to achieve this, the
> data
> >>>> has to be sorted, but calculating min-max statistics is the
> >> responsibility
> >>>> of the library, while sorting the data is a responsibility of the
> >>>> application above the library. Without logical-type-based sorting
> rules,
> >>>> the application may sort data using a different ordering than the one
> >> used
> >>>> for calculating/filtering on the min-max values. This results in too
> >> broad
> >>>> min-max ranges, that in theory could still function correctly, but are
> >> not
> >>>> optimal for filtering. (As an extra twist, even if we have
> >>>> logical-type-based sorting rules, the application can still use a
> >> different
> >>>> ordering for sorting and even for comparisons. The results can range
> >> from
> >>>> overly broad min-max ranges to incorrectly discarded values.)
> >>>>
> >>>> From this point of view, dictionary entries and bloom filters are much
> >> less
> >>>> problematic, because sorting by *any* order will optimize these
> >> structures,
> >>>> as the only important thing is that equal values should end up next to
> >> each
> >>>> other to increase the chance of having a low number of values in a
> >> single
> >>>> page/row group.
> >>>>
> >>>> I think that trying to optimize min-max stats by sorting crosses the
> >> layer
> >>>> boundary between the library and the application and as such is much
> >> better
> >>>> suited to a full-stack implementation like Impala than the parquet-mr
> +
> >>>> separate application stack. Neither relying on the application to
> >> calculate
> >>>> standards-compliant statistics nor forcing the application to use the
> >>>> library's data types for sorting and comparison seems like a good
> >> solution
> >>>> to me.
> >>>>
> >>>> Please note that this problem inherently applies to columns statistics
> >> as
> >>>> well.
> >>>>
> >>>> Zoltan
> >>>>
> >>>> On Wed, Nov 8, 2017 at 3:41 PM, Gabor Szadovszky <
> >>>> gabor.szadovszky@cloudera.com> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I started working on the jira PARQUET-1025 <
> https://issues.apache.org/
> >>>>> jira/browse/PARQUET-1025>. It is about implementing the new min-max
> >>>>> statistics specified in PARQUET-686 <https://issues.apache.org/
> >>>>> jira/browse/PARQUET-686>.
> >>>>>
> >>>>> After looking in the code deeper I think the spec of the new min-max
> >>>> stats
> >>>>> contradicts to the actual design of parquet-mr:
> >>>>> The concept of parquet-mr is to use the raw types (representing the
> >>>>> primitive types in the spec) and let the client of the API have them
> >>>>> converted to rich objects (representing the logical types). For
> example
> >>>>> parquet-mr have Binary for both UTF8 and DECIMAL (at least for one
> >>>> specific
> >>>>> representation) instead of returning String or BigDecimal anywhere.
> >>>>> The new min-max stats requires to have separate comparison mechanisms
> >> for
> >>>>> the same primitives depending on the logical types. For example UTF8
> >>>>> requires unsigned (lexicographical) while DECIMAL requires signed
> >>>>> comparisons for the same Binary class.
> >>>>> The problem is that we are specifying sorting orders based on logical
> >>>>> types while we are not providing specific java types for them. It
> means
> >>>> the
> >>>>> client can unintentionally use a different comparison logic than
> >>>> parquet-mr
> >>>>> in the min-max stats which can lead to discarding relevant values
> >> during
> >>>>> filtering.
> >>>>>
> >>>>> I can see two possible solutions however none of them seems to be
> good
> >>>>> enough for me.
> >>>>>
> >>>>> 1. Implement specific comparison logics based on the logical types in
> >> the
> >>>>> related Statistics object
> >>>>> The problem here is that we are still using the raw types therefore
> >>>> client
> >>>>> code in filters might implement different comparison logic than
> >> specified
> >>>>> and implemented in Statistics. For example in case of having UINT_32
> >> the
> >>>>> min-max comparison in Statistics will use the proper unsigned
> >> comparison
> >>>>> logic while the client code in case of checking the elements (in the
> >>>> column
> >>>>> or in the dictionary) might implement somewhat simpler e.g. value > 7
> >>>> which
> >>>>> may lead to different results. See UserDefinedPredicate.keep(T)
> >>>>> It is highly confusing for the client that it has to re-implement the
> >>>> same
> >>>>> comparison logic in the client code for the raw types as it is
> >>>> implemented
> >>>>> for the statistics.
> >>>>>
> >>>>> 2. Return specific “rich” objects for the logical types instead of
> the
> >>>> raw
> >>>>> types
> >>>>> This solution would solve the problems what the previous one has but
> >>>> would
> >>>>> introduce bigger problems.
> >>>>> Breaks the actual design of parquet-mr
> >>>>> Backward incompatible: existing client code would not work properly
> for
> >>>>> several logical types
> >>>>>
> >>>>> (3.) Revert the specs of the new min-max statistics
> >>>>> Let’s keep working on the raw types without having any additional
> >>>>> comparison logic. Specify the existing comparison logic (e.g. signed
> >> for
> >>>>> all primitive types, (signed) lexicographical for binary) and expect
> >> the
> >>>>> client to use these if it want to have sorting implemented for the
> >> values
> >>>>> for better performance.
> >>>>>
> >>>>>
> >>>>> What do you think, guys? Any suggestions on one of my solutions or
> for
> >> a
> >>>>> new one I did not recognise?
> >>>>>
> >>>>> Thanks a lot,
> >>>>> Gabor
> >>>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Ryan Blue
> >>> Software Engineer
> >>> Netflix
> >>
> >>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: PARQUET-1025: Support new min-max statistics in parquet-mr

Posted by Gabor Szadovszky <ga...@cloudera.com>.

Thanks a lot, Zoltan for making it clear.

Meanwhile I’ve discovered that the problem I’ve mentioned before with the actual ordering of Binary and the former statistics is already solved: PARQUET-686 <https://issues.apache.org/jira/browse/PARQUET-686>
What still stands against my proposal (except that it does not support the proper comparison for UINT values) is that it breaks the backward compatibility: Binary.compareTo(Binary) works in different way.

Gabor

> On 14 Nov 2017, at 14:08, Zoltan Ivanfi <zi...@cloudera.com> wrote:
> 
> Hi,
> 
> Let me rephrase the problem with unsigned ints in a way that I personally
> find easier to understand.
> 
> Let's take a typical application that uses native Java types internally.
> For parquet-mr, these native types have to be converted to Parquet
> primitive types. The parquet-mr library supports low-level filtering of
> rows, pages or row groups by allowing the application to implement
> callbacks or build complex conditions using a set of predicates provided by
> parquet-mr. For this low-level filtering, conditions must be specified in
> terms of Parquet's primitive/logical types and not in terms of the native
> Java types that the application internally uses.
> 
> The "primitive/logical" part of the last sentence is the actual change in
> question. This used to be "primitive" and now for the new statistics fields
> it should become "logical". Most logical types annotate binaries, which is
> a class (hierarchy) in parquet-mr. Making these logical-type-aware is
> more-or-less easily accomplished by providing a correct compareTo()
> implementation.
> 
> The UINT logical types are an exception, however. For these, parquet-mr's
> internal primitive types are the regular native Java types, so there is no
> way to change their comparison behaviour. Even if we provide a correct
> Comparator, the application can simply specify filtering conditions like
> "value < 42", which will result in a signed comparison on unsigned fields.
> It makes it way too easy for developers to shoot themselves in the foot.
> 
> Zoltan
> 
> 
> On Tue, Nov 14, 2017 at 12:29 PM Gabor Szadovszky <
> gabor.szadovszky@cloudera.com> wrote:
> 
>> Hi,
>> 
>> During the development of this feature I’ve found out some scenarios which
>> would be really confusing for the clients.
>> For example: We have the new min-max statistics in the file and also have
>> dictionary encoding for binary elements. The client have filtering
>> implemented for the related column involving Gt. Which comparator shall we
>> use in the DictionaryFilter? If we use the new one based on the statistics
>> it might contradict to the UserDefinedPredicate implemented by the client
>> and might cause false negatives. If we use the current natural order of
>> Binary it would contradict to the one in statistics. Further more, we are
>> about the make the new comparators available to the client so the
>> implemented UserDefinedPredicate may not match to the DictionaryFilter or
>> the statistics anyway.
>> 
>> I think, the following new proposal would solve the confusion issues and
>> even make the existing client code work properly with the new statistics. I
>> am not sure yet, but think that the implementation would be cleaner as well.
>> As far as I know the unsigned integer types (UINT_8 etc.) are not used
>> widely. I would skip creating min-max statistics for them or keep having
>> signed comparison. (BTW, don’t we want to deprecate them?)
>> By leaving the unsigned integers out of the picture only the Binary class
>> is left to support the different comparison logics. So, let’s refactor
>> Binary and create different implementations for the different logical
>> types. This way the natural ordering of Binary will always reflect the one
>> specified for the logical types and the Statistics implementations do not
>> need to be changed.
>> We do have a problem though and it is the current natural ordering of
>> Binary. It is implemented in a way that seems to be lexicographical but the
>> byte comparison is signed. I don’t think it is correct so I would drop this
>> implementation but it makes a bit hard to implement the handing of the
>> former min-max statistics. If I would like to be correct, I would not write
>> the former min-max statistics for Binary at all and would not use them at
>> read. (Or only use it if it was not written by parquet-mr.) I guess, this
>> issue was not identified because clients are rarely using characters where
>> unsigned/signed comparison matters.
>> What do you think?
>> 
>> Regards,
>> Gabor
>> 
>>> On 8 Nov 2017, at 18:02, Ryan Blue <rb...@netflix.com.INVALID> wrote:
>>> 
>>> Gabor,
>>> 
>>> You're right that working with logical type data isn't really something
>>> that Parquet has done much in the past. We have the distinction between
>>> logical types and physical types to simplify what we need to support --
>> we
>>> only have encodings for physical types -- but in this case, we can't do
>>> that, and will need to use comparators based on the logical types. For
>> now,
>>> we don't need to expose much to the layer above because all of the types
>>> have a defined sort order. In the future, we should be able to support
>>> other orders for UTF8 data, but we don't need to focus on that now.
>>> 
>>> Zoltan,
>>> 
>>> While it isn't ideal to have no control over the sort order, the only
>> thing
>>> we can do at the Parquet level is to handle data correctly when it does
>>> come in sorted. The concerns you're raising are good to think about, but
>> we
>>> need to solve them elsewhere. I think that the table format should have a
>>> way to specify the sort order it wants for incoming data and communicate
>>> that to the engine writing Parquet files. That's what we're working on
>>> adding to the data source v2 API in Spark. Tables should be able to
>> specify
>>> the expected clustering (for partitioning) and sort order for rows, then
>>> the query plan is automatically rewritten to make that happen.
>>> 
>>> rb
>>> 
>>> On Wed, Nov 8, 2017 at 7:47 AM, Zoltan Ivanfi <zi...@cloudera.com> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I don't know the solution, just adding my thoughts.
>>>> 
>>>> In my opinion the underlying problem is that min-max ranges have to be
>>>> small for the best filtering results. In order to achieve this, the data
>>>> has to be sorted, but calculating min-max statistics is the
>> responsibility
>>>> of the library, while sorting the data is a responsibility of the
>>>> application above the library. Without logical-type-based sorting rules,
>>>> the application may sort data using a different ordering than the one
>> used
>>>> for calculating/filtering on the min-max values. This results in too
>> broad
>>>> min-max ranges, that in theory could still function correctly, but are
>> not
>>>> optimal for filtering. (As an extra twist, even if we have
>>>> logical-type-based sorting rules, the application can still use a
>> different
>>>> ordering for sorting and even for comparisons. The results can range
>> from
>>>> overly broad min-max ranges to incorrectly discarded values.)
>>>> 
>>>> From this point of view, dictionary entries and bloom filters are much
>> less
>>>> problematic, because sorting by *any* order will optimize these
>> structures,
>>>> as the only important thing is that equal values should end up next to
>> each
>>>> other to increase the chance of having a low number of values in a
>> single
>>>> page/row group.
>>>> 
>>>> I think that trying to optimize min-max stats by sorting crosses the
>> layer
>>>> boundary between the library and the application and as such is much
>> better
>>>> suited to a full-stack implementation like Impala than the parquet-mr +
>>>> separate application stack. Neither relying on the application to
>> calculate
>>>> standards-compliant statistics nor forcing the application to use the
>>>> library's data types for sorting and comparison seems like a good
>> solution
>>>> to me.
>>>> 
>>>> Please note that this problem inherently applies to columns statistics
>> as
>>>> well.
>>>> 
>>>> Zoltan
>>>> 
>>>> On Wed, Nov 8, 2017 at 3:41 PM, Gabor Szadovszky <
>>>> gabor.szadovszky@cloudera.com> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I started working on the jira PARQUET-1025 <https://issues.apache.org/
>>>>> jira/browse/PARQUET-1025>. It is about implementing the new min-max
>>>>> statistics specified in PARQUET-686 <https://issues.apache.org/
>>>>> jira/browse/PARQUET-686>.
>>>>> 
>>>>> After looking in the code deeper I think the spec of the new min-max
>>>> stats
>>>>> contradicts to the actual design of parquet-mr:
>>>>> The concept of parquet-mr is to use the raw types (representing the
>>>>> primitive types in the spec) and let the client of the API have them
>>>>> converted to rich objects (representing the logical types). For example
>>>>> parquet-mr have Binary for both UTF8 and DECIMAL (at least for one
>>>> specific
>>>>> representation) instead of returning String or BigDecimal anywhere.
>>>>> The new min-max stats requires to have separate comparison mechanisms
>> for
>>>>> the same primitives depending on the logical types. For example UTF8
>>>>> requires unsigned (lexicographical) while DECIMAL requires signed
>>>>> comparisons for the same Binary class.
>>>>> The problem is that we are specifying sorting orders based on logical
>>>>> types while we are not providing specific java types for them. It means
>>>> the
>>>>> client can unintentionally use a different comparison logic than
>>>> parquet-mr
>>>>> in the min-max stats which can lead to discarding relevant values
>> during
>>>>> filtering.
>>>>> 
>>>>> I can see two possible solutions however none of them seems to be good
>>>>> enough for me.
>>>>> 
>>>>> 1. Implement specific comparison logics based on the logical types in
>> the
>>>>> related Statistics object
>>>>> The problem here is that we are still using the raw types therefore
>>>> client
>>>>> code in filters might implement different comparison logic than
>> specified
>>>>> and implemented in Statistics. For example in case of having UINT_32
>> the
>>>>> min-max comparison in Statistics will use the proper unsigned
>> comparison
>>>>> logic while the client code in case of checking the elements (in the
>>>> column
>>>>> or in the dictionary) might implement somewhat simpler e.g. value > 7
>>>> which
>>>>> may lead to different results. See UserDefinedPredicate.keep(T)
>>>>> It is highly confusing for the client that it has to re-implement the
>>>> same
>>>>> comparison logic in the client code for the raw types as it is
>>>> implemented
>>>>> for the statistics.
>>>>> 
>>>>> 2. Return specific “rich” objects for the logical types instead of the
>>>> raw
>>>>> types
>>>>> This solution would solve the problems what the previous one has but
>>>> would
>>>>> introduce bigger problems.
>>>>> Breaks the actual design of parquet-mr
>>>>> Backward incompatible: existing client code would not work properly for
>>>>> several logical types
>>>>> 
>>>>> (3.) Revert the specs of the new min-max statistics
>>>>> Let’s keep working on the raw types without having any additional
>>>>> comparison logic. Specify the existing comparison logic (e.g. signed
>> for
>>>>> all primitive types, (signed) lexicographical for binary) and expect
>> the
>>>>> client to use these if it want to have sorting implemented for the
>> values
>>>>> for better performance.
>>>>> 
>>>>> 
>>>>> What do you think, guys? Any suggestions on one of my solutions or for
>> a
>>>>> new one I did not recognise?
>>>>> 
>>>>> Thanks a lot,
>>>>> Gabor
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>> 
>>

Re: PARQUET-1025: Support new min-max statistics in parquet-mr

Posted by Zoltan Ivanfi <zi...@cloudera.com>.

Hi,

Let me rephrase the problem with unsigned ints in a way that I personally
find easier to understand.

Let's take a typical application that uses native Java types internally.
For parquet-mr, these native types have to be converted to Parquet
primitive types. The parquet-mr library supports low-level filtering of
rows, pages or row groups by allowing the application to implement
callbacks or build complex conditions using a set of predicates provided by
parquet-mr. For this low-level filtering, conditions must be specified in
terms of Parquet's primitive/logical types and not in terms of the native
Java types that the application internally uses.

The "primitive/logical" part of the last sentence is the actual change in
question. This used to be "primitive" and now for the new statistics fields
it should become "logical". Most logical types annotate binaries, which is
a class (hierarchy) in parquet-mr. Making these logical-type-aware is
more-or-less easily accomplished by providing a correct compareTo()
implementation.

The UINT logical types are an exception, however. For these, parquet-mr's
internal primitive types are the regular native Java types, so there is no
way to change their comparison behaviour. Even if we provide a correct
Comparator, the application can simply specify filtering conditions like
"value < 42", which will result in a signed comparison on unsigned fields.
It makes it way too easy for developers to shoot themselves in the foot.

Zoltan


On Tue, Nov 14, 2017 at 12:29 PM Gabor Szadovszky <
gabor.szadovszky@cloudera.com> wrote:

> Hi,
>
> During the development of this feature I’ve found out some scenarios which
> would be really confusing for the clients.
> For example: We have the new min-max statistics in the file and also have
> dictionary encoding for binary elements. The client have filtering
> implemented for the related column involving Gt. Which comparator shall we
> use in the DictionaryFilter? If we use the new one based on the statistics
> it might contradict to the UserDefinedPredicate implemented by the client
> and might cause false negatives. If we use the current natural order of
> Binary it would contradict to the one in statistics. Further more, we are
> about the make the new comparators available to the client so the
> implemented UserDefinedPredicate may not match to the DictionaryFilter or
> the statistics anyway.
>
> I think, the following new proposal would solve the confusion issues and
> even make the existing client code work properly with the new statistics. I
> am not sure yet, but think that the implementation would be cleaner as well.
> As far as I know the unsigned integer types (UINT_8 etc.) are not used
> widely. I would skip creating min-max statistics for them or keep having
> signed comparison. (BTW, don’t we want to deprecate them?)
> By leaving the unsigned integers out of the picture only the Binary class
> is left to support the different comparison logics. So, let’s refactor
> Binary and create different implementations for the different logical
> types. This way the natural ordering of Binary will always reflect the one
> specified for the logical types and the Statistics implementations do not
> need to be changed.
> We do have a problem though and it is the current natural ordering of
> Binary. It is implemented in a way that seems to be lexicographical but the
> byte comparison is signed. I don’t think it is correct so I would drop this
> implementation but it makes a bit hard to implement the handing of the
> former min-max statistics. If I would like to be correct, I would not write
> the former min-max statistics for Binary at all and would not use them at
> read. (Or only use it if it was not written by parquet-mr.) I guess, this
> issue was not identified because clients are rarely using characters where
> unsigned/signed comparison matters.
> What do you think?
>
> Regards,
> Gabor
>
> > On 8 Nov 2017, at 18:02, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> >
> > Gabor,
> >
> > You're right that working with logical type data isn't really something
> > that Parquet has done much in the past. We have the distinction between
> > logical types and physical types to simplify what we need to support --
> we
> > only have encodings for physical types -- but in this case, we can't do
> > that, and will need to use comparators based on the logical types. For
> now,
> > we don't need to expose much to the layer above because all of the types
> > have a defined sort order. In the future, we should be able to support
> > other orders for UTF8 data, but we don't need to focus on that now.
> >
> > Zoltan,
> >
> > While it isn't ideal to have no control over the sort order, the only
> thing
> > we can do at the Parquet level is to handle data correctly when it does
> > come in sorted. The concerns you're raising are good to think about, but
> we
> > need to solve them elsewhere. I think that the table format should have a
> > way to specify the sort order it wants for incoming data and communicate
> > that to the engine writing Parquet files. That's what we're working on
> > adding to the data source v2 API in Spark. Tables should be able to
> specify
> > the expected clustering (for partitioning) and sort order for rows, then
> > the query plan is automatically rewritten to make that happen.
> >
> > rb
> >
> > On Wed, Nov 8, 2017 at 7:47 AM, Zoltan Ivanfi <zi...@cloudera.com> wrote:
> >
> >> Hi,
> >>
> >> I don't know the solution, just adding my thoughts.
> >>
> >> In my opinion the underlying problem is that min-max ranges have to be
> >> small for the best filtering results. In order to achieve this, the data
> >> has to be sorted, but calculating min-max statistics is the
> responsibility
> >> of the library, while sorting the data is a responsibility of the
> >> application above the library. Without logical-type-based sorting rules,
> >> the application may sort data using a different ordering than the one
> used
> >> for calculating/filtering on the min-max values. This results in too
> broad
> >> min-max ranges, that in theory could still function correctly, but are
> not
> >> optimal for filtering. (As an extra twist, even if we have
> >> logical-type-based sorting rules, the application can still use a
> different
> >> ordering for sorting and even for comparisons. The results can range
> from
> >> overly broad min-max ranges to incorrectly discarded values.)
> >>
> >> From this point of view, dictionary entries and bloom filters are much
> less
> >> problematic, because sorting by *any* order will optimize these
> structures,
> >> as the only important thing is that equal values should end up next to
> each
> >> other to increase the chance of having a low number of values in a
> single
> >> page/row group.
> >>
> >> I think that trying to optimize min-max stats by sorting crosses the
> layer
> >> boundary between the library and the application and as such is much
> better
> >> suited to a full-stack implementation like Impala than the parquet-mr +
> >> separate application stack. Neither relying on the application to
> calculate
> >> standards-compliant statistics nor forcing the application to use the
> >> library's data types for sorting and comparison seems like a good
> solution
> >> to me.
> >>
> >> Please note that this problem inherently applies to columns statistics
> as
> >> well.
> >>
> >> Zoltan
> >>
> >> On Wed, Nov 8, 2017 at 3:41 PM, Gabor Szadovszky <
> >> gabor.szadovszky@cloudera.com> wrote:
> >>
> >>> Hi,
> >>>
> >>> I started working on the jira PARQUET-1025 <https://issues.apache.org/
> >>> jira/browse/PARQUET-1025>. It is about implementing the new min-max
> >>> statistics specified in PARQUET-686 <https://issues.apache.org/
> >>> jira/browse/PARQUET-686>.
> >>>
> >>> After looking in the code deeper I think the spec of the new min-max
> >> stats
> >>> contradicts to the actual design of parquet-mr:
> >>> The concept of parquet-mr is to use the raw types (representing the
> >>> primitive types in the spec) and let the client of the API have them
> >>> converted to rich objects (representing the logical types). For example
> >>> parquet-mr have Binary for both UTF8 and DECIMAL (at least for one
> >> specific
> >>> representation) instead of returning String or BigDecimal anywhere.
> >>> The new min-max stats requires to have separate comparison mechanisms
> for
> >>> the same primitives depending on the logical types. For example UTF8
> >>> requires unsigned (lexicographical) while DECIMAL requires signed
> >>> comparisons for the same Binary class.
> >>> The problem is that we are specifying sorting orders based on logical
> >>> types while we are not providing specific java types for them. It means
> >> the
> >>> client can unintentionally use a different comparison logic than
> >> parquet-mr
> >>> in the min-max stats which can lead to discarding relevant values
> during
> >>> filtering.
> >>>
> >>> I can see two possible solutions however none of them seems to be good
> >>> enough for me.
> >>>
> >>> 1. Implement specific comparison logics based on the logical types in
> the
> >>> related Statistics object
> >>> The problem here is that we are still using the raw types therefore
> >> client
> >>> code in filters might implement different comparison logic than
> specified
> >>> and implemented in Statistics. For example in case of having UINT_32
> the
> >>> min-max comparison in Statistics will use the proper unsigned
> comparison
> >>> logic while the client code in case of checking the elements (in the
> >> column
> >>> or in the dictionary) might implement somewhat simpler e.g. value > 7
> >> which
> >>> may lead to different results. See UserDefinedPredicate.keep(T)
> >>> It is highly confusing for the client that it has to re-implement the
> >> same
> >>> comparison logic in the client code for the raw types as it is
> >> implemented
> >>> for the statistics.
> >>>
> >>> 2. Return specific “rich” objects for the logical types instead of the
> >> raw
> >>> types
> >>> This solution would solve the problems what the previous one has but
> >> would
> >>> introduce bigger problems.
> >>> Breaks the actual design of parquet-mr
> >>> Backward incompatible: existing client code would not work properly for
> >>> several logical types
> >>>
> >>> (3.) Revert the specs of the new min-max statistics
> >>> Let’s keep working on the raw types without having any additional
> >>> comparison logic. Specify the existing comparison logic (e.g. signed
> for
> >>> all primitive types, (signed) lexicographical for binary) and expect
> the
> >>> client to use these if it want to have sorting implemented for the
> values
> >>> for better performance.
> >>>
> >>>
> >>> What do you think, guys? Any suggestions on one of my solutions or for
> a
> >>> new one I did not recognise?
> >>>
> >>> Thanks a lot,
> >>> Gabor
> >>>
> >>
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>

Re: PARQUET-1025: Support new min-max statistics in parquet-mr

Posted by Gabor Szadovszky <ga...@cloudera.com>.

Hi,

During the development of this feature I’ve found out some scenarios which would be really confusing for the clients.
For example: We have the new min-max statistics in the file and also have dictionary encoding for binary elements. The client have filtering implemented for the related column involving Gt. Which comparator shall we use in the DictionaryFilter? If we use the new one based on the statistics it might contradict to the UserDefinedPredicate implemented by the client and might cause false negatives. If we use the current natural order of Binary it would contradict to the one in statistics. Further more, we are about the make the new comparators available to the client so the implemented UserDefinedPredicate may not match to the DictionaryFilter or the statistics anyway.

I think, the following new proposal would solve the confusion issues and even make the existing client code work properly with the new statistics. I am not sure yet, but think that the implementation would be cleaner as well.
As far as I know the unsigned integer types (UINT_8 etc.) are not used widely. I would skip creating min-max statistics for them or keep having signed comparison. (BTW, don’t we want to deprecate them?)
By leaving the unsigned integers out of the picture only the Binary class is left to support the different comparison logics. So, let’s refactor Binary and create different implementations for the different logical types. This way the natural ordering of Binary will always reflect the one specified for the logical types and the Statistics implementations do not need to be changed.
We do have a problem though and it is the current natural ordering of Binary. It is implemented in a way that seems to be lexicographical but the byte comparison is signed. I don’t think it is correct so I would drop this implementation but it makes a bit hard to implement the handing of the former min-max statistics. If I would like to be correct, I would not write the former min-max statistics for Binary at all and would not use them at read. (Or only use it if it was not written by parquet-mr.) I guess, this issue was not identified because clients are rarely using characters where unsigned/signed comparison matters.
What do you think?

Regards,
Gabor

> On 8 Nov 2017, at 18:02, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
> Gabor,
> 
> You're right that working with logical type data isn't really something
> that Parquet has done much in the past. We have the distinction between
> logical types and physical types to simplify what we need to support -- we
> only have encodings for physical types -- but in this case, we can't do
> that, and will need to use comparators based on the logical types. For now,
> we don't need to expose much to the layer above because all of the types
> have a defined sort order. In the future, we should be able to support
> other orders for UTF8 data, but we don't need to focus on that now.
> 
> Zoltan,
> 
> While it isn't ideal to have no control over the sort order, the only thing
> we can do at the Parquet level is to handle data correctly when it does
> come in sorted. The concerns you're raising are good to think about, but we
> need to solve them elsewhere. I think that the table format should have a
> way to specify the sort order it wants for incoming data and communicate
> that to the engine writing Parquet files. That's what we're working on
> adding to the data source v2 API in Spark. Tables should be able to specify
> the expected clustering (for partitioning) and sort order for rows, then
> the query plan is automatically rewritten to make that happen.
> 
> rb
> 
> On Wed, Nov 8, 2017 at 7:47 AM, Zoltan Ivanfi <zi...@cloudera.com> wrote:
> 
>> Hi,
>> 
>> I don't know the solution, just adding my thoughts.
>> 
>> In my opinion the underlying problem is that min-max ranges have to be
>> small for the best filtering results. In order to achieve this, the data
>> has to be sorted, but calculating min-max statistics is the responsibility
>> of the library, while sorting the data is a responsibility of the
>> application above the library. Without logical-type-based sorting rules,
>> the application may sort data using a different ordering than the one used
>> for calculating/filtering on the min-max values. This results in too broad
>> min-max ranges, that in theory could still function correctly, but are not
>> optimal for filtering. (As an extra twist, even if we have
>> logical-type-based sorting rules, the application can still use a different
>> ordering for sorting and even for comparisons. The results can range from
>> overly broad min-max ranges to incorrectly discarded values.)
>> 
>> From this point of view, dictionary entries and bloom filters are much less
>> problematic, because sorting by *any* order will optimize these structures,
>> as the only important thing is that equal values should end up next to each
>> other to increase the chance of having a low number of values in a single
>> page/row group.
>> 
>> I think that trying to optimize min-max stats by sorting crosses the layer
>> boundary between the library and the application and as such is much better
>> suited to a full-stack implementation like Impala than the parquet-mr +
>> separate application stack. Neither relying on the application to calculate
>> standards-compliant statistics nor forcing the application to use the
>> library's data types for sorting and comparison seems like a good solution
>> to me.
>> 
>> Please note that this problem inherently applies to columns statistics as
>> well.
>> 
>> Zoltan
>> 
>> On Wed, Nov 8, 2017 at 3:41 PM, Gabor Szadovszky <
>> gabor.szadovszky@cloudera.com> wrote:
>> 
>>> Hi,
>>> 
>>> I started working on the jira PARQUET-1025 <https://issues.apache.org/
>>> jira/browse/PARQUET-1025>. It is about implementing the new min-max
>>> statistics specified in PARQUET-686 <https://issues.apache.org/
>>> jira/browse/PARQUET-686>.
>>> 
>>> After looking in the code deeper I think the spec of the new min-max
>> stats
>>> contradicts to the actual design of parquet-mr:
>>> The concept of parquet-mr is to use the raw types (representing the
>>> primitive types in the spec) and let the client of the API have them
>>> converted to rich objects (representing the logical types). For example
>>> parquet-mr have Binary for both UTF8 and DECIMAL (at least for one
>> specific
>>> representation) instead of returning String or BigDecimal anywhere.
>>> The new min-max stats requires to have separate comparison mechanisms for
>>> the same primitives depending on the logical types. For example UTF8
>>> requires unsigned (lexicographical) while DECIMAL requires signed
>>> comparisons for the same Binary class.
>>> The problem is that we are specifying sorting orders based on logical
>>> types while we are not providing specific java types for them. It means
>> the
>>> client can unintentionally use a different comparison logic than
>> parquet-mr
>>> in the min-max stats which can lead to discarding relevant values during
>>> filtering.
>>> 
>>> I can see two possible solutions however none of them seems to be good
>>> enough for me.
>>> 
>>> 1. Implement specific comparison logics based on the logical types in the
>>> related Statistics object
>>> The problem here is that we are still using the raw types therefore
>> client
>>> code in filters might implement different comparison logic than specified
>>> and implemented in Statistics. For example in case of having UINT_32 the
>>> min-max comparison in Statistics will use the proper unsigned comparison
>>> logic while the client code in case of checking the elements (in the
>> column
>>> or in the dictionary) might implement somewhat simpler e.g. value > 7
>> which
>>> may lead to different results. See UserDefinedPredicate.keep(T)
>>> It is highly confusing for the client that it has to re-implement the
>> same
>>> comparison logic in the client code for the raw types as it is
>> implemented
>>> for the statistics.
>>> 
>>> 2. Return specific “rich” objects for the logical types instead of the
>> raw
>>> types
>>> This solution would solve the problems what the previous one has but
>> would
>>> introduce bigger problems.
>>> Breaks the actual design of parquet-mr
>>> Backward incompatible: existing client code would not work properly for
>>> several logical types
>>> 
>>> (3.) Revert the specs of the new min-max statistics
>>> Let’s keep working on the raw types without having any additional
>>> comparison logic. Specify the existing comparison logic (e.g. signed for
>>> all primitive types, (signed) lexicographical for binary) and expect the
>>> client to use these if it want to have sorting implemented for the values
>>> for better performance.
>>> 
>>> 
>>> What do you think, guys? Any suggestions on one of my solutions or for a
>>> new one I did not recognise?
>>> 
>>> Thanks a lot,
>>> Gabor
>>> 
>> 
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: PARQUET-1025: Support new min-max statistics in parquet-mr

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Gabor,

You're right that working with logical type data isn't really something
that Parquet has done much in the past. We have the distinction between
logical types and physical types to simplify what we need to support -- we
only have encodings for physical types -- but in this case, we can't do
that, and will need to use comparators based on the logical types. For now,
we don't need to expose much to the layer above because all of the types
have a defined sort order. In the future, we should be able to support
other orders for UTF8 data, but we don't need to focus on that now.

Zoltan,

While it isn't ideal to have no control over the sort order, the only thing
we can do at the Parquet level is to handle data correctly when it does
come in sorted. The concerns you're raising are good to think about, but we
need to solve them elsewhere. I think that the table format should have a
way to specify the sort order it wants for incoming data and communicate
that to the engine writing Parquet files. That's what we're working on
adding to the data source v2 API in Spark. Tables should be able to specify
the expected clustering (for partitioning) and sort order for rows, then
the query plan is automatically rewritten to make that happen.

rb

On Wed, Nov 8, 2017 at 7:47 AM, Zoltan Ivanfi <zi...@cloudera.com> wrote:

> Hi,
>
> I don't know the solution, just adding my thoughts.
>
> In my opinion the underlying problem is that min-max ranges have to be
> small for the best filtering results. In order to achieve this, the data
> has to be sorted, but calculating min-max statistics is the responsibility
> of the library, while sorting the data is a responsibility of the
> application above the library. Without logical-type-based sorting rules,
> the application may sort data using a different ordering than the one used
> for calculating/filtering on the min-max values. This results in too broad
> min-max ranges, that in theory could still function correctly, but are not
> optimal for filtering. (As an extra twist, even if we have
> logical-type-based sorting rules, the application can still use a different
> ordering for sorting and even for comparisons. The results can range from
> overly broad min-max ranges to incorrectly discarded values.)
>
> From this point of view, dictionary entries and bloom filters are much less
> problematic, because sorting by *any* order will optimize these structures,
> as the only important thing is that equal values should end up next to each
> other to increase the chance of having a low number of values in a single
> page/row group.
>
> I think that trying to optimize min-max stats by sorting crosses the layer
> boundary between the library and the application and as such is much better
> suited to a full-stack implementation like Impala than the parquet-mr +
> separate application stack. Neither relying on the application to calculate
> standards-compliant statistics nor forcing the application to use the
> library's data types for sorting and comparison seems like a good solution
> to me.
>
> Please note that this problem inherently applies to columns statistics as
> well.
>
> Zoltan
>
> On Wed, Nov 8, 2017 at 3:41 PM, Gabor Szadovszky <
> gabor.szadovszky@cloudera.com> wrote:
>
> > Hi,
> >
> > I started working on the jira PARQUET-1025 <https://issues.apache.org/
> > jira/browse/PARQUET-1025>. It is about implementing the new min-max
> > statistics specified in PARQUET-686 <https://issues.apache.org/
> > jira/browse/PARQUET-686>.
> >
> > After looking in the code deeper I think the spec of the new min-max
> stats
> > contradicts to the actual design of parquet-mr:
> > The concept of parquet-mr is to use the raw types (representing the
> > primitive types in the spec) and let the client of the API have them
> > converted to rich objects (representing the logical types). For example
> > parquet-mr have Binary for both UTF8 and DECIMAL (at least for one
> specific
> > representation) instead of returning String or BigDecimal anywhere.
> > The new min-max stats requires to have separate comparison mechanisms for
> > the same primitives depending on the logical types. For example UTF8
> > requires unsigned (lexicographical) while DECIMAL requires signed
> > comparisons for the same Binary class.
> > The problem is that we are specifying sorting orders based on logical
> > types while we are not providing specific java types for them. It means
> the
> > client can unintentionally use a different comparison logic than
> parquet-mr
> > in the min-max stats which can lead to discarding relevant values during
> > filtering.
> >
> > I can see two possible solutions however none of them seems to be good
> > enough for me.
> >
> > 1. Implement specific comparison logics based on the logical types in the
> > related Statistics object
> > The problem here is that we are still using the raw types therefore
> client
> > code in filters might implement different comparison logic than specified
> > and implemented in Statistics. For example in case of having UINT_32 the
> > min-max comparison in Statistics will use the proper unsigned comparison
> > logic while the client code in case of checking the elements (in the
> column
> > or in the dictionary) might implement somewhat simpler e.g. value > 7
> which
> > may lead to different results. See UserDefinedPredicate.keep(T)
> > It is highly confusing for the client that it has to re-implement the
> same
> > comparison logic in the client code for the raw types as it is
> implemented
> > for the statistics.
> >
> > 2. Return specific “rich” objects for the logical types instead of the
> raw
> > types
> > This solution would solve the problems what the previous one has but
> would
> > introduce bigger problems.
> > Breaks the actual design of parquet-mr
> > Backward incompatible: existing client code would not work properly for
> > several logical types
> >
> > (3.) Revert the specs of the new min-max statistics
> > Let’s keep working on the raw types without having any additional
> > comparison logic. Specify the existing comparison logic (e.g. signed for
> > all primitive types, (signed) lexicographical for binary) and expect the
> > client to use these if it want to have sorting implemented for the values
> > for better performance.
> >
> >
> > What do you think, guys? Any suggestions on one of my solutions or for a
> > new one I did not recognise?
> >
> > Thanks a lot,
> > Gabor
> >
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: PARQUET-1025: Support new min-max statistics in parquet-mr

Posted by Zoltan Ivanfi <zi...@cloudera.com>.

Hi,

I don't know the solution, just adding my thoughts.

In my opinion the underlying problem is that min-max ranges have to be
small for the best filtering results. In order to achieve this, the data
has to be sorted, but calculating min-max statistics is the responsibility
of the library, while sorting the data is a responsibility of the
application above the library. Without logical-type-based sorting rules,
the application may sort data using a different ordering than the one used
for calculating/filtering on the min-max values. This results in too broad
min-max ranges, that in theory could still function correctly, but are not
optimal for filtering. (As an extra twist, even if we have
logical-type-based sorting rules, the application can still use a different
ordering for sorting and even for comparisons. The results can range from
overly broad min-max ranges to incorrectly discarded values.)

From this point of view, dictionary entries and bloom filters are much less
problematic, because sorting by *any* order will optimize these structures,
as the only important thing is that equal values should end up next to each
other to increase the chance of having a low number of values in a single
page/row group.

I think that trying to optimize min-max stats by sorting crosses the layer
boundary between the library and the application and as such is much better
suited to a full-stack implementation like Impala than the parquet-mr +
separate application stack. Neither relying on the application to calculate
standards-compliant statistics nor forcing the application to use the
library's data types for sorting and comparison seems like a good solution
to me.

Please note that this problem inherently applies to columns statistics as
well.

Zoltan

On Wed, Nov 8, 2017 at 3:41 PM, Gabor Szadovszky <
gabor.szadovszky@cloudera.com> wrote:

> Hi,
>
> I started working on the jira PARQUET-1025 <https://issues.apache.org/
> jira/browse/PARQUET-1025>. It is about implementing the new min-max
> statistics specified in PARQUET-686 <https://issues.apache.org/
> jira/browse/PARQUET-686>.
>
> After looking in the code deeper I think the spec of the new min-max stats
> contradicts to the actual design of parquet-mr:
> The concept of parquet-mr is to use the raw types (representing the
> primitive types in the spec) and let the client of the API have them
> converted to rich objects (representing the logical types). For example
> parquet-mr have Binary for both UTF8 and DECIMAL (at least for one specific
> representation) instead of returning String or BigDecimal anywhere.
> The new min-max stats requires to have separate comparison mechanisms for
> the same primitives depending on the logical types. For example UTF8
> requires unsigned (lexicographical) while DECIMAL requires signed
> comparisons for the same Binary class.
> The problem is that we are specifying sorting orders based on logical
> types while we are not providing specific java types for them. It means the
> client can unintentionally use a different comparison logic than parquet-mr
> in the min-max stats which can lead to discarding relevant values during
> filtering.
>
> I can see two possible solutions however none of them seems to be good
> enough for me.
>
> 1. Implement specific comparison logics based on the logical types in the
> related Statistics object
> The problem here is that we are still using the raw types therefore client
> code in filters might implement different comparison logic than specified
> and implemented in Statistics. For example in case of having UINT_32 the
> min-max comparison in Statistics will use the proper unsigned comparison
> logic while the client code in case of checking the elements (in the column
> or in the dictionary) might implement somewhat simpler e.g. value > 7 which
> may lead to different results. See UserDefinedPredicate.keep(T)
> It is highly confusing for the client that it has to re-implement the same
> comparison logic in the client code for the raw types as it is implemented
> for the statistics.
>
> 2. Return specific “rich” objects for the logical types instead of the raw
> types
> This solution would solve the problems what the previous one has but would
> introduce bigger problems.
> Breaks the actual design of parquet-mr
> Backward incompatible: existing client code would not work properly for
> several logical types
>
> (3.) Revert the specs of the new min-max statistics
> Let’s keep working on the raw types without having any additional
> comparison logic. Specify the existing comparison logic (e.g. signed for
> all primitive types, (signed) lexicographical for binary) and expect the
> client to use these if it want to have sorting implemented for the values
> for better performance.
>
>
> What do you think, guys? Any suggestions on one of my solutions or for a
> new one I did not recognise?
>
> Thanks a lot,
> Gabor
>