You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Jacques Nadeau <ja...@apache.org> on 2016/10/10 05:15:05 UTC

Re: Binary sort order and RC

Just following up here. It seems like we stalled out due to this issue. Per
my previous comments, if we don't have a resolution for this, and it isn't
regression, I propose moving forward without a change.

On Mon, Sep 12, 2016 at 12:57 PM, Piyush Narang <pnarang@twitter.com.invalid
> wrote:

> Ok, think that sounds good to me. Shall take a look at your PR as well.
>
> On Mon, Sep 12, 2016 at 10:49 AM, Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
> > Thanks, Piyush. I agree and would like to get the release out before we
> > completely finish the changes for the sort order bug. But, I think it is
> > important to do something to address the problem that stats pages for
> > binary/UTF8 don't match Java's String sort order or UTF-8. That's why I
> > posted a PR with minimal changes to disable returning stats using the
> > signed order, unless a property is set to opt into using them.
> >
> > Alex is arguing that the current signed min/max shouldn't be considered a
> > bug so we shouldn't suppress them in the 1.9.0 release. I think we should
> > agree on whether this is a correctness issue and, if it is, fix it before
> > the release. We can fix the sort order in 1.9.1.
> >
> > rb
> >
> > On Mon, Sep 12, 2016 at 10:43 AM, Piyush Narang
> > <pnarang@twitter.com.invalid
> > > wrote:
> >
> > > Are there other projects / folks waiting on the 1.9.0 release? Given
> that
> > > it has been a while since our last release, it might make sense for us
> to
> > > first put that out (if I understand correctly, everything else has been
> > > resolved). Seems like there's a bit of discussion on Andrew's PR and we
> > > could follow up with another release once that is resolved. I'm worried
> > > about the release goal post constantly moving and us adding more and
> more
> > > to the 1.9.0 release.
> > >
> > > On Sat, Sep 10, 2016 at 11:42 AM, Ryan Blue <rblue@netflix.com.invalid
> >
> > > wrote:
> > >
> > > > To keep the list up-to-date, there's more discussion on this topic on
> > > > Andrew's original pull request [1] about whether we should consider
> > this
> > > a
> > > > bug. I don't think we should build a release candidate until we agree
> > > that
> > > > it's a bug or not, and if it is a bug add a work-around to stop
> > returning
> > > > the bad min and max. If you're interested, please join the discussion
> > on
> > > > the PR.
> > > >
> > > > Thanks!
> > > >
> > > > rb
> > > >
> > > >
> > > > [1]: https://github.com/apache/parquet-mr/pull/362
> > > >
> > > > On Fri, Sep 9, 2016 at 2:50 PM, Ryan Blue <rb...@netflix.com> wrote:
> > > >
> > > > > I think what Spark needs to do depends on how we decide to fix
> this.
> > I
> > > > > think the right thing to do for now is to treat these stats as
> > invalid
> > > > and
> > > > > not return them, like we did for binary in PARQUET-251. That's an
> > easy
> > > > fix
> > > > > because we've done it before and would just need to add another
> check
> > > > what
> > > > > we did the last time. That way, Spark is automatically fixed when
> it
> > > > > updates to 1.9.0. We can work in 1.9.1 and later to fix how stats
> are
> > > > > written and agree on an update to the format.
> > > > >
> > > > > I just posted a pull request that implements this. It doesn't
> > require a
> > > > > parquet-format release and will fix the problem for downstream
> > readers
> > > by
> > > > > not returning incorrect stats, based on the column type. In the
> > patch,
> > > > > stats are suppressed by default for unsigned integers, strings,
> > enums,
> > > > and
> > > > > decimals. Readers can override this for string types by setting
> > > > > parquet.strings.use-signed-order.
> > > > >
> > > > > Does that seem like a reasonable fix to get this release out?
> > > > >
> > > > > rb
> > > > >
> > > > > On Fri, Sep 9, 2016 at 6:13 AM, Andrew Duffy <ad...@palantir.com>
> > > > wrote:
> > > > >
> > > > >> So I’m cool with making necessary changes to get this in sooner
> > rather
> > > > >> than later, I’ve mostly been blocking on code reviews. If there’s
> a
> > > > >> commitment made to releasing 1.9.1 as soon as the binary sort
> order
> > > > change
> > > > >> goes in then I’m fine with not blocking on this for 1.9, but if
> not
> > > and
> > > > we
> > > > >> can expect the current velocity of releases then I’d rather see if
> > we
> > > > can
> > > > >> get it in now. Especially because Spark and others will need to
> wait
> > > > until
> > > > >> next release (>= 1 year from now?) to fix use of statistics for
> > binary
> > > > >> data. As Rob said, this is broken in master of Spark so without a
> > > > >> fix+release of Parquet it’ll likely require disabling statistics
> > > > pushdown
> > > > >> for binary columns on their end.
> > > > >>
> > > > >> -Andrew
> > > > >>
> > > > >> On 9/8/16, 11:37 PM, "Jacques Nadeau" <ja...@apache.org> wrote:
> > > > >>
> > > > >> >Absolutely agree the fix would be good to get in.
> > > > >> >
> > > > >> >My comments come from the fact that we tried to start the 1.9
> > release
> > > > >> (late
> > > > >> >last year I think) after the direct memory feature got in and it
> > > never
> > > > >> >completed (as there is always a good reason to hold the release).
> > My
> > > > main
> > > > >> >fear is that if we wait for something and it turns into another
> > > > >> substantial
> > > > >> >delay before a release. Ryan, what do you think about proposing a
> > > hard
> > > > >> >deadline?
> > > > >> >
> > > > >> >On Thu, Sep 8, 2016 at 7:57 PM, Robert Kruszewski <
> > > > robertk@palantir.com>
> > > > >> >wrote:
> > > > >> >
> > > > >> >> When would the fix be ready by to consider it for release in
> > 1.9.0?
> > > > >> This
> > > > >> >> affects correctness on current master of spark and would be
> good
> > to
> > > > >> get a
> > > > >> >> release with it fixed so that people aren’t using potentially
> bad
> > > > >> versions.
> > > > >> >>
> > > > >> >> -          Robert
> > > > >> >>
> > > > >> >> On 9/8/16, 10:44 PM, "Jacques Nadeau" <ja...@apache.org>
> > wrote:
> > > > >> >>
> > > > >> >>     A non-binding +1 from me on releasing sooner/more often.
> > > > >> >>
> > > > >> >>     On Thu, Sep 8, 2016 at 5:44 PM, Ryan Blue
> > > > >> <rb...@netflix.com.invalid>
> > > > >> >> wrote:
> > > > >> >>
> > > > >> >>     > Hey everyone,
> > > > >> >>     >
> > > > >> >>     > I'd like to put together a release candidate for 1.9.0.
> The
> > > > other
> > > > >> >> issues
> > > > >> >>     > are done, but the sort order min/max issue, PARQUET-686
> is
> > > > still
> > > > >> >> open.
> > > > >> >>     >
> > > > >> >>     > I'm okay releasing 1.9.0 without fixing that issue since
> > it's
> > > > >> been
> > > > >> >> so long
> > > > >> >>     > since our last release. It would also be nice to do
> > releases
> > > as
> > > > >> >> necessary,
> > > > >> >>     > so we can always do a release to fix PARQUET-686 when the
> > > patch
> > > > >> for
> > > > >> >> it is
> > > > >> >>     > ready. Is there anyone that thinks we should definitely
> get
> > > > this
> > > > >> >> into the
> > > > >> >>     > 1.9.0 release?
> > > > >> >>     >
> > > > >> >>     > Thanks,
> > > > >> >>     >
> > > > >> >>     > rb
> > > > >> >>     >
> > > > >> >>     > --
> > > > >> >>     > Ryan Blue
> > > > >> >>     > Software Engineer
> > > > >> >>     > Netflix
> > > > >> >>     >
> > > > >> >>
> > > > >> >>
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Ryan Blue
> > > > > Software Engineer
> > > > > Netflix
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Ryan Blue
> > > > Software Engineer
> > > > Netflix
> > > >
> > >
> > >
> > >
> > > --
> > > - Piyush
> > >
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>
>
>
> --
> - Piyush
>

Re: Binary sort order and RC

Posted by Robert Kruszewski <ro...@palantir.com>.
This technically is a regression over 1.8.0 since before statistics were treated as corrupted and not used. I think it would make sense to get workaround Ryan proposed in for release and merge the fix later.

-          Robert

On 10/10/16, 1:15 AM, "Jacques Nadeau" <ja...@apache.org> wrote:

    Just following up here. It seems like we stalled out due to this issue. Per
    my previous comments, if we don't have a resolution for this, and it isn't
    regression, I propose moving forward without a change.
    
    On Mon, Sep 12, 2016 at 12:57 PM, Piyush Narang <pnarang@twitter.com.invalid
    > wrote:
    
    > Ok, think that sounds good to me. Shall take a look at your PR as well.
    >
    > On Mon, Sep 12, 2016 at 10:49 AM, Ryan Blue <rb...@netflix.com.invalid>
    > wrote:
    >
    > > Thanks, Piyush. I agree and would like to get the release out before we
    > > completely finish the changes for the sort order bug. But, I think it is
    > > important to do something to address the problem that stats pages for
    > > binary/UTF8 don't match Java's String sort order or UTF-8. That's why I
    > > posted a PR with minimal changes to disable returning stats using the
    > > signed order, unless a property is set to opt into using them.
    > >
    > > Alex is arguing that the current signed min/max shouldn't be considered a
    > > bug so we shouldn't suppress them in the 1.9.0 release. I think we should
    > > agree on whether this is a correctness issue and, if it is, fix it before
    > > the release. We can fix the sort order in 1.9.1.
    > >
    > > rb
    > >
    > > On Mon, Sep 12, 2016 at 10:43 AM, Piyush Narang
    > > <pnarang@twitter.com.invalid
    > > > wrote:
    > >
    > > > Are there other projects / folks waiting on the 1.9.0 release? Given
    > that
    > > > it has been a while since our last release, it might make sense for us
    > to
    > > > first put that out (if I understand correctly, everything else has been
    > > > resolved). Seems like there's a bit of discussion on Andrew's PR and we
    > > > could follow up with another release once that is resolved. I'm worried
    > > > about the release goal post constantly moving and us adding more and
    > more
    > > > to the 1.9.0 release.
    > > >
    > > > On Sat, Sep 10, 2016 at 11:42 AM, Ryan Blue <rblue@netflix.com.invalid
    > >
    > > > wrote:
    > > >
    > > > > To keep the list up-to-date, there's more discussion on this topic on
    > > > > Andrew's original pull request [1] about whether we should consider
    > > this
    > > > a
    > > > > bug. I don't think we should build a release candidate until we agree
    > > > that
    > > > > it's a bug or not, and if it is a bug add a work-around to stop
    > > returning
    > > > > the bad min and max. If you're interested, please join the discussion
    > > on
    > > > > the PR.
    > > > >
    > > > > Thanks!
    > > > >
    > > > > rb
    > > > >
    > > > >
    > > > > [1]: https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_parquet-2Dmr_pull_362&d=DQIFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=Gukiqwaa9M7VDsJzd0J3W7mh_DfC1XlLxRRhg4t2Xyc&m=vztq-HOWpA5TCgl-PyMVWznvZmicIuenLMWs8_tR5U0&s=6E18BGXKrtD3v7AEpfvzIYS2VZRYEjn08fYEKthHlTE&e= 
    > > > >
    > > > > On Fri, Sep 9, 2016 at 2:50 PM, Ryan Blue <rb...@netflix.com> wrote:
    > > > >
    > > > > > I think what Spark needs to do depends on how we decide to fix
    > this.
    > > I
    > > > > > think the right thing to do for now is to treat these stats as
    > > invalid
    > > > > and
    > > > > > not return them, like we did for binary in PARQUET-251. That's an
    > > easy
    > > > > fix
    > > > > > because we've done it before and would just need to add another
    > check
    > > > > what
    > > > > > we did the last time. That way, Spark is automatically fixed when
    > it
    > > > > > updates to 1.9.0. We can work in 1.9.1 and later to fix how stats
    > are
    > > > > > written and agree on an update to the format.
    > > > > >
    > > > > > I just posted a pull request that implements this. It doesn't
    > > require a
    > > > > > parquet-format release and will fix the problem for downstream
    > > readers
    > > > by
    > > > > > not returning incorrect stats, based on the column type. In the
    > > patch,
    > > > > > stats are suppressed by default for unsigned integers, strings,
    > > enums,
    > > > > and
    > > > > > decimals. Readers can override this for string types by setting
    > > > > > parquet.strings.use-signed-order.
    > > > > >
    > > > > > Does that seem like a reasonable fix to get this release out?
    > > > > >
    > > > > > rb
    > > > > >
    > > > > > On Fri, Sep 9, 2016 at 6:13 AM, Andrew Duffy <ad...@palantir.com>
    > > > > wrote:
    > > > > >
    > > > > >> So I’m cool with making necessary changes to get this in sooner
    > > rather
    > > > > >> than later, I’ve mostly been blocking on code reviews. If there’s
    > a
    > > > > >> commitment made to releasing 1.9.1 as soon as the binary sort
    > order
    > > > > change
    > > > > >> goes in then I’m fine with not blocking on this for 1.9, but if
    > not
    > > > and
    > > > > we
    > > > > >> can expect the current velocity of releases then I’d rather see if
    > > we
    > > > > can
    > > > > >> get it in now. Especially because Spark and others will need to
    > wait
    > > > > until
    > > > > >> next release (>= 1 year from now?) to fix use of statistics for
    > > binary
    > > > > >> data. As Rob said, this is broken in master of Spark so without a
    > > > > >> fix+release of Parquet it’ll likely require disabling statistics
    > > > > pushdown
    > > > > >> for binary columns on their end.
    > > > > >>
    > > > > >> -Andrew
    > > > > >>
    > > > > >> On 9/8/16, 11:37 PM, "Jacques Nadeau" <ja...@apache.org> wrote:
    > > > > >>
    > > > > >> >Absolutely agree the fix would be good to get in.
    > > > > >> >
    > > > > >> >My comments come from the fact that we tried to start the 1.9
    > > release
    > > > > >> (late
    > > > > >> >last year I think) after the direct memory feature got in and it
    > > > never
    > > > > >> >completed (as there is always a good reason to hold the release).
    > > My
    > > > > main
    > > > > >> >fear is that if we wait for something and it turns into another
    > > > > >> substantial
    > > > > >> >delay before a release. Ryan, what do you think about proposing a
    > > > hard
    > > > > >> >deadline?
    > > > > >> >
    > > > > >> >On Thu, Sep 8, 2016 at 7:57 PM, Robert Kruszewski <
    > > > > robertk@palantir.com>
    > > > > >> >wrote:
    > > > > >> >
    > > > > >> >> When would the fix be ready by to consider it for release in
    > > 1.9.0?
    > > > > >> This
    > > > > >> >> affects correctness on current master of spark and would be
    > good
    > > to
    > > > > >> get a
    > > > > >> >> release with it fixed so that people aren’t using potentially
    > bad
    > > > > >> versions.
    > > > > >> >>
    > > > > >> >> -          Robert
    > > > > >> >>
    > > > > >> >> On 9/8/16, 10:44 PM, "Jacques Nadeau" <ja...@apache.org>
    > > wrote:
    > > > > >> >>
    > > > > >> >>     A non-binding +1 from me on releasing sooner/more often.
    > > > > >> >>
    > > > > >> >>     On Thu, Sep 8, 2016 at 5:44 PM, Ryan Blue
    > > > > >> <rb...@netflix.com.invalid>
    > > > > >> >> wrote:
    > > > > >> >>
    > > > > >> >>     > Hey everyone,
    > > > > >> >>     >
    > > > > >> >>     > I'd like to put together a release candidate for 1.9.0.
    > The
    > > > > other
    > > > > >> >> issues
    > > > > >> >>     > are done, but the sort order min/max issue, PARQUET-686
    > is
    > > > > still
    > > > > >> >> open.
    > > > > >> >>     >
    > > > > >> >>     > I'm okay releasing 1.9.0 without fixing that issue since
    > > it's
    > > > > >> been
    > > > > >> >> so long
    > > > > >> >>     > since our last release. It would also be nice to do
    > > releases
    > > > as
    > > > > >> >> necessary,
    > > > > >> >>     > so we can always do a release to fix PARQUET-686 when the
    > > > patch
    > > > > >> for
    > > > > >> >> it is
    > > > > >> >>     > ready. Is there anyone that thinks we should definitely
    > get
    > > > > this
    > > > > >> >> into the
    > > > > >> >>     > 1.9.0 release?
    > > > > >> >>     >
    > > > > >> >>     > Thanks,
    > > > > >> >>     >
    > > > > >> >>     > rb
    > > > > >> >>     >
    > > > > >> >>     > --
    > > > > >> >>     > Ryan Blue
    > > > > >> >>     > Software Engineer
    > > > > >> >>     > Netflix
    > > > > >> >>     >
    > > > > >> >>
    > > > > >> >>
    > > > > >>
    > > > > >
    > > > > >
    > > > > >
    > > > > > --
    > > > > > Ryan Blue
    > > > > > Software Engineer
    > > > > > Netflix
    > > > > >
    > > > >
    > > > >
    > > > >
    > > > > --
    > > > > Ryan Blue
    > > > > Software Engineer
    > > > > Netflix
    > > > >
    > > >
    > > >
    > > >
    > > > --
    > > > - Piyush
    > > >
    > >
    > >
    > >
    > > --
    > > Ryan Blue
    > > Software Engineer
    > > Netflix
    > >
    >
    >
    >
    > --
    > - Piyush
    >