You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by 俊杰陈 <cj...@gmail.com> on 2019/07/02 02:23:37 UTC

[VOTE] Parquet Bloom filter spec sign-off

Dear Parquet developers

Parquet Bloom filter has been developed for a while, per the discussion on
the mail list, it's time to call a vote for spec to move forward. The
current spec can be found at
https://github.com/apache/parquet-format/blob/master/BloomFilter.md. There
are some different options about the internal hash choice of Bloom filter
and the PR <https://github.com/apache/parquet-format/pull/139> is for that
concern.

So I 'd like to propose to vote the spec + hash option, for example:

+1 to spec and xxHash
+1 to spec and murmur3
...

Please help to vote, any feedback is also welcome in the discussion thread.

Thanks & Best Regards

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by Jim Apple <jb...@apache.org>.

On 2019/09/18 20:52:30, Wes McKinney <we...@gmail.com> wrote: 
> We have just worked to move almost all hashing in Apache Arrow to xxh3
> -- I may have lost it in the mix, but are we dropping murmur3?

Yep: https://github.com/apache/parquet-format/commit/8f1783ec0b273e89c884b46c0f527d0a48321826#diff-d96aef0e8954afde569c8b40b8748081

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by Wes McKinney <we...@gmail.com>.
We have just worked to move almost all hashing in Apache Arrow to xxh3
-- I may have lost it in the mix, but are we dropping murmur3? Per
discussion in ARROW-3298 that would definitely be our preference.

On Mon, Sep 9, 2019 at 2:17 PM Wes McKinney <we...@gmail.com> wrote:
>
> +1 from me
>
> On Fri, Sep 6, 2019 at 4:00 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
> >
> > +1 on the current spec. Is everyone else still +1?
> >
> > Sorry for the delay, I didn't realize that everything had been addressed
> > and I didn't see the email from Jim in my inbox.
> >
> > On Wed, Aug 28, 2019 at 10:13 AM Jim Apple <jb...@apache.org> wrote:
> >
> > > We've got +1's from Zoltan and Gabor. Ryan, you've committed a few BF
> > > patches that were written in response to your feedback on this list. Are
> > > you in a position to vote +1 now, or do you have further concerns we could
> > > address?
> > >
> > > On 2019/07/31 02:17:15, 俊杰陈 <cj...@gmail.com> wrote:
> > > > Dear Parquet developers
> > > >
> > > > We still need your vote!
> > > >
> > > >
> > > > On Wed, Jul 24, 2019 at 9:30 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > >
> > > > > Hi @Ryan Blue  @Wes McKinney
> > > > >
> > > > > We need your valuable vote, any feedback is welcome as well.
> > > > >
> > > > > On Tue, Jul 23, 2019 at 1:24 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > >
> > > > > > Call for voting again.
> > > > > >
> > > > > > On Fri, Jul 19, 2019 at 1:17 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > > >
> > > > > > > Dear Parquet developers
> > > > > > >
> > > > > > > We need more votes, please help to vote on this.
> > > > > > >
> > > > > > > On Wed, Jul 17, 2019 at 3:42 PM Gabor Szadovszky
> > > > > > > <ga...@cloudera.com.invalid> wrote:
> > > > > > > >
> > > > > > > > After getting in PARQUET-1625 I vote again for having bloom
> > > filter spec and
> > > > > > > > the thrift file update as is in parquet-format master.
> > > > > > > > +1 (binding)
> > > > > > > >
> > > > > > > > On Mon, Jul 15, 2019 at 3:23 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > Thanks Gabor, It's never too late to make it better. We don't
> > > have to
> > > > > > > > > run it in a hurry, it has been developed for a long time yet.:)
> > > > > > > > >
> > > > > > > > > The thrift file is indeed a bit lag behind the spec. As the
> > > spec
> > > > > > > > > defined, the bloom filter data is stored near the footer which
> > > means
> > > > > > > > > we don't have to handle it like the page. Therefore, I just
> > > opened a
> > > > > > > > > jira to remove bloom_filter_page_header in PageHeader
> > > structure, while
> > > > > > > > > the BloomFitlerHeader is kept intentionally for convenience.
> > > Since the
> > > > > > > > > spec and the thrift should be aligned with each other
> > > eventually, so
> > > > > > > > > the vote target is both of them.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Mon, Jul 15, 2019 at 7:48 PM Gabor Szadovszky
> > > > > > > > > <ga...@cloudera.com.invalid> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Junjie,
> > > > > > > > > >
> > > > > > > > > > Sorry for bringing up this a bit late but I have some
> > > problems with the
> > > > > > > > > > format update. The parquet.thrift file is updated to have
> > > the bloom
> > > > > > > > > filters
> > > > > > > > > > as a page (just as dictionaries and data pages). Meanwhile,
> > > the spec
> > > > > > > > > > (BloomFilter.md) says that the bloom filter is stored near
> > > the footer.
> > > > > > > > > So,
> > > > > > > > > > if the bloom filter is not part of the row-groups (like
> > > column indexes) I
> > > > > > > > > > would not add it as a page. See the struct ColumnIndex in
> > > the thrift
> > > > > > > > > file.
> > > > > > > > > > This struct is not referenced anywhere in it only declared.
> > > It was done
> > > > > > > > > > this way because we don't parse it in the same way as we
> > > parse the pages.
> > > > > > > > > >
> > > > > > > > > > Currently, I am not 100% sure about the target of this vote.
> > > If it is a
> > > > > > > > > > vote about adding bloom filters in general then it is a +1
> > > (binding). If
> > > > > > > > > it
> > > > > > > > > > is about adding the bloom filters to parquet-format as is
> > > then, it is a
> > > > > > > > > -1
> > > > > > > > > > (binding) until we fix the issue above.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > Gabor
> > > > > > > > > >
> > > > > > > > > > On Mon, Jul 15, 2019 at 11:45 AM Gidon Gershinsky <
> > > gg5070@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > +1 (non-binding)
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Jul 15, 2019 at 12:08 PM Zoltan Ivanfi
> > > <zi@cloudera.com.invalid
> > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > +1 (binding)
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Jul 15, 2019 at 9:57 AM 俊杰陈 <cj...@gmail.com>
> > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > > >
> > > > > > > > > > > > > I'd like to resume this vote, you can start to vote
> > > now. Thanks for
> > > > > > > > > > > your
> > > > > > > > > > > > time.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:29 PM 俊杰陈 <
> > > cjjnjust@gmail.com> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I see, will resume this next week.  Thanks.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi
> > > > > > > > > > > <zi...@cloudera.com.invalid>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Junjie,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Since there are ongoing improvements addressing
> > > review
> > > > > > > > > comments, I
> > > > > > > > > > > > would
> > > > > > > > > > > > > > > hold off with the vote for a few more days until
> > > the
> > > > > > > > > specification
> > > > > > > > > > > > settles.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Br,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Zoltan
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 <
> > > cjjnjust@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi Parquet committers and developers
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > We are waiting for your important ballot:)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <
> > > cjjnjust@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Yes, there are some public benchmark results,
> > > such as the
> > > > > > > > > > > > official
> > > > > > > > > > > > > > > > > benchmark from xxhash site (
> > > http://www.xxhash.com/) and
> > > > > > > > > > > > published
> > > > > > > > > > > > > > > > > comparison from smhasher project
> > > > > > > > > > > > > > > > > (https://github.com/rurban/smhasher/).
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 5:25 AM Wes McKinney <
> > > > > > > > > > > wesmckinn@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Do you have any benchmark data to support
> > > the choice of
> > > > > > > > > hash
> > > > > > > > > > > > function?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <
> > > cjjnjust@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > To simplify the voting, I 'd like to
> > > update voting
> > > > > > > > > content
> > > > > > > > > > > > to the
> > > > > > > > > > > > > > > > spec
> > > > > > > > > > > > > > > > > > > with xxHash hash strategy. Now you can
> > > reply with +1
> > > > > > > > > or -1.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thanks for your participation.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 <
> > > > > > > > > cjjnjust@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Parquet Bloom filter has been developed
> > > for a while,
> > > > > > > > > per
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > discussion on the mail list, it's time to call a
> > > vote for
> > > > > > > > > spec to
> > > > > > > > > > > > move
> > > > > > > > > > > > > > > > forward. The current spec can be found at
> > > > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > https://github.com/apache/parquet-format/blob/master/BloomFilter.md.
> > > > > > > > > > > > > > > > There are some different options about the
> > > internal hash
> > > > > > > > > choice
> > > > > > > > > > > of
> > > > > > > > > > > > Bloom
> > > > > > > > > > > > > > > > filter and the PR is for that concern.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > So I 'd like to propose to vote the spec
> > > + hash
> > > > > > > > > option,
> > > > > > > > > > > for
> > > > > > > > > > > > > > > > example:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > +1 to spec and xxHash
> > > > > > > > > > > > > > > > > > > > +1 to spec and murmur3
> > > > > > > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Please help to vote, any feedback is
> > > also welcome in
> > > > > > > > > the
> > > > > > > > > > > > > > > > discussion thread.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Thanks & Best Regards
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Thanks & Best Regards
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Thanks & Best Regards
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks & Best Regards
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks & Best Regards
> > > >
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by Wes McKinney <we...@gmail.com>.
+1 from me

On Fri, Sep 6, 2019 at 4:00 PM Ryan Blue <rb...@netflix.com.invalid> wrote:
>
> +1 on the current spec. Is everyone else still +1?
>
> Sorry for the delay, I didn't realize that everything had been addressed
> and I didn't see the email from Jim in my inbox.
>
> On Wed, Aug 28, 2019 at 10:13 AM Jim Apple <jb...@apache.org> wrote:
>
> > We've got +1's from Zoltan and Gabor. Ryan, you've committed a few BF
> > patches that were written in response to your feedback on this list. Are
> > you in a position to vote +1 now, or do you have further concerns we could
> > address?
> >
> > On 2019/07/31 02:17:15, 俊杰陈 <cj...@gmail.com> wrote:
> > > Dear Parquet developers
> > >
> > > We still need your vote!
> > >
> > >
> > > On Wed, Jul 24, 2019 at 9:30 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > >
> > > > Hi @Ryan Blue  @Wes McKinney
> > > >
> > > > We need your valuable vote, any feedback is welcome as well.
> > > >
> > > > On Tue, Jul 23, 2019 at 1:24 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > >
> > > > > Call for voting again.
> > > > >
> > > > > On Fri, Jul 19, 2019 at 1:17 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > >
> > > > > > Dear Parquet developers
> > > > > >
> > > > > > We need more votes, please help to vote on this.
> > > > > >
> > > > > > On Wed, Jul 17, 2019 at 3:42 PM Gabor Szadovszky
> > > > > > <ga...@cloudera.com.invalid> wrote:
> > > > > > >
> > > > > > > After getting in PARQUET-1625 I vote again for having bloom
> > filter spec and
> > > > > > > the thrift file update as is in parquet-format master.
> > > > > > > +1 (binding)
> > > > > > >
> > > > > > > On Mon, Jul 15, 2019 at 3:23 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > > >
> > > > > > > > Thanks Gabor, It's never too late to make it better. We don't
> > have to
> > > > > > > > run it in a hurry, it has been developed for a long time yet.:)
> > > > > > > >
> > > > > > > > The thrift file is indeed a bit lag behind the spec. As the
> > spec
> > > > > > > > defined, the bloom filter data is stored near the footer which
> > means
> > > > > > > > we don't have to handle it like the page. Therefore, I just
> > opened a
> > > > > > > > jira to remove bloom_filter_page_header in PageHeader
> > structure, while
> > > > > > > > the BloomFitlerHeader is kept intentionally for convenience.
> > Since the
> > > > > > > > spec and the thrift should be aligned with each other
> > eventually, so
> > > > > > > > the vote target is both of them.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Jul 15, 2019 at 7:48 PM Gabor Szadovszky
> > > > > > > > <ga...@cloudera.com.invalid> wrote:
> > > > > > > > >
> > > > > > > > > Hi Junjie,
> > > > > > > > >
> > > > > > > > > Sorry for bringing up this a bit late but I have some
> > problems with the
> > > > > > > > > format update. The parquet.thrift file is updated to have
> > the bloom
> > > > > > > > filters
> > > > > > > > > as a page (just as dictionaries and data pages). Meanwhile,
> > the spec
> > > > > > > > > (BloomFilter.md) says that the bloom filter is stored near
> > the footer.
> > > > > > > > So,
> > > > > > > > > if the bloom filter is not part of the row-groups (like
> > column indexes) I
> > > > > > > > > would not add it as a page. See the struct ColumnIndex in
> > the thrift
> > > > > > > > file.
> > > > > > > > > This struct is not referenced anywhere in it only declared.
> > It was done
> > > > > > > > > this way because we don't parse it in the same way as we
> > parse the pages.
> > > > > > > > >
> > > > > > > > > Currently, I am not 100% sure about the target of this vote.
> > If it is a
> > > > > > > > > vote about adding bloom filters in general then it is a +1
> > (binding). If
> > > > > > > > it
> > > > > > > > > is about adding the bloom filters to parquet-format as is
> > then, it is a
> > > > > > > > -1
> > > > > > > > > (binding) until we fix the issue above.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Gabor
> > > > > > > > >
> > > > > > > > > On Mon, Jul 15, 2019 at 11:45 AM Gidon Gershinsky <
> > gg5070@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > +1 (non-binding)
> > > > > > > > > >
> > > > > > > > > > On Mon, Jul 15, 2019 at 12:08 PM Zoltan Ivanfi
> > <zi@cloudera.com.invalid
> > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > +1 (binding)
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Jul 15, 2019 at 9:57 AM 俊杰陈 <cj...@gmail.com>
> > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > >
> > > > > > > > > > > > I'd like to resume this vote, you can start to vote
> > now. Thanks for
> > > > > > > > > > your
> > > > > > > > > > > time.
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:29 PM 俊杰陈 <
> > cjjnjust@gmail.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > I see, will resume this next week.  Thanks.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi
> > > > > > > > > > <zi...@cloudera.com.invalid>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Junjie,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Since there are ongoing improvements addressing
> > review
> > > > > > > > comments, I
> > > > > > > > > > > would
> > > > > > > > > > > > > > hold off with the vote for a few more days until
> > the
> > > > > > > > specification
> > > > > > > > > > > settles.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Br,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Zoltan
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 <
> > cjjnjust@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Parquet committers and developers
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > We are waiting for your important ballot:)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <
> > cjjnjust@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Yes, there are some public benchmark results,
> > such as the
> > > > > > > > > > > official
> > > > > > > > > > > > > > > > benchmark from xxhash site (
> > http://www.xxhash.com/) and
> > > > > > > > > > > published
> > > > > > > > > > > > > > > > comparison from smhasher project
> > > > > > > > > > > > > > > > (https://github.com/rurban/smhasher/).
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 5:25 AM Wes McKinney <
> > > > > > > > > > wesmckinn@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Do you have any benchmark data to support
> > the choice of
> > > > > > > > hash
> > > > > > > > > > > function?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <
> > cjjnjust@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > To simplify the voting, I 'd like to
> > update voting
> > > > > > > > content
> > > > > > > > > > > to the
> > > > > > > > > > > > > > > spec
> > > > > > > > > > > > > > > > > > with xxHash hash strategy. Now you can
> > reply with +1
> > > > > > > > or -1.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks for your participation.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 <
> > > > > > > > cjjnjust@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Parquet Bloom filter has been developed
> > for a while,
> > > > > > > > per
> > > > > > > > > > > the
> > > > > > > > > > > > > > > discussion on the mail list, it's time to call a
> > vote for
> > > > > > > > spec to
> > > > > > > > > > > move
> > > > > > > > > > > > > > > forward. The current spec can be found at
> > > > > > > > > > > > > > >
> > > > > > > > > > >
> > https://github.com/apache/parquet-format/blob/master/BloomFilter.md.
> > > > > > > > > > > > > > > There are some different options about the
> > internal hash
> > > > > > > > choice
> > > > > > > > > > of
> > > > > > > > > > > Bloom
> > > > > > > > > > > > > > > filter and the PR is for that concern.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > So I 'd like to propose to vote the spec
> > + hash
> > > > > > > > option,
> > > > > > > > > > for
> > > > > > > > > > > > > > > example:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > +1 to spec and xxHash
> > > > > > > > > > > > > > > > > > > +1 to spec and murmur3
> > > > > > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Please help to vote, any feedback is
> > also welcome in
> > > > > > > > the
> > > > > > > > > > > > > > > discussion thread.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Thanks & Best Regards
> > > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Thanks & Best Regards
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks & Best Regards
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks & Best Regards
> > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> > >
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
+1 on the current spec. Is everyone else still +1?

Sorry for the delay, I didn't realize that everything had been addressed
and I didn't see the email from Jim in my inbox.

On Wed, Aug 28, 2019 at 10:13 AM Jim Apple <jb...@apache.org> wrote:

> We've got +1's from Zoltan and Gabor. Ryan, you've committed a few BF
> patches that were written in response to your feedback on this list. Are
> you in a position to vote +1 now, or do you have further concerns we could
> address?
>
> On 2019/07/31 02:17:15, 俊杰陈 <cj...@gmail.com> wrote:
> > Dear Parquet developers
> >
> > We still need your vote!
> >
> >
> > On Wed, Jul 24, 2019 at 9:30 PM 俊杰陈 <cj...@gmail.com> wrote:
> > >
> > > Hi @Ryan Blue  @Wes McKinney
> > >
> > > We need your valuable vote, any feedback is welcome as well.
> > >
> > > On Tue, Jul 23, 2019 at 1:24 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > >
> > > > Call for voting again.
> > > >
> > > > On Fri, Jul 19, 2019 at 1:17 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > >
> > > > > Dear Parquet developers
> > > > >
> > > > > We need more votes, please help to vote on this.
> > > > >
> > > > > On Wed, Jul 17, 2019 at 3:42 PM Gabor Szadovszky
> > > > > <ga...@cloudera.com.invalid> wrote:
> > > > > >
> > > > > > After getting in PARQUET-1625 I vote again for having bloom
> filter spec and
> > > > > > the thrift file update as is in parquet-format master.
> > > > > > +1 (binding)
> > > > > >
> > > > > > On Mon, Jul 15, 2019 at 3:23 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > >
> > > > > > > Thanks Gabor, It's never too late to make it better. We don't
> have to
> > > > > > > run it in a hurry, it has been developed for a long time yet.:)
> > > > > > >
> > > > > > > The thrift file is indeed a bit lag behind the spec. As the
> spec
> > > > > > > defined, the bloom filter data is stored near the footer which
> means
> > > > > > > we don't have to handle it like the page. Therefore, I just
> opened a
> > > > > > > jira to remove bloom_filter_page_header in PageHeader
> structure, while
> > > > > > > the BloomFitlerHeader is kept intentionally for convenience.
> Since the
> > > > > > > spec and the thrift should be aligned with each other
> eventually, so
> > > > > > > the vote target is both of them.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Jul 15, 2019 at 7:48 PM Gabor Szadovszky
> > > > > > > <ga...@cloudera.com.invalid> wrote:
> > > > > > > >
> > > > > > > > Hi Junjie,
> > > > > > > >
> > > > > > > > Sorry for bringing up this a bit late but I have some
> problems with the
> > > > > > > > format update. The parquet.thrift file is updated to have
> the bloom
> > > > > > > filters
> > > > > > > > as a page (just as dictionaries and data pages). Meanwhile,
> the spec
> > > > > > > > (BloomFilter.md) says that the bloom filter is stored near
> the footer.
> > > > > > > So,
> > > > > > > > if the bloom filter is not part of the row-groups (like
> column indexes) I
> > > > > > > > would not add it as a page. See the struct ColumnIndex in
> the thrift
> > > > > > > file.
> > > > > > > > This struct is not referenced anywhere in it only declared.
> It was done
> > > > > > > > this way because we don't parse it in the same way as we
> parse the pages.
> > > > > > > >
> > > > > > > > Currently, I am not 100% sure about the target of this vote.
> If it is a
> > > > > > > > vote about adding bloom filters in general then it is a +1
> (binding). If
> > > > > > > it
> > > > > > > > is about adding the bloom filters to parquet-format as is
> then, it is a
> > > > > > > -1
> > > > > > > > (binding) until we fix the issue above.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Gabor
> > > > > > > >
> > > > > > > > On Mon, Jul 15, 2019 at 11:45 AM Gidon Gershinsky <
> gg5070@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > +1 (non-binding)
> > > > > > > > >
> > > > > > > > > On Mon, Jul 15, 2019 at 12:08 PM Zoltan Ivanfi
> <zi@cloudera.com.invalid
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > +1 (binding)
> > > > > > > > > >
> > > > > > > > > > On Mon, Jul 15, 2019 at 9:57 AM 俊杰陈 <cj...@gmail.com>
> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > >
> > > > > > > > > > > I'd like to resume this vote, you can start to vote
> now. Thanks for
> > > > > > > > > your
> > > > > > > > > > time.
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Jul 10, 2019 at 9:29 PM 俊杰陈 <
> cjjnjust@gmail.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > I see, will resume this next week.  Thanks.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi
> > > > > > > > > <zi...@cloudera.com.invalid>
> > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi Junjie,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Since there are ongoing improvements addressing
> review
> > > > > > > comments, I
> > > > > > > > > > would
> > > > > > > > > > > > > hold off with the vote for a few more days until
> the
> > > > > > > specification
> > > > > > > > > > settles.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Br,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Zoltan
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 <
> cjjnjust@gmail.com>
> > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Parquet committers and developers
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > We are waiting for your important ballot:)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <
> cjjnjust@gmail.com>
> > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Yes, there are some public benchmark results,
> such as the
> > > > > > > > > > official
> > > > > > > > > > > > > > > benchmark from xxhash site (
> http://www.xxhash.com/) and
> > > > > > > > > > published
> > > > > > > > > > > > > > > comparison from smhasher project
> > > > > > > > > > > > > > > (https://github.com/rurban/smhasher/).
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 5:25 AM Wes McKinney <
> > > > > > > > > wesmckinn@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Do you have any benchmark data to support
> the choice of
> > > > > > > hash
> > > > > > > > > > function?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <
> cjjnjust@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > To simplify the voting, I 'd like to
> update voting
> > > > > > > content
> > > > > > > > > > to the
> > > > > > > > > > > > > > spec
> > > > > > > > > > > > > > > > > with xxHash hash strategy. Now you can
> reply with +1
> > > > > > > or -1.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks for your participation.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 <
> > > > > > > cjjnjust@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Parquet Bloom filter has been developed
> for a while,
> > > > > > > per
> > > > > > > > > > the
> > > > > > > > > > > > > > discussion on the mail list, it's time to call a
> vote for
> > > > > > > spec to
> > > > > > > > > > move
> > > > > > > > > > > > > > forward. The current spec can be found at
> > > > > > > > > > > > > >
> > > > > > > > > >
> https://github.com/apache/parquet-format/blob/master/BloomFilter.md.
> > > > > > > > > > > > > > There are some different options about the
> internal hash
> > > > > > > choice
> > > > > > > > > of
> > > > > > > > > > Bloom
> > > > > > > > > > > > > > filter and the PR is for that concern.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > So I 'd like to propose to vote the spec
> + hash
> > > > > > > option,
> > > > > > > > > for
> > > > > > > > > > > > > > example:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > +1 to spec and xxHash
> > > > > > > > > > > > > > > > > > +1 to spec and murmur3
> > > > > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Please help to vote, any feedback is
> also welcome in
> > > > > > > the
> > > > > > > > > > > > > > discussion thread.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Thanks & Best Regards
> > > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks & Best Regards
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks & Best Regards
> > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> >
> >
> >
> > --
> > Thanks & Best Regards
> >
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by Jim Apple <jb...@apache.org>.
We've got +1's from Zoltan and Gabor. Ryan, you've committed a few BF patches that were written in response to your feedback on this list. Are you in a position to vote +1 now, or do you have further concerns we could address?

On 2019/07/31 02:17:15, 俊杰陈 <cj...@gmail.com> wrote: 
> Dear Parquet developers
> 
> We still need your vote!
> 
> 
> On Wed, Jul 24, 2019 at 9:30 PM 俊杰陈 <cj...@gmail.com> wrote:
> >
> > Hi @Ryan Blue  @Wes McKinney
> >
> > We need your valuable vote, any feedback is welcome as well.
> >
> > On Tue, Jul 23, 2019 at 1:24 PM 俊杰陈 <cj...@gmail.com> wrote:
> > >
> > > Call for voting again.
> > >
> > > On Fri, Jul 19, 2019 at 1:17 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > >
> > > > Dear Parquet developers
> > > >
> > > > We need more votes, please help to vote on this.
> > > >
> > > > On Wed, Jul 17, 2019 at 3:42 PM Gabor Szadovszky
> > > > <ga...@cloudera.com.invalid> wrote:
> > > > >
> > > > > After getting in PARQUET-1625 I vote again for having bloom filter spec and
> > > > > the thrift file update as is in parquet-format master.
> > > > > +1 (binding)
> > > > >
> > > > > On Mon, Jul 15, 2019 at 3:23 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > >
> > > > > > Thanks Gabor, It's never too late to make it better. We don't have to
> > > > > > run it in a hurry, it has been developed for a long time yet.:)
> > > > > >
> > > > > > The thrift file is indeed a bit lag behind the spec. As the spec
> > > > > > defined, the bloom filter data is stored near the footer which means
> > > > > > we don't have to handle it like the page. Therefore, I just opened a
> > > > > > jira to remove bloom_filter_page_header in PageHeader structure, while
> > > > > > the BloomFitlerHeader is kept intentionally for convenience. Since the
> > > > > > spec and the thrift should be aligned with each other eventually, so
> > > > > > the vote target is both of them.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Jul 15, 2019 at 7:48 PM Gabor Szadovszky
> > > > > > <ga...@cloudera.com.invalid> wrote:
> > > > > > >
> > > > > > > Hi Junjie,
> > > > > > >
> > > > > > > Sorry for bringing up this a bit late but I have some problems with the
> > > > > > > format update. The parquet.thrift file is updated to have the bloom
> > > > > > filters
> > > > > > > as a page (just as dictionaries and data pages). Meanwhile, the spec
> > > > > > > (BloomFilter.md) says that the bloom filter is stored near the footer.
> > > > > > So,
> > > > > > > if the bloom filter is not part of the row-groups (like column indexes) I
> > > > > > > would not add it as a page. See the struct ColumnIndex in the thrift
> > > > > > file.
> > > > > > > This struct is not referenced anywhere in it only declared. It was done
> > > > > > > this way because we don't parse it in the same way as we parse the pages.
> > > > > > >
> > > > > > > Currently, I am not 100% sure about the target of this vote. If it is a
> > > > > > > vote about adding bloom filters in general then it is a +1 (binding). If
> > > > > > it
> > > > > > > is about adding the bloom filters to parquet-format as is then, it is a
> > > > > > -1
> > > > > > > (binding) until we fix the issue above.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Gabor
> > > > > > >
> > > > > > > On Mon, Jul 15, 2019 at 11:45 AM Gidon Gershinsky <gg...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > +1 (non-binding)
> > > > > > > >
> > > > > > > > On Mon, Jul 15, 2019 at 12:08 PM Zoltan Ivanfi <zi@cloudera.com.invalid
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > +1 (binding)
> > > > > > > > >
> > > > > > > > > On Mon, Jul 15, 2019 at 9:57 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Dear Parquet developers
> > > > > > > > > >
> > > > > > > > > > I'd like to resume this vote, you can start to vote now. Thanks for
> > > > > > > > your
> > > > > > > > > time.
> > > > > > > > > >
> > > > > > > > > > On Wed, Jul 10, 2019 at 9:29 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > I see, will resume this next week.  Thanks.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi
> > > > > > > > <zi...@cloudera.com.invalid>
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Junjie,
> > > > > > > > > > > >
> > > > > > > > > > > > Since there are ongoing improvements addressing review
> > > > > > comments, I
> > > > > > > > > would
> > > > > > > > > > > > hold off with the vote for a few more days until the
> > > > > > specification
> > > > > > > > > settles.
> > > > > > > > > > > >
> > > > > > > > > > > > Br,
> > > > > > > > > > > >
> > > > > > > > > > > > Zoltan
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 <cj...@gmail.com>
> > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Parquet committers and developers
> > > > > > > > > > > > >
> > > > > > > > > > > > > We are waiting for your important ballot:)
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <cj...@gmail.com>
> > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yes, there are some public benchmark results, such as the
> > > > > > > > > official
> > > > > > > > > > > > > > benchmark from xxhash site (http://www.xxhash.com/) and
> > > > > > > > > published
> > > > > > > > > > > > > > comparison from smhasher project
> > > > > > > > > > > > > > (https://github.com/rurban/smhasher/).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 5:25 AM Wes McKinney <
> > > > > > > > wesmckinn@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Do you have any benchmark data to support the choice of
> > > > > > hash
> > > > > > > > > function?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <cj...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > To simplify the voting, I 'd like to update voting
> > > > > > content
> > > > > > > > > to the
> > > > > > > > > > > > > spec
> > > > > > > > > > > > > > > > with xxHash hash strategy. Now you can reply with +1
> > > > > > or -1.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks for your participation.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 <
> > > > > > cjjnjust@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Parquet Bloom filter has been developed for a while,
> > > > > > per
> > > > > > > > > the
> > > > > > > > > > > > > discussion on the mail list, it's time to call a vote for
> > > > > > spec to
> > > > > > > > > move
> > > > > > > > > > > > > forward. The current spec can be found at
> > > > > > > > > > > > >
> > > > > > > > > https://github.com/apache/parquet-format/blob/master/BloomFilter.md.
> > > > > > > > > > > > > There are some different options about the internal hash
> > > > > > choice
> > > > > > > > of
> > > > > > > > > Bloom
> > > > > > > > > > > > > filter and the PR is for that concern.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > So I 'd like to propose to vote the spec + hash
> > > > > > option,
> > > > > > > > for
> > > > > > > > > > > > > example:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > +1 to spec and xxHash
> > > > > > > > > > > > > > > > > +1 to spec and murmur3
> > > > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Please help to vote, any feedback is also welcome in
> > > > > > the
> > > > > > > > > > > > > discussion thread.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Thanks & Best Regards
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Thanks & Best Regards
> > > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks & Best Regards
> > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> >
> >
> >
> > --
> > Thanks & Best Regards
> 
> 
> 
> --
> Thanks & Best Regards
> 

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by 俊杰陈 <cj...@gmail.com>.
Thanks Gidon to correct me.

On Sun, Aug 4, 2019 at 10:37 PM Gidon Gershinsky <gg...@gmail.com> wrote:
>
> The main reason for bloom filter header encryption is tamper-proofing it
> with AES GCM.
> If not done, an attacker can alter the hash/algorithm fields, making the
> readers lose relevant data on the one hand, and load/filter irrelevant data
> on the other hand.
> Bloom filters are encrypted differently from pages (GCM only, no page
> ordinal, etc).
>
> On Sun, Aug 4, 2019 at 4:47 PM 俊杰陈 <cj...@gmail.com> wrote:
>
> > Many thanks for these useful comments!
> >
> > I will create a PR to update format according to these comments.
> > Reply some of them inline:
> >
> > The spec should give more detail on how to choose the number of blocks
> > and on false positive rates. The sentence with “11.54 bits for each
> > distinct value inserted into the filter” is vague: is this the
> > multi-block filter? Why is a 1% false-positive rate “recommended”? I
> > think it is okay to use 0.5% as each block’s false-positive rate, but
> > then this should state how to achieve an overall false-positive rate
> > as a function of the number of distinct values.
> > [junjie]: IIUC, 0.5% is overall FPR.  The reason to recommend 1% FPR
> > is that it can use about 1.2MB bitset to store one million distinct
> > values which should satisfy most cases, but I cannot guarantee that
> > fit extreme cases such as only one column in a table, so I also
> > recommend to use a configuration to set the FPR. In the implementation
> > in the bloom-filter branch, I also set default max bloom filter size
> > to 1MB.
> >
> > Should the bloom filters support compression?
> > [junjie], The compression should help as you mentioned. I didn't add
> > it because I think the size of a bloom filter should be small for a
> > column chunk. Also, we may skip building bloom filter if there is an
> > existing dictionary.
> >
> > Why is the bloom filter header encrypted?
> > In a discussion with Gidon, we 'd like to follow the current way when
> > encrypting pages.
> >
> >
> > On Sun, Aug 4, 2019 at 4:42 AM Ryan Blue <rb...@netflix.com> wrote:
> > >
> > > Right now, I’m voting -1 on the spec as written because I don’t think
> > that it is clear enough to be correctly implemented from just the spec.
> > I’ll change that with a few corrections.
> > >
> > > Here are the items I think need to be fixed before I’ll change my vote
> > to support this spec:
> > >
> > > The algorithm should be more clear:
> > >
> > > How is the bloom filter block selected from the 32 most-significant bits
> > from of the hash function? These details must be in the spec and not in
> > papers linked from the spec.
> > > How is the number of blocks determined? From the overall filter size?
> > > I think that the exact procedure for a lookup in each block should be
> > covered in a section, followed by a section for how to perform a look up in
> > the multi-block filter. The wording also needs to be cleaned up so that it
> > is always clear whether the filter that is referenced is a block or the
> > multi-block filter.
> > >
> > > The spec should give more detail on how to choose the number of blocks
> > and on false positive rates. The sentence with “11.54 bits for each
> > distinct value inserted into the filter” is vague: is this the multi-block
> > filter? Why is a 1% false-positive rate “recommended”?
> > >
> > > I think it is okay to use 0.5% as each block’s false-positive rate, but
> > then this should state how to achieve an overall false-positive rate as a
> > function of the number of distinct values.
> > >
> > > This should be more clear about the scope of bloom filters. The only
> > reference right now is “The Bloom filter data of a column chunk, …”. If
> > each filter is for a column chunk, then this should be stated as a
> > requirement.
> > > The position of bloom filter headers and bloom filter bit sets is too
> > vague: “The Bloom filter data of a column chunk, which contains the size of
> > the filter in bytes, the algorithm, the hash function and the Bloom filter
> > bitset, is stored near the footer.”
> > >
> > > This should state that the column chunk offset points to the start of
> > the bloom filter header, which immediately precedes the bloom filter bytes.
> > The number of bytes must be stored as numBytes in the header. Include
> > layout diagrams like the ones in the encryption spec.
> > > The position of all bloom filters in relation to the page indexes should
> > be clear. Bloom filters should be written just before column indexes?
> > > The position of each bloom filter within the bloom filter data section
> > should be clear. Are bloom filters for a column located together, or bloom
> > filters for a row group? (R1C1, R2C1, R1C2, R2C2 vs R1C1, R1C2, R2C1, R2C2)
> > > Does the format allow locating bloom filter data anywhere else other
> > than just before the header? Maybe the locations are recommendations and
> > not requirements?
> > >
> > > Should the bloom filters support compression? If the strategy for a
> > lower false-positive rate is to under-fill the multi-block filter, then
> > would compression help reduce the storage cost?
> > > Why is the bloom filter header encrypted?
> > >
> > >
> > > On Wed, Jul 31, 2019 at 10:01 AM 俊杰陈 <cj...@gmail.com> wrote:
> > >>
> > >> Hi Wes,
> > >>
> > >> Thanks for the reply. It indeed does not have automation support in
> > >> the test. Since now we have the repo parquet-testing for the
> > >> integration test, I think we can do some automation base on that. I
> > >> will take some time to look at this and update the integration plan in
> > >> PARQUET-1326.
> > >>
> > >> Besides that, we can open a thread to discuss building automation
> > >> integration test framework for features WIP and in the future.
> > >>
> > >> On Thu, Aug 1, 2019 at 12:10 AM Ryan Blue <rb...@netflix.com.invalid>
> > wrote:
> > >> >
> > >> > Wes,
> > >> >
> > >> > Since the v2 format is unfinished, I don't think anyone should be
> > writing
> > >> > v2 pages. In fact, I don't think v2 pages would be recommended in the
> > v2
> > >> > spec at this point. Do you know why Spark is writing this data? I
> > didn't
> > >> > know that Spark would write v2 by default.
> > >> >
> > >> > rb
> > >> >
> > >> > On Wed, Jul 31, 2019 at 8:28 AM Wes McKinney <we...@gmail.com>
> > wrote:
> > >> >
> > >> > > I'm not sure when I can have a more thorough look at this.
> > >> > >
> > >> > > To be honest, I'm personally struggling with the burden of
> > supporting
> > >> > > the existing feature set in the Parquet C++ library. The integration
> > >> > > testing strategy for this as well as other features that are being
> > >> > > added to both the Java and C++ libraries (e.g. encryption) make me
> > >> > > uncomfortable due to the lack of automation. As an example,
> > DataPageV2
> > >> > > support in parquet-cpp has been broken since the beginning of the
> > >> > > project (see PARQUET-458) but it's only recently become an issue
> > when
> > >> > > people have been trying to read such files produced by Spark. More
> > >> > > comprehensive integration testing would help ensure that the
> > libraries
> > >> > > remain compatible.
> > >> > >
> > >> > > On Tue, Jul 30, 2019 at 9:17 PM 俊杰陈 <cj...@gmail.com> wrote:
> > >> > > >
> > >> > > > Dear Parquet developers
> > >> > > >
> > >> > > > We still need your vote!
> > >> > > >
> > >> > > >
> > >> > > > On Wed, Jul 24, 2019 at 9:30 PM 俊杰陈 <cj...@gmail.com> wrote:
> > >> > > > >
> > >> > > > > Hi @Ryan Blue  @Wes McKinney
> > >> > > > >
> > >> > > > > We need your valuable vote, any feedback is welcome as well.
> > >> > > > >
> > >> > > > > On Tue, Jul 23, 2019 at 1:24 PM 俊杰陈 <cj...@gmail.com> wrote:
> > >> > > > > >
> > >> > > > > > Call for voting again.
> > >> > > > > >
> > >> > > > > > On Fri, Jul 19, 2019 at 1:17 PM 俊杰陈 <cj...@gmail.com>
> > wrote:
> > >> > > > > > >
> > >> > > > > > > Dear Parquet developers
> > >> > > > > > >
> > >> > > > > > > We need more votes, please help to vote on this.
> > >> > > > > > >
> > >> > > > > > > On Wed, Jul 17, 2019 at 3:42 PM Gabor Szadovszky
> > >> > > > > > > <ga...@cloudera.com.invalid> wrote:
> > >> > > > > > > >
> > >> > > > > > > > After getting in PARQUET-1625 I vote again for having
> > bloom
> > >> > > filter spec and
> > >> > > > > > > > the thrift file update as is in parquet-format master.
> > >> > > > > > > > +1 (binding)
> > >> > > > > > > >
> > >> > > > > > > > On Mon, Jul 15, 2019 at 3:23 PM 俊杰陈 <cj...@gmail.com>
> > wrote:
> > >> > > > > > > >
> > >> > > > > > > > > Thanks Gabor, It's never too late to make it better. We
> > don't
> > >> > > have to
> > >> > > > > > > > > run it in a hurry, it has been developed for a long
> > time yet.:)
> > >> > > > > > > > >
> > >> > > > > > > > > The thrift file is indeed a bit lag behind the spec. As
> > the
> > >> > > spec
> > >> > > > > > > > > defined, the bloom filter data is stored near the
> > footer which
> > >> > > means
> > >> > > > > > > > > we don't have to handle it like the page. Therefore, I
> > just
> > >> > > opened a
> > >> > > > > > > > > jira to remove bloom_filter_page_header in PageHeader
> > >> > > structure, while
> > >> > > > > > > > > the BloomFitlerHeader is kept intentionally for
> > convenience.
> > >> > > Since the
> > >> > > > > > > > > spec and the thrift should be aligned with each other
> > >> > > eventually, so
> > >> > > > > > > > > the vote target is both of them.
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > On Mon, Jul 15, 2019 at 7:48 PM Gabor Szadovszky
> > >> > > > > > > > > <ga...@cloudera.com.invalid> wrote:
> > >> > > > > > > > > >
> > >> > > > > > > > > > Hi Junjie,
> > >> > > > > > > > > >
> > >> > > > > > > > > > Sorry for bringing up this a bit late but I have some
> > >> > > problems with the
> > >> > > > > > > > > > format update. The parquet.thrift file is updated to
> > have
> > >> > > the bloom
> > >> > > > > > > > > filters
> > >> > > > > > > > > > as a page (just as dictionaries and data pages).
> > Meanwhile,
> > >> > > the spec
> > >> > > > > > > > > > (BloomFilter.md) says that the bloom filter is stored
> > near
> > >> > > the footer.
> > >> > > > > > > > > So,
> > >> > > > > > > > > > if the bloom filter is not part of the row-groups
> > (like
> > >> > > column indexes) I
> > >> > > > > > > > > > would not add it as a page. See the struct
> > ColumnIndex in
> > >> > > the thrift
> > >> > > > > > > > > file.
> > >> > > > > > > > > > This struct is not referenced anywhere in it only
> > declared.
> > >> > > It was done
> > >> > > > > > > > > > this way because we don't parse it in the same way as
> > we
> > >> > > parse the pages.
> > >> > > > > > > > > >
> > >> > > > > > > > > > Currently, I am not 100% sure about the target of
> > this vote.
> > >> > > If it is a
> > >> > > > > > > > > > vote about adding bloom filters in general then it is
> > a +1
> > >> > > (binding). If
> > >> > > > > > > > > it
> > >> > > > > > > > > > is about adding the bloom filters to parquet-format
> > as is
> > >> > > then, it is a
> > >> > > > > > > > > -1
> > >> > > > > > > > > > (binding) until we fix the issue above.
> > >> > > > > > > > > >
> > >> > > > > > > > > > Regards,
> > >> > > > > > > > > > Gabor
> > >> > > > > > > > > >
> > >> > > > > > > > > > On Mon, Jul 15, 2019 at 11:45 AM Gidon Gershinsky <
> > >> > > gg5070@gmail.com>
> > >> > > > > > > > > wrote:
> > >> > > > > > > > > >
> > >> > > > > > > > > > > +1 (non-binding)
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > On Mon, Jul 15, 2019 at 12:08 PM Zoltan Ivanfi
> > >> > > <zi@cloudera.com.invalid
> > >> > > > > > > > > >
> > >> > > > > > > > > > > wrote:
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > > +1 (binding)
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > On Mon, Jul 15, 2019 at 9:57 AM 俊杰陈 <
> > cjjnjust@gmail.com>
> > >> > > wrote:
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > Dear Parquet developers
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > I'd like to resume this vote, you can start to
> > vote
> > >> > > now. Thanks for
> > >> > > > > > > > > > > your
> > >> > > > > > > > > > > > time.
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:29 PM 俊杰陈 <
> > >> > > cjjnjust@gmail.com> wrote:
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > I see, will resume this next week.  Thanks.
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi
> > >> > > > > > > > > > > <zi...@cloudera.com.invalid>
> > >> > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Hi Junjie,
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Since there are ongoing improvements
> > addressing
> > >> > > review
> > >> > > > > > > > > comments, I
> > >> > > > > > > > > > > > would
> > >> > > > > > > > > > > > > > > hold off with the vote for a few more days
> > until
> > >> > > the
> > >> > > > > > > > > specification
> > >> > > > > > > > > > > > settles.
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Br,
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Zoltan
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 <
> > >> > > cjjnjust@gmail.com>
> > >> > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > Hi Parquet committers and developers
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > We are waiting for your important ballot:)
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <
> > >> > > cjjnjust@gmail.com>
> > >> > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > Yes, there are some public benchmark
> > results,
> > >> > > such as the
> > >> > > > > > > > > > > > official
> > >> > > > > > > > > > > > > > > > > benchmark from xxhash site (
> > >> > > http://www.xxhash.com/) and
> > >> > > > > > > > > > > > published
> > >> > > > > > > > > > > > > > > > > comparison from smhasher project
> > >> > > > > > > > > > > > > > > > > (https://github.com/rurban/smhasher/).
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 5:25 AM Wes
> > McKinney <
> > >> > > > > > > > > > > wesmckinn@gmail.com>
> > >> > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > Do you have any benchmark data to
> > support
> > >> > > the choice of
> > >> > > > > > > > > hash
> > >> > > > > > > > > > > > function?
> > >> > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <
> > >> > > cjjnjust@gmail.com>
> > >> > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > Dear Parquet developers
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > To simplify the voting, I 'd like to
> > >> > > update voting
> > >> > > > > > > > > content
> > >> > > > > > > > > > > > to the
> > >> > > > > > > > > > > > > > > > spec
> > >> > > > > > > > > > > > > > > > > > > with xxHash hash strategy. Now you
> > can
> > >> > > reply with +1
> > >> > > > > > > > > or -1.
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > Thanks for your participation.
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈
> > <
> > >> > > > > > > > > cjjnjust@gmail.com>
> > >> > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > Dear Parquet developers
> > >> > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > Parquet Bloom filter has been
> > developed
> > >> > > for a while,
> > >> > > > > > > > > per
> > >> > > > > > > > > > > > the
> > >> > > > > > > > > > > > > > > > discussion on the mail list, it's time to
> > call a
> > >> > > vote for
> > >> > > > > > > > > spec to
> > >> > > > > > > > > > > > move
> > >> > > > > > > > > > > > > > > > forward. The current spec can be found at
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > https://github.com/apache/parquet-format/blob/master/BloomFilter.md
> > .
> > >> > > > > > > > > > > > > > > > There are some different options about the
> > >> > > internal hash
> > >> > > > > > > > > choice
> > >> > > > > > > > > > > of
> > >> > > > > > > > > > > > Bloom
> > >> > > > > > > > > > > > > > > > filter and the PR is for that concern.
> > >> > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > So I 'd like to propose to vote
> > the spec
> > >> > > + hash
> > >> > > > > > > > > option,
> > >> > > > > > > > > > > for
> > >> > > > > > > > > > > > > > > > example:
> > >> > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > +1 to spec and xxHash
> > >> > > > > > > > > > > > > > > > > > > > +1 to spec and murmur3
> > >> > > > > > > > > > > > > > > > > > > > ...
> > >> > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > Please help to vote, any feedback
> > is
> > >> > > also welcome in
> > >> > > > > > > > > the
> > >> > > > > > > > > > > > > > > > discussion thread.
> > >> > > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > > Thanks & Best Regards
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > > > > > > > Thanks & Best Regards
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > > > > > Thanks & Best Regards
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > > > > Thanks & Best Regards
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > > Thanks & Best Regards
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > --
> > >> > > > > > > > > > > > > Thanks & Best Regards
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > --
> > >> > > > > > > > > Thanks & Best Regards
> > >> > > > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > --
> > >> > > > > > > Thanks & Best Regards
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > --
> > >> > > > > > Thanks & Best Regards
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > --
> > >> > > > > Thanks & Best Regards
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > > Thanks & Best Regards
> > >> > >
> > >> >
> > >> >
> > >> > --
> > >> > Ryan Blue
> > >> > Software Engineer
> > >> > Netflix
> > >>
> > >>
> > >>
> > >> --
> > >> Thanks & Best Regards
> > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> >
> >
> >
> > --
> > Thanks & Best Regards
> >



-- 
Thanks & Best Regards

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by Gidon Gershinsky <gg...@gmail.com>.
The main reason for bloom filter header encryption is tamper-proofing it
with AES GCM.
If not done, an attacker can alter the hash/algorithm fields, making the
readers lose relevant data on the one hand, and load/filter irrelevant data
on the other hand.
Bloom filters are encrypted differently from pages (GCM only, no page
ordinal, etc).

On Sun, Aug 4, 2019 at 4:47 PM 俊杰陈 <cj...@gmail.com> wrote:

> Many thanks for these useful comments!
>
> I will create a PR to update format according to these comments.
> Reply some of them inline:
>
> The spec should give more detail on how to choose the number of blocks
> and on false positive rates. The sentence with “11.54 bits for each
> distinct value inserted into the filter” is vague: is this the
> multi-block filter? Why is a 1% false-positive rate “recommended”? I
> think it is okay to use 0.5% as each block’s false-positive rate, but
> then this should state how to achieve an overall false-positive rate
> as a function of the number of distinct values.
> [junjie]: IIUC, 0.5% is overall FPR.  The reason to recommend 1% FPR
> is that it can use about 1.2MB bitset to store one million distinct
> values which should satisfy most cases, but I cannot guarantee that
> fit extreme cases such as only one column in a table, so I also
> recommend to use a configuration to set the FPR. In the implementation
> in the bloom-filter branch, I also set default max bloom filter size
> to 1MB.
>
> Should the bloom filters support compression?
> [junjie], The compression should help as you mentioned. I didn't add
> it because I think the size of a bloom filter should be small for a
> column chunk. Also, we may skip building bloom filter if there is an
> existing dictionary.
>
> Why is the bloom filter header encrypted?
> In a discussion with Gidon, we 'd like to follow the current way when
> encrypting pages.
>
>
> On Sun, Aug 4, 2019 at 4:42 AM Ryan Blue <rb...@netflix.com> wrote:
> >
> > Right now, I’m voting -1 on the spec as written because I don’t think
> that it is clear enough to be correctly implemented from just the spec.
> I’ll change that with a few corrections.
> >
> > Here are the items I think need to be fixed before I’ll change my vote
> to support this spec:
> >
> > The algorithm should be more clear:
> >
> > How is the bloom filter block selected from the 32 most-significant bits
> from of the hash function? These details must be in the spec and not in
> papers linked from the spec.
> > How is the number of blocks determined? From the overall filter size?
> > I think that the exact procedure for a lookup in each block should be
> covered in a section, followed by a section for how to perform a look up in
> the multi-block filter. The wording also needs to be cleaned up so that it
> is always clear whether the filter that is referenced is a block or the
> multi-block filter.
> >
> > The spec should give more detail on how to choose the number of blocks
> and on false positive rates. The sentence with “11.54 bits for each
> distinct value inserted into the filter” is vague: is this the multi-block
> filter? Why is a 1% false-positive rate “recommended”?
> >
> > I think it is okay to use 0.5% as each block’s false-positive rate, but
> then this should state how to achieve an overall false-positive rate as a
> function of the number of distinct values.
> >
> > This should be more clear about the scope of bloom filters. The only
> reference right now is “The Bloom filter data of a column chunk, …”. If
> each filter is for a column chunk, then this should be stated as a
> requirement.
> > The position of bloom filter headers and bloom filter bit sets is too
> vague: “The Bloom filter data of a column chunk, which contains the size of
> the filter in bytes, the algorithm, the hash function and the Bloom filter
> bitset, is stored near the footer.”
> >
> > This should state that the column chunk offset points to the start of
> the bloom filter header, which immediately precedes the bloom filter bytes.
> The number of bytes must be stored as numBytes in the header. Include
> layout diagrams like the ones in the encryption spec.
> > The position of all bloom filters in relation to the page indexes should
> be clear. Bloom filters should be written just before column indexes?
> > The position of each bloom filter within the bloom filter data section
> should be clear. Are bloom filters for a column located together, or bloom
> filters for a row group? (R1C1, R2C1, R1C2, R2C2 vs R1C1, R1C2, R2C1, R2C2)
> > Does the format allow locating bloom filter data anywhere else other
> than just before the header? Maybe the locations are recommendations and
> not requirements?
> >
> > Should the bloom filters support compression? If the strategy for a
> lower false-positive rate is to under-fill the multi-block filter, then
> would compression help reduce the storage cost?
> > Why is the bloom filter header encrypted?
> >
> >
> > On Wed, Jul 31, 2019 at 10:01 AM 俊杰陈 <cj...@gmail.com> wrote:
> >>
> >> Hi Wes,
> >>
> >> Thanks for the reply. It indeed does not have automation support in
> >> the test. Since now we have the repo parquet-testing for the
> >> integration test, I think we can do some automation base on that. I
> >> will take some time to look at this and update the integration plan in
> >> PARQUET-1326.
> >>
> >> Besides that, we can open a thread to discuss building automation
> >> integration test framework for features WIP and in the future.
> >>
> >> On Thu, Aug 1, 2019 at 12:10 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
> >> >
> >> > Wes,
> >> >
> >> > Since the v2 format is unfinished, I don't think anyone should be
> writing
> >> > v2 pages. In fact, I don't think v2 pages would be recommended in the
> v2
> >> > spec at this point. Do you know why Spark is writing this data? I
> didn't
> >> > know that Spark would write v2 by default.
> >> >
> >> > rb
> >> >
> >> > On Wed, Jul 31, 2019 at 8:28 AM Wes McKinney <we...@gmail.com>
> wrote:
> >> >
> >> > > I'm not sure when I can have a more thorough look at this.
> >> > >
> >> > > To be honest, I'm personally struggling with the burden of
> supporting
> >> > > the existing feature set in the Parquet C++ library. The integration
> >> > > testing strategy for this as well as other features that are being
> >> > > added to both the Java and C++ libraries (e.g. encryption) make me
> >> > > uncomfortable due to the lack of automation. As an example,
> DataPageV2
> >> > > support in parquet-cpp has been broken since the beginning of the
> >> > > project (see PARQUET-458) but it's only recently become an issue
> when
> >> > > people have been trying to read such files produced by Spark. More
> >> > > comprehensive integration testing would help ensure that the
> libraries
> >> > > remain compatible.
> >> > >
> >> > > On Tue, Jul 30, 2019 at 9:17 PM 俊杰陈 <cj...@gmail.com> wrote:
> >> > > >
> >> > > > Dear Parquet developers
> >> > > >
> >> > > > We still need your vote!
> >> > > >
> >> > > >
> >> > > > On Wed, Jul 24, 2019 at 9:30 PM 俊杰陈 <cj...@gmail.com> wrote:
> >> > > > >
> >> > > > > Hi @Ryan Blue  @Wes McKinney
> >> > > > >
> >> > > > > We need your valuable vote, any feedback is welcome as well.
> >> > > > >
> >> > > > > On Tue, Jul 23, 2019 at 1:24 PM 俊杰陈 <cj...@gmail.com> wrote:
> >> > > > > >
> >> > > > > > Call for voting again.
> >> > > > > >
> >> > > > > > On Fri, Jul 19, 2019 at 1:17 PM 俊杰陈 <cj...@gmail.com>
> wrote:
> >> > > > > > >
> >> > > > > > > Dear Parquet developers
> >> > > > > > >
> >> > > > > > > We need more votes, please help to vote on this.
> >> > > > > > >
> >> > > > > > > On Wed, Jul 17, 2019 at 3:42 PM Gabor Szadovszky
> >> > > > > > > <ga...@cloudera.com.invalid> wrote:
> >> > > > > > > >
> >> > > > > > > > After getting in PARQUET-1625 I vote again for having
> bloom
> >> > > filter spec and
> >> > > > > > > > the thrift file update as is in parquet-format master.
> >> > > > > > > > +1 (binding)
> >> > > > > > > >
> >> > > > > > > > On Mon, Jul 15, 2019 at 3:23 PM 俊杰陈 <cj...@gmail.com>
> wrote:
> >> > > > > > > >
> >> > > > > > > > > Thanks Gabor, It's never too late to make it better. We
> don't
> >> > > have to
> >> > > > > > > > > run it in a hurry, it has been developed for a long
> time yet.:)
> >> > > > > > > > >
> >> > > > > > > > > The thrift file is indeed a bit lag behind the spec. As
> the
> >> > > spec
> >> > > > > > > > > defined, the bloom filter data is stored near the
> footer which
> >> > > means
> >> > > > > > > > > we don't have to handle it like the page. Therefore, I
> just
> >> > > opened a
> >> > > > > > > > > jira to remove bloom_filter_page_header in PageHeader
> >> > > structure, while
> >> > > > > > > > > the BloomFitlerHeader is kept intentionally for
> convenience.
> >> > > Since the
> >> > > > > > > > > spec and the thrift should be aligned with each other
> >> > > eventually, so
> >> > > > > > > > > the vote target is both of them.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > On Mon, Jul 15, 2019 at 7:48 PM Gabor Szadovszky
> >> > > > > > > > > <ga...@cloudera.com.invalid> wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > Hi Junjie,
> >> > > > > > > > > >
> >> > > > > > > > > > Sorry for bringing up this a bit late but I have some
> >> > > problems with the
> >> > > > > > > > > > format update. The parquet.thrift file is updated to
> have
> >> > > the bloom
> >> > > > > > > > > filters
> >> > > > > > > > > > as a page (just as dictionaries and data pages).
> Meanwhile,
> >> > > the spec
> >> > > > > > > > > > (BloomFilter.md) says that the bloom filter is stored
> near
> >> > > the footer.
> >> > > > > > > > > So,
> >> > > > > > > > > > if the bloom filter is not part of the row-groups
> (like
> >> > > column indexes) I
> >> > > > > > > > > > would not add it as a page. See the struct
> ColumnIndex in
> >> > > the thrift
> >> > > > > > > > > file.
> >> > > > > > > > > > This struct is not referenced anywhere in it only
> declared.
> >> > > It was done
> >> > > > > > > > > > this way because we don't parse it in the same way as
> we
> >> > > parse the pages.
> >> > > > > > > > > >
> >> > > > > > > > > > Currently, I am not 100% sure about the target of
> this vote.
> >> > > If it is a
> >> > > > > > > > > > vote about adding bloom filters in general then it is
> a +1
> >> > > (binding). If
> >> > > > > > > > > it
> >> > > > > > > > > > is about adding the bloom filters to parquet-format
> as is
> >> > > then, it is a
> >> > > > > > > > > -1
> >> > > > > > > > > > (binding) until we fix the issue above.
> >> > > > > > > > > >
> >> > > > > > > > > > Regards,
> >> > > > > > > > > > Gabor
> >> > > > > > > > > >
> >> > > > > > > > > > On Mon, Jul 15, 2019 at 11:45 AM Gidon Gershinsky <
> >> > > gg5070@gmail.com>
> >> > > > > > > > > wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > > +1 (non-binding)
> >> > > > > > > > > > >
> >> > > > > > > > > > > On Mon, Jul 15, 2019 at 12:08 PM Zoltan Ivanfi
> >> > > <zi@cloudera.com.invalid
> >> > > > > > > > > >
> >> > > > > > > > > > > wrote:
> >> > > > > > > > > > >
> >> > > > > > > > > > > > +1 (binding)
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > On Mon, Jul 15, 2019 at 9:57 AM 俊杰陈 <
> cjjnjust@gmail.com>
> >> > > wrote:
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Dear Parquet developers
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > I'd like to resume this vote, you can start to
> vote
> >> > > now. Thanks for
> >> > > > > > > > > > > your
> >> > > > > > > > > > > > time.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:29 PM 俊杰陈 <
> >> > > cjjnjust@gmail.com> wrote:
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > I see, will resume this next week.  Thanks.
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi
> >> > > > > > > > > > > <zi...@cloudera.com.invalid>
> >> > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Hi Junjie,
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Since there are ongoing improvements
> addressing
> >> > > review
> >> > > > > > > > > comments, I
> >> > > > > > > > > > > > would
> >> > > > > > > > > > > > > > > hold off with the vote for a few more days
> until
> >> > > the
> >> > > > > > > > > specification
> >> > > > > > > > > > > > settles.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Br,
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Zoltan
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 <
> >> > > cjjnjust@gmail.com>
> >> > > > > > > > > wrote:
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > Hi Parquet committers and developers
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > We are waiting for your important ballot:)
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <
> >> > > cjjnjust@gmail.com>
> >> > > > > > > > > wrote:
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > Yes, there are some public benchmark
> results,
> >> > > such as the
> >> > > > > > > > > > > > official
> >> > > > > > > > > > > > > > > > > benchmark from xxhash site (
> >> > > http://www.xxhash.com/) and
> >> > > > > > > > > > > > published
> >> > > > > > > > > > > > > > > > > comparison from smhasher project
> >> > > > > > > > > > > > > > > > > (https://github.com/rurban/smhasher/).
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 5:25 AM Wes
> McKinney <
> >> > > > > > > > > > > wesmckinn@gmail.com>
> >> > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > Do you have any benchmark data to
> support
> >> > > the choice of
> >> > > > > > > > > hash
> >> > > > > > > > > > > > function?
> >> > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <
> >> > > cjjnjust@gmail.com>
> >> > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > Dear Parquet developers
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > To simplify the voting, I 'd like to
> >> > > update voting
> >> > > > > > > > > content
> >> > > > > > > > > > > > to the
> >> > > > > > > > > > > > > > > > spec
> >> > > > > > > > > > > > > > > > > > > with xxHash hash strategy. Now you
> can
> >> > > reply with +1
> >> > > > > > > > > or -1.
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > Thanks for your participation.
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈
> <
> >> > > > > > > > > cjjnjust@gmail.com>
> >> > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > Dear Parquet developers
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > Parquet Bloom filter has been
> developed
> >> > > for a while,
> >> > > > > > > > > per
> >> > > > > > > > > > > > the
> >> > > > > > > > > > > > > > > > discussion on the mail list, it's time to
> call a
> >> > > vote for
> >> > > > > > > > > spec to
> >> > > > > > > > > > > > move
> >> > > > > > > > > > > > > > > > forward. The current spec can be found at
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > https://github.com/apache/parquet-format/blob/master/BloomFilter.md
> .
> >> > > > > > > > > > > > > > > > There are some different options about the
> >> > > internal hash
> >> > > > > > > > > choice
> >> > > > > > > > > > > of
> >> > > > > > > > > > > > Bloom
> >> > > > > > > > > > > > > > > > filter and the PR is for that concern.
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > So I 'd like to propose to vote
> the spec
> >> > > + hash
> >> > > > > > > > > option,
> >> > > > > > > > > > > for
> >> > > > > > > > > > > > > > > > example:
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > +1 to spec and xxHash
> >> > > > > > > > > > > > > > > > > > > > +1 to spec and murmur3
> >> > > > > > > > > > > > > > > > > > > > ...
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > Please help to vote, any feedback
> is
> >> > > also welcome in
> >> > > > > > > > > the
> >> > > > > > > > > > > > > > > > discussion thread.
> >> > > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > > Thanks & Best Regards
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > > > --
> >> > > > > > > > > > > > > > > > > > > Thanks & Best Regards
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > --
> >> > > > > > > > > > > > > > > > > Thanks & Best Regards
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > --
> >> > > > > > > > > > > > > > > > Thanks & Best Regards
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > --
> >> > > > > > > > > > > > > > Thanks & Best Regards
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > --
> >> > > > > > > > > > > > > Thanks & Best Regards
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > --
> >> > > > > > > > > Thanks & Best Regards
> >> > > > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > --
> >> > > > > > > Thanks & Best Regards
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > Thanks & Best Regards
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > --
> >> > > > > Thanks & Best Regards
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Thanks & Best Regards
> >> > >
> >> >
> >> >
> >> > --
> >> > Ryan Blue
> >> > Software Engineer
> >> > Netflix
> >>
> >>
> >>
> >> --
> >> Thanks & Best Regards
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>
>
> --
> Thanks & Best Regards
>

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by 俊杰陈 <cj...@gmail.com>.
Many thanks for these useful comments!

I will create a PR to update format according to these comments.
Reply some of them inline:

The spec should give more detail on how to choose the number of blocks
and on false positive rates. The sentence with “11.54 bits for each
distinct value inserted into the filter” is vague: is this the
multi-block filter? Why is a 1% false-positive rate “recommended”? I
think it is okay to use 0.5% as each block’s false-positive rate, but
then this should state how to achieve an overall false-positive rate
as a function of the number of distinct values.
[junjie]: IIUC, 0.5% is overall FPR.  The reason to recommend 1% FPR
is that it can use about 1.2MB bitset to store one million distinct
values which should satisfy most cases, but I cannot guarantee that
fit extreme cases such as only one column in a table, so I also
recommend to use a configuration to set the FPR. In the implementation
in the bloom-filter branch, I also set default max bloom filter size
to 1MB.

Should the bloom filters support compression?
[junjie], The compression should help as you mentioned. I didn't add
it because I think the size of a bloom filter should be small for a
column chunk. Also, we may skip building bloom filter if there is an
existing dictionary.

Why is the bloom filter header encrypted?
In a discussion with Gidon, we 'd like to follow the current way when
encrypting pages.


On Sun, Aug 4, 2019 at 4:42 AM Ryan Blue <rb...@netflix.com> wrote:
>
> Right now, I’m voting -1 on the spec as written because I don’t think that it is clear enough to be correctly implemented from just the spec. I’ll change that with a few corrections.
>
> Here are the items I think need to be fixed before I’ll change my vote to support this spec:
>
> The algorithm should be more clear:
>
> How is the bloom filter block selected from the 32 most-significant bits from of the hash function? These details must be in the spec and not in papers linked from the spec.
> How is the number of blocks determined? From the overall filter size?
> I think that the exact procedure for a lookup in each block should be covered in a section, followed by a section for how to perform a look up in the multi-block filter. The wording also needs to be cleaned up so that it is always clear whether the filter that is referenced is a block or the multi-block filter.
>
> The spec should give more detail on how to choose the number of blocks and on false positive rates. The sentence with “11.54 bits for each distinct value inserted into the filter” is vague: is this the multi-block filter? Why is a 1% false-positive rate “recommended”?
>
> I think it is okay to use 0.5% as each block’s false-positive rate, but then this should state how to achieve an overall false-positive rate as a function of the number of distinct values.
>
> This should be more clear about the scope of bloom filters. The only reference right now is “The Bloom filter data of a column chunk, …”. If each filter is for a column chunk, then this should be stated as a requirement.
> The position of bloom filter headers and bloom filter bit sets is too vague: “The Bloom filter data of a column chunk, which contains the size of the filter in bytes, the algorithm, the hash function and the Bloom filter bitset, is stored near the footer.”
>
> This should state that the column chunk offset points to the start of the bloom filter header, which immediately precedes the bloom filter bytes. The number of bytes must be stored as numBytes in the header. Include layout diagrams like the ones in the encryption spec.
> The position of all bloom filters in relation to the page indexes should be clear. Bloom filters should be written just before column indexes?
> The position of each bloom filter within the bloom filter data section should be clear. Are bloom filters for a column located together, or bloom filters for a row group? (R1C1, R2C1, R1C2, R2C2 vs R1C1, R1C2, R2C1, R2C2)
> Does the format allow locating bloom filter data anywhere else other than just before the header? Maybe the locations are recommendations and not requirements?
>
> Should the bloom filters support compression? If the strategy for a lower false-positive rate is to under-fill the multi-block filter, then would compression help reduce the storage cost?
> Why is the bloom filter header encrypted?
>
>
> On Wed, Jul 31, 2019 at 10:01 AM 俊杰陈 <cj...@gmail.com> wrote:
>>
>> Hi Wes,
>>
>> Thanks for the reply. It indeed does not have automation support in
>> the test. Since now we have the repo parquet-testing for the
>> integration test, I think we can do some automation base on that. I
>> will take some time to look at this and update the integration plan in
>> PARQUET-1326.
>>
>> Besides that, we can open a thread to discuss building automation
>> integration test framework for features WIP and in the future.
>>
>> On Thu, Aug 1, 2019 at 12:10 AM Ryan Blue <rb...@netflix.com.invalid> wrote:
>> >
>> > Wes,
>> >
>> > Since the v2 format is unfinished, I don't think anyone should be writing
>> > v2 pages. In fact, I don't think v2 pages would be recommended in the v2
>> > spec at this point. Do you know why Spark is writing this data? I didn't
>> > know that Spark would write v2 by default.
>> >
>> > rb
>> >
>> > On Wed, Jul 31, 2019 at 8:28 AM Wes McKinney <we...@gmail.com> wrote:
>> >
>> > > I'm not sure when I can have a more thorough look at this.
>> > >
>> > > To be honest, I'm personally struggling with the burden of supporting
>> > > the existing feature set in the Parquet C++ library. The integration
>> > > testing strategy for this as well as other features that are being
>> > > added to both the Java and C++ libraries (e.g. encryption) make me
>> > > uncomfortable due to the lack of automation. As an example, DataPageV2
>> > > support in parquet-cpp has been broken since the beginning of the
>> > > project (see PARQUET-458) but it's only recently become an issue when
>> > > people have been trying to read such files produced by Spark. More
>> > > comprehensive integration testing would help ensure that the libraries
>> > > remain compatible.
>> > >
>> > > On Tue, Jul 30, 2019 at 9:17 PM 俊杰陈 <cj...@gmail.com> wrote:
>> > > >
>> > > > Dear Parquet developers
>> > > >
>> > > > We still need your vote!
>> > > >
>> > > >
>> > > > On Wed, Jul 24, 2019 at 9:30 PM 俊杰陈 <cj...@gmail.com> wrote:
>> > > > >
>> > > > > Hi @Ryan Blue  @Wes McKinney
>> > > > >
>> > > > > We need your valuable vote, any feedback is welcome as well.
>> > > > >
>> > > > > On Tue, Jul 23, 2019 at 1:24 PM 俊杰陈 <cj...@gmail.com> wrote:
>> > > > > >
>> > > > > > Call for voting again.
>> > > > > >
>> > > > > > On Fri, Jul 19, 2019 at 1:17 PM 俊杰陈 <cj...@gmail.com> wrote:
>> > > > > > >
>> > > > > > > Dear Parquet developers
>> > > > > > >
>> > > > > > > We need more votes, please help to vote on this.
>> > > > > > >
>> > > > > > > On Wed, Jul 17, 2019 at 3:42 PM Gabor Szadovszky
>> > > > > > > <ga...@cloudera.com.invalid> wrote:
>> > > > > > > >
>> > > > > > > > After getting in PARQUET-1625 I vote again for having bloom
>> > > filter spec and
>> > > > > > > > the thrift file update as is in parquet-format master.
>> > > > > > > > +1 (binding)
>> > > > > > > >
>> > > > > > > > On Mon, Jul 15, 2019 at 3:23 PM 俊杰陈 <cj...@gmail.com> wrote:
>> > > > > > > >
>> > > > > > > > > Thanks Gabor, It's never too late to make it better. We don't
>> > > have to
>> > > > > > > > > run it in a hurry, it has been developed for a long time yet.:)
>> > > > > > > > >
>> > > > > > > > > The thrift file is indeed a bit lag behind the spec. As the
>> > > spec
>> > > > > > > > > defined, the bloom filter data is stored near the footer which
>> > > means
>> > > > > > > > > we don't have to handle it like the page. Therefore, I just
>> > > opened a
>> > > > > > > > > jira to remove bloom_filter_page_header in PageHeader
>> > > structure, while
>> > > > > > > > > the BloomFitlerHeader is kept intentionally for convenience.
>> > > Since the
>> > > > > > > > > spec and the thrift should be aligned with each other
>> > > eventually, so
>> > > > > > > > > the vote target is both of them.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > On Mon, Jul 15, 2019 at 7:48 PM Gabor Szadovszky
>> > > > > > > > > <ga...@cloudera.com.invalid> wrote:
>> > > > > > > > > >
>> > > > > > > > > > Hi Junjie,
>> > > > > > > > > >
>> > > > > > > > > > Sorry for bringing up this a bit late but I have some
>> > > problems with the
>> > > > > > > > > > format update. The parquet.thrift file is updated to have
>> > > the bloom
>> > > > > > > > > filters
>> > > > > > > > > > as a page (just as dictionaries and data pages). Meanwhile,
>> > > the spec
>> > > > > > > > > > (BloomFilter.md) says that the bloom filter is stored near
>> > > the footer.
>> > > > > > > > > So,
>> > > > > > > > > > if the bloom filter is not part of the row-groups (like
>> > > column indexes) I
>> > > > > > > > > > would not add it as a page. See the struct ColumnIndex in
>> > > the thrift
>> > > > > > > > > file.
>> > > > > > > > > > This struct is not referenced anywhere in it only declared.
>> > > It was done
>> > > > > > > > > > this way because we don't parse it in the same way as we
>> > > parse the pages.
>> > > > > > > > > >
>> > > > > > > > > > Currently, I am not 100% sure about the target of this vote.
>> > > If it is a
>> > > > > > > > > > vote about adding bloom filters in general then it is a +1
>> > > (binding). If
>> > > > > > > > > it
>> > > > > > > > > > is about adding the bloom filters to parquet-format as is
>> > > then, it is a
>> > > > > > > > > -1
>> > > > > > > > > > (binding) until we fix the issue above.
>> > > > > > > > > >
>> > > > > > > > > > Regards,
>> > > > > > > > > > Gabor
>> > > > > > > > > >
>> > > > > > > > > > On Mon, Jul 15, 2019 at 11:45 AM Gidon Gershinsky <
>> > > gg5070@gmail.com>
>> > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > +1 (non-binding)
>> > > > > > > > > > >
>> > > > > > > > > > > On Mon, Jul 15, 2019 at 12:08 PM Zoltan Ivanfi
>> > > <zi@cloudera.com.invalid
>> > > > > > > > > >
>> > > > > > > > > > > wrote:
>> > > > > > > > > > >
>> > > > > > > > > > > > +1 (binding)
>> > > > > > > > > > > >
>> > > > > > > > > > > > On Mon, Jul 15, 2019 at 9:57 AM 俊杰陈 <cj...@gmail.com>
>> > > wrote:
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Dear Parquet developers
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > I'd like to resume this vote, you can start to vote
>> > > now. Thanks for
>> > > > > > > > > > > your
>> > > > > > > > > > > > time.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:29 PM 俊杰陈 <
>> > > cjjnjust@gmail.com> wrote:
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > I see, will resume this next week.  Thanks.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi
>> > > > > > > > > > > <zi...@cloudera.com.invalid>
>> > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Hi Junjie,
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Since there are ongoing improvements addressing
>> > > review
>> > > > > > > > > comments, I
>> > > > > > > > > > > > would
>> > > > > > > > > > > > > > > hold off with the vote for a few more days until
>> > > the
>> > > > > > > > > specification
>> > > > > > > > > > > > settles.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Br,
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Zoltan
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 <
>> > > cjjnjust@gmail.com>
>> > > > > > > > > wrote:
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > Hi Parquet committers and developers
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > We are waiting for your important ballot:)
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <
>> > > cjjnjust@gmail.com>
>> > > > > > > > > wrote:
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Yes, there are some public benchmark results,
>> > > such as the
>> > > > > > > > > > > > official
>> > > > > > > > > > > > > > > > > benchmark from xxhash site (
>> > > http://www.xxhash.com/) and
>> > > > > > > > > > > > published
>> > > > > > > > > > > > > > > > > comparison from smhasher project
>> > > > > > > > > > > > > > > > > (https://github.com/rurban/smhasher/).
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 5:25 AM Wes McKinney <
>> > > > > > > > > > > wesmckinn@gmail.com>
>> > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > Do you have any benchmark data to support
>> > > the choice of
>> > > > > > > > > hash
>> > > > > > > > > > > > function?
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <
>> > > cjjnjust@gmail.com>
>> > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > Dear Parquet developers
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > To simplify the voting, I 'd like to
>> > > update voting
>> > > > > > > > > content
>> > > > > > > > > > > > to the
>> > > > > > > > > > > > > > > > spec
>> > > > > > > > > > > > > > > > > > > with xxHash hash strategy. Now you can
>> > > reply with +1
>> > > > > > > > > or -1.
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > Thanks for your participation.
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 <
>> > > > > > > > > cjjnjust@gmail.com>
>> > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > Dear Parquet developers
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > Parquet Bloom filter has been developed
>> > > for a while,
>> > > > > > > > > per
>> > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > discussion on the mail list, it's time to call a
>> > > vote for
>> > > > > > > > > spec to
>> > > > > > > > > > > > move
>> > > > > > > > > > > > > > > > forward. The current spec can be found at
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > https://github.com/apache/parquet-format/blob/master/BloomFilter.md.
>> > > > > > > > > > > > > > > > There are some different options about the
>> > > internal hash
>> > > > > > > > > choice
>> > > > > > > > > > > of
>> > > > > > > > > > > > Bloom
>> > > > > > > > > > > > > > > > filter and the PR is for that concern.
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > So I 'd like to propose to vote the spec
>> > > + hash
>> > > > > > > > > option,
>> > > > > > > > > > > for
>> > > > > > > > > > > > > > > > example:
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > +1 to spec and xxHash
>> > > > > > > > > > > > > > > > > > > > +1 to spec and murmur3
>> > > > > > > > > > > > > > > > > > > > ...
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > Please help to vote, any feedback is
>> > > also welcome in
>> > > > > > > > > the
>> > > > > > > > > > > > > > > > discussion thread.
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > Thanks & Best Regards
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > --
>> > > > > > > > > > > > > > > > > > > Thanks & Best Regards
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > --
>> > > > > > > > > > > > > > > > > Thanks & Best Regards
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > --
>> > > > > > > > > > > > > > > > Thanks & Best Regards
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > --
>> > > > > > > > > > > > > > Thanks & Best Regards
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > --
>> > > > > > > > > > > > > Thanks & Best Regards
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > --
>> > > > > > > > > Thanks & Best Regards
>> > > > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > --
>> > > > > > > Thanks & Best Regards
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Thanks & Best Regards
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Thanks & Best Regards
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Thanks & Best Regards
>> > >
>> >
>> >
>> > --
>> > Ryan Blue
>> > Software Engineer
>> > Netflix
>>
>>
>>
>> --
>> Thanks & Best Regards
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix



-- 
Thanks & Best Regards

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by Ryan Blue <rb...@netflix.com>.
Thanks for working on this, Jim! I merged the current PR.

On Tue, Aug 6, 2019 at 8:39 AM Jim Apple <jb...@apache.org> wrote:

> On 2019/08/05 18:05:53, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> > At least getting a compression union into the bloom filter header
> > will help us with compatibility later if we choose to add compression
> > schemes.
>
> That's very reasonable. I'll send a PR for that after
> https://github.com/apache/parquet-format/pull/147 is checked in to avoid
> rebase racing.
>
> At that time, I'll also send a PR to support filters with sizes that are
> not a power of 2. I'll use
> https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/
> to avoid the expensive modulo operation.
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by Jim Apple <jb...@apache.org>.
On 2019/08/05 18:05:53, Ryan Blue <rb...@netflix.com.INVALID> wrote: 
> At least getting a compression union into the bloom filter header
> will help us with compatibility later if we choose to add compression
> schemes.

That's very reasonable. I'll send a PR for that after https://github.com/apache/parquet-format/pull/147 is checked in to avoid rebase racing.

At that time, I'll also send a PR to support filters with sizes that are not a power of 2. I'll use https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/ to avoid the expensive modulo operation.

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Gidon, thanks for the explanation about tampering. That sounds good to me.

Jim, I think we are definitely a bit closer with your PR. That explains a
lot of the algorithm clearly.

After reading through the requirements for bloom filter sizing, I think we
should probably look into adding compression, even if it is to signal that
the bloom filter is uncompressed. Because we require that the number of
blocks is a power of 2, there will probably be some room for compression to
work. At least getting a compression union into the bloom filter header
will help us with compatibility later if we choose to add compression
schemes. It think it may also be worth the overhead of naive compression in
some cases, though I didn't thoroughly read through that reference yet.

On Sun, Aug 4, 2019 at 7:56 PM Jim Apple <jb...@apache.org> wrote:

> On 2019/08/03 20:42:10, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> >    - Should the bloom filters support compression? If the strategy for a
> >    lower false-positive rate is to under-fill the multi-block filter,
> then
> >    would compression help reduce the storage cost?
>
> Eventually, it might be useful to support compression. The question of how
> to do so in the unique case of Bloom filters, though, deserves some thought
> and planning that I think would be most useful in a follow-on.
> Mitzenmacher's "Compressed Bloom Filters" [0] has some discussion of the
> issues.
>
> There are complex tradeoffs between compressed size, uncompressed size,
> number of hash functions, and the false positive rate, and this is only one
> direction we could go to reduce the false positive rate while keeping the
> storage space constant. For instance, there are several other filter
> algorithms that achieve the best possible false positive rate (for any
> filter, compressed or not), but that pay for it with complexity, query
> time, and construction time. These include Golomb filters, XORSAT filters,
> Porat's matrix filters, Pagh et. al's filters, etc.
>
> [0] http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf
>


-- 
Ryan Blue
Software Engineer
Netflix

[VOTE] Parquet Bloom filter spec sign-off

Posted by Jim Apple <jb...@apache.org>.
On 2019/08/03 20:42:10, Ryan Blue <rb...@netflix.com.INVALID> wrote: 
>    - Should the bloom filters support compression? If the strategy for a
>    lower false-positive rate is to under-fill the multi-block filter, then
>    would compression help reduce the storage cost?

Eventually, it might be useful to support compression. The question of how to do so in the unique case of Bloom filters, though, deserves some thought and planning that I think would be most useful in a follow-on. Mitzenmacher's "Compressed Bloom Filters" [0] has some discussion of the issues.

There are complex tradeoffs between compressed size, uncompressed size, number of hash functions, and the false positive rate, and this is only one direction we could go to reduce the false positive rate while keeping the storage space constant. For instance, there are several other filter algorithms that achieve the best possible false positive rate (for any filter, compressed or not), but that pay for it with complexity, query time, and construction time. These include Golomb filters, XORSAT filters, Porat's matrix filters, Pagh et. al's filters, etc.

[0] http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Right now, I’m voting -1 on the spec as written because I don’t think that
it is clear enough to be correctly implemented from just the spec. I’ll
change that with a few corrections.

Here are the items I think need to be fixed before I’ll change my vote to
support this spec:

   - The algorithm should be more clear:
      - How is the bloom filter block selected from the 32 most-significant
      bits from of the hash function? These details must be in the spec and not
      in papers linked from the spec.
      - How is the number of blocks determined? From the overall filter
      size?
      - I think that the exact procedure for a lookup in each block should
      be covered in a section, followed by a section for how to
perform a look up
      in the multi-block filter. The wording also needs to be cleaned
up so that
      it is always clear whether the filter that is referenced is a
block or the
      multi-block filter.
   - The spec should give more detail on how to choose the number of blocks
   and on false positive rates. The sentence with “11.54 bits for each
   distinct value inserted into the filter” is vague: is this the multi-block
   filter? Why is a 1% false-positive rate “recommended”?
      - I think it is okay to use 0.5% as each block’s false-positive rate,
      but then this should state how to achieve an overall
false-positive rate as
      a function of the number of distinct values.
   - This should be more clear about the scope of bloom filters. The only
   reference right now is “The Bloom filter data of a column chunk, …”. If
   each filter is for a column chunk, then this should be stated as a
   requirement.
   - The position of bloom filter headers and bloom filter bit sets is too
   vague: “The Bloom filter data of a column chunk, which contains the size of
   the filter in bytes, the algorithm, the hash function and the Bloom filter
   bitset, is stored near the footer.”
      - This should state that the column chunk offset points to the start
      of the bloom filter header, which immediately precedes the bloom filter
      bytes. The number of bytes must be stored as numBytes in the header.
      Include layout diagrams like the ones in the encryption spec.
      - The position of all bloom filters in relation to the page indexes
      should be clear. Bloom filters should be written just before
column indexes?
      - The position of each bloom filter within the bloom filter data
      section should be clear. Are bloom filters for a column located together,
      or bloom filters for a row group? (R1C1, R2C1, R1C2, R2C2 vs R1C1, R1C2,
      R2C1, R2C2)
      - Does the format allow locating bloom filter data anywhere else
      other than just before the header? Maybe the locations are
recommendations
      and not requirements?
   - Should the bloom filters support compression? If the strategy for a
   lower false-positive rate is to under-fill the multi-block filter, then
   would compression help reduce the storage cost?
   - Why is the bloom filter header encrypted?


On Wed, Jul 31, 2019 at 10:01 AM 俊杰陈 <cj...@gmail.com> wrote:

> Hi Wes,
>
> Thanks for the reply. It indeed does not have automation support in
> the test. Since now we have the repo parquet-testing for the
> integration test, I think we can do some automation base on that. I
> will take some time to look at this and update the integration plan in
> PARQUET-1326.
>
> Besides that, we can open a thread to discuss building automation
> integration test framework for features WIP and in the future.
>
> On Thu, Aug 1, 2019 at 12:10 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
> >
> > Wes,
> >
> > Since the v2 format is unfinished, I don't think anyone should be writing
> > v2 pages. In fact, I don't think v2 pages would be recommended in the v2
> > spec at this point. Do you know why Spark is writing this data? I didn't
> > know that Spark would write v2 by default.
> >
> > rb
> >
> > On Wed, Jul 31, 2019 at 8:28 AM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > I'm not sure when I can have a more thorough look at this.
> > >
> > > To be honest, I'm personally struggling with the burden of supporting
> > > the existing feature set in the Parquet C++ library. The integration
> > > testing strategy for this as well as other features that are being
> > > added to both the Java and C++ libraries (e.g. encryption) make me
> > > uncomfortable due to the lack of automation. As an example, DataPageV2
> > > support in parquet-cpp has been broken since the beginning of the
> > > project (see PARQUET-458) but it's only recently become an issue when
> > > people have been trying to read such files produced by Spark. More
> > > comprehensive integration testing would help ensure that the libraries
> > > remain compatible.
> > >
> > > On Tue, Jul 30, 2019 at 9:17 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > >
> > > > Dear Parquet developers
> > > >
> > > > We still need your vote!
> > > >
> > > >
> > > > On Wed, Jul 24, 2019 at 9:30 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > >
> > > > > Hi @Ryan Blue  @Wes McKinney
> > > > >
> > > > > We need your valuable vote, any feedback is welcome as well.
> > > > >
> > > > > On Tue, Jul 23, 2019 at 1:24 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > >
> > > > > > Call for voting again.
> > > > > >
> > > > > > On Fri, Jul 19, 2019 at 1:17 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > > >
> > > > > > > Dear Parquet developers
> > > > > > >
> > > > > > > We need more votes, please help to vote on this.
> > > > > > >
> > > > > > > On Wed, Jul 17, 2019 at 3:42 PM Gabor Szadovszky
> > > > > > > <ga...@cloudera.com.invalid> wrote:
> > > > > > > >
> > > > > > > > After getting in PARQUET-1625 I vote again for having bloom
> > > filter spec and
> > > > > > > > the thrift file update as is in parquet-format master.
> > > > > > > > +1 (binding)
> > > > > > > >
> > > > > > > > On Mon, Jul 15, 2019 at 3:23 PM 俊杰陈 <cj...@gmail.com>
> wrote:
> > > > > > > >
> > > > > > > > > Thanks Gabor, It's never too late to make it better. We
> don't
> > > have to
> > > > > > > > > run it in a hurry, it has been developed for a long time
> yet.:)
> > > > > > > > >
> > > > > > > > > The thrift file is indeed a bit lag behind the spec. As the
> > > spec
> > > > > > > > > defined, the bloom filter data is stored near the footer
> which
> > > means
> > > > > > > > > we don't have to handle it like the page. Therefore, I just
> > > opened a
> > > > > > > > > jira to remove bloom_filter_page_header in PageHeader
> > > structure, while
> > > > > > > > > the BloomFitlerHeader is kept intentionally for
> convenience.
> > > Since the
> > > > > > > > > spec and the thrift should be aligned with each other
> > > eventually, so
> > > > > > > > > the vote target is both of them.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Mon, Jul 15, 2019 at 7:48 PM Gabor Szadovszky
> > > > > > > > > <ga...@cloudera.com.invalid> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Junjie,
> > > > > > > > > >
> > > > > > > > > > Sorry for bringing up this a bit late but I have some
> > > problems with the
> > > > > > > > > > format update. The parquet.thrift file is updated to have
> > > the bloom
> > > > > > > > > filters
> > > > > > > > > > as a page (just as dictionaries and data pages).
> Meanwhile,
> > > the spec
> > > > > > > > > > (BloomFilter.md) says that the bloom filter is stored
> near
> > > the footer.
> > > > > > > > > So,
> > > > > > > > > > if the bloom filter is not part of the row-groups (like
> > > column indexes) I
> > > > > > > > > > would not add it as a page. See the struct ColumnIndex in
> > > the thrift
> > > > > > > > > file.
> > > > > > > > > > This struct is not referenced anywhere in it only
> declared.
> > > It was done
> > > > > > > > > > this way because we don't parse it in the same way as we
> > > parse the pages.
> > > > > > > > > >
> > > > > > > > > > Currently, I am not 100% sure about the target of this
> vote.
> > > If it is a
> > > > > > > > > > vote about adding bloom filters in general then it is a
> +1
> > > (binding). If
> > > > > > > > > it
> > > > > > > > > > is about adding the bloom filters to parquet-format as is
> > > then, it is a
> > > > > > > > > -1
> > > > > > > > > > (binding) until we fix the issue above.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > Gabor
> > > > > > > > > >
> > > > > > > > > > On Mon, Jul 15, 2019 at 11:45 AM Gidon Gershinsky <
> > > gg5070@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > +1 (non-binding)
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Jul 15, 2019 at 12:08 PM Zoltan Ivanfi
> > > <zi@cloudera.com.invalid
> > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > +1 (binding)
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Jul 15, 2019 at 9:57 AM 俊杰陈 <
> cjjnjust@gmail.com>
> > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > > >
> > > > > > > > > > > > > I'd like to resume this vote, you can start to vote
> > > now. Thanks for
> > > > > > > > > > > your
> > > > > > > > > > > > time.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:29 PM 俊杰陈 <
> > > cjjnjust@gmail.com> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I see, will resume this next week.  Thanks.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi
> > > > > > > > > > > <zi...@cloudera.com.invalid>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Junjie,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Since there are ongoing improvements addressing
> > > review
> > > > > > > > > comments, I
> > > > > > > > > > > > would
> > > > > > > > > > > > > > > hold off with the vote for a few more days
> until
> > > the
> > > > > > > > > specification
> > > > > > > > > > > > settles.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Br,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Zoltan
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 <
> > > cjjnjust@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi Parquet committers and developers
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > We are waiting for your important ballot:)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <
> > > cjjnjust@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Yes, there are some public benchmark
> results,
> > > such as the
> > > > > > > > > > > > official
> > > > > > > > > > > > > > > > > benchmark from xxhash site (
> > > http://www.xxhash.com/) and
> > > > > > > > > > > > published
> > > > > > > > > > > > > > > > > comparison from smhasher project
> > > > > > > > > > > > > > > > > (https://github.com/rurban/smhasher/).
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 5:25 AM Wes
> McKinney <
> > > > > > > > > > > wesmckinn@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Do you have any benchmark data to support
> > > the choice of
> > > > > > > > > hash
> > > > > > > > > > > > function?
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <
> > > cjjnjust@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > To simplify the voting, I 'd like to
> > > update voting
> > > > > > > > > content
> > > > > > > > > > > > to the
> > > > > > > > > > > > > > > > spec
> > > > > > > > > > > > > > > > > > > with xxHash hash strategy. Now you can
> > > reply with +1
> > > > > > > > > or -1.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thanks for your participation.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 <
> > > > > > > > > cjjnjust@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Parquet Bloom filter has been
> developed
> > > for a while,
> > > > > > > > > per
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > discussion on the mail list, it's time to
> call a
> > > vote for
> > > > > > > > > spec to
> > > > > > > > > > > > move
> > > > > > > > > > > > > > > > forward. The current spec can be found at
> > > > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > https://github.com/apache/parquet-format/blob/master/BloomFilter.md.
> > > > > > > > > > > > > > > > There are some different options about the
> > > internal hash
> > > > > > > > > choice
> > > > > > > > > > > of
> > > > > > > > > > > > Bloom
> > > > > > > > > > > > > > > > filter and the PR is for that concern.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > So I 'd like to propose to vote the
> spec
> > > + hash
> > > > > > > > > option,
> > > > > > > > > > > for
> > > > > > > > > > > > > > > > example:
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > +1 to spec and xxHash
> > > > > > > > > > > > > > > > > > > > +1 to spec and murmur3
> > > > > > > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Please help to vote, any feedback is
> > > also welcome in
> > > > > > > > > the
> > > > > > > > > > > > > > > > discussion thread.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Thanks & Best Regards
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Thanks & Best Regards
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Thanks & Best Regards
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks & Best Regards
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks & Best Regards
> > >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>
>
> --
> Thanks & Best Regards
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by 俊杰陈 <cj...@gmail.com>.
Hi Wes,

Thanks for the reply. It indeed does not have automation support in
the test. Since now we have the repo parquet-testing for the
integration test, I think we can do some automation base on that. I
will take some time to look at this and update the integration plan in
PARQUET-1326.

Besides that, we can open a thread to discuss building automation
integration test framework for features WIP and in the future.

On Thu, Aug 1, 2019 at 12:10 AM Ryan Blue <rb...@netflix.com.invalid> wrote:
>
> Wes,
>
> Since the v2 format is unfinished, I don't think anyone should be writing
> v2 pages. In fact, I don't think v2 pages would be recommended in the v2
> spec at this point. Do you know why Spark is writing this data? I didn't
> know that Spark would write v2 by default.
>
> rb
>
> On Wed, Jul 31, 2019 at 8:28 AM Wes McKinney <we...@gmail.com> wrote:
>
> > I'm not sure when I can have a more thorough look at this.
> >
> > To be honest, I'm personally struggling with the burden of supporting
> > the existing feature set in the Parquet C++ library. The integration
> > testing strategy for this as well as other features that are being
> > added to both the Java and C++ libraries (e.g. encryption) make me
> > uncomfortable due to the lack of automation. As an example, DataPageV2
> > support in parquet-cpp has been broken since the beginning of the
> > project (see PARQUET-458) but it's only recently become an issue when
> > people have been trying to read such files produced by Spark. More
> > comprehensive integration testing would help ensure that the libraries
> > remain compatible.
> >
> > On Tue, Jul 30, 2019 at 9:17 PM 俊杰陈 <cj...@gmail.com> wrote:
> > >
> > > Dear Parquet developers
> > >
> > > We still need your vote!
> > >
> > >
> > > On Wed, Jul 24, 2019 at 9:30 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > >
> > > > Hi @Ryan Blue  @Wes McKinney
> > > >
> > > > We need your valuable vote, any feedback is welcome as well.
> > > >
> > > > On Tue, Jul 23, 2019 at 1:24 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > >
> > > > > Call for voting again.
> > > > >
> > > > > On Fri, Jul 19, 2019 at 1:17 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > >
> > > > > > Dear Parquet developers
> > > > > >
> > > > > > We need more votes, please help to vote on this.
> > > > > >
> > > > > > On Wed, Jul 17, 2019 at 3:42 PM Gabor Szadovszky
> > > > > > <ga...@cloudera.com.invalid> wrote:
> > > > > > >
> > > > > > > After getting in PARQUET-1625 I vote again for having bloom
> > filter spec and
> > > > > > > the thrift file update as is in parquet-format master.
> > > > > > > +1 (binding)
> > > > > > >
> > > > > > > On Mon, Jul 15, 2019 at 3:23 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > > >
> > > > > > > > Thanks Gabor, It's never too late to make it better. We don't
> > have to
> > > > > > > > run it in a hurry, it has been developed for a long time yet.:)
> > > > > > > >
> > > > > > > > The thrift file is indeed a bit lag behind the spec. As the
> > spec
> > > > > > > > defined, the bloom filter data is stored near the footer which
> > means
> > > > > > > > we don't have to handle it like the page. Therefore, I just
> > opened a
> > > > > > > > jira to remove bloom_filter_page_header in PageHeader
> > structure, while
> > > > > > > > the BloomFitlerHeader is kept intentionally for convenience.
> > Since the
> > > > > > > > spec and the thrift should be aligned with each other
> > eventually, so
> > > > > > > > the vote target is both of them.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Jul 15, 2019 at 7:48 PM Gabor Szadovszky
> > > > > > > > <ga...@cloudera.com.invalid> wrote:
> > > > > > > > >
> > > > > > > > > Hi Junjie,
> > > > > > > > >
> > > > > > > > > Sorry for bringing up this a bit late but I have some
> > problems with the
> > > > > > > > > format update. The parquet.thrift file is updated to have
> > the bloom
> > > > > > > > filters
> > > > > > > > > as a page (just as dictionaries and data pages). Meanwhile,
> > the spec
> > > > > > > > > (BloomFilter.md) says that the bloom filter is stored near
> > the footer.
> > > > > > > > So,
> > > > > > > > > if the bloom filter is not part of the row-groups (like
> > column indexes) I
> > > > > > > > > would not add it as a page. See the struct ColumnIndex in
> > the thrift
> > > > > > > > file.
> > > > > > > > > This struct is not referenced anywhere in it only declared.
> > It was done
> > > > > > > > > this way because we don't parse it in the same way as we
> > parse the pages.
> > > > > > > > >
> > > > > > > > > Currently, I am not 100% sure about the target of this vote.
> > If it is a
> > > > > > > > > vote about adding bloom filters in general then it is a +1
> > (binding). If
> > > > > > > > it
> > > > > > > > > is about adding the bloom filters to parquet-format as is
> > then, it is a
> > > > > > > > -1
> > > > > > > > > (binding) until we fix the issue above.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Gabor
> > > > > > > > >
> > > > > > > > > On Mon, Jul 15, 2019 at 11:45 AM Gidon Gershinsky <
> > gg5070@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > +1 (non-binding)
> > > > > > > > > >
> > > > > > > > > > On Mon, Jul 15, 2019 at 12:08 PM Zoltan Ivanfi
> > <zi@cloudera.com.invalid
> > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > +1 (binding)
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Jul 15, 2019 at 9:57 AM 俊杰陈 <cj...@gmail.com>
> > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > >
> > > > > > > > > > > > I'd like to resume this vote, you can start to vote
> > now. Thanks for
> > > > > > > > > > your
> > > > > > > > > > > time.
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:29 PM 俊杰陈 <
> > cjjnjust@gmail.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > I see, will resume this next week.  Thanks.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi
> > > > > > > > > > <zi...@cloudera.com.invalid>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Junjie,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Since there are ongoing improvements addressing
> > review
> > > > > > > > comments, I
> > > > > > > > > > > would
> > > > > > > > > > > > > > hold off with the vote for a few more days until
> > the
> > > > > > > > specification
> > > > > > > > > > > settles.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Br,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Zoltan
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 <
> > cjjnjust@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Parquet committers and developers
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > We are waiting for your important ballot:)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <
> > cjjnjust@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Yes, there are some public benchmark results,
> > such as the
> > > > > > > > > > > official
> > > > > > > > > > > > > > > > benchmark from xxhash site (
> > http://www.xxhash.com/) and
> > > > > > > > > > > published
> > > > > > > > > > > > > > > > comparison from smhasher project
> > > > > > > > > > > > > > > > (https://github.com/rurban/smhasher/).
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 5:25 AM Wes McKinney <
> > > > > > > > > > wesmckinn@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Do you have any benchmark data to support
> > the choice of
> > > > > > > > hash
> > > > > > > > > > > function?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <
> > cjjnjust@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > To simplify the voting, I 'd like to
> > update voting
> > > > > > > > content
> > > > > > > > > > > to the
> > > > > > > > > > > > > > > spec
> > > > > > > > > > > > > > > > > > with xxHash hash strategy. Now you can
> > reply with +1
> > > > > > > > or -1.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks for your participation.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 <
> > > > > > > > cjjnjust@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Parquet Bloom filter has been developed
> > for a while,
> > > > > > > > per
> > > > > > > > > > > the
> > > > > > > > > > > > > > > discussion on the mail list, it's time to call a
> > vote for
> > > > > > > > spec to
> > > > > > > > > > > move
> > > > > > > > > > > > > > > forward. The current spec can be found at
> > > > > > > > > > > > > > >
> > > > > > > > > > >
> > https://github.com/apache/parquet-format/blob/master/BloomFilter.md.
> > > > > > > > > > > > > > > There are some different options about the
> > internal hash
> > > > > > > > choice
> > > > > > > > > > of
> > > > > > > > > > > Bloom
> > > > > > > > > > > > > > > filter and the PR is for that concern.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > So I 'd like to propose to vote the spec
> > + hash
> > > > > > > > option,
> > > > > > > > > > for
> > > > > > > > > > > > > > > example:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > +1 to spec and xxHash
> > > > > > > > > > > > > > > > > > > +1 to spec and murmur3
> > > > > > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Please help to vote, any feedback is
> > also welcome in
> > > > > > > > the
> > > > > > > > > > > > > > > discussion thread.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Thanks & Best Regards
> > > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Thanks & Best Regards
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks & Best Regards
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks & Best Regards
> > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix



-- 
Thanks & Best Regards

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Wes,

Since the v2 format is unfinished, I don't think anyone should be writing
v2 pages. In fact, I don't think v2 pages would be recommended in the v2
spec at this point. Do you know why Spark is writing this data? I didn't
know that Spark would write v2 by default.

rb

On Wed, Jul 31, 2019 at 8:28 AM Wes McKinney <we...@gmail.com> wrote:

> I'm not sure when I can have a more thorough look at this.
>
> To be honest, I'm personally struggling with the burden of supporting
> the existing feature set in the Parquet C++ library. The integration
> testing strategy for this as well as other features that are being
> added to both the Java and C++ libraries (e.g. encryption) make me
> uncomfortable due to the lack of automation. As an example, DataPageV2
> support in parquet-cpp has been broken since the beginning of the
> project (see PARQUET-458) but it's only recently become an issue when
> people have been trying to read such files produced by Spark. More
> comprehensive integration testing would help ensure that the libraries
> remain compatible.
>
> On Tue, Jul 30, 2019 at 9:17 PM 俊杰陈 <cj...@gmail.com> wrote:
> >
> > Dear Parquet developers
> >
> > We still need your vote!
> >
> >
> > On Wed, Jul 24, 2019 at 9:30 PM 俊杰陈 <cj...@gmail.com> wrote:
> > >
> > > Hi @Ryan Blue  @Wes McKinney
> > >
> > > We need your valuable vote, any feedback is welcome as well.
> > >
> > > On Tue, Jul 23, 2019 at 1:24 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > >
> > > > Call for voting again.
> > > >
> > > > On Fri, Jul 19, 2019 at 1:17 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > >
> > > > > Dear Parquet developers
> > > > >
> > > > > We need more votes, please help to vote on this.
> > > > >
> > > > > On Wed, Jul 17, 2019 at 3:42 PM Gabor Szadovszky
> > > > > <ga...@cloudera.com.invalid> wrote:
> > > > > >
> > > > > > After getting in PARQUET-1625 I vote again for having bloom
> filter spec and
> > > > > > the thrift file update as is in parquet-format master.
> > > > > > +1 (binding)
> > > > > >
> > > > > > On Mon, Jul 15, 2019 at 3:23 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > >
> > > > > > > Thanks Gabor, It's never too late to make it better. We don't
> have to
> > > > > > > run it in a hurry, it has been developed for a long time yet.:)
> > > > > > >
> > > > > > > The thrift file is indeed a bit lag behind the spec. As the
> spec
> > > > > > > defined, the bloom filter data is stored near the footer which
> means
> > > > > > > we don't have to handle it like the page. Therefore, I just
> opened a
> > > > > > > jira to remove bloom_filter_page_header in PageHeader
> structure, while
> > > > > > > the BloomFitlerHeader is kept intentionally for convenience.
> Since the
> > > > > > > spec and the thrift should be aligned with each other
> eventually, so
> > > > > > > the vote target is both of them.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Jul 15, 2019 at 7:48 PM Gabor Szadovszky
> > > > > > > <ga...@cloudera.com.invalid> wrote:
> > > > > > > >
> > > > > > > > Hi Junjie,
> > > > > > > >
> > > > > > > > Sorry for bringing up this a bit late but I have some
> problems with the
> > > > > > > > format update. The parquet.thrift file is updated to have
> the bloom
> > > > > > > filters
> > > > > > > > as a page (just as dictionaries and data pages). Meanwhile,
> the spec
> > > > > > > > (BloomFilter.md) says that the bloom filter is stored near
> the footer.
> > > > > > > So,
> > > > > > > > if the bloom filter is not part of the row-groups (like
> column indexes) I
> > > > > > > > would not add it as a page. See the struct ColumnIndex in
> the thrift
> > > > > > > file.
> > > > > > > > This struct is not referenced anywhere in it only declared.
> It was done
> > > > > > > > this way because we don't parse it in the same way as we
> parse the pages.
> > > > > > > >
> > > > > > > > Currently, I am not 100% sure about the target of this vote.
> If it is a
> > > > > > > > vote about adding bloom filters in general then it is a +1
> (binding). If
> > > > > > > it
> > > > > > > > is about adding the bloom filters to parquet-format as is
> then, it is a
> > > > > > > -1
> > > > > > > > (binding) until we fix the issue above.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Gabor
> > > > > > > >
> > > > > > > > On Mon, Jul 15, 2019 at 11:45 AM Gidon Gershinsky <
> gg5070@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > +1 (non-binding)
> > > > > > > > >
> > > > > > > > > On Mon, Jul 15, 2019 at 12:08 PM Zoltan Ivanfi
> <zi@cloudera.com.invalid
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > +1 (binding)
> > > > > > > > > >
> > > > > > > > > > On Mon, Jul 15, 2019 at 9:57 AM 俊杰陈 <cj...@gmail.com>
> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > >
> > > > > > > > > > > I'd like to resume this vote, you can start to vote
> now. Thanks for
> > > > > > > > > your
> > > > > > > > > > time.
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Jul 10, 2019 at 9:29 PM 俊杰陈 <
> cjjnjust@gmail.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > I see, will resume this next week.  Thanks.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi
> > > > > > > > > <zi...@cloudera.com.invalid>
> > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi Junjie,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Since there are ongoing improvements addressing
> review
> > > > > > > comments, I
> > > > > > > > > > would
> > > > > > > > > > > > > hold off with the vote for a few more days until
> the
> > > > > > > specification
> > > > > > > > > > settles.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Br,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Zoltan
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 <
> cjjnjust@gmail.com>
> > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Parquet committers and developers
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > We are waiting for your important ballot:)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <
> cjjnjust@gmail.com>
> > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Yes, there are some public benchmark results,
> such as the
> > > > > > > > > > official
> > > > > > > > > > > > > > > benchmark from xxhash site (
> http://www.xxhash.com/) and
> > > > > > > > > > published
> > > > > > > > > > > > > > > comparison from smhasher project
> > > > > > > > > > > > > > > (https://github.com/rurban/smhasher/).
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 5:25 AM Wes McKinney <
> > > > > > > > > wesmckinn@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Do you have any benchmark data to support
> the choice of
> > > > > > > hash
> > > > > > > > > > function?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <
> cjjnjust@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > To simplify the voting, I 'd like to
> update voting
> > > > > > > content
> > > > > > > > > > to the
> > > > > > > > > > > > > > spec
> > > > > > > > > > > > > > > > > with xxHash hash strategy. Now you can
> reply with +1
> > > > > > > or -1.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks for your participation.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 <
> > > > > > > cjjnjust@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Parquet Bloom filter has been developed
> for a while,
> > > > > > > per
> > > > > > > > > > the
> > > > > > > > > > > > > > discussion on the mail list, it's time to call a
> vote for
> > > > > > > spec to
> > > > > > > > > > move
> > > > > > > > > > > > > > forward. The current spec can be found at
> > > > > > > > > > > > > >
> > > > > > > > > >
> https://github.com/apache/parquet-format/blob/master/BloomFilter.md.
> > > > > > > > > > > > > > There are some different options about the
> internal hash
> > > > > > > choice
> > > > > > > > > of
> > > > > > > > > > Bloom
> > > > > > > > > > > > > > filter and the PR is for that concern.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > So I 'd like to propose to vote the spec
> + hash
> > > > > > > option,
> > > > > > > > > for
> > > > > > > > > > > > > > example:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > +1 to spec and xxHash
> > > > > > > > > > > > > > > > > > +1 to spec and murmur3
> > > > > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Please help to vote, any feedback is
> also welcome in
> > > > > > > the
> > > > > > > > > > > > > > discussion thread.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Thanks & Best Regards
> > > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks & Best Regards
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks & Best Regards
> > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> >
> >
> >
> > --
> > Thanks & Best Regards
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by Wes McKinney <we...@gmail.com>.
I'm not sure when I can have a more thorough look at this.

To be honest, I'm personally struggling with the burden of supporting
the existing feature set in the Parquet C++ library. The integration
testing strategy for this as well as other features that are being
added to both the Java and C++ libraries (e.g. encryption) make me
uncomfortable due to the lack of automation. As an example, DataPageV2
support in parquet-cpp has been broken since the beginning of the
project (see PARQUET-458) but it's only recently become an issue when
people have been trying to read such files produced by Spark. More
comprehensive integration testing would help ensure that the libraries
remain compatible.

On Tue, Jul 30, 2019 at 9:17 PM 俊杰陈 <cj...@gmail.com> wrote:
>
> Dear Parquet developers
>
> We still need your vote!
>
>
> On Wed, Jul 24, 2019 at 9:30 PM 俊杰陈 <cj...@gmail.com> wrote:
> >
> > Hi @Ryan Blue  @Wes McKinney
> >
> > We need your valuable vote, any feedback is welcome as well.
> >
> > On Tue, Jul 23, 2019 at 1:24 PM 俊杰陈 <cj...@gmail.com> wrote:
> > >
> > > Call for voting again.
> > >
> > > On Fri, Jul 19, 2019 at 1:17 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > >
> > > > Dear Parquet developers
> > > >
> > > > We need more votes, please help to vote on this.
> > > >
> > > > On Wed, Jul 17, 2019 at 3:42 PM Gabor Szadovszky
> > > > <ga...@cloudera.com.invalid> wrote:
> > > > >
> > > > > After getting in PARQUET-1625 I vote again for having bloom filter spec and
> > > > > the thrift file update as is in parquet-format master.
> > > > > +1 (binding)
> > > > >
> > > > > On Mon, Jul 15, 2019 at 3:23 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > >
> > > > > > Thanks Gabor, It's never too late to make it better. We don't have to
> > > > > > run it in a hurry, it has been developed for a long time yet.:)
> > > > > >
> > > > > > The thrift file is indeed a bit lag behind the spec. As the spec
> > > > > > defined, the bloom filter data is stored near the footer which means
> > > > > > we don't have to handle it like the page. Therefore, I just opened a
> > > > > > jira to remove bloom_filter_page_header in PageHeader structure, while
> > > > > > the BloomFitlerHeader is kept intentionally for convenience. Since the
> > > > > > spec and the thrift should be aligned with each other eventually, so
> > > > > > the vote target is both of them.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Jul 15, 2019 at 7:48 PM Gabor Szadovszky
> > > > > > <ga...@cloudera.com.invalid> wrote:
> > > > > > >
> > > > > > > Hi Junjie,
> > > > > > >
> > > > > > > Sorry for bringing up this a bit late but I have some problems with the
> > > > > > > format update. The parquet.thrift file is updated to have the bloom
> > > > > > filters
> > > > > > > as a page (just as dictionaries and data pages). Meanwhile, the spec
> > > > > > > (BloomFilter.md) says that the bloom filter is stored near the footer.
> > > > > > So,
> > > > > > > if the bloom filter is not part of the row-groups (like column indexes) I
> > > > > > > would not add it as a page. See the struct ColumnIndex in the thrift
> > > > > > file.
> > > > > > > This struct is not referenced anywhere in it only declared. It was done
> > > > > > > this way because we don't parse it in the same way as we parse the pages.
> > > > > > >
> > > > > > > Currently, I am not 100% sure about the target of this vote. If it is a
> > > > > > > vote about adding bloom filters in general then it is a +1 (binding). If
> > > > > > it
> > > > > > > is about adding the bloom filters to parquet-format as is then, it is a
> > > > > > -1
> > > > > > > (binding) until we fix the issue above.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Gabor
> > > > > > >
> > > > > > > On Mon, Jul 15, 2019 at 11:45 AM Gidon Gershinsky <gg...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > +1 (non-binding)
> > > > > > > >
> > > > > > > > On Mon, Jul 15, 2019 at 12:08 PM Zoltan Ivanfi <zi@cloudera.com.invalid
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > +1 (binding)
> > > > > > > > >
> > > > > > > > > On Mon, Jul 15, 2019 at 9:57 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Dear Parquet developers
> > > > > > > > > >
> > > > > > > > > > I'd like to resume this vote, you can start to vote now. Thanks for
> > > > > > > > your
> > > > > > > > > time.
> > > > > > > > > >
> > > > > > > > > > On Wed, Jul 10, 2019 at 9:29 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > I see, will resume this next week.  Thanks.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi
> > > > > > > > <zi...@cloudera.com.invalid>
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Junjie,
> > > > > > > > > > > >
> > > > > > > > > > > > Since there are ongoing improvements addressing review
> > > > > > comments, I
> > > > > > > > > would
> > > > > > > > > > > > hold off with the vote for a few more days until the
> > > > > > specification
> > > > > > > > > settles.
> > > > > > > > > > > >
> > > > > > > > > > > > Br,
> > > > > > > > > > > >
> > > > > > > > > > > > Zoltan
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 <cj...@gmail.com>
> > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Parquet committers and developers
> > > > > > > > > > > > >
> > > > > > > > > > > > > We are waiting for your important ballot:)
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <cj...@gmail.com>
> > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yes, there are some public benchmark results, such as the
> > > > > > > > > official
> > > > > > > > > > > > > > benchmark from xxhash site (http://www.xxhash.com/) and
> > > > > > > > > published
> > > > > > > > > > > > > > comparison from smhasher project
> > > > > > > > > > > > > > (https://github.com/rurban/smhasher/).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Jul 9, 2019 at 5:25 AM Wes McKinney <
> > > > > > > > wesmckinn@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Do you have any benchmark data to support the choice of
> > > > > > hash
> > > > > > > > > function?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <cj...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > To simplify the voting, I 'd like to update voting
> > > > > > content
> > > > > > > > > to the
> > > > > > > > > > > > > spec
> > > > > > > > > > > > > > > > with xxHash hash strategy. Now you can reply with +1
> > > > > > or -1.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks for your participation.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 <
> > > > > > cjjnjust@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Parquet Bloom filter has been developed for a while,
> > > > > > per
> > > > > > > > > the
> > > > > > > > > > > > > discussion on the mail list, it's time to call a vote for
> > > > > > spec to
> > > > > > > > > move
> > > > > > > > > > > > > forward. The current spec can be found at
> > > > > > > > > > > > >
> > > > > > > > > https://github.com/apache/parquet-format/blob/master/BloomFilter.md.
> > > > > > > > > > > > > There are some different options about the internal hash
> > > > > > choice
> > > > > > > > of
> > > > > > > > > Bloom
> > > > > > > > > > > > > filter and the PR is for that concern.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > So I 'd like to propose to vote the spec + hash
> > > > > > option,
> > > > > > > > for
> > > > > > > > > > > > > example:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > +1 to spec and xxHash
> > > > > > > > > > > > > > > > > +1 to spec and murmur3
> > > > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Please help to vote, any feedback is also welcome in
> > > > > > the
> > > > > > > > > > > > > discussion thread.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Thanks & Best Regards
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Thanks & Best Regards
> > > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks & Best Regards
> > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> >
> >
> >
> > --
> > Thanks & Best Regards
>
>
>
> --
> Thanks & Best Regards

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by 俊杰陈 <cj...@gmail.com>.
Dear Parquet developers

We still need your vote!


On Wed, Jul 24, 2019 at 9:30 PM 俊杰陈 <cj...@gmail.com> wrote:
>
> Hi @Ryan Blue  @Wes McKinney
>
> We need your valuable vote, any feedback is welcome as well.
>
> On Tue, Jul 23, 2019 at 1:24 PM 俊杰陈 <cj...@gmail.com> wrote:
> >
> > Call for voting again.
> >
> > On Fri, Jul 19, 2019 at 1:17 PM 俊杰陈 <cj...@gmail.com> wrote:
> > >
> > > Dear Parquet developers
> > >
> > > We need more votes, please help to vote on this.
> > >
> > > On Wed, Jul 17, 2019 at 3:42 PM Gabor Szadovszky
> > > <ga...@cloudera.com.invalid> wrote:
> > > >
> > > > After getting in PARQUET-1625 I vote again for having bloom filter spec and
> > > > the thrift file update as is in parquet-format master.
> > > > +1 (binding)
> > > >
> > > > On Mon, Jul 15, 2019 at 3:23 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > >
> > > > > Thanks Gabor, It's never too late to make it better. We don't have to
> > > > > run it in a hurry, it has been developed for a long time yet.:)
> > > > >
> > > > > The thrift file is indeed a bit lag behind the spec. As the spec
> > > > > defined, the bloom filter data is stored near the footer which means
> > > > > we don't have to handle it like the page. Therefore, I just opened a
> > > > > jira to remove bloom_filter_page_header in PageHeader structure, while
> > > > > the BloomFitlerHeader is kept intentionally for convenience. Since the
> > > > > spec and the thrift should be aligned with each other eventually, so
> > > > > the vote target is both of them.
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jul 15, 2019 at 7:48 PM Gabor Szadovszky
> > > > > <ga...@cloudera.com.invalid> wrote:
> > > > > >
> > > > > > Hi Junjie,
> > > > > >
> > > > > > Sorry for bringing up this a bit late but I have some problems with the
> > > > > > format update. The parquet.thrift file is updated to have the bloom
> > > > > filters
> > > > > > as a page (just as dictionaries and data pages). Meanwhile, the spec
> > > > > > (BloomFilter.md) says that the bloom filter is stored near the footer.
> > > > > So,
> > > > > > if the bloom filter is not part of the row-groups (like column indexes) I
> > > > > > would not add it as a page. See the struct ColumnIndex in the thrift
> > > > > file.
> > > > > > This struct is not referenced anywhere in it only declared. It was done
> > > > > > this way because we don't parse it in the same way as we parse the pages.
> > > > > >
> > > > > > Currently, I am not 100% sure about the target of this vote. If it is a
> > > > > > vote about adding bloom filters in general then it is a +1 (binding). If
> > > > > it
> > > > > > is about adding the bloom filters to parquet-format as is then, it is a
> > > > > -1
> > > > > > (binding) until we fix the issue above.
> > > > > >
> > > > > > Regards,
> > > > > > Gabor
> > > > > >
> > > > > > On Mon, Jul 15, 2019 at 11:45 AM Gidon Gershinsky <gg...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > +1 (non-binding)
> > > > > > >
> > > > > > > On Mon, Jul 15, 2019 at 12:08 PM Zoltan Ivanfi <zi@cloudera.com.invalid
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > +1 (binding)
> > > > > > > >
> > > > > > > > On Mon, Jul 15, 2019 at 9:57 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > Dear Parquet developers
> > > > > > > > >
> > > > > > > > > I'd like to resume this vote, you can start to vote now. Thanks for
> > > > > > > your
> > > > > > > > time.
> > > > > > > > >
> > > > > > > > > On Wed, Jul 10, 2019 at 9:29 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > I see, will resume this next week.  Thanks.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi
> > > > > > > <zi...@cloudera.com.invalid>
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hi Junjie,
> > > > > > > > > > >
> > > > > > > > > > > Since there are ongoing improvements addressing review
> > > > > comments, I
> > > > > > > > would
> > > > > > > > > > > hold off with the vote for a few more days until the
> > > > > specification
> > > > > > > > settles.
> > > > > > > > > > >
> > > > > > > > > > > Br,
> > > > > > > > > > >
> > > > > > > > > > > Zoltan
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 <cj...@gmail.com>
> > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Parquet committers and developers
> > > > > > > > > > > >
> > > > > > > > > > > > We are waiting for your important ballot:)
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <cj...@gmail.com>
> > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Yes, there are some public benchmark results, such as the
> > > > > > > > official
> > > > > > > > > > > > > benchmark from xxhash site (http://www.xxhash.com/) and
> > > > > > > > published
> > > > > > > > > > > > > comparison from smhasher project
> > > > > > > > > > > > > (https://github.com/rurban/smhasher/).
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Jul 9, 2019 at 5:25 AM Wes McKinney <
> > > > > > > wesmckinn@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Do you have any benchmark data to support the choice of
> > > > > hash
> > > > > > > > function?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <cj...@gmail.com>
> > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > To simplify the voting, I 'd like to update voting
> > > > > content
> > > > > > > > to the
> > > > > > > > > > > > spec
> > > > > > > > > > > > > > > with xxHash hash strategy. Now you can reply with +1
> > > > > or -1.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for your participation.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 <
> > > > > cjjnjust@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Parquet Bloom filter has been developed for a while,
> > > > > per
> > > > > > > > the
> > > > > > > > > > > > discussion on the mail list, it's time to call a vote for
> > > > > spec to
> > > > > > > > move
> > > > > > > > > > > > forward. The current spec can be found at
> > > > > > > > > > > >
> > > > > > > > https://github.com/apache/parquet-format/blob/master/BloomFilter.md.
> > > > > > > > > > > > There are some different options about the internal hash
> > > > > choice
> > > > > > > of
> > > > > > > > Bloom
> > > > > > > > > > > > filter and the PR is for that concern.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > So I 'd like to propose to vote the spec + hash
> > > > > option,
> > > > > > > for
> > > > > > > > > > > > example:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > +1 to spec and xxHash
> > > > > > > > > > > > > > > > +1 to spec and murmur3
> > > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Please help to vote, any feedback is also welcome in
> > > > > the
> > > > > > > > > > > > discussion thread.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Thanks & Best Regards
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Thanks & Best Regards
> > > > > > > >
> > > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks & Best Regards
> > > > >
> > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> >
> >
> >
> > --
> > Thanks & Best Regards
>
>
>
> --
> Thanks & Best Regards



--
Thanks & Best Regards

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by 俊杰陈 <cj...@gmail.com>.
Hi @Ryan Blue  @Wes McKinney

We need your valuable vote, any feedback is welcome as well.

On Tue, Jul 23, 2019 at 1:24 PM 俊杰陈 <cj...@gmail.com> wrote:
>
> Call for voting again.
>
> On Fri, Jul 19, 2019 at 1:17 PM 俊杰陈 <cj...@gmail.com> wrote:
> >
> > Dear Parquet developers
> >
> > We need more votes, please help to vote on this.
> >
> > On Wed, Jul 17, 2019 at 3:42 PM Gabor Szadovszky
> > <ga...@cloudera.com.invalid> wrote:
> > >
> > > After getting in PARQUET-1625 I vote again for having bloom filter spec and
> > > the thrift file update as is in parquet-format master.
> > > +1 (binding)
> > >
> > > On Mon, Jul 15, 2019 at 3:23 PM 俊杰陈 <cj...@gmail.com> wrote:
> > >
> > > > Thanks Gabor, It's never too late to make it better. We don't have to
> > > > run it in a hurry, it has been developed for a long time yet.:)
> > > >
> > > > The thrift file is indeed a bit lag behind the spec. As the spec
> > > > defined, the bloom filter data is stored near the footer which means
> > > > we don't have to handle it like the page. Therefore, I just opened a
> > > > jira to remove bloom_filter_page_header in PageHeader structure, while
> > > > the BloomFitlerHeader is kept intentionally for convenience. Since the
> > > > spec and the thrift should be aligned with each other eventually, so
> > > > the vote target is both of them.
> > > >
> > > >
> > > >
> > > > On Mon, Jul 15, 2019 at 7:48 PM Gabor Szadovszky
> > > > <ga...@cloudera.com.invalid> wrote:
> > > > >
> > > > > Hi Junjie,
> > > > >
> > > > > Sorry for bringing up this a bit late but I have some problems with the
> > > > > format update. The parquet.thrift file is updated to have the bloom
> > > > filters
> > > > > as a page (just as dictionaries and data pages). Meanwhile, the spec
> > > > > (BloomFilter.md) says that the bloom filter is stored near the footer.
> > > > So,
> > > > > if the bloom filter is not part of the row-groups (like column indexes) I
> > > > > would not add it as a page. See the struct ColumnIndex in the thrift
> > > > file.
> > > > > This struct is not referenced anywhere in it only declared. It was done
> > > > > this way because we don't parse it in the same way as we parse the pages.
> > > > >
> > > > > Currently, I am not 100% sure about the target of this vote. If it is a
> > > > > vote about adding bloom filters in general then it is a +1 (binding). If
> > > > it
> > > > > is about adding the bloom filters to parquet-format as is then, it is a
> > > > -1
> > > > > (binding) until we fix the issue above.
> > > > >
> > > > > Regards,
> > > > > Gabor
> > > > >
> > > > > On Mon, Jul 15, 2019 at 11:45 AM Gidon Gershinsky <gg...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > +1 (non-binding)
> > > > > >
> > > > > > On Mon, Jul 15, 2019 at 12:08 PM Zoltan Ivanfi <zi@cloudera.com.invalid
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > +1 (binding)
> > > > > > >
> > > > > > > On Mon, Jul 15, 2019 at 9:57 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Dear Parquet developers
> > > > > > > >
> > > > > > > > I'd like to resume this vote, you can start to vote now. Thanks for
> > > > > > your
> > > > > > > time.
> > > > > > > >
> > > > > > > > On Wed, Jul 10, 2019 at 9:29 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > I see, will resume this next week.  Thanks.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi
> > > > > > <zi...@cloudera.com.invalid>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Junjie,
> > > > > > > > > >
> > > > > > > > > > Since there are ongoing improvements addressing review
> > > > comments, I
> > > > > > > would
> > > > > > > > > > hold off with the vote for a few more days until the
> > > > specification
> > > > > > > settles.
> > > > > > > > > >
> > > > > > > > > > Br,
> > > > > > > > > >
> > > > > > > > > > Zoltan
> > > > > > > > > >
> > > > > > > > > > On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 <cj...@gmail.com>
> > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Parquet committers and developers
> > > > > > > > > > >
> > > > > > > > > > > We are waiting for your important ballot:)
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <cj...@gmail.com>
> > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Yes, there are some public benchmark results, such as the
> > > > > > > official
> > > > > > > > > > > > benchmark from xxhash site (http://www.xxhash.com/) and
> > > > > > > published
> > > > > > > > > > > > comparison from smhasher project
> > > > > > > > > > > > (https://github.com/rurban/smhasher/).
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Jul 9, 2019 at 5:25 AM Wes McKinney <
> > > > > > wesmckinn@gmail.com>
> > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Do you have any benchmark data to support the choice of
> > > > hash
> > > > > > > function?
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <cj...@gmail.com>
> > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > To simplify the voting, I 'd like to update voting
> > > > content
> > > > > > > to the
> > > > > > > > > > > spec
> > > > > > > > > > > > > > with xxHash hash strategy. Now you can reply with +1
> > > > or -1.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for your participation.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 <
> > > > cjjnjust@gmail.com>
> > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Parquet Bloom filter has been developed for a while,
> > > > per
> > > > > > > the
> > > > > > > > > > > discussion on the mail list, it's time to call a vote for
> > > > spec to
> > > > > > > move
> > > > > > > > > > > forward. The current spec can be found at
> > > > > > > > > > >
> > > > > > > https://github.com/apache/parquet-format/blob/master/BloomFilter.md.
> > > > > > > > > > > There are some different options about the internal hash
> > > > choice
> > > > > > of
> > > > > > > Bloom
> > > > > > > > > > > filter and the PR is for that concern.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > So I 'd like to propose to vote the spec + hash
> > > > option,
> > > > > > for
> > > > > > > > > > > example:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > +1 to spec and xxHash
> > > > > > > > > > > > > > > +1 to spec and murmur3
> > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Please help to vote, any feedback is also welcome in
> > > > the
> > > > > > > > > > > discussion thread.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Thanks & Best Regards
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Thanks & Best Regards
> > > > > > >
> > > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks & Best Regards
> > > >
> >
> >
> >
> > --
> > Thanks & Best Regards
>
>
>
> --
> Thanks & Best Regards



-- 
Thanks & Best Regards

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by 俊杰陈 <cj...@gmail.com>.
Call for voting again.

On Fri, Jul 19, 2019 at 1:17 PM 俊杰陈 <cj...@gmail.com> wrote:
>
> Dear Parquet developers
>
> We need more votes, please help to vote on this.
>
> On Wed, Jul 17, 2019 at 3:42 PM Gabor Szadovszky
> <ga...@cloudera.com.invalid> wrote:
> >
> > After getting in PARQUET-1625 I vote again for having bloom filter spec and
> > the thrift file update as is in parquet-format master.
> > +1 (binding)
> >
> > On Mon, Jul 15, 2019 at 3:23 PM 俊杰陈 <cj...@gmail.com> wrote:
> >
> > > Thanks Gabor, It's never too late to make it better. We don't have to
> > > run it in a hurry, it has been developed for a long time yet.:)
> > >
> > > The thrift file is indeed a bit lag behind the spec. As the spec
> > > defined, the bloom filter data is stored near the footer which means
> > > we don't have to handle it like the page. Therefore, I just opened a
> > > jira to remove bloom_filter_page_header in PageHeader structure, while
> > > the BloomFitlerHeader is kept intentionally for convenience. Since the
> > > spec and the thrift should be aligned with each other eventually, so
> > > the vote target is both of them.
> > >
> > >
> > >
> > > On Mon, Jul 15, 2019 at 7:48 PM Gabor Szadovszky
> > > <ga...@cloudera.com.invalid> wrote:
> > > >
> > > > Hi Junjie,
> > > >
> > > > Sorry for bringing up this a bit late but I have some problems with the
> > > > format update. The parquet.thrift file is updated to have the bloom
> > > filters
> > > > as a page (just as dictionaries and data pages). Meanwhile, the spec
> > > > (BloomFilter.md) says that the bloom filter is stored near the footer.
> > > So,
> > > > if the bloom filter is not part of the row-groups (like column indexes) I
> > > > would not add it as a page. See the struct ColumnIndex in the thrift
> > > file.
> > > > This struct is not referenced anywhere in it only declared. It was done
> > > > this way because we don't parse it in the same way as we parse the pages.
> > > >
> > > > Currently, I am not 100% sure about the target of this vote. If it is a
> > > > vote about adding bloom filters in general then it is a +1 (binding). If
> > > it
> > > > is about adding the bloom filters to parquet-format as is then, it is a
> > > -1
> > > > (binding) until we fix the issue above.
> > > >
> > > > Regards,
> > > > Gabor
> > > >
> > > > On Mon, Jul 15, 2019 at 11:45 AM Gidon Gershinsky <gg...@gmail.com>
> > > wrote:
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > On Mon, Jul 15, 2019 at 12:08 PM Zoltan Ivanfi <zi@cloudera.com.invalid
> > > >
> > > > > wrote:
> > > > >
> > > > > > +1 (binding)
> > > > > >
> > > > > > On Mon, Jul 15, 2019 at 9:57 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > > >
> > > > > > > Dear Parquet developers
> > > > > > >
> > > > > > > I'd like to resume this vote, you can start to vote now. Thanks for
> > > > > your
> > > > > > time.
> > > > > > >
> > > > > > > On Wed, Jul 10, 2019 at 9:29 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > I see, will resume this next week.  Thanks.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi
> > > > > <zi...@cloudera.com.invalid>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Hi Junjie,
> > > > > > > > >
> > > > > > > > > Since there are ongoing improvements addressing review
> > > comments, I
> > > > > > would
> > > > > > > > > hold off with the vote for a few more days until the
> > > specification
> > > > > > settles.
> > > > > > > > >
> > > > > > > > > Br,
> > > > > > > > >
> > > > > > > > > Zoltan
> > > > > > > > >
> > > > > > > > > On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 <cj...@gmail.com>
> > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Parquet committers and developers
> > > > > > > > > >
> > > > > > > > > > We are waiting for your important ballot:)
> > > > > > > > > >
> > > > > > > > > > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <cj...@gmail.com>
> > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Yes, there are some public benchmark results, such as the
> > > > > > official
> > > > > > > > > > > benchmark from xxhash site (http://www.xxhash.com/) and
> > > > > > published
> > > > > > > > > > > comparison from smhasher project
> > > > > > > > > > > (https://github.com/rurban/smhasher/).
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Jul 9, 2019 at 5:25 AM Wes McKinney <
> > > > > wesmckinn@gmail.com>
> > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Do you have any benchmark data to support the choice of
> > > hash
> > > > > > function?
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <cj...@gmail.com>
> > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > > >
> > > > > > > > > > > > > To simplify the voting, I 'd like to update voting
> > > content
> > > > > > to the
> > > > > > > > > > spec
> > > > > > > > > > > > > with xxHash hash strategy. Now you can reply with +1
> > > or -1.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for your participation.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 <
> > > cjjnjust@gmail.com>
> > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Parquet Bloom filter has been developed for a while,
> > > per
> > > > > > the
> > > > > > > > > > discussion on the mail list, it's time to call a vote for
> > > spec to
> > > > > > move
> > > > > > > > > > forward. The current spec can be found at
> > > > > > > > > >
> > > > > > https://github.com/apache/parquet-format/blob/master/BloomFilter.md.
> > > > > > > > > > There are some different options about the internal hash
> > > choice
> > > > > of
> > > > > > Bloom
> > > > > > > > > > filter and the PR is for that concern.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > So I 'd like to propose to vote the spec + hash
> > > option,
> > > > > for
> > > > > > > > > > example:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > +1 to spec and xxHash
> > > > > > > > > > > > > > +1 to spec and murmur3
> > > > > > > > > > > > > > ...
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Please help to vote, any feedback is also welcome in
> > > the
> > > > > > > > > > discussion thread.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Thanks & Best Regards
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Thanks & Best Regards
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Thanks & Best Regards
> > > > > >
> > > > >
> > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> > >
>
>
>
> --
> Thanks & Best Regards



-- 
Thanks & Best Regards

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by 俊杰陈 <cj...@gmail.com>.
Dear Parquet developers

We need more votes, please help to vote on this.

On Wed, Jul 17, 2019 at 3:42 PM Gabor Szadovszky
<ga...@cloudera.com.invalid> wrote:
>
> After getting in PARQUET-1625 I vote again for having bloom filter spec and
> the thrift file update as is in parquet-format master.
> +1 (binding)
>
> On Mon, Jul 15, 2019 at 3:23 PM 俊杰陈 <cj...@gmail.com> wrote:
>
> > Thanks Gabor, It's never too late to make it better. We don't have to
> > run it in a hurry, it has been developed for a long time yet.:)
> >
> > The thrift file is indeed a bit lag behind the spec. As the spec
> > defined, the bloom filter data is stored near the footer which means
> > we don't have to handle it like the page. Therefore, I just opened a
> > jira to remove bloom_filter_page_header in PageHeader structure, while
> > the BloomFitlerHeader is kept intentionally for convenience. Since the
> > spec and the thrift should be aligned with each other eventually, so
> > the vote target is both of them.
> >
> >
> >
> > On Mon, Jul 15, 2019 at 7:48 PM Gabor Szadovszky
> > <ga...@cloudera.com.invalid> wrote:
> > >
> > > Hi Junjie,
> > >
> > > Sorry for bringing up this a bit late but I have some problems with the
> > > format update. The parquet.thrift file is updated to have the bloom
> > filters
> > > as a page (just as dictionaries and data pages). Meanwhile, the spec
> > > (BloomFilter.md) says that the bloom filter is stored near the footer.
> > So,
> > > if the bloom filter is not part of the row-groups (like column indexes) I
> > > would not add it as a page. See the struct ColumnIndex in the thrift
> > file.
> > > This struct is not referenced anywhere in it only declared. It was done
> > > this way because we don't parse it in the same way as we parse the pages.
> > >
> > > Currently, I am not 100% sure about the target of this vote. If it is a
> > > vote about adding bloom filters in general then it is a +1 (binding). If
> > it
> > > is about adding the bloom filters to parquet-format as is then, it is a
> > -1
> > > (binding) until we fix the issue above.
> > >
> > > Regards,
> > > Gabor
> > >
> > > On Mon, Jul 15, 2019 at 11:45 AM Gidon Gershinsky <gg...@gmail.com>
> > wrote:
> > >
> > > > +1 (non-binding)
> > > >
> > > > On Mon, Jul 15, 2019 at 12:08 PM Zoltan Ivanfi <zi@cloudera.com.invalid
> > >
> > > > wrote:
> > > >
> > > > > +1 (binding)
> > > > >
> > > > > On Mon, Jul 15, 2019 at 9:57 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > >
> > > > > > Dear Parquet developers
> > > > > >
> > > > > > I'd like to resume this vote, you can start to vote now. Thanks for
> > > > your
> > > > > time.
> > > > > >
> > > > > > On Wed, Jul 10, 2019 at 9:29 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > > >
> > > > > > > I see, will resume this next week.  Thanks.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi
> > > > <zi...@cloudera.com.invalid>
> > > > > wrote:
> > > > > > > >
> > > > > > > > Hi Junjie,
> > > > > > > >
> > > > > > > > Since there are ongoing improvements addressing review
> > comments, I
> > > > > would
> > > > > > > > hold off with the vote for a few more days until the
> > specification
> > > > > settles.
> > > > > > > >
> > > > > > > > Br,
> > > > > > > >
> > > > > > > > Zoltan
> > > > > > > >
> > > > > > > > On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 <cj...@gmail.com>
> > wrote:
> > > > > > > >
> > > > > > > > > Hi Parquet committers and developers
> > > > > > > > >
> > > > > > > > > We are waiting for your important ballot:)
> > > > > > > > >
> > > > > > > > > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <cj...@gmail.com>
> > wrote:
> > > > > > > > > >
> > > > > > > > > > Yes, there are some public benchmark results, such as the
> > > > > official
> > > > > > > > > > benchmark from xxhash site (http://www.xxhash.com/) and
> > > > > published
> > > > > > > > > > comparison from smhasher project
> > > > > > > > > > (https://github.com/rurban/smhasher/).
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Tue, Jul 9, 2019 at 5:25 AM Wes McKinney <
> > > > wesmckinn@gmail.com>
> > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Do you have any benchmark data to support the choice of
> > hash
> > > > > function?
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <cj...@gmail.com>
> > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > >
> > > > > > > > > > > > To simplify the voting, I 'd like to update voting
> > content
> > > > > to the
> > > > > > > > > spec
> > > > > > > > > > > > with xxHash hash strategy. Now you can reply with +1
> > or -1.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for your participation.
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 <
> > cjjnjust@gmail.com>
> > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > > >
> > > > > > > > > > > > > Parquet Bloom filter has been developed for a while,
> > per
> > > > > the
> > > > > > > > > discussion on the mail list, it's time to call a vote for
> > spec to
> > > > > move
> > > > > > > > > forward. The current spec can be found at
> > > > > > > > >
> > > > > https://github.com/apache/parquet-format/blob/master/BloomFilter.md.
> > > > > > > > > There are some different options about the internal hash
> > choice
> > > > of
> > > > > Bloom
> > > > > > > > > filter and the PR is for that concern.
> > > > > > > > > > > > >
> > > > > > > > > > > > > So I 'd like to propose to vote the spec + hash
> > option,
> > > > for
> > > > > > > > > example:
> > > > > > > > > > > > >
> > > > > > > > > > > > > +1 to spec and xxHash
> > > > > > > > > > > > > +1 to spec and murmur3
> > > > > > > > > > > > > ...
> > > > > > > > > > > > >
> > > > > > > > > > > > > Please help to vote, any feedback is also welcome in
> > the
> > > > > > > > > discussion thread.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Thanks & Best Regards
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Thanks & Best Regards
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Thanks & Best Regards
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Thanks & Best Regards
> > > > >
> > > >
> >
> >
> >
> > --
> > Thanks & Best Regards
> >



-- 
Thanks & Best Regards

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by Gabor Szadovszky <ga...@cloudera.com.INVALID>.
After getting in PARQUET-1625 I vote again for having bloom filter spec and
the thrift file update as is in parquet-format master.
+1 (binding)

On Mon, Jul 15, 2019 at 3:23 PM 俊杰陈 <cj...@gmail.com> wrote:

> Thanks Gabor, It's never too late to make it better. We don't have to
> run it in a hurry, it has been developed for a long time yet.:)
>
> The thrift file is indeed a bit lag behind the spec. As the spec
> defined, the bloom filter data is stored near the footer which means
> we don't have to handle it like the page. Therefore, I just opened a
> jira to remove bloom_filter_page_header in PageHeader structure, while
> the BloomFitlerHeader is kept intentionally for convenience. Since the
> spec and the thrift should be aligned with each other eventually, so
> the vote target is both of them.
>
>
>
> On Mon, Jul 15, 2019 at 7:48 PM Gabor Szadovszky
> <ga...@cloudera.com.invalid> wrote:
> >
> > Hi Junjie,
> >
> > Sorry for bringing up this a bit late but I have some problems with the
> > format update. The parquet.thrift file is updated to have the bloom
> filters
> > as a page (just as dictionaries and data pages). Meanwhile, the spec
> > (BloomFilter.md) says that the bloom filter is stored near the footer.
> So,
> > if the bloom filter is not part of the row-groups (like column indexes) I
> > would not add it as a page. See the struct ColumnIndex in the thrift
> file.
> > This struct is not referenced anywhere in it only declared. It was done
> > this way because we don't parse it in the same way as we parse the pages.
> >
> > Currently, I am not 100% sure about the target of this vote. If it is a
> > vote about adding bloom filters in general then it is a +1 (binding). If
> it
> > is about adding the bloom filters to parquet-format as is then, it is a
> -1
> > (binding) until we fix the issue above.
> >
> > Regards,
> > Gabor
> >
> > On Mon, Jul 15, 2019 at 11:45 AM Gidon Gershinsky <gg...@gmail.com>
> wrote:
> >
> > > +1 (non-binding)
> > >
> > > On Mon, Jul 15, 2019 at 12:08 PM Zoltan Ivanfi <zi@cloudera.com.invalid
> >
> > > wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > On Mon, Jul 15, 2019 at 9:57 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > > >
> > > > > Dear Parquet developers
> > > > >
> > > > > I'd like to resume this vote, you can start to vote now. Thanks for
> > > your
> > > > time.
> > > > >
> > > > > On Wed, Jul 10, 2019 at 9:29 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > >
> > > > > > I see, will resume this next week.  Thanks.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi
> > > <zi...@cloudera.com.invalid>
> > > > wrote:
> > > > > > >
> > > > > > > Hi Junjie,
> > > > > > >
> > > > > > > Since there are ongoing improvements addressing review
> comments, I
> > > > would
> > > > > > > hold off with the vote for a few more days until the
> specification
> > > > settles.
> > > > > > >
> > > > > > > Br,
> > > > > > >
> > > > > > > Zoltan
> > > > > > >
> > > > > > > On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 <cj...@gmail.com>
> wrote:
> > > > > > >
> > > > > > > > Hi Parquet committers and developers
> > > > > > > >
> > > > > > > > We are waiting for your important ballot:)
> > > > > > > >
> > > > > > > > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <cj...@gmail.com>
> wrote:
> > > > > > > > >
> > > > > > > > > Yes, there are some public benchmark results, such as the
> > > > official
> > > > > > > > > benchmark from xxhash site (http://www.xxhash.com/) and
> > > > published
> > > > > > > > > comparison from smhasher project
> > > > > > > > > (https://github.com/rurban/smhasher/).
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Jul 9, 2019 at 5:25 AM Wes McKinney <
> > > wesmckinn@gmail.com>
> > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Do you have any benchmark data to support the choice of
> hash
> > > > function?
> > > > > > > > > >
> > > > > > > > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <cj...@gmail.com>
> > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > >
> > > > > > > > > > > To simplify the voting, I 'd like to update voting
> content
> > > > to the
> > > > > > > > spec
> > > > > > > > > > > with xxHash hash strategy. Now you can reply with +1
> or -1.
> > > > > > > > > > >
> > > > > > > > > > > Thanks for your participation.
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 <
> cjjnjust@gmail.com>
> > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > > >
> > > > > > > > > > > > Parquet Bloom filter has been developed for a while,
> per
> > > > the
> > > > > > > > discussion on the mail list, it's time to call a vote for
> spec to
> > > > move
> > > > > > > > forward. The current spec can be found at
> > > > > > > >
> > > > https://github.com/apache/parquet-format/blob/master/BloomFilter.md.
> > > > > > > > There are some different options about the internal hash
> choice
> > > of
> > > > Bloom
> > > > > > > > filter and the PR is for that concern.
> > > > > > > > > > > >
> > > > > > > > > > > > So I 'd like to propose to vote the spec + hash
> option,
> > > for
> > > > > > > > example:
> > > > > > > > > > > >
> > > > > > > > > > > > +1 to spec and xxHash
> > > > > > > > > > > > +1 to spec and murmur3
> > > > > > > > > > > > ...
> > > > > > > > > > > >
> > > > > > > > > > > > Please help to vote, any feedback is also welcome in
> the
> > > > > > > > discussion thread.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Thanks & Best Regards
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Thanks & Best Regards
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Thanks & Best Regards
> > > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Thanks & Best Regards
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks & Best Regards
> > > >
> > >
>
>
>
> --
> Thanks & Best Regards
>

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by 俊杰陈 <cj...@gmail.com>.
Thanks Gabor, It's never too late to make it better. We don't have to
run it in a hurry, it has been developed for a long time yet.:)

The thrift file is indeed a bit lag behind the spec. As the spec
defined, the bloom filter data is stored near the footer which means
we don't have to handle it like the page. Therefore, I just opened a
jira to remove bloom_filter_page_header in PageHeader structure, while
the BloomFitlerHeader is kept intentionally for convenience. Since the
spec and the thrift should be aligned with each other eventually, so
the vote target is both of them.



On Mon, Jul 15, 2019 at 7:48 PM Gabor Szadovszky
<ga...@cloudera.com.invalid> wrote:
>
> Hi Junjie,
>
> Sorry for bringing up this a bit late but I have some problems with the
> format update. The parquet.thrift file is updated to have the bloom filters
> as a page (just as dictionaries and data pages). Meanwhile, the spec
> (BloomFilter.md) says that the bloom filter is stored near the footer. So,
> if the bloom filter is not part of the row-groups (like column indexes) I
> would not add it as a page. See the struct ColumnIndex in the thrift file.
> This struct is not referenced anywhere in it only declared. It was done
> this way because we don't parse it in the same way as we parse the pages.
>
> Currently, I am not 100% sure about the target of this vote. If it is a
> vote about adding bloom filters in general then it is a +1 (binding). If it
> is about adding the bloom filters to parquet-format as is then, it is a -1
> (binding) until we fix the issue above.
>
> Regards,
> Gabor
>
> On Mon, Jul 15, 2019 at 11:45 AM Gidon Gershinsky <gg...@gmail.com> wrote:
>
> > +1 (non-binding)
> >
> > On Mon, Jul 15, 2019 at 12:08 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > wrote:
> >
> > > +1 (binding)
> > >
> > > On Mon, Jul 15, 2019 at 9:57 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > >
> > > > Dear Parquet developers
> > > >
> > > > I'd like to resume this vote, you can start to vote now. Thanks for
> > your
> > > time.
> > > >
> > > > On Wed, Jul 10, 2019 at 9:29 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > > >
> > > > > I see, will resume this next week.  Thanks.
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi
> > <zi...@cloudera.com.invalid>
> > > wrote:
> > > > > >
> > > > > > Hi Junjie,
> > > > > >
> > > > > > Since there are ongoing improvements addressing review comments, I
> > > would
> > > > > > hold off with the vote for a few more days until the specification
> > > settles.
> > > > > >
> > > > > > Br,
> > > > > >
> > > > > > Zoltan
> > > > > >
> > > > > > On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > >
> > > > > > > Hi Parquet committers and developers
> > > > > > >
> > > > > > > We are waiting for your important ballot:)
> > > > > > >
> > > > > > > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Yes, there are some public benchmark results, such as the
> > > official
> > > > > > > > benchmark from xxhash site (http://www.xxhash.com/) and
> > > published
> > > > > > > > comparison from smhasher project
> > > > > > > > (https://github.com/rurban/smhasher/).
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Jul 9, 2019 at 5:25 AM Wes McKinney <
> > wesmckinn@gmail.com>
> > > wrote:
> > > > > > > > >
> > > > > > > > > Do you have any benchmark data to support the choice of hash
> > > function?
> > > > > > > > >
> > > > > > > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <cj...@gmail.com>
> > wrote:
> > > > > > > > > >
> > > > > > > > > > Dear Parquet developers
> > > > > > > > > >
> > > > > > > > > > To simplify the voting, I 'd like to update voting content
> > > to the
> > > > > > > spec
> > > > > > > > > > with xxHash hash strategy. Now you can reply with +1 or -1.
> > > > > > > > > >
> > > > > > > > > > Thanks for your participation.
> > > > > > > > > >
> > > > > > > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 <cj...@gmail.com>
> > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Dear Parquet developers
> > > > > > > > > > >
> > > > > > > > > > > Parquet Bloom filter has been developed for a while, per
> > > the
> > > > > > > discussion on the mail list, it's time to call a vote for spec to
> > > move
> > > > > > > forward. The current spec can be found at
> > > > > > >
> > > https://github.com/apache/parquet-format/blob/master/BloomFilter.md.
> > > > > > > There are some different options about the internal hash choice
> > of
> > > Bloom
> > > > > > > filter and the PR is for that concern.
> > > > > > > > > > >
> > > > > > > > > > > So I 'd like to propose to vote the spec + hash option,
> > for
> > > > > > > example:
> > > > > > > > > > >
> > > > > > > > > > > +1 to spec and xxHash
> > > > > > > > > > > +1 to spec and murmur3
> > > > > > > > > > > ...
> > > > > > > > > > >
> > > > > > > > > > > Please help to vote, any feedback is also welcome in the
> > > > > > > discussion thread.
> > > > > > > > > > >
> > > > > > > > > > > Thanks & Best Regards
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Thanks & Best Regards
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Thanks & Best Regards
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Thanks & Best Regards
> > > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks & Best Regards
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks & Best Regards
> > >
> >



--
Thanks & Best Regards

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by Gabor Szadovszky <ga...@cloudera.com.INVALID>.
Hi Junjie,

Sorry for bringing up this a bit late but I have some problems with the
format update. The parquet.thrift file is updated to have the bloom filters
as a page (just as dictionaries and data pages). Meanwhile, the spec
(BloomFilter.md) says that the bloom filter is stored near the footer. So,
if the bloom filter is not part of the row-groups (like column indexes) I
would not add it as a page. See the struct ColumnIndex in the thrift file.
This struct is not referenced anywhere in it only declared. It was done
this way because we don't parse it in the same way as we parse the pages.

Currently, I am not 100% sure about the target of this vote. If it is a
vote about adding bloom filters in general then it is a +1 (binding). If it
is about adding the bloom filters to parquet-format as is then, it is a -1
(binding) until we fix the issue above.

Regards,
Gabor

On Mon, Jul 15, 2019 at 11:45 AM Gidon Gershinsky <gg...@gmail.com> wrote:

> +1 (non-binding)
>
> On Mon, Jul 15, 2019 at 12:08 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> wrote:
>
> > +1 (binding)
> >
> > On Mon, Jul 15, 2019 at 9:57 AM 俊杰陈 <cj...@gmail.com> wrote:
> > >
> > > Dear Parquet developers
> > >
> > > I'd like to resume this vote, you can start to vote now. Thanks for
> your
> > time.
> > >
> > > On Wed, Jul 10, 2019 at 9:29 PM 俊杰陈 <cj...@gmail.com> wrote:
> > > >
> > > > I see, will resume this next week.  Thanks.
> > > >
> > > >
> > > >
> > > > On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi
> <zi...@cloudera.com.invalid>
> > wrote:
> > > > >
> > > > > Hi Junjie,
> > > > >
> > > > > Since there are ongoing improvements addressing review comments, I
> > would
> > > > > hold off with the vote for a few more days until the specification
> > settles.
> > > > >
> > > > > Br,
> > > > >
> > > > > Zoltan
> > > > >
> > > > > On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > > >
> > > > > > Hi Parquet committers and developers
> > > > > >
> > > > > > We are waiting for your important ballot:)
> > > > > >
> > > > > > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > > >
> > > > > > > Yes, there are some public benchmark results, such as the
> > official
> > > > > > > benchmark from xxhash site (http://www.xxhash.com/) and
> > published
> > > > > > > comparison from smhasher project
> > > > > > > (https://github.com/rurban/smhasher/).
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Jul 9, 2019 at 5:25 AM Wes McKinney <
> wesmckinn@gmail.com>
> > wrote:
> > > > > > > >
> > > > > > > > Do you have any benchmark data to support the choice of hash
> > function?
> > > > > > > >
> > > > > > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <cj...@gmail.com>
> wrote:
> > > > > > > > >
> > > > > > > > > Dear Parquet developers
> > > > > > > > >
> > > > > > > > > To simplify the voting, I 'd like to update voting content
> > to the
> > > > > > spec
> > > > > > > > > with xxHash hash strategy. Now you can reply with +1 or -1.
> > > > > > > > >
> > > > > > > > > Thanks for your participation.
> > > > > > > > >
> > > > > > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 <cj...@gmail.com>
> > wrote:
> > > > > > > > > >
> > > > > > > > > > Dear Parquet developers
> > > > > > > > > >
> > > > > > > > > > Parquet Bloom filter has been developed for a while, per
> > the
> > > > > > discussion on the mail list, it's time to call a vote for spec to
> > move
> > > > > > forward. The current spec can be found at
> > > > > >
> > https://github.com/apache/parquet-format/blob/master/BloomFilter.md.
> > > > > > There are some different options about the internal hash choice
> of
> > Bloom
> > > > > > filter and the PR is for that concern.
> > > > > > > > > >
> > > > > > > > > > So I 'd like to propose to vote the spec + hash option,
> for
> > > > > > example:
> > > > > > > > > >
> > > > > > > > > > +1 to spec and xxHash
> > > > > > > > > > +1 to spec and murmur3
> > > > > > > > > > ...
> > > > > > > > > >
> > > > > > > > > > Please help to vote, any feedback is also welcome in the
> > > > > > discussion thread.
> > > > > > > > > >
> > > > > > > > > > Thanks & Best Regards
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Thanks & Best Regards
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Thanks & Best Regards
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Thanks & Best Regards
> > > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks & Best Regards
> > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> >
>

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by Gidon Gershinsky <gg...@gmail.com>.
+1 (non-binding)

On Mon, Jul 15, 2019 at 12:08 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
wrote:

> +1 (binding)
>
> On Mon, Jul 15, 2019 at 9:57 AM 俊杰陈 <cj...@gmail.com> wrote:
> >
> > Dear Parquet developers
> >
> > I'd like to resume this vote, you can start to vote now. Thanks for your
> time.
> >
> > On Wed, Jul 10, 2019 at 9:29 PM 俊杰陈 <cj...@gmail.com> wrote:
> > >
> > > I see, will resume this next week.  Thanks.
> > >
> > >
> > >
> > > On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> wrote:
> > > >
> > > > Hi Junjie,
> > > >
> > > > Since there are ongoing improvements addressing review comments, I
> would
> > > > hold off with the vote for a few more days until the specification
> settles.
> > > >
> > > > Br,
> > > >
> > > > Zoltan
> > > >
> > > > On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > >
> > > > > Hi Parquet committers and developers
> > > > >
> > > > > We are waiting for your important ballot:)
> > > > >
> > > > > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > >
> > > > > > Yes, there are some public benchmark results, such as the
> official
> > > > > > benchmark from xxhash site (http://www.xxhash.com/) and
> published
> > > > > > comparison from smhasher project
> > > > > > (https://github.com/rurban/smhasher/).
> > > > > >
> > > > > >
> > > > > > On Tue, Jul 9, 2019 at 5:25 AM Wes McKinney <we...@gmail.com>
> wrote:
> > > > > > >
> > > > > > > Do you have any benchmark data to support the choice of hash
> function?
> > > > > > >
> > > > > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Dear Parquet developers
> > > > > > > >
> > > > > > > > To simplify the voting, I 'd like to update voting content
> to the
> > > > > spec
> > > > > > > > with xxHash hash strategy. Now you can reply with +1 or -1.
> > > > > > > >
> > > > > > > > Thanks for your participation.
> > > > > > > >
> > > > > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 <cj...@gmail.com>
> wrote:
> > > > > > > > >
> > > > > > > > > Dear Parquet developers
> > > > > > > > >
> > > > > > > > > Parquet Bloom filter has been developed for a while, per
> the
> > > > > discussion on the mail list, it's time to call a vote for spec to
> move
> > > > > forward. The current spec can be found at
> > > > >
> https://github.com/apache/parquet-format/blob/master/BloomFilter.md.
> > > > > There are some different options about the internal hash choice of
> Bloom
> > > > > filter and the PR is for that concern.
> > > > > > > > >
> > > > > > > > > So I 'd like to propose to vote the spec + hash option, for
> > > > > example:
> > > > > > > > >
> > > > > > > > > +1 to spec and xxHash
> > > > > > > > > +1 to spec and murmur3
> > > > > > > > > ...
> > > > > > > > >
> > > > > > > > > Please help to vote, any feedback is also welcome in the
> > > > > discussion thread.
> > > > > > > > >
> > > > > > > > > Thanks & Best Regards
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Thanks & Best Regards
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Thanks & Best Regards
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks & Best Regards
> > > > >
> > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> >
> >
> >
> > --
> > Thanks & Best Regards
>

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
+1 (binding)

On Mon, Jul 15, 2019 at 9:57 AM 俊杰陈 <cj...@gmail.com> wrote:
>
> Dear Parquet developers
>
> I'd like to resume this vote, you can start to vote now. Thanks for your time.
>
> On Wed, Jul 10, 2019 at 9:29 PM 俊杰陈 <cj...@gmail.com> wrote:
> >
> > I see, will resume this next week.  Thanks.
> >
> >
> >
> > On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi <zi...@cloudera.com.invalid> wrote:
> > >
> > > Hi Junjie,
> > >
> > > Since there are ongoing improvements addressing review comments, I would
> > > hold off with the vote for a few more days until the specification settles.
> > >
> > > Br,
> > >
> > > Zoltan
> > >
> > > On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 <cj...@gmail.com> wrote:
> > >
> > > > Hi Parquet committers and developers
> > > >
> > > > We are waiting for your important ballot:)
> > > >
> > > > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > > >
> > > > > Yes, there are some public benchmark results, such as the official
> > > > > benchmark from xxhash site (http://www.xxhash.com/) and published
> > > > > comparison from smhasher project
> > > > > (https://github.com/rurban/smhasher/).
> > > > >
> > > > >
> > > > > On Tue, Jul 9, 2019 at 5:25 AM Wes McKinney <we...@gmail.com> wrote:
> > > > > >
> > > > > > Do you have any benchmark data to support the choice of hash function?
> > > > > >
> > > > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > > >
> > > > > > > Dear Parquet developers
> > > > > > >
> > > > > > > To simplify the voting, I 'd like to update voting content to the
> > > > spec
> > > > > > > with xxHash hash strategy. Now you can reply with +1 or -1.
> > > > > > >
> > > > > > > Thanks for your participation.
> > > > > > >
> > > > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Dear Parquet developers
> > > > > > > >
> > > > > > > > Parquet Bloom filter has been developed for a while, per the
> > > > discussion on the mail list, it's time to call a vote for spec to move
> > > > forward. The current spec can be found at
> > > > https://github.com/apache/parquet-format/blob/master/BloomFilter.md.
> > > > There are some different options about the internal hash choice of Bloom
> > > > filter and the PR is for that concern.
> > > > > > > >
> > > > > > > > So I 'd like to propose to vote the spec + hash option, for
> > > > example:
> > > > > > > >
> > > > > > > > +1 to spec and xxHash
> > > > > > > > +1 to spec and murmur3
> > > > > > > > ...
> > > > > > > >
> > > > > > > > Please help to vote, any feedback is also welcome in the
> > > > discussion thread.
> > > > > > > >
> > > > > > > > Thanks & Best Regards
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Thanks & Best Regards
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks & Best Regards
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks & Best Regards
> > > >
> >
> >
> >
> > --
> > Thanks & Best Regards
>
>
>
> --
> Thanks & Best Regards

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by 俊杰陈 <cj...@gmail.com>.
Dear Parquet developers

I'd like to resume this vote, you can start to vote now. Thanks for your time.

On Wed, Jul 10, 2019 at 9:29 PM 俊杰陈 <cj...@gmail.com> wrote:
>
> I see, will resume this next week.  Thanks.
>
>
>
> On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi <zi...@cloudera.com.invalid> wrote:
> >
> > Hi Junjie,
> >
> > Since there are ongoing improvements addressing review comments, I would
> > hold off with the vote for a few more days until the specification settles.
> >
> > Br,
> >
> > Zoltan
> >
> > On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 <cj...@gmail.com> wrote:
> >
> > > Hi Parquet committers and developers
> > >
> > > We are waiting for your important ballot:)
> > >
> > > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > >
> > > > Yes, there are some public benchmark results, such as the official
> > > > benchmark from xxhash site (http://www.xxhash.com/) and published
> > > > comparison from smhasher project
> > > > (https://github.com/rurban/smhasher/).
> > > >
> > > >
> > > > On Tue, Jul 9, 2019 at 5:25 AM Wes McKinney <we...@gmail.com> wrote:
> > > > >
> > > > > Do you have any benchmark data to support the choice of hash function?
> > > > >
> > > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > >
> > > > > > Dear Parquet developers
> > > > > >
> > > > > > To simplify the voting, I 'd like to update voting content to the
> > > spec
> > > > > > with xxHash hash strategy. Now you can reply with +1 or -1.
> > > > > >
> > > > > > Thanks for your participation.
> > > > > >
> > > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > > >
> > > > > > > Dear Parquet developers
> > > > > > >
> > > > > > > Parquet Bloom filter has been developed for a while, per the
> > > discussion on the mail list, it's time to call a vote for spec to move
> > > forward. The current spec can be found at
> > > https://github.com/apache/parquet-format/blob/master/BloomFilter.md.
> > > There are some different options about the internal hash choice of Bloom
> > > filter and the PR is for that concern.
> > > > > > >
> > > > > > > So I 'd like to propose to vote the spec + hash option, for
> > > example:
> > > > > > >
> > > > > > > +1 to spec and xxHash
> > > > > > > +1 to spec and murmur3
> > > > > > > ...
> > > > > > >
> > > > > > > Please help to vote, any feedback is also welcome in the
> > > discussion thread.
> > > > > > >
> > > > > > > Thanks & Best Regards
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Thanks & Best Regards
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks & Best Regards
> > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> > >
>
>
>
> --
> Thanks & Best Regards



-- 
Thanks & Best Regards

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by 俊杰陈 <cj...@gmail.com>.
I see, will resume this next week.  Thanks.



On Wed, Jul 10, 2019 at 5:26 PM Zoltan Ivanfi <zi...@cloudera.com.invalid> wrote:
>
> Hi Junjie,
>
> Since there are ongoing improvements addressing review comments, I would
> hold off with the vote for a few more days until the specification settles.
>
> Br,
>
> Zoltan
>
> On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 <cj...@gmail.com> wrote:
>
> > Hi Parquet committers and developers
> >
> > We are waiting for your important ballot:)
> >
> > On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <cj...@gmail.com> wrote:
> > >
> > > Yes, there are some public benchmark results, such as the official
> > > benchmark from xxhash site (http://www.xxhash.com/) and published
> > > comparison from smhasher project
> > > (https://github.com/rurban/smhasher/).
> > >
> > >
> > > On Tue, Jul 9, 2019 at 5:25 AM Wes McKinney <we...@gmail.com> wrote:
> > > >
> > > > Do you have any benchmark data to support the choice of hash function?
> > > >
> > > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > > >
> > > > > Dear Parquet developers
> > > > >
> > > > > To simplify the voting, I 'd like to update voting content to the
> > spec
> > > > > with xxHash hash strategy. Now you can reply with +1 or -1.
> > > > >
> > > > > Thanks for your participation.
> > > > >
> > > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > > > >
> > > > > > Dear Parquet developers
> > > > > >
> > > > > > Parquet Bloom filter has been developed for a while, per the
> > discussion on the mail list, it's time to call a vote for spec to move
> > forward. The current spec can be found at
> > https://github.com/apache/parquet-format/blob/master/BloomFilter.md.
> > There are some different options about the internal hash choice of Bloom
> > filter and the PR is for that concern.
> > > > > >
> > > > > > So I 'd like to propose to vote the spec + hash option, for
> > example:
> > > > > >
> > > > > > +1 to spec and xxHash
> > > > > > +1 to spec and murmur3
> > > > > > ...
> > > > > >
> > > > > > Please help to vote, any feedback is also welcome in the
> > discussion thread.
> > > > > >
> > > > > > Thanks & Best Regards
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks & Best Regards
> > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
> >
> >
> >
> > --
> > Thanks & Best Regards
> >



--
Thanks & Best Regards

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Hi Junjie,

Since there are ongoing improvements addressing review comments, I would
hold off with the vote for a few more days until the specification settles.

Br,

Zoltan

On Wed, Jul 10, 2019 at 9:32 AM 俊杰陈 <cj...@gmail.com> wrote:

> Hi Parquet committers and developers
>
> We are waiting for your important ballot:)
>
> On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <cj...@gmail.com> wrote:
> >
> > Yes, there are some public benchmark results, such as the official
> > benchmark from xxhash site (http://www.xxhash.com/) and published
> > comparison from smhasher project
> > (https://github.com/rurban/smhasher/).
> >
> >
> > On Tue, Jul 9, 2019 at 5:25 AM Wes McKinney <we...@gmail.com> wrote:
> > >
> > > Do you have any benchmark data to support the choice of hash function?
> > >
> > > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > >
> > > > Dear Parquet developers
> > > >
> > > > To simplify the voting, I 'd like to update voting content to the
> spec
> > > > with xxHash hash strategy. Now you can reply with +1 or -1.
> > > >
> > > > Thanks for your participation.
> > > >
> > > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > > >
> > > > > Dear Parquet developers
> > > > >
> > > > > Parquet Bloom filter has been developed for a while, per the
> discussion on the mail list, it's time to call a vote for spec to move
> forward. The current spec can be found at
> https://github.com/apache/parquet-format/blob/master/BloomFilter.md.
> There are some different options about the internal hash choice of Bloom
> filter and the PR is for that concern.
> > > > >
> > > > > So I 'd like to propose to vote the spec + hash option, for
> example:
> > > > >
> > > > > +1 to spec and xxHash
> > > > > +1 to spec and murmur3
> > > > > ...
> > > > >
> > > > > Please help to vote, any feedback is also welcome in the
> discussion thread.
> > > > >
> > > > > Thanks & Best Regards
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks & Best Regards
> >
> >
> >
> > --
> > Thanks & Best Regards
>
>
>
> --
> Thanks & Best Regards
>

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by 俊杰陈 <cj...@gmail.com>.
Hi Parquet committers and developers

We are waiting for your important ballot:)

On Tue, Jul 9, 2019 at 10:21 AM 俊杰陈 <cj...@gmail.com> wrote:
>
> Yes, there are some public benchmark results, such as the official
> benchmark from xxhash site (http://www.xxhash.com/) and published
> comparison from smhasher project
> (https://github.com/rurban/smhasher/).
>
>
> On Tue, Jul 9, 2019 at 5:25 AM Wes McKinney <we...@gmail.com> wrote:
> >
> > Do you have any benchmark data to support the choice of hash function?
> >
> > On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <cj...@gmail.com> wrote:
> > >
> > > Dear Parquet developers
> > >
> > > To simplify the voting, I 'd like to update voting content to the spec
> > > with xxHash hash strategy. Now you can reply with +1 or -1.
> > >
> > > Thanks for your participation.
> > >
> > > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 <cj...@gmail.com> wrote:
> > > >
> > > > Dear Parquet developers
> > > >
> > > > Parquet Bloom filter has been developed for a while, per the discussion on the mail list, it's time to call a vote for spec to move forward. The current spec can be found at https://github.com/apache/parquet-format/blob/master/BloomFilter.md. There are some different options about the internal hash choice of Bloom filter and the PR is for that concern.
> > > >
> > > > So I 'd like to propose to vote the spec + hash option, for example:
> > > >
> > > > +1 to spec and xxHash
> > > > +1 to spec and murmur3
> > > > ...
> > > >
> > > > Please help to vote, any feedback is also welcome in the discussion thread.
> > > >
> > > > Thanks & Best Regards
> > >
> > >
> > >
> > > --
> > > Thanks & Best Regards
>
>
>
> --
> Thanks & Best Regards



-- 
Thanks & Best Regards

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by 俊杰陈 <cj...@gmail.com>.
Yes, there are some public benchmark results, such as the official
benchmark from xxhash site (http://www.xxhash.com/) and published
comparison from smhasher project
(https://github.com/rurban/smhasher/).


On Tue, Jul 9, 2019 at 5:25 AM Wes McKinney <we...@gmail.com> wrote:
>
> Do you have any benchmark data to support the choice of hash function?
>
> On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <cj...@gmail.com> wrote:
> >
> > Dear Parquet developers
> >
> > To simplify the voting, I 'd like to update voting content to the spec
> > with xxHash hash strategy. Now you can reply with +1 or -1.
> >
> > Thanks for your participation.
> >
> > On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 <cj...@gmail.com> wrote:
> > >
> > > Dear Parquet developers
> > >
> > > Parquet Bloom filter has been developed for a while, per the discussion on the mail list, it's time to call a vote for spec to move forward. The current spec can be found at https://github.com/apache/parquet-format/blob/master/BloomFilter.md. There are some different options about the internal hash choice of Bloom filter and the PR is for that concern.
> > >
> > > So I 'd like to propose to vote the spec + hash option, for example:
> > >
> > > +1 to spec and xxHash
> > > +1 to spec and murmur3
> > > ...
> > >
> > > Please help to vote, any feedback is also welcome in the discussion thread.
> > >
> > > Thanks & Best Regards
> >
> >
> >
> > --
> > Thanks & Best Regards



-- 
Thanks & Best Regards

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by Wes McKinney <we...@gmail.com>.
Do you have any benchmark data to support the choice of hash function?

On Wed, Jul 3, 2019 at 8:41 AM 俊杰陈 <cj...@gmail.com> wrote:
>
> Dear Parquet developers
>
> To simplify the voting, I 'd like to update voting content to the spec
> with xxHash hash strategy. Now you can reply with +1 or -1.
>
> Thanks for your participation.
>
> On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 <cj...@gmail.com> wrote:
> >
> > Dear Parquet developers
> >
> > Parquet Bloom filter has been developed for a while, per the discussion on the mail list, it's time to call a vote for spec to move forward. The current spec can be found at https://github.com/apache/parquet-format/blob/master/BloomFilter.md. There are some different options about the internal hash choice of Bloom filter and the PR is for that concern.
> >
> > So I 'd like to propose to vote the spec + hash option, for example:
> >
> > +1 to spec and xxHash
> > +1 to spec and murmur3
> > ...
> >
> > Please help to vote, any feedback is also welcome in the discussion thread.
> >
> > Thanks & Best Regards
>
>
>
> --
> Thanks & Best Regards

Re: [VOTE] Parquet Bloom filter spec sign-off

Posted by 俊杰陈 <cj...@gmail.com>.
Dear Parquet developers

To simplify the voting, I 'd like to update voting content to the spec
with xxHash hash strategy. Now you can reply with +1 or -1.

Thanks for your participation.

On Tue, Jul 2, 2019 at 10:23 AM 俊杰陈 <cj...@gmail.com> wrote:
>
> Dear Parquet developers
>
> Parquet Bloom filter has been developed for a while, per the discussion on the mail list, it's time to call a vote for spec to move forward. The current spec can be found at https://github.com/apache/parquet-format/blob/master/BloomFilter.md. There are some different options about the internal hash choice of Bloom filter and the PR is for that concern.
>
> So I 'd like to propose to vote the spec + hash option, for example:
>
> +1 to spec and xxHash
> +1 to spec and murmur3
> ...
>
> Please help to vote, any feedback is also welcome in the discussion thread.
>
> Thanks & Best Regards



-- 
Thanks & Best Regards