You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by 俊杰陈 <cj...@gmail.com> on 2018/08/29 02:45:06 UTC

[VOTE] Finalizing the design and moving forward to read/write implementation

Hi all

As discussed in the sync-up meeting, I 'd like to propose a vote on Bloom
filter design doc
<https://docs.google.com/document/d/1mIZ0W24Cr79QHJWN1sQ3dIUc4lAK5AVqozwSwtpFhW8/edit?usp=sharing>and
its corresponding parquet-format PR
<https://github.com/apache/parquet-format/pull/99> , and then we can move
forward to update parquet spec and do read/write side implementation.

What we have done includes:

    The PoC benchmark
<https://docs.google.com/spreadsheets/d/1yV3u-P_yY4DtfSty3LPrbhwuJx4cqm_YeK61s2v0OLU/edit?usp=sharing>.
It includes comparison between with and without Bloom filter, Bloom filter
and dictionary filter. The results show promising improvement in selective
queries.

    Bloom filter utility class implementation in java and c++ language.

This vote is to determine if parquet committers can accept Bloom filter
design and its corresponding parquet-format changes.

+1: Accept the design and related changes of parquet-format
+0: ...
-1: Because ...


Thanks & Best Regards

Re: [VOTE] Finalizing the design and moving forward to read/write implementation

Posted by 俊杰陈 <cj...@gmail.com>.
I agree with Jim that we might discover more when implementing
reader/writer and there should be no major change for parquet-format
because:

what type of bloom filter to use?
We use block-based Bloom filter now and no major changes if we plan to
support others. Just add it to defined algorithm union.

where to add them in the file?
At beginning of row group.  This is defined by offset specific in column
chunk metadata so at least there is no change for parquet-format if we want
to add it in different places.

what thrift object should contain?
The thrift definition now contains enough information to read a block-based
bloom filter, it might need to add other info if we plan to support other
type bloom filters in future.

I can submit reader/writer PR in java side make this clear once we finish
bloom filter utility PR in java side.


Jim Apple <jb...@apache.org> 于2018年9月1日周六 上午12:26写道:

> On 2018/08/30 19:41:59, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> > Jim, do you think that the implementation is going to make major changes
> to
> > the design of how bloom filters are stored in files?
>
> I don't foresee any problems with the current layout.
>


-- 
Thanks & Best Regards

Re: [VOTE] Finalizing the design and moving forward to read/write implementation

Posted by Jim Apple <jb...@apache.org>.
On 2018/08/30 19:41:59, Ryan Blue <rb...@netflix.com.INVALID> wrote: 
> Jim, do you think that the implementation is going to make major changes to
> the design of how bloom filters are stored in files?

I don't foresee any problems with the current layout.

Re: [VOTE] Finalizing the design and moving forward to read/write implementation

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Jim, do you think that the implementation is going to make major changes to
the design of how bloom filters are stored in files? I thought that
concerns about what type of bloom filter to use, where to add them in the
file, and what thrift object should contain the bytes were pretty much
decided.

On Thu, Aug 30, 2018 at 9:17 AM Jim Apple <jb...@apache.org> wrote:

> +0, non-binding.
>
> Junjie and I spent a lot of time getting the C++ code to where it is now,
> but all three patches (Java, -format, C++) could use some more work before
> I'm fully confident we're in a good place. In particular, the code for
> integrating the existing patches in with readers and writers is not even in
> code review yet.
>
> That could lead us to discover things about the -format patch, so I'd like
> to see things advance a bit before the -format patch makes it into a
> release.
>
> I'm not -1 because I don't see any current blockers, just some risk and
> unexplored territory.
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Finalizing the design and moving forward to read/write implementation

Posted by Jim Apple <jb...@apache.org>.
+0, non-binding.

Junjie and I spent a lot of time getting the C++ code to where it is now, but all three patches (Java, -format, C++) could use some more work before I'm fully confident we're in a good place. In particular, the code for integrating the existing patches in with readers and writers is not even in code review yet.

That could lead us to discover things about the -format patch, so I'd like to see things advance a bit before the -format patch makes it into a release.

I'm not -1 because I don't see any current blockers, just some risk and unexplored territory.