You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Jim Apple <jb...@apache.org> on 2018/10/08 21:53:26 UTC

BitWeaving in Parquet?

The BitWeaving paper from a few years ago demonstrates some large performance wins in predicate evaluation based partially on reconfiguring the storage layout:

http://pages.cs.wisc.edu/~jignesh/publ/BitWeaving.pdf

Is it technically possible for Parquet to support "Vertical Bit-Parallel" layout as an option?

Re: BitWeaving in Parquet?

Posted by Julien Le Dem <ju...@wework.com.INVALID>.
If you want (and if you don't already know him) I'm happy to ask Jignesh if
he wants an intro.
I think he would be happy to tell you about it.

On Mon, Oct 8, 2018 at 4:04 PM Jim Apple <jb...@apache.org> wrote:

> > That sounds like an interesting possibility. It's not that fresh in my
> mind
> > but I'd say from the storage perspective it's a variation of bit packing.
> > right?
>
> I'm not familiar with bit packing, so I'd have to look into that. I found
> the paper readable enough at the time that I didn't end up doing a lot of
> groundwork reading to understand the origins.
>
> > We would need an implementation of a runtime for this to make sense, so I
> > suppose that the impala team is looking into implementing this?
>
> The Impala community hasn't been discussing this, as far as I am aware. I
> came across it in another paper and thought it might be of interest to
> consumers of Parquet, including Impala, but this is the first place I'm
> shopping the idea around.
>

Re: BitWeaving in Parquet?

Posted by Jim Apple <jb...@apache.org>.
> That sounds like an interesting possibility. It's not that fresh in my mind
> but I'd say from the storage perspective it's a variation of bit packing.
> right?

I'm not familiar with bit packing, so I'd have to look into that. I found the paper readable enough at the time that I didn't end up doing a lot of groundwork reading to understand the origins.

> We would need an implementation of a runtime for this to make sense, so I
> suppose that the impala team is looking into implementing this?

The Impala community hasn't been discussing this, as far as I am aware. I came across it in another paper and thought it might be of interest to consumers of Parquet, including Impala, but this is the first place I'm shopping the idea around.

Re: BitWeaving in Parquet?

Posted by Jim Apple <jb...@apache.org>.
> For Vertical Bit-Parallel (VBP), I think the reason why I didn't think it
> would be useful for Parquet is that it is really expensive to produce and
> really expensive to reconstruct values that aren't filtered out.

Julien, this would be a thing I think the list would love to hear from Jignesh about. What was his experience with reconstruction cost? What were the fast and slow ways to get it done?

Re: BitWeaving in Parquet?

Posted by Jim Apple <jb...@apache.org>.
> For Vertical Bit-Parallel (VBP), I think the reason why I didn't think it
> would be useful for Parquet is that it is really expensive to produce and
> really expensive to reconstruct values that aren't filtered out.

Yes - you can see in Figure 12(a) that the aggregation time went up for the vertical methods by maybe 3x.

"As the column l_extendedprice is fairly wide (24 bits), BitWeaving/V spends more cycles extracting the matching values from the aggregation columns. As a result, for this particular query, BitWeaving/H is faster than BitWeaving/V."

http://quickstep.cs.wisc.edu/pubs/bitweaving-extended.pdf

They also show benchmarks for using vertical layout for only an index, which might be useful to some Parquet users, just like the Bloom filter might be useful to some.

Re: BitWeaving in Parquet?

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I looked into this a while ago. Assuming that I remember correctly, the
conclusion I came to was that Horizontal Bit-Parallel (HBP) might be
helpful, but the vertical option was probably not appropriate.

HBP would allow Parquet readers to run predicates on multiple values at
once without needing to use SIMD instructions that aren't available to JVM
processes. (With SIMD instructions, you get even more value.) That would be
useful, but I think we'd have to change the bit packing encoding to lay out
values with the extra padding bit where predicate evaluation results end
up, because the benefit is only worth the work to reorder and pack if it is
reused.

For Vertical Bit-Parallel (VBP), I think the reason why I didn't think it
would be useful for Parquet is that it is really expensive to produce and
really expensive to reconstruct values that aren't filtered out. When
reconstructing more than just a few rows, as you would for large scans, it
would be much more expensive.

On Sun, Oct 14, 2018 at 1:26 PM Jim Apple <jb...@apache.org> wrote:

> On 2018/10/08 22:08:16, Julien Le Dem <ju...@wework.com.INVALID>
> wrote:
> > it's a variation of bit packing. right?
>
> I looked into it on
> https://github.com/apache/parquet-format/blob/master/Encodings.md and I
> believe that the Horizontal Bit-Parallel encoding in the paper is a variant
> on bit packing. There are three changes:
>
> 1. No code is split between words
> 2. Every code gets a padding bit
> 3. The order of the packing is not linear; code 1 is not packed in a word
> with code 2.
>
> The paper obviously has much more detail. :-)
>
> The various vertical encodings referenced in the paper (bit-slicing,
> vertical bit-parallel, and BitWeaving/V) look further afield from Parquet's
> bit packing.
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: BitWeaving in Parquet?

Posted by Jim Apple <jb...@apache.org>.
On 2018/10/08 22:08:16, Julien Le Dem <ju...@wework.com.INVALID> wrote: 
> it's a variation of bit packing. right?

I looked into it on https://github.com/apache/parquet-format/blob/master/Encodings.md and I believe that the Horizontal Bit-Parallel encoding in the paper is a variant on bit packing. There are three changes:

1. No code is split between words
2. Every code gets a padding bit
3. The order of the packing is not linear; code 1 is not packed in a word with code 2.

The paper obviously has much more detail. :-)

The various vertical encodings referenced in the paper (bit-slicing, vertical bit-parallel, and BitWeaving/V) look further afield from Parquet's bit packing.

Re: BitWeaving in Parquet?

Posted by Julien Le Dem <ju...@wework.com.INVALID>.
Hi Jim,
I remember chatting with Jignesh Patel about it at the time.
Since his company locomatix was acquired by twitter we had him as an
adviser for some time.
That sounds like an interesting possibility. It's not that fresh in my mind
but I'd say from the storage perspective it's a variation of bit packing.
right?
We would need an implementation of a runtime for this to make sense, so I
suppose that the impala team is looking into implementing this?
It would be interesting to have this type of "compressed" vector in Arrow
too. But I don't know if you're looking into Arrow on your end
Cheers,
Julien




On Mon, Oct 8, 2018 at 2:53 PM Jim Apple <jb...@apache.org> wrote:

> The BitWeaving paper from a few years ago demonstrates some large
> performance wins in predicate evaluation based partially on reconfiguring
> the storage layout:
>
> http://pages.cs.wisc.edu/~jignesh/publ/BitWeaving.pdf
>
> Is it technically possible for Parquet to support "Vertical Bit-Parallel"
> layout as an option?
>