You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Aleksei Statkevich <as...@rocketfuel.com> on 2016/06/18 05:31:35 UTC

Why does ORC use Deflater instead of native ZlibCompressor?

Hello,

I recently looked at ORC encoding and noticed that hive.ql.io.orc.ZlibCodec
uses java's java.util.zip.Deflater and not Hadoop's native ZlibCompressor.

Can someone please tell me what is the reason for it?

Also, how does performance of Deflater (which also uses native
implementation) compare to Hadoop's native zlib implementation?

Thanks,
Aleksei

Re: Why does ORC use Deflater instead of native ZlibCompressor?

Posted by Owen O'Malley <om...@apache.org>.

For compression, I'm also interested in investigating the pure java
compression codecs that were done by the Presto project:

https://github.com/airlift/aircompressor

They've implemented LZ4, Snappy, and LZO in pure java.

On Thu, Jun 23, 2016 at 8:04 PM, Gopal Vijayaraghavan <go...@apache.org>
wrote:

> > Though, I'm also wondering about about performance difference between
> >the two. Since they both use native implementations, theoretically they
> >can be close in performance.
>
> ZlibCompressor block compression was extremely slow due to the non-JNI
> bits in Hadoop - <https://issues.apache.org/jira/browse/HADOOP-10681>
>
> When I last benchmarked after that issue was fixed 86% of CPU samples were
> spent inside zlib.so in the perf traces - irrespective of which mode it
> was used.
>
> The result of those profiles went into making ORC fit into Zlib better,
> avoid doing compression work twice - ORC did its own versions of
> dictionary+rle+bit-packing already.
>
> <
> http://www.slideshare.net/Hadoop_Summit/orc-2015-faster-better-smaller-494
> 81231/22>
>
> For instance, bit-packing 127 bit data into 7 bits and then compressing it
> offered less compression (& cost more CPU) than leaving it at 8 bits
> without reduction. LZ77 worked much better and the huffman anyway
> compressed the data by bit-packing anyway. The impact was more visible at
> higher bit-counts (like 27 bits is way worse than 32 bits).
>
> And then turning off bits of Zlib not necessary for some encoding patterns
> - Z_FILTERED for instance for numeric sequences, Z_TEXT for the string
> dicts etc.
>
> Purely from a performance standpoint, I'm getting more interested in Zstd,
> because it brings a whole new way of fast bit-packing.
>
> <https://issues.apache.org/jira/browse/ORC-45>
>
>
> Cheers,
> Gopal
>
>
>

Re: Why does ORC use Deflater instead of native ZlibCompressor?

Posted by Gopal Vijayaraghavan <go...@apache.org>.

> Though, I'm also wondering about about performance difference between
>the two. Since they both use native implementations, theoretically they
>can be close in performance.

ZlibCompressor block compression was extremely slow due to the non-JNI
bits in Hadoop - <https://issues.apache.org/jira/browse/HADOOP-10681>

When I last benchmarked after that issue was fixed 86% of CPU samples were
spent inside zlib.so in the perf traces - irrespective of which mode it
was used.

The result of those profiles went into making ORC fit into Zlib better,
avoid doing compression work twice - ORC did its own versions of
dictionary+rle+bit-packing already.

<http://www.slideshare.net/Hadoop_Summit/orc-2015-faster-better-smaller-494
81231/22>

For instance, bit-packing 127 bit data into 7 bits and then compressing it
offered less compression (& cost more CPU) than leaving it at 8 bits
without reduction. LZ77 worked much better and the huffman anyway
compressed the data by bit-packing anyway. The impact was more visible at
higher bit-counts (like 27 bits is way worse than 32 bits).

And then turning off bits of Zlib not necessary for some encoding patterns
- Z_FILTERED for instance for numeric sequences, Z_TEXT for the string
dicts etc.

Purely from a performance standpoint, I'm getting more interested in Zstd,
because it brings a whole new way of fast bit-packing.

<https://issues.apache.org/jira/browse/ORC-45>


Cheers,
Gopal

Re: Why does ORC use Deflater instead of native ZlibCompressor?

Posted by Aleksei Statkevich <as...@rocketfuel.com>.

It might be a good idea. Though, I'm also wondering about about performance
difference between the two. Since they both use native implementations,
theoretically they can be close in performance. Are there any benchmarks
for them?

*Aleksei Statkevich *| Engineering Manager

<http://www.google.com/url?q=http%3A%2F%2Frocketfuel.com%2F&sa=D&sntz=1&usg=AFrqEzfAQ9xih8SV05CiYtvyyIAKLzpX2g>

<https://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Frocketfuelinc&sa=D&sntz=1&usg=AFrqEzdmS-VfAbRejUE27Yrsp6UaaAoUdw>

<https://www.google.com/url?q=https%3A%2F%2Fwww.facebook.com%2Frocketfuelinc%2F&sa=D&sntz=1&usg=AFrqEzc8zstBb-QJdiYqd7m9Wmmt-UHs7A>

<https://www.google.com/url?q=https%3A%2F%2Fwww.instagram.com%2Frocketfuellife%2F&sa=D&sntz=1&usg=AFrqEzf8veiDVVhTCQnpUnRttXonn6y9-g>

<https://www.google.com/url?q=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Frocket-fuel-inc-&sa=D&sntz=1&usg=AFrqEzcvsj2bSqJ_SYc8qpQWQJnXXEjvLQ>

<https://www.google.com/url?q=https%3A%2F%2Fwww.glassdoor.com%2FOverview%2FWorking-at-Rocket-Fuel-EI_IE286428.11%2C22.htm&sa=D&sntz=1&usg=AFrqEzf6IUelwlAKdidiiJ3wTFdjnigQVg>

On Thu, Jun 23, 2016 at 5:00 PM, Owen O'Malley <om...@apache.org> wrote:

> Actually, that should work. I'm a little concerned about the memory copy
> that the Hadoop ZlibCompressor does, but it should be a win. If you want to
> work on it, why don't you create a jira on the orc project? Don't forget
> that you'll need to handle the other options in CompressionCodec.modify.
>
> .. Owen
>
> On Thu, Jun 23, 2016 at 3:59 PM, Aleksei Statkevich <
> astatkevich@rocketfuel.com> wrote:
>
>> Hi Owen,
>>
>> Thanks for the response. I saw that DirectDecompressor will be used if
>> available and the difference was only in compression.
>> Keeping in mind what you said, I looked at the code again. I see that the
>> only specific piece that ORC uses is "nowrap" = true in Deflater. As far as
>> I understand from the description, it should directly correspond
>> to CompressionHeader.NO_HEADER in ZlibCompressor. In this case,
>> ZlibCompressor with the right setup can be a replacement for Deflater. What
>> do you think?
>>
>> Aleksei
>>
>> *Aleksei Statkevich *| Engineering Manager
>>
>>
>> <http://www.google.com/url?q=http%3A%2F%2Frocketfuel.com%2F&sa=D&sntz=1&usg=AFrqEzfAQ9xih8SV05CiYtvyyIAKLzpX2g>
>>
>> <https://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Frocketfuelinc&sa=D&sntz=1&usg=AFrqEzdmS-VfAbRejUE27Yrsp6UaaAoUdw>
>>
>> <https://www.google.com/url?q=https%3A%2F%2Fwww.facebook.com%2Frocketfuelinc%2F&sa=D&sntz=1&usg=AFrqEzc8zstBb-QJdiYqd7m9Wmmt-UHs7A>
>>
>> <https://www.google.com/url?q=https%3A%2F%2Fwww.instagram.com%2Frocketfuellife%2F&sa=D&sntz=1&usg=AFrqEzf8veiDVVhTCQnpUnRttXonn6y9-g>
>>
>> <https://www.google.com/url?q=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Frocket-fuel-inc-&sa=D&sntz=1&usg=AFrqEzcvsj2bSqJ_SYc8qpQWQJnXXEjvLQ>
>>
>> <https://www.google.com/url?q=https%3A%2F%2Fwww.glassdoor.com%2FOverview%2FWorking-at-Rocket-Fuel-EI_IE286428.11%2C22.htm&sa=D&sntz=1&usg=AFrqEzf6IUelwlAKdidiiJ3wTFdjnigQVg>
>>
>> On Thu, Jun 23, 2016 at 2:35 PM, Owen O'Malley <om...@apache.org>
>> wrote:
>>
>>>
>>>
>>> On Fri, Jun 17, 2016 at 11:31 PM, Aleksei Statkevich <
>>> astatkevich@rocketfuel.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I recently looked at ORC encoding and noticed
>>>> that hive.ql.io.orc.ZlibCodec uses java's java.util.zip.Deflater and not
>>>> Hadoop's native ZlibCompressor.
>>>>
>>>> Can someone please tell me what is the reason for it?
>>>>
>>>
>>> It is more subtle than that. The first piece to notice is that if your
>>> Hadoop has the direct decompression
>>> (org.apache.hadoop.io.compress.zlib.ZlibDirectDecompressor), it will be
>>> used. The reason that the ZlibCompressor isn't used is because ORC needs a
>>> different API. In particular, ORC doesn't use stream compression, but
>>> rather block compression. That is done so that it can jump over compression
>>> blocks for predicate push down. (If you are skipping over a lot of values,
>>> ORC doesn't need to decompress the bytes.)
>>>
>>> .. Owen
>>>
>>>
>>>
>>>>
>>>> Also, how does performance of Deflater (which also uses native
>>>> implementation) compare to Hadoop's native zlib implementation?
>>>>
>>>> Thanks,
>>>> Aleksei
>>>>
>>>>
>>>
>>
>

Re: Why does ORC use Deflater instead of native ZlibCompressor?

Posted by Owen O'Malley <om...@apache.org>.

Actually, that should work. I'm a little concerned about the memory copy
that the Hadoop ZlibCompressor does, but it should be a win. If you want to
work on it, why don't you create a jira on the orc project? Don't forget
that you'll need to handle the other options in CompressionCodec.modify.

.. Owen

On Thu, Jun 23, 2016 at 3:59 PM, Aleksei Statkevich <
astatkevich@rocketfuel.com> wrote:

> Hi Owen,
>
> Thanks for the response. I saw that DirectDecompressor will be used if
> available and the difference was only in compression.
> Keeping in mind what you said, I looked at the code again. I see that the
> only specific piece that ORC uses is "nowrap" = true in Deflater. As far as
> I understand from the description, it should directly correspond
> to CompressionHeader.NO_HEADER in ZlibCompressor. In this case,
> ZlibCompressor with the right setup can be a replacement for Deflater. What
> do you think?
>
> Aleksei
>
> *Aleksei Statkevich *| Engineering Manager
>
>
> <http://www.google.com/url?q=http%3A%2F%2Frocketfuel.com%2F&sa=D&sntz=1&usg=AFrqEzfAQ9xih8SV05CiYtvyyIAKLzpX2g>
>
> <https://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Frocketfuelinc&sa=D&sntz=1&usg=AFrqEzdmS-VfAbRejUE27Yrsp6UaaAoUdw>
>
> <https://www.google.com/url?q=https%3A%2F%2Fwww.facebook.com%2Frocketfuelinc%2F&sa=D&sntz=1&usg=AFrqEzc8zstBb-QJdiYqd7m9Wmmt-UHs7A>
>
> <https://www.google.com/url?q=https%3A%2F%2Fwww.instagram.com%2Frocketfuellife%2F&sa=D&sntz=1&usg=AFrqEzf8veiDVVhTCQnpUnRttXonn6y9-g>
>
> <https://www.google.com/url?q=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Frocket-fuel-inc-&sa=D&sntz=1&usg=AFrqEzcvsj2bSqJ_SYc8qpQWQJnXXEjvLQ>
>
> <https://www.google.com/url?q=https%3A%2F%2Fwww.glassdoor.com%2FOverview%2FWorking-at-Rocket-Fuel-EI_IE286428.11%2C22.htm&sa=D&sntz=1&usg=AFrqEzf6IUelwlAKdidiiJ3wTFdjnigQVg>
>
> On Thu, Jun 23, 2016 at 2:35 PM, Owen O'Malley <om...@apache.org> wrote:
>
>>
>>
>> On Fri, Jun 17, 2016 at 11:31 PM, Aleksei Statkevich <
>> astatkevich@rocketfuel.com> wrote:
>>
>>> Hello,
>>>
>>> I recently looked at ORC encoding and noticed
>>> that hive.ql.io.orc.ZlibCodec uses java's java.util.zip.Deflater and not
>>> Hadoop's native ZlibCompressor.
>>>
>>> Can someone please tell me what is the reason for it?
>>>
>>
>> It is more subtle than that. The first piece to notice is that if your
>> Hadoop has the direct decompression
>> (org.apache.hadoop.io.compress.zlib.ZlibDirectDecompressor), it will be
>> used. The reason that the ZlibCompressor isn't used is because ORC needs a
>> different API. In particular, ORC doesn't use stream compression, but
>> rather block compression. That is done so that it can jump over compression
>> blocks for predicate push down. (If you are skipping over a lot of values,
>> ORC doesn't need to decompress the bytes.)
>>
>> .. Owen
>>
>>
>>
>>>
>>> Also, how does performance of Deflater (which also uses native
>>> implementation) compare to Hadoop's native zlib implementation?
>>>
>>> Thanks,
>>> Aleksei
>>>
>>>
>>
>

Re: Why does ORC use Deflater instead of native ZlibCompressor?

Posted by Aleksei Statkevich <as...@rocketfuel.com>.

Hi Owen,

Thanks for the response. I saw that DirectDecompressor will be used if
available and the difference was only in compression.
Keeping in mind what you said, I looked at the code again. I see that the
only specific piece that ORC uses is "nowrap" = true in Deflater. As far as
I understand from the description, it should directly correspond
to CompressionHeader.NO_HEADER in ZlibCompressor. In this case,
ZlibCompressor with the right setup can be a replacement for Deflater. What
do you think?

Aleksei

*Aleksei Statkevich *| Engineering Manager

<http://www.google.com/url?q=http%3A%2F%2Frocketfuel.com%2F&sa=D&sntz=1&usg=AFrqEzfAQ9xih8SV05CiYtvyyIAKLzpX2g>

<https://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Frocketfuelinc&sa=D&sntz=1&usg=AFrqEzdmS-VfAbRejUE27Yrsp6UaaAoUdw>

<https://www.google.com/url?q=https%3A%2F%2Fwww.facebook.com%2Frocketfuelinc%2F&sa=D&sntz=1&usg=AFrqEzc8zstBb-QJdiYqd7m9Wmmt-UHs7A>

<https://www.google.com/url?q=https%3A%2F%2Fwww.instagram.com%2Frocketfuellife%2F&sa=D&sntz=1&usg=AFrqEzf8veiDVVhTCQnpUnRttXonn6y9-g>

<https://www.google.com/url?q=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Frocket-fuel-inc-&sa=D&sntz=1&usg=AFrqEzcvsj2bSqJ_SYc8qpQWQJnXXEjvLQ>

<https://www.google.com/url?q=https%3A%2F%2Fwww.glassdoor.com%2FOverview%2FWorking-at-Rocket-Fuel-EI_IE286428.11%2C22.htm&sa=D&sntz=1&usg=AFrqEzf6IUelwlAKdidiiJ3wTFdjnigQVg>

On Thu, Jun 23, 2016 at 2:35 PM, Owen O'Malley <om...@apache.org> wrote:

>
>
> On Fri, Jun 17, 2016 at 11:31 PM, Aleksei Statkevich <
> astatkevich@rocketfuel.com> wrote:
>
>> Hello,
>>
>> I recently looked at ORC encoding and noticed
>> that hive.ql.io.orc.ZlibCodec uses java's java.util.zip.Deflater and not
>> Hadoop's native ZlibCompressor.
>>
>> Can someone please tell me what is the reason for it?
>>
>
> It is more subtle than that. The first piece to notice is that if your
> Hadoop has the direct decompression
> (org.apache.hadoop.io.compress.zlib.ZlibDirectDecompressor), it will be
> used. The reason that the ZlibCompressor isn't used is because ORC needs a
> different API. In particular, ORC doesn't use stream compression, but
> rather block compression. That is done so that it can jump over compression
> blocks for predicate push down. (If you are skipping over a lot of values,
> ORC doesn't need to decompress the bytes.)
>
> .. Owen
>
>
>
>>
>> Also, how does performance of Deflater (which also uses native
>> implementation) compare to Hadoop's native zlib implementation?
>>
>> Thanks,
>> Aleksei
>>
>>
>

Re: Why does ORC use Deflater instead of native ZlibCompressor?

Posted by Owen O'Malley <om...@apache.org>.

On Fri, Jun 17, 2016 at 11:31 PM, Aleksei Statkevich <
astatkevich@rocketfuel.com> wrote:

> Hello,
>
> I recently looked at ORC encoding and noticed
> that hive.ql.io.orc.ZlibCodec uses java's java.util.zip.Deflater and not
> Hadoop's native ZlibCompressor.
>
> Can someone please tell me what is the reason for it?
>

It is more subtle than that. The first piece to notice is that if your
Hadoop has the direct decompression
(org.apache.hadoop.io.compress.zlib.ZlibDirectDecompressor), it will be
used. The reason that the ZlibCompressor isn't used is because ORC needs a
different API. In particular, ORC doesn't use stream compression, but
rather block compression. That is done so that it can jump over compression
blocks for predicate push down. (If you are skipping over a lot of values,
ORC doesn't need to decompress the bytes.)

.. Owen

>
> Also, how does performance of Deflater (which also uses native
> implementation) compare to Hadoop's native zlib implementation?
>
> Thanks,
> Aleksei
>
>

Re: Why does ORC use Deflater instead of native ZlibCompressor?

Posted by Aleksei Statkevich <as...@rocketfuel.com>.

Does anyone know?

*Aleksei Statkevich *| Engineering Manager

<http://www.google.com/url?q=http%3A%2F%2Frocketfuel.com%2F&sa=D&sntz=1&usg=AFrqEzfAQ9xih8SV05CiYtvyyIAKLzpX2g>

<https://www.google.com/url?q=https%3A%2F%2Ftwitter.com%2Frocketfuelinc&sa=D&sntz=1&usg=AFrqEzdmS-VfAbRejUE27Yrsp6UaaAoUdw>

<https://www.google.com/url?q=https%3A%2F%2Fwww.facebook.com%2Frocketfuelinc%2F&sa=D&sntz=1&usg=AFrqEzc8zstBb-QJdiYqd7m9Wmmt-UHs7A>

<https://www.google.com/url?q=https%3A%2F%2Fwww.instagram.com%2Frocketfuellife%2F&sa=D&sntz=1&usg=AFrqEzf8veiDVVhTCQnpUnRttXonn6y9-g>

<https://www.google.com/url?q=https%3A%2F%2Fwww.linkedin.com%2Fcompany%2Frocket-fuel-inc-&sa=D&sntz=1&usg=AFrqEzcvsj2bSqJ_SYc8qpQWQJnXXEjvLQ>

<https://www.google.com/url?q=https%3A%2F%2Fwww.glassdoor.com%2FOverview%2FWorking-at-Rocket-Fuel-EI_IE286428.11%2C22.htm&sa=D&sntz=1&usg=AFrqEzf6IUelwlAKdidiiJ3wTFdjnigQVg>

On Fri, Jun 17, 2016 at 10:31 PM, Aleksei Statkevich <
astatkevich@rocketfuel.com> wrote:

> Hello,
>
> I recently looked at ORC encoding and noticed
> that hive.ql.io.orc.ZlibCodec uses java's java.util.zip.Deflater and not
> Hadoop's native ZlibCompressor.
>
> Can someone please tell me what is the reason for it?
>
> Also, how does performance of Deflater (which also uses native
> implementation) compare to Hadoop's native zlib implementation?
>
> Thanks,
> Aleksei
>
>