You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by 孙清孟 <sq...@gmail.com> on 2017/06/11 00:41:05 UTC

how add lzma compression to Parquet in Impala

I have added lzma codec (hadoop-xz) to parquet(modify the parquet-format
and parquet-mr)  for hive, and get a higher compression ratio.

But how add a new codec for Impala?

Re: how add lzma compression to Parquet in Impala

Posted by Jim Apple <jb...@cloudera.com>.
If I am reading that discussion correctly, then there is public domain
lzma code that can help you do what you would like to do. THanks for
looking into this!

On Mon, Jun 12, 2017 at 4:22 PM, 孙清孟 <sq...@gmail.com> wrote:
> Hi Jim and Tim:
> Thanks for your reply.
> I know APL and GPL, here is some discusses about Hadoop supports for lzma:
> https://issues.apache.org/jira/browse/HADOOP-6837
> <https://issues.apache.org/jira/browse/HADOOP-6837.>
>
> 2017-06-12 23:40 GMT+08:00 Jim Apple <jb...@cloudera.com>:
>
>> Because Impala is part of the ASF, it cannot contain any GPL code.
>>
>> https://www.apache.org/legal/resolved.html
>>
>> "However, if the component is only needed for optional features, a
>> project can provide the user with instructions on how to obtain and
>> install the non-included work. Optional means that the component is
>> not required for standard use of the product or for the product to
>> achieve a desirable level of quality. The question to ask yourself in
>> this situation is: 'Will the majority of users want to use my product
>> without adding the optional components?'"
>>
>> As I understand it, this is the rule by which Impala can use
>> https://github.com/cloudera/impala-lzo
>>
>> On Mon, Jun 12, 2017 at 8:30 AM, Tim Armstrong <ta...@cloudera.com>
>> wrote:
>> > You would need to add a new codec to the Impala source tree. The codecs
>> are
>> > implemented in be/src/util/codec.h,  be/src/util/compress.h  and
>> > be/src/util/decompress.h. There are a few other places you may need to
>> > change. I would just "git grep -i gzip" to see how the gzip codec is
>> > implemented.
>> >
>> > For compressed text files you would also need to add support to the
>> > frontend, e.g. in
>> > fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java
>> >
>> > I'm also not sure if there are any licensing issues here since the XZ
>> > library is GPL licensed.
>> >
>> > On Sat, Jun 10, 2017 at 5:41 PM, 孙清孟 <sq...@gmail.com> wrote:
>> >
>> >> I have added lzma codec (hadoop-xz) to parquet(modify the parquet-format
>> >> and parquet-mr)  for hive, and get a higher compression ratio.
>> >>
>> >> But how add a new codec for Impala?
>> >>
>>

Re: how add lzma compression to Parquet in Impala

Posted by 孙清孟 <sq...@gmail.com>.
Hi Jim and Tim:
Thanks for your reply.
I know APL and GPL, here is some discusses about Hadoop supports for lzma:
https://issues.apache.org/jira/browse/HADOOP-6837
<https://issues.apache.org/jira/browse/HADOOP-6837.>

2017-06-12 23:40 GMT+08:00 Jim Apple <jb...@cloudera.com>:

> Because Impala is part of the ASF, it cannot contain any GPL code.
>
> https://www.apache.org/legal/resolved.html
>
> "However, if the component is only needed for optional features, a
> project can provide the user with instructions on how to obtain and
> install the non-included work. Optional means that the component is
> not required for standard use of the product or for the product to
> achieve a desirable level of quality. The question to ask yourself in
> this situation is: 'Will the majority of users want to use my product
> without adding the optional components?'"
>
> As I understand it, this is the rule by which Impala can use
> https://github.com/cloudera/impala-lzo
>
> On Mon, Jun 12, 2017 at 8:30 AM, Tim Armstrong <ta...@cloudera.com>
> wrote:
> > You would need to add a new codec to the Impala source tree. The codecs
> are
> > implemented in be/src/util/codec.h,  be/src/util/compress.h  and
> > be/src/util/decompress.h. There are a few other places you may need to
> > change. I would just "git grep -i gzip" to see how the gzip codec is
> > implemented.
> >
> > For compressed text files you would also need to add support to the
> > frontend, e.g. in
> > fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java
> >
> > I'm also not sure if there are any licensing issues here since the XZ
> > library is GPL licensed.
> >
> > On Sat, Jun 10, 2017 at 5:41 PM, 孙清孟 <sq...@gmail.com> wrote:
> >
> >> I have added lzma codec (hadoop-xz) to parquet(modify the parquet-format
> >> and parquet-mr)  for hive, and get a higher compression ratio.
> >>
> >> But how add a new codec for Impala?
> >>
>

Re: how add lzma compression to Parquet in Impala

Posted by Jim Apple <jb...@cloudera.com>.
Because Impala is part of the ASF, it cannot contain any GPL code.

https://www.apache.org/legal/resolved.html

"However, if the component is only needed for optional features, a
project can provide the user with instructions on how to obtain and
install the non-included work. Optional means that the component is
not required for standard use of the product or for the product to
achieve a desirable level of quality. The question to ask yourself in
this situation is: 'Will the majority of users want to use my product
without adding the optional components?'"

As I understand it, this is the rule by which Impala can use
https://github.com/cloudera/impala-lzo

On Mon, Jun 12, 2017 at 8:30 AM, Tim Armstrong <ta...@cloudera.com> wrote:
> You would need to add a new codec to the Impala source tree. The codecs are
> implemented in be/src/util/codec.h,  be/src/util/compress.h  and
> be/src/util/decompress.h. There are a few other places you may need to
> change. I would just "git grep -i gzip" to see how the gzip codec is
> implemented.
>
> For compressed text files you would also need to add support to the
> frontend, e.g. in
> fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java
>
> I'm also not sure if there are any licensing issues here since the XZ
> library is GPL licensed.
>
> On Sat, Jun 10, 2017 at 5:41 PM, 孙清孟 <sq...@gmail.com> wrote:
>
>> I have added lzma codec (hadoop-xz) to parquet(modify the parquet-format
>> and parquet-mr)  for hive, and get a higher compression ratio.
>>
>> But how add a new codec for Impala?
>>

Re: how add lzma compression to Parquet in Impala

Posted by Tim Armstrong <ta...@cloudera.com>.
You would need to add a new codec to the Impala source tree. The codecs are
implemented in be/src/util/codec.h,  be/src/util/compress.h  and
be/src/util/decompress.h. There are a few other places you may need to
change. I would just "git grep -i gzip" to see how the gzip codec is
implemented.

For compressed text files you would also need to add support to the
frontend, e.g. in
fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java

I'm also not sure if there are any licensing issues here since the XZ
library is GPL licensed.

On Sat, Jun 10, 2017 at 5:41 PM, 孙清孟 <sq...@gmail.com> wrote:

> I have added lzma codec (hadoop-xz) to parquet(modify the parquet-format
> and parquet-mr)  for hive, and get a higher compression ratio.
>
> But how add a new codec for Impala?
>