You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Luke Cwik <lc...@google.com> on 2020/01/06 19:06:21 UTC

Re: Request for review of PR [Beam-8564]

Have you had a chance to update the PR?

On Mon, Dec 30, 2019 at 5:00 AM Amogh Tiwari <am...@google.com> wrote:

> Hi Luke,
>
> We have gone through shevek/lzo-java, but we chose to go with
> airflow/aircompressor for the following reasons:
>
> 1) shevek/lzo-java is internally using .jni, .c and .h files, hence the
> GNU licence, and that would leave us with only choice of putting this as an
> option dependency
>
> 2) performance of airlift/aircompressor was much better as compared to
> shevek/lzo-java in terms of compression ratio and time taken for
> compression/decompression
>
> 3) airflow/aircompressor is in pure java and is under Apache licence
>
> Therefore, we'd prefer to go with adding the current implementation as
> optional. We'd require your inputs on the same as we are unsure on where we
> are supposed to keep the required files and how the final directory
> structure would like. We have an idea and we'll update the current PR
> accordingly.
>
> Please do guide us on this.
>
>
> Regards,
>
> Amogh Tiwari
>
> On Wed, Dec 18, 2019 at 4:42 AM Luke Cwik <lc...@google.com> wrote:
>
>> Sorry for the long delay (was on vacation).
>>
>> Using org.apache.hadoop isn't part of the Apache Beam Core module but is
>> a dependency for those who depend on the Apache Beam Hadoop module. So I
>> don't think swapping the com.facebook.presto.hadoop version for the
>> org.apache.hadoop version will address Ismael's concern about including
>> hadoop dependencies as part of the core.
>>
>> I looked at shevek/lzo-java[1] and I think its our best choice since it
>> is:
>> * pure Java
>> * GPLv3 (would require marking the dependency optional and telling users
>> to add it explicitly which we have done in the past as well)
>> * small (<100kb)
>> * small dependency tree (commons-logging & findbugs annotations if we
>> only depend on lzo-core)
>> * performance (github page claims 500mb/s compression and 800mb/s
>> decompression)
>>
>> Alternatively we can make the LZO compression an extension module (with
>> the facebook dependency) and use a registrar to have it loaded dynamically.
>>
>> 1: https://github.com/shevek/lzo-java
>>
>> On Fri, Dec 6, 2019 at 5:09 AM Amogh Tiwari <am...@google.com> wrote:
>>
>>> While studying the code, we found that the airlift/ aircompressor
>>> library only requires some classes which are also present in apache hadoop
>>> common package. Therefore, we are now thinking that if we make changes in
>>> the airlift/ aircompressor package, remove the
>>> com.facebook.presto.hadoop and use the existing org.apache.hadoop
>>> <https://mvnrepository.com/artifact/org.apache.hadoop> package which is
>>> already included in beam. This will solve both #2 and #3 as the transitive
>>> dependency will be removed and the size will also be reduced by almost
>>> ~20mbs.
>>>
>>> But if we use this approach, we will have to manually change the util
>>> whenever any changes are made to the airlift library.
>>>
>>> On Wed, Dec 4, 2019 at 10:13 PM Luke Cwik <lc...@google.com> wrote:
>>>
>>>> Going with the Registrar/ServiceLoader route would allow for
>>>> alternative providers for the same compression algorithms so if they don't
>>>> like one they can always contribute a different one.
>>>>
>>>> On Wed, Dec 4, 2019 at 8:22 AM Ismaël Mejía <ie...@gmail.com> wrote:
>>>>
>>>>> (1) seems not to be the issue because it is Apache licensed.
>>>>> (2) and (3) are the big issues, because it requires a provided huge
>>>>> uber jar that essentially leaks Hadoop classes into core SDK [1] so it is
>>>>> definitely concerning.
>>>>>
>>>>> We discussed at some point during the PR that added ZStandard support
>>>>> about creating some sort of Registrar for compression algorithms [2] but we
>>>>> decided to not go ahead because we could achieve that for the zstd case via
>>>>> the optional dependencies of commons-compress. Maybe it is time to
>>>>> reconsider if such mechanism is worth. For example for users that may not
>>>>> care about having the hadoop leakage to be able to use LZO.
>>>>>
>>>>> Refs.
>>>>> [1] https://mvnrepository.com/artifact/io.airlift/aircompressor/0.16
>>>>> [2] https://issues.apache.org/jira/browse/BEAM-6422
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Dec 3, 2019 at 7:01 PM Robert Bradshaw <ro...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Is there a way to wrap this up as an optional dependency with multiple
>>>>>> possible providers, if there's no good library satisfying all of the
>>>>>> conditions (in particular (1))?
>>>>>>
>>>>>> On Tue, Dec 3, 2019 at 9:47 AM Luke Cwik <lc...@google.com> wrote:
>>>>>> >
>>>>>> > I was hoping that someone in the community would provide some
>>>>>> alternatives since there are quite a few implementations.
>>>>>> >
>>>>>> > On Tue, Dec 3, 2019 at 8:20 AM Amogh Tiwari <am...@google.com>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> Hi Luke,
>>>>>> >>
>>>>>> >> I agree with your thoughts and observations. But,
>>>>>> airlift:aircompressor is the only implementation of LZO in pure java. That
>>>>>> straight away solves #5.
>>>>>> >> The other implementations that I found either have licensing
>>>>>> issues (since LZO natively uses GNU GPL licence) or are implemented using
>>>>>> .c, .h and jni (which again make them dependent on the OS). Please refer
>>>>>> these: twitter/hadoop-lzo and shevek/lzo-java.
>>>>>> >> These were the main reasons why we based this on
>>>>>> airlift:aircompressor.
>>>>>> >>
>>>>>> >> Thanks and Regards,
>>>>>> >> Amogh
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> On Tue, Dec 3, 2019 at 2:59 AM Luke Cwik <lc...@google.com> wrote:
>>>>>> >>>
>>>>>> >>> I took a look. My biggest concern is finding a good LZO
>>>>>> implementation. Looking for one that preferably has:
>>>>>> >>> 1) Apache license
>>>>>> >>> 2) Has zero transitive dependencies
>>>>>> >>> 3) Is small
>>>>>> >>> 4) Is performant
>>>>>> >>> 5) Is native java or supports execution on the three main OSs
>>>>>> (Windows, Linux, Mac)
>>>>>> >>>
>>>>>> >>> In your PR you suggested using io.airlift:aircompressor:0.16
>>>>>> which doesn't meet item #2 and its transitive dependency fails #3.
>>>>>> >>>
>>>>>> >>> On Mon, Dec 2, 2019 at 12:16 PM Amogh Tiwari <am...@google.com>
>>>>>> wrote:
>>>>>> >>>>
>>>>>> >>>> Hi,
>>>>>> >>>> I have filed a PR for an extension that will enable Apache Beam
>>>>>> to work with LZO/LZOP compression. Please refer.
>>>>>> >>>> I would love it if someone can take this up and review it.
>>>>>> >>>> Please feel free to share your thoughts/suggestions.
>>>>>> >>>> Regards,
>>>>>> >>>> Amogh
>>>>>>
>>>>>

Re: Request for review of PR [Beam-8564]

Posted by Amogh Tiwari <am...@google.com>.
Apologies for a late reply. The PR has been updated now. Would love to get
your thoughts/suggestions.

On Tue, Jan 7, 2020 at 12:36 AM Luke Cwik <lc...@google.com> wrote:

> Have you had a chance to update the PR?
>
> On Mon, Dec 30, 2019 at 5:00 AM Amogh Tiwari <am...@google.com> wrote:
>
>> Hi Luke,
>>
>> We have gone through shevek/lzo-java, but we chose to go with
>> airflow/aircompressor for the following reasons:
>>
>> 1) shevek/lzo-java is internally using .jni, .c and .h files, hence the
>> GNU licence, and that would leave us with only choice of putting this as an
>> option dependency
>>
>> 2) performance of airlift/aircompressor was much better as compared to
>> shevek/lzo-java in terms of compression ratio and time taken for
>> compression/decompression
>>
>> 3) airflow/aircompressor is in pure java and is under Apache licence
>>
>> Therefore, we'd prefer to go with adding the current implementation as
>> optional. We'd require your inputs on the same as we are unsure on where we
>> are supposed to keep the required files and how the final directory
>> structure would like. We have an idea and we'll update the current PR
>> accordingly.
>>
>> Please do guide us on this.
>>
>>
>> Regards,
>>
>> Amogh Tiwari
>>
>> On Wed, Dec 18, 2019 at 4:42 AM Luke Cwik <lc...@google.com> wrote:
>>
>>> Sorry for the long delay (was on vacation).
>>>
>>> Using org.apache.hadoop isn't part of the Apache Beam Core module but is
>>> a dependency for those who depend on the Apache Beam Hadoop module. So I
>>> don't think swapping the com.facebook.presto.hadoop version for the
>>> org.apache.hadoop version will address Ismael's concern about including
>>> hadoop dependencies as part of the core.
>>>
>>> I looked at shevek/lzo-java[1] and I think its our best choice since it
>>> is:
>>> * pure Java
>>> * GPLv3 (would require marking the dependency optional and telling users
>>> to add it explicitly which we have done in the past as well)
>>> * small (<100kb)
>>> * small dependency tree (commons-logging & findbugs annotations if we
>>> only depend on lzo-core)
>>> * performance (github page claims 500mb/s compression and 800mb/s
>>> decompression)
>>>
>>> Alternatively we can make the LZO compression an extension module (with
>>> the facebook dependency) and use a registrar to have it loaded dynamically.
>>>
>>> 1: https://github.com/shevek/lzo-java
>>>
>>> On Fri, Dec 6, 2019 at 5:09 AM Amogh Tiwari <am...@google.com> wrote:
>>>
>>>> While studying the code, we found that the airlift/ aircompressor
>>>> library only requires some classes which are also present in apache hadoop
>>>> common package. Therefore, we are now thinking that if we make changes in
>>>> the airlift/ aircompressor package, remove the
>>>> com.facebook.presto.hadoop and use the existing org.apache.hadoop
>>>> <https://mvnrepository.com/artifact/org.apache.hadoop> package which
>>>> is already included in beam. This will solve both #2 and #3 as the
>>>> transitive dependency will be removed and the size will also be reduced by
>>>> almost ~20mbs.
>>>>
>>>> But if we use this approach, we will have to manually change the util
>>>> whenever any changes are made to the airlift library.
>>>>
>>>> On Wed, Dec 4, 2019 at 10:13 PM Luke Cwik <lc...@google.com> wrote:
>>>>
>>>>> Going with the Registrar/ServiceLoader route would allow for
>>>>> alternative providers for the same compression algorithms so if they don't
>>>>> like one they can always contribute a different one.
>>>>>
>>>>> On Wed, Dec 4, 2019 at 8:22 AM Ismaël Mejía <ie...@gmail.com> wrote:
>>>>>
>>>>>> (1) seems not to be the issue because it is Apache licensed.
>>>>>> (2) and (3) are the big issues, because it requires a provided huge
>>>>>> uber jar that essentially leaks Hadoop classes into core SDK [1] so it is
>>>>>> definitely concerning.
>>>>>>
>>>>>> We discussed at some point during the PR that added ZStandard support
>>>>>> about creating some sort of Registrar for compression algorithms [2] but we
>>>>>> decided to not go ahead because we could achieve that for the zstd case via
>>>>>> the optional dependencies of commons-compress. Maybe it is time to
>>>>>> reconsider if such mechanism is worth. For example for users that may not
>>>>>> care about having the hadoop leakage to be able to use LZO.
>>>>>>
>>>>>> Refs.
>>>>>> [1] https://mvnrepository.com/artifact/io.airlift/aircompressor/0.16
>>>>>> [2] https://issues.apache.org/jira/browse/BEAM-6422
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Dec 3, 2019 at 7:01 PM Robert Bradshaw <ro...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Is there a way to wrap this up as an optional dependency with
>>>>>>> multiple
>>>>>>> possible providers, if there's no good library satisfying all of the
>>>>>>> conditions (in particular (1))?
>>>>>>>
>>>>>>> On Tue, Dec 3, 2019 at 9:47 AM Luke Cwik <lc...@google.com> wrote:
>>>>>>> >
>>>>>>> > I was hoping that someone in the community would provide some
>>>>>>> alternatives since there are quite a few implementations.
>>>>>>> >
>>>>>>> > On Tue, Dec 3, 2019 at 8:20 AM Amogh Tiwari <am...@google.com>
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >> Hi Luke,
>>>>>>> >>
>>>>>>> >> I agree with your thoughts and observations. But,
>>>>>>> airlift:aircompressor is the only implementation of LZO in pure java. That
>>>>>>> straight away solves #5.
>>>>>>> >> The other implementations that I found either have licensing
>>>>>>> issues (since LZO natively uses GNU GPL licence) or are implemented using
>>>>>>> .c, .h and jni (which again make them dependent on the OS). Please refer
>>>>>>> these: twitter/hadoop-lzo and shevek/lzo-java.
>>>>>>> >> These were the main reasons why we based this on
>>>>>>> airlift:aircompressor.
>>>>>>> >>
>>>>>>> >> Thanks and Regards,
>>>>>>> >> Amogh
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> On Tue, Dec 3, 2019 at 2:59 AM Luke Cwik <lc...@google.com>
>>>>>>> wrote:
>>>>>>> >>>
>>>>>>> >>> I took a look. My biggest concern is finding a good LZO
>>>>>>> implementation. Looking for one that preferably has:
>>>>>>> >>> 1) Apache license
>>>>>>> >>> 2) Has zero transitive dependencies
>>>>>>> >>> 3) Is small
>>>>>>> >>> 4) Is performant
>>>>>>> >>> 5) Is native java or supports execution on the three main OSs
>>>>>>> (Windows, Linux, Mac)
>>>>>>> >>>
>>>>>>> >>> In your PR you suggested using io.airlift:aircompressor:0.16
>>>>>>> which doesn't meet item #2 and its transitive dependency fails #3.
>>>>>>> >>>
>>>>>>> >>> On Mon, Dec 2, 2019 at 12:16 PM Amogh Tiwari <am...@google.com>
>>>>>>> wrote:
>>>>>>> >>>>
>>>>>>> >>>> Hi,
>>>>>>> >>>> I have filed a PR for an extension that will enable Apache Beam
>>>>>>> to work with LZO/LZOP compression. Please refer.
>>>>>>> >>>> I would love it if someone can take this up and review it.
>>>>>>> >>>> Please feel free to share your thoughts/suggestions.
>>>>>>> >>>> Regards,
>>>>>>> >>>> Amogh
>>>>>>>
>>>>>>