You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by mtanski <gi...@git.apache.org> on 2016/05/18 04:09:25 UTC

[GitHub] flink pull request: Support for bz2 compression in flink-core

GitHub user mtanski opened a pull request:

    https://github.com/apache/flink/pull/2002

    Support for bz2 compression in flink-core

    Add support for bz2 compression to flink. Right now this requires using Hadoop InputFormats.
    
    Doesn't require any extra dependencies as flink-core already uses the Apache common-io package. Doesn't support splitting. the current compression support in Flink would need to be reworked to support that.
    
    It's possible that Flink should use use apache common-io for compression. This way it would get support support for snappy, bz2, xz, lzma in addition to gz & deflate without pulling in extra dependencies.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/adfin/flink bz2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/2002.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2002
    
----
commit 297aea88f7b5f832898b380e214f9d4080594092
Author: Milosz Tanski <mt...@gmail.com>
Date:   2016-05-18T04:04:36Z

    Support for bz2 compression in flink-core.
    
    Add support for bz2 compression to flink. Right now this requires using Hadoop
    InputFormats.
    
    Doesn't require any extra dependencies as flink-core already uses the Apache
    common-io package.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2002: Support for bz2 compression in flink-core

Posted by mtanski <gi...@git.apache.org>.
Github user mtanski commented on the issue:

    https://github.com/apache/flink/pull/2002
  
    Added support for XZ as well and update the documentation.
    
    Newer versions of commons compression (1.12) add support for snappy and lzma. But the version included as a (transitive) dependency of flink-core is stuck at 1.4. The commons-compression is pulled in via avro.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2002: Support for bz2 compression in flink-core

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/flink/pull/2002


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Support for bz2 compression in flink-core

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2002#discussion_r63744808
  
    --- Diff: flink-core/src/main/java/org/apache/flink/api/common/io/compression/Bzip2InputStreamFactory.java ---
    @@ -0,0 +1,50 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.flink.api.common.io.compression;
    +
    +import org.apache.flink.annotation.Internal;
    +
    +import java.io.IOException;
    +import java.io.InputStream;
    +import java.util.Collection;
    +import java.util.Collections;
    +
    +import org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream;
    +
    +@Internal
    +public class Bzip2InputStreamFactory implements InflaterInputStreamFactory<BZip2CompressorInputStream> {
    +
    +	private static Bzip2InputStreamFactory INSTANCE = null;
    --- End diff --
    
    I think this can be eagerly initialized, given that we can expect an initialization to always happen anyways when the class is used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2002: Support for bz2 compression in flink-core

Posted by mtanski <gi...@git.apache.org>.
Github user mtanski commented on the issue:

    https://github.com/apache/flink/pull/2002
  
    @StephanEwen would you like me to use the same version as the one being current include (transitive) or bump it to a recent version that include Snappy?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2002: Support for bz2 compression in flink-core

Posted by mtanski <gi...@git.apache.org>.
Github user mtanski commented on the issue:

    https://github.com/apache/flink/pull/2002
  
    Now that 1.1 one is out, is it possible to get this in?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2002: Support for bz2 compression in flink-core

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/2002
  
    I was assuming that you wanted to add more codecs, given the discussion above.
    Pending an update of the documentation (as per Greg's pointer),  this can be merged in my opinion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Support for bz2 compression in flink-core

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the pull request:

    https://github.com/apache/flink/pull/2002#issuecomment-220271991
  
    If this does not add any additional dependencies, I think it would be a nice addition to add some more codecs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2002: Support for bz2 compression in flink-core

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2002#discussion_r76288290
  
    --- Diff: flink-core/src/main/java/org/apache/flink/api/common/io/compression/InflaterInputStreamFactory.java ---
    @@ -23,13 +23,12 @@
     import java.io.IOException;
     import java.io.InputStream;
     import java.util.Collection;
    -import java.util.zip.InflaterInputStream;
     
     /**
      * Creates a new instance of a certain subclass of {@link java.util.zip.InflaterInputStream}.
      */
     @Internal
    -public interface InflaterInputStreamFactory<T extends InflaterInputStream> {
    +public interface InflaterInputStreamFactory<T extends InputStream> {
    --- End diff --
    
    I think Greg raised a good point, the classname and comments should be adjusted as well. How about `DecompressingStreamFactory`, or `DecoratorStreamFactory` (if you want to keep it generic)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2002: Support for bz2 compression in flink-core

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/2002
  
    I think this looks good. +1 to merge.
    
    As a followup, we can upgrade the `commons-compression` version via dependency management (if it is backwards compatible, which apache commons libs usually are).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2002: Support for bz2 compression in flink-core

Posted by mtanski <gi...@git.apache.org>.
Github user mtanski commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2002#discussion_r76121574
  
    --- Diff: flink-core/src/main/java/org/apache/flink/api/common/io/compression/InflaterInputStreamFactory.java ---
    @@ -23,13 +23,12 @@
     import java.io.IOException;
     import java.io.InputStream;
     import java.util.Collection;
    -import java.util.zip.InflaterInputStream;
     
     /**
      * Creates a new instance of a certain subclass of {@link java.util.zip.InflaterInputStream}.
      */
     @Internal
    -public interface InflaterInputStreamFactory<T extends InflaterInputStream> {
    +public interface InflaterInputStreamFactory<T extends InputStream> {
    --- End diff --
    
    That's true, although calling it InputStreamFactory also does not seam right.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Support for bz2 compression in flink-core

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2002#discussion_r63846803
  
    --- Diff: flink-core/src/main/java/org/apache/flink/api/common/io/compression/Bzip2InputStreamFactory.java ---
    @@ -0,0 +1,50 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.flink.api.common.io.compression;
    +
    +import org.apache.flink.annotation.Internal;
    +
    +import java.io.IOException;
    +import java.io.InputStream;
    +import java.util.Collection;
    +import java.util.Collections;
    +
    +import org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream;
    +
    +@Internal
    +public class Bzip2InputStreamFactory implements InflaterInputStreamFactory<BZip2CompressorInputStream> {
    +
    +	private static Bzip2InputStreamFactory INSTANCE = null;
    --- End diff --
    
    Not strictly needed, but nice, since it seems cleaner.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2002: Support for bz2 compression in flink-core

Posted by mtanski <gi...@git.apache.org>.
Github user mtanski commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2002#discussion_r81589240
  
    --- Diff: flink-core/src/main/java/org/apache/flink/api/common/io/compression/InflaterInputStreamFactory.java ---
    @@ -23,13 +23,12 @@
     import java.io.IOException;
     import java.io.InputStream;
     import java.util.Collection;
    -import java.util.zip.InflaterInputStream;
     
     /**
      * Creates a new instance of a certain subclass of {@link java.util.zip.InflaterInputStream}.
      */
     @Internal
    -public interface InflaterInputStreamFactory<T extends InflaterInputStream> {
    +public interface InflaterInputStreamFactory<T extends InputStream> {
    --- End diff --
    
    This causes a build error due to binary backwards compatibility.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2002: Support for bz2 compression in flink-core

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on the issue:

    https://github.com/apache/flink/pull/2002
  
    This looks very handy. We should also update the formats table in the documentation.
    
    https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html#read-compressed-files


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Support for bz2 compression in flink-core

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the pull request:

    https://github.com/apache/flink/pull/2002#issuecomment-220097588
  
    Looks good to me, with one minor comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Support for bz2 compression in flink-core

Posted by mtanski <gi...@git.apache.org>.
Github user mtanski commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2002#discussion_r63755827
  
    --- Diff: flink-core/src/main/java/org/apache/flink/api/common/io/compression/Bzip2InputStreamFactory.java ---
    @@ -0,0 +1,50 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.flink.api.common.io.compression;
    +
    +import org.apache.flink.annotation.Internal;
    +
    +import java.io.IOException;
    +import java.io.InputStream;
    +import java.util.Collection;
    +import java.util.Collections;
    +
    +import org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream;
    +
    +@Internal
    +public class Bzip2InputStreamFactory implements InflaterInputStreamFactory<BZip2CompressorInputStream> {
    +
    +	private static Bzip2InputStreamFactory INSTANCE = null;
    --- End diff --
    
    I was following the (Deflate|Gzip)DeflateInflaterInputStreamFactory examples. I can change it if it's needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2002: Support for bz2 compression in flink-core

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/2002
  
    Sorry for the delay. This looks good to me, let's merge it and add further changes separately.
    
    Will merge this...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #2002: Support for bz2 compression in flink-core

Posted by greghogan <gi...@git.apache.org>.
Github user greghogan commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2002#discussion_r76065257
  
    --- Diff: flink-core/src/main/java/org/apache/flink/api/common/io/compression/InflaterInputStreamFactory.java ---
    @@ -23,13 +23,12 @@
     import java.io.IOException;
     import java.io.InputStream;
     import java.util.Collection;
    -import java.util.zip.InflaterInputStream;
     
     /**
      * Creates a new instance of a certain subclass of {@link java.util.zip.InflaterInputStream}.
      */
     @Internal
    -public interface InflaterInputStreamFactory<T extends InflaterInputStream> {
    +public interface InflaterInputStreamFactory<T extends InputStream> {
    --- End diff --
    
    The interface changed but the documentation of and names within this class have not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2002: Support for bz2 compression in flink-core

Posted by mtanski <gi...@git.apache.org>.
Github user mtanski commented on the issue:

    https://github.com/apache/flink/pull/2002
  
    I tried changing the name from to InflaterInputStreamFactory -> DecompressingStreamFactory
    
    But now it will not build because of:
    [ERROR] Failed to execute goal com.github.siom79.japicmp:japicmp-maven-plugin:0.7.0:cmp (default) on project flink-core: Breaking the build because there is at least one binary incompatible class: org.apache.flink.api.common.io.BinaryInputFormat -> [Help 1]
    
    This is because this change causes prototype changes in a bunch of places flink.api.common.io.FileInputFormat
    
    The next change will have bz2, xz, docs & maven dep for commons compression. I'll also rebase it against master. Can we get it up after that.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2002: Support for bz2 compression in flink-core

Posted by mtanski <gi...@git.apache.org>.
Github user mtanski commented on the issue:

    https://github.com/apache/flink/pull/2002
  
    This has been sitting around for a month. Any chance to merge or clear todo in order to get this in shape.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2002: Support for bz2 compression in flink-core

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/2002
  
    Lets start with the current version - we know that one works.
    
    Dependency upgrades tend to be more sensitive than they often appear, so I would prefer to have a separate issue for that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: Support for bz2 compression in flink-core

Posted by mtanski <gi...@git.apache.org>.
Github user mtanski commented on the pull request:

    https://github.com/apache/flink/pull/2002#issuecomment-220115838
  
    My big question is: should I implement the other compressors from apache commons-io (snappy, xz). And if so should flink apache commons-io for gzip deflate as well?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---