You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/05/09 19:41:04 UTC

[jira] [Commented] (BEAM-1494) GcsFileSystem should check content encoding when setting IsReadSeekEfficient

    [ https://issues.apache.org/jira/browse/BEAM-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16003388#comment-16003388 ] 

ASF GitHub Bot commented on BEAM-1494:
--------------------------------------

GitHub user dhalperi opened a pull request:

    https://github.com/apache/beam/pull/2998

    [BEAM-1494] Correctly handle content-encoding in GcsFileSystem, fixing reading of such files in CompressedSource

    R: @jkff  thoughts?
    CC: @chamikaramj

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dhalperi/beam b1494-gcs-content-encoding

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/beam/pull/2998.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2998
    
----
commit 7ef0f8afc88b292724228fb3507e6d0c77c0b1aa
Author: Dan Halperin <dh...@google.com>
Date:   2017-05-09T19:34:04Z

    FileBasedSource: isSplittable should not throw
    
    This is a legacy design from Dataflow 1.x that was a poor choice.
    All the information needed to know whether a source is splittable should
    be known at source construction time, and if runtime behavior is needed
    it should result in conservative choices, aka false.

commit 59e8e0ec27dfc498dacaaf425548681ed07a2d31
Author: Dan Halperin <dh...@google.com>
Date:   2017-05-09T19:36:10Z

    CompressedSource: only use delegate reader if the file is splittable
    
    Otherwise, it's likely compressed

commit b71f5dfed5b8e56dd01cca5a71e2fa72233ab363
Author: Dan Halperin <dh...@google.com>
Date:   2017-05-09T19:36:53Z

    GcsFileSystem: mark content-encoded files as not seekable
    
    That is the truth (since they are actually compressed) and will result in correct data
    when reading from them in, e.g., TextIO

----


> GcsFileSystem should check content encoding when setting IsReadSeekEfficient
> ----------------------------------------------------------------------------
>
>                 Key: BEAM-1494
>                 URL: https://issues.apache.org/jira/browse/BEAM-1494
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-extensions
>            Reporter: Pei He
>            Assignee: Daniel Halperin
>
> It is incorrect to set IsReadSeekEfficient true for files with content encoding set to gzip. This is an inherited issue from GcsIOChannelFactory.
> https://cloud.google.com/storage/docs/transcoding#content-type_vs_content-encoding



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)