You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/05/09 19:41:04 UTC
[jira] [Commented] (BEAM-1494) GcsFileSystem should check content
encoding when setting IsReadSeekEfficient
[ https://issues.apache.org/jira/browse/BEAM-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16003388#comment-16003388 ]
ASF GitHub Bot commented on BEAM-1494:
--------------------------------------
GitHub user dhalperi opened a pull request:
https://github.com/apache/beam/pull/2998
[BEAM-1494] Correctly handle content-encoding in GcsFileSystem, fixing reading of such files in CompressedSource
R: @jkff thoughts?
CC: @chamikaramj
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dhalperi/beam b1494-gcs-content-encoding
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/beam/pull/2998.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2998
----
commit 7ef0f8afc88b292724228fb3507e6d0c77c0b1aa
Author: Dan Halperin <dh...@google.com>
Date: 2017-05-09T19:34:04Z
FileBasedSource: isSplittable should not throw
This is a legacy design from Dataflow 1.x that was a poor choice.
All the information needed to know whether a source is splittable should
be known at source construction time, and if runtime behavior is needed
it should result in conservative choices, aka false.
commit 59e8e0ec27dfc498dacaaf425548681ed07a2d31
Author: Dan Halperin <dh...@google.com>
Date: 2017-05-09T19:36:10Z
CompressedSource: only use delegate reader if the file is splittable
Otherwise, it's likely compressed
commit b71f5dfed5b8e56dd01cca5a71e2fa72233ab363
Author: Dan Halperin <dh...@google.com>
Date: 2017-05-09T19:36:53Z
GcsFileSystem: mark content-encoded files as not seekable
That is the truth (since they are actually compressed) and will result in correct data
when reading from them in, e.g., TextIO
----
> GcsFileSystem should check content encoding when setting IsReadSeekEfficient
> ----------------------------------------------------------------------------
>
> Key: BEAM-1494
> URL: https://issues.apache.org/jira/browse/BEAM-1494
> Project: Beam
> Issue Type: Bug
> Components: sdk-java-extensions
> Reporter: Pei He
> Assignee: Daniel Halperin
>
> It is incorrect to set IsReadSeekEfficient true for files with content encoding set to gzip. This is an inherited issue from GcsIOChannelFactory.
> https://cloud.google.com/storage/docs/transcoding#content-type_vs_content-encoding
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)