You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2020/05/27 04:10:50 UTC

[GitHub] [beam] epicfaace opened a new pull request #11824: Add HttpIO / HttpFileSystem

epicfaace opened a new pull request #11824:
URL: https://github.com/apache/beam/pull/11824


   Add HttpIO (and a related HttpFileSystem and HttpsFileSystem), which can download files from a particular http:// or https:// URL. HttpIO cannot upload / write to files, though, because there's no standardized way to write to files using HTTP.
   
   Sample usage:
   
   ```python
           (
               p
               | ReadFromText("https://raw.githubusercontent.com/apache/beam/5ff5313f0913ec81d31ad306400ad30c0a928b34/NOTICE")
               | WriteToText("output.txt", shard_name_template="", num_shards=0)
           )
   ```
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`).
    - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
    - [ ] Update `CHANGES.md` with noteworthy changes.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   Post-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/)
   Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/) | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/)
   XLang | --- | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/)
   
   Pre-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   --- |Java | Python | Go | Website
   --- | --- | --- | --- | ---
   Non-portable | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/) 
   Portable | --- | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/) | --- | ---
   
   See [.test-infra/jenkins/README](https://github.com/apache/beam/blob/master/.test-infra/jenkins/README.md) for trigger phrase, status and link of all Jenkins jobs.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] pabloem commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
pabloem commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-651399432


   should I look into it? Will you? @epicfaace LMK : )


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] pabloem commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
pabloem commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-634923054


   I'll be happy to take a look at this


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] stale[bot] commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
stale[bot] commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-723500390


   This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] iemejia commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
iemejia commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-634907464


   @aaltay can you please review this one or pass to someone who can. thx!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] pabloem commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
pabloem commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-643507084






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] pabloem commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
pabloem commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-649867747


   okay let me figure out why that's happening...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] epicfaace commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
epicfaace commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-652626019


   I'll take a look!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] pabloem commented on a change in pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
pabloem commented on a change in pull request #11824:
URL: https://github.com/apache/beam/pull/11824#discussion_r434194515



##########
File path: sdks/python/apache_beam/io/httpio.py
##########
@@ -0,0 +1,167 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""This class implements methods to interact with files at HTTP URLs.
+
+This I/O only implements methods to read with files at HTTP URLs, because
+of the variability in methods by which HTTP content can be written
+to a server. If you need to write your results to an HTTP endpoint,
+you might want to make your own I/O or use another, more specific,
+I/O connector.
+
+"""
+
+# pytype: skip-file
+
+from __future__ import absolute_import
+
+import io
+from builtins import object
+
+from apache_beam.io.filesystemio import Downloader
+from apache_beam.io.filesystemio import DownloaderStream
+from apache_beam.internal.http_client import get_new_http
+import sys
+
+
+class HttpIO(object):
+  """HTTP I/O."""
+
+  def __init__(self, client = None):
+    if sys.version_info[0] != 3:
+      raise RuntimeError("HttpIO only supports Python 3.")
+    self._client = client or get_new_http()
+    pass
+  
+  def open(
+      self,
+      uri,
+      mode='r',
+      read_buffer_size=16 * 1024 * 1024):
+      """Open a URL for reading or writing.
+
+      Args:
+        uri (str): HTTP URL in the form ``http://[path]`` or ``https://[path]``.
+        mode (str): ``'r'`` or ``'rb'`` for reading.
+        read_buffer_size (int): Buffer size to use during read operations.
+
+      Returns:
+        A file object representing the response.
+
+      Raises:
+        ValueError: Invalid open file mode.
+      """
+    if mode == 'r' or mode == 'rb':
+      downloader = HttpDownloader(uri, self._client)
+      return io.BufferedReader(
+        DownloaderStream(downloader, mode=mode), buffer_size=read_buffer_size)
+    else:
+      raise ValueError('Invalid file open mode: %s.' % mode)
+
+  def list_prefix(self, path):
+    """Lists files matching the prefix.
+    
+    Because there is no common standard for listing files at a given
+    HTTP URL, this method just returns a single file at the given URL.
+    This means that listing files only works with an exact path, not
+    with a glob expression.
+
+    Args:
+      path: HTTP URL in the form http://[path] or https://[path].
+
+    Returns:
+      Dictionary of file name -> size.
+    """
+    return {path: self.size(path)}
+
+  def size(self, uri):
+    """Returns the size of a single file stored at a HTTP URL.
+
+    First, the client attempts to make a HEAD request for a non-gzipped version of the file,
+    and uses the Content-Length header to retrieve the size. If that fails because the server
+    does not attempt HEAD requests, the client just does a GET requuest to retrieve the length. 
+
+    Args:
+      path: HTTP URL in the form http://[path] or https://[path].
+
+    Returns:
+      Size of the HTTP file in bytes.
+    """
+    try:
+      # Pass in "" for "Accept-Encoding" because we want the non-gzipped content-length.
+      resp, content = self._client.request(uri, method='HEAD', headers={"Accept-Encoding": ""})
+      if resp.status != 200:
+        raise Exception(resp.status, resp.reason)

Review comment:
       can you raise an IO error?

##########
File path: sdks/python/apache_beam/io/httpfilesystem.py
##########
@@ -0,0 +1,210 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""HTTP file system implementation for accessing files from a HTTP URL."""
+
+# pytype: skip-file
+
+from __future__ import absolute_import
+
+from future.utils import iteritems
+
+from apache_beam.io.aws import s3io

Review comment:
       remove this line?

##########
File path: sdks/python/apache_beam/io/httpio.py
##########
@@ -0,0 +1,167 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""This class implements methods to interact with files at HTTP URLs.
+
+This I/O only implements methods to read with files at HTTP URLs, because
+of the variability in methods by which HTTP content can be written
+to a server. If you need to write your results to an HTTP endpoint,
+you might want to make your own I/O or use another, more specific,
+I/O connector.
+
+"""
+
+# pytype: skip-file
+
+from __future__ import absolute_import
+
+import io
+from builtins import object
+
+from apache_beam.io.filesystemio import Downloader
+from apache_beam.io.filesystemio import DownloaderStream
+from apache_beam.internal.http_client import get_new_http
+import sys
+
+
+class HttpIO(object):
+  """HTTP I/O."""
+
+  def __init__(self, client = None):
+    if sys.version_info[0] != 3:
+      raise RuntimeError("HttpIO only supports Python 3.")
+    self._client = client or get_new_http()
+    pass
+  
+  def open(
+      self,
+      uri,
+      mode='r',
+      read_buffer_size=16 * 1024 * 1024):
+      """Open a URL for reading or writing.
+
+      Args:
+        uri (str): HTTP URL in the form ``http://[path]`` or ``https://[path]``.
+        mode (str): ``'r'`` or ``'rb'`` for reading.
+        read_buffer_size (int): Buffer size to use during read operations.
+
+      Returns:
+        A file object representing the response.
+
+      Raises:
+        ValueError: Invalid open file mode.
+      """
+    if mode == 'r' or mode == 'rb':
+      downloader = HttpDownloader(uri, self._client)
+      return io.BufferedReader(
+        DownloaderStream(downloader, mode=mode), buffer_size=read_buffer_size)
+    else:
+      raise ValueError('Invalid file open mode: %s.' % mode)
+
+  def list_prefix(self, path):
+    """Lists files matching the prefix.
+    
+    Because there is no common standard for listing files at a given
+    HTTP URL, this method just returns a single file at the given URL.
+    This means that listing files only works with an exact path, not
+    with a glob expression.
+
+    Args:
+      path: HTTP URL in the form http://[path] or https://[path].
+
+    Returns:
+      Dictionary of file name -> size.
+    """
+    return {path: self.size(path)}
+
+  def size(self, uri):
+    """Returns the size of a single file stored at a HTTP URL.
+
+    First, the client attempts to make a HEAD request for a non-gzipped version of the file,
+    and uses the Content-Length header to retrieve the size. If that fails because the server
+    does not attempt HEAD requests, the client just does a GET requuest to retrieve the length. 
+
+    Args:
+      path: HTTP URL in the form http://[path] or https://[path].
+
+    Returns:
+      Size of the HTTP file in bytes.
+    """
+    try:
+      # Pass in "" for "Accept-Encoding" because we want the non-gzipped content-length.
+      resp, content = self._client.request(uri, method='HEAD', headers={"Accept-Encoding": ""})
+      if resp.status != 200:
+        raise Exception(resp.status, resp.reason)
+      return int(resp["content-length"])
+    except Exception:
+      # Server doesn't support HEAD method;
+      # use GET method instead to prefetch the result.
+      resp, content = self._client.request(uri, method='GET')
+      if resp.status != 200:
+        raise Exception(resp.status, resp.reason)

Review comment:
       can you raise an IO error so we won't have a fully undefined Exception?

##########
File path: sdks/python/apache_beam/io/httpio.py
##########
@@ -0,0 +1,167 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""This class implements methods to interact with files at HTTP URLs.
+
+This I/O only implements methods to read with files at HTTP URLs, because
+of the variability in methods by which HTTP content can be written
+to a server. If you need to write your results to an HTTP endpoint,
+you might want to make your own I/O or use another, more specific,
+I/O connector.
+
+"""
+
+# pytype: skip-file
+
+from __future__ import absolute_import
+
+import io
+from builtins import object
+
+from apache_beam.io.filesystemio import Downloader
+from apache_beam.io.filesystemio import DownloaderStream
+from apache_beam.internal.http_client import get_new_http
+import sys
+
+
+class HttpIO(object):
+  """HTTP I/O."""
+
+  def __init__(self, client = None):
+    if sys.version_info[0] != 3:
+      raise RuntimeError("HttpIO only supports Python 3.")
+    self._client = client or get_new_http()
+    pass
+  
+  def open(
+      self,
+      uri,
+      mode='r',
+      read_buffer_size=16 * 1024 * 1024):
+      """Open a URL for reading or writing.
+
+      Args:
+        uri (str): HTTP URL in the form ``http://[path]`` or ``https://[path]``.
+        mode (str): ``'r'`` or ``'rb'`` for reading.
+        read_buffer_size (int): Buffer size to use during read operations.
+
+      Returns:
+        A file object representing the response.
+
+      Raises:
+        ValueError: Invalid open file mode.
+      """
+    if mode == 'r' or mode == 'rb':
+      downloader = HttpDownloader(uri, self._client)
+      return io.BufferedReader(
+        DownloaderStream(downloader, mode=mode), buffer_size=read_buffer_size)
+    else:
+      raise ValueError('Invalid file open mode: %s.' % mode)

Review comment:
       ```suggestion
         raise ValueError('Unsupported file open mode: %s for URI %s.' % (mode, uri))
   ```

##########
File path: sdks/python/apache_beam/io/httpio.py
##########
@@ -0,0 +1,167 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""This class implements methods to interact with files at HTTP URLs.
+
+This I/O only implements methods to read with files at HTTP URLs, because
+of the variability in methods by which HTTP content can be written
+to a server. If you need to write your results to an HTTP endpoint,
+you might want to make your own I/O or use another, more specific,
+I/O connector.
+
+"""
+
+# pytype: skip-file
+
+from __future__ import absolute_import
+
+import io
+from builtins import object
+
+from apache_beam.io.filesystemio import Downloader
+from apache_beam.io.filesystemio import DownloaderStream
+from apache_beam.internal.http_client import get_new_http
+import sys
+
+
+class HttpIO(object):
+  """HTTP I/O."""
+
+  def __init__(self, client = None):
+    if sys.version_info[0] != 3:
+      raise RuntimeError("HttpIO only supports Python 3.")
+    self._client = client or get_new_http()
+    pass
+  
+  def open(
+      self,
+      uri,
+      mode='r',
+      read_buffer_size=16 * 1024 * 1024):
+      """Open a URL for reading or writing.
+
+      Args:
+        uri (str): HTTP URL in the form ``http://[path]`` or ``https://[path]``.
+        mode (str): ``'r'`` or ``'rb'`` for reading.
+        read_buffer_size (int): Buffer size to use during read operations.
+
+      Returns:
+        A file object representing the response.
+
+      Raises:
+        ValueError: Invalid open file mode.
+      """
+    if mode == 'r' or mode == 'rb':
+      downloader = HttpDownloader(uri, self._client)
+      return io.BufferedReader(
+        DownloaderStream(downloader, mode=mode), buffer_size=read_buffer_size)
+    else:
+      raise ValueError('Invalid file open mode: %s.' % mode)
+
+  def list_prefix(self, path):
+    """Lists files matching the prefix.
+    
+    Because there is no common standard for listing files at a given
+    HTTP URL, this method just returns a single file at the given URL.
+    This means that listing files only works with an exact path, not
+    with a glob expression.
+
+    Args:
+      path: HTTP URL in the form http://[path] or https://[path].
+
+    Returns:
+      Dictionary of file name -> size.
+    """
+    return {path: self.size(path)}
+
+  def size(self, uri):
+    """Returns the size of a single file stored at a HTTP URL.
+
+    First, the client attempts to make a HEAD request for a non-gzipped version of the file,
+    and uses the Content-Length header to retrieve the size. If that fails because the server
+    does not attempt HEAD requests, the client just does a GET requuest to retrieve the length. 
+
+    Args:
+      path: HTTP URL in the form http://[path] or https://[path].
+
+    Returns:
+      Size of the HTTP file in bytes.
+    """
+    try:
+      # Pass in "" for "Accept-Encoding" because we want the non-gzipped content-length.
+      resp, content = self._client.request(uri, method='HEAD', headers={"Accept-Encoding": ""})
+      if resp.status != 200:
+        raise Exception(resp.status, resp.reason)
+      return int(resp["content-length"])
+    except Exception:
+      # Server doesn't support HEAD method;
+      # use GET method instead to prefetch the result.
+      resp, content = self._client.request(uri, method='GET')
+      if resp.status != 200:
+        raise Exception(resp.status, resp.reason)
+      return int(resp["content-length"])
+
+  def exists(self, uri):
+    """Returns whether the file at the given HTTP URL exists.
+
+    The client attempts to make a HEAD request, and if that fails, a GET request.
+    If the server returns 404, this function returns false, and it returns
+    true only if the server returns 200.
+
+    Args:
+      path: HTTP URL in the form http://[path] or https://[path].
+
+    Returns:
+      Size of the HTTP file in bytes.
+    """
+    try:
+      resp, content = self._client.request(uri, method='HEAD')
+      if resp.status == 200:
+        return True
+      elif resp.status == 404:
+        return False
+      else:
+        raise Exception(resp.status, resp.reason)
+    except Exception:

Review comment:
       Same here, let's avoid catching and raising non-specified exceptions.
   
   Use as a resource: https://httplib2.readthedocs.io/en/latest/libhttplib2.html#httplib2.HttpLib2Error

##########
File path: sdks/python/apache_beam/io/httpio.py
##########
@@ -0,0 +1,167 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""This class implements methods to interact with files at HTTP URLs.
+
+This I/O only implements methods to read with files at HTTP URLs, because
+of the variability in methods by which HTTP content can be written
+to a server. If you need to write your results to an HTTP endpoint,
+you might want to make your own I/O or use another, more specific,
+I/O connector.
+
+"""
+
+# pytype: skip-file
+
+from __future__ import absolute_import
+
+import io
+from builtins import object
+
+from apache_beam.io.filesystemio import Downloader
+from apache_beam.io.filesystemio import DownloaderStream
+from apache_beam.internal.http_client import get_new_http
+import sys
+
+
+class HttpIO(object):
+  """HTTP I/O."""
+
+  def __init__(self, client = None):
+    if sys.version_info[0] != 3:
+      raise RuntimeError("HttpIO only supports Python 3.")
+    self._client = client or get_new_http()
+    pass
+  
+  def open(
+      self,
+      uri,
+      mode='r',
+      read_buffer_size=16 * 1024 * 1024):
+      """Open a URL for reading or writing.
+
+      Args:
+        uri (str): HTTP URL in the form ``http://[path]`` or ``https://[path]``.
+        mode (str): ``'r'`` or ``'rb'`` for reading.
+        read_buffer_size (int): Buffer size to use during read operations.
+
+      Returns:
+        A file object representing the response.
+
+      Raises:
+        ValueError: Invalid open file mode.
+      """
+    if mode == 'r' or mode == 'rb':
+      downloader = HttpDownloader(uri, self._client)
+      return io.BufferedReader(
+        DownloaderStream(downloader, mode=mode), buffer_size=read_buffer_size)
+    else:
+      raise ValueError('Invalid file open mode: %s.' % mode)
+
+  def list_prefix(self, path):
+    """Lists files matching the prefix.
+    
+    Because there is no common standard for listing files at a given
+    HTTP URL, this method just returns a single file at the given URL.
+    This means that listing files only works with an exact path, not
+    with a glob expression.
+
+    Args:
+      path: HTTP URL in the form http://[path] or https://[path].
+
+    Returns:
+      Dictionary of file name -> size.
+    """
+    return {path: self.size(path)}
+
+  def size(self, uri):
+    """Returns the size of a single file stored at a HTTP URL.
+
+    First, the client attempts to make a HEAD request for a non-gzipped version of the file,
+    and uses the Content-Length header to retrieve the size. If that fails because the server
+    does not attempt HEAD requests, the client just does a GET requuest to retrieve the length. 
+
+    Args:
+      path: HTTP URL in the form http://[path] or https://[path].
+
+    Returns:
+      Size of the HTTP file in bytes.
+    """
+    try:
+      # Pass in "" for "Accept-Encoding" because we want the non-gzipped content-length.
+      resp, content = self._client.request(uri, method='HEAD', headers={"Accept-Encoding": ""})
+      if resp.status != 200:
+        raise Exception(resp.status, resp.reason)
+      return int(resp["content-length"])
+    except Exception:
+      # Server doesn't support HEAD method;
+      # use GET method instead to prefetch the result.
+      resp, content = self._client.request(uri, method='GET')
+      if resp.status != 200:
+        raise Exception(resp.status, resp.reason)
+      return int(resp["content-length"])
+
+  def exists(self, uri):
+    """Returns whether the file at the given HTTP URL exists.
+
+    The client attempts to make a HEAD request, and if that fails, a GET request.
+    If the server returns 404, this function returns false, and it returns
+    true only if the server returns 200.
+
+    Args:
+      path: HTTP URL in the form http://[path] or https://[path].
+
+    Returns:
+      Size of the HTTP file in bytes.
+    """
+    try:
+      resp, content = self._client.request(uri, method='HEAD')
+      if resp.status == 200:
+        return True
+      elif resp.status == 404:
+        return False
+      else:
+        raise Exception(resp.status, resp.reason)
+    except Exception:
+      # Server doesn't support HEAD method;
+      # use GET method instead to prefetch the result.
+      resp, content = self._client.request(uri, method='GET')
+      if resp.status == 200:
+        return True
+      elif resp.status == 404:
+        return False
+      else:
+        raise Exception(resp.status, resp.reason)

Review comment:
       Ditto

##########
File path: sdks/python/apache_beam/io/httpio.py
##########
@@ -0,0 +1,167 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""This class implements methods to interact with files at HTTP URLs.
+
+This I/O only implements methods to read with files at HTTP URLs, because
+of the variability in methods by which HTTP content can be written
+to a server. If you need to write your results to an HTTP endpoint,
+you might want to make your own I/O or use another, more specific,
+I/O connector.
+
+"""
+
+# pytype: skip-file
+
+from __future__ import absolute_import
+
+import io
+from builtins import object
+
+from apache_beam.io.filesystemio import Downloader
+from apache_beam.io.filesystemio import DownloaderStream
+from apache_beam.internal.http_client import get_new_http
+import sys
+
+
+class HttpIO(object):
+  """HTTP I/O."""
+
+  def __init__(self, client = None):
+    if sys.version_info[0] != 3:
+      raise RuntimeError("HttpIO only supports Python 3.")
+    self._client = client or get_new_http()
+    pass
+  
+  def open(
+      self,
+      uri,
+      mode='r',
+      read_buffer_size=16 * 1024 * 1024):
+      """Open a URL for reading or writing.
+
+      Args:
+        uri (str): HTTP URL in the form ``http://[path]`` or ``https://[path]``.
+        mode (str): ``'r'`` or ``'rb'`` for reading.
+        read_buffer_size (int): Buffer size to use during read operations.
+
+      Returns:
+        A file object representing the response.
+
+      Raises:
+        ValueError: Invalid open file mode.
+      """
+    if mode == 'r' or mode == 'rb':
+      downloader = HttpDownloader(uri, self._client)
+      return io.BufferedReader(
+        DownloaderStream(downloader, mode=mode), buffer_size=read_buffer_size)
+    else:
+      raise ValueError('Invalid file open mode: %s.' % mode)
+
+  def list_prefix(self, path):
+    """Lists files matching the prefix.
+    
+    Because there is no common standard for listing files at a given
+    HTTP URL, this method just returns a single file at the given URL.
+    This means that listing files only works with an exact path, not
+    with a glob expression.
+
+    Args:
+      path: HTTP URL in the form http://[path] or https://[path].
+
+    Returns:
+      Dictionary of file name -> size.
+    """
+    return {path: self.size(path)}
+
+  def size(self, uri):
+    """Returns the size of a single file stored at a HTTP URL.
+
+    First, the client attempts to make a HEAD request for a non-gzipped version of the file,
+    and uses the Content-Length header to retrieve the size. If that fails because the server
+    does not attempt HEAD requests, the client just does a GET requuest to retrieve the length. 
+
+    Args:
+      path: HTTP URL in the form http://[path] or https://[path].
+
+    Returns:
+      Size of the HTTP file in bytes.
+    """
+    try:
+      # Pass in "" for "Accept-Encoding" because we want the non-gzipped content-length.
+      resp, content = self._client.request(uri, method='HEAD', headers={"Accept-Encoding": ""})
+      if resp.status != 200:
+        raise Exception(resp.status, resp.reason)
+      return int(resp["content-length"])
+    except Exception:
+      # Server doesn't support HEAD method;
+      # use GET method instead to prefetch the result.
+      resp, content = self._client.request(uri, method='GET')
+      if resp.status != 200:
+        raise Exception(resp.status, resp.reason)
+      return int(resp["content-length"])
+
+  def exists(self, uri):
+    """Returns whether the file at the given HTTP URL exists.
+
+    The client attempts to make a HEAD request, and if that fails, a GET request.
+    If the server returns 404, this function returns false, and it returns
+    true only if the server returns 200.
+
+    Args:
+      path: HTTP URL in the form http://[path] or https://[path].
+
+    Returns:
+      Size of the HTTP file in bytes.
+    """
+    try:
+      resp, content = self._client.request(uri, method='HEAD')
+      if resp.status == 200:
+        return True
+      elif resp.status == 404:
+        return False
+      else:
+        raise Exception(resp.status, resp.reason)
+    except Exception:

Review comment:
       Consider defining an Exception class specific to this module, so we won't expose the implementation detail  that we use `httplib2`.

##########
File path: sdks/python/apache_beam/io/httpio.py
##########
@@ -0,0 +1,167 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""This class implements methods to interact with files at HTTP URLs.
+
+This I/O only implements methods to read with files at HTTP URLs, because
+of the variability in methods by which HTTP content can be written
+to a server. If you need to write your results to an HTTP endpoint,
+you might want to make your own I/O or use another, more specific,
+I/O connector.
+
+"""
+
+# pytype: skip-file
+
+from __future__ import absolute_import
+
+import io
+from builtins import object
+
+from apache_beam.io.filesystemio import Downloader
+from apache_beam.io.filesystemio import DownloaderStream
+from apache_beam.internal.http_client import get_new_http
+import sys
+
+
+class HttpIO(object):
+  """HTTP I/O."""
+
+  def __init__(self, client = None):
+    if sys.version_info[0] != 3:
+      raise RuntimeError("HttpIO only supports Python 3.")
+    self._client = client or get_new_http()
+    pass
+  
+  def open(
+      self,
+      uri,
+      mode='r',
+      read_buffer_size=16 * 1024 * 1024):
+      """Open a URL for reading or writing.
+
+      Args:
+        uri (str): HTTP URL in the form ``http://[path]`` or ``https://[path]``.
+        mode (str): ``'r'`` or ``'rb'`` for reading.
+        read_buffer_size (int): Buffer size to use during read operations.
+
+      Returns:
+        A file object representing the response.
+
+      Raises:
+        ValueError: Invalid open file mode.
+      """
+    if mode == 'r' or mode == 'rb':
+      downloader = HttpDownloader(uri, self._client)
+      return io.BufferedReader(
+        DownloaderStream(downloader, mode=mode), buffer_size=read_buffer_size)
+    else:
+      raise ValueError('Invalid file open mode: %s.' % mode)
+
+  def list_prefix(self, path):
+    """Lists files matching the prefix.
+    
+    Because there is no common standard for listing files at a given
+    HTTP URL, this method just returns a single file at the given URL.
+    This means that listing files only works with an exact path, not
+    with a glob expression.
+
+    Args:
+      path: HTTP URL in the form http://[path] or https://[path].
+
+    Returns:
+      Dictionary of file name -> size.
+    """
+    return {path: self.size(path)}
+
+  def size(self, uri):
+    """Returns the size of a single file stored at a HTTP URL.
+
+    First, the client attempts to make a HEAD request for a non-gzipped version of the file,
+    and uses the Content-Length header to retrieve the size. If that fails because the server
+    does not attempt HEAD requests, the client just does a GET requuest to retrieve the length. 
+
+    Args:
+      path: HTTP URL in the form http://[path] or https://[path].
+
+    Returns:
+      Size of the HTTP file in bytes.
+    """
+    try:
+      # Pass in "" for "Accept-Encoding" because we want the non-gzipped content-length.
+      resp, content = self._client.request(uri, method='HEAD', headers={"Accept-Encoding": ""})
+      if resp.status != 200:
+        raise Exception(resp.status, resp.reason)
+      return int(resp["content-length"])
+    except Exception:
+      # Server doesn't support HEAD method;
+      # use GET method instead to prefetch the result.
+      resp, content = self._client.request(uri, method='GET')
+      if resp.status != 200:
+        raise Exception(resp.status, resp.reason)
+      return int(resp["content-length"])
+
+  def exists(self, uri):
+    """Returns whether the file at the given HTTP URL exists.
+
+    The client attempts to make a HEAD request, and if that fails, a GET request.
+    If the server returns 404, this function returns false, and it returns
+    true only if the server returns 200.
+
+    Args:
+      path: HTTP URL in the form http://[path] or https://[path].
+
+    Returns:
+      Size of the HTTP file in bytes.
+    """
+    try:
+      resp, content = self._client.request(uri, method='HEAD')
+      if resp.status == 200:
+        return True
+      elif resp.status == 404:
+        return False
+      else:
+        raise Exception(resp.status, resp.reason)
+    except Exception:
+      # Server doesn't support HEAD method;
+      # use GET method instead to prefetch the result.
+      resp, content = self._client.request(uri, method='GET')
+      if resp.status == 200:
+        return True
+      elif resp.status == 404:
+        return False
+      else:
+        raise Exception(resp.status, resp.reason)
+
+
+class HttpDownloader(Downloader):
+  def __init__(self, uri, client):
+    self._uri = uri
+    self._client = client
+
+    resp, content = self._client.request(self._uri, method='GET')
+    if resp.status != 200:
+      raise Exception(resp.status, resp.reason)

Review comment:
       Ditto

##########
File path: sdks/python/apache_beam/io/httpio.py
##########
@@ -0,0 +1,167 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""This class implements methods to interact with files at HTTP URLs.
+
+This I/O only implements methods to read with files at HTTP URLs, because
+of the variability in methods by which HTTP content can be written
+to a server. If you need to write your results to an HTTP endpoint,
+you might want to make your own I/O or use another, more specific,
+I/O connector.
+
+"""
+
+# pytype: skip-file
+
+from __future__ import absolute_import
+
+import io
+from builtins import object
+
+from apache_beam.io.filesystemio import Downloader
+from apache_beam.io.filesystemio import DownloaderStream
+from apache_beam.internal.http_client import get_new_http
+import sys
+
+
+class HttpIO(object):
+  """HTTP I/O."""
+
+  def __init__(self, client = None):
+    if sys.version_info[0] != 3:
+      raise RuntimeError("HttpIO only supports Python 3.")
+    self._client = client or get_new_http()
+    pass
+  
+  def open(
+      self,
+      uri,
+      mode='r',
+      read_buffer_size=16 * 1024 * 1024):
+      """Open a URL for reading or writing.
+
+      Args:
+        uri (str): HTTP URL in the form ``http://[path]`` or ``https://[path]``.
+        mode (str): ``'r'`` or ``'rb'`` for reading.
+        read_buffer_size (int): Buffer size to use during read operations.
+
+      Returns:
+        A file object representing the response.
+
+      Raises:
+        ValueError: Invalid open file mode.
+      """
+    if mode == 'r' or mode == 'rb':
+      downloader = HttpDownloader(uri, self._client)
+      return io.BufferedReader(
+        DownloaderStream(downloader, mode=mode), buffer_size=read_buffer_size)
+    else:
+      raise ValueError('Invalid file open mode: %s.' % mode)
+
+  def list_prefix(self, path):
+    """Lists files matching the prefix.
+    
+    Because there is no common standard for listing files at a given
+    HTTP URL, this method just returns a single file at the given URL.
+    This means that listing files only works with an exact path, not
+    with a glob expression.
+
+    Args:
+      path: HTTP URL in the form http://[path] or https://[path].
+
+    Returns:
+      Dictionary of file name -> size.
+    """
+    return {path: self.size(path)}
+
+  def size(self, uri):
+    """Returns the size of a single file stored at a HTTP URL.
+
+    First, the client attempts to make a HEAD request for a non-gzipped version of the file,
+    and uses the Content-Length header to retrieve the size. If that fails because the server
+    does not attempt HEAD requests, the client just does a GET requuest to retrieve the length. 
+
+    Args:
+      path: HTTP URL in the form http://[path] or https://[path].
+
+    Returns:
+      Size of the HTTP file in bytes.
+    """
+    try:
+      # Pass in "" for "Accept-Encoding" because we want the non-gzipped content-length.
+      resp, content = self._client.request(uri, method='HEAD', headers={"Accept-Encoding": ""})
+      if resp.status != 200:
+        raise Exception(resp.status, resp.reason)
+      return int(resp["content-length"])
+    except Exception:

Review comment:
       Can you catch `httplib2.HttpLib2Error`? Or the specific error that may be thrown from the call?
   
   (see https://httplib2.readthedocs.io/en/latest/libhttplib2.html#httplib2.HttpLib2Error)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] epicfaace commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
epicfaace commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-723522730


   Todo


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] epicfaace edited a comment on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
epicfaace edited a comment on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-649815907


   I'm not sure why exactly those failures are happening, since `HttpIO` _does_ take `client` in its constructor.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] pabloem commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
pabloem commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-637821019


   ok just looking at this now...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] pabloem commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
pabloem commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-643471642






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] stale[bot] commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
stale[bot] commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-683484607


   This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] TheNeuralBit commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
TheNeuralBit commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-777866124


   If the test failures are only Python 2 maybe this will just work now that we've dropped Python 2 support :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] epicfaace commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
epicfaace commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-683492904


   Still need to take a look -- not stale


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] pabloem commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
pabloem commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-641602134


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] pabloem commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
pabloem commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-639885325


   @epicfaace lmk what are your plans for this PR


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] epicfaace commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
epicfaace commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-649815907


   I'm not sure why exactly those failures are happening, since `HttpIO` _does_ take `client` as a constructor.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] iemejia commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
iemejia commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-634898482


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] epicfaace commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
epicfaace commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-639946797


   I'll be making changes soon.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] pabloem commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
pabloem commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-640880406


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] pabloem commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
pabloem commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-644269789


   Run Python PreCommit


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] epicfaace commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
epicfaace commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-640123988


   @pabloem I've addressed your changes and also made the implementation a bit cleaner; please take a look.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] epicfaace commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
epicfaace commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-649895785


   Oh, it appears the test failures are only on Python 2 -- perhaps I didn't define the class properly for Python 2


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] pabloem commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
pabloem commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-637831911


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] github-actions[bot] commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-1032573508


   This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] epicfaace commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
epicfaace commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-1032676586


   not stale
   
   --
   Ashwin Ramaswami
   
   
   On Tue, Feb 8, 2022 at 7:50 AM github-actions[bot] ***@***.***>
   wrote:
   
   > This pull request has been marked as stale due to 60 days of inactivity.
   > It will be closed in 1 week if no further activity occurs. If you think
   > that’s incorrect or this pull request requires a review, please simply
   > write any comment. If closed, you can revive the PR at any time and
   > @mention a reviewer or discuss it on the ***@***.*** list. Thank
   > you for your contributions.
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/beam/pull/11824#issuecomment-1032573508>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AAM4MX65XJIDMIC6NCYZGKLU2EGQRANCNFSM4NLTO4BQ>
   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] iemejia commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
iemejia commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-634898332


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] epicfaace commented on a change in pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
epicfaace commented on a change in pull request #11824:
URL: https://github.com/apache/beam/pull/11824#discussion_r445836086



##########
File path: sdks/python/apache_beam/io/httpio.py
##########
@@ -0,0 +1,172 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""This class implements methods to interact with files at HTTP URLs.
+
+This I/O only implements methods to read with files at HTTP URLs, because
+of the variability in methods by which HTTP content can be written
+to a server. If you need to write your results to an HTTP endpoint,
+you might want to make your own I/O or use another, more specific,
+I/O connector.
+
+"""
+
+# pytype: skip-file
+
+from __future__ import absolute_import
+
+import io
+from builtins import object
+
+from apache_beam.io.filesystem import BeamIOError
+from apache_beam.io.filesystemio import Downloader
+from apache_beam.io.filesystemio import DownloaderStream
+from apache_beam.internal.http_client import get_new_http
+import sys
+from httplib2 import HttpLib2Error
+
+REQUEST_FAILED_ERROR_MSG = "HTTP request failed for URL {}: {}"
+UNEXPECTED_STATUS_CODE_ERROR_MSG = "Unexpected status code received for URL {}: {} {}"
+
+
+class HttpIO(object):

Review comment:
       ```suggestion
   class HttpIO:
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] epicfaace commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
epicfaace commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-649021127


   @pabloem anything you need from me on this one?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] iemejia removed a comment on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
iemejia removed a comment on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-634898332


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] pabloem commented on pull request #11824: [BEAM-10101] Add HttpIO / HttpFileSystem (Python)

Posted by GitBox <gi...@apache.org>.
pabloem commented on pull request #11824:
URL: https://github.com/apache/beam/pull/11824#issuecomment-649793262


   sorry about the delay - it looks like many of the tests are failing: https://ci-beam.apache.org/job/beam_PreCommit_Python_Phrase/1920/
   Do you htink you could fix those so we can review?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org