You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2020/10/23 18:47:08 UTC

[GitHub] [beam] dandy10 opened a new pull request #13180: BEAM-9094 Configure S3 client for IO to s3 compatible object stores

dandy10 opened a new pull request #13180:
URL: https://github.com/apache/beam/pull/13180


   I would like to be able to configure s3 compatible IO in order to be able to use alternative endpoints. https://github.com/apache/beam/pull/10560 began moving in this direction but has been stalled for a while. The main comment there was that pipeline options should be used for configuration. 
   
   Some open questions that I have are:
   
   * Is there a reason that a separate S3IO instance was created for every operation in the s3filesystem? If so I can save a reference to the pipeline options in the constructor and pass that in instead of saving the single S3IO.
   * Are there any naming considerations for the S3Options? Should they begin with a naming prefix to differentiate from other options provided on the command line?
   * Does it make sense to pass in tokens/keys through pipeline options? Should they instead be pulled from the environment?
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [x] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`).
    - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
    - [x] Update `CHANGES.md` with noteworthy changes.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   Post-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   Lang | SDK | Dataflow | Flink | Samza | Spark | Twister2
   --- | --- | --- | --- | --- | --- | ---
   Go | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) | ---
   Java | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/badge/i
 con)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)<br>[![Build Status](htt
 ps://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Twister2/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Twister2/lastCompletedBuild/)
   Python | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python38/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python38/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/)<br>[![Build Status](https://ci-beam
 .apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Python_PVR_Flink_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Python_PVR_Flink_Cron/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Flink/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/) | ---
   XLang | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Direct/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Direct/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/) | ---
   
   Pre-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   --- |Java | Python | Go | Website | Whitespace | Typescript
   --- | --- | --- | --- | --- | --- | ---
   Non-portable | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_PythonDocker_Cron/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_PythonDocker_Cron/lastCompletedBuild/) <br>[![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_PythonDocs_Cron/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_PythonDocs_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/be
 am_PreCommit_Go_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Whitespace_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Whitespace_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Typescript_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Typescript_Cron/lastCompletedBuild/)
   Portable | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/) | --- | --- | --- | ---
   
   See [.test-infra/jenkins/README](https://github.com/apache/beam/blob/master/.test-infra/jenkins/README.md) for trigger phrase, status and link of all Jenkins jobs.
   
   
   GitHub Actions Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   [![Build python source distribution and wheels](https://github.com/apache/beam/workflows/Build%20python%20source%20distribution%20and%20wheels/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Build+python+source+distribution+and+wheels%22+branch%3Amaster+event%3Aschedule)
   [![Python tests](https://github.com/apache/beam/workflows/Python%20tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Python+Tests%22+branch%3Amaster+event%3Aschedule)
   [![Java tests](https://github.com/apache/beam/workflows/Java%20Tests/badge.svg?branch=master&event=schedule)](https://github.com/apache/beam/actions?query=workflow%3A%22Java+Tests%22+branch%3Amaster+event%3Aschedule)
   
   See [CI.md](https://github.com/apache/beam/blob/master/CI.md) for more information about GitHub Actions CI.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] dandy10 commented on pull request #13180: BEAM-9094 Configure S3 client for IO to s3 compatible object stores

Posted by GitBox <gi...@apache.org>.
dandy10 commented on pull request #13180:
URL: https://github.com/apache/beam/pull/13180#issuecomment-717203676


   I think it is probably due to the incorrect processing of the boolean flag. I've changed to using an action on the argparser, and hopefully that should sort the issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] aaltay commented on a change in pull request #13180: BEAM-9094 Configure S3 client for IO to s3 compatible object stores

Posted by GitBox <gi...@apache.org>.
aaltay commented on a change in pull request #13180:
URL: https://github.com/apache/beam/pull/13180#discussion_r511134918



##########
File path: sdks/python/apache_beam/options/pipeline_options.py
##########
@@ -1321,3 +1320,39 @@ def augment_options(cls, options):
       for name, value in override.items():
         setattr(options, name, value)
     return options
+
+
+class S3Options(PipelineOptions):
+  @classmethod
+  def _add_argparse_args(cls, parser):
+    # These options are passed to the S3 IO Client
+    parser.add_argument(
+        '--aws_access_key_id',
+        default=None,
+        help='The secret key to use when creating the s3 client.')
+    parser.add_argument(
+        '--aws_secret_access_key',
+        default=None,
+        help='The secret key to use when creating the s3 client.')
+    parser.add_argument(
+        '--aws_session_token',
+        default=None,
+        help='The session token to use when creating the s3 client.')
+    parser.add_argument(
+        '--endpoint_url',
+        default=None,
+        help='The complete URL to use for the constructed s3 client.')
+    parser.add_argument(
+        '--region_name',
+        default=None,
+        help='The name of the region associated with the client.')
+    parser.add_argument(
+        '--api_version', default=None, help='The API version to use.')
+    parser.add_argument(
+        '--verify',
+        default=None,
+        help='Whether or not to verify SSL certificates.')
+    parser.add_argument(
+        '--use_ssl',
+        default=True,
+        help='Whether or not to use SSL. By default, SSL is used.')

Review comment:
       +1. Please do not use endpoint_url, region_name, api_version, verify, use_ssl and so on without a prefix.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] dandy10 commented on a change in pull request #13180: BEAM-9094 Configure S3 client for IO to s3 compatible object stores

Posted by GitBox <gi...@apache.org>.
dandy10 commented on a change in pull request #13180:
URL: https://github.com/apache/beam/pull/13180#discussion_r511150097



##########
File path: sdks/python/apache_beam/options/pipeline_options.py
##########
@@ -1321,3 +1320,39 @@ def augment_options(cls, options):
       for name, value in override.items():
         setattr(options, name, value)
     return options
+
+
+class S3Options(PipelineOptions):
+  @classmethod
+  def _add_argparse_args(cls, parser):
+    # These options are passed to the S3 IO Client
+    parser.add_argument(
+        '--aws_access_key_id',
+        default=None,
+        help='The secret key to use when creating the s3 client.')
+    parser.add_argument(
+        '--aws_secret_access_key',
+        default=None,
+        help='The secret key to use when creating the s3 client.')
+    parser.add_argument(
+        '--aws_session_token',
+        default=None,
+        help='The session token to use when creating the s3 client.')
+    parser.add_argument(
+        '--endpoint_url',
+        default=None,
+        help='The complete URL to use for the constructed s3 client.')
+    parser.add_argument(
+        '--region_name',
+        default=None,
+        help='The name of the region associated with the client.')
+    parser.add_argument(
+        '--api_version', default=None, help='The API version to use.')
+    parser.add_argument(
+        '--verify',
+        default=None,
+        help='Whether or not to verify SSL certificates.')
+    parser.add_argument(
+        '--use_ssl',
+        default=True,
+        help='Whether or not to use SSL. By default, SSL is used.')

Review comment:
       Fair enough. I've moved them all to use the same s3 prefix which I think makes sense given they are collected under the S3Options class. @pabloem I believe the access keys can also be used for other AWS services, although I've never actually used them. I think it makes sense to consolidate the more generic aws options with the s3 options together for now given that this is the only use case at the moment. If there is another AWS service added in the future it could make sense then to split them up. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] pabloem commented on pull request #13180: BEAM-9094 Configure S3 client for IO to s3 compatible object stores

Posted by GitBox <gi...@apache.org>.
pabloem commented on pull request #13180:
URL: https://github.com/apache/beam/pull/13180#issuecomment-715526817


   thanks. I can review this


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] pabloem commented on pull request #13180: BEAM-9094 Configure S3 client for IO to s3 compatible object stores

Posted by GitBox <gi...@apache.org>.
pabloem commented on pull request #13180:
URL: https://github.com/apache/beam/pull/13180#issuecomment-717419386


   thanks @dandy10 ! this is great!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] ConverJens edited a comment on pull request #13180: BEAM-9094 Configure S3 client for IO to s3 compatible object stores

Posted by GitBox <gi...@apache.org>.
ConverJens edited a comment on pull request #13180:
URL: https://github.com/apache/beam/pull/13180#issuecomment-762308020


   @dandy10 @pabloem 
   Great work with this PR!
   I'm trying to get s3 (Minio) to work for TFX, and I get it to work for all but the beam components where I get this strange error:
   
   ```
   Traceback (most recent call last):
     File "apache_beam/runners/common.py", line 1213, in apache_beam.runners.common.DoFnRunner.process
     File "apache_beam/runners/common.py", line 742, in apache_beam.runners.common.PerWindowInvoker.invoke_process
     File "apache_beam/runners/common.py", line 867, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
     File "/usr/local/lib/python3.7/dist-packages/apache_beam/io/iobase.py", line 1129, in process
       self.writer = self.sink.open_writer(init_result, str(uuid.uuid4()))
     File "/usr/local/lib/python3.7/dist-packages/apache_beam/options/value_provider.py", line 135, in _f
       return fnc(self, *args, **kwargs)
     File "/usr/local/lib/python3.7/dist-packages/apache_beam/io/filebasedsink.py", line 196, in open_writer
       return FileBasedSinkWriter(self, writer_path)
     File "/usr/local/lib/python3.7/dist-packages/apache_beam/io/filebasedsink.py", line 417, in __init__
       self.temp_handle = self.sink.open(temp_shard_path)
     File "/usr/local/lib/python3.7/dist-packages/apache_beam/options/value_provider.py", line 135, in _f
       return fnc(self, *args, **kwargs)
     File "/usr/local/lib/python3.7/dist-packages/apache_beam/io/filebasedsink.py", line 138, in open
       return FileSystems.create(temp_path, self.mime_type, self.compression_type)
     File "/usr/local/lib/python3.7/dist-packages/apache_beam/io/filesystems.py", line 229, in create
       return filesystem.create(path, mime_type, compression_type)
     File "/usr/local/lib/python3.7/dist-packages/apache_beam/io/aws/s3filesystem.py", line 171, in create
       return self._path_open(path, 'wb', mime_type, compression_type)
     File "/usr/local/lib/python3.7/dist-packages/apache_beam/io/aws/s3filesystem.py", line 151, in _path_open
       raw_file = s3io.S3IO(options=self._options).open(
     File "/usr/local/lib/python3.7/dist-packages/apache_beam/io/aws/s3io.py", line 63, in __init__
       raise ValueError('Must provide one of client or options')
   ValueError: Must provide one of client or options
   ```
   
   Do you have any idea what I'm doing wrong? 
   
   These are the beam pipeline args that I'm supplying and I know for sure that at least the multi process and nr_of_workers arguments are applied:
   ```
   '--direct_running_mode=multi_processing',
   f'--direct_num_workers={NR_OF_CPUS}',
   '--s3_endpoint_url=minio-service.kubeflow:9000',
   f'--s3_access_key={ACCESS_KEY}',
   f'--s3_secret_access_key={SECRET_ACCESS_KEY},
   '--s3_verify=False'
   ```
   
   Help would be greatly appreciated!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] pabloem commented on a change in pull request #13180: BEAM-9094 Configure S3 client for IO to s3 compatible object stores

Posted by GitBox <gi...@apache.org>.
pabloem commented on a change in pull request #13180:
URL: https://github.com/apache/beam/pull/13180#discussion_r511132983



##########
File path: sdks/python/apache_beam/options/pipeline_options.py
##########
@@ -1321,3 +1320,39 @@ def augment_options(cls, options):
       for name, value in override.items():
         setattr(options, name, value)
     return options
+
+
+class S3Options(PipelineOptions):
+  @classmethod
+  def _add_argparse_args(cls, parser):
+    # These options are passed to the S3 IO Client
+    parser.add_argument(
+        '--aws_access_key_id',
+        default=None,
+        help='The secret key to use when creating the s3 client.')
+    parser.add_argument(
+        '--aws_secret_access_key',
+        default=None,
+        help='The secret key to use when creating the s3 client.')
+    parser.add_argument(
+        '--aws_session_token',
+        default=None,
+        help='The session token to use when creating the s3 client.')
+    parser.add_argument(
+        '--endpoint_url',
+        default=None,
+        help='The complete URL to use for the constructed s3 client.')
+    parser.add_argument(
+        '--region_name',
+        default=None,
+        help='The name of the region associated with the client.')
+    parser.add_argument(
+        '--api_version', default=None, help='The API version to use.')
+    parser.add_argument(
+        '--verify',
+        default=None,
+        help='Whether or not to verify SSL certificates.')
+    parser.add_argument(
+        '--use_ssl',
+        default=True,
+        help='Whether or not to use SSL. By default, SSL is used.')

Review comment:
       You are correct that it's desirable to use a sort of namespace prefix. Perhaps `--aws_` or `--aws_s3_`. What do you think?
   
   You must know more than me about s3 and AWS - I wonder if `aws_session_token`, `aws_secret_access_key`, `aws_access_key_id`  in this context are specific for s3, or if they provide some sort of AWS-wide authentication?
   
   If they're s3-specific, then maybe we should namespace them as `aws_s3`? Let me know what you think.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] pabloem commented on pull request #13180: BEAM-9094 Configure S3 client for IO to s3 compatible object stores

Posted by GitBox <gi...@apache.org>.
pabloem commented on pull request #13180:
URL: https://github.com/apache/beam/pull/13180#issuecomment-715538228


   Run Portable_Python PreCommit


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] ConverJens commented on pull request #13180: BEAM-9094 Configure S3 client for IO to s3 compatible object stores

Posted by GitBox <gi...@apache.org>.
ConverJens commented on pull request #13180:
URL: https://github.com/apache/beam/pull/13180#issuecomment-762308020


   @dandy10 @pabloem 
   Great work with this PR!
   I'm trying to get s3 (Minio) to work for TFX, and I get it to work for all but the beam components where I get this strange error:
   
   '''
   Traceback (most recent call last):
     File "apache_beam/runners/common.py", line 1213, in apache_beam.runners.common.DoFnRunner.process
     File "apache_beam/runners/common.py", line 742, in apache_beam.runners.common.PerWindowInvoker.invoke_process
     File "apache_beam/runners/common.py", line 867, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
     File "/usr/local/lib/python3.7/dist-packages/apache_beam/io/iobase.py", line 1129, in process
       self.writer = self.sink.open_writer(init_result, str(uuid.uuid4()))
     File "/usr/local/lib/python3.7/dist-packages/apache_beam/options/value_provider.py", line 135, in _f
       return fnc(self, *args, **kwargs)
     File "/usr/local/lib/python3.7/dist-packages/apache_beam/io/filebasedsink.py", line 196, in open_writer
       return FileBasedSinkWriter(self, writer_path)
     File "/usr/local/lib/python3.7/dist-packages/apache_beam/io/filebasedsink.py", line 417, in __init__
       self.temp_handle = self.sink.open(temp_shard_path)
     File "/usr/local/lib/python3.7/dist-packages/apache_beam/options/value_provider.py", line 135, in _f
       return fnc(self, *args, **kwargs)
     File "/usr/local/lib/python3.7/dist-packages/apache_beam/io/filebasedsink.py", line 138, in open
       return FileSystems.create(temp_path, self.mime_type, self.compression_type)
     File "/usr/local/lib/python3.7/dist-packages/apache_beam/io/filesystems.py", line 229, in create
       return filesystem.create(path, mime_type, compression_type)
     File "/usr/local/lib/python3.7/dist-packages/apache_beam/io/aws/s3filesystem.py", line 171, in create
       return self._path_open(path, 'wb', mime_type, compression_type)
     File "/usr/local/lib/python3.7/dist-packages/apache_beam/io/aws/s3filesystem.py", line 151, in _path_open
       raw_file = s3io.S3IO(options=self._options).open(
     File "/usr/local/lib/python3.7/dist-packages/apache_beam/io/aws/s3io.py", line 63, in __init__
       raise ValueError('Must provide one of client or options')
   ValueError: Must provide one of client or options
   '''
   
   Do you have any idea what I'm doing wrong? 
   
   These are the beam pipeline args that I'm supplying and I know for sure that at least the multi process and nr_of_workers arguments are applied:
   '''
   '--direct_running_mode=multi_processing',
   f'--direct_num_workers={NR_OF_CPUS}',
   '--s3_endpoint_url=minio-service.kubeflow:9000',
   f'--s3_access_key={ACCESS_KEY}',
   f'--s3_secret_access_key={SECRET_ACCESS_KEY},
   '--s3_verify=False'
   '''
   
   Help would be greatly appreciated!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] pabloem commented on pull request #13180: BEAM-9094 Configure S3 client for IO to s3 compatible object stores

Posted by GitBox <gi...@apache.org>.
pabloem commented on pull request #13180:
URL: https://github.com/apache/beam/pull/13180#issuecomment-716855030






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] pabloem commented on pull request #13180: BEAM-9094 Configure S3 client for IO to s3 compatible object stores

Posted by GitBox <gi...@apache.org>.
pabloem commented on pull request #13180:
URL: https://github.com/apache/beam/pull/13180#issuecomment-717388974






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] pabloem merged pull request #13180: BEAM-9094 Configure S3 client for IO to s3 compatible object stores

Posted by GitBox <gi...@apache.org>.
pabloem merged pull request #13180:
URL: https://github.com/apache/beam/pull/13180


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] dandy10 commented on pull request #13180: BEAM-9094 Configure S3 client for IO to s3 compatible object stores

Posted by GitBox <gi...@apache.org>.
dandy10 commented on pull request #13180:
URL: https://github.com/apache/beam/pull/13180#issuecomment-715524704


   @chamikaramj @aaltay @udim @pabloem @charlesccychen, apologies for tagging you all, you're on the nearest OWNERS files and not sure who is most relevant.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] dandy10 commented on a change in pull request #13180: BEAM-9094 Configure S3 client for IO to s3 compatible object stores

Posted by GitBox <gi...@apache.org>.
dandy10 commented on a change in pull request #13180:
URL: https://github.com/apache/beam/pull/13180#discussion_r511498443



##########
File path: sdks/python/apache_beam/options/pipeline_options.py
##########
@@ -1321,3 +1320,39 @@ def augment_options(cls, options):
       for name, value in override.items():
         setattr(options, name, value)
     return options
+
+
+class S3Options(PipelineOptions):
+  @classmethod
+  def _add_argparse_args(cls, parser):
+    # These options are passed to the S3 IO Client
+    parser.add_argument(
+        '--aws_access_key_id',
+        default=None,
+        help='The secret key to use when creating the s3 client.')
+    parser.add_argument(
+        '--aws_secret_access_key',
+        default=None,
+        help='The secret key to use when creating the s3 client.')
+    parser.add_argument(
+        '--aws_session_token',
+        default=None,
+        help='The session token to use when creating the s3 client.')
+    parser.add_argument(
+        '--endpoint_url',
+        default=None,
+        help='The complete URL to use for the constructed s3 client.')
+    parser.add_argument(
+        '--region_name',
+        default=None,
+        help='The name of the region associated with the client.')
+    parser.add_argument(
+        '--api_version', default=None, help='The API version to use.')
+    parser.add_argument(
+        '--verify',
+        default=None,
+        help='Whether or not to verify SSL certificates.')
+    parser.add_argument(
+        '--use_ssl',
+        default=True,
+        help='Whether or not to use SSL. By default, SSL is used.')

Review comment:
       Will do. I tried to have a look at the failing tests and can't figure out which ones have actually failed (the formatting is quite difficult to parse). Unfortunately I don't have access to windows to run locally. The same pattern of failures seems to be affecting https://github.com/apache/beam/pull/13187 which is just a comment change, so perhaps the failures are unrelated?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] dandy10 commented on pull request #13180: BEAM-9094 Configure S3 client for IO to s3 compatible object stores

Posted by GitBox <gi...@apache.org>.
dandy10 commented on pull request #13180:
URL: https://github.com/apache/beam/pull/13180#issuecomment-717217045


   the two failures in the windows test are `PermissionError: [WinError 32] The process cannot access the file because it is being used by another process:` so it seems that the tests are not being isolated sufficiently.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] pabloem commented on a change in pull request #13180: BEAM-9094 Configure S3 client for IO to s3 compatible object stores

Posted by GitBox <gi...@apache.org>.
pabloem commented on a change in pull request #13180:
URL: https://github.com/apache/beam/pull/13180#discussion_r511497416



##########
File path: sdks/python/apache_beam/options/pipeline_options.py
##########
@@ -1321,3 +1320,39 @@ def augment_options(cls, options):
       for name, value in override.items():
         setattr(options, name, value)
     return options
+
+
+class S3Options(PipelineOptions):
+  @classmethod
+  def _add_argparse_args(cls, parser):
+    # These options are passed to the S3 IO Client
+    parser.add_argument(
+        '--aws_access_key_id',
+        default=None,
+        help='The secret key to use when creating the s3 client.')
+    parser.add_argument(
+        '--aws_secret_access_key',
+        default=None,
+        help='The secret key to use when creating the s3 client.')
+    parser.add_argument(
+        '--aws_session_token',
+        default=None,
+        help='The session token to use when creating the s3 client.')
+    parser.add_argument(
+        '--endpoint_url',
+        default=None,
+        help='The complete URL to use for the constructed s3 client.')
+    parser.add_argument(
+        '--region_name',
+        default=None,
+        help='The name of the region associated with the client.')
+    parser.add_argument(
+        '--api_version', default=None, help='The API version to use.')
+    parser.add_argument(
+        '--verify',
+        default=None,
+        help='Whether or not to verify SSL certificates.')
+    parser.add_argument(
+        '--use_ssl',
+        default=True,
+        help='Whether or not to use SSL. By default, SSL is used.')

Review comment:
       Thanks! That makes sense to me. Can you fix the broken unittests? Other than that, the change looks great (and it's very welcome, as we'd been needing it).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] pabloem commented on pull request #13180: BEAM-9094 Configure S3 client for IO to s3 compatible object stores

Posted by GitBox <gi...@apache.org>.
pabloem commented on pull request #13180:
URL: https://github.com/apache/beam/pull/13180#issuecomment-716816340


   The precommit failures are from dataflow jobs trying to start the sdk worker: 
   ![image](https://user-images.githubusercontent.com/1301740/97227307-87722180-1792-11eb-98a0-50e1bff89ca4.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org