You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Pablo Estrada <pa...@google.com> on 2020/10/01 00:57:05 UTC

Re: Provide credentials for s3 writes

You may need to set those up in setup.py so that the code runs in every
worker at startup.

On Wed, Sep 30, 2020, 10:16 AM Ross Vandegrift <
ross.vandegrift@cleardata.com> wrote:

> I see - it'd be great if the s3 io code would accept a boto session, so the
> default process could be overridden.
>
> But it looks like the module lazy loads boto3 and uses the default
> session.  So I think it'll work if we setup SDK env vars before the
> pipeline
> code.
>
> i.e., we'll try something like:
>
> os.environ['AWS_ROLE_ARN'] = 'aws:arn:...'
> os.environ['AWS_ROLE_SESSION_NAME'] = 'my-beam-pipeline'
> os.environ['AWS_WEB_IDENTITY_TOKEN_FILE'] = '/path/to/id_token'
>
> with beam.Pipline(...) as p:
>     ...
>
> Ross
>
> On Tue, 2020-09-29 at 14:29 -0700, Pablo Estrada wrote:
> > **This message came from an external sender.**
> >
> > Hi Ross,
> > it seems that this feature is missing (e.g. passing a pipeline option
> with
> > authentication information for AWS). I'm sorry about that - that's pretty
> > annoying.
> > I wonder if you can use the setup.py file to add the default
> configuration
> > yourself while we have appropriate support for a pipeline option-based
> > authentication. Could you try adding this default config on setup.py?
> > Best
> > -P.
> >
> > On Tue, Sep 29, 2020 at 11:16 AM Ross Vandegrift <
> > ross.vandegrift@cleardata.com> wrote:
> > > Hello all,
> > >
> > > I have a python pipeline that writes data to an s3 bucket.  On my
> laptop
> > > it
> > > picks up the SDK credentials from my boto3 config and works great.
> > >
> > > Is is possible to provide credentials explicitly?  I'd like to use
> remote
> > > dataflow runners, which won't have implicit AWS credentials available.
> > >
> > > Thanks,
> > > Ross
> > >
> >
> > This message came from an external source. Please exercise caution when
> > opening attachments or clicking on links.
>

Re: Provide credentials for s3 writes

Posted by Ross Vandegrift <ro...@cleardata.com>.
I've worked through adapting this to Dataflow, it's simple enough once you try
all of the things that don't work. :)

In setup.py, write out config files with an identity token and a boto3 config
file.  File-base config was essential, I couldn't get env vars working.

Here's a sample.  Be careful!  This can clobber your local boto3 config.  All
of this is in top-level scope of setup.py:


import pathlib
import os

import google.oauth2.id_token
import google.auth.transport.requests

# Get google id token
request = google.auth.transport.requests.Request()
id_token = google.oauth2.id_token.fetch_id_token(request, 'your-audience')
with open('/tmp/id_token', 'w') as f:
    f.write(id_token)

# create aws sdk config
home = os.getenv('HOME', '/tmp')
dotaws = pathlib.Path(home) / pathlib.Path('.aws')
try:
    dotaws.mkdir()
except FileExistsError:
    pass

awsconfig = dotaws / pathlib.Path('config')
if awsconfig.exists():
    cfgbackup = awsconfig.parent / pathlib.Path('config.bak')
    awsconfig.rename(cfgbackup)

with awsconfig.open('w') as f:
    f.write('[profile default]\n')
    f.write('role_arn = your-role-arn\n')
    f.write('web_identity_token_file = /tmp/id_token\n')


You need to sub appropriate values for 'your-audience' and 'your-role-arn'.

Ross


On Thu, 2020-10-01 at 15:47 +0000, Ross Vandegrift wrote:
> **This message came from an external sender.**
> 
> 
> Can you explain that a little bit?  Right now, our pipeline code is
> structured
> like this:
> 
>   if __name__ == '__main__':
>       setup_credentials()  # exports env vars for default boto session
>       run_pipeline()       # runs all the beam stuff
> 
> 
> So I expect every worker to setup their environment before running any beam
> code.  This seems to work fine.  Is there an issue lurking here?
> 
> Ross
> 
> On Wed, 2020-09-30 at 17:57 -0700, Pablo Estrada wrote:
> > **This message came from an external sender.**
> > 
> > You may need to set those up in setup.py so that the code runs in every
> > worker at startup.
> > 
> > On Wed, Sep 30, 2020, 10:16 AM Ross Vandegrift <
> > ross.vandegrift@cleardata.com> wrote:
> > > I see - it'd be great if the s3 io code would accept a boto session, so
> > > the
> > > default process could be overridden.
> > > 
> > > But it looks like the module lazy loads boto3 and uses the default
> > > session.  So I think it'll work if we setup SDK env vars before the
> > > pipeline
> > > code.
> > > 
> > > i.e., we'll try something like:
> > > 
> > > os.environ['AWS_ROLE_ARN'] = 'aws:arn:...'
> > > os.environ['AWS_ROLE_SESSION_NAME'] = 'my-beam-pipeline'
> > > os.environ['AWS_WEB_IDENTITY_TOKEN_FILE'] = '/path/to/id_token'
> > > 
> > > with beam.Pipline(...) as p:
> > >     ...
> > > 
> > > Ross
> > > 
> > > On Tue, 2020-09-29 at 14:29 -0700, Pablo Estrada wrote:
> > > > **This message came from an external sender.**
> > > > 
> > > > Hi Ross,
> > > > it seems that this feature is missing (e.g. passing a pipeline option
> > > with
> > > > authentication information for AWS). I'm sorry about that - that's
> > > pretty
> > > > annoying.
> > > > I wonder if you can use the setup.py file to add the default
> > > configuration
> > > > yourself while we have appropriate support for a pipeline option-based
> > > > authentication. Could you try adding this default config on setup.py?
> > > > Best
> > > > -P.
> > > > 
> > > > On Tue, Sep 29, 2020 at 11:16 AM Ross Vandegrift <
> > > > ross.vandegrift@cleardata.com> wrote:
> > > > > Hello all,
> > > > > 
> > > > > I have a python pipeline that writes data to an s3 bucket.  On my
> > > laptop
> > > > > it
> > > > > picks up the SDK credentials from my boto3 config and works great.
> > > > > 
> > > > > Is is possible to provide credentials explicitly?  I'd like to use
> > > remote
> > > > > dataflow runners, which won't have implicit AWS credentials
> > > > > available.
> > > > > 
> > > > > Thanks,
> > > > > Ross
> > > > > 
> > > > 
> > > > This message came from an external source. Please exercise caution
> > > > when
> > > > opening attachments or clicking on links.
> > 
> > This message came from an external source. Please exercise caution when
> > opening attachments or clicking on links.
> 
> This message came from an external source. Please exercise caution when
> opening attachments or clicking on links.

Re: Provide credentials for s3 writes

Posted by Ross Vandegrift <ro...@cleardata.com>.
Can you explain that a little bit?  Right now, our pipeline code is structured
like this:

  if __name__ == '__main__':
      setup_credentials()  # exports env vars for default boto session
      run_pipeline()       # runs all the beam stuff


So I expect every worker to setup their environment before running any beam
code.  This seems to work fine.  Is there an issue lurking here?

Ross

On Wed, 2020-09-30 at 17:57 -0700, Pablo Estrada wrote:
> **This message came from an external sender.** 
> 
> You may need to set those up in setup.py so that the code runs in every
> worker at startup.
> 
> On Wed, Sep 30, 2020, 10:16 AM Ross Vandegrift <
> ross.vandegrift@cleardata.com> wrote:
> > I see - it'd be great if the s3 io code would accept a boto session, so
> > the
> > default process could be overridden.
> > 
> > But it looks like the module lazy loads boto3 and uses the default
> > session.  So I think it'll work if we setup SDK env vars before the
> > pipeline
> > code.
> > 
> > i.e., we'll try something like:
> > 
> > os.environ['AWS_ROLE_ARN'] = 'aws:arn:...'
> > os.environ['AWS_ROLE_SESSION_NAME'] = 'my-beam-pipeline'
> > os.environ['AWS_WEB_IDENTITY_TOKEN_FILE'] = '/path/to/id_token'
> > 
> > with beam.Pipline(...) as p:
> >     ...
> > 
> > Ross
> > 
> > On Tue, 2020-09-29 at 14:29 -0700, Pablo Estrada wrote:
> > > **This message came from an external sender.** 
> > > 
> > > Hi Ross,
> > > it seems that this feature is missing (e.g. passing a pipeline option
> > with
> > > authentication information for AWS). I'm sorry about that - that's
> > pretty
> > > annoying.
> > > I wonder if you can use the setup.py file to add the default
> > configuration
> > > yourself while we have appropriate support for a pipeline option-based
> > > authentication. Could you try adding this default config on setup.py?
> > > Best
> > > -P.
> > > 
> > > On Tue, Sep 29, 2020 at 11:16 AM Ross Vandegrift <
> > > ross.vandegrift@cleardata.com> wrote:
> > > > Hello all,
> > > > 
> > > > I have a python pipeline that writes data to an s3 bucket.  On my
> > laptop
> > > > it
> > > > picks up the SDK credentials from my boto3 config and works great.
> > > > 
> > > > Is is possible to provide credentials explicitly?  I'd like to use
> > remote
> > > > dataflow runners, which won't have implicit AWS credentials available.
> > > > 
> > > > Thanks,
> > > > Ross
> > > > 
> > > 
> > > This message came from an external source. Please exercise caution when
> > > opening attachments or clicking on links.
> > 
> 
> This message came from an external source. Please exercise caution when
> opening attachments or clicking on links.