You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Ross Vandegrift <ro...@cleardata.com> on 2020/09/29 18:16:08 UTC

Provide credentials for s3 writes

Hello all,

I have a python pipeline that writes data to an s3 bucket.  On my laptop it
picks up the SDK credentials from my boto3 config and works great.

Is is possible to provide credentials explicitly?  I'd like to use remote
dataflow runners, which won't have implicit AWS credentials available.

Thanks,
Ross

Re: Provide credentials for s3 writes

Posted by Ross Vandegrift <ro...@cleardata.com>.
I've worked through adapting this to Dataflow, it's simple enough once you try
all of the things that don't work. :)

In setup.py, write out config files with an identity token and a boto3 config
file.  File-base config was essential, I couldn't get env vars working.

Here's a sample.  Be careful!  This can clobber your local boto3 config.  All
of this is in top-level scope of setup.py:


import pathlib
import os

import google.oauth2.id_token
import google.auth.transport.requests

# Get google id token
request = google.auth.transport.requests.Request()
id_token = google.oauth2.id_token.fetch_id_token(request, 'your-audience')
with open('/tmp/id_token', 'w') as f:
    f.write(id_token)

# create aws sdk config
home = os.getenv('HOME', '/tmp')
dotaws = pathlib.Path(home) / pathlib.Path('.aws')
try:
    dotaws.mkdir()
except FileExistsError:
    pass

awsconfig = dotaws / pathlib.Path('config')
if awsconfig.exists():
    cfgbackup = awsconfig.parent / pathlib.Path('config.bak')
    awsconfig.rename(cfgbackup)

with awsconfig.open('w') as f:
    f.write('[profile default]\n')
    f.write('role_arn = your-role-arn\n')
    f.write('web_identity_token_file = /tmp/id_token\n')


You need to sub appropriate values for 'your-audience' and 'your-role-arn'.

Ross


On Thu, 2020-10-01 at 15:47 +0000, Ross Vandegrift wrote:
> **This message came from an external sender.**
> 
> 
> Can you explain that a little bit?  Right now, our pipeline code is
> structured
> like this:
> 
>   if __name__ == '__main__':
>       setup_credentials()  # exports env vars for default boto session
>       run_pipeline()       # runs all the beam stuff
> 
> 
> So I expect every worker to setup their environment before running any beam
> code.  This seems to work fine.  Is there an issue lurking here?
> 
> Ross
> 
> On Wed, 2020-09-30 at 17:57 -0700, Pablo Estrada wrote:
> > **This message came from an external sender.**
> > 
> > You may need to set those up in setup.py so that the code runs in every
> > worker at startup.
> > 
> > On Wed, Sep 30, 2020, 10:16 AM Ross Vandegrift <
> > ross.vandegrift@cleardata.com> wrote:
> > > I see - it'd be great if the s3 io code would accept a boto session, so
> > > the
> > > default process could be overridden.
> > > 
> > > But it looks like the module lazy loads boto3 and uses the default
> > > session.  So I think it'll work if we setup SDK env vars before the
> > > pipeline
> > > code.
> > > 
> > > i.e., we'll try something like:
> > > 
> > > os.environ['AWS_ROLE_ARN'] = 'aws:arn:...'
> > > os.environ['AWS_ROLE_SESSION_NAME'] = 'my-beam-pipeline'
> > > os.environ['AWS_WEB_IDENTITY_TOKEN_FILE'] = '/path/to/id_token'
> > > 
> > > with beam.Pipline(...) as p:
> > >     ...
> > > 
> > > Ross
> > > 
> > > On Tue, 2020-09-29 at 14:29 -0700, Pablo Estrada wrote:
> > > > **This message came from an external sender.**
> > > > 
> > > > Hi Ross,
> > > > it seems that this feature is missing (e.g. passing a pipeline option
> > > with
> > > > authentication information for AWS). I'm sorry about that - that's
> > > pretty
> > > > annoying.
> > > > I wonder if you can use the setup.py file to add the default
> > > configuration
> > > > yourself while we have appropriate support for a pipeline option-based
> > > > authentication. Could you try adding this default config on setup.py?
> > > > Best
> > > > -P.
> > > > 
> > > > On Tue, Sep 29, 2020 at 11:16 AM Ross Vandegrift <
> > > > ross.vandegrift@cleardata.com> wrote:
> > > > > Hello all,
> > > > > 
> > > > > I have a python pipeline that writes data to an s3 bucket.  On my
> > > laptop
> > > > > it
> > > > > picks up the SDK credentials from my boto3 config and works great.
> > > > > 
> > > > > Is is possible to provide credentials explicitly?  I'd like to use
> > > remote
> > > > > dataflow runners, which won't have implicit AWS credentials
> > > > > available.
> > > > > 
> > > > > Thanks,
> > > > > Ross
> > > > > 
> > > > 
> > > > This message came from an external source. Please exercise caution
> > > > when
> > > > opening attachments or clicking on links.
> > 
> > This message came from an external source. Please exercise caution when
> > opening attachments or clicking on links.
> 
> This message came from an external source. Please exercise caution when
> opening attachments or clicking on links.

Re: Provide credentials for s3 writes

Posted by Ross Vandegrift <ro...@cleardata.com>.
Can you explain that a little bit?  Right now, our pipeline code is structured
like this:

  if __name__ == '__main__':
      setup_credentials()  # exports env vars for default boto session
      run_pipeline()       # runs all the beam stuff


So I expect every worker to setup their environment before running any beam
code.  This seems to work fine.  Is there an issue lurking here?

Ross

On Wed, 2020-09-30 at 17:57 -0700, Pablo Estrada wrote:
> **This message came from an external sender.** 
> 
> You may need to set those up in setup.py so that the code runs in every
> worker at startup.
> 
> On Wed, Sep 30, 2020, 10:16 AM Ross Vandegrift <
> ross.vandegrift@cleardata.com> wrote:
> > I see - it'd be great if the s3 io code would accept a boto session, so
> > the
> > default process could be overridden.
> > 
> > But it looks like the module lazy loads boto3 and uses the default
> > session.  So I think it'll work if we setup SDK env vars before the
> > pipeline
> > code.
> > 
> > i.e., we'll try something like:
> > 
> > os.environ['AWS_ROLE_ARN'] = 'aws:arn:...'
> > os.environ['AWS_ROLE_SESSION_NAME'] = 'my-beam-pipeline'
> > os.environ['AWS_WEB_IDENTITY_TOKEN_FILE'] = '/path/to/id_token'
> > 
> > with beam.Pipline(...) as p:
> >     ...
> > 
> > Ross
> > 
> > On Tue, 2020-09-29 at 14:29 -0700, Pablo Estrada wrote:
> > > **This message came from an external sender.** 
> > > 
> > > Hi Ross,
> > > it seems that this feature is missing (e.g. passing a pipeline option
> > with
> > > authentication information for AWS). I'm sorry about that - that's
> > pretty
> > > annoying.
> > > I wonder if you can use the setup.py file to add the default
> > configuration
> > > yourself while we have appropriate support for a pipeline option-based
> > > authentication. Could you try adding this default config on setup.py?
> > > Best
> > > -P.
> > > 
> > > On Tue, Sep 29, 2020 at 11:16 AM Ross Vandegrift <
> > > ross.vandegrift@cleardata.com> wrote:
> > > > Hello all,
> > > > 
> > > > I have a python pipeline that writes data to an s3 bucket.  On my
> > laptop
> > > > it
> > > > picks up the SDK credentials from my boto3 config and works great.
> > > > 
> > > > Is is possible to provide credentials explicitly?  I'd like to use
> > remote
> > > > dataflow runners, which won't have implicit AWS credentials available.
> > > > 
> > > > Thanks,
> > > > Ross
> > > > 
> > > 
> > > This message came from an external source. Please exercise caution when
> > > opening attachments or clicking on links.
> > 
> 
> This message came from an external source. Please exercise caution when
> opening attachments or clicking on links.

Re: Provide credentials for s3 writes

Posted by Pablo Estrada <pa...@google.com>.
You may need to set those up in setup.py so that the code runs in every
worker at startup.

On Wed, Sep 30, 2020, 10:16 AM Ross Vandegrift <
ross.vandegrift@cleardata.com> wrote:

> I see - it'd be great if the s3 io code would accept a boto session, so the
> default process could be overridden.
>
> But it looks like the module lazy loads boto3 and uses the default
> session.  So I think it'll work if we setup SDK env vars before the
> pipeline
> code.
>
> i.e., we'll try something like:
>
> os.environ['AWS_ROLE_ARN'] = 'aws:arn:...'
> os.environ['AWS_ROLE_SESSION_NAME'] = 'my-beam-pipeline'
> os.environ['AWS_WEB_IDENTITY_TOKEN_FILE'] = '/path/to/id_token'
>
> with beam.Pipline(...) as p:
>     ...
>
> Ross
>
> On Tue, 2020-09-29 at 14:29 -0700, Pablo Estrada wrote:
> > **This message came from an external sender.**
> >
> > Hi Ross,
> > it seems that this feature is missing (e.g. passing a pipeline option
> with
> > authentication information for AWS). I'm sorry about that - that's pretty
> > annoying.
> > I wonder if you can use the setup.py file to add the default
> configuration
> > yourself while we have appropriate support for a pipeline option-based
> > authentication. Could you try adding this default config on setup.py?
> > Best
> > -P.
> >
> > On Tue, Sep 29, 2020 at 11:16 AM Ross Vandegrift <
> > ross.vandegrift@cleardata.com> wrote:
> > > Hello all,
> > >
> > > I have a python pipeline that writes data to an s3 bucket.  On my
> laptop
> > > it
> > > picks up the SDK credentials from my boto3 config and works great.
> > >
> > > Is is possible to provide credentials explicitly?  I'd like to use
> remote
> > > dataflow runners, which won't have implicit AWS credentials available.
> > >
> > > Thanks,
> > > Ross
> > >
> >
> > This message came from an external source. Please exercise caution when
> > opening attachments or clicking on links.
>

Re: Provide credentials for s3 writes

Posted by Ross Vandegrift <ro...@cleardata.com>.
I see - it'd be great if the s3 io code would accept a boto session, so the
default process could be overridden.

But it looks like the module lazy loads boto3 and uses the default
session.  So I think it'll work if we setup SDK env vars before the pipeline
code.

i.e., we'll try something like:

os.environ['AWS_ROLE_ARN'] = 'aws:arn:...'
os.environ['AWS_ROLE_SESSION_NAME'] = 'my-beam-pipeline'
os.environ['AWS_WEB_IDENTITY_TOKEN_FILE'] = '/path/to/id_token'

with beam.Pipline(...) as p:
    ...

Ross

On Tue, 2020-09-29 at 14:29 -0700, Pablo Estrada wrote:
> **This message came from an external sender.** 
> 
> Hi Ross,
> it seems that this feature is missing (e.g. passing a pipeline option with
> authentication information for AWS). I'm sorry about that - that's pretty
> annoying.
> I wonder if you can use the setup.py file to add the default configuration
> yourself while we have appropriate support for a pipeline option-based
> authentication. Could you try adding this default config on setup.py?
> Best
> -P.
> 
> On Tue, Sep 29, 2020 at 11:16 AM Ross Vandegrift <
> ross.vandegrift@cleardata.com> wrote:
> > Hello all,
> > 
> > I have a python pipeline that writes data to an s3 bucket.  On my laptop
> > it
> > picks up the SDK credentials from my boto3 config and works great.
> > 
> > Is is possible to provide credentials explicitly?  I'd like to use remote
> > dataflow runners, which won't have implicit AWS credentials available.
> > 
> > Thanks,
> > Ross
> > 
> 
> This message came from an external source. Please exercise caution when
> opening attachments or clicking on links.

Re: Provide credentials for s3 writes

Posted by Pablo Estrada <pa...@google.com>.
Hi Ross,
it seems that this feature is missing (e.g. passing a pipeline option with
authentication information for AWS). I'm sorry about that - that's pretty
annoying.
I wonder if you can use the setup.py file to add the default configuration
yourself while we have appropriate support for a pipeline option-based
authentication. Could you try adding this default config on setup.py?
Best
-P.

On Tue, Sep 29, 2020 at 11:16 AM Ross Vandegrift <
ross.vandegrift@cleardata.com> wrote:

> Hello all,
>
> I have a python pipeline that writes data to an s3 bucket.  On my laptop it
> picks up the SDK credentials from my boto3 config and works great.
>
> Is is possible to provide credentials explicitly?  I'd like to use remote
> dataflow runners, which won't have implicit AWS credentials available.
>
> Thanks,
> Ross
>