You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Ricardo Cardante <ri...@tutanota.com> on 2020/06/17 23:46:34 UTC

Interact with different S3 buckets from a shared Flink cluster

Hi!





We are working in a use case where we have a shared Flink cluster to deploy multiple jobs from different teams. With this strategy, we are facing a challenge regarding the interaction with S3. Given that we already configured S3 for the state backend (through flink-conf.yaml) every time we use API functions that communicate with the file system (e.g., DataStream readFile) the applicational configurations appear to be overridden by those of the cluster while attempting to communicate with external S3 buckets. What we've thought so far:




1. Provide a core-site.xml resource file targeting the external S3 buckets we want to interact with. We've tested, and the credentials ultimately seem to be ignored in behalf of the IAM roles that are pre-loaded with the instances;

2. Load the cluster instances with multiple IAM roles. The problem with this is that we would allow each job to interact with out-of-scope buckets;

3. Spin multiple clusters with different configurations - we would like to avoid this since we started from the premise of sharing a single cluster per context;




What would be a clean/recommended solution to interact with multiple S3 buckets with different security policies from a shared Flink cluster? 


Thanks in advance.

Re: Interact with different S3 buckets from a shared Flink cluster

Posted by Steven Wu <st...@gmail.com>.

Internally, we have our own ConfigurableCredentialsProvider. Based on the
config in core-site.xml, it does assume-role with the proper IAM
credentials using STSAssumeRoleSessionCredentialsProvider. We just need to
grant permission for the instance credentials to be able to assume the IAM
role for bucket access. We have a single core-site.xml that lays out all
the mapping.

  <property>
    <name>aws.iam.role.arn.${BUCKET_NAME}</name>
    <value>arn:aws:iam::${ACCOUNT_NUMBER}:role/${BUCKET_ROLE_NAME}</value>
  </property>

On Mon, Jun 22, 2020 at 7:07 AM Arvid Heise <ar...@ververica.com> wrote:

> Hi Ricardo,
>
> one option is to use s3p for checkpointing (Presto) and s3a for custom
> applications and attach different configurations.
>
> In general, I'd recommend to use a cluster per application to exactly
> avoid such issues. I'd use K8s and put the respective IAM roles on each
> application pod (e.g. with kiam).
>
> On Thu, Jun 18, 2020 at 1:46 AM Ricardo Cardante <
> ricardocardante@tutanota.com> wrote:
>
>> Hi!
>>
>>
>> We are working in a use case where we have a shared Flink cluster to
>> deploy multiple jobs from different teams. With this strategy, we are
>> facing a challenge regarding the interaction with S3. Given that we already
>> configured S3 for the state backend (through flink-conf.yaml) every time we
>> use API functions that communicate with the file system (e.g., DataStream
>> readFile) the applicational configurations appear to be overridden by those
>> of the cluster while attempting to communicate with external S3 buckets.
>> What we've thought so far:
>>
>>
>> 1. Provide a core-site.xml resource file targeting the external S3
>> buckets we want to interact with. We've tested, and the credentials
>> ultimately seem to be ignored in behalf of the IAM roles that are
>> pre-loaded with the instances;
>>
>> 2. Load the cluster instances with multiple IAM roles. The problem with
>> this is that we would allow each job to interact with out-of-scope buckets;
>>
>> 3. Spin multiple clusters with different configurations - we would like
>> to avoid this since we started from the premise of sharing a single cluster
>> per context;
>>
>>
>> What would be a clean/recommended solution to interact with multiple S3
>> buckets with different security policies from a shared Flink cluster?
>>
>> Thanks in advance.
>>
>
>
> --
>
> Arvid Heise | Senior Java Developer
>
> <https://www.ververica.com/>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
> (Toni) Cheng
>

Re: Interact with different S3 buckets from a shared Flink cluster

Posted by Arvid Heise <ar...@ververica.com>.

Hi Ricardo,

one option is to use s3p for checkpointing (Presto) and s3a for custom
applications and attach different configurations.

In general, I'd recommend to use a cluster per application to exactly avoid
such issues. I'd use K8s and put the respective IAM roles on each
application pod (e.g. with kiam).

On Thu, Jun 18, 2020 at 1:46 AM Ricardo Cardante <
ricardocardante@tutanota.com> wrote:

> Hi!
>
>
> We are working in a use case where we have a shared Flink cluster to
> deploy multiple jobs from different teams. With this strategy, we are
> facing a challenge regarding the interaction with S3. Given that we already
> configured S3 for the state backend (through flink-conf.yaml) every time we
> use API functions that communicate with the file system (e.g., DataStream
> readFile) the applicational configurations appear to be overridden by those
> of the cluster while attempting to communicate with external S3 buckets.
> What we've thought so far:
>
>
> 1. Provide a core-site.xml resource file targeting the external S3 buckets
> we want to interact with. We've tested, and the credentials ultimately seem
> to be ignored in behalf of the IAM roles that are pre-loaded with the
> instances;
>
> 2. Load the cluster instances with multiple IAM roles. The problem with
> this is that we would allow each job to interact with out-of-scope buckets;
>
> 3. Spin multiple clusters with different configurations - we would like to
> avoid this since we started from the premise of sharing a single cluster
> per context;
>
>
> What would be a clean/recommended solution to interact with multiple S3
> buckets with different security policies from a shared Flink cluster?
>
> Thanks in advance.
>


-- 

Arvid Heise | Senior Java Developer

<https://www.ververica.com/>

Follow us @VervericaData

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
(Toni) Cheng