You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Hai Lu <lh...@apache.org> on 2019/05/17 03:01:45 UTC

Re: Enable security for data channels in portability

Hi Lukasz and Ankur,

Here is the PR that implements the idea:
https://github.com/apache/beam/pull/8597

Would appreciate it if you could take a look.

Thanks,
Hai

On Tue, Apr 30, 2019 at 9:13 AM Hai Lu <lh...@gmail.com> wrote:

> One thing to clarify is that we do not use docker. I don't have too much
> experience with docker; I assume docker itself already has network
> isolation, and that's why it was never necessary to enable security in
> portable runner before?
>
> For us because we simply use processes, we need this extra secret (through
> file system) for authentication.
>
> Let me create a ticket and send a PR, which should explain my intention
> better.
>
> Thanks,
> Hai
>
> On Mon, Apr 29, 2019 at 1:03 PM Lukasz Cwik <lc...@google.com> wrote:
>
>> Changing the address to be loopback based upon how the environment is
>> started (docker container/process/external/...) makes sense.
>>
>> How would the SDK and runner support storing/sharing this secret? (For
>> example, in the docker container, how would the secret get there?)
>>
>> On Mon, Apr 29, 2019 at 9:23 AM Hai Lu <lh...@gmail.com> wrote:
>>
>>> Hi Lukasz and Ankur,
>>>
>>> Thank you so much for your response! This is what we're
>>> doing/implementing in our internal fork right now:
>>>
>>>    1. We assume that the Java process and Python process *are always
>>>    colocated in the same host*, so first of all we use "loopback"
>>>    address instead of "any address" that's currently being used on the java
>>>    side. That way, the traffic between sdk worker and runner is limited to the
>>>    host but not exposed to network.
>>>    2. Because of the multi-tenant nature of our environment, we still
>>>    want to have authentication even for local host, so that data ports are not
>>>    connected by random processes. Because different jobs have their own user
>>>    name, it's sufficient to *use file system to store an ad-hoc secret*,
>>>    which can be shared by both Python sdk and java runner. The the runner uses
>>>    this secret to authenticate the worker (by using gRPC's interceptor for
>>>    this customized auth)
>>>    3. By having the 2 steps above, we *no longer need transport layer
>>>    security *(SSL/TLS). So we abandon our initial plan to enable
>>>    SSL/TLS.
>>>
>>> Above is the high level plan that I'm implementing. I would like to have
>>> a similar solution in the open source to be merged with our internal fork.
>>> Let me know what you think. If this sounds OK I will create a ticket for
>>> myself and will first send out a short write-up in google doc to collect
>>> comments soon.
>>>
>>> Thanks,
>>> Hai
>>>
>>> On Fri, Apr 26, 2019 at 5:24 PM Ankur Goenka <go...@google.com> wrote:
>>>
>>>> In an offline chat with Hai, It seem useful for users to be able to
>>>> provide custom authentication like a secret which can be distributed out of
>>>> band by the infrastructure and can be provided via file system, rpc to
>>>> another service etc.
>>>> gRPC already has some mechanism for standard and custom
>>>> authentication[1].
>>>> Instrumenting gRPC channel using command line option or environment
>>>> variable on the worker machines can be be useful.
>>>>
>>>> [1] https://grpc.io/docs/guides/auth/
>>>>
>>>> On Fri, Apr 26, 2019 at 4:33 PM Lukasz Cwik <lc...@google.com> wrote:
>>>>
>>>>> The link to the ApiServiceDescriptor is
>>>>> https://github.com/apache/beam/blob/476e17ed6badd4d5c06c4caf8a824805f40a8e7a/model/pipeline/src/main/proto/endpoints.proto#L31
>>>>>
>>>>> On Fri, Apr 26, 2019 at 4:32 PM Lukasz Cwik <lc...@google.com> wrote:
>>>>>
>>>>>> I had originally taken a look at this a while ago but not much has
>>>>>> progressed since then. The original idea was that the ApiServiceDescriptor
>>>>>> would be extended to support secure ways of authentication/communication. I
>>>>>> was prototyping with an OAuth2 client credentials grant at the time but
>>>>>> dropped it as other things were more important. The only currently
>>>>>> supported mode across all SDKs is an implicit authenticated/secure mode
>>>>>> where all communication is assumed to already be encrypted/private (e.g.
>>>>>> over VPN that is managed externally with trusted services) and hence the
>>>>>> gRPC channel itself is insecure and there is no authentication being
>>>>>> performed.
>>>>>>
>>>>>> Even though sdk_worker.py seems like it supports credentials, no one
>>>>>> invokes the constructor with credentials enabled as can be seen by this
>>>>>> comment by Robert[1].
>>>>>>
>>>>>> For SSL/TLS support it seems like we need some way to configure a
>>>>>> runner to be told to use SSL/TLS (potentially with a custom private key and
>>>>>> trust chain). Do you have some suggestions on how we add support for
>>>>>> passing around channel/call[2] credentials?
>>>>>>
>>>>>> 1:
>>>>>> https://github.com/apache/beam/blob/476e17ed6badd4d5c06c4caf8a824805f40a8e7a/sdks/python/apache_beam/runners/worker/sdk_worker_main.py#L139
>>>>>> 2: https://grpc.io/docs/guides/auth/
>>>>>>
>>>>>> On Tue, Apr 23, 2019 at 5:06 PM Hai Lu <lh...@apache.org> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> This is Hai from LinkedIn. Daniel and I have been working on
>>>>>>> productionizing Samza portable runner. BTW, Daniel didn't mention in his
>>>>>>> previous email that he has enabled and validated Python 3 for Samza runner
>>>>>>> and it worked smoothly. Kudos to the team!
>>>>>>>
>>>>>>> Here I have a few security related questions about portability. At
>>>>>>> LinkedIn, we enable SSL/TLS and ACLs for Kafka data and any data exchange.
>>>>>>> In the case of portable runner, we're required to secure the data channels
>>>>>>> between Java and Python processes as well because our Samza jobs are
>>>>>>> running in a multi-tenant environment. While I'm currently working on this
>>>>>>> on our internal branch, I do want to keep it clean and consistent with the
>>>>>>> master branch.
>>>>>>>
>>>>>>> My questions are: were there any plans/thoughts around security for
>>>>>>> portability? I see that sdk_worker.py does have some codes to create
>>>>>>> secured gRPC channels; is anyone actually leveraging those codes? I don't
>>>>>>> see on the Java side any work is done, though.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Hai Lu
>>>>>>>
>>>>>>

Re: Enable security for data channels in portability

Posted by Ankur Goenka <go...@google.com>.
Hi Hai,

Thanks for the PR.
Added a couple of comments. Will take a detailed look later.

Thanks,
Ankur

*From: *Hai Lu <lh...@apache.org>
*Date: *Thu, May 16, 2019 at 8:02 PM
*To: * <lc...@google.com>, <go...@google.com>
*Cc: * <de...@beam.apache.org>, <da...@gmail.com>, <xi...@linkedin.com>

Hi Lukasz and Ankur,
>
> Here is the PR that implements the idea:
> https://github.com/apache/beam/pull/8597
>
> Would appreciate it if you could take a look.
>
> Thanks,
> Hai
>
> On Tue, Apr 30, 2019 at 9:13 AM Hai Lu <lh...@gmail.com> wrote:
>
>> One thing to clarify is that we do not use docker. I don't have too much
>> experience with docker; I assume docker itself already has network
>> isolation, and that's why it was never necessary to enable security in
>> portable runner before?
>>
>> For us because we simply use processes, we need this extra secret
>> (through file system) for authentication.
>>
>> Let me create a ticket and send a PR, which should explain my intention
>> better.
>>
>> Thanks,
>> Hai
>>
>> On Mon, Apr 29, 2019 at 1:03 PM Lukasz Cwik <lc...@google.com> wrote:
>>
>>> Changing the address to be loopback based upon how the environment is
>>> started (docker container/process/external/...) makes sense.
>>>
>>> How would the SDK and runner support storing/sharing this secret? (For
>>> example, in the docker container, how would the secret get there?)
>>>
>>> On Mon, Apr 29, 2019 at 9:23 AM Hai Lu <lh...@gmail.com> wrote:
>>>
>>>> Hi Lukasz and Ankur,
>>>>
>>>> Thank you so much for your response! This is what we're
>>>> doing/implementing in our internal fork right now:
>>>>
>>>>    1. We assume that the Java process and Python process *are always
>>>>    colocated in the same host*, so first of all we use "loopback"
>>>>    address instead of "any address" that's currently being used on the java
>>>>    side. That way, the traffic between sdk worker and runner is limited to the
>>>>    host but not exposed to network.
>>>>    2. Because of the multi-tenant nature of our environment, we still
>>>>    want to have authentication even for local host, so that data ports are not
>>>>    connected by random processes. Because different jobs have their own user
>>>>    name, it's sufficient to *use file system to store an ad-hoc secret*,
>>>>    which can be shared by both Python sdk and java runner. The the runner uses
>>>>    this secret to authenticate the worker (by using gRPC's interceptor for
>>>>    this customized auth)
>>>>    3. By having the 2 steps above, we *no longer need transport layer
>>>>    security *(SSL/TLS). So we abandon our initial plan to enable
>>>>    SSL/TLS.
>>>>
>>>> Above is the high level plan that I'm implementing. I would like to
>>>> have a similar solution in the open source to be merged with our internal
>>>> fork. Let me know what you think. If this sounds OK I will create a ticket
>>>> for myself and will first send out a short write-up in google doc to
>>>> collect comments soon.
>>>>
>>>> Thanks,
>>>> Hai
>>>>
>>>> On Fri, Apr 26, 2019 at 5:24 PM Ankur Goenka <go...@google.com> wrote:
>>>>
>>>>> In an offline chat with Hai, It seem useful for users to be able to
>>>>> provide custom authentication like a secret which can be distributed out of
>>>>> band by the infrastructure and can be provided via file system, rpc to
>>>>> another service etc.
>>>>> gRPC already has some mechanism for standard and custom
>>>>> authentication[1].
>>>>> Instrumenting gRPC channel using command line option or environment
>>>>> variable on the worker machines can be be useful.
>>>>>
>>>>> [1] https://grpc.io/docs/guides/auth/
>>>>>
>>>>> On Fri, Apr 26, 2019 at 4:33 PM Lukasz Cwik <lc...@google.com> wrote:
>>>>>
>>>>>> The link to the ApiServiceDescriptor is
>>>>>> https://github.com/apache/beam/blob/476e17ed6badd4d5c06c4caf8a824805f40a8e7a/model/pipeline/src/main/proto/endpoints.proto#L31
>>>>>>
>>>>>> On Fri, Apr 26, 2019 at 4:32 PM Lukasz Cwik <lc...@google.com> wrote:
>>>>>>
>>>>>>> I had originally taken a look at this a while ago but not much has
>>>>>>> progressed since then. The original idea was that the ApiServiceDescriptor
>>>>>>> would be extended to support secure ways of authentication/communication. I
>>>>>>> was prototyping with an OAuth2 client credentials grant at the time but
>>>>>>> dropped it as other things were more important. The only currently
>>>>>>> supported mode across all SDKs is an implicit authenticated/secure mode
>>>>>>> where all communication is assumed to already be encrypted/private (e.g.
>>>>>>> over VPN that is managed externally with trusted services) and hence the
>>>>>>> gRPC channel itself is insecure and there is no authentication being
>>>>>>> performed.
>>>>>>>
>>>>>>> Even though sdk_worker.py seems like it supports credentials, no one
>>>>>>> invokes the constructor with credentials enabled as can be seen by this
>>>>>>> comment by Robert[1].
>>>>>>>
>>>>>>> For SSL/TLS support it seems like we need some way to configure a
>>>>>>> runner to be told to use SSL/TLS (potentially with a custom private key and
>>>>>>> trust chain). Do you have some suggestions on how we add support for
>>>>>>> passing around channel/call[2] credentials?
>>>>>>>
>>>>>>> 1:
>>>>>>> https://github.com/apache/beam/blob/476e17ed6badd4d5c06c4caf8a824805f40a8e7a/sdks/python/apache_beam/runners/worker/sdk_worker_main.py#L139
>>>>>>> 2: https://grpc.io/docs/guides/auth/
>>>>>>>
>>>>>>> On Tue, Apr 23, 2019 at 5:06 PM Hai Lu <lh...@apache.org> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> This is Hai from LinkedIn. Daniel and I have been working on
>>>>>>>> productionizing Samza portable runner. BTW, Daniel didn't mention in his
>>>>>>>> previous email that he has enabled and validated Python 3 for Samza runner
>>>>>>>> and it worked smoothly. Kudos to the team!
>>>>>>>>
>>>>>>>> Here I have a few security related questions about portability. At
>>>>>>>> LinkedIn, we enable SSL/TLS and ACLs for Kafka data and any data exchange.
>>>>>>>> In the case of portable runner, we're required to secure the data channels
>>>>>>>> between Java and Python processes as well because our Samza jobs are
>>>>>>>> running in a multi-tenant environment. While I'm currently working on this
>>>>>>>> on our internal branch, I do want to keep it clean and consistent with the
>>>>>>>> master branch.
>>>>>>>>
>>>>>>>> My questions are: were there any plans/thoughts around security for
>>>>>>>> portability? I see that sdk_worker.py does have some codes to create
>>>>>>>> secured gRPC channels; is anyone actually leveraging those codes? I don't
>>>>>>>> see on the Java side any work is done, though.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Hai Lu
>>>>>>>>
>>>>>>>