You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Pablo Estrada <pa...@google.com> on 2018/03/29 00:13:08 UTC

Adding a StepMetadataRegistry for Python SDK

Hello all,
I've filed https://issues.apache.org/jira/browse/BEAM-3955, to consider the
possibility of adding some sort of facility to translate different names
for the runners.
This is currently a problem in Dataflow, where steps can have different
names in the backend and in the SDK.
This is observable in Beam code, where different parts of the
SDK/worker/runners use different names in their metrics:

- Logging uses Beam transform names (e.g. Foo/Bar)
- Metrics uses operation_name (e.g. s2)
- Statesampler uses operation_name.
- The Dataflow worker sets step_name to operation_name after creating the
operation.

I'd like to propose the following design outline:

   - Create an e*xecution context *that will allow runners to provide their
   specific functionality*.*
   - Execution context will be able to provide multiple runner-specific
   functionality (e.g. side input fetchers).
   - In this case, the execution contexts can have a StepNameRegistry, or
   StepRegistry, or StepMetadataRegistry of some kind, where step names and
   other metadata can be enrolled.
   - Runners can pass their execution contexts to operations, logging, and
   other modules.
   - Beam core can then switch to use Beam step names, and each runner's
   specific monitoring / metrics / etc classes can have their own logic for
   accessing these.
   - This would also allow us to remove the LoggingContext tracking, and
   rely only on statesampler for context tracking.

Eventually, all of this should be fully contained in the portability API
and runners won't have to deal with these issues, but for now it seems like
a good compromise.

If this sounds good, I'll start working to implement that.
Note that this is only a rough description, and I'm open to reconsider any
and all aspects.

Best
-P.
-- 
Got feedback? go/pabloem-feedback

Re: Adding a StepMetadataRegistry for Python SDK

Posted by Lukasz Cwik <lc...@google.com>.

+1 on minimizing creating new stuff that will be deleted but if it gets us
to that goal faster it can still be worthwhile.

On Thu, Mar 29, 2018 at 5:51 PM Robert Bradshaw <ro...@google.com> wrote:

> If I understand correctly, this is something runner-specific that would
> live solely on the runner side (i.e. over the Fn API we'd still have a
> single name for operations rather than pushing this complexity into that
> protocol as well which I'd really like to avoid, right?) If that's the
> case, then it's a bit unclear what we'd be doing on the Python side, as all
> the non-SDK worker code is going to be thrown away in the new world and I'd
> like to avoid investing too much more there.
>
> On Wed, Mar 28, 2018 at 5:13 PM Pablo Estrada <pa...@google.com> wrote:
>
>> Hello all,
>> I've filed https://issues.apache.org/jira/browse/BEAM-3955, to consider
>> the possibility of adding some sort of facility to translate different
>> names for the runners.
>> This is currently a problem in Dataflow, where steps can have different
>> names in the backend and in the SDK.
>> This is observable in Beam code, where different parts of the
>> SDK/worker/runners use different names in their metrics:
>>
>> - Logging uses Beam transform names (e.g. Foo/Bar)
>> - Metrics uses operation_name (e.g. s2)
>> - Statesampler uses operation_name.
>> - The Dataflow worker sets step_name to operation_name after creating the
>> operation.
>>
>> I'd like to propose the following design outline:
>>
>>    - Create an e*xecution context *that will allow runners to provide
>>    their specific functionality*.*
>>    - Execution context will be able to provide multiple runner-specific
>>    functionality (e.g. side input fetchers).
>>    - In this case, the execution contexts can have a StepNameRegistry,
>>    or StepRegistry, or StepMetadataRegistry of some kind, where step names and
>>    other metadata can be enrolled.
>>    - Runners can pass their execution contexts to operations, logging,
>>    and other modules.
>>    - Beam core can then switch to use Beam step names, and each runner's
>>    specific monitoring / metrics / etc classes can have their own logic for
>>    accessing these.
>>    - This would also allow us to remove the LoggingContext tracking, and
>>    rely only on statesampler for context tracking.
>>
>> Eventually, all of this should be fully contained in the portability API
>> and runners won't have to deal with these issues, but for now it seems like
>> a good compromise.
>>
>> If this sounds good, I'll start working to implement that.
>> Note that this is only a rough description, and I'm open to reconsider
>> any and all aspects.
>>
>> Best
>> -P.
>> --
>> Got feedback? go/pabloem-feedback
>> <https://goto.google.com/pabloem-feedback>
>>
>

Re: Adding a StepMetadataRegistry for Python SDK

Posted by Robert Bradshaw <ro...@google.com>.

If I understand correctly, this is something runner-specific that would
live solely on the runner side (i.e. over the Fn API we'd still have a
single name for operations rather than pushing this complexity into that
protocol as well which I'd really like to avoid, right?) If that's the
case, then it's a bit unclear what we'd be doing on the Python side, as all
the non-SDK worker code is going to be thrown away in the new world and I'd
like to avoid investing too much more there.

On Wed, Mar 28, 2018 at 5:13 PM Pablo Estrada <pa...@google.com> wrote:

> Hello all,
> I've filed https://issues.apache.org/jira/browse/BEAM-3955, to consider
> the possibility of adding some sort of facility to translate different
> names for the runners.
> This is currently a problem in Dataflow, where steps can have different
> names in the backend and in the SDK.
> This is observable in Beam code, where different parts of the
> SDK/worker/runners use different names in their metrics:
>
> - Logging uses Beam transform names (e.g. Foo/Bar)
> - Metrics uses operation_name (e.g. s2)
> - Statesampler uses operation_name.
> - The Dataflow worker sets step_name to operation_name after creating the
> operation.
>
> I'd like to propose the following design outline:
>
>    - Create an e*xecution context *that will allow runners to provide
>    their specific functionality*.*
>    - Execution context will be able to provide multiple runner-specific
>    functionality (e.g. side input fetchers).
>    - In this case, the execution contexts can have a StepNameRegistry, or
>    StepRegistry, or StepMetadataRegistry of some kind, where step names and
>    other metadata can be enrolled.
>    - Runners can pass their execution contexts to operations, logging,
>    and other modules.
>    - Beam core can then switch to use Beam step names, and each runner's
>    specific monitoring / metrics / etc classes can have their own logic for
>    accessing these.
>    - This would also allow us to remove the LoggingContext tracking, and
>    rely only on statesampler for context tracking.
>
> Eventually, all of this should be fully contained in the portability API
> and runners won't have to deal with these issues, but for now it seems like
> a good compromise.
>
> If this sounds good, I'll start working to implement that.
> Note that this is only a rough description, and I'm open to reconsider any
> and all aspects.
>
> Best
> -P.
> --
> Got feedback? go/pabloem-feedback
> <https://goto.google.com/pabloem-feedback>
>