You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Andrea Foegler <fo...@google.com> on 2018/08/14 17:43:52 UTC

External resources in Beam pipelines

Hi folks -

Many of you don't know me, as I don't contribute directly to Beam.  But I
do a lot of work around the periphery, in particular considering how to
manage and monitor Beam pipelines.

I think there's room in Beam to greatly improve both the management and
monitoring story, especially around external resources.  By far the most
common external resources in a pipeline are the data sources and sinks.
Nothing mentioned here is limited to those, and should be considered
equally valuable for any sort of RPC or other external connection made in a
pipeline.  But I will focus on I/O here to provide some focus.

The two questions I'd like Beamers to think about are:
1.  How could I easily monitor a Beam pipeline AND all of it's external
dependencies in a single monitoring experience?  How could I easily
distinguish the external dependencies of a Beam pipeline?

2.  How could I make a pipeline data source easily configurable so that I
could launch existing pipelines with a different data source easily?  Is it
possible to do this even if the type of source changes?  <<Note: A great
answer to this question might require rethinking templates a bit.  More on
that later :) >>

I'm attaching a doc with these questions / ideas fleshed out a bit more.  I
would love to hear your thoughts.  And if we end up with some consensus,
I'd love your help in creating a plan to engineer some solutions to these
ideas.

Thanks!
Andrea

Re: External resources in Beam pipelines

Posted by Andrea Foegler <fo...@google.com>.

Here's the raw doc content, for your convenience.Beam I/O representation

In the current Beam programming model, sources and sinks are virtually
indistinguishable from other transforms.  From a composition point of view,
this is great. But from an integration point of view, special handling of
external data sources would be very useful.  I don't intend to propose any
particular solution, but I will outline two use cases that would be great
to support in Beam. I will also note that generalizing these ideas to cover
all external dependencies and not just data sources seems like a good idea.


   1. Externally configurable sources.
      1.

      Each I/O type should be describable using a configuration proto
      specific to that I/O type.  This would make configuring sources work the
      same between different SDK languages. It would also make it obvious what
      parameterization is supported and how to use it, for example selecting a
      subset of BigQuery partitions.
      2.

      Tools that help a user construct a pipeline will likely already have
      some representation of the user's available data sources,
something like a
      Data Lake or Data Hub.  The easiest initial integration with these tools
      would be canned pipelines that support configuring with any type of data
      source. To support this, we would need pipelines that could be configured
      at runtime with both the type of source and the configuration. This would
      look something like a fully generic "Source" transform that can accept a
      runtime configuration for any supported input type (PubSub, BigQuery,
      TextIO, etc)
      3.

      Supporting cross-language pipelines likely looks just like described
      in 'a', where the configured source is whatever type of in-between
      collection representation Beam decides to use for passing data
between the
      two runtimes.  We should use changes to support these cross-language
      pipelines to move us toward simpler, cleaner I/O configuration for all
      pipelines.



   1. Cross-pipeline monitoring
      1.

      When users are monitoring their pipelines, they often need to monitor
      their data sources as well for quota, growing backlog or other issues.
      Right now, digging into the pipeline representation to find information
      about data sources is quite tricky. It would be great if the
Beam Pipeline
      proto could make external data sources a first class citizen so that they
      could be easily extracted by monitoring systems.  Presumably, the
      representation presented in the proto could be the same ones used for
      configuration. The data source description should make it clear in which
      transform it is accessed. Additionally, we should avoid introducing a
      second copy of this data for this purpose; for correctness and
consistency
      sake, the operation of the pipeline and consumption of this config for
      monitoring should access the same description.
      2.

      In addition to a clear description of the data sources in the
      pipeline, it would be great for the Beam runtime to emit details
around the
      data source when it is actually accessed as additional monitoring data.
      Since the exact data source may not be available in the
description and may
      only be determined at runtime, Beam should export these details via
      monitoring data. Additionally, Beam should emit monitoring data
to confirm
      access to the data sources at runtime even if the description fully
      described the source.


On Tue, Aug 14, 2018 at 10:43 AM Andrea Foegler <fo...@google.com> wrote:

> Hi folks -
>
> Many of you don't know me, as I don't contribute directly to Beam.  But I
> do a lot of work around the periphery, in particular considering how to
> manage and monitor Beam pipelines.
>
> I think there's room in Beam to greatly improve both the management and
> monitoring story, especially around external resources.  By far the most
> common external resources in a pipeline are the data sources and sinks.
> Nothing mentioned here is limited to those, and should be considered
> equally valuable for any sort of RPC or other external connection made in a
> pipeline.  But I will focus on I/O here to provide some focus.
>
> The two questions I'd like Beamers to think about are:
> 1.  How could I easily monitor a Beam pipeline AND all of it's external
> dependencies in a single monitoring experience?  How could I easily
> distinguish the external dependencies of a Beam pipeline?
>
> 2.  How could I make a pipeline data source easily configurable so that I
> could launch existing pipelines with a different data source easily?  Is it
> possible to do this even if the type of source changes?  <<Note: A great
> answer to this question might require rethinking templates a bit.  More on
> that later :) >>
>
> I'm attaching a doc with these questions / ideas fleshed out a bit more.
> I would love to hear your thoughts.  And if we end up with some consensus,
> I'd love your help in creating a plan to engineer some solutions to these
> ideas.
>
> Thanks!
> Andrea
>
>
>