You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Andrea Foegler <fo...@google.com> on 2018/08/14 17:43:52 UTC
External resources in Beam pipelines
Hi folks -
Many of you don't know me, as I don't contribute directly to Beam. But I
do a lot of work around the periphery, in particular considering how to
manage and monitor Beam pipelines.
I think there's room in Beam to greatly improve both the management and
monitoring story, especially around external resources. By far the most
common external resources in a pipeline are the data sources and sinks.
Nothing mentioned here is limited to those, and should be considered
equally valuable for any sort of RPC or other external connection made in a
pipeline. But I will focus on I/O here to provide some focus.
The two questions I'd like Beamers to think about are:
1. How could I easily monitor a Beam pipeline AND all of it's external
dependencies in a single monitoring experience? How could I easily
distinguish the external dependencies of a Beam pipeline?
2. How could I make a pipeline data source easily configurable so that I
could launch existing pipelines with a different data source easily? Is it
possible to do this even if the type of source changes? <<Note: A great
answer to this question might require rethinking templates a bit. More on
that later :) >>
I'm attaching a doc with these questions / ideas fleshed out a bit more. I
would love to hear your thoughts. And if we end up with some consensus,
I'd love your help in creating a plan to engineer some solutions to these
ideas.
Thanks!
Andrea
Re: External resources in Beam pipelines
Posted by Andrea Foegler <fo...@google.com>.
Here's the raw doc content, for your convenience.Beam I/O representation
In the current Beam programming model, sources and sinks are virtually
indistinguishable from other transforms. From a composition point of view,
this is great. But from an integration point of view, special handling of
external data sources would be very useful. I don't intend to propose any
particular solution, but I will outline two use cases that would be great
to support in Beam. I will also note that generalizing these ideas to cover
all external dependencies and not just data sources seems like a good idea.
1. Externally configurable sources.
1.
Each I/O type should be describable using a configuration proto
specific to that I/O type. This would make configuring sources work the
same between different SDK languages. It would also make it obvious what
parameterization is supported and how to use it, for example selecting a
subset of BigQuery partitions.
2.
Tools that help a user construct a pipeline will likely already have
some representation of the user's available data sources,
something like a
Data Lake or Data Hub. The easiest initial integration with these tools
would be canned pipelines that support configuring with any type of data
source. To support this, we would need pipelines that could be configured
at runtime with both the type of source and the configuration. This would
look something like a fully generic "Source" transform that can accept a
runtime configuration for any supported input type (PubSub, BigQuery,
TextIO, etc)
3.
Supporting cross-language pipelines likely looks just like described
in 'a', where the configured source is whatever type of in-between
collection representation Beam decides to use for passing data
between the
two runtimes. We should use changes to support these cross-language
pipelines to move us toward simpler, cleaner I/O configuration for all
pipelines.
1. Cross-pipeline monitoring
1.
When users are monitoring their pipelines, they often need to monitor
their data sources as well for quota, growing backlog or other issues.
Right now, digging into the pipeline representation to find information
about data sources is quite tricky. It would be great if the
Beam Pipeline
proto could make external data sources a first class citizen so that they
could be easily extracted by monitoring systems. Presumably, the
representation presented in the proto could be the same ones used for
configuration. The data source description should make it clear in which
transform it is accessed. Additionally, we should avoid introducing a
second copy of this data for this purpose; for correctness and
consistency
sake, the operation of the pipeline and consumption of this config for
monitoring should access the same description.
2.
In addition to a clear description of the data sources in the
pipeline, it would be great for the Beam runtime to emit details
around the
data source when it is actually accessed as additional monitoring data.
Since the exact data source may not be available in the
description and may
only be determined at runtime, Beam should export these details via
monitoring data. Additionally, Beam should emit monitoring data
to confirm
access to the data sources at runtime even if the description fully
described the source.
On Tue, Aug 14, 2018 at 10:43 AM Andrea Foegler <fo...@google.com> wrote:
> Hi folks -
>
> Many of you don't know me, as I don't contribute directly to Beam. But I
> do a lot of work around the periphery, in particular considering how to
> manage and monitor Beam pipelines.
>
> I think there's room in Beam to greatly improve both the management and
> monitoring story, especially around external resources. By far the most
> common external resources in a pipeline are the data sources and sinks.
> Nothing mentioned here is limited to those, and should be considered
> equally valuable for any sort of RPC or other external connection made in a
> pipeline. But I will focus on I/O here to provide some focus.
>
> The two questions I'd like Beamers to think about are:
> 1. How could I easily monitor a Beam pipeline AND all of it's external
> dependencies in a single monitoring experience? How could I easily
> distinguish the external dependencies of a Beam pipeline?
>
> 2. How could I make a pipeline data source easily configurable so that I
> could launch existing pipelines with a different data source easily? Is it
> possible to do this even if the type of source changes? <<Note: A great
> answer to this question might require rethinking templates a bit. More on
> that later :) >>
>
> I'm attaching a doc with these questions / ideas fleshed out a bit more.
> I would love to hear your thoughts. And if we end up with some consensus,
> I'd love your help in creating a plan to engineer some solutions to these
> ideas.
>
> Thanks!
> Andrea
>
>
>