You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by lo...@apache.org on 2023/02/08 22:19:07 UTC

[beam] 01/02: [prism] Add initial README

This is an automated email from the ASF dual-hosted git repository.

lostluck pushed a commit to branch prism
in repository https://gitbox.apache.org/repos/asf/beam.git

commit a8a9e57e6bdd4f5478e78ee08aa0dc584eab9749
Author: lostluck <13...@users.noreply.github.com>
AuthorDate: Wed Feb 8 13:10:01 2023 -0800

    [prism] Add initial README
---
 sdks/go/pkg/beam/runners/prism/README.md | 169 +++++++++++++++++++++++++++++++
 1 file changed, 169 insertions(+)

diff --git a/sdks/go/pkg/beam/runners/prism/README.md b/sdks/go/pkg/beam/runners/prism/README.md
new file mode 100644
index 00000000000..fbd73d124c2
--- /dev/null
+++ b/sdks/go/pkg/beam/runners/prism/README.md
@@ -0,0 +1,169 @@
+# Apache Beam Go Prism Runner
+
+Prism is a local portable Apache Beam runner authored in Go.
+
+* Local, for fast startup and ease of testing on a single machine.
+* Portable, in that it uses the Beam FnAPI to communicate with Beam SDKs of any language.
+* Go simple concurrency enables clear structures for testing batch through streaming jobs.
+
+It's intended to replace the current Go Direct runner, but also be for general
+single machine use.
+
+For Go SDK users:
+  - Short term: set runner to "prism" to use it, or invoke directly.
+  - Medium term: switch the default from "direct" to "prism". 
+  - Long term: alias "direct" to "prism", and delete legacy Go direct runner.
+
+Prisms allow breaking apart and separating a beam of light into
+it's component wavelengths, as well as recombining them together.
+
+The Prism Runner leans on this metaphor with the goal of making it
+easier for users and Beam SDK developers alike to test and validate
+aspects of Beam that are presently under represented.
+
+## Configurability
+
+Prism is configurable using YAML, which is eagerly validated on startup.
+The configuration contains a set of variants to specify execution behavior, 
+either to support specific testing goals, or to emulate different runners.
+
+Beam's implementation contains a number of details that are hidden from
+users, and to date, no runner implements the same set of features. This
+can make SDK or pipeline development difficult, since exactly what is
+being tested will vary on the runner being used. 
+
+At the top level the configuration contains "variants", and the variants
+configure the behaviors of different "handlers" in Prism. 
+
+Jobs will be able to provide a pipeline option to select which variant to
+use. Multiple jobs on the same prism instance can use different variants.
+Jobs which don't provide a variant will default to testing behavior.
+
+All variants should execute the Beam Model faithfully and correctly, 
+and with few exceptions it should not be possible for there to be an
+invalid execution. The machine's the limit.
+
+It's not expected that all handler options are useful for pipeline authors, 
+These options should remain useful for SDK developers,
+or more precise issue reproduction.
+
+For more detail on the motivation, see Robert Burke's (@lostluck) Beam Summit 2022 talk:
+https://2022.beamsummit.org/sessions/portable-go-beam-runner/.
+
+Here's a non-exhaustive set of variants.
+
+### Variant Highlight: "default"
+
+The "default" variant is testing focused, intending to route out issues at development
+time, rather than discovering them on production runners. Notably, this mode should 
+never use fusion, executing each Transform individually and independantly, one at a time.
+
+This variant should be able to execute arbitrary pipelines, correctly, with clarity and
+precision when an error occurs. Other features supported by the SDK should be enabled by default to
+ensure good coverage, such as caches, or RPC reductions like sending elements in 
+ProcessBundleRequest and Response, as they should not affect correctness. Composite
+transforms like Splitable DoFns and Combines should be expanded to ensure coverage.
+
+Additional validations may be added as time goes on.
+
+Does not retry or provide other resilience features, which may mask errors. 
+
+To ensure coverage, there may be sibling variants that use mutually exclusive alternative
+executions.
+
+### Variant Highlight: "fast"
+
+Not Yet Implemented - Illustrative goal.
+
+The "fast" variant is performance focused, intended for local scale execution.
+A psuedo production execution. Fusion optimizations should be performed. 
+Large PCollection should be offloaded to persistent disk. Bundles should be 
+dynamically split. Multiple Bundles should be executed simultaneously. And so on.
+
+Pipelines should execute as swiftly as possible within the bounds of correct
+execution.
+
+### Variant Hightlight: "flink" "dataflow" "spark" AKA Emulations
+
+Not Yet Implemented - Illustrative goal.
+
+Emulation variants have the goal of replicating on the local scale,
+the behaviors of other runners. Flink execution never "lifts" Combines, and
+doesn't dynamically split. Dataflow has different characteristics for batch
+and streaming execution with certain execution charateristics enabled or
+disabled.
+
+As Prism is intended to implement all facets of Beam Model execution, the handlers
+can have features selectively disabled to ensure 
+
+## Current Limitations
+
+* Experimental and testing use only.
+* Executing docker containers isn't yet implemented.
+    * This precludes running the Java and Python SDKs, or their transforms for Cross Language.
+* In Memory Only
+    * Not yet suitable for larger jobs, which may have intermediate data that exceeds memory bounds.
+    * Doesn't yet support sufficient intermediate data garbage collection for indefinite stream processing.
+* Doesn't yet execute all beam pipeline features.
+* No UI for job status inspection.
+
+## Implemented so far.
+
+* DoFns
+    * Side Inputs
+    * Multiple Outputs
+* Flattens
+* GBKs
+    * Includes handling session windows.
+    * Global Window 
+    * Interval Windowing
+    * Session Windows.
+* Combines lifted and unlifted.
+* Expands Splittable DoFns
+* Limited support for Process Continuations
+  * Residuals are rescheduled for execution immeadiately.
+  * The transform must be finite (and eventually return a stop process continuation)
+* Basic Metrics support
+
+## Next feature short list (unordered)
+
+See https://github.com/apache/beam/issues/24789 for current status.
+
+* Resolve watermark advancement for Process Continuations
+* Test Stream
+* Triggers & Complex Windowing Strategy execution.
+* State
+* Timers
+* "PubSub" Transform
+* Support SDK Containers via Testcontainers
+  * Cross Language Transforms
+* FnAPI Optimizations
+  * Fusion
+  * Data with ProcessBundleRequest & Response
+* Progess tracking
+    * Channel Splitting
+    * Dynamic Splitting
+* Stand alone execution support 
+* UI reporting of in progress jobs
+
+This is not a comprehensive feature set, but a set of goals to best
+support users of the Go SDK in testing their pipelines.
+
+## How to contribute
+
+Until additional structure is necessary, check the main issue
+https://github.com/apache/beam/issues/24789 for the current
+status, file an issue for the feature or bug to fix with `[prism]`
+in the title, and refer to the main issue, before begining work
+to avoid duplication of effort.
+
+If a feature will take a long time, please send a PR to
+link to your issue from this README to help others discover it.
+
+Otherwise, ordinary [Beam contribution guidelines apply](https://beam.apache.org/contribute/).
+
+# Long Term Goals
+
+Once support for containers is implemented, Prism should become a target
+for the Java Runner Validation tests, which are the current specification
+for correct runner behavior. This will inform further feature developement.