You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by ni...@apache.org on 2022/06/27 22:33:38 UTC

[beam] branch master updated: Issue#20877 Updated Interactive Beam README (#22034)

This is an automated email from the ASF dual-hosted git repository.

ningk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/beam.git


The following commit(s) were added to refs/heads/master by this push:
     new 141e9507437 Issue#20877 Updated Interactive Beam README (#22034)
141e9507437 is described below

commit 141e95074370e2e1dde40605d05e653503768e14
Author: Ning Kang <ni...@gmail.com>
AuthorDate: Mon Jun 27 15:33:30 2022 -0700

    Issue#20877 Updated Interactive Beam README (#22034)
    
    fixes #20877
---
 .../apache_beam/runners/interactive/README.md      | 146 ++++++++++++++++++---
 1 file changed, 125 insertions(+), 21 deletions(-)

diff --git a/sdks/python/apache_beam/runners/interactive/README.md b/sdks/python/apache_beam/runners/interactive/README.md
index 519e33b1a2e..15c1d6b3e95 100644
--- a/sdks/python/apache_beam/runners/interactive/README.md
+++ b/sdks/python/apache_beam/runners/interactive/README.md
@@ -27,38 +27,142 @@ exploration much faster and easier. It provides nice features including
 
 1.  Graphical representation
 
-    When a pipeline is executed on a Jupyter notebook, it instantly displays the
-    pipeline as a directed acyclic graph. Sampled PCollection results will be
-    added to the graph as the pipeline execution proceeds.
+    Visualize the Pipeline DAG:
 
-2.  Fetching PCollections as list
+    ```python
+    import apache_beam.runners.interactive.interactive_beam as ib
+    from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
+
+    p = beam.Pipeline(InteractiveRunner())
+    # ... add transforms
+    ib.show_graph(pipeline)
+    ```
 
-    PCollections can be fetched as a list from the pipeline result. This unique
-    feature of
-    [InteractiveRunner](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/interactive_runner.py)
-    makes it much easier to integrate Beam pipeline into data analysis.
+    Visualize elements in a PCollection:
 
     ```python
-    p = beam.Pipeline(interactive_runner.InteractiveRunner())
-    pcoll = p | SomePTransform | AnotherPTransform
-    result = p.run().wait_until_finish()
-    pcoll_list = result.get(pcoll)  # This returns a list!
+    pcoll = p | beam.Create([1, 2, 3])
+    # include_window_info displays windowing information
+    # visualize_data visualizes data with https://pair-code.github.io/facets/
+    ib.show(pcoll, include_window_info=True, visualize_data=True)
     ```
+    More details see the docstrings of `interactive_beam` module.
+
+2.  Support of streaming record/replay and dynamic visualization
 
-3.  Faster re-execution
+    For streaming pipelines, Interactive Beam records a subset of unbounded
+    sources in the pipeline automatically so that they can be replayed for
+    pipeline changes during prototyping.
 
-    [InteractiveRunner](https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/interactive_runner.py)
-    caches PCollection results of pipeline executed previously and re-uses it
-    when the same pipeline is submitted again.
+    There are a few knobs to tune the source recording:
 
     ```python
-    p = beam.Pipeline(interactive_runner.InteractiveRunner())
-    pcoll = p | SomePTransform | AnotherPTransform
-    result = p.run().wait_until_finish()
+    # Set the amount of time recording data from unbounded sources.
+    ib.options.recording_duration = '10m'
+
+    # Set the recording size limit to 1 GB.
+    ib.options.recording_size_limit = 1e9
+
+    # Visualization is dynamic as data streamed in real time.
+    # n=100 indicates that displays at most 100 elements.
+    # duration=60 indicates that displays at most 60 seconds worth of unbounded
+    # source generated data.
+    ib.show(pcoll, include_window_info=True, n=100, duration=60)
+
+    # duration can also be strings.
+    ib.show(pcoll, include_window_info=True, duration='1m')
 
-    pcoll2 = pcoll | YetAnotherPTransform
-    result = p.run().wait_until_finish()  # <- only executes YetAnotherPTransform
+    # If neither n nor duration is provided, the display is indefinitely until
+    # the current machine's recording usage hits the threadshold set by
+    # ib.options.
+    ib.show(pcoll, include_window_info=True)
     ```
+    More details see the docstrings of `interactive_beam` module.
+
+3.  Fetching PCollections as pandas.DataFrame
+
+    PCollections can be collected as a pandas.DataFrame:
+
+    ```python
+    pcoll_df = ib.collect(pcoll)  # This returns a pandas.DataFrame!
+    ```
+
+4.  Faster execution and re-execution
+
+    Interactive Beam analyzes the pipeline graph depending on what PCollection
+    you want to inspect and builds a pipeline fragment to only compute
+    necessary data.
+
+    ```python
+    pcoll = p | PTransformA | PTransformB
+    pcoll2 = p | PTransformC | PTransformD
+
+    ib.collect(pcoll)  # <- only executes PTransformA and PTransformB
+    ib.collect(pcoll2)  # <- only executes PTransformC and PTransformD
+    ```
+
+    Interactive Beam caches PCollection inspected previously and re-uses it
+    when the data is still in scope.
+
+    ```python
+    pcoll = p | PTransformA
+    # pcoll2 depends on pcoll
+    pcoll2 = pcoll | PTransformB
+    ib.collect(pcoll2)  # <- caches data for both pcoll and pcoll2
+
+    pcoll3 = pcoll2 | PTransformC
+    ib.collect(pcoll3)  # <- reuses data of pcoll2 and only executes PTransformC
+
+    pcoll4 = pcoll | PTransformD
+    ib.collect(pcoll4)  # <- reuses data of pcoll and only executes PTransformD
+    ```
+
+5. Supports global and local scopes
+
+   Interactive Beam automatically watches the `__main__` scope for pipeline and
+   PCollection definitions to implicitly do magic under the hood.
+
+   ```python
+   # In a script or in a notebook
+   p = beam.Pipeline(InteractiveRunner())
+   pcoll = beam | SomeTransform
+   pcoll2 = pcoll | SomeOtherTransform
+
+   # p, pcoll and pcoll2 are all known to Interactive Beam.
+   ib.collect(pcoll)
+   ib.collect(pcoll2)
+   ib.show_graph(p)
+   ```
+
+   You have to explicitly watch pipelines and PCollections in your local scope.
+   Otherwise, Interactive Beam doesn't know about them and won't handle them
+   with interactive features.
+
+   ```python
+   def a_func():
+     p = beam.Pipeline(InteractiveRunner())
+     pcoll = beam | SomeTransform
+     pcoll2 = pcoll | SomeOtherTransform
+
+     # Watch everything defined locally before this line.
+     ib.watch(locals())
+     # Or explicitly watch them.
+     ib.watch({
+         'p': p,
+         'pcoll': pcoll,
+         'pcoll2': pcoll2})
+
+     # p, pcoll and pcoll2 are all known to Interactive Beam.
+     ib.collect(pcoll)
+     ib.collect(pcoll2)
+     ib.show_graph(p)
+
+     return p, pcoll, pcoll2
+
+   # Or return them to main scope
+   p, pcoll, pcoll2 = a_func()
+   ib.collect(pcoll)  # Also works!
+   ```
 
 ## Status