You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by bh...@apache.org on 2022/10/28 16:00:25 UTC

[beam] branch master updated: adding examples in schema transforms section of programming guide for python (changes for issue #21022) (#23224)

This is an automated email from the ASF dual-hosted git repository.

bhulette pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/beam.git


The following commit(s) were added to refs/heads/master by this push:
     new a9531951649  adding examples in schema transforms section of programming guide for python  (changes for issue #21022) (#23224)
a9531951649 is described below

commit a9531951649a474386b659134af1746f2d180664
Author: smeet07 <81...@users.noreply.github.com>
AuthorDate: Fri Oct 28 21:30:15 2022 +0530

     adding examples in schema transforms section of programming guide for python  (changes for issue #21022) (#23224)
    
    * changes for issue #21022
    
    @yeandy
    In the section "Using Schema Transforms" of the Python programming guide, there are missing examples.
    I've written the examples for top-level fields, nested fields and wildcards
    
    * Update website/www/site/content/en/documentation/programming-guide.md
    
    Co-authored-by: Andy Ye <an...@gmail.com>
    
    * Update website/www/site/content/en/documentation/programming-guide.md
    
    Co-authored-by: Andy Ye <an...@gmail.com>
    
    * Update programming-guide.md
    
    * Update website/www/site/content/en/documentation/programming-guide.md
    
    Co-authored-by: Andy Ye <an...@gmail.com>
    
    * Update website/www/site/content/en/documentation/programming-guide.md
    
    Co-authored-by: Andy Ye <an...@gmail.com>
    
    * Update website/www/site/content/en/documentation/programming-guide.md
    
    Co-authored-by: Andy Ye <an...@gmail.com>
    
    * whitespace changes
    
    * top level, nested fields and wildcards is shown only for java
    
    I've updated the read me to only show nested fields and wildcards for java as it is not available for python at the moment
    
    * added paragraphs for python
    
    I've added paragraphs for 6.6 section  to avoid confusing users for which python SDK support hasn't been developed yet
    
    * adding texts for python and Go SDKs
    
    * changes for grouping aggregations and joins
    
    * replacing highlight python by highlight py
    
    Co-authored-by: Brian Hulette <hu...@gmail.com>
    
    * replacing all python by py
    
    * using string argument approach
    
    Co-authored-by: Brian Hulette <hu...@gmail.com>
    
    * using appropriate functions
    
    * changes as combine fucntions are exposed now
    
    * Add TODO
    
    Co-authored-by: Brian Hulette <hu...@gmail.com>
    
    * spelling error
    
    Co-authored-by: Brian Hulette <hu...@gmail.com>
    
    * Update website/www/site/content/en/documentation/programming-guide.md
    
    Co-authored-by: Brian Hulette <hu...@gmail.com>
    
    * Update website/www/site/content/en/documentation/programming-guide.md
    
    Co-authored-by: Brian Hulette <hu...@gmail.com>
    
    * table shown for both java and python
    
    * Update website/www/site/content/en/documentation/programming-guide.md
    
    Co-authored-by: Brian Hulette <hu...@gmail.com>
    
    * minor changes
    
    Co-authored-by: Brian Hulette <hu...@gmail.com>
    
    * Update website/www/site/content/en/documentation/programming-guide.md
    
    Co-authored-by: Brian Hulette <hu...@gmail.com>
    
    * Update website/www/site/content/en/documentation/programming-guide.md
    
    Co-authored-by: Brian Hulette <hu...@gmail.com>
    
    * Update website/www/site/content/en/documentation/programming-guide.md
    
    Co-authored-by: Brian Hulette <hu...@gmail.com>
    
    * Update website/www/site/content/en/documentation/programming-guide.md
    
    Co-authored-by: Brian Hulette <hu...@gmail.com>
    
    * Update website/www/site/content/en/documentation/programming-guide.md
    
    Co-authored-by: Brian Hulette <hu...@gmail.com>
    
    * Update website/www/site/content/en/documentation/programming-guide.md
    
    Co-authored-by: Brian Hulette <hu...@gmail.com>
    
    * Update website/www/site/content/en/documentation/programming-guide.md
    
    Co-authored-by: Brian Hulette <hu...@gmail.com>
    
    * Update website/www/site/content/en/documentation/programming-guide.md
    
    Co-authored-by: Brian Hulette <hu...@gmail.com>
    
    * Update website/www/site/content/en/documentation/programming-guide.md
    
    Co-authored-by: Brian Hulette <hu...@gmail.com>
    
    * Update website/www/site/content/en/documentation/programming-guide.md
    
    Co-authored-by: Brian Hulette <hu...@gmail.com>
    
    * Update website/www/site/content/en/documentation/programming-guide.md
    
    Co-authored-by: Brian Hulette <hu...@gmail.com>
    
    * whitespace changes
    
    * should pass all test cases
    
    Co-authored-by: Andy Ye <an...@gmail.com>
    Co-authored-by: Brian Hulette <hu...@gmail.com>
---
 .../content/en/documentation/programming-guide.md  | 162 ++++++++++++++++++++-
 1 file changed, 160 insertions(+), 2 deletions(-)

diff --git a/website/www/site/content/en/documentation/programming-guide.md b/website/www/site/content/en/documentation/programming-guide.md
index 3dbea18d82b..1f249ed3da1 100644
--- a/website/www/site/content/en/documentation/programming-guide.md
+++ b/website/www/site/content/en/documentation/programming-guide.md
@@ -3792,39 +3792,89 @@ the user ids from a `PCollection` of purchases one would write (using the `Selec
 purchases.apply(Select.fieldNames("userId"));
 {{< /highlight >}}
 
+{{< highlight py >}}
+input_pc = ... # {"user_id": ...,"bank": ..., "purchase_amount": ...}
+output_pc = input_pc | beam.Select("user_id")
+{{< /highlight >}}
+
 ##### **Nested fields**
 
+{{< paragraph class="language-py" >}}
+Support for Nested fields hasn't been developed for the Python SDK yet.
+{{< /paragraph >}}
+
+{{< paragraph class="language-go" >}}
+Support for Nested fields hasn't been developed for the Go SDK yet.
+{{< /paragraph >}}
+
+{{< paragraph class="language-java" >}}
 Individual nested fields can be specified using the dot operator. For example, to select just the postal code from the
  shipping address one would write
+{{< /paragraph >}}
 
 {{< highlight java >}}
 purchases.apply(Select.fieldNames("shippingAddress.postCode"));
 {{< /highlight >}}
 
+<!-- {{< highlight py >}}
+input_pc = ... # {"user_id": ..., "shipping_address": "post_code": ..., "bank": ..., "purchase_amount": ...}
+output_pc = input_pc | beam.Select(post_code=lambda item: str(item["shipping_address.post_code"]))
+{{< /highlight >}} -->
 ##### **Wildcards**
 
+{{< paragraph class="language-py" >}}
+Support for wildcards hasn't been developed for the Python SDK yet.
+{{< /paragraph >}}
+
+{{< paragraph class="language-go" >}}
+Support for wildcards hasn't been developed for the Go SDK yet.
+{{< /paragraph >}}
+
+{{< paragraph class="language-java" >}}
 The * operator can be specified at any nesting level to represent all fields at that level. For example, to select all
 shipping-address fields one would write
+{{< /paragraph >}}
 
 {{< highlight java >}}
 purchases.apply(Select.fieldNames("shippingAddress.*"));
 {{< /highlight >}}
 
+<!--
+{{< highlight py >}}
+#TODO(https://github.com/apache/beam/issues/23275): Add support for projecting nested fields
+input_pc = ... # {"user_id": ..., "shipping_address": "post_code": ..., "bank": ..., "purchase_amount": ...}
+output_pc = input_pc | beam.Select("shipping_address.*"))
+{{< /highlight >}} -->
 ##### **Arrays**
 
+{{< paragraph class="language-java" >}}
 An array field, where the array element type is a row, can also have subfields of the element type addressed. When
 selected, the result is an array of the selected subfield type. For example
+{{< /paragraph >}}
+
+{{< paragraph class="language-py" >}}
+Support for Array fields hasn't been developed for the Python SDK yet.
+{{< /paragraph >}}
+
+{{< paragraph class="language-go" >}}
+Support for Array fields hasn't been developed for the Go SDK yet.
+{{< /paragraph >}}
 
 {{< highlight java >}}
 purchases.apply(Select.fieldNames("transactions[].bank"));
 {{< /highlight >}}
 
+{{< paragraph class="language-java" >}}
 Will result in a row containing an array field with element-type string, containing the list of banks for each
 transaction.
+{{< /paragraph >}}
 
+{{< paragraph class="language-java" >}}
 While the use of  [] brackets in the selector is recommended, to make it clear that array elements are being selected,
 they can be omitted for brevity. In the future, array slicing will be supported, allowing selection of portions of the
 array.
+{{< /paragraph >}}
+
 
 ##### **Maps**
 
@@ -3858,6 +3908,14 @@ The following
 purchasesByType.apply(Select.fieldNames("purchases{}.userId"));
 {{< /highlight >}}
 
+{{< paragraph class="language-py" >}}
+Support for Map fields hasn't been developed for the Python SDK yet.
+{{< /paragraph >}}
+
+{{< paragraph class="language-go" >}}
+Support for Map fields hasn't been developed for the Go SDK yet.
+{{< /paragraph >}}
+
 Will result in a row containing a map field with key-type string and value-type string. The selected map will contain
 all of the keys from the original map, and the values will be the userId contained in the purchase record.
 
@@ -3882,6 +3940,14 @@ could select only the userId and streetAddress fields as follows
 purchases.apply(Select.fieldNames("userId", "shippingAddress.streetAddress"));
 {{< /highlight >}}
 
+{{< paragraph class="language-py" >}}
+Support for Nested fields hasn't been developed for the Python SDK yet.
+{{< /paragraph >}}
+
+{{< paragraph class="language-go" >}}
+Support for Nested fields hasn't been developed for the Go SDK yet.
+{{< /paragraph >}}
+
 The resulting `PCollection` will have the following schema
 
 <table>
@@ -3910,6 +3976,14 @@ The same is true for wildcard selections. The following
 purchases.apply(Select.fieldNames("userId", "shippingAddress.*"));
 {{< /highlight >}}
 
+{{< paragraph class="language-py" >}}
+Support for Wildcards hasn't been developed for the Python SDK yet.
+{{< /paragraph >}}
+
+{{< paragraph class="language-go" >}}
+Support for Wildcards hasn't been developed for the Go SDK yet.
+{{< /paragraph >}}
+
 Will result in the following schema
 
 <table>
@@ -3956,6 +4030,15 @@ selected field will appear as its own array field. For example
 purchases.apply(Select.fieldNames( "transactions.bank", "transactions.purchaseAmount"));
 {{< /highlight >}}
 
+{{< paragraph class="language-py" >}}
+Support for nested fields hasn't been developed for the Python SDK yet.
+{{< /paragraph >}}
+
+{{< paragraph class="language-go" >}}
+Support for nested fields hasn't been developed for the Go SDK yet.
+{{< /paragraph >}}
+
+{{< paragraph class="language-java" >}}
 Will result in the following schema
 <table>
   <thead>
@@ -3976,6 +4059,7 @@ Will result in the following schema
   </tbody>
 </table>
 <br/>
+{{< /paragraph >}}
 
 Wildcard selections are equivalent to separately selecting each field.
 
@@ -3993,6 +4077,15 @@ Another use of the Select transform is to flatten a nested schema into a single
 purchases.apply(Select.flattenedSchema());
 {{< /highlight >}}
 
+{{< paragraph class="language-py" >}}
+Support for nested fields hasn't been developed for the Python SDK yet.
+{{< /paragraph >}}
+
+{{< paragraph class="language-go" >}}
+Support for nested fields hasn't been developed for the Go SDK yet.
+{{< /paragraph >}}
+
+{{< paragraph class="language-java" >}}
 Will result in the following schema
 <table>
   <thead>
@@ -4045,21 +4138,48 @@ Will result in the following schema
   </tbody>
 </table>
 <br/>
+{{< /paragraph >}}
 
 ##### **Grouping aggregations**
 
+{{< paragraph class="language-java" >}}
 The `Group` transform allows simply grouping data by any number of fields in the input schema, applying aggregations to
 those groupings, and storing the result of those aggregations in a new schema field. The output of the `Group` transform
 has a schema with one field corresponding to each aggregation performed.
+{{< /paragraph >}}
+
+{{< paragraph class="language-py" >}}
+The `GroupBy` transform allows simply grouping data by any number of fields in the input schema, applying aggregations to
+those groupings, and storing the result of those aggregations in a new schema field. The output of the `GroupBy` transform
+has a schema with one field corresponding to each aggregation performed.
+{{< /paragraph >}}
 
+{{< paragraph class="language-java" >}}
 The simplest usage of `Group` specifies no aggregations, in which case all inputs matching the provided set of fields
 are grouped together into an `ITERABLE` field. For example
+{{< /paragraph >}}
+
+{{< paragraph class="language-py" >}}
+The simplest usage of `GroupBy` specifies no aggregations, in which case all inputs matching the provided set of fields
+are grouped together into an `ITERABLE` field. For example
+{{< /paragraph >}}
 
 {{< highlight java >}}
-purchases.apply(Group.byFieldNames("userId", "shippingAddress.streetAddress"));
+purchases.apply(Group.byFieldNames("userId", "bank"));
 {{< /highlight >}}
 
+{{< highlight py >}}
+input_pc = ... # {"user_id": ...,"bank": ..., "purchase_amount": ...}
+output_pc = input_pc | beam.GroupBy('user_id','bank')
+{{< /highlight >}}
+
+{{< paragraph class="language-go" >}}
+Support for schema-aware grouping hasn't been developed for the Go SDK yet.
+{{< /paragraph >}}
+
+{{< paragraph class="lanuage-java" >}}
 The output schema of this is:
+{{< /paragraph >}}
 
 <table>
   <thead>
@@ -4071,7 +4191,7 @@ The output schema of this is:
   <tbody>
     <tr>
       <td>key</td>
-      <td>ROW{userId:STRING, streetAddress:STRING}</td>
+      <td>ROW{userId:STRING, bank:STRING}</td>
     </tr>
     <tr>
       <td>values</td>
@@ -4104,6 +4224,18 @@ purchases.apply(Group.byFieldNames("userId")
     .aggregateField("costCents", Top.<Long>largestLongsFn(10), "topPurchases"));
 {{< /highlight >}}
 
+{{< highlight py >}}
+input_pc = ... # {"user_id": ..., "item_Id": ..., "cost_cents": ...}
+output_pc = input_pc | beam.GroupBy("user_id")
+	.aggregate_field("item_id", CountCombineFn, "num_purchases")
+	.aggregate_field("cost_cents", sum, "total_spendcents")
+	.aggregate_field("cost_cents", TopCombineFn, "top_purchases")
+{{< /highlight >}}
+
+{{< paragraph class="language-go" >}}
+Support for schema-aware grouping hasn't been developed for the Go SDK yet.
+{{< /paragraph >}}
+
 The result of this aggregation will have the following schema:
 <table>
   <thead>
@@ -4135,6 +4267,14 @@ that are likely associated with that transaction (both the user and product matc
 "natural join" - one in which the same field names are used on both the left-hand and right-hand sides of the join -
 and is specified with the `using` keyword:
 
+{{< paragraph class="language-py" >}}
+Support for joins hasn't been developed for the Python SDK yet.
+{{< /paragraph >}}
+
+{{< paragraph class="language-go" >}}
+Support for joins hasn't been developed for the Go SDK yet.
+{{< /paragraph >}}
+
 {{< highlight java >}}
 PCollection<Transaction> transactions = readTransactions();
 PCollection<Review> reviews = readReviews();
@@ -4142,6 +4282,7 @@ PCollection<Row> joined = transactions.apply(
     Join.innerJoin(reviews).using("userId", "productId"));
 {{< /highlight >}}
 
+{{< paragraph class="language-java" >}}
 The resulting schema is the following:
 <table>
   <thead>
@@ -4162,12 +4303,21 @@ The resulting schema is the following:
   </tbody>
 </table>
 <br/>
+{{< /paragraph >}}
 
 Each resulting row contains one Transaction and one Review that matched the join condition.
 
 If the fields to match in the two schemas have different names, then the on function can be used. For example, if the
 Review schema named those fields differently than the Transaction schema, then we could write the following:
 
+{{< paragraph class="language-py" >}}
+Support for joins hasn't been developed for the Python SDK yet.
+{{< /paragraph >}}
+
+{{< paragraph class="language-go" >}}
+Support for joins hasn't been developed for the Go SDK yet.
+{{< /paragraph >}}
+
 {{< highlight java >}}
 PCollection<Row> joined = transactions.apply(
     Join.innerJoin(reviews).on(
@@ -4188,6 +4338,14 @@ can optionally be expanded - providing individual joined records, as in the `Joi
 processed in unexpanded format - providing the join key along with Iterables of all records from each input that matched
 that key.
 
+{{< paragraph class="language-py" >}}
+Support for joins hasn't been developed for the Python SDK yet.
+{{< /paragraph >}}
+
+{{< paragraph class="language-go" >}}
+Support for joins hasn't been developed for the Go SDK yet.
+{{< /paragraph >}}
+
 ##### **Filtering events**
 
 The `Filter` transform can be configured with a set of predicates, each one based one specified fields. Only records for