You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2020/08/03 19:11:17 UTC

[GitHub] [beam] lostluck commented on a change in pull request #12448: [BEAM-9679] Add Additional Parameters lesson to Go SDK Katas

lostluck commented on a change in pull request #12448:
URL: https://github.com/apache/beam/pull/12448#discussion_r464583093



##########
File path: learning/katas/go/core_transforms/additional_parameters/additional_parameters/task.md
##########
@@ -0,0 +1,84 @@
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~     http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing, software
+  ~ distributed under the License is distributed on an "AS IS" BASIS,
+  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  ~ See the License for the specific language governing permissions and
+  ~ limitations under the License.
+  -->
+
+# Additional Parameters - Window and Timestamp

Review comment:
       Additional Parameters is a weird name for this section. It's simply how this is implemented in the Go SDK.  There are also other additional parameters that aren't being covered which may get confusing as we add them to the SDK (Pane, Timers, State...)
   
   Windowing or windows and timestamps stands fairly well on it's own.
   
   

##########
File path: learning/katas/go/core_transforms/additional_parameters/additional_parameters/task.md
##########
@@ -0,0 +1,84 @@
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~     http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing, software
+  ~ distributed under the License is distributed on an "AS IS" BASIS,
+  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  ~ See the License for the specific language governing permissions and
+  ~ limitations under the License.
+  -->
+
+# Additional Parameters - Window and Timestamp
+
+This lesson introduces the concept of windowing and timestamped PCollection elements.
+Before discussing windowing, we need to distinguish bounded from unbounded data.
+Bounded data is of a fixed size such as a file or database query.  Unbounded data comes
+from a continuously updated source such as a subscription or stream.
+
+A window is a view into a fixed beginning and fixed end to a set of data.  In the beam model, windowing subdivides 
+a PCollection according to the timestamps of its individual elements.  This is useful
+for unbounded data because it allows the model to work with fixed element sizes.  Note that windowing
+is not unique to unbounded data.  The beam model windows all data whether it is bounded or unbounded.
+Yet, when you read from a fixed size source such as a file, beam applies the same timestamp to all the elements.

Review comment:
       Beam doesn't specify timestamps. It's transform or runner dependent. If the framework receives timestamps, it propagates them or updates them as the transforms require. 
   
   eg. "The reading transform applies a timestamp...." not "beam applies the timestamp"

##########
File path: learning/katas/go/core_transforms/additional_parameters/additional_parameters/task.md
##########
@@ -0,0 +1,84 @@
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~     http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing, software
+  ~ distributed under the License is distributed on an "AS IS" BASIS,
+  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  ~ See the License for the specific language governing permissions and
+  ~ limitations under the License.
+  -->
+
+# Additional Parameters - Window and Timestamp
+
+This lesson introduces the concept of windowing and timestamped PCollection elements.
+Before discussing windowing, we need to distinguish bounded from unbounded data.
+Bounded data is of a fixed size such as a file or database query.  Unbounded data comes
+from a continuously updated source such as a subscription or stream.
+
+A window is a view into a fixed beginning and fixed end to a set of data.  In the beam model, windowing subdivides 
+a PCollection according to the timestamps of its individual elements.  This is useful
+for unbounded data because it allows the model to work with fixed element sizes.  Note that windowing
+is not unique to unbounded data.  The beam model windows all data whether it is bounded or unbounded.
+Yet, when you read from a fixed size source such as a file, beam applies the same timestamp to all the elements.
+
+Beam will include information about the window and timestamp to your elements in your DoFn.  All your previous
+lessons' DoFn had this information provided, yet you never made use of it in your DoFn parameters.  In this 
+lesson you will.  The simple toy dataset has five git commit messages and their timestamps 
+from the [Apache Beam public repository](https://github.com/apache/beam).  Their timestamps have been
+applied to the PCollection input to simulate an unbounded dataset.

Review comment:
       Probably repeating myself now, but bounded datasets can have timestamps as well.
   
   Speaking outside of the context of this lesson:
   Consider you have a stream of data from pubsub or something. Each element has the publishing time associated with it. However, data can be late*, which means you might emit less than accurate results if you want to maintain your ~1 minute averages or similar. To have the daily graphs be correct after the fact, you could preserve the incoming datastream somewhere, timestamps and all in some files. Then after the fact you could run the same pipeline against those files, to get the correct running averages throughout the day instead, just by replacing the streaming source transform, with the batch source transform, along with the respective sinks. Fun eh?
   
   *which you can configure beam to handle, but that's not implemented in the Go SDK yet.
   

##########
File path: learning/katas/go/core_transforms/additional_parameters/additional_parameters/task.md
##########
@@ -0,0 +1,84 @@
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~     http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing, software
+  ~ distributed under the License is distributed on an "AS IS" BASIS,
+  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  ~ See the License for the specific language governing permissions and
+  ~ limitations under the License.
+  -->
+
+# Additional Parameters - Window and Timestamp
+
+This lesson introduces the concept of windowing and timestamped PCollection elements.
+Before discussing windowing, we need to distinguish bounded from unbounded data.
+Bounded data is of a fixed size such as a file or database query.  Unbounded data comes
+from a continuously updated source such as a subscription or stream.
+
+A window is a view into a fixed beginning and fixed end to a set of data.  In the beam model, windowing subdivides 
+a PCollection according to the timestamps of its individual elements.  This is useful
+for unbounded data because it allows the model to work with fixed element sizes.  Note that windowing
+is not unique to unbounded data.  The beam model windows all data whether it is bounded or unbounded.
+Yet, when you read from a fixed size source such as a file, beam applies the same timestamp to all the elements.
+
+Beam will include information about the window and timestamp to your elements in your DoFn.  All your previous
+lessons' DoFn had this information provided, yet you never made use of it in your DoFn parameters.  In this 

Review comment:
       I'd say "available" rather than "provided".

##########
File path: learning/katas/go/core_transforms/additional_parameters/additional_parameters/task.md
##########
@@ -0,0 +1,84 @@
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~     http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing, software
+  ~ distributed under the License is distributed on an "AS IS" BASIS,
+  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  ~ See the License for the specific language governing permissions and
+  ~ limitations under the License.
+  -->
+
+# Additional Parameters - Window and Timestamp
+
+This lesson introduces the concept of windowing and timestamped PCollection elements.
+Before discussing windowing, we need to distinguish bounded from unbounded data.

Review comment:
       Bounded vs Unbounded is orthogonal to windowing/event times. There's no need to understand it to understand the other. Windowing is useful and available to both kinds of PCollection. I'd recommend not mentioning it at all at this juncture.

##########
File path: learning/katas/go/core_transforms/additional_parameters/additional_parameters/task.md
##########
@@ -0,0 +1,84 @@
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~     http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing, software
+  ~ distributed under the License is distributed on an "AS IS" BASIS,
+  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  ~ See the License for the specific language governing permissions and
+  ~ limitations under the License.
+  -->
+
+# Additional Parameters - Window and Timestamp
+
+This lesson introduces the concept of windowing and timestamped PCollection elements.
+Before discussing windowing, we need to distinguish bounded from unbounded data.
+Bounded data is of a fixed size such as a file or database query.  Unbounded data comes
+from a continuously updated source such as a subscription or stream.
+
+A window is a view into a fixed beginning and fixed end to a set of data.  In the beam model, windowing subdivides 
+a PCollection according to the timestamps of its individual elements.  This is useful
+for unbounded data because it allows the model to work with fixed element sizes.  Note that windowing

Review comment:
       WRT elements, size refers to how many bytes it takes up. You probably mean counts.
   Windowing doesn't set things to fixed sizes or element sizes, or even counts. 
   
   WRT to bounded/unbounded, note that the text is saying "It's true for A!" "it's also true for not A!" It's true for both A and not A!"
   So my recommendation is to not mention A at all.
   
   

##########
File path: learning/katas/go/core_transforms/additional_parameters/additional_parameters/task.md
##########
@@ -0,0 +1,84 @@
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~     http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing, software
+  ~ distributed under the License is distributed on an "AS IS" BASIS,
+  ~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  ~ See the License for the specific language governing permissions and
+  ~ limitations under the License.
+  -->
+
+# Additional Parameters - Window and Timestamp
+
+This lesson introduces the concept of windowing and timestamped PCollection elements.
+Before discussing windowing, we need to distinguish bounded from unbounded data.
+Bounded data is of a fixed size such as a file or database query.  Unbounded data comes
+from a continuously updated source such as a subscription or stream.
+
+A window is a view into a fixed beginning and fixed end to a set of data.  In the beam model, windowing subdivides 
+a PCollection according to the timestamps of its individual elements.  This is useful
+for unbounded data because it allows the model to work with fixed element sizes.  Note that windowing
+is not unique to unbounded data.  The beam model windows all data whether it is bounded or unbounded.
+Yet, when you read from a fixed size source such as a file, beam applies the same timestamp to all the elements.
+
+Beam will include information about the window and timestamp to your elements in your DoFn.  All your previous

Review comment:
       Beam can pass information about....




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org