You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by jk...@apache.org on 2017/08/10 23:55:47 UTC
[beam-site] 01/03: Expands the sections on Coders and Validation in
Style Guide
This is an automated email from the ASF dual-hosted git repository.
jkff pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/beam-site.git
commit 0f5e5e746b98cc6596cb7ab74eb752878e8f153c
Author: Eugene Kirpichov <ki...@google.com>
AuthorDate: Thu Aug 3 23:13:07 2017 -0700
Expands the sections on Coders and Validation in Style Guide
* Provides the new guidance that a transform should set coders
on all collections
* Provides guidance on choosing and inferring coders
* Recommends less verbose validation error messages
* Clarifies what validation goes in expand() vs validate()
---
src/contribute/ptransform-style-guide.md | 73 ++++++++++++++++++++++++--------
1 file changed, 55 insertions(+), 18 deletions(-)
diff --git a/src/contribute/ptransform-style-guide.md b/src/contribute/ptransform-style-guide.md
index 3624b02..db69c5a 100644
--- a/src/contribute/ptransform-style-guide.md
+++ b/src/contribute/ptransform-style-guide.md
@@ -452,39 +452,76 @@ public static class FooV2 {
#### Validation
-* Validate individual parameters in `.withBlah()` methods. Error messages should mention the method being called, the actual value and the range of valid values.
-* Validate inter-parameter invariants in the `PTransform`'s `.validate()` method.
+* Validate individual parameters in `.withBlah()` methods using `checkArgument()`. Error messages should mention the name of the parameter, the actual value, and the range of valid values.
+* Validate parameter combinations and missing required parameters in the `PTransform`'s `.expand()` method.
+* Validate parameters that the `PTransform` takes from `PipelineOptions` in the `PTransform`'s `.validate(PipelineOptions)` method.
+ These validations will be executed when the pipeline is already fully constructed/expanded and is about to be run with a particular `PipelineOptions`.
+ Most `PTransform`s do not use `PipelineOptions` and thus don't need a `validate()` method - instead, they should perform their validation via the two other methods above.
```java
@AutoValue
public abstract class TwiddleThumbs
extends PTransform<PCollection<Foo>, PCollection<Bar>> {
abstract int getMoo();
- abstract int getBoo();
+ abstract String getBoo();
...
// Validating individual parameters
public TwiddleThumbs withMoo(int moo) {
- checkArgument(moo >= 0 && moo < 100,
- "TwiddleThumbs.withMoo() called with an invalid moo of %s. "
- + "Valid values are 0 (exclusive) to 100 (exclusive)",
- moo);
- return toBuilder().setMoo(moo).build();
+ checkArgument(
+ moo >= 0 && moo < 100,
+ "Moo must be between 0 (inclusive) and 100 (exclusive), but was: %s",
+ moo);
+ return toBuilder().setMoo(moo).build();
}
- // Validating cross-parameter invariants
- public void validate(PCollection<Foo> input) {
- checkArgument(getMoo() == 0 || getBoo() == 0,
- "TwiddleThumbs created with both .withMoo(%s) and .withBoo(%s). "
- + "Only one of these must be specified.",
- getMoo(), getBoo());
+ public TwiddleThumbs withBoo(String boo) {
+ checkArgument(boo != null, "Boo can not be null");
+ checkArgument(!boo.isEmpty(), "Boo can not be empty");
+ return toBuilder().setBoo(boo).build();
+ }
+
+ @Override
+ public void validate(PipelineOptions options) {
+ int woo = options.as(TwiddleThumbsOptions.class).getWoo();
+ checkArgument(
+ woo > getMoo(),
+ "Woo (%s) must be smaller than moo (%s)",
+ woo, getMoo());
+ }
+
+ @Override
+ public PCollection<Bar> expand(PCollection<Foo> input) {
+ // Validating that a required parameter is present
+ checkArgument(getBoo() != null, "Must specify boo");
+
+ // Validating a combination of parameters
+ checkArgument(
+ getMoo() == 0 || getBoo() == null,
+ "Must specify at most one of moo or boo, but was: moo = %s, boo = %s",
+ getMoo(), getBoo());
+
+ ...
}
}
```
#### Coders
-* Use `Coder`s only for setting the coder on a `PCollection` or a mutable state cell.
-* When available, use a specific most efficient coder for the datatype (e.g. `StringUtf8Coder.of()` for strings, `ByteArrayCoder.of()` for byte arrays, etc.), rather than using a generic coder like `SerializableCoder`. Develop efficient coders for types that can be elements of `PCollection`s.
-* Do not use coders as a general serialization or parsing mechanism for arbitrary raw byte data. (anti-examples that should be fixed: `TextIO`, `KafkaIO`).
-* In general, any transform that outputs a user-controlled type (that is not its input type) needs to accept a coder in the transform configuration (example: the `Create.of()` transform). This gives the user the ability to control the coder no matter how the transform is structured: e.g., purely letting the user specify the coder on the output `PCollection` of the transform is insufficient in case the transform internally uses intermediate `PCollection`s of this type.
+`Coder`s are a way for a Beam runner to materialize intermediate data or transmit it between workers when necessary. `Coder` should not be used as a general-purpose API for parsing or writing binary formats because the particular binary encoding of a `Coder` is intended to be its private implementation detail.
+
+##### Providing default coders for types
+
+Provide default `Coder`s for all new data types. Use `@DefaultCoder` annotations or `CoderProviderRegistrar` classes annotated with `@AutoService`: see usages of these classes in the SDK for examples. If performance is not important, you can use `SerializableCoder` or `AvroCoder`. Otherwise, develop an efficient custom coder (subclass `AtomicCoder` for concrete types, `StructuredCoder` for generic types).
+
+##### Setting coders on output collections
+
+All `PCollection`s created by your `PTransform` (both output and intermediate collections) must have a `Coder` set on them: a user should never need to call `.setCoder()` to "fix up" a coder on a `PCollection` produced by your `PTransform` (in fact, Beam intends to eventually deprecate `setCoder`). In some cases, coder inference will be sufficient to achieve this; in other cases, your transform will need to explicitly call `setCoder` on its collections.
+
+If the collection is of a concrete type, that type usually has a corresponding coder. Use a specific most efficient coder (e.g. `StringUtf8Coder.of()` for strings, `ByteArrayCoder.of()` for byte arrays, etc.), rather than a general-purpose coder like `SerializableCoder`.
+
+If the type of the collection involves generic type variables, the situation is more complex:
+* If it coincides with the transform's input type or is a simple wrapper over it, you can reuse the coder of the input `PCollection`, available via `input.getCoder()`.
+* Attempt to infer the coder via `input.getPipeline().getCoderRegistry().getCoder(TypeDescriptor)`. Use utilities in `TypeDescriptors` to obtain the `TypeDescriptor` for the generic type. For an example of this approach, see the implementation of `AvroIO.parseGenericRecords()`. However, coder inference for generic types is best-effort and in some cases it may fail due to Java type erasure.
+* Always make it possible for the user to explicitly specify a `Coder` for the relevant type variable(s) as a configuration parameter of your `PTransform`. (e.g. `AvroIO.<T>parseGenericRecords().withCoder(Coder<T>)`). Fall back to inference if the coder was not explicitly specified.
+
--
To stop receiving notification emails like this one, please contact
"commits@beam.apache.org" <co...@beam.apache.org>.