You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/03 19:17:19 UTC

[GitHub] [beam] kennknowles opened a new issue, #18538: Spec out what mutations are allowed to a constructed model pipeline, particularly coders

kennknowles opened a new issue, #18538:
URL: https://github.com/apache/beam/issues/18538

   Context: presume an SDK has constructed a pipeline or sub-pipeline, and sent it - as a model proto - to another party, which could be a runner or another SDK.
   
   Question to be resolved: What mutations are allowed to this pipeline?
   
   For example, depending on how an SDK harness is implemented, some coders (aka wire formats) can be swapped while leaving the language-level types compatible. For example, "urn:beam:coder:varlong" and "urn:beam:coder:bigendianlong". It may also be possible to add or remove added length prefixes in some situations.
   
   What we mean by _coder_ is a wire format specification for a _stream_ of elements, specified by a `FunctionSpec` proto and its components coders (and so on recursively).
   
   For many coders, if the encoding is not known to a party, then the boundaries of elements cannot be discerned. But there are lots of situations where boundaries need to be known without full decoding - particularly by runners, but also at some point for SDK-to-SDK transmission.
   
   *Possibility 1*: insist that a coder...
   
   ```
   
   Coder {
     spec: FunctionSpec { urn: "beam:coder:my_whatever_coder" }
   }
   
   ```
   
   
   ... is always allowed to be replaced by the same coder, wrapped with an added lengh prefix ...
   
   ```
   
   Coder {
     spec: FunctionSpec { urn: "beam:coder:add_length_prefix" }
     component_coders: [
       Coder
   {
         spec: FunctionSpec { urn: "beam:coder:my_whatever_coder" }
       }
     ]
   }
   
   ```
   
   
   There is a responsibility that each SDK harness understand this coder and also be able to execute the same UDFs with the decoded values. This is already sort of implicit in how the Fn API produces ProcessBundleDescriptors, since a runner can never assume to understand SDK coders.
   
   *Posibility 2*: allow optimization by indicating a way to determine element boundaries
   
   It may be that even for a coder that cannot be understood, the element boundaries can be easily discerned. For example, if a coder _already_ puts a length prefix in a known format at the start of each element, you just need to pull that out. This means that for an unknown coder, you can save the computation and space of adding a length prefix. (if you can understand "urn:beam:coder:add_length_prefix" then that special case is already handled)
   
   It might look something like this:
   
   ```
   
   Coder {
     spec: FunctionSpec { urn: "beam:coder:my_whatever_coder" }
     also_decodes_as: Coder {
   
      spec: FunctionSpec { urn: "beam:coder:add_length_prefix" }
       component_coders: [
         Coder:
   { urn: "beam:coder:uninterpretable_bytes" }
       ]
     }
   }
   
   ```
   
   
   The extra coder in `also_decodes_as` must be completely wire-compatible and should always be compose of completely standardized coders, so element boundaries can always be ascertained. An annoyance here is the possibility for silly protos where this recurses. Since the main implementation we expect is a length prefix, it could just be a flag, or just a coder for the length prefix itself.
   
   Imported from Jira [BEAM-3203](https://issues.apache.org/jira/browse/BEAM-3203). Original Jira may contain additional context.
   Reported by: kenn.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org