You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@beam.apache.org by "nbali (via GitHub)" <gi...@apache.org> on 2023/03/08 05:26:59 UTC

[GitHub] [beam] nbali opened a new issue, #25758: [Bug]: Decreased Dataflow runner functionality due to feature availability checks

nbali opened a new issue, #25758:
URL: https://github.com/apache/beam/issues/25758

   ### What happened?
   
   Scenario:
   - Java
   - using Dataflow
   - 2.45.0
   - streaming
   - streaming engine disabled
   - using `GroupIntoBatches.WithShardedKey` (for example `BigQueryIO.Write.withAutoSharding`)
   
   The pipeline fails saying it requires Streaming Engine. Based on my knowledge it should work without Streaming Engine as well.
   
   1. Using `GroupIntoBatches.WithShardedKey` triggers https://github.com/apache/beam/blob/v2.45.0/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L554-L558
   2. The referenced `StreamingGroupIntoBatchesWithShardedKeyOverrideFactory` replaces the transform with `StreamingGroupIntoBatchesWithShardedKey`
   3. Inside that it calls `DataflowRunner.maybeRecordPCollectionWithAutoSharding` https://github.com/apache/beam/blob/v2.45.0/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/GroupIntoBatchesOverride.java#L335
   4. ... and finally that runner method contains the code that always fails if Streaming Engine isn't enabled. https://github.com/apache/beam/blob/v2.45.0/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L1697-L1706
   
   I don't think that check is necessary. We already have a `DataflowRunner.verifyDoFnSupported` that should cover checking if the used state functionality is supported or not. 
   https://github.com/apache/beam/blob/v2.45.0/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L2442-L2487
   
   I can take this issue and contribute a fix if my assumption is right. I personally can't really see the reasoning behind that limitation (I also checked the original PR https://github.com/apache/beam/pull/13208) - but maybe I missed something.
   
   
   
   ### Issue Priority
   
   Priority: 2 (default / most bugs should be filed as P2)
   
   ### Issue Components
   
   - [ ] Component: Python SDK
   - [ ] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [X] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] nbali commented on issue #25758: [Bug]: Decreased Dataflow runner functionality due to feature availability checks

Posted by "nbali (via GitHub)" <gi...@apache.org>.

nbali commented on issue #25758:
URL: https://github.com/apache/beam/issues/25758#issuecomment-1541027829

   (I would prefer to implement this while I'm still having available hours for OSS contributions, so any feedback would be appreciated to my previous questions.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] nbali commented on issue #25758: [Bug]: Decreased Dataflow runner functionality due to feature availability checks

Posted by "nbali (via GitHub)" <gi...@apache.org>.

nbali commented on issue #25758:
URL: https://github.com/apache/beam/issues/25758#issuecomment-1579559437

   Well, I tried.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] nbali commented on issue #25758: [Bug]: Decreased Dataflow runner functionality due to feature availability checks

Posted by "nbali (via GitHub)" <gi...@apache.org>.

nbali commented on issue #25758:
URL: https://github.com/apache/beam/issues/25758#issuecomment-1485074709

   I'm not exactly sure where to check that (the service-side)... but even if a runner can't handle it in any custom way _at all_, using a count and memory limited `GroupIntoBatches` is superior compared to the static sharding. If `.withShardedKey()` can't work without Streaming Engine, than it should be turned off if Streaming Engine isn't enabled, but the `GroupIntoBatches` should be allowed to be used in `BigQueryIO.Write` even without Streaming Engine, because it is simply superior. (For static sharding the shard count has to be guessed ahead of time, and it can still throw OOME if too small, or can have unnecessarily high performance overhead if unnecessarily high shard count is being used - not to mention it also incurs additional GCS costs -, meanwhile GIB avoids OOME, and does the minimally required GCS actions.)
   
   What I don't get necessarily is, even if it doesn't do anything _extra_ why would https://github.com/apache/beam/blob/fef98f7597f11aaa93388d547f58bf0482ed74f0/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java#L278-L315 fail? I can write literally the same code and it works - just not as optimally I guess -. I can only assume it always tries to handle this specific PTranform in a custom way (I assume https://github.com/apache/beam/blob/fef98f7597f11aaa93388d547f58bf0482ed74f0/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineTranslator.java#L703-L705 is for that), and that doesn't work without Streaming Engine, but then the problem is this and the special handling should be disabled without Streaming Engine.
   
   So what I see is either don't try to handle `WithShardedKey` in any special way without streaming engine, or replace that with another PTransform that doesn't trigger any custom handling, but still keep the functionality - so most likely a simple GIB. Am I wrong in my assumptions that both of these would work?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] Abacn commented on issue #25758: [Bug]: Decreased Dataflow runner functionality due to feature availability checks

Posted by "Abacn (via GitHub)" <gi...@apache.org>.

Abacn commented on issue #25758:
URL: https://github.com/apache/beam/issues/25758#issuecomment-1461075063

   Recently there is a similar issue #25675 that was a guard added in the runner side causing existing tests breaking
   
   I personally agree that these implementation specific flags that does not affect the functionality, if not supported, could be ignored rather than fail. Users may try their pipeline with different settings, these pipeline embedded configurations then need to change every time when switching runner v1/v2, streaming engine, etc, if these kind of guard are present
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] nbali commented on issue #25758: [Bug]: Decreased Dataflow runner functionality due to feature availability checks

Posted by "nbali (via GitHub)" <gi...@apache.org>.

nbali commented on issue #25758:
URL: https://github.com/apache/beam/issues/25758#issuecomment-1461189729

   Triggered the builds with https://github.com/apache/beam/pull/25770
   
   Although I'm curious what part of the code of `GroupIntoBatches.WithShardedKey` makes this so risky. I'm guessing there most be some hidden stuff in the runner, because the logic should be a fairly simple one, that just shards with a better algo, and decreases the chance of OOME.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] Abacn commented on issue #25758: [Bug]: Decreased Dataflow runner functionality due to feature availability checks

Posted by "Abacn (via GitHub)" <gi...@apache.org>.

Abacn commented on issue #25758:
URL: https://github.com/apache/beam/issues/25758#issuecomment-1461077755

   However I am not sure in this case specifically removing the check here then it will work, or lead to undefined behavior. One can run an integration test with modified code to check


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] kennknowles commented on issue #25758: [Bug]: Decreased Dataflow runner functionality due to feature availability checks

Posted by "kennknowles (via GitHub)" <gi...@apache.org>.

kennknowles commented on issue #25758:
URL: https://github.com/apache/beam/issues/25758#issuecomment-1476847107

   Yes, the Dataflow service executes this in a custom way, where the runner determines the sharding. So the Dataflow-side code path matters for this. The check that you found in the Beam code really ought to be done service-side since that is where the implementation details are, and that is what decides if it works or not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org