You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/04 01:45:27 UTC

[GitHub] [beam] kennknowles opened a new issue, #19444: Change batch handling in ElasticsearchIO to avoid necessity for GroupIntoBatches

kennknowles opened a new issue, #19444:
URL: https://github.com/apache/beam/issues/19444

   I have a streaming job inserting records into an Elasticsearch cluster. I set the batch size appropriately big, but I found out this is not causing any effect at all: I found that all elements are inserted in batches of 1 or 2 elements.
   
   The reason seems to be that this is a streaming pipeline, which may result in tiny bundles. Since ElasticsearchIO uses `@FinishBundle` to flush a batch, this will result in equally small batches.
   
   This results in a huge amount of bulk requests with just one element, grinding the Elasticsearch cluster to a halt.
   
   I have now been able to work around this by using a `GroupIntoBatches` operation before the insert, but this results in 3 steps (mapping to a key, applying GroupIntoBatches, stripping key and outputting all collected elements), making the process quite awkward.
   
   A much better approach would be to internalize this into the ElasticsearchIO write transform.. Use a timer that flushes the batch at batch size or end of window, not at the end of a bundle.
   
   Imported from Jira [BEAM-6886](https://issues.apache.org/jira/browse/BEAM-6886). Original Jira may contain additional context.
   Reported by: MadEgg.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] egalpin commented on issue #19444: Change batch handling in ElasticsearchIO to avoid necessity for GroupIntoBatches

Posted by GitBox <gi...@apache.org>.
egalpin commented on issue #19444:
URL: https://github.com/apache/beam/issues/19444#issuecomment-1149066319

   The original is a bit of an old ticket, but this is now addressed as of https://github.com/apache/beam/pull/14347.  TL;DR one can now use `.withUseStatefulBatches(true)` to employ GroupIntoBatches internally such that bulk API requests will have `maxBatchSize` elements (sometimes fewer if also using non-global windowing).
   
   IMO this ticket can be closed because the feature is implemented 🙂 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] kennknowles closed issue #19444: Change batch handling in ElasticsearchIO to avoid necessity for GroupIntoBatches

Posted by GitBox <gi...@apache.org>.
kennknowles closed issue #19444: Change batch handling in ElasticsearchIO to avoid necessity for GroupIntoBatches
URL: https://github.com/apache/beam/issues/19444


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] kennknowles commented on issue #19444: Change batch handling in ElasticsearchIO to avoid necessity for GroupIntoBatches

Posted by GitBox <gi...@apache.org>.
kennknowles commented on issue #19444:
URL: https://github.com/apache/beam/issues/19444#issuecomment-1149098164

   Fabulous. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org