You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/11/08 14:23:07 UTC

[GitHub] [beam] amontoli commented on issue #22809: [Bug]: Python SDK gets stuck when using Unbounded PCollection in streaming mode on GroupByKey after ReadFromKafka on DirectRunner, FlinkRunner and DataflowRunner

amontoli commented on issue #22809:
URL: https://github.com/apache/beam/issues/22809#issuecomment-1307301545

I have a very similar issue, but instead of using Kafka module in Beam I am using the [Kafka module in beam_nuggets](http://mohaseeb.com/beam-nuggets/beam_nuggets.io.kafkaio.html), a wrapper of the [Kafka Python](https://pypi.org/project/kafka-python/) client. With this source I have to add the timestamp by hand using beam.window.TimestampedValue.

I tried to analyze the data after applying the window transformation by using the AnalyzeElement class defined [here (Example 2)](https://beam.apache.org/documentation/transforms/python/elementwise/pardo/). Data is correctly assigned to a window, but `GroupByKey` never gets called.
I have used the Direct Runner and the Flink Portable one. I have also tried using a non-default trigger (`trigger=trigger.AfterWatermark()`).

I do not know if it is related, but I have also tried with reading from file with ReadFromText and the streaming pipeline option: data is processed line by line up to the GroupByKey step, but the latter is called only after the whole file has been read, as if the window trigger does not activate until the PCollection has ended.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org