You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Michał Walenia <mi...@polidea.com> on 2020/01/29 11:17:35 UTC

Problems with Flink cluster running load tests

Hi there,
I want to add some load tests for core Beam operations (GBK, CoGBK,
Combine, ParDo, SideInput) on portable Flink in Java. For some reason some
of my test scenarios for Combine seem to crash the whole Flink cluster.
The PR introducing the tests is here:
https://github.com/apache/beam/pull/10386
As you can see in the logs of failing Jenkins jobs, the scenario combining
2GB of records on 5 workers works well, whereas 2GB fanned out on 16
workers fails with a cryptic message of either
"org.apache.beam.vendor.grpc.v1p26p0.io.grpc.StatusRuntimeException:
CANCELLED: cancelled before receiving half close"
or
"java.util.concurrent.TimeoutException: The heartbeat of TaskManager with
id container_e01_1580289509522_0001_01_000002 timed out."

link to logs with first error:
https://builds.apache.org/job/beam_LoadTests_Java_Combine_Portable_Flink_Batch_PR/9/console

link to logs with second error:
https://builds.apache.org/job/beam_LoadTests_Java_Combine_Portable_Flink_Batch_PR/11/console

I wonder what the issue might be here. When I ran the same load test on a
Flink cluster created in an identical way on a separate GCP project,
everything went well, which makes me think there may be something wrong
with the Dataproc setup.

Another important note - Combine load tests work and pass with the Python
portable runner.

I will be grateful for any help you can provide - I'm not a Flink cluster
expert and I have no idea how can I change configurations so that it works.

Thanks and have a good day!
Michal

-- 

Michał Walenia
Polidea <https://www.polidea.com/> | Software Engineer

M: +48 791 432 002 <+48791432002>
E: michal.walenia@polidea.com

Unique Tech
Check out our projects! <https://www.polidea.com/our-work>

Re: Problems with Flink cluster running load tests

Posted by Maximilian Michels <mx...@apache.org>.
Hi Michał,

You might want to get the logs of the container 
(container_e01_1580289509522_0001_01_000002) to see what happened there.

Cheers,
Max

On 29.01.20 12:17, Michał Walenia wrote:
> Hi there,
> I want to add some load tests for core Beam operations (GBK, CoGBK, 
> Combine, ParDo, SideInput) on portable Flink in Java. For some reason 
> some of my test scenarios for Combine seem to crash the whole Flink cluster.
> The PR introducing the tests is here: 
> https://github.com/apache/beam/pull/10386
> As you can see in the logs of failing Jenkins jobs, the scenario 
> combining 2GB of records on 5 workers works well, whereas 2GB fanned out 
> on 16 workers fails with a cryptic message of either
> "org.apache.beam.vendor.grpc.v1p26p0.io.grpc.StatusRuntimeException: 
> CANCELLED: cancelled before receiving half close"
> or
> "java.util.concurrent.TimeoutException: The heartbeat of TaskManager 
> with id container_e01_1580289509522_0001_01_000002 timed out."
> 
> link to logs with first error: 
> https://builds.apache.org/job/beam_LoadTests_Java_Combine_Portable_Flink_Batch_PR/9/console
> 
> link to logs with second error:
> https://builds.apache.org/job/beam_LoadTests_Java_Combine_Portable_Flink_Batch_PR/11/console
> 
> I wonder what the issue might be here. When I ran the same load test on 
> a Flink cluster created in an identical way on a separate GCP project, 
> everything went well, which makes me think there may be something wrong 
> with the Dataproc setup.
> 
> Another important note - Combine load tests work and pass with the 
> Python portable runner.
> 
> I will be grateful for any help you can provide - I'm not a Flink 
> cluster expert and I have no idea how can I change configurations so 
> that it works.
> 
> Thanks and have a good day!
> Michal
> 
> -- 
> 
> Michał Walenia
> Polidea <https://www.polidea.com/> | Software Engineer
> 
> M: +48 791 432 002 <tel:+48791432002>
> E: michal.walenia@polidea.com <ma...@polidea.com>
> 
> Unique Tech
> Check out our projects! <https://www.polidea.com/our-work>
>