You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by "Zhilong Hong (Jira)" <ji...@apache.org> on 2021/09/15 15:23:00 UTC

[jira] [Created] (FLINK-24300) MultipleInputOperator is running much more slowly in TPCDS

Zhilong Hong created FLINK-24300:
------------------------------------

             Summary: MultipleInputOperator is running much more slowly in TPCDS
                 Key: FLINK-24300
                 URL: https://issues.apache.org/jira/browse/FLINK-24300
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Network
    Affects Versions: 1.14.0, 1.15.0
            Reporter: Zhilong Hong
         Attachments: 64570e4c56955713ca599fd1d7ae7be891a314c6.png, detail-of-the-job.png, e3010c16947ed8da2ecb7d89a3aa08dacecc524a.png, jstack.txt

When we are running TPCDS with release 1.14 we find that the job with MultipleInputOperator is running much more slowly than before. With a binary search among the commits, we find that the issue may be introduced by FLINK-23408. 

At the commit 64570e4c56955713ca599fd1d7ae7be891a314c6, the job runs normally in TPCDS, as the image below illustrates:

!64570e4c56955713ca599fd1d7ae7be891a314c6.png|width=600!

At the commit e3010c16947ed8da2ecb7d89a3aa08dacecc524a, the job q2.sql gets stuck for a pretty long time (longer than half an hour), as the image below illustrates:

!e3010c16947ed8da2ecb7d89a3aa08dacecc524a.png|width=600!

The detail of the job is illustrated below:

!detail-of-the-job.png|width=600!

The job uses a {{MultipleInputOperator}} with one normal input and two chained FileSource. It has finished reading the normal input and start to read the chained source. Each chained source has one source data fetcher.

We capture the jstack of the stuck tasks and attach the file below. From the [^jstack.txt] we can see the main thread is blocked on waiting for the lock, and the lock is held by a source data fetcher. The source data fetcher is still running but the stack keeps on {{CompletableFuture.cleanStack}}.

This issue happens in a batch job. However, from where it get blocked, it seems also affects the streaming jobs.

For the reference, the code of TPCDS we are running is located at [https://github.com/ververica/flink-sql-benchmark/tree/dev].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)