You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Flink Jira Bot (Jira)" <ji...@apache.org> on 2022/07/05 22:38:00 UTC
[jira] [Updated] (FLINK-24293) Tasks from the same job on a machine share user jar

     [ https://issues.apache.org/jira/browse/FLINK-24293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Flink Jira Bot updated FLINK-24293:
-----------------------------------
    Labels: auto-deprioritized-major pull-request-available stale-minor  (was: auto-deprioritized-major pull-request-available)

I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help the community manage its development. I see this issues has been marked as Minor but is unassigned and neither itself nor its Sub-Tasks have been updated for 180 days. I have gone ahead and marked it "stale-minor". If this ticket is still Minor, please either assign yourself or give an update. Afterwards, please remove the label or in 7 days the issue will be deprioritized.


> Tasks from the same job on a machine share user jar 
> ----------------------------------------------------
>
>                 Key: FLINK-24293
>                 URL: https://issues.apache.org/jira/browse/FLINK-24293
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: huntercc
>            Priority: Minor
>              Labels: auto-deprioritized-major, pull-request-available, stale-minor
>         Attachments: image-2021-09-15-20-43-11-758.png, image-2021-09-15-20-43-17-304.png
>
>
> In the current blob storage design, tasks executed by the same TaskExecutor will share BLOBs storage dir and tasks executed by different TaskExecutor use different dir. As a result, a TaskExecutor has to download user jar even if there has been the same user jar downloaded by other TaskExecutors on the machine. We believe that there is no need to download many copies of the same user jar to the local, two main problems will by exposed:
>  # The NIC bandwidth of the distribution terminal may become a bottleneck  !image-2021-09-15-20-43-17-304.png|width=695,height=193! 
> As shown in the figure above, 24640 Mbps of the total 25000 Mbps NIC bandwidth is used when we launched a flink job with 4000 TaskManagers, which will cause a long deployment time and akka timeout exception.
>  # Take up more disk space
> We expect to optimize the sharing mechanism of user jar by allowing tasks from the same job on a machine to share blob storage dir, more specifically, share the user jar in the dir. Only one task deployed to the machine will download the user jar from BLOB server or distributed file storage, and the subsequent tasks just use the localized user jar. In this way, the user jar of one job only needs to be downloaded once on a machine. Here is a comparison of job startup time before and after optimization.
> ||num of TM||before optimization||after optimization||
> |1000|62s|37s|
> |2000|104s|40s|
> |3000|170s|43s|
> |4000|211s|45s|
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)