You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Nikhil Goyal <no...@gmail.com> on 2022/11/22 15:30:25 UTC

Driver takes long time to finish once job ends

Hi folks,
We are running a job on our on prem cluster on K8s but writing the output
to S3. We noticed that all the executors finish in < 1h but the driver
takes another 5h to finish. Logs:

22/11/22 02:08:29 INFO BlockManagerInfo: Removed broadcast_3_piece0 on
10.42.145.11:39001 in memory (size: 7.3 KiB, free: 9.4 GiB)
22/11/22 *02:08:29* INFO BlockManagerInfo: Removed broadcast_3_piece0
on 10.42.137.10:33425 in memory (size: 7.3 KiB, free: 9.4 GiB)
22/11/22 *04:57:46* INFO FileFormatWriter: Write Job
4f0051fc-dda9-457f-a072-26311fd5e132 committed.
22/11/22 04:57:46 INFO FileFormatWriter: Finished processing stats for
write job 4f0051fc-dda9-457f-a072-26311fd5e132.
22/11/22 04:57:47 INFO FileUtils: Creating directory if it doesn't
exist: s3://rbx.usr/masked/dw_pii/creator_analytics_user_universe_first_playsession_dc_ngoyal/ds=2022-10-21
22/11/22 04:57:48 INFO SessionState: Could not get hdfsEncryptionShim,
it is only applicable to hdfs filesystem.
22/11/22 *04:57:48* INFO SessionState: Could not get
hdfsEncryptionShim, it is only applicable to hdfs filesystem.
22/11/22 *07:20:20* WARN ExecutorPodsWatchSnapshotSource: Kubernetes
client has been closed (this is expected if the application is
shutting down.)
22/11/22 07:20:22 INFO MapOutputTrackerMasterEndpoint:
MapOutputTrackerMasterEndpoint stopped!
22/11/22 07:20:22 INFO MemoryStore: MemoryStore cleared
22/11/22 07:20:22 INFO BlockManager: BlockManager stopped
22/11/22 07:20:22 INFO BlockManagerMaster: BlockManagerMaster stopped
22/11/22 07:20:22 INFO
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
OutputCommitCoordinator stopped!
22/11/22 07:20:22 INFO SparkContext: Successfully stopped SparkContext
22/11/22 07:20:22 INFO ShutdownHookManager: Shutdown hook called
22/11/22 07:20:22 INFO ShutdownHookManager: Deleting directory
/tmp/spark-d9aa302f-86f2-4668-9c01-07b3e71cba82
22/11/22 07:20:22 INFO ShutdownHookManager: Deleting directory
/var/data/spark-5295849e-a0f3-4355-9a6a-b510616aefaa/spark-43772336-8c86-4e2b-839e-97b2442b2959
22/11/22 07:20:22 INFO MetricsSystemImpl: Stopping s3a-file-system
metrics system...
22/11/22 07:20:22 INFO MetricsSystemImpl: s3a-file-system metrics
system stopped.
22/11/22 07:20:22 INFO MetricsSystemImpl: s3a-file-system metrics
system shutdown complete.

Seems like the job is taking time to write to S3. Any idea how to fix
this issue?

Thanks

Re: Driver takes long time to finish once job ends

Posted by Pralabh Kumar <pr...@gmail.com>.

Cores and memory setting of driver ?

On Wed, 23 Nov 2022, 12:56 Pralabh Kumar, <pr...@gmail.com> wrote:

> How many cores and  u are running driver with?
>
> On Tue, 22 Nov 2022, 21:00 Nikhil Goyal, <no...@gmail.com> wrote:
>
>> Hi folks,
>> We are running a job on our on prem cluster on K8s but writing the output
>> to S3. We noticed that all the executors finish in < 1h but the driver
>> takes another 5h to finish. Logs:
>>
>> 22/11/22 02:08:29 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 10.42.145.11:39001 in memory (size: 7.3 KiB, free: 9.4 GiB)
>> 22/11/22 *02:08:29* INFO BlockManagerInfo: Removed broadcast_3_piece0 on 10.42.137.10:33425 in memory (size: 7.3 KiB, free: 9.4 GiB)
>> 22/11/22 *04:57:46* INFO FileFormatWriter: Write Job 4f0051fc-dda9-457f-a072-26311fd5e132 committed.
>> 22/11/22 04:57:46 INFO FileFormatWriter: Finished processing stats for write job 4f0051fc-dda9-457f-a072-26311fd5e132.
>> 22/11/22 04:57:47 INFO FileUtils: Creating directory if it doesn't exist: s3://rbx.usr/masked/dw_pii/creator_analytics_user_universe_first_playsession_dc_ngoyal/ds=2022-10-21
>> 22/11/22 04:57:48 INFO SessionState: Could not get hdfsEncryptionShim, it is only applicable to hdfs filesystem.
>> 22/11/22 *04:57:48* INFO SessionState: Could not get hdfsEncryptionShim, it is only applicable to hdfs filesystem.
>> 22/11/22 *07:20:20* WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
>> 22/11/22 07:20:22 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
>> 22/11/22 07:20:22 INFO MemoryStore: MemoryStore cleared
>> 22/11/22 07:20:22 INFO BlockManager: BlockManager stopped
>> 22/11/22 07:20:22 INFO BlockManagerMaster: BlockManagerMaster stopped
>> 22/11/22 07:20:22 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
>> 22/11/22 07:20:22 INFO SparkContext: Successfully stopped SparkContext
>> 22/11/22 07:20:22 INFO ShutdownHookManager: Shutdown hook called
>> 22/11/22 07:20:22 INFO ShutdownHookManager: Deleting directory /tmp/spark-d9aa302f-86f2-4668-9c01-07b3e71cba82
>> 22/11/22 07:20:22 INFO ShutdownHookManager: Deleting directory /var/data/spark-5295849e-a0f3-4355-9a6a-b510616aefaa/spark-43772336-8c86-4e2b-839e-97b2442b2959
>> 22/11/22 07:20:22 INFO MetricsSystemImpl: Stopping s3a-file-system metrics system...
>> 22/11/22 07:20:22 INFO MetricsSystemImpl: s3a-file-system metrics system stopped.
>> 22/11/22 07:20:22 INFO MetricsSystemImpl: s3a-file-system metrics system shutdown complete.
>>
>> Seems like the job is taking time to write to S3. Any idea how to fix this issue?
>>
>> Thanks
>>
>>

Re: Driver takes long time to finish once job ends

Posted by Pralabh Kumar <pr...@gmail.com>.

How many cores and  u are running driver with?

On Tue, 22 Nov 2022, 21:00 Nikhil Goyal, <no...@gmail.com> wrote:

> Hi folks,
> We are running a job on our on prem cluster on K8s but writing the output
> to S3. We noticed that all the executors finish in < 1h but the driver
> takes another 5h to finish. Logs:
>
> 22/11/22 02:08:29 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 10.42.145.11:39001 in memory (size: 7.3 KiB, free: 9.4 GiB)
> 22/11/22 *02:08:29* INFO BlockManagerInfo: Removed broadcast_3_piece0 on 10.42.137.10:33425 in memory (size: 7.3 KiB, free: 9.4 GiB)
> 22/11/22 *04:57:46* INFO FileFormatWriter: Write Job 4f0051fc-dda9-457f-a072-26311fd5e132 committed.
> 22/11/22 04:57:46 INFO FileFormatWriter: Finished processing stats for write job 4f0051fc-dda9-457f-a072-26311fd5e132.
> 22/11/22 04:57:47 INFO FileUtils: Creating directory if it doesn't exist: s3://rbx.usr/masked/dw_pii/creator_analytics_user_universe_first_playsession_dc_ngoyal/ds=2022-10-21
> 22/11/22 04:57:48 INFO SessionState: Could not get hdfsEncryptionShim, it is only applicable to hdfs filesystem.
> 22/11/22 *04:57:48* INFO SessionState: Could not get hdfsEncryptionShim, it is only applicable to hdfs filesystem.
> 22/11/22 *07:20:20* WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
> 22/11/22 07:20:22 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
> 22/11/22 07:20:22 INFO MemoryStore: MemoryStore cleared
> 22/11/22 07:20:22 INFO BlockManager: BlockManager stopped
> 22/11/22 07:20:22 INFO BlockManagerMaster: BlockManagerMaster stopped
> 22/11/22 07:20:22 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
> 22/11/22 07:20:22 INFO SparkContext: Successfully stopped SparkContext
> 22/11/22 07:20:22 INFO ShutdownHookManager: Shutdown hook called
> 22/11/22 07:20:22 INFO ShutdownHookManager: Deleting directory /tmp/spark-d9aa302f-86f2-4668-9c01-07b3e71cba82
> 22/11/22 07:20:22 INFO ShutdownHookManager: Deleting directory /var/data/spark-5295849e-a0f3-4355-9a6a-b510616aefaa/spark-43772336-8c86-4e2b-839e-97b2442b2959
> 22/11/22 07:20:22 INFO MetricsSystemImpl: Stopping s3a-file-system metrics system...
> 22/11/22 07:20:22 INFO MetricsSystemImpl: s3a-file-system metrics system stopped.
> 22/11/22 07:20:22 INFO MetricsSystemImpl: s3a-file-system metrics system shutdown complete.
>
> Seems like the job is taking time to write to S3. Any idea how to fix this issue?
>
> Thanks
>
>

Re: EXT: Driver takes long time to finish once job ends

Posted by Vibhor Gupta <Vi...@walmart.com.INVALID>.

Hi Nikhil,

You might be using the v1 file output commit protocol.
http://www.openkb.info/2019/04/what-is-difference-between.html
What is the difference between mapreduce.fileoutputcommitter.algorithm.version=1 and 2 | Open Knowledge Base - openkb.info<http://www.openkb.info/2019/04/what-is-difference-between.html>
Goal: This article explains the difference between mapreduce.fileoutputcommitter.algorithm.version=1 and 2 using a sample wordcount job. Env: MapR 6.1
www.openkb.info

Regards,
Vibhor.

________________________________
From: Nikhil Goyal <no...@gmail.com>
Sent: Tuesday, November 22, 2022 9:00 PM
To: user @spark/'user @spark'/spark users/user@spark <us...@spark.apache.org>
Subject: EXT: Driver takes long time to finish once job ends

EXTERNAL: Report suspicious emails to Email Abuse.

Hi folks,
We are running a job on our on prem cluster on K8s but writing the output to S3. We noticed that all the executors finish in < 1h but the driver takes another 5h to finish. Logs:

22/11/22 02:08:29 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 10.42.145.11:39001<https://urldefense.com/v3/__http://10.42.145.11:39001__;!!IfjTnhH9!XrXMaxWv0cUMtb5c1JLoH9xwARR2Fgz3VyFWccsNJayocx5QBvWYMdmy3PS8wFVpIplMRzRKqCrHdgOyZ6jrlg$> in memory (size: 7.3 KiB, free: 9.4 GiB)
22/11/22 02:08:29 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 10.42.137.10:33425<https://urldefense.com/v3/__http://10.42.137.10:33425__;!!IfjTnhH9!XrXMaxWv0cUMtb5c1JLoH9xwARR2Fgz3VyFWccsNJayocx5QBvWYMdmy3PS8wFVpIplMRzRKqCrHdgP24LuGAw$> in memory (size: 7.3 KiB, free: 9.4 GiB)
22/11/22 04:57:46 INFO FileFormatWriter: Write Job 4f0051fc-dda9-457f-a072-26311fd5e132 committed.
22/11/22 04:57:46 INFO FileFormatWriter: Finished processing stats for write job 4f0051fc-dda9-457f-a072-26311fd5e132.
22/11/22 04:57:47 INFO FileUtils: Creating directory if it doesn't exist: s3://rbx.usr/masked/dw_pii/creator_analytics_user_universe_first_playsession_dc_ngoyal/ds=2022-10-21
22/11/22 04:57:48 INFO SessionState: Could not get hdfsEncryptionShim, it is only applicable to hdfs filesystem.
22/11/22 04:57:48 INFO SessionState: Could not get hdfsEncryptionShim, it is only applicable to hdfs filesystem.
22/11/22 07:20:20 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
22/11/22 07:20:22 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/11/22 07:20:22 INFO MemoryStore: MemoryStore cleared
22/11/22 07:20:22 INFO BlockManager: BlockManager stopped
22/11/22 07:20:22 INFO BlockManagerMaster: BlockManagerMaster stopped
22/11/22 07:20:22 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/11/22 07:20:22 INFO SparkContext: Successfully stopped SparkContext
22/11/22 07:20:22 INFO ShutdownHookManager: Shutdown hook called
22/11/22 07:20:22 INFO ShutdownHookManager: Deleting directory /tmp/spark-d9aa302f-86f2-4668-9c01-07b3e71cba82
22/11/22 07:20:22 INFO ShutdownHookManager: Deleting directory /var/data/spark-5295849e-a0f3-4355-9a6a-b510616aefaa/spark-43772336-8c86-4e2b-839e-97b2442b2959
22/11/22 07:20:22 INFO MetricsSystemImpl: Stopping s3a-file-system metrics system...
22/11/22 07:20:22 INFO MetricsSystemImpl: s3a-file-system metrics system stopped.
22/11/22 07:20:22 INFO MetricsSystemImpl: s3a-file-system metrics system shutdown complete.

Seems like the job is taking time to write to S3. Any idea how to fix this issue?

Thanks