You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Rong Rong <wa...@gmail.com> on 2020/01/09 15:51:51 UTC

Re: Completed job wasn't saved to archive

Hi Pavel,

Sorry for bringing this thread up so late. I was digging into the usage of
the Flink history server and I found one situation where there would be no
logs and no failure/success message from the cluster:
In very rare case in our Flink-YARN session cluster: if an application
master (JobManager running container) fails and being restarted as a YARN
2nd attempt (we haven't enable HA) - then there will be no logs of
archiving being logged whatsoever. However in this case the there would be
a completely new AM container brought up running the JM again (e.g. new log
files)

I am not exactly sure whether this suites your scenarios. Could you
describe a bit more on how your cluster was configured?

Thanks,
Rong

On Mon, Nov 25, 2019 at 10:49 AM Chesnay Schepler <ch...@apache.org>
wrote:

> I'm afraid I can't think of a solution. I don't see a way how this
> operation can succeed or fail without anything being logged.
>
> Is the cluster behaving normally afterwards? Could you check whether the numRunningJobs
> ticks down properly after the job was canceled?
>
>
> On 22/11/2019 13:27, Pavel Potseluev wrote:
>
> Hi Chesnay,
>
> We archive jobs on s3 file system. We don't configure a throttling for
> write operations and afaik it isn't possible now and will be implemented in
> FLINK-13251 <https://issues.apache.org/jira/browse/FLINK-13251>. And
> other write operations (like checkpoints saving) work fine. But I don't see
> archived job or message about archiving failure at all. It looks like Flink
> just didn't try to save job to archive.
>
> 21.11.2019, 17:17, "Chesnay Schepler" <ch...@apache.org>
> <ch...@apache.org>:
>
> If the archiving fails there should be some log message, like "Failed to
> archive job" or "Could not archive completed job..." .
> If nothing of the sort is logged my first instinct would be that the
> operation is being slowed down, _a lot_.
>
> Where are you archiving them to? Could it be the write operation is being
> throttled heavily?
>
> On 21/11/2019 13:48, Pavel Potseluev wrote:
>
> Hi Vino,
>
> Usually Flink archives jobs correctly and the problem is rarely
> reproduced. So I think it isn't a problem with configuration.
>
> Job Manager log when job 5ec264a20bb5005cdbd8e23a5e59f136 was canceled:
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:52:13.294 [Checkpoint
> Timer] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  -
> Triggering checkpoint 1872 @ 1574092333218 for job
> 5ec264a20bb5005cdbd8e23a5e59f136.
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:52:37.260
> [flink-akka.actor.default-dispatcher-30] INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed
> checkpoint 1872 for job 5ec264a20bb5005cdbd8e23a5e59f136 (568048140 bytes
> in 23541 ms).
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:13.314 [Checkpoint
> Timer] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator  -
> Triggering checkpoint 1873 @ 1574092393218 for job
> 5ec264a20bb5005cdbd8e23a5e59f136.
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.279
> [flink-akka.actor.default-dispatcher-40] INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph  - Job
> bureau-user-offers-statistics-AUTORU-USERS_AUTORU
> (5ec264a20bb5005cdbd8e23a5e59f136) switched from state RUNNING to
> CANCELLING.
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.279
> [flink-akka.actor.default-dispatcher-40] INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: Custom
> File Source (1/1) (934d89cf3d7999b40225dd8009b5493c) switched from RUNNING
> to CANCELING.
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280
> [flink-akka.actor.default-dispatcher-40] INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source:
> kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (1/2)
> (47656a3c4fc70e19622acca31267e41f) switched from RUNNING to CANCELING.
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280
> [flink-akka.actor.default-dispatcher-40] INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source:
> kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (2/2)
> (be3c4562e65d3d6bdfda4f1632017c6c) switched from RUNNING to CANCELING.
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280
> [flink-akka.actor.default-dispatcher-40] INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph  -
> user-offers-statistics-init-from-file -> Map (1/2)
> (4a45ed43b05e4d444e190a44b33514ac) switched from RUNNING to CANCELING.
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280
> [flink-akka.actor.default-dispatcher-40] INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph  -
> user-offers-statistics-init-from-file -> Map (2/2)
> (bb3be311c5e53abaedb06b4d0148c23f) switched from RUNNING to CANCELING.
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280
> [flink-akka.actor.default-dispatcher-40] INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed Reduce ->
> Map -> Sink: user-offers-statistics-autoru (1/2)
> (cfb291033df3f19c9745a6f2fd25e037) switched from RUNNING to CANCELING.
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280
> [flink-akka.actor.default-dispatcher-40] INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed Reduce ->
> Map -> Sink: user-offers-statistics-autoru (2/2)
> (9ce7cd66199513fa97ac44d7617f0c83) switched from RUNNING to CANCELING.
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.299
> [flink-akka.actor.default-dispatcher-2] INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source: Custom
> File Source (1/1) (934d89cf3d7999b40225dd8009b5493c) switched from
> CANCELING to CANCELED.
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.300
> [flink-akka.actor.default-dispatcher-2] INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source:
> kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (1/2)
> (47656a3c4fc70e19622acca31267e41f) switched from CANCELING to CANCELED.
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.300
> [flink-akka.actor.default-dispatcher-2] INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source:
> kafka-source-moderation-update-journal-autoru -> Filter -> Flat Map (2/2)
> (be3c4562e65d3d6bdfda4f1632017c6c) switched from CANCELING to CANCELED.
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.344
> [flink-akka.actor.default-dispatcher-2] INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph  -
> user-offers-statistics-init-from-file -> Map (2/2)
> (bb3be311c5e53abaedb06b4d0148c23f) switched from CANCELING to CANCELED.
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.345
> [flink-akka.actor.default-dispatcher-2] INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph  -
> user-offers-statistics-init-from-file -> Map (1/2)
> (4a45ed43b05e4d444e190a44b33514ac) switched from CANCELING to CANCELED.
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.706
> [flink-akka.actor.default-dispatcher-2] INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed Reduce ->
> Map -> Sink: user-offers-statistics-autoru (1/2)
> (cfb291033df3f19c9745a6f2fd25e037) switched from CANCELING to CANCELED.
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714
> [flink-akka.actor.default-dispatcher-2] INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed Reduce ->
> Map -> Sink: user-offers-statistics-autoru (2/2)
> (9ce7cd66199513fa97ac44d7617f0c83) switched from CANCELING to CANCELED.
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714
> [flink-akka.actor.default-dispatcher-2] INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph  - Job
> bureau-user-offers-statistics-AUTORU-USERS_AUTORU
> (5ec264a20bb5005cdbd8e23a5e59f136) switched from state CANCELLING to
> CANCELED.
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714
> [flink-akka.actor.default-dispatcher-2] INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Stopping
> checkpoint coordinator for job 5ec264a20bb5005cdbd8e23a5e59f136.
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714
> [flink-akka.actor.default-dispatcher-2] INFO
> o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Shutting down
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.966
> [flink-akka.actor.default-dispatcher-2] INFO
> org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore  - Removing
> /moderation-flink/testing/checkpoints/5ec264a20bb5005cdbd8e23a5e59f136 from
> ZooKeeper
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.966
> [cluster-io-thread-6] INFO
> org.apache.flink.runtime.checkpoint.CompletedCheckpoint  - Checkpoint with
> ID 1872 at
> 's3://misc/moderation-flink/flink-checkpoints/5ec264a20bb5005cdbd8e23a5e59f136/chk-1872'
> not discarded.
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:20.044
> [flink-akka.actor.default-dispatcher-2] INFO
> o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Shutting down.
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:20.045
> [flink-akka.actor.default-dispatcher-2] INFO
> o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  - Removing
> /checkpoint-counter/5ec264a20bb5005cdbd8e23a5e59f136 from ZooKeeper
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:20.259
> [flink-akka.actor.default-dispatcher-2] INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job
> 5ec264a20bb5005cdbd8e23a5e59f136 reached globally terminal state CANCELED.
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:54:24.085
> [flink-akka.actor.default-dispatcher-31] INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  - Releasing idle
> slot [0553df66161f5d78f4b41d8c8c32c21f].
>
> 771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:54:24.085
> [flink-akka.actor.default-dispatcher-31] INFO
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  - Releasing idle
> slot [498b9bf0c0f2188ff739d72e6df288dc].
>
>
> 21.11.2019, 06:07, "vino yang" <ya...@gmail.com>
> <ya...@gmail.com>:
>
> If everything is OK(your config options about archive dir and history
> server is correct), Flink should archive the completed job.
>
> You said you did not find any exceptions in the log about failing to
> archive. But any other exceptions? Can you share the logs about your scene?
>
> Best,
> Vino
>
> Pavel Potseluev <po...@yandex-team.ru> 于2019年11月21日周四 上午2:25写道:
>
> Hi all,
>
> We see occasionally that flink doesn't save information about canceled job
> to archive directory (configured by jobmanager.archive.fs.dir property). And
> there are no exceptions in the log about failing archiving. It's a problem
> in our use case because our script for deploying jobs relies on flink
> history server to find latest checkpoint for some job. Does flink guarantee
> saving data to archive? If so, any ideas why it doesn't work
> sometimes? Flink version is 1.8.0.
>
> --
> Best regards,
> Pavel Potseluev
> Software developer, Yandex.Classifieds LLC
>
>
>
>
> --
> Best regards,
> Pavel Potseluev
> Software developer, Yandex.Classifieds LLC
>
>
>
>
>
>
> --
> Best regards,
> Pavel Potseluev
> Software developer, Yandex.Classifieds LLC
>
>
>
>