You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kylin.apache.org by "Gabor Arki (Jira)" <ji...@apache.org> on 2020/08/06 11:01:00 UTC
[jira] [Created] (KYLIN-4689) Deadlock in Kylin job execution
Gabor Arki created KYLIN-4689:
---------------------------------
Summary: Deadlock in Kylin job execution
Key: KYLIN-4689
URL: https://issues.apache.org/jira/browse/KYLIN-4689
Project: Kylin
Issue Type: Bug
Components: Job Engine
Reporter: Gabor Arki
h4. Reproduction steps
* Install Kylin 3.1.0
* Deploy a streaming cube
* Enable the cube having historical data present in the Kafka topic
* Note: in our case, we had 3 cubes deployed, each consuming ~20-20 hourly segments from Kafka when the cubes were enabled)
h4. Expected result
* Kylin is starting to process stream segments with stream jobs, eventually processing the older segments and catching up with the stream
Actual result
* After a short time, all jobs are completely stuck without any progress. Some in running state, some in pending state.
* The following logs are continously written:
{code:java}
2020-08-06 06:16:22 INFO [Scheduler 116797841 Job 12750aea-3b96-c817-64e8-bf893d8c120f-254] MapReduceExecutable:409 - 12750aea-3b96-c817-64e8-bf893d8c120f-00, parent lock path(/cube_job_lock/cube_vm) is locked by other job result is true ,ephemeral lock path :/cube_job_ephemeral_lock/cube_vm is locked by other job result is true,will try after one minute
2020-08-06 06:16:33 WARN [FetcherRunner 787667774-43] FetcherRunner:56 - There are too many jobs running, Job Fetch will wait until next schedule time
{code}
* Zookeeper indicates the following locks are in place:
{code:java}
ls /kylin/kylin_metadata/cube_job_ephemeral_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_cm
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_vm
[]
ls /kylin/kylin_metadata/cube_job_ephemeral_lock/cube_jm
[]
ls /kylin/kylin_metadata/cube_job_lock
[cube_cm, cube_vm, cube_jm]
ls /kylin/kylin_metadata/cube_job_lock/cube_cm
[f888380e-9ff4-98f5-2df4-1ae71e045f93]
ls /kylin/kylin_metadata/cube_job_lock/cube_vm
[fc186bd9-1186-6ed4-e58c-bbbf6dd8ef74]
ls /kylin/kylin_metadata/cube_job_lock/cube_jm
[d1a6475a-9ab2-5ee4-6714-f395e20cfc01]
{code}
* The job IDs for the running jobs:
* 10 running jobs in the cluster which show no progress:
** 169f75fa-a02f-221b-fc48-037bc7a842d0
** 0b5dae1b-6faf-66c5-71dc-86f5b820f1c4
** 00924699-8b51-8091-6e71-34ccfeba3a98
** 4620192a-71e1-16dd-3b05-44d7f9144ad4
** 416355c2-a3d7-57eb-55c6-c042aa256510
** 12750aea-3b96-c817-64e8-bf893d8c120f
** 42819dde-5857-fd6b-b075-439952f47140
** 00128937-bd4a-d6c1-7a4e-744dee946f67
** 46a0233f-217e-9155-725b-c815ad77ba2c
** 062150ba-bacd-6644-4801-3a51b260d1c5
As you can see, the 10 jobs that are actually running do not posess the locks thus cannot actually do anything. On the other hand, the 3 jobs possessing the lock are not running thus cannot proceed and release them. This is a deadlock that completely stuck the cluster.
We have been observing this behavior in 3.0.0 (where rolling back https://issues.apache.org/jira/browse/KYLIN-4165 resolved the issue), and now in 3.1.0 as well. It has been originally reported in the comments of https://issues.apache.org/jira/browse/KYLIN-4348.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)