You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kylin.apache.org by "hujiahua (Jira)" <ji...@apache.org> on 2022/02/27 10:26:00 UTC

[jira] [Updated] (KYLIN-5163) Global dictionary build job may produce incomplete dictionary file

     [ https://issues.apache.org/jira/browse/KYLIN-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

hujiahua updated KYLIN-5163:
----------------------------
    Summary: Global dictionary build job may produce incomplete dictionary file  (was: Global dictionary build job may produced incomplete dictionary file)

> Global dictionary build job may produce incomplete dictionary file
> ------------------------------------------------------------------
>
>                 Key: KYLIN-5163
>                 URL: https://issues.apache.org/jira/browse/KYLIN-5163
>             Project: Kylin
>          Issue Type: Bug
>          Components: Job Engine
>    Affects Versions: v4.0.1
>            Reporter: hujiahua
>            Priority: Major
>
> The current dictionary spark build job uses function `NBucketDictionary.saveBucketDict` to write dictionary files (include CURR file and PREV file) for each partition. But it does not consider that there may be concurrency multiple tasks for one same partition, such as scenarios like task retry or speculation task. Concurrency multiple tasks of one partition may cause incomplete dictionary file and we've encountered this issue in production.
> I describe the issue in terms of timeline: 
> 1. currently in the dictionary building phase, one executor called E1 was preparing to build dictionary file for partition 0 
> 2. driver sent E1  shutdown message because of YARN resource preemption. Then driver mark the task of partition 0 failed and created a retry task to another executor called E2.
> 3. E2 began to proccess task, and finished task in a short time.
> 4. after E2 finished task, E1 began to proccess task, so E1 delete complete dictionary file which was created by E2 and created new dictionary file to write.
> 5. Then E1 received driver's shutdown message and kill himself, finally left a incomplete dictionary file which was not finished.
> 6. after other partition finished, the stage was marked successfull.
> 7. when next phase table encoding using incomplete dictionary file, stage will failed.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)