You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kylin.apache.org by "Dayue Gao (JIRA)" <ji...@apache.org> on 2016/09/14 07:46:21 UTC
[jira] [Commented] (KYLIN-2012) more robust approach to hive schema changes

    [ https://issues.apache.org/jira/browse/KYLIN-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15489712#comment-15489712 ] 

Dayue Gao commented on KYLIN-2012:
----------------------------------

commit 17569f6 to master.

SchemaChecker is the main workhorse, it prevents danger reloads according to the following rules:
* if table has been used as fact table, all columns used in cube can't be changed. It means
** remove/rename used column is not allowed
** type change of used column is generally not allowed, except {{float<=>double}} and {{tinyint<=>smallint<=>integer<=>bigint}}
** add/remove/change unused column is ok
* if table has been used as lookup table, the old and new schema should be the same, except these type change {{float<=>double}} and {{tinyint<=>smallint<=>integer<=>bigint}}, It means
** add/remove/rename/reorder column is not allowed

(PS: I'm aware that KYLIN-1985 could allow some degree of schema changes on lookup table, so the above rule for lookup table may be too strict)

When a non-empty cube violates these rules, no reloading will be performed. An error message containing details about the violation is shown.

When only empty cube violates these rules, reloading will success. All violating cubes are changed to {{DESCBROKEN}} status (see CubeManager#reloadCubeLocalAt). The status is shown in an orange warning label at front-end so that user can easily find out all broken cubes.

User can edit or drop broken cube, but can't disable/enable/build/copy it. After user fixes all the problems in his cube (and model), the cube will back to DISABLE status. Trying to save a broken cube won't success like always. Therefore, DESCBROKEN status can only appear after reloading a changed table.

[~Shaofengshi] [~yimingliu] Do you have time to review the code?

[~zhongjian] I'm not an expert on front-end, could you also review the front-end changes?

> more robust approach to hive schema changes
> -------------------------------------------
>
>                 Key: KYLIN-2012
>                 URL: https://issues.apache.org/jira/browse/KYLIN-2012
>             Project: Kylin
>          Issue Type: Bug
>          Components: Metadata, REST Service, Web 
>    Affects Versions: v1.5.3
>            Reporter: Dayue Gao
>            Assignee: Dayue Gao
>
> Our users occasionally want to change their existing cube, such as adding/renaming/removing a dimension. Some of these changes require modifications to its source hive table. So our user changed the table schema and reloaded its metadata in Kylin, then several issues can happen depends on what he changed.
> I did some schema changing tests based on 1.5.3, the results after reloading table are listed below
> || type of changes || fact table || lookup table ||
> | *minor* | both query and build still works | query can fail or return wrong answer |
> | *major* | fail to load related cube | fail to load related cube |
> {{minor}} changes refer to those doesn't change columns used in cubes, such as insert/append new column, remove/change unused column.
> {{major}} changes are the opposite, like remove/rename/change type of used column.
> Clearly from the table, reload a changed table is problematic in certain cases. KYLIN-1536 reports a similar problem.
> So what can we do to support this kind of iterative development process (load -> define cube -> build -> reload -> change cube -> rebuild)?
> My first thought is simply detect-and-prohibit reloading used table. User should be able to know which cube is preventing him from reloading, and then he could drop and recreate cube after reloading. However, defining a cube is not an easy task (consider editing 100 measures). Force users to recreate their cube over and over again will certainly not make them happy.
> A better idea is to allow cube to be editable even if it's broken due to some columns changed after reloading. Broken cube can't be built or queried, it can only be edit or dropped. In fact, there is a cube status called {{RealizationStatusEnum.DESCBROKEN}} in code, but was never used. We should take advantage of it.
> An enabled cube shouldn't allow schema changes, otherwise an unintentional reload could make it unavailable. Similarly, a disabled but unpurged cube shouldn't allow schema changes since it still has data in it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)