You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kylin.apache.org by "Shaofeng SHI (JIRA)" <ji...@apache.org> on 2019/01/29 08:41:00 UTC

[jira] [Commented] (KYLIN-2800) All dictionaries should be built based on the flat hive table

    [ https://issues.apache.org/jira/browse/KYLIN-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754735#comment-16754735 ] 

Shaofeng SHI commented on KYLIN-2800:
-------------------------------------

The root cause is in merge cuboid step, Kylin detects if a column's dictionary is from Lookup table, it won't decode it with old dictionary and then encode it with new dictionary, as it always assumes the lookup table is incremental. While, if the lookup table is not incremental, this merge will cause incorrect data after segments be merged.

 

See:

https://github.com/apache/kylin/blob/2.0.x/engine-mr/src/main/java/org/apache/kylin/engine/mr/steps/MergeCuboidMapper.java#L273

> All dictionaries should be built based on the flat hive table
> -------------------------------------------------------------
>
>                 Key: KYLIN-2800
>                 URL: https://issues.apache.org/jira/browse/KYLIN-2800
>             Project: Kylin
>          Issue Type: Bug
>            Reporter: zhengdong
>            Assignee: zhengdong
>            Priority: Major
>             Fix For: v2.2.0
>
>         Attachments: 0001-KYLIN-2800-All-dictionaries-should-be-built-based-on.patch
>
>
> After KYLIN-2457, we still got wrong query result sometimes after a merging job finished. 
> Finally, we realize the root cause is that we always use lookup table as source data to build dictionaries for FK columns. 
> However, incremental lookup table doesn't mean sequential and incremental PK. If a new record inserted into the lookup table while its PK column does not have the max value, ID numbers in the new dictionary could be changed for those PK value larger than the newest one. What's more, using lookup table as source data for FK column's dictionary may has performance advantage for merging job, but also may encounter too big dictionary problem for large lookup tables. And we must add some validation rules to ensure the PK value sequential and incremental.
> On the another hand, we could just unify using the flat hive table as data source for all dictionaries.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)