You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kylin.apache.org by dong wang <el...@gmail.com> on 2015/03/18 12:11:58 UTC

Too high cardinality is not suitable for dictionary

I rebuilt the source code with the fix for issue 630, then, I select 2
segments to merge, and each segment contains just one day's data, however,
the following error occurs,

org.apache.kylin.job.exception.ExecuteException:
org.apache.kylin.job.exception.ExecuteException:
java.lang.IllegalArgumentException: Too high cardinality is not suitable
for dictionary -- DEFAULT.TEST.MYDATE cardinality: 7304854

is there any suggestion about it?

Re: Too high cardinality is not suitable for dictionary

Posted by Li Yang <li...@apache.org>.

I suspect the intermediate table was not read correctly.

Look at the 2nd build step, it extract distinct values of every column on
fact table that requires dictionary. Related code is
FactDistinctColumnsMapper, FactDistinctColumnsReducer. In the output dir,
find the text file (named after column name) containing all your date
strings. Confirm if the content is correct or not.

If bad date is there, then FactDistinctColumnsMapper had problem reading
the intermediate hive table. Further troubleshooting will be writing a unit
test to read table use hcatalog and debug through it.

Cheers
Yang

On Thu, Mar 19, 2015 at 9:57 PM, Shi, Shaofeng <sh...@ebay.com> wrote:

> Thanks; so far I¹m not sure whether it is an regression of KYLIN-630; Will
> discuss this with Yang tomorrow;
>
> On 3/19/15, 6:00 PM, "dong wang" <el...@gmail.com> wrote:
>
> >all the data are dumped from MySQL, and load into HIVE successfully, thus,
> >the data format should be right~ and after applying the fix of issue-630,
> >I
> >cannot even build one day's data incrementally due to the error mentioned
> >above~
> >
> >2015-03-19 17:54 GMT+08:00 dong wang <el...@gmail.com>:
> >
> >> 1, the data in kylin_intermediate_* are like:
> >> 2015-02-02      0       409     13619   432     1267    2       13
> >>34
> >>      59      0       39      0
> >> 2015-02-02      0       5534    13943   432     1259    1       17
> >>40
> >>      73      0       1       0
> >> 2015-02-02      0       845     14194   461     1245    1       17
> >>38
> >>      66      0       1       0
> >> 2015-02-02      0       409     13617   432     1227    2       13
> >>34
> >>      59      0       276     2
> >> 2015-02-02      0       19      11539   387     1084    2       15
> >>35
> >>      67      0       7       0
> >> 2015-02-02      0       1221    12985   387     1079    2       15
> >>39
> >>      62      0       12      0
> >> 2015-02-02      0       51      11152   387     1076    2       15
> >>35
> >>      67      0       770     0
> >> 2015-02-02      0       166     11148   387     1057    2       15
> >>35
> >>      67      0       282     0
> >> 2015-02-02      0       5157    11295   397     810     2       15
> >>35
> >>      67      0       5       0
> >> 2015-02-02      0       151     11659   397     807     1       17
> >>36
> >>      63      0       1       0
> >>
> >> 2, select * from test where mydate='8549-07-10' limit 10;  return empty.
> >>
> >> 3, another thing is that we haven't found incorrect data for the sum
> >> result for each day, it only affects segment merging and building a new
> >> segment
> >>
>
>