You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kylin.apache.org by dong wang <el...@gmail.com> on 2015/03/18 12:11:58 UTC

Too high cardinality is not suitable for dictionary

I rebuilt the source code with the fix for issue 630, then, I select 2
segments to merge, and each segment contains just one day's data, however,
the following error occurs,

org.apache.kylin.job.exception.ExecuteException:
org.apache.kylin.job.exception.ExecuteException:
java.lang.IllegalArgumentException: Too high cardinality is not suitable
for dictionary -- DEFAULT.TEST.MYDATE cardinality: 7304854

is there any suggestion about it?

Re: Too high cardinality is not suitable for dictionary

Posted by Li Yang <li...@apache.org>.
I suspect the intermediate table was not read correctly.

Look at the 2nd build step, it extract distinct values of every column on
fact table that requires dictionary. Related code is
FactDistinctColumnsMapper, FactDistinctColumnsReducer. In the output dir,
find the text file (named after column name) containing all your date
strings. Confirm if the content is correct or not.

If bad date is there, then FactDistinctColumnsMapper had problem reading
the intermediate hive table. Further troubleshooting will be writing a unit
test to read table use hcatalog and debug through it.

Cheers
Yang

On Thu, Mar 19, 2015 at 9:57 PM, Shi, Shaofeng <sh...@ebay.com> wrote:

> Thanks; so far I¹m not sure whether it is an regression of KYLIN-630; Will
> discuss this with Yang tomorrow;
>
> On 3/19/15, 6:00 PM, "dong wang" <el...@gmail.com> wrote:
>
> >all the data are dumped from MySQL, and load into HIVE successfully, thus,
> >the data format should be right~ and after applying the fix of issue-630,
> >I
> >cannot even build one day's data incrementally due to the error mentioned
> >above~
> >
> >2015-03-19 17:54 GMT+08:00 dong wang <el...@gmail.com>:
> >
> >> 1, the data in kylin_intermediate_* are like:
> >> 2015-02-02      0       409     13619   432     1267    2       13
> >>34
> >>      59      0       39      0
> >> 2015-02-02      0       5534    13943   432     1259    1       17
> >>40
> >>      73      0       1       0
> >> 2015-02-02      0       845     14194   461     1245    1       17
> >>38
> >>      66      0       1       0
> >> 2015-02-02      0       409     13617   432     1227    2       13
> >>34
> >>      59      0       276     2
> >> 2015-02-02      0       19      11539   387     1084    2       15
> >>35
> >>      67      0       7       0
> >> 2015-02-02      0       1221    12985   387     1079    2       15
> >>39
> >>      62      0       12      0
> >> 2015-02-02      0       51      11152   387     1076    2       15
> >>35
> >>      67      0       770     0
> >> 2015-02-02      0       166     11148   387     1057    2       15
> >>35
> >>      67      0       282     0
> >> 2015-02-02      0       5157    11295   397     810     2       15
> >>35
> >>      67      0       5       0
> >> 2015-02-02      0       151     11659   397     807     1       17
> >>36
> >>      63      0       1       0
> >>
> >> 2, select * from test where mydate='8549-07-10' limit 10;  return empty.
> >>
> >> 3, another thing is that we haven't found incorrect data for the sum
> >> result for each day, it only affects segment merging and building a new
> >> segment
> >>
>
>

Re: Too high cardinality is not suitable for dictionary

Posted by "Shi, Shaofeng" <sh...@ebay.com>.
Thanks; so far I¹m not sure whether it is an regression of KYLIN-630; Will
discuss this with Yang tomorrow;

On 3/19/15, 6:00 PM, "dong wang" <el...@gmail.com> wrote:

>all the data are dumped from MySQL, and load into HIVE successfully, thus,
>the data format should be right~ and after applying the fix of issue-630,
>I
>cannot even build one day's data incrementally due to the error mentioned
>above~
>
>2015-03-19 17:54 GMT+08:00 dong wang <el...@gmail.com>:
>
>> 1, the data in kylin_intermediate_* are like:
>> 2015-02-02      0       409     13619   432     1267    2       13
>>34
>>      59      0       39      0
>> 2015-02-02      0       5534    13943   432     1259    1       17
>>40
>>      73      0       1       0
>> 2015-02-02      0       845     14194   461     1245    1       17
>>38
>>      66      0       1       0
>> 2015-02-02      0       409     13617   432     1227    2       13
>>34
>>      59      0       276     2
>> 2015-02-02      0       19      11539   387     1084    2       15
>>35
>>      67      0       7       0
>> 2015-02-02      0       1221    12985   387     1079    2       15
>>39
>>      62      0       12      0
>> 2015-02-02      0       51      11152   387     1076    2       15
>>35
>>      67      0       770     0
>> 2015-02-02      0       166     11148   387     1057    2       15
>>35
>>      67      0       282     0
>> 2015-02-02      0       5157    11295   397     810     2       15
>>35
>>      67      0       5       0
>> 2015-02-02      0       151     11659   397     807     1       17
>>36
>>      63      0       1       0
>>
>> 2, select * from test where mydate='8549-07-10' limit 10;  return empty.
>>
>> 3, another thing is that we haven't found incorrect data for the sum
>> result for each day, it only affects segment merging and building a new
>> segment
>>


Re: Too high cardinality is not suitable for dictionary

Posted by dong wang <el...@gmail.com>.
all the data are dumped from MySQL, and load into HIVE successfully, thus,
the data format should be right~ and after applying the fix of issue-630, I
cannot even build one day's data incrementally due to the error mentioned
above~

2015-03-19 17:54 GMT+08:00 dong wang <el...@gmail.com>:

> 1, the data in kylin_intermediate_* are like:
> 2015-02-02      0       409     13619   432     1267    2       13      34
>      59      0       39      0
> 2015-02-02      0       5534    13943   432     1259    1       17      40
>      73      0       1       0
> 2015-02-02      0       845     14194   461     1245    1       17      38
>      66      0       1       0
> 2015-02-02      0       409     13617   432     1227    2       13      34
>      59      0       276     2
> 2015-02-02      0       19      11539   387     1084    2       15      35
>      67      0       7       0
> 2015-02-02      0       1221    12985   387     1079    2       15      39
>      62      0       12      0
> 2015-02-02      0       51      11152   387     1076    2       15      35
>      67      0       770     0
> 2015-02-02      0       166     11148   387     1057    2       15      35
>      67      0       282     0
> 2015-02-02      0       5157    11295   397     810     2       15      35
>      67      0       5       0
> 2015-02-02      0       151     11659   397     807     1       17      36
>      63      0       1       0
>
> 2, select * from test where mydate='8549-07-10' limit 10;  return empty.
>
> 3, another thing is that we haven't found incorrect data for the sum
> result for each day, it only affects segment merging and building a new
> segment
>

Re: Too high cardinality is not suitable for dictionary

Posted by dong wang <el...@gmail.com>.
1, the data in kylin_intermediate_* are like:
2015-02-02      0       409     13619   432     1267    2       13      34
     59      0       39      0
2015-02-02      0       5534    13943   432     1259    1       17      40
     73      0       1       0
2015-02-02      0       845     14194   461     1245    1       17      38
     66      0       1       0
2015-02-02      0       409     13617   432     1227    2       13      34
     59      0       276     2
2015-02-02      0       19      11539   387     1084    2       15      35
     67      0       7       0
2015-02-02      0       1221    12985   387     1079    2       15      39
     62      0       12      0
2015-02-02      0       51      11152   387     1076    2       15      35
     67      0       770     0
2015-02-02      0       166     11148   387     1057    2       15      35
     67      0       282     0
2015-02-02      0       5157    11295   397     810     2       15      35
     67      0       5       0
2015-02-02      0       151     11659   397     807     1       17      36
     63      0       1       0

2, select * from test where mydate='8549-07-10' limit 10;  return empty.

3, another thing is that we haven't found incorrect data for the sum result
for each day, it only affects segment merging and building a new segment

Re: Too high cardinality is not suitable for dictionary

Posted by "Shi, Shaofeng" <sh...@ebay.com>.
From the log we see the data that Kylin read is not standard date, I doubt
there is dirty data:
8549-07-10

8621-07-06

9994-04-05

…

What’s the file format of your source hive table? If you run “select *
from  kylin_intermediate_*”, are the columns be separated clearly?


On 3/19/15, 4:44 PM, "dong wang" <el...@gmail.com> wrote:

>Hi shaofeng, the following log should be what you mentioned:
>
>[pool-7-thread-1]:[2015-03-19
>16:42:02,980][INFO][org.apache.kylin.dict.DictionaryGenerator.buildDiction
>aryFromValueList(DictionaryGenerator.java:75)]
>- Dictionary value samples: 8549-07-10
>=>3122651, 8621-07-06=>3148944, 9994-04-05=>3650330, 9808-04-14=>3582404,
>5012-02-14=>1830641
>[pool-7-thread-1]:[2015-03-19
>16:42:02,980][INFO][org.apache.kylin.dict.DictionaryGenerator.buildDiction
>aryFromValueList(DictionaryGenerator.java:76)]
>- Dictionary cardinality 7304854
>
>2015-03-19 13:20 GMT+08:00 Shi, Shaofeng <sh...@ebay.com>:
>
>> Just before this exception, there should be some log saying "Dictionary
>> value samples: “; Could you please find and paste that log msg?
>>
>> On 3/19/15, 1:12 PM, "dong wang" <el...@gmail.com> wrote:
>>
>> >for  DEFAULT.TEST.MYDATE,  select count(distinct(mydate)) from test,
>> >returns 533,  which represents all the days till now
>> >for kylin_intermediate_*.MYDATE,   select count(distinct())) from
>> >kylin_intermediate_*, returns 517, which indicates the first segment of
>> >the
>> >cube, and the segment contains 517 days' data
>> >
>> >2015-03-19 1:20 GMT+08:00 hongbin ma <ma...@apache.org>:
>> >
>> >> can you use hive to get the distinct count values in
>>DEFAULT.TEST.MYDATE
>> >> and kylin_intermediate_*.MYDATE?
>> >>
>> >> On Wed, Mar 18, 2015 at 6:05 AM, dong wang <el...@gmail.com>
>> >>wrote:
>> >>
>> >> > 1, in Hive, my_date is indeeded DATE type, it means which day the
>> >>records
>> >> > belong to
>> >> > 2, it is certain that the pattern for this column "mydate" is
>> >> "yyyy-MM-dd",
>> >> >  no "HH", "MM", "SS" at all for my_date
>> >> > 3, for the kylin_intermediate_* table, I'm sure the data for the
>> >>column
>> >> is
>> >> > like "2015-03-17"
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Regards,
>> >>
>> >> *Bin Mahone | 马洪宾*
>> >> Apache Kylin: http://kylin.io
>> >> Github: https://github.com/binmahone
>> >>
>>
>>


Re: Too high cardinality is not suitable for dictionary

Posted by dong wang <el...@gmail.com>.
Hi shaofeng, the following log should be what you mentioned:

[pool-7-thread-1]:[2015-03-19
16:42:02,980][INFO][org.apache.kylin.dict.DictionaryGenerator.buildDictionaryFromValueList(DictionaryGenerator.java:75)]
- Dictionary value samples: 8549-07-10
=>3122651, 8621-07-06=>3148944, 9994-04-05=>3650330, 9808-04-14=>3582404,
5012-02-14=>1830641
[pool-7-thread-1]:[2015-03-19
16:42:02,980][INFO][org.apache.kylin.dict.DictionaryGenerator.buildDictionaryFromValueList(DictionaryGenerator.java:76)]
- Dictionary cardinality 7304854

2015-03-19 13:20 GMT+08:00 Shi, Shaofeng <sh...@ebay.com>:

> Just before this exception, there should be some log saying "Dictionary
> value samples: “; Could you please find and paste that log msg?
>
> On 3/19/15, 1:12 PM, "dong wang" <el...@gmail.com> wrote:
>
> >for  DEFAULT.TEST.MYDATE,  select count(distinct(mydate)) from test,
> >returns 533,  which represents all the days till now
> >for kylin_intermediate_*.MYDATE,   select count(distinct())) from
> >kylin_intermediate_*, returns 517, which indicates the first segment of
> >the
> >cube, and the segment contains 517 days' data
> >
> >2015-03-19 1:20 GMT+08:00 hongbin ma <ma...@apache.org>:
> >
> >> can you use hive to get the distinct count values in DEFAULT.TEST.MYDATE
> >> and kylin_intermediate_*.MYDATE?
> >>
> >> On Wed, Mar 18, 2015 at 6:05 AM, dong wang <el...@gmail.com>
> >>wrote:
> >>
> >> > 1, in Hive, my_date is indeeded DATE type, it means which day the
> >>records
> >> > belong to
> >> > 2, it is certain that the pattern for this column "mydate" is
> >> "yyyy-MM-dd",
> >> >  no "HH", "MM", "SS" at all for my_date
> >> > 3, for the kylin_intermediate_* table, I'm sure the data for the
> >>column
> >> is
> >> > like "2015-03-17"
> >> >
> >>
> >>
> >>
> >> --
> >> Regards,
> >>
> >> *Bin Mahone | 马洪宾*
> >> Apache Kylin: http://kylin.io
> >> Github: https://github.com/binmahone
> >>
>
>

Re: Too high cardinality is not suitable for dictionary

Posted by "Shi, Shaofeng" <sh...@ebay.com>.
Just before this exception, there should be some log saying "Dictionary
value samples: “; Could you please find and paste that log msg?

On 3/19/15, 1:12 PM, "dong wang" <el...@gmail.com> wrote:

>for  DEFAULT.TEST.MYDATE,  select count(distinct(mydate)) from test,
>returns 533,  which represents all the days till now
>for kylin_intermediate_*.MYDATE,   select count(distinct())) from
>kylin_intermediate_*, returns 517, which indicates the first segment of
>the
>cube, and the segment contains 517 days' data
>
>2015-03-19 1:20 GMT+08:00 hongbin ma <ma...@apache.org>:
>
>> can you use hive to get the distinct count values in DEFAULT.TEST.MYDATE
>> and kylin_intermediate_*.MYDATE?
>>
>> On Wed, Mar 18, 2015 at 6:05 AM, dong wang <el...@gmail.com>
>>wrote:
>>
>> > 1, in Hive, my_date is indeeded DATE type, it means which day the
>>records
>> > belong to
>> > 2, it is certain that the pattern for this column "mydate" is
>> "yyyy-MM-dd",
>> >  no "HH", "MM", "SS" at all for my_date
>> > 3, for the kylin_intermediate_* table, I'm sure the data for the
>>column
>> is
>> > like "2015-03-17"
>> >
>>
>>
>>
>> --
>> Regards,
>>
>> *Bin Mahone | 马洪宾*
>> Apache Kylin: http://kylin.io
>> Github: https://github.com/binmahone
>>


Re: Too high cardinality is not suitable for dictionary

Posted by dong wang <el...@gmail.com>.
for  DEFAULT.TEST.MYDATE,  select count(distinct(mydate)) from test,
returns 533,  which represents all the days till now
for kylin_intermediate_*.MYDATE,   select count(distinct())) from
kylin_intermediate_*, returns 517, which indicates the first segment of the
cube, and the segment contains 517 days' data

2015-03-19 1:20 GMT+08:00 hongbin ma <ma...@apache.org>:

> can you use hive to get the distinct count values in DEFAULT.TEST.MYDATE
> and kylin_intermediate_*.MYDATE?
>
> On Wed, Mar 18, 2015 at 6:05 AM, dong wang <el...@gmail.com> wrote:
>
> > 1, in Hive, my_date is indeeded DATE type, it means which day the records
> > belong to
> > 2, it is certain that the pattern for this column "mydate" is
> "yyyy-MM-dd",
> >  no "HH", "MM", "SS" at all for my_date
> > 3, for the kylin_intermediate_* table, I'm sure the data for the column
> is
> > like "2015-03-17"
> >
>
>
>
> --
> Regards,
>
> *Bin Mahone | 马洪宾*
> Apache Kylin: http://kylin.io
> Github: https://github.com/binmahone
>

Re: Too high cardinality is not suitable for dictionary

Posted by hongbin ma <ma...@apache.org>.
can you use hive to get the distinct count values in DEFAULT.TEST.MYDATE
and kylin_intermediate_*.MYDATE?

On Wed, Mar 18, 2015 at 6:05 AM, dong wang <el...@gmail.com> wrote:

> 1, in Hive, my_date is indeeded DATE type, it means which day the records
> belong to
> 2, it is certain that the pattern for this column "mydate" is "yyyy-MM-dd",
>  no "HH", "MM", "SS" at all for my_date
> 3, for the kylin_intermediate_* table, I'm sure the data for the column is
> like "2015-03-17"
>



-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone

Re: Too high cardinality is not suitable for dictionary

Posted by dong wang <el...@gmail.com>.
1, in Hive, my_date is indeeded DATE type, it means which day the records
belong to
2, it is certain that the pattern for this column "mydate" is "yyyy-MM-dd",
 no "HH", "MM", "SS" at all for my_date
3, for the kylin_intermediate_* table, I'm sure the data for the column is
like "2015-03-17"

Re: Too high cardinality is not suitable for dictionary

Posted by "Zhou, Qianhao" <qi...@ebay.com>.
Hi dong
    MYDATE looks like a date to me. And if you are using date like
Œyyyy-MM-dd¹, there won¹t be so high cardinality. So I assume you are
introducing hours even seconds in the date column.
    The question is what are you using ³MYDATE² for? Do you really need
the data to be so specific?
    As a workaround, you can split the date into several derived column,
like year month date, hour(hour can be aggregated into 4 groups, like 0-6
7-12 13-18 19-24)

Best Regard
Zhou QianHao





On 3/18/15, 7:11 PM, "dong wang" <el...@gmail.com> wrote:

>I rebuilt the source code with the fix for issue 630, then, I select 2
>segments to merge, and each segment contains just one day's data, however,
>the following error occurs,
>
>org.apache.kylin.job.exception.ExecuteException:
>org.apache.kylin.job.exception.ExecuteException:
>java.lang.IllegalArgumentException: Too high cardinality is not suitable
>for dictionary -- DEFAULT.TEST.MYDATE cardinality: 7304854
>
>is there any suggestion about it?