You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kylin.apache.org by dong wang <el...@gmail.com> on 2015/03/18 12:11:58 UTC
Too high cardinality is not suitable for dictionary
I rebuilt the source code with the fix for issue 630, then, I select 2
segments to merge, and each segment contains just one day's data, however,
the following error occurs,
org.apache.kylin.job.exception.ExecuteException:
org.apache.kylin.job.exception.ExecuteException:
java.lang.IllegalArgumentException: Too high cardinality is not suitable
for dictionary -- DEFAULT.TEST.MYDATE cardinality: 7304854
is there any suggestion about it?
Re: Too high cardinality is not suitable for dictionary
Posted by Li Yang <li...@apache.org>.
I suspect the intermediate table was not read correctly.
Look at the 2nd build step, it extract distinct values of every column on
fact table that requires dictionary. Related code is
FactDistinctColumnsMapper, FactDistinctColumnsReducer. In the output dir,
find the text file (named after column name) containing all your date
strings. Confirm if the content is correct or not.
If bad date is there, then FactDistinctColumnsMapper had problem reading
the intermediate hive table. Further troubleshooting will be writing a unit
test to read table use hcatalog and debug through it.
Cheers
Yang
On Thu, Mar 19, 2015 at 9:57 PM, Shi, Shaofeng <sh...@ebay.com> wrote:
> Thanks; so far I¹m not sure whether it is an regression of KYLIN-630; Will
> discuss this with Yang tomorrow;
>
> On 3/19/15, 6:00 PM, "dong wang" <el...@gmail.com> wrote:
>
> >all the data are dumped from MySQL, and load into HIVE successfully, thus,
> >the data format should be right~ and after applying the fix of issue-630,
> >I
> >cannot even build one day's data incrementally due to the error mentioned
> >above~
> >
> >2015-03-19 17:54 GMT+08:00 dong wang <el...@gmail.com>:
> >
> >> 1, the data in kylin_intermediate_* are like:
> >> 2015-02-02 0 409 13619 432 1267 2 13
> >>34
> >> 59 0 39 0
> >> 2015-02-02 0 5534 13943 432 1259 1 17
> >>40
> >> 73 0 1 0
> >> 2015-02-02 0 845 14194 461 1245 1 17
> >>38
> >> 66 0 1 0
> >> 2015-02-02 0 409 13617 432 1227 2 13
> >>34
> >> 59 0 276 2
> >> 2015-02-02 0 19 11539 387 1084 2 15
> >>35
> >> 67 0 7 0
> >> 2015-02-02 0 1221 12985 387 1079 2 15
> >>39
> >> 62 0 12 0
> >> 2015-02-02 0 51 11152 387 1076 2 15
> >>35
> >> 67 0 770 0
> >> 2015-02-02 0 166 11148 387 1057 2 15
> >>35
> >> 67 0 282 0
> >> 2015-02-02 0 5157 11295 397 810 2 15
> >>35
> >> 67 0 5 0
> >> 2015-02-02 0 151 11659 397 807 1 17
> >>36
> >> 63 0 1 0
> >>
> >> 2, select * from test where mydate='8549-07-10' limit 10; return empty.
> >>
> >> 3, another thing is that we haven't found incorrect data for the sum
> >> result for each day, it only affects segment merging and building a new
> >> segment
> >>
>
>
Re: Too high cardinality is not suitable for dictionary
Posted by "Shi, Shaofeng" <sh...@ebay.com>.
Thanks; so far I¹m not sure whether it is an regression of KYLIN-630; Will
discuss this with Yang tomorrow;
On 3/19/15, 6:00 PM, "dong wang" <el...@gmail.com> wrote:
>all the data are dumped from MySQL, and load into HIVE successfully, thus,
>the data format should be right~ and after applying the fix of issue-630,
>I
>cannot even build one day's data incrementally due to the error mentioned
>above~
>
>2015-03-19 17:54 GMT+08:00 dong wang <el...@gmail.com>:
>
>> 1, the data in kylin_intermediate_* are like:
>> 2015-02-02 0 409 13619 432 1267 2 13
>>34
>> 59 0 39 0
>> 2015-02-02 0 5534 13943 432 1259 1 17
>>40
>> 73 0 1 0
>> 2015-02-02 0 845 14194 461 1245 1 17
>>38
>> 66 0 1 0
>> 2015-02-02 0 409 13617 432 1227 2 13
>>34
>> 59 0 276 2
>> 2015-02-02 0 19 11539 387 1084 2 15
>>35
>> 67 0 7 0
>> 2015-02-02 0 1221 12985 387 1079 2 15
>>39
>> 62 0 12 0
>> 2015-02-02 0 51 11152 387 1076 2 15
>>35
>> 67 0 770 0
>> 2015-02-02 0 166 11148 387 1057 2 15
>>35
>> 67 0 282 0
>> 2015-02-02 0 5157 11295 397 810 2 15
>>35
>> 67 0 5 0
>> 2015-02-02 0 151 11659 397 807 1 17
>>36
>> 63 0 1 0
>>
>> 2, select * from test where mydate='8549-07-10' limit 10; return empty.
>>
>> 3, another thing is that we haven't found incorrect data for the sum
>> result for each day, it only affects segment merging and building a new
>> segment
>>
Re: Too high cardinality is not suitable for dictionary
Posted by dong wang <el...@gmail.com>.
all the data are dumped from MySQL, and load into HIVE successfully, thus,
the data format should be right~ and after applying the fix of issue-630, I
cannot even build one day's data incrementally due to the error mentioned
above~
2015-03-19 17:54 GMT+08:00 dong wang <el...@gmail.com>:
> 1, the data in kylin_intermediate_* are like:
> 2015-02-02 0 409 13619 432 1267 2 13 34
> 59 0 39 0
> 2015-02-02 0 5534 13943 432 1259 1 17 40
> 73 0 1 0
> 2015-02-02 0 845 14194 461 1245 1 17 38
> 66 0 1 0
> 2015-02-02 0 409 13617 432 1227 2 13 34
> 59 0 276 2
> 2015-02-02 0 19 11539 387 1084 2 15 35
> 67 0 7 0
> 2015-02-02 0 1221 12985 387 1079 2 15 39
> 62 0 12 0
> 2015-02-02 0 51 11152 387 1076 2 15 35
> 67 0 770 0
> 2015-02-02 0 166 11148 387 1057 2 15 35
> 67 0 282 0
> 2015-02-02 0 5157 11295 397 810 2 15 35
> 67 0 5 0
> 2015-02-02 0 151 11659 397 807 1 17 36
> 63 0 1 0
>
> 2, select * from test where mydate='8549-07-10' limit 10; return empty.
>
> 3, another thing is that we haven't found incorrect data for the sum
> result for each day, it only affects segment merging and building a new
> segment
>
Re: Too high cardinality is not suitable for dictionary
Posted by dong wang <el...@gmail.com>.
1, the data in kylin_intermediate_* are like:
2015-02-02 0 409 13619 432 1267 2 13 34
59 0 39 0
2015-02-02 0 5534 13943 432 1259 1 17 40
73 0 1 0
2015-02-02 0 845 14194 461 1245 1 17 38
66 0 1 0
2015-02-02 0 409 13617 432 1227 2 13 34
59 0 276 2
2015-02-02 0 19 11539 387 1084 2 15 35
67 0 7 0
2015-02-02 0 1221 12985 387 1079 2 15 39
62 0 12 0
2015-02-02 0 51 11152 387 1076 2 15 35
67 0 770 0
2015-02-02 0 166 11148 387 1057 2 15 35
67 0 282 0
2015-02-02 0 5157 11295 397 810 2 15 35
67 0 5 0
2015-02-02 0 151 11659 397 807 1 17 36
63 0 1 0
2, select * from test where mydate='8549-07-10' limit 10; return empty.
3, another thing is that we haven't found incorrect data for the sum result
for each day, it only affects segment merging and building a new segment
Re: Too high cardinality is not suitable for dictionary
Posted by "Shi, Shaofeng" <sh...@ebay.com>.
From the log we see the data that Kylin read is not standard date, I doubt
there is dirty data:
8549-07-10
8621-07-06
9994-04-05
…
What’s the file format of your source hive table? If you run “select *
from kylin_intermediate_*”, are the columns be separated clearly?
On 3/19/15, 4:44 PM, "dong wang" <el...@gmail.com> wrote:
>Hi shaofeng, the following log should be what you mentioned:
>
>[pool-7-thread-1]:[2015-03-19
>16:42:02,980][INFO][org.apache.kylin.dict.DictionaryGenerator.buildDiction
>aryFromValueList(DictionaryGenerator.java:75)]
>- Dictionary value samples: 8549-07-10
>=>3122651, 8621-07-06=>3148944, 9994-04-05=>3650330, 9808-04-14=>3582404,
>5012-02-14=>1830641
>[pool-7-thread-1]:[2015-03-19
>16:42:02,980][INFO][org.apache.kylin.dict.DictionaryGenerator.buildDiction
>aryFromValueList(DictionaryGenerator.java:76)]
>- Dictionary cardinality 7304854
>
>2015-03-19 13:20 GMT+08:00 Shi, Shaofeng <sh...@ebay.com>:
>
>> Just before this exception, there should be some log saying "Dictionary
>> value samples: “; Could you please find and paste that log msg?
>>
>> On 3/19/15, 1:12 PM, "dong wang" <el...@gmail.com> wrote:
>>
>> >for DEFAULT.TEST.MYDATE, select count(distinct(mydate)) from test,
>> >returns 533, which represents all the days till now
>> >for kylin_intermediate_*.MYDATE, select count(distinct())) from
>> >kylin_intermediate_*, returns 517, which indicates the first segment of
>> >the
>> >cube, and the segment contains 517 days' data
>> >
>> >2015-03-19 1:20 GMT+08:00 hongbin ma <ma...@apache.org>:
>> >
>> >> can you use hive to get the distinct count values in
>>DEFAULT.TEST.MYDATE
>> >> and kylin_intermediate_*.MYDATE?
>> >>
>> >> On Wed, Mar 18, 2015 at 6:05 AM, dong wang <el...@gmail.com>
>> >>wrote:
>> >>
>> >> > 1, in Hive, my_date is indeeded DATE type, it means which day the
>> >>records
>> >> > belong to
>> >> > 2, it is certain that the pattern for this column "mydate" is
>> >> "yyyy-MM-dd",
>> >> > no "HH", "MM", "SS" at all for my_date
>> >> > 3, for the kylin_intermediate_* table, I'm sure the data for the
>> >>column
>> >> is
>> >> > like "2015-03-17"
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Regards,
>> >>
>> >> *Bin Mahone | 马洪宾*
>> >> Apache Kylin: http://kylin.io
>> >> Github: https://github.com/binmahone
>> >>
>>
>>
Re: Too high cardinality is not suitable for dictionary
Posted by dong wang <el...@gmail.com>.
Hi shaofeng, the following log should be what you mentioned:
[pool-7-thread-1]:[2015-03-19
16:42:02,980][INFO][org.apache.kylin.dict.DictionaryGenerator.buildDictionaryFromValueList(DictionaryGenerator.java:75)]
- Dictionary value samples: 8549-07-10
=>3122651, 8621-07-06=>3148944, 9994-04-05=>3650330, 9808-04-14=>3582404,
5012-02-14=>1830641
[pool-7-thread-1]:[2015-03-19
16:42:02,980][INFO][org.apache.kylin.dict.DictionaryGenerator.buildDictionaryFromValueList(DictionaryGenerator.java:76)]
- Dictionary cardinality 7304854
2015-03-19 13:20 GMT+08:00 Shi, Shaofeng <sh...@ebay.com>:
> Just before this exception, there should be some log saying "Dictionary
> value samples: “; Could you please find and paste that log msg?
>
> On 3/19/15, 1:12 PM, "dong wang" <el...@gmail.com> wrote:
>
> >for DEFAULT.TEST.MYDATE, select count(distinct(mydate)) from test,
> >returns 533, which represents all the days till now
> >for kylin_intermediate_*.MYDATE, select count(distinct())) from
> >kylin_intermediate_*, returns 517, which indicates the first segment of
> >the
> >cube, and the segment contains 517 days' data
> >
> >2015-03-19 1:20 GMT+08:00 hongbin ma <ma...@apache.org>:
> >
> >> can you use hive to get the distinct count values in DEFAULT.TEST.MYDATE
> >> and kylin_intermediate_*.MYDATE?
> >>
> >> On Wed, Mar 18, 2015 at 6:05 AM, dong wang <el...@gmail.com>
> >>wrote:
> >>
> >> > 1, in Hive, my_date is indeeded DATE type, it means which day the
> >>records
> >> > belong to
> >> > 2, it is certain that the pattern for this column "mydate" is
> >> "yyyy-MM-dd",
> >> > no "HH", "MM", "SS" at all for my_date
> >> > 3, for the kylin_intermediate_* table, I'm sure the data for the
> >>column
> >> is
> >> > like "2015-03-17"
> >> >
> >>
> >>
> >>
> >> --
> >> Regards,
> >>
> >> *Bin Mahone | 马洪宾*
> >> Apache Kylin: http://kylin.io
> >> Github: https://github.com/binmahone
> >>
>
>
Re: Too high cardinality is not suitable for dictionary
Posted by "Shi, Shaofeng" <sh...@ebay.com>.
Just before this exception, there should be some log saying "Dictionary
value samples: “; Could you please find and paste that log msg?
On 3/19/15, 1:12 PM, "dong wang" <el...@gmail.com> wrote:
>for DEFAULT.TEST.MYDATE, select count(distinct(mydate)) from test,
>returns 533, which represents all the days till now
>for kylin_intermediate_*.MYDATE, select count(distinct())) from
>kylin_intermediate_*, returns 517, which indicates the first segment of
>the
>cube, and the segment contains 517 days' data
>
>2015-03-19 1:20 GMT+08:00 hongbin ma <ma...@apache.org>:
>
>> can you use hive to get the distinct count values in DEFAULT.TEST.MYDATE
>> and kylin_intermediate_*.MYDATE?
>>
>> On Wed, Mar 18, 2015 at 6:05 AM, dong wang <el...@gmail.com>
>>wrote:
>>
>> > 1, in Hive, my_date is indeeded DATE type, it means which day the
>>records
>> > belong to
>> > 2, it is certain that the pattern for this column "mydate" is
>> "yyyy-MM-dd",
>> > no "HH", "MM", "SS" at all for my_date
>> > 3, for the kylin_intermediate_* table, I'm sure the data for the
>>column
>> is
>> > like "2015-03-17"
>> >
>>
>>
>>
>> --
>> Regards,
>>
>> *Bin Mahone | 马洪宾*
>> Apache Kylin: http://kylin.io
>> Github: https://github.com/binmahone
>>
Re: Too high cardinality is not suitable for dictionary
Posted by dong wang <el...@gmail.com>.
for DEFAULT.TEST.MYDATE, select count(distinct(mydate)) from test,
returns 533, which represents all the days till now
for kylin_intermediate_*.MYDATE, select count(distinct())) from
kylin_intermediate_*, returns 517, which indicates the first segment of the
cube, and the segment contains 517 days' data
2015-03-19 1:20 GMT+08:00 hongbin ma <ma...@apache.org>:
> can you use hive to get the distinct count values in DEFAULT.TEST.MYDATE
> and kylin_intermediate_*.MYDATE?
>
> On Wed, Mar 18, 2015 at 6:05 AM, dong wang <el...@gmail.com> wrote:
>
> > 1, in Hive, my_date is indeeded DATE type, it means which day the records
> > belong to
> > 2, it is certain that the pattern for this column "mydate" is
> "yyyy-MM-dd",
> > no "HH", "MM", "SS" at all for my_date
> > 3, for the kylin_intermediate_* table, I'm sure the data for the column
> is
> > like "2015-03-17"
> >
>
>
>
> --
> Regards,
>
> *Bin Mahone | 马洪宾*
> Apache Kylin: http://kylin.io
> Github: https://github.com/binmahone
>
Re: Too high cardinality is not suitable for dictionary
Posted by hongbin ma <ma...@apache.org>.
can you use hive to get the distinct count values in DEFAULT.TEST.MYDATE
and kylin_intermediate_*.MYDATE?
On Wed, Mar 18, 2015 at 6:05 AM, dong wang <el...@gmail.com> wrote:
> 1, in Hive, my_date is indeeded DATE type, it means which day the records
> belong to
> 2, it is certain that the pattern for this column "mydate" is "yyyy-MM-dd",
> no "HH", "MM", "SS" at all for my_date
> 3, for the kylin_intermediate_* table, I'm sure the data for the column is
> like "2015-03-17"
>
--
Regards,
*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone
Re: Too high cardinality is not suitable for dictionary
Posted by dong wang <el...@gmail.com>.
1, in Hive, my_date is indeeded DATE type, it means which day the records
belong to
2, it is certain that the pattern for this column "mydate" is "yyyy-MM-dd",
no "HH", "MM", "SS" at all for my_date
3, for the kylin_intermediate_* table, I'm sure the data for the column is
like "2015-03-17"
Re: Too high cardinality is not suitable for dictionary
Posted by "Zhou, Qianhao" <qi...@ebay.com>.
Hi dong
MYDATE looks like a date to me. And if you are using date like
Œyyyy-MM-dd¹, there won¹t be so high cardinality. So I assume you are
introducing hours even seconds in the date column.
The question is what are you using ³MYDATE² for? Do you really need
the data to be so specific?
As a workaround, you can split the date into several derived column,
like year month date, hour(hour can be aggregated into 4 groups, like 0-6
7-12 13-18 19-24)
Best Regard
Zhou QianHao
On 3/18/15, 7:11 PM, "dong wang" <el...@gmail.com> wrote:
>I rebuilt the source code with the fix for issue 630, then, I select 2
>segments to merge, and each segment contains just one day's data, however,
>the following error occurs,
>
>org.apache.kylin.job.exception.ExecuteException:
>org.apache.kylin.job.exception.ExecuteException:
>java.lang.IllegalArgumentException: Too high cardinality is not suitable
>for dictionary -- DEFAULT.TEST.MYDATE cardinality: 7304854
>
>is there any suggestion about it?