You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kylin.apache.org by "lifan.su" <su...@lvwan.com> on 2019/02/13 08:30:58 UTC
Unexpected behavior when joinning streaming table and hive table
Hello, I am evaluating Kylin and tried to join streaming table and hive
table, but now got unexpected behavior.
All the scripts can be found in
https://gist.github.com/OstCollector/a4ac396e3169aa42a416d96db3021195
(may need to modify some script to match the environments)
Environment:
Centos 7
Hadoop on CDH-5.8
dedicated Kafka-2.1 (not included in CDH)
How to reproduce this problem:
1. run gen_station.pl to generate dim table data
2. run import-data.sh to build dim table in Hive
3. run factdata.pl and pipe its output into kafka
4. create tables TEST_WEATHER.STATION_INFO(hive)
TEST_WEATHER.WEATHER(streaming) in Kylin
5. create model and cube in Kylin, join WEATHER.SATION_ID = STATION.ID
6. build the cube
Expected behavior:
The cube is built correctly and I can get data when search.
Actual behavior:
On apache-kylin-2.6.0-bin-cdh57: build failed at step #2 (Create
Intermediate Flat Hive Table)
On apache-kylin-2.5.2-bin-cdh57: got empty cube
I also tried with this case without streaming, with the format of timestamp
column changed to "%Y-%m-%d %H:%M:%S", and an additional table to store the
mapping of timestamp and {hour,day,month,year}_start.
In this case, the cube is built as expected.
In both failed cases, the intermediate fact table on Hive built in step #2
seems to have wrong column order.
e.g. on version 2.5.2-cdh57, the schema and content of temp table are shown
below:
CREATE EXTERNAL TABLE IF NOT EXISTS
kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact
(
DAY_START date
,YEAR_START date
,STATION_ID string
,QUARTER_START date
,MONTH_START date
,TEMPERATURE bigint
,HOUR_START timestamp
)
STORED AS SEQUENCEFILE
LOCATION
'hdfs://hz-dev-hdfs-service/user/admin/kylin-2/kylin_metadata/kylin-5dbe40eb-55ba-2245-c0b5-1e9efcb67937/kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact';
ALTER TABLE
kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact SET
TBLPROPERTIES('auto.purge'='true');
hive> select * from
kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact limit
10;
OK
NULL 2010-01-01 2010-01-01 2010-01-01 2010-01-01 NULL
NULL
NULL 2009-01-01 2009-10-01 2009-12-01 2009-12-31 NULL
NULL
NULL 2009-01-01 2009-10-01 2009-12-01 2009-12-31 NULL
NULL
NULL 2009-01-01 2009-10-01 2009-12-01 2009-12-31 NULL
NULL
NULL 2009-01-01 2009-10-01 2009-12-01 2009-12-31 NULL
NULL
NULL 2010-01-01 2010-01-01 2010-01-01 2010-01-01 NULL
NULL
NULL 2010-01-01 2010-01-01 2010-01-01 2010-01-01 NULL
NULL
NULL 2009-01-01 2009-10-01 2009-12-01 2009-12-31 NULL
NULL
NULL 2009-01-01 2009-10-01 2009-12-01 2009-12-31 NULL
NULL
NULL 2010-01-01 2010-01-01 2010-01-01 2010-01-01 NULL
NULL
Time taken: 0.421 seconds, Fetched: 10 row(s)
While the the content of temp file is:
# hdfs dfs -text
hdfs://hz-dev-hdfs-service/user/admin/kylin-2/kylin_metadata/kylin-5dbe40eb-55ba-2245-c0b5-1e9efcb67937/kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact/part-m-00001
| head -n 10
19/02/13 11:44:12 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
19/02/13 11:44:12 INFO compress.CodecPool: Got brand-new decompressor
[.deflate]
19/02/13 11:44:12 INFO compress.CodecPool: Got brand-new decompressor
[.deflate]
19/02/13 11:44:12 INFO compress.CodecPool: Got brand-new decompressor
[.deflate]
19/02/13 11:44:12 INFO compress.CodecPool: Got brand-new decompressor
[.deflate]
0030322010-01-012010-01-012010-01-012010-01-012010-01-01
07:00:001706
0075762010-01-012010-01-012010-01-012010-01-012010-01-01
07:00:002605
0113882010-01-012010-01-012010-01-012010-01-012010-01-01
07:00:002963
0214922010-01-012010-01-012010-01-012010-01-012010-01-01
07:00:001769
0303062010-01-012010-01-012010-01-012010-01-012010-01-01 07:00:00432
0377712010-01-012010-01-012010-01-012010-01-012010-01-01 07:00:00808
0443462010-01-012010-01-012010-01-012010-01-012010-01-01
07:00:001400
0500512010-01-012010-01-012010-01-012010-01-012010-01-01 07:00:00342
0537982010-01-012010-01-012010-01-012010-01-012010-01-01
07:00:001587
0597122010-01-012010-01-012010-01-012010-01-012010-01-01
07:00:00-1309
(the '\x01' character is not correctly copied)
So what am I doing wrong?
--
Sent from: http://apache-kylin.74782.x6.nabble.com/
Re: Unexpected behavior when joinning streaming table and hive
table
Posted by "lifan.su" <su...@lvwan.com>.
Thank you for your reply, the gist contains the corresponding model
description json file. But not list together with cube.
I have updated the gist with timestamp column renamed to epoch to avoid the
syntax error during query.
But this problem still reproduced.
Lifan Su
--
Sent from: http://apache-kylin.74782.x6.nabble.com/
Re: Unexpected behavior when joinning streaming table and hive table
Posted by ShaoFeng Shi <sh...@apache.org>.
Using hour as the partition column should be fine. From the data, it seems
the declared column sequence is not matched with the persisted data.
Lifan, I see you posted the cube JSON, could you please also provide the
model's JSON? That would help to analysis the problem. Thank you!
Best regards,
Shaofeng Shi 史少锋
Apache Kylin PMC
Work email: shaofeng.shi@kyligence.io
Kyligence Inc: https://kyligence.io/
Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org
Xiaoxiang Yu <xi...@kyligence.io> 于2019年2月14日周四 上午10:34写道:
> Hi, lifei
>
> After check your model.json, I found you use "HOUR_START" as your
> partition_date_column, which is not correct.
> I think you should change to "timestamp" and have another try.
>
> Source code at
> https://github.com/apache/kylin/blob/master/source-kafka/src/main/java/org/apache/kylin/source/kafka/TimedJsonStreamParser.java#L111
>
> If you find any mistake, please let me know.
>
> ----------------
> Best wishes,
> Xiaoxiang Yu
>
>
> On [DATE], "[NAME]" <[ADDRESS]> wrote:
>
> Hello, I am evaluating Kylin and tried to join streaming table and hive
> table, but now got unexpected behavior.
>
> All the scripts can be found in
> https://gist.github.com/OstCollector/a4ac396e3169aa42a416d96db3021195
> (may need to modify some script to match the environments)
>
> Environment:
> Centos 7
> Hadoop on CDH-5.8
> dedicated Kafka-2.1 (not included in CDH)
>
> How to reproduce this problem:
>
> 1. run gen_station.pl to generate dim table data
> 2. run import-data.sh to build dim table in Hive
> 3. run factdata.pl and pipe its output into kafka
> 4. create tables TEST_WEATHER.STATION_INFO(hive)
> TEST_WEATHER.WEATHER(streaming) in Kylin
> 5. create model and cube in Kylin, join WEATHER.SATION_ID = STATION.ID
> 6. build the cube
>
> Expected behavior:
> The cube is built correctly and I can get data when search.
>
> Actual behavior:
> On apache-kylin-2.6.0-bin-cdh57: build failed at step #2 (Create
> Intermediate Flat Hive Table)
> On apache-kylin-2.5.2-bin-cdh57: got empty cube
>
> I also tried with this case without streaming, with the format of
> timestamp
> column changed to "%Y-%m-%d %H:%M:%S", and an additional table to
> store the
> mapping of timestamp and {hour,day,month,year}_start.
> In this case, the cube is built as expected.
>
>
> In both failed cases, the intermediate fact table on Hive built in
> step #2
> seems to have wrong column order.
> e.g. on version 2.5.2-cdh57, the schema and content of temp table are
> shown
> below:
>
> CREATE EXTERNAL TABLE IF NOT EXISTS
> kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact
> (
> DAY_START date
> ,YEAR_START date
> ,STATION_ID string
> ,QUARTER_START date
> ,MONTH_START date
> ,TEMPERATURE bigint
> ,HOUR_START timestamp
> )
> STORED AS SEQUENCEFILE
> LOCATION
>
> 'hdfs://hz-dev-hdfs-service/user/admin/kylin-2/kylin_metadata/kylin-5dbe40eb-55ba-2245-c0b5-1e9efcb67937/kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact';
> ALTER TABLE
> kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact
> SET
> TBLPROPERTIES('auto.purge'='true');
>
> hive> select * from
> kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact
> limit
> 10;
> OK
> NULL 2010-01-01 2010-01-01 2010-01-01 2010-01-01
> NULL
> NULL
> NULL 2009-01-01 2009-10-01 2009-12-01 2009-12-31
> NULL
> NULL
> NULL 2009-01-01 2009-10-01 2009-12-01 2009-12-31
> NULL
> NULL
> NULL 2009-01-01 2009-10-01 2009-12-01 2009-12-31
> NULL
> NULL
> NULL 2009-01-01 2009-10-01 2009-12-01 2009-12-31
> NULL
> NULL
> NULL 2010-01-01 2010-01-01 2010-01-01 2010-01-01
> NULL
> NULL
> NULL 2010-01-01 2010-01-01 2010-01-01 2010-01-01
> NULL
> NULL
> NULL 2009-01-01 2009-10-01 2009-12-01 2009-12-31
> NULL
> NULL
> NULL 2009-01-01 2009-10-01 2009-12-01 2009-12-31
> NULL
> NULL
> NULL 2010-01-01 2010-01-01 2010-01-01 2010-01-01
> NULL
> NULL
> Time taken: 0.421 seconds, Fetched: 10 row(s)
>
> While the the content of temp file is:
> # hdfs dfs -text
>
> hdfs://hz-dev-hdfs-service/user/admin/kylin-2/kylin_metadata/kylin-5dbe40eb-55ba-2245-c0b5-1e9efcb67937/kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact/part-m-00001
> | head -n 10
> 19/02/13 11:44:12 INFO zlib.ZlibFactory: Successfully loaded &
> initialized
> native-zlib library
> 19/02/13 11:44:12 INFO compress.CodecPool: Got brand-new decompressor
> [.deflate]
> 19/02/13 11:44:12 INFO compress.CodecPool: Got brand-new decompressor
> [.deflate]
> 19/02/13 11:44:12 INFO compress.CodecPool: Got brand-new decompressor
> [.deflate]
> 19/02/13 11:44:12 INFO compress.CodecPool: Got brand-new decompressor
> [.deflate]
> 0030322010-01-012010-01-012010-01-012010-01-012010-01-01
> 07:00:001706
> 0075762010-01-012010-01-012010-01-012010-01-012010-01-01
> 07:00:002605
> 0113882010-01-012010-01-012010-01-012010-01-012010-01-01
> 07:00:002963
> 0214922010-01-012010-01-012010-01-012010-01-012010-01-01
> 07:00:001769
> 0303062010-01-012010-01-012010-01-012010-01-012010-01-01
> 07:00:00432
> 0377712010-01-012010-01-012010-01-012010-01-012010-01-01
> 07:00:00808
> 0443462010-01-012010-01-012010-01-012010-01-012010-01-01
> 07:00:001400
> 0500512010-01-012010-01-012010-01-012010-01-012010-01-01
> 07:00:00342
> 0537982010-01-012010-01-012010-01-012010-01-012010-01-01
> 07:00:001587
> 0597122010-01-012010-01-012010-01-012010-01-012010-01-01
> 07:00:00-1309
> (the '\x01' character is not correctly copied)
>
> So what am I doing wrong?
>
> --
> Sent from: http://apache-kylin.74782.x6.nabble.com/
>
>
>
Re: Unexpected behavior when joinning streaming table and hive table
Posted by Xiaoxiang Yu <xi...@kyligence.io>.
Hi, lifei
After check your model.json, I found you use "HOUR_START" as your partition_date_column, which is not correct.
I think you should change to "timestamp" and have another try.
Source code at https://github.com/apache/kylin/blob/master/source-kafka/src/main/java/org/apache/kylin/source/kafka/TimedJsonStreamParser.java#L111
If you find any mistake, please let me know.
----------------
Best wishes,
Xiaoxiang Yu
On [DATE], "[NAME]" <[ADDRESS]> wrote:
Hello, I am evaluating Kylin and tried to join streaming table and hive
table, but now got unexpected behavior.
All the scripts can be found in
https://gist.github.com/OstCollector/a4ac396e3169aa42a416d96db3021195
(may need to modify some script to match the environments)
Environment:
Centos 7
Hadoop on CDH-5.8
dedicated Kafka-2.1 (not included in CDH)
How to reproduce this problem:
1. run gen_station.pl to generate dim table data
2. run import-data.sh to build dim table in Hive
3. run factdata.pl and pipe its output into kafka
4. create tables TEST_WEATHER.STATION_INFO(hive)
TEST_WEATHER.WEATHER(streaming) in Kylin
5. create model and cube in Kylin, join WEATHER.SATION_ID = STATION.ID
6. build the cube
Expected behavior:
The cube is built correctly and I can get data when search.
Actual behavior:
On apache-kylin-2.6.0-bin-cdh57: build failed at step #2 (Create
Intermediate Flat Hive Table)
On apache-kylin-2.5.2-bin-cdh57: got empty cube
I also tried with this case without streaming, with the format of timestamp
column changed to "%Y-%m-%d %H:%M:%S", and an additional table to store the
mapping of timestamp and {hour,day,month,year}_start.
In this case, the cube is built as expected.
In both failed cases, the intermediate fact table on Hive built in step #2
seems to have wrong column order.
e.g. on version 2.5.2-cdh57, the schema and content of temp table are shown
below:
CREATE EXTERNAL TABLE IF NOT EXISTS
kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact
(
DAY_START date
,YEAR_START date
,STATION_ID string
,QUARTER_START date
,MONTH_START date
,TEMPERATURE bigint
,HOUR_START timestamp
)
STORED AS SEQUENCEFILE
LOCATION
'hdfs://hz-dev-hdfs-service/user/admin/kylin-2/kylin_metadata/kylin-5dbe40eb-55ba-2245-c0b5-1e9efcb67937/kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact';
ALTER TABLE
kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact SET
TBLPROPERTIES('auto.purge'='true');
hive> select * from
kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact limit
10;
OK
NULL 2010-01-01 2010-01-01 2010-01-01 2010-01-01 NULL
NULL
NULL 2009-01-01 2009-10-01 2009-12-01 2009-12-31 NULL
NULL
NULL 2009-01-01 2009-10-01 2009-12-01 2009-12-31 NULL
NULL
NULL 2009-01-01 2009-10-01 2009-12-01 2009-12-31 NULL
NULL
NULL 2009-01-01 2009-10-01 2009-12-01 2009-12-31 NULL
NULL
NULL 2010-01-01 2010-01-01 2010-01-01 2010-01-01 NULL
NULL
NULL 2010-01-01 2010-01-01 2010-01-01 2010-01-01 NULL
NULL
NULL 2009-01-01 2009-10-01 2009-12-01 2009-12-31 NULL
NULL
NULL 2009-01-01 2009-10-01 2009-12-01 2009-12-31 NULL
NULL
NULL 2010-01-01 2010-01-01 2010-01-01 2010-01-01 NULL
NULL
Time taken: 0.421 seconds, Fetched: 10 row(s)
While the the content of temp file is:
# hdfs dfs -text
hdfs://hz-dev-hdfs-service/user/admin/kylin-2/kylin_metadata/kylin-5dbe40eb-55ba-2245-c0b5-1e9efcb67937/kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact/part-m-00001
| head -n 10
19/02/13 11:44:12 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
19/02/13 11:44:12 INFO compress.CodecPool: Got brand-new decompressor
[.deflate]
19/02/13 11:44:12 INFO compress.CodecPool: Got brand-new decompressor
[.deflate]
19/02/13 11:44:12 INFO compress.CodecPool: Got brand-new decompressor
[.deflate]
19/02/13 11:44:12 INFO compress.CodecPool: Got brand-new decompressor
[.deflate]
0030322010-01-012010-01-012010-01-012010-01-012010-01-01
07:00:001706
0075762010-01-012010-01-012010-01-012010-01-012010-01-01
07:00:002605
0113882010-01-012010-01-012010-01-012010-01-012010-01-01
07:00:002963
0214922010-01-012010-01-012010-01-012010-01-012010-01-01
07:00:001769
0303062010-01-012010-01-012010-01-012010-01-012010-01-01 07:00:00432
0377712010-01-012010-01-012010-01-012010-01-012010-01-01 07:00:00808
0443462010-01-012010-01-012010-01-012010-01-012010-01-01
07:00:001400
0500512010-01-012010-01-012010-01-012010-01-012010-01-01 07:00:00342
0537982010-01-012010-01-012010-01-012010-01-012010-01-01
07:00:001587
0597122010-01-012010-01-012010-01-012010-01-012010-01-01
07:00:00-1309
(the '\x01' character is not correctly copied)
So what am I doing wrong?
--
Sent from: http://apache-kylin.74782.x6.nabble.com/