You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kylin.apache.org by "lifan.su" <su...@lvwan.com> on 2019/02/13 08:30:58 UTC

Unexpected behavior when joinning streaming table and hive table

Hello, I am evaluating Kylin and tried to join streaming table and hive
table, but now got unexpected behavior.

All the scripts can be found in
https://gist.github.com/OstCollector/a4ac396e3169aa42a416d96db3021195
(may need to modify some script to match the environments)

Environment: 
Centos 7
Hadoop on CDH-5.8
dedicated Kafka-2.1 (not included in CDH)

How to reproduce this problem:

1. run gen_station.pl to generate dim table data
2. run import-data.sh to build dim table in Hive
3. run factdata.pl and pipe its output into kafka
4. create tables TEST_WEATHER.STATION_INFO(hive)
TEST_WEATHER.WEATHER(streaming) in Kylin
5. create model and cube in Kylin, join WEATHER.SATION_ID = STATION.ID
6. build the cube

Expected behavior:
The cube is built correctly and I can get data when search.

Actual behavior:
On apache-kylin-2.6.0-bin-cdh57: build failed at step #2 (Create
Intermediate Flat Hive Table)
On apache-kylin-2.5.2-bin-cdh57: got empty cube

I also tried with this case without streaming, with the format of timestamp
column changed to "%Y-%m-%d %H:%M:%S", and an additional table to store the
mapping of timestamp and {hour,day,month,year}_start.
In this case, the cube is built as expected. 


In both failed cases, the intermediate fact table on Hive built in step #2
seems to have wrong column order.
e.g. on version 2.5.2-cdh57, the schema and content of temp table are shown
below:

CREATE EXTERNAL TABLE IF NOT EXISTS
kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact
(
DAY_START date
,YEAR_START date
,STATION_ID string
,QUARTER_START date
,MONTH_START date
,TEMPERATURE bigint
,HOUR_START timestamp
)
STORED AS SEQUENCEFILE
LOCATION
'hdfs://hz-dev-hdfs-service/user/admin/kylin-2/kylin_metadata/kylin-5dbe40eb-55ba-2245-c0b5-1e9efcb67937/kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact';
ALTER TABLE
kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact SET
TBLPROPERTIES('auto.purge'='true');

hive> select * from
kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact limit
10;
OK
NULL    2010-01-01      2010-01-01      2010-01-01      2010-01-01      NULL   
NULL
NULL    2009-01-01      2009-10-01      2009-12-01      2009-12-31      NULL   
NULL
NULL    2009-01-01      2009-10-01      2009-12-01      2009-12-31      NULL   
NULL
NULL    2009-01-01      2009-10-01      2009-12-01      2009-12-31      NULL   
NULL
NULL    2009-01-01      2009-10-01      2009-12-01      2009-12-31      NULL   
NULL
NULL    2010-01-01      2010-01-01      2010-01-01      2010-01-01      NULL   
NULL
NULL    2010-01-01      2010-01-01      2010-01-01      2010-01-01      NULL   
NULL
NULL    2009-01-01      2009-10-01      2009-12-01      2009-12-31      NULL   
NULL
NULL    2009-01-01      2009-10-01      2009-12-01      2009-12-31      NULL   
NULL
NULL    2010-01-01      2010-01-01      2010-01-01      2010-01-01      NULL   
NULL
Time taken: 0.421 seconds, Fetched: 10 row(s)

While the the content of temp file is:
# hdfs dfs -text
hdfs://hz-dev-hdfs-service/user/admin/kylin-2/kylin_metadata/kylin-5dbe40eb-55ba-2245-c0b5-1e9efcb67937/kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact/part-m-00001
| head -n 10
19/02/13 11:44:12 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
19/02/13 11:44:12 INFO compress.CodecPool: Got brand-new decompressor
[.deflate]
19/02/13 11:44:12 INFO compress.CodecPool: Got brand-new decompressor
[.deflate]
19/02/13 11:44:12 INFO compress.CodecPool: Got brand-new decompressor
[.deflate]
19/02/13 11:44:12 INFO compress.CodecPool: Got brand-new decompressor
[.deflate]
        0030322010-01-012010-01-012010-01-012010-01-012010-01-01
07:00:001706
        0075762010-01-012010-01-012010-01-012010-01-012010-01-01
07:00:002605
        0113882010-01-012010-01-012010-01-012010-01-012010-01-01
07:00:002963
        0214922010-01-012010-01-012010-01-012010-01-012010-01-01
07:00:001769
        0303062010-01-012010-01-012010-01-012010-01-012010-01-01 07:00:00432
        0377712010-01-012010-01-012010-01-012010-01-012010-01-01 07:00:00808
        0443462010-01-012010-01-012010-01-012010-01-012010-01-01
07:00:001400
        0500512010-01-012010-01-012010-01-012010-01-012010-01-01 07:00:00342
        0537982010-01-012010-01-012010-01-012010-01-012010-01-01
07:00:001587
        0597122010-01-012010-01-012010-01-012010-01-012010-01-01
07:00:00-1309
(the '\x01' character is not correctly copied)

So what am I doing wrong?

--
Sent from: http://apache-kylin.74782.x6.nabble.com/

Re: Unexpected behavior when joinning streaming table and hive table

Posted by "lifan.su" <su...@lvwan.com>.
Thank you for your reply, the gist contains the corresponding model
description json file. But not list together with cube.
I have updated the gist with timestamp column renamed to epoch to avoid the
syntax error during query.
But this problem still reproduced.

Lifan Su

--
Sent from: http://apache-kylin.74782.x6.nabble.com/

Re: Unexpected behavior when joinning streaming table and hive table

Posted by ShaoFeng Shi <sh...@apache.org>.
Using hour as the partition column should be fine. From the data, it seems
the declared column sequence is not matched with the persisted data.

Lifan, I see you posted the cube JSON, could you please also provide the
model's JSON? That would help to analysis the problem. Thank you!

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Work email: shaofeng.shi@kyligence.io
Kyligence Inc: https://kyligence.io/

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org




Xiaoxiang Yu <xi...@kyligence.io> 于2019年2月14日周四 上午10:34写道:

> Hi, lifei
>
> After check your model.json, I found you use "HOUR_START" as your
> partition_date_column, which is not correct.
> I think you should change to "timestamp" and have another try.
>
> Source code at
> https://github.com/apache/kylin/blob/master/source-kafka/src/main/java/org/apache/kylin/source/kafka/TimedJsonStreamParser.java#L111
>
> If you find any mistake, please let me know.
>
> ----------------
> Best wishes,
> Xiaoxiang Yu
>
>
> On [DATE], "[NAME]" <[ADDRESS]> wrote:
>
>     Hello, I am evaluating Kylin and tried to join streaming table and hive
>     table, but now got unexpected behavior.
>
>     All the scripts can be found in
>     https://gist.github.com/OstCollector/a4ac396e3169aa42a416d96db3021195
>     (may need to modify some script to match the environments)
>
>     Environment:
>     Centos 7
>     Hadoop on CDH-5.8
>     dedicated Kafka-2.1 (not included in CDH)
>
>     How to reproduce this problem:
>
>     1. run gen_station.pl to generate dim table data
>     2. run import-data.sh to build dim table in Hive
>     3. run factdata.pl and pipe its output into kafka
>     4. create tables TEST_WEATHER.STATION_INFO(hive)
>     TEST_WEATHER.WEATHER(streaming) in Kylin
>     5. create model and cube in Kylin, join WEATHER.SATION_ID = STATION.ID
>     6. build the cube
>
>     Expected behavior:
>     The cube is built correctly and I can get data when search.
>
>     Actual behavior:
>     On apache-kylin-2.6.0-bin-cdh57: build failed at step #2 (Create
>     Intermediate Flat Hive Table)
>     On apache-kylin-2.5.2-bin-cdh57: got empty cube
>
>     I also tried with this case without streaming, with the format of
> timestamp
>     column changed to "%Y-%m-%d %H:%M:%S", and an additional table to
> store the
>     mapping of timestamp and {hour,day,month,year}_start.
>     In this case, the cube is built as expected.
>
>
>     In both failed cases, the intermediate fact table on Hive built in
> step #2
>     seems to have wrong column order.
>     e.g. on version 2.5.2-cdh57, the schema and content of temp table are
> shown
>     below:
>
>     CREATE EXTERNAL TABLE IF NOT EXISTS
>     kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact
>     (
>     DAY_START date
>     ,YEAR_START date
>     ,STATION_ID string
>     ,QUARTER_START date
>     ,MONTH_START date
>     ,TEMPERATURE bigint
>     ,HOUR_START timestamp
>     )
>     STORED AS SEQUENCEFILE
>     LOCATION
>
> 'hdfs://hz-dev-hdfs-service/user/admin/kylin-2/kylin_metadata/kylin-5dbe40eb-55ba-2245-c0b5-1e9efcb67937/kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact';
>     ALTER TABLE
>     kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact
> SET
>     TBLPROPERTIES('auto.purge'='true');
>
>     hive> select * from
>     kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact
> limit
>     10;
>     OK
>     NULL    2010-01-01      2010-01-01      2010-01-01      2010-01-01
>   NULL
>     NULL
>     NULL    2009-01-01      2009-10-01      2009-12-01      2009-12-31
>   NULL
>     NULL
>     NULL    2009-01-01      2009-10-01      2009-12-01      2009-12-31
>   NULL
>     NULL
>     NULL    2009-01-01      2009-10-01      2009-12-01      2009-12-31
>   NULL
>     NULL
>     NULL    2009-01-01      2009-10-01      2009-12-01      2009-12-31
>   NULL
>     NULL
>     NULL    2010-01-01      2010-01-01      2010-01-01      2010-01-01
>   NULL
>     NULL
>     NULL    2010-01-01      2010-01-01      2010-01-01      2010-01-01
>   NULL
>     NULL
>     NULL    2009-01-01      2009-10-01      2009-12-01      2009-12-31
>   NULL
>     NULL
>     NULL    2009-01-01      2009-10-01      2009-12-01      2009-12-31
>   NULL
>     NULL
>     NULL    2010-01-01      2010-01-01      2010-01-01      2010-01-01
>   NULL
>     NULL
>     Time taken: 0.421 seconds, Fetched: 10 row(s)
>
>     While the the content of temp file is:
>     # hdfs dfs -text
>
> hdfs://hz-dev-hdfs-service/user/admin/kylin-2/kylin_metadata/kylin-5dbe40eb-55ba-2245-c0b5-1e9efcb67937/kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact/part-m-00001
>     | head -n 10
>     19/02/13 11:44:12 INFO zlib.ZlibFactory: Successfully loaded &
> initialized
>     native-zlib library
>     19/02/13 11:44:12 INFO compress.CodecPool: Got brand-new decompressor
>     [.deflate]
>     19/02/13 11:44:12 INFO compress.CodecPool: Got brand-new decompressor
>     [.deflate]
>     19/02/13 11:44:12 INFO compress.CodecPool: Got brand-new decompressor
>     [.deflate]
>     19/02/13 11:44:12 INFO compress.CodecPool: Got brand-new decompressor
>     [.deflate]
>             0030322010-01-012010-01-012010-01-012010-01-012010-01-01
>     07:00:001706
>             0075762010-01-012010-01-012010-01-012010-01-012010-01-01
>     07:00:002605
>             0113882010-01-012010-01-012010-01-012010-01-012010-01-01
>     07:00:002963
>             0214922010-01-012010-01-012010-01-012010-01-012010-01-01
>     07:00:001769
>             0303062010-01-012010-01-012010-01-012010-01-012010-01-01
> 07:00:00432
>             0377712010-01-012010-01-012010-01-012010-01-012010-01-01
> 07:00:00808
>             0443462010-01-012010-01-012010-01-012010-01-012010-01-01
>     07:00:001400
>             0500512010-01-012010-01-012010-01-012010-01-012010-01-01
> 07:00:00342
>             0537982010-01-012010-01-012010-01-012010-01-012010-01-01
>     07:00:001587
>             0597122010-01-012010-01-012010-01-012010-01-012010-01-01
>     07:00:00-1309
>     (the '\x01' character is not correctly copied)
>
>     So what am I doing wrong?
>
>     --
>     Sent from: http://apache-kylin.74782.x6.nabble.com/
>
>
>

Re: Unexpected behavior when joinning streaming table and hive table

Posted by Xiaoxiang Yu <xi...@kyligence.io>.
Hi, lifei

After check your model.json, I found you use "HOUR_START" as your partition_date_column, which is not correct. 
I think you should change to "timestamp" and have another try. 

Source code at https://github.com/apache/kylin/blob/master/source-kafka/src/main/java/org/apache/kylin/source/kafka/TimedJsonStreamParser.java#L111

If you find any mistake, please let me know.

----------------
Best wishes,
Xiaoxiang Yu 
 

On [DATE], "[NAME]" <[ADDRESS]> wrote:

    Hello, I am evaluating Kylin and tried to join streaming table and hive
    table, but now got unexpected behavior.
    
    All the scripts can be found in
    https://gist.github.com/OstCollector/a4ac396e3169aa42a416d96db3021195
    (may need to modify some script to match the environments)
    
    Environment: 
    Centos 7
    Hadoop on CDH-5.8
    dedicated Kafka-2.1 (not included in CDH)
    
    How to reproduce this problem:
    
    1. run gen_station.pl to generate dim table data
    2. run import-data.sh to build dim table in Hive
    3. run factdata.pl and pipe its output into kafka
    4. create tables TEST_WEATHER.STATION_INFO(hive)
    TEST_WEATHER.WEATHER(streaming) in Kylin
    5. create model and cube in Kylin, join WEATHER.SATION_ID = STATION.ID
    6. build the cube
    
    Expected behavior:
    The cube is built correctly and I can get data when search.
    
    Actual behavior:
    On apache-kylin-2.6.0-bin-cdh57: build failed at step #2 (Create
    Intermediate Flat Hive Table)
    On apache-kylin-2.5.2-bin-cdh57: got empty cube
    
    I also tried with this case without streaming, with the format of timestamp
    column changed to "%Y-%m-%d %H:%M:%S", and an additional table to store the
    mapping of timestamp and {hour,day,month,year}_start.
    In this case, the cube is built as expected. 
    
    
    In both failed cases, the intermediate fact table on Hive built in step #2
    seems to have wrong column order.
    e.g. on version 2.5.2-cdh57, the schema and content of temp table are shown
    below:
    
    CREATE EXTERNAL TABLE IF NOT EXISTS
    kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact
    (
    DAY_START date
    ,YEAR_START date
    ,STATION_ID string
    ,QUARTER_START date
    ,MONTH_START date
    ,TEMPERATURE bigint
    ,HOUR_START timestamp
    )
    STORED AS SEQUENCEFILE
    LOCATION
    'hdfs://hz-dev-hdfs-service/user/admin/kylin-2/kylin_metadata/kylin-5dbe40eb-55ba-2245-c0b5-1e9efcb67937/kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact';
    ALTER TABLE
    kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact SET
    TBLPROPERTIES('auto.purge'='true');
    
    hive> select * from
    kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact limit
    10;
    OK
    NULL    2010-01-01      2010-01-01      2010-01-01      2010-01-01      NULL   
    NULL
    NULL    2009-01-01      2009-10-01      2009-12-01      2009-12-31      NULL   
    NULL
    NULL    2009-01-01      2009-10-01      2009-12-01      2009-12-31      NULL   
    NULL
    NULL    2009-01-01      2009-10-01      2009-12-01      2009-12-31      NULL   
    NULL
    NULL    2009-01-01      2009-10-01      2009-12-01      2009-12-31      NULL   
    NULL
    NULL    2010-01-01      2010-01-01      2010-01-01      2010-01-01      NULL   
    NULL
    NULL    2010-01-01      2010-01-01      2010-01-01      2010-01-01      NULL   
    NULL
    NULL    2009-01-01      2009-10-01      2009-12-01      2009-12-31      NULL   
    NULL
    NULL    2009-01-01      2009-10-01      2009-12-01      2009-12-31      NULL   
    NULL
    NULL    2010-01-01      2010-01-01      2010-01-01      2010-01-01      NULL   
    NULL
    Time taken: 0.421 seconds, Fetched: 10 row(s)
    
    While the the content of temp file is:
    # hdfs dfs -text
    hdfs://hz-dev-hdfs-service/user/admin/kylin-2/kylin_metadata/kylin-5dbe40eb-55ba-2245-c0b5-1e9efcb67937/kylin_intermediate_weather_f32241e6_53c6_2949_b737_d9a88a4618df_fact/part-m-00001
    | head -n 10
    19/02/13 11:44:12 INFO zlib.ZlibFactory: Successfully loaded & initialized
    native-zlib library
    19/02/13 11:44:12 INFO compress.CodecPool: Got brand-new decompressor
    [.deflate]
    19/02/13 11:44:12 INFO compress.CodecPool: Got brand-new decompressor
    [.deflate]
    19/02/13 11:44:12 INFO compress.CodecPool: Got brand-new decompressor
    [.deflate]
    19/02/13 11:44:12 INFO compress.CodecPool: Got brand-new decompressor
    [.deflate]
            0030322010-01-012010-01-012010-01-012010-01-012010-01-01
    07:00:001706
            0075762010-01-012010-01-012010-01-012010-01-012010-01-01
    07:00:002605
            0113882010-01-012010-01-012010-01-012010-01-012010-01-01
    07:00:002963
            0214922010-01-012010-01-012010-01-012010-01-012010-01-01
    07:00:001769
            0303062010-01-012010-01-012010-01-012010-01-012010-01-01 07:00:00432
            0377712010-01-012010-01-012010-01-012010-01-012010-01-01 07:00:00808
            0443462010-01-012010-01-012010-01-012010-01-012010-01-01
    07:00:001400
            0500512010-01-012010-01-012010-01-012010-01-012010-01-01 07:00:00342
            0537982010-01-012010-01-012010-01-012010-01-012010-01-01
    07:00:001587
            0597122010-01-012010-01-012010-01-012010-01-012010-01-01
    07:00:00-1309
    (the '\x01' character is not correctly copied)
    
    So what am I doing wrong?
    
    --
    Sent from: http://apache-kylin.74782.x6.nabble.com/