You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kylin.apache.org by ShaoFeng Shi <sh...@apache.org> on 2016/07/03 10:05:48 UTC

Re: Kylin使用kafka作为数据源的一些讨论

Hi Copperfield,

To be honest, Kylin's current streaming impl is like a POC; We have the
plan to make it more scalable and robust; Would you mind to summary each
requirement/bug you found into a JIRA? We appreciate all the inputs, thank
you!

在 2016年7月1日 下午4:40，Amuro Copperfield <xw...@gmail.com>写道：

>
> 您好，
>
>         因为问题比较多，怕英文表达不清楚，改用中文，请见谅，如果不方便我在补充一份英文的。
>
>         这段时间在使用kafka作为数据源，在kylin中进行cube的构建，遇到了一些问题：
>
>         1.
> 这个之前在微信已经说过，Kafka的数据一旦在Web界面被导入，则没法删除，也没法修改，相比之下Hive表的数据还有一个Unload选项，这就造成极大的不方便，一旦我在载入数据的时候填写错一些项，例如数据类型，就必须重新换个名字进行工作，或者去metadata下面修改
>
>         2.
> 研究了一下Kafka模块的代码，在Input部分感觉有点问题，不是bug或异常，个人认为不够健壮。在KafkaStreamingInput.java中，每个线程对应一个Kafka的partition，线程退出的方式（结束循环的方式）在于抓取到的数据作为时间戳列的值大于构建cube时end的时间+margin
>        这里我遇到的几种情况：
>        一、Kafka集群分布不均，木桶效应，导致有的线程在超时很久的情况下遇到仍然在消费数据，造成整个build无法按时完成
>        二、如果Kafka集群不稳定到某个partition根本没有数据，Kylin的build过程将陷入死循环
>
>        对于实时业务来说，这种延迟的都呈现出不确定性
> --
> Best regards,
> Amuro Copperfield
>



-- 
Best regards,

Shaofeng Shi