You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kylin.apache.org by Andras Nagy <an...@gmail.com> on 2019/06/13 10:01:34 UTC

Kylin streaming questions

Greetings,

I have a few questions related to the new streaming (real-time OLAP)
implementation.

1) Is there a way to have data reprocessed from kafka? E.g. I change a cube
definition and drop the cube (or add a new cube definition) and want to
have data that is still available on kafka to be reprocessed to build the
changed cube (or new cube)? Is this possible?

2) Does the hybrid model work with streaming cubes (to combine two cubes)?

3) What is minimum kafka version required? The tutorial asks to install
Kafka 1.0, is this the minimum required version?

Thank you very much,
Andras

Re: Kylin streaming questions

Posted by Ma Gang <mg...@163.com>.
for lambda mode, it is better to build the batch segments which the related real-time segments have already been saved to hbase, or you are sure that the batch segment is stable enough, and no new streaming events will go to that segment.

If the build cannot be triggered from ui for lambda cube, it should be a bug




| |
Ma Gang
|
|
���䣺mg4work@163.com
|

ǩ���� ���������ʦ ����

On 06/27/2019 11:31, Xiaoxiang Yu wrote:
Hi,
   As far as I know, "best practice"(in my mind)  of lambda mode should looks like this. Here, I use "batch segment" to refer to segment which source from Hive, "streaming segment" which source from Kafka.
   1. To deal with (most) late income message, you should set "kylin.stream.cube.duration" to a reasonable value; if 99.5% of would not late for 2 hour, 
        you may set kylin.stream.cube.duration to 7200.
   2. To deal with some case such as the following scene/case, you may need to use Kylin's Rest API the build a"batch segment" to replace a "streaming segment".
        1. You want to normalize some value , if some value of "price" is using dollor($) and others using euro(��). 
        2. Correct mistake from kafka message, such as some value are upper case or other value are capitalize.
        3. Some message have been discarded by Streaming Receiver because of they are beyond the scope of "kylin.stream.cube.duration", but you real need them. 
   So, I think build a "batch segment" is only necessary when you find something wrong and want to overwrite "streaming segment". When you need to overwrite, you may need to overwrite whole streaming segment, use exact match segment range, I think in this way segment overlap problem will never happens. 
    If I misunderstand anything, please let me know. Thank you.


-----------------
-----------------
Best wishes to you ! 
From ��Xiaoxiang Yu

At 2019-06-26 19:36:44, "Andras Nagy" <an...@gmail.com> wrote:

Hi Xiaoxiang, ShaoFeng,


Thank you for your answers!


Regarding the segment overlap between batch and streaming, my point was that it seems to be different to how I understand segment overlap to work in streaming OLAP.


That is, assuming I build a "batch" segment from 2019-06-25 00:00:00.0 to 2019-06-26 00:00:00.0 (1 day).
Then if a late event comes in for the same period (e.g. event timestamp field contains 2019-06-25 12:34:56), but after this batch segment has already been built, it will not show up in the query result unless I set up a mechanism to detect the late event and trigger the rebuilding of the batch segment. This is because the results from the batch segments overwrite the results from the streaming segments.
On the other hand, for segments built by the streaming engine, my understanding is that they can have overlapping time periods and the query engine will merge the results.


I understand this behaviour is actually useful in optimizing the query path in case there were many overlapping segments created by the streaming cube build, since with the batch-built segment, the results can be served from a single segment and don't need to be merged from multiple overlapping segments.



I guess the solution here is to ensure that the batch segment is always built for a time period from which we practically don't expect late events anymore.


Best regards,
Andras




On Wed, Jun 26, 2019 at 11:40 AM Xiaoxiang Yu <hi...@126.com> wrote:

Hi Andras, Shaofeng,
  I will update this information asap. 
  About segment overlaping problem, I have a test in my env, looks like everything works well. Since the segment range created by kylin��s streaming coordinator is something like "201906290000_201906290100" , if you want to build a segment, I think you should use the exact match segment range (such as "201906290000_201906290100"), or merge multi exist segments range (such as "201906290100_201906290300") .




-----------------
-----------------
Best wishes to you ! 
From ��Xiaoxiang Yu

At 2019-06-26 12:00:38, "ShaoFeng Shi" <sh...@apache.org> wrote:

Hi Xiaoxiang,


Thank you for the detailed information. Could you please record these limitations as JIRA issues (if not yet)? Thanks.


Best regards,


Shaofeng Shi ʷ�ٷ�
Apache Kylin PMC
Email: shaofengshi@apache.org


Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org









Xiaoxiang Yu <hi...@126.com> ��2019��6��25���ܶ� ����11:42д����



Hi, Andras
    I am glad to see that you have have a strong understanding with Kylin's Realtime OLAP. Most of them are correct, the following is my understanding:
    1)  Currently, there is no such documentation which talk about how to use lambda mode, we will publish one after 3.0.0-beta release (maybe this wekend or after a week?).
    2)  Hive table must have the same name as the streaming table , and should be locate at "default" namespace of hive. The column name should match exactly and data type should be compatible.
    3)  If you want to build segment which data from hive,  you have to built by rest api.
    4)  Cube build engine must be mapreduce, spark is not supported at the moment.




-----------------
-----------------
Best wishes to you ! 
From ��Xiaoxiang Yu

At 2019-06-25 17:20:55, "Andras Nagy" <an...@gmail.com> wrote:

Hi ShaoFeng,

Thanks a lot for the pointer on the lambda mode, yes, that's exactly what I need :)

Is there perhaps documentation on this? For now, I was trying to get this working 'empirically' and finally succeeded, but some of my conclusions may be wrong. This is what I concluded:

- hive table must have the same name as the streaming table (name given to the data source)
- cube can't be built from UI (to build the historic segments from the data in hive), but it can be built using the REST API
- cube build engine must be mapreduce. For Spark as build engine I got exception "Cannot adapt to interface org.apache.kylin.engine.spark.ISparkOutput"
- endTime must be non-overlapping with the streaming data. When I had overlap, the streaming data coming from kafka did not show up in the output, I guess this is what you meant by "the segments from Hive will overwrite the segments from Kafka".

Are these correct conclusions? Is there anything else I should be aware of?

Many thanks,
Andras



On Tue, Jun 25, 2019 at 9:19 AM ShaoFeng Shi <sh...@apache.org> wrote:

Hello Andras,


Kylin's realtime-OLAP feature supports a "Lambda" mode (mentioned in https://kylin.apache.org/blog/2019/04/12/rt-streaming-design/), which means, you can define a fact table whose data can be from both Kafka and Hive. The only requirement is that all the cube columns appear in both Kafka data and Hive data. I think maybe that can fit your need. The cube can be built from Kafka, in the meanwhile, it can also be built from Hive, the segments from Hive will overwrite the segments from Kafka (as usually Hive data is more accurate). When querying the cube, Kylin will firstly query historical segments, and then real-time segments (adding the max-time of historical segments as the condition).




Best regards,


Shaofeng Shi ʷ�ٷ�
Apache Kylin PMC
Email: shaofengshi@apache.org


Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org









Andras Nagy <an...@gmail.com> ��2019��6��24����һ ����11:29д����

Dear Ma,


Thanks for your reply.


Slightly related to my original question on the hybrid model, I was wondering if it's possible to combine a batch and a streaming cube. I realized this is not possible, as a hybrid model can only be created from cubes of the same model (and a model points to either a batch or a streaming datasource).


The usecase would be this:
- we have a large amount of streaming data in Kafka that we would like to process with Kylin streaming
- Kafka retention is only a few days, so if we need to change anything in the cubes (e.g. introduce a new metric or dimension which has been present in the events, but not in the cube definition), we can only reprocess a few days worth of data in the streaming model
- the raw events are also written to a data lake for long-term storage
- the data written to the data lake could be used to feed the historic data into a batch kylin model (and cubes)
- I'm looking for a way to combine these, so if we want to change anything in the cubes, we can recalculate them for the historic data as well


Is there a way to achieve this with current Kylin? (Without implementing a custom query layer that combines the two cubes.)


Best regards,
Andras




















On Fri, Jun 14, 2019 at 6:43 AM Ma Gang <mg...@163.com> wrote:

Hi Andras,


Currently it doesn't support consume from specified offsets, only support consume from startOffset or latestOffset, if you want to consume from startOffset, you need to set the configuration: kylin.stream.consume.offsets.latest to false in the cube's overrides page.


If you do need to start from specified offsets, please create a jira request, but I think it is hard for user to know what's the offsets should be set for all partitions.


At 2019-06-13 22:34:59, "Andras Nagy" <an...@gmail.com> wrote:

Dear Ma,


Thank you very much!


>1)yes, you can specify a configuration in the new cube, to consume data from start offset
That is, an offset value for each partition of the topic? That would be good - could you please point me where to do this in practice, or point me to what I should read? (I haven't found it on the cube designer UI - perhaps this is something that's only available on the API?)


Many thanks,
Andras






On Thu, Jun 13, 2019 at 1:14 PM Ma Gang <mg...@163.com> wrote:

Hi Andras,
1)yes, you can specify a configuration in the new cube, to consume data from start offset

2)It should work, but I haven't tested it yet

3)as I remember, currently we use Kafka 1.0 client library, so it is better to use the version later, I'm sure that the version before 0.9.0 cannot work, but not sure 0.9.x can work or not




| |
Ma Gang
|
|
���䣺mg4work@163.com
|

ǩ���� ���������ʦ ����

On 06/13/2019 18:01, Andras Nagy wrote:
Greetings,


I have a few questions related to the new streaming (real-time OLAP) implementation.


1) Is there a way to have data reprocessed from kafka? E.g. I change a cube definition and drop the cube (or add a new cube definition) and want to have data that is still available on kafka to be reprocessed to build the changed cube (or new cube)? Is this possible?


2) Does the hybrid model work with streaming cubes (to combine two cubes)?


3) What is minimum kafka version required? The tutorial asks to install Kafka 1.0, is this the minimum required version?


Thank you very much,
Andras




 

Re: Kylin streaming questions

Posted by Xiaoxiang Yu <hi...@126.com>.
Hi,
   As far as I know, "best practice"(in my mind)  of lambda mode should looks like this. Here, I use "batch segment" to refer to segment which source from Hive, "streaming segment" which source from Kafka.
   1. To deal with (most) late income message, you should set "kylin.stream.cube.duration" to a reasonable value; if 99.5% of would not late for 2 hour, 
        you may set kylin.stream.cube.duration to 7200.
   2. To deal with some case such as the following scene/case, you may need to use Kylin's Rest API the build a"batch segment" to replace a "streaming segment".
        1. You want to normalize some value , if some value of "price" is using dollor($) and others using euro(��). 
        2. Correct mistake from kafka message, such as some value are upper case or other value are capitalize.
        3. Some message have been discarded by Streaming Receiver because of they are beyond the scope of "kylin.stream.cube.duration", but you real need them. 
   So, I think build a "batch segment" is only necessary when you find something wrong and want to overwrite "streaming segment". When you need to overwrite, you may need to overwrite whole streaming segment, use exact match segment range, I think in this way segment overlap problem will never happens. 
    If I misunderstand anything, please let me know. Thank you.


-----------------
-----------------
Best wishes to you ! 
From ��Xiaoxiang Yu

At 2019-06-26 19:36:44, "Andras Nagy" <an...@gmail.com> wrote:

Hi Xiaoxiang, ShaoFeng,


Thank you for your answers!


Regarding the segment overlap between batch and streaming, my point was that it seems to be different to how I understand segment overlap to work in streaming OLAP.


That is, assuming I build a "batch" segment from 2019-06-25 00:00:00.0 to 2019-06-26 00:00:00.0 (1 day).
Then if a late event comes in for the same period (e.g. event timestamp field contains 2019-06-25 12:34:56), but after this batch segment has already been built, it will not show up in the query result unless I set up a mechanism to detect the late event and trigger the rebuilding of the batch segment. This is because the results from the batch segments overwrite the results from the streaming segments.
On the other hand, for segments built by the streaming engine, my understanding is that they can have overlapping time periods and the query engine will merge the results.


I understand this behaviour is actually useful in optimizing the query path in case there were many overlapping segments created by the streaming cube build, since with the batch-built segment, the results can be served from a single segment and don't need to be merged from multiple overlapping segments.



I guess the solution here is to ensure that the batch segment is always built for a time period from which we practically don't expect late events anymore.


Best regards,
Andras




On Wed, Jun 26, 2019 at 11:40 AM Xiaoxiang Yu <hi...@126.com> wrote:

Hi Andras, Shaofeng,
  I will update this information asap. 
  About segment overlaping problem, I have a test in my env, looks like everything works well. Since the segment range created by kylin��s streaming coordinator is something like "201906290000_201906290100" , if you want to build a segment, I think you should use the exact match segment range (such as "201906290000_201906290100"), or merge multi exist segments range (such as "201906290100_201906290300") .




-----------------
-----------------
Best wishes to you ! 
From ��Xiaoxiang Yu

At 2019-06-26 12:00:38, "ShaoFeng Shi" <sh...@apache.org> wrote:

Hi Xiaoxiang,


Thank you for the detailed information. Could you please record these limitations as JIRA issues (if not yet)? Thanks.


Best regards,


Shaofeng Shi ʷ�ٷ�
Apache Kylin PMC
Email: shaofengshi@apache.org


Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org









Xiaoxiang Yu <hi...@126.com> ��2019��6��25���ܶ� ����11:42д����



Hi, Andras
    I am glad to see that you have have a strong understanding with Kylin's Realtime OLAP. Most of them are correct, the following is my understanding:
    1)  Currently, there is no such documentation which talk about how to use lambda mode, we will publish one after 3.0.0-beta release (maybe this wekend or after a week?).
    2)  Hive table must have the same name as the streaming table , and should be locate at "default" namespace of hive. The column name should match exactly and data type should be compatible.
    3)  If you want to build segment which data from hive,  you have to built by rest api.
    4)  Cube build engine must be mapreduce, spark is not supported at the moment.




-----------------
-----------------
Best wishes to you ! 
From ��Xiaoxiang Yu

At 2019-06-25 17:20:55, "Andras Nagy" <an...@gmail.com> wrote:

Hi ShaoFeng,

Thanks a lot for the pointer on the lambda mode, yes, that's exactly what I need :)

Is there perhaps documentation on this? For now, I was trying to get this working 'empirically' and finally succeeded, but some of my conclusions may be wrong. This is what I concluded:

- hive table must have the same name as the streaming table (name given to the data source)
- cube can't be built from UI (to build the historic segments from the data in hive), but it can be built using the REST API
- cube build engine must be mapreduce. For Spark as build engine I got exception "Cannot adapt to interface org.apache.kylin.engine.spark.ISparkOutput"
- endTime must be non-overlapping with the streaming data. When I had overlap, the streaming data coming from kafka did not show up in the output, I guess this is what you meant by "the segments from Hive will overwrite the segments from Kafka".

Are these correct conclusions? Is there anything else I should be aware of?

Many thanks,
Andras



On Tue, Jun 25, 2019 at 9:19 AM ShaoFeng Shi <sh...@apache.org> wrote:

Hello Andras,


Kylin's realtime-OLAP feature supports a "Lambda" mode (mentioned in https://kylin.apache.org/blog/2019/04/12/rt-streaming-design/), which means, you can define a fact table whose data can be from both Kafka and Hive. The only requirement is that all the cube columns appear in both Kafka data and Hive data. I think maybe that can fit your need. The cube can be built from Kafka, in the meanwhile, it can also be built from Hive, the segments from Hive will overwrite the segments from Kafka (as usually Hive data is more accurate). When querying the cube, Kylin will firstly query historical segments, and then real-time segments (adding the max-time of historical segments as the condition).




Best regards,


Shaofeng Shi ʷ�ٷ�
Apache Kylin PMC
Email: shaofengshi@apache.org


Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org









Andras Nagy <an...@gmail.com> ��2019��6��24����һ ����11:29д����

Dear Ma,


Thanks for your reply.


Slightly related to my original question on the hybrid model, I was wondering if it's possible to combine a batch and a streaming cube. I realized this is not possible, as a hybrid model can only be created from cubes of the same model (and a model points to either a batch or a streaming datasource).


The usecase would be this:
- we have a large amount of streaming data in Kafka that we would like to process with Kylin streaming
- Kafka retention is only a few days, so if we need to change anything in the cubes (e.g. introduce a new metric or dimension which has been present in the events, but not in the cube definition), we can only reprocess a few days worth of data in the streaming model
- the raw events are also written to a data lake for long-term storage
- the data written to the data lake could be used to feed the historic data into a batch kylin model (and cubes)
- I'm looking for a way to combine these, so if we want to change anything in the cubes, we can recalculate them for the historic data as well


Is there a way to achieve this with current Kylin? (Without implementing a custom query layer that combines the two cubes.)


Best regards,
Andras




















On Fri, Jun 14, 2019 at 6:43 AM Ma Gang <mg...@163.com> wrote:

Hi Andras,


Currently it doesn't support consume from specified offsets, only support consume from startOffset or latestOffset, if you want to consume from startOffset, you need to set the configuration: kylin.stream.consume.offsets.latest to false in the cube's overrides page.


If you do need to start from specified offsets, please create a jira request, but I think it is hard for user to know what's the offsets should be set for all partitions.


At 2019-06-13 22:34:59, "Andras Nagy" <an...@gmail.com> wrote:

Dear Ma,


Thank you very much!


>1)yes, you can specify a configuration in the new cube, to consume data from start offset
That is, an offset value for each partition of the topic? That would be good - could you please point me where to do this in practice, or point me to what I should read? (I haven't found it on the cube designer UI - perhaps this is something that's only available on the API?)


Many thanks,
Andras






On Thu, Jun 13, 2019 at 1:14 PM Ma Gang <mg...@163.com> wrote:

Hi Andras,
1)yes, you can specify a configuration in the new cube, to consume data from start offset

2)It should work, but I haven't tested it yet

3)as I remember, currently we use Kafka 1.0 client library, so it is better to use the version later, I'm sure that the version before 0.9.0 cannot work, but not sure 0.9.x can work or not




| |
Ma Gang
|
|
���䣺mg4work@163.com
|

ǩ���� ���������ʦ ����

On 06/13/2019 18:01, Andras Nagy wrote:
Greetings,


I have a few questions related to the new streaming (real-time OLAP) implementation.


1) Is there a way to have data reprocessed from kafka? E.g. I change a cube definition and drop the cube (or add a new cube definition) and want to have data that is still available on kafka to be reprocessed to build the changed cube (or new cube)? Is this possible?


2) Does the hybrid model work with streaming cubes (to combine two cubes)?


3) What is minimum kafka version required? The tutorial asks to install Kafka 1.0, is this the minimum required version?


Thank you very much,
Andras




 

Re: Kylin streaming questions

Posted by Andras Nagy <an...@gmail.com>.
Hi Xiaoxiang, ShaoFeng,

Thank you for your answers!

Regarding the segment overlap between batch and streaming, my point was
that it seems to be different to how I understand segment overlap to work
in streaming OLAP.

That is, assuming I build a "batch" segment from 2019-06-25 00:00:00.0 to
2019-06-26 00:00:00.0 (1 day).
Then if a late event comes in for the same period (e.g. event timestamp
field contains 2019-06-25 12:34:56), but after this batch segment has
already been built, it will not show up in the query result unless I set up
a mechanism to detect the late event and trigger the rebuilding of the
batch segment. This is because the results from the batch segments
overwrite the results from the streaming segments.
On the other hand, for segments built by the streaming engine, my
understanding is that they can have overlapping time periods and the query
engine will merge the results.

I understand this behaviour is actually useful in optimizing the query path
in case there were many overlapping segments created by the streaming cube
build, since with the batch-built segment, the results can be served from a
single segment and don't need to be merged from multiple overlapping
segments.

I guess the solution here is to ensure that the batch segment is always
built for a time period from which we practically don't expect late events
anymore.

Best regards,
Andras


On Wed, Jun 26, 2019 at 11:40 AM Xiaoxiang Yu <hi...@126.com> wrote:

> Hi Andras, Shaofeng,
>   I will update this information asap.
>   About segment overlaping problem, I have a test in my env, looks like
> everything works well. Since the segment range created by kylin’s streaming
> coordinator is something like "201906290000_201906290100" , if you want to
> build a segment, I think you should use the exact match segment range (such
> as "201906290000_201906290100"), or merge multi exist segments range (such
> as "201906290100_201906290300") .
>
>
> *-----------------*
> *-----------------*
> *Best wishes to you ! *
> *From :**Xiaoxiang Yu*
>
> At 2019-06-26 12:00:38, "ShaoFeng Shi" <sh...@apache.org> wrote:
>
> Hi Xiaoxiang,
>
> Thank you for the detailed information. Could you please record these
> limitations as JIRA issues (if not yet)? Thanks.
>
> Best regards,
>
> Shaofeng Shi 史少锋
> Apache Kylin PMC
> Email: shaofengshi@apache.org
>
> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> Join Kylin user mail group: user-subscribe@kylin.apache.org
> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>
>
>
>
> Xiaoxiang Yu <hi...@126.com> 于2019年6月25日周二 下午11:42写道:
>
>>
>> Hi, Andras
>>     I am glad to see that you have have a strong understanding with
>> Kylin's Realtime OLAP. Most of them are correct, the following is my
>> understanding:
>>     1)  Currently, there is no such documentation which talk about how to
>> use lambda mode, we will publish one after 3.0.0-beta release (maybe this
>> wekend or after a week?).
>>     2)  Hive table must have the same name as the streaming table , and
>> should be locate at "default" namespace of hive. The column name should
>> match exactly and data type should be compatible.
>>     3)  If you want to build segment which data from hive,  you have to
>> built by rest api.
>>     4)  Cube build engine must be mapreduce, spark is not supported at
>> the moment.
>>
>>
>> *-----------------*
>> *-----------------*
>> *Best wishes to you ! *
>> *From :**Xiaoxiang Yu*
>>
>> At 2019-06-25 17:20:55, "Andras Nagy" <an...@gmail.com>
>> wrote:
>>
>> Hi ShaoFeng,
>>
>> Thanks a lot for the pointer on the lambda mode, yes, that's exactly what
>> I need :)
>>
>> Is there perhaps documentation on this? For now, I was trying to get this
>> working 'empirically' and finally succeeded, but some of my conclusions may
>> be wrong. This is what I concluded:
>>
>> - hive table must have the same name as the streaming table (name given
>> to the data source)
>> - cube can't be built from UI (to build the historic segments from the
>> data in hive), but it can be built using the REST API
>> - cube build engine must be mapreduce. For Spark as build engine I got
>> exception "Cannot adapt to interface
>> org.apache.kylin.engine.spark.ISparkOutput"
>> - endTime must be non-overlapping with the streaming data. When I had
>> overlap, the streaming data coming from kafka did not show up in the
>> output, I guess this is what you meant by "the segments from Hive will
>> overwrite the segments from Kafka".
>>
>> Are these correct conclusions? Is there anything else I should be aware
>> of?
>>
>> Many thanks,
>> Andras
>>
>> On Tue, Jun 25, 2019 at 9:19 AM ShaoFeng Shi <sh...@apache.org>
>> wrote:
>>
>>> Hello Andras,
>>>
>>> Kylin's realtime-OLAP feature supports a "Lambda" mode (mentioned in
>>> https://kylin.apache.org/blog/2019/04/12/rt-streaming-design/), which
>>> means, you can define a fact table whose data can be from both Kafka and
>>> Hive. The only requirement is that all the cube columns appear in both
>>> Kafka data and Hive data. I think maybe that can fit your need. The cube
>>> can be built from Kafka, in the meanwhile, it can also be built from Hive,
>>> the segments from Hive will overwrite the segments from Kafka (as usually
>>> Hive data is more accurate). When querying the cube, Kylin will firstly
>>> query historical segments, and then real-time segments (adding the max-time
>>> of historical segments as the condition).
>>>
>>>
>>> Best regards,
>>>
>>> Shaofeng Shi 史少锋
>>> Apache Kylin PMC
>>> Email: shaofengshi@apache.org
>>>
>>> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
>>> Join Kylin user mail group: user-subscribe@kylin.apache.org
>>> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>>>
>>>
>>>
>>>
>>> Andras Nagy <an...@gmail.com> 于2019年6月24日周一 下午11:29写道:
>>>
>>>> Dear Ma,
>>>>
>>>> Thanks for your reply.
>>>>
>>>> Slightly related to my original question on the hybrid model, I was
>>>> wondering if it's possible to combine a batch and a streaming cube. I
>>>> realized this is not possible, as a hybrid model can only be created from
>>>> cubes of the same model (and a model points to either a batch or a
>>>> streaming datasource).
>>>>
>>>> The usecase would be this:
>>>> - we have a large amount of streaming data in Kafka that we would like
>>>> to process with Kylin streaming
>>>> - Kafka retention is only a few days, so if we need to change anything
>>>> in the cubes (e.g. introduce a new metric or dimension which has been
>>>> present in the events, but not in the cube definition), we can only
>>>> reprocess a few days worth of data in the streaming model
>>>> - the raw events are also written to a data lake for long-term storage
>>>> - the data written to the data lake could be used to feed the historic
>>>> data into a batch kylin model (and cubes)
>>>> - I'm looking for a way to combine these, so if we want to change
>>>> anything in the cubes, we can recalculate them for the historic data as well
>>>>
>>>> Is there a way to achieve this with current Kylin? (Without
>>>> implementing a custom query layer that combines the two cubes.)
>>>>
>>>> Best regards,
>>>> Andras
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jun 14, 2019 at 6:43 AM Ma Gang <mg...@163.com> wrote:
>>>>
>>>>> Hi Andras,
>>>>>
>>>>> Currently it doesn't support consume from specified offsets, only
>>>>> support consume from startOffset or latestOffset, if you want to consume
>>>>> from startOffset, you need to set the
>>>>> configuration: kylin.stream.consume.offsets.latest to false in the cube's
>>>>> overrides page.
>>>>>
>>>>> If you do need to start from specified offsets, please create a jira
>>>>> request, but I think it is hard for user to know what's the offsets should
>>>>> be set for all partitions.
>>>>>
>>>>> At 2019-06-13 22:34:59, "Andras Nagy" <an...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Dear Ma,
>>>>>
>>>>> Thank you very much!
>>>>>
>>>>> >1)yes, you can specify a configuration in the new cube, to consume
>>>>> data from start offset
>>>>> That is, an offset value for each partition of the topic? That would
>>>>> be good - could you please point me where to do this in practice, or point
>>>>> me to what I should read? (I haven't found it on the cube designer UI -
>>>>> perhaps this is something that's only available on the API?)
>>>>>
>>>>> Many thanks,
>>>>> Andras
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jun 13, 2019 at 1:14 PM Ma Gang <mg...@163.com> wrote:
>>>>>
>>>>>> Hi Andras,
>>>>>> 1)yes, you can specify a configuration in the new cube, to consume
>>>>>> data from start offset
>>>>>>
>>>>>> 2)It should work, but I haven't tested it yet
>>>>>>
>>>>>> 3)as I remember, currently we use Kafka 1.0 client library, so it is
>>>>>> better to use the version later, I'm sure that the version before 0.9.0
>>>>>> cannot work, but not sure 0.9.x can work or not
>>>>>>
>>>>>>
>>>>>>
>>>>>> Ma Gang
>>>>>> 邮箱:mg4work@163.com
>>>>>>
>>>>>> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=Ma+Gang&uid=mg4work%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Amg4work%40163.com%22%5D>
>>>>>>
>>>>>> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88> 定制
>>>>>>
>>>>>> On 06/13/2019 18:01, Andras Nagy <an...@gmail.com>
>>>>>> wrote:
>>>>>> Greetings,
>>>>>>
>>>>>> I have a few questions related to the new streaming (real-time OLAP)
>>>>>> implementation.
>>>>>>
>>>>>> 1) Is there a way to have data reprocessed from kafka? E.g. I change
>>>>>> a cube definition and drop the cube (or add a new cube definition) and want
>>>>>> to have data that is still available on kafka to be reprocessed to build
>>>>>> the changed cube (or new cube)? Is this possible?
>>>>>>
>>>>>> 2) Does the hybrid model work with streaming cubes (to combine two
>>>>>> cubes)?
>>>>>>
>>>>>> 3) What is minimum kafka version required? The tutorial asks to
>>>>>> install Kafka 1.0, is this the minimum required version?
>>>>>>
>>>>>> Thank you very much,
>>>>>> Andras
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>

Re: Kylin streaming questions

Posted by Xiaoxiang Yu <hi...@126.com>.
Hi Andras, Shaofeng,
  I will update this information asap. 
  About segment overlaping problem, I have a test in my env, looks like everything works well. Since the segment range created by kylin’s streaming coordinator is something like "201906290000_201906290100" , if you want to build a segment, I think you should use the exact match segment range (such as "201906290000_201906290100"), or merge multi exist segments range (such as "201906290100_201906290300") .




-----------------
-----------------
Best wishes to you ! 
From :Xiaoxiang Yu

At 2019-06-26 12:00:38, "ShaoFeng Shi" <sh...@apache.org> wrote:

Hi Xiaoxiang,


Thank you for the detailed information. Could you please record these limitations as JIRA issues (if not yet)? Thanks.


Best regards,


Shaofeng Shi 史少锋
Apache Kylin PMC
Email: shaofengshi@apache.org


Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org









Xiaoxiang Yu <hi...@126.com> 于2019年6月25日周二 下午11:42写道:



Hi, Andras
    I am glad to see that you have have a strong understanding with Kylin's Realtime OLAP. Most of them are correct, the following is my understanding:
    1)  Currently, there is no such documentation which talk about how to use lambda mode, we will publish one after 3.0.0-beta release (maybe this wekend or after a week?).
    2)  Hive table must have the same name as the streaming table , and should be locate at "default" namespace of hive. The column name should match exactly and data type should be compatible.
    3)  If you want to build segment which data from hive,  you have to built by rest api.
    4)  Cube build engine must be mapreduce, spark is not supported at the moment.




-----------------
-----------------
Best wishes to you ! 
From :Xiaoxiang Yu

At 2019-06-25 17:20:55, "Andras Nagy" <an...@gmail.com> wrote:

Hi ShaoFeng,

Thanks a lot for the pointer on the lambda mode, yes, that's exactly what I need :)

Is there perhaps documentation on this? For now, I was trying to get this working 'empirically' and finally succeeded, but some of my conclusions may be wrong. This is what I concluded:

- hive table must have the same name as the streaming table (name given to the data source)
- cube can't be built from UI (to build the historic segments from the data in hive), but it can be built using the REST API
- cube build engine must be mapreduce. For Spark as build engine I got exception "Cannot adapt to interface org.apache.kylin.engine.spark.ISparkOutput"
- endTime must be non-overlapping with the streaming data. When I had overlap, the streaming data coming from kafka did not show up in the output, I guess this is what you meant by "the segments from Hive will overwrite the segments from Kafka".

Are these correct conclusions? Is there anything else I should be aware of?

Many thanks,
Andras



On Tue, Jun 25, 2019 at 9:19 AM ShaoFeng Shi <sh...@apache.org> wrote:

Hello Andras,


Kylin's realtime-OLAP feature supports a "Lambda" mode (mentioned in https://kylin.apache.org/blog/2019/04/12/rt-streaming-design/), which means, you can define a fact table whose data can be from both Kafka and Hive. The only requirement is that all the cube columns appear in both Kafka data and Hive data. I think maybe that can fit your need. The cube can be built from Kafka, in the meanwhile, it can also be built from Hive, the segments from Hive will overwrite the segments from Kafka (as usually Hive data is more accurate). When querying the cube, Kylin will firstly query historical segments, and then real-time segments (adding the max-time of historical segments as the condition).




Best regards,


Shaofeng Shi 史少锋
Apache Kylin PMC
Email: shaofengshi@apache.org


Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org









Andras Nagy <an...@gmail.com> 于2019年6月24日周一 下午11:29写道:

Dear Ma,


Thanks for your reply.


Slightly related to my original question on the hybrid model, I was wondering if it's possible to combine a batch and a streaming cube. I realized this is not possible, as a hybrid model can only be created from cubes of the same model (and a model points to either a batch or a streaming datasource).


The usecase would be this:
- we have a large amount of streaming data in Kafka that we would like to process with Kylin streaming
- Kafka retention is only a few days, so if we need to change anything in the cubes (e.g. introduce a new metric or dimension which has been present in the events, but not in the cube definition), we can only reprocess a few days worth of data in the streaming model
- the raw events are also written to a data lake for long-term storage
- the data written to the data lake could be used to feed the historic data into a batch kylin model (and cubes)
- I'm looking for a way to combine these, so if we want to change anything in the cubes, we can recalculate them for the historic data as well


Is there a way to achieve this with current Kylin? (Without implementing a custom query layer that combines the two cubes.)


Best regards,
Andras




















On Fri, Jun 14, 2019 at 6:43 AM Ma Gang <mg...@163.com> wrote:

Hi Andras,


Currently it doesn't support consume from specified offsets, only support consume from startOffset or latestOffset, if you want to consume from startOffset, you need to set the configuration: kylin.stream.consume.offsets.latest to false in the cube's overrides page.


If you do need to start from specified offsets, please create a jira request, but I think it is hard for user to know what's the offsets should be set for all partitions.


At 2019-06-13 22:34:59, "Andras Nagy" <an...@gmail.com> wrote:

Dear Ma,


Thank you very much!


>1)yes, you can specify a configuration in the new cube, to consume data from start offset
That is, an offset value for each partition of the topic? That would be good - could you please point me where to do this in practice, or point me to what I should read? (I haven't found it on the cube designer UI - perhaps this is something that's only available on the API?)


Many thanks,
Andras






On Thu, Jun 13, 2019 at 1:14 PM Ma Gang <mg...@163.com> wrote:

Hi Andras,
1)yes, you can specify a configuration in the new cube, to consume data from start offset

2)It should work, but I haven't tested it yet

3)as I remember, currently we use Kafka 1.0 client library, so it is better to use the version later, I'm sure that the version before 0.9.0 cannot work, but not sure 0.9.x can work or not




| |
Ma Gang
|
|
邮箱:mg4work@163.com
|

签名由 网易邮箱大师 定制

On 06/13/2019 18:01, Andras Nagy wrote:
Greetings,


I have a few questions related to the new streaming (real-time OLAP) implementation.


1) Is there a way to have data reprocessed from kafka? E.g. I change a cube definition and drop the cube (or add a new cube definition) and want to have data that is still available on kafka to be reprocessed to build the changed cube (or new cube)? Is this possible?


2) Does the hybrid model work with streaming cubes (to combine two cubes)?


3) What is minimum kafka version required? The tutorial asks to install Kafka 1.0, is this the minimum required version?


Thank you very much,
Andras




 

Re: Re: Kylin streaming questions

Posted by ShaoFeng Shi <sh...@apache.org>.
Hi Xiaoxiang,

Thank you for the detailed information. Could you please record these
limitations as JIRA issues (if not yet)? Thanks.

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Email: shaofengshi@apache.org

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org




Xiaoxiang Yu <hi...@126.com> 于2019年6月25日周二 下午11:42写道:

>
> Hi, Andras
>     I am glad to see that you have have a strong understanding with
> Kylin's Realtime OLAP. Most of them are correct, the following is my
> understanding:
>     1)  Currently, there is no such documentation which talk about how to
> use lambda mode, we will publish one after 3.0.0-beta release (maybe this
> wekend or after a week?).
>     2)  Hive table must have the same name as the streaming table , and
> should be locate at "default" namespace of hive. The column name should
> match exactly and data type should be compatible.
>     3)  If you want to build segment which data from hive,  you have to
> built by rest api.
>     4)  Cube build engine must be mapreduce, spark is not supported at the
> moment.
>
>
> *-----------------*
> *-----------------*
> *Best wishes to you ! *
> *From :**Xiaoxiang Yu*
>
> At 2019-06-25 17:20:55, "Andras Nagy" <an...@gmail.com>
> wrote:
>
> Hi ShaoFeng,
>
> Thanks a lot for the pointer on the lambda mode, yes, that's exactly what
> I need :)
>
> Is there perhaps documentation on this? For now, I was trying to get this
> working 'empirically' and finally succeeded, but some of my conclusions may
> be wrong. This is what I concluded:
>
> - hive table must have the same name as the streaming table (name given to
> the data source)
> - cube can't be built from UI (to build the historic segments from the
> data in hive), but it can be built using the REST API
> - cube build engine must be mapreduce. For Spark as build engine I got
> exception "Cannot adapt to interface
> org.apache.kylin.engine.spark.ISparkOutput"
> - endTime must be non-overlapping with the streaming data. When I had
> overlap, the streaming data coming from kafka did not show up in the
> output, I guess this is what you meant by "the segments from Hive will
> overwrite the segments from Kafka".
>
> Are these correct conclusions? Is there anything else I should be aware of?
>
> Many thanks,
> Andras
>
> On Tue, Jun 25, 2019 at 9:19 AM ShaoFeng Shi <sh...@apache.org>
> wrote:
>
>> Hello Andras,
>>
>> Kylin's realtime-OLAP feature supports a "Lambda" mode (mentioned in
>> https://kylin.apache.org/blog/2019/04/12/rt-streaming-design/), which
>> means, you can define a fact table whose data can be from both Kafka and
>> Hive. The only requirement is that all the cube columns appear in both
>> Kafka data and Hive data. I think maybe that can fit your need. The cube
>> can be built from Kafka, in the meanwhile, it can also be built from Hive,
>> the segments from Hive will overwrite the segments from Kafka (as usually
>> Hive data is more accurate). When querying the cube, Kylin will firstly
>> query historical segments, and then real-time segments (adding the max-time
>> of historical segments as the condition).
>>
>>
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>> Apache Kylin PMC
>> Email: shaofengshi@apache.org
>>
>> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
>> Join Kylin user mail group: user-subscribe@kylin.apache.org
>> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>>
>>
>>
>>
>> Andras Nagy <an...@gmail.com> 于2019年6月24日周一 下午11:29写道:
>>
>>> Dear Ma,
>>>
>>> Thanks for your reply.
>>>
>>> Slightly related to my original question on the hybrid model, I was
>>> wondering if it's possible to combine a batch and a streaming cube. I
>>> realized this is not possible, as a hybrid model can only be created from
>>> cubes of the same model (and a model points to either a batch or a
>>> streaming datasource).
>>>
>>> The usecase would be this:
>>> - we have a large amount of streaming data in Kafka that we would like
>>> to process with Kylin streaming
>>> - Kafka retention is only a few days, so if we need to change anything
>>> in the cubes (e.g. introduce a new metric or dimension which has been
>>> present in the events, but not in the cube definition), we can only
>>> reprocess a few days worth of data in the streaming model
>>> - the raw events are also written to a data lake for long-term storage
>>> - the data written to the data lake could be used to feed the historic
>>> data into a batch kylin model (and cubes)
>>> - I'm looking for a way to combine these, so if we want to change
>>> anything in the cubes, we can recalculate them for the historic data as well
>>>
>>> Is there a way to achieve this with current Kylin? (Without implementing
>>> a custom query layer that combines the two cubes.)
>>>
>>> Best regards,
>>> Andras
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Jun 14, 2019 at 6:43 AM Ma Gang <mg...@163.com> wrote:
>>>
>>>> Hi Andras,
>>>>
>>>> Currently it doesn't support consume from specified offsets, only
>>>> support consume from startOffset or latestOffset, if you want to consume
>>>> from startOffset, you need to set the
>>>> configuration: kylin.stream.consume.offsets.latest to false in the cube's
>>>> overrides page.
>>>>
>>>> If you do need to start from specified offsets, please create a jira
>>>> request, but I think it is hard for user to know what's the offsets should
>>>> be set for all partitions.
>>>>
>>>> At 2019-06-13 22:34:59, "Andras Nagy" <an...@gmail.com>
>>>> wrote:
>>>>
>>>> Dear Ma,
>>>>
>>>> Thank you very much!
>>>>
>>>> >1)yes, you can specify a configuration in the new cube, to consume
>>>> data from start offset
>>>> That is, an offset value for each partition of the topic? That would be
>>>> good - could you please point me where to do this in practice, or point me
>>>> to what I should read? (I haven't found it on the cube designer UI -
>>>> perhaps this is something that's only available on the API?)
>>>>
>>>> Many thanks,
>>>> Andras
>>>>
>>>>
>>>>
>>>> On Thu, Jun 13, 2019 at 1:14 PM Ma Gang <mg...@163.com> wrote:
>>>>
>>>>> Hi Andras,
>>>>> 1)yes, you can specify a configuration in the new cube, to consume
>>>>> data from start offset
>>>>>
>>>>> 2)It should work, but I haven't tested it yet
>>>>>
>>>>> 3)as I remember, currently we use Kafka 1.0 client library, so it is
>>>>> better to use the version later, I'm sure that the version before 0.9.0
>>>>> cannot work, but not sure 0.9.x can work or not
>>>>>
>>>>>
>>>>>
>>>>> Ma Gang
>>>>> 邮箱:mg4work@163.com
>>>>>
>>>>> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=Ma+Gang&uid=mg4work%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Amg4work%40163.com%22%5D>
>>>>>
>>>>> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88> 定制
>>>>>
>>>>> On 06/13/2019 18:01, Andras Nagy <an...@gmail.com> wrote:
>>>>> Greetings,
>>>>>
>>>>> I have a few questions related to the new streaming (real-time OLAP)
>>>>> implementation.
>>>>>
>>>>> 1) Is there a way to have data reprocessed from kafka? E.g. I change a
>>>>> cube definition and drop the cube (or add a new cube definition) and want
>>>>> to have data that is still available on kafka to be reprocessed to build
>>>>> the changed cube (or new cube)? Is this possible?
>>>>>
>>>>> 2) Does the hybrid model work with streaming cubes (to combine two
>>>>> cubes)?
>>>>>
>>>>> 3) What is minimum kafka version required? The tutorial asks to
>>>>> install Kafka 1.0, is this the minimum required version?
>>>>>
>>>>> Thank you very much,
>>>>> Andras
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>

Re: Re: Kylin streaming questions

Posted by Xiaoxiang Yu <hi...@126.com>.

Hi, Andras
    I am glad to see that you have have a strong understanding with Kylin's Realtime OLAP. Most of them are correct, the following is my understanding:
    1)  Currently, there is no such documentation which talk about how to use lambda mode, we will publish one after 3.0.0-beta release (maybe this wekend or after a week?).
    2)  Hive table must have the same name as the streaming table , and should be locate at "default" namespace of hive. The column name should match exactly and data type should be compatible.
    3)  If you want to build segment which data from hive,  you have to built by rest api.
    4)  Cube build engine must be mapreduce, spark is not supported at the moment.




-----------------
-----------------
Best wishes to you ! 
From :Xiaoxiang Yu

At 2019-06-25 17:20:55, "Andras Nagy" <an...@gmail.com> wrote:

Hi ShaoFeng,

Thanks a lot for the pointer on the lambda mode, yes, that's exactly what I need :)

Is there perhaps documentation on this? For now, I was trying to get this working 'empirically' and finally succeeded, but some of my conclusions may be wrong. This is what I concluded:

- hive table must have the same name as the streaming table (name given to the data source)
- cube can't be built from UI (to build the historic segments from the data in hive), but it can be built using the REST API
- cube build engine must be mapreduce. For Spark as build engine I got exception "Cannot adapt to interface org.apache.kylin.engine.spark.ISparkOutput"
- endTime must be non-overlapping with the streaming data. When I had overlap, the streaming data coming from kafka did not show up in the output, I guess this is what you meant by "the segments from Hive will overwrite the segments from Kafka".

Are these correct conclusions? Is there anything else I should be aware of?

Many thanks,
Andras



On Tue, Jun 25, 2019 at 9:19 AM ShaoFeng Shi <sh...@apache.org> wrote:

Hello Andras,


Kylin's realtime-OLAP feature supports a "Lambda" mode (mentioned in https://kylin.apache.org/blog/2019/04/12/rt-streaming-design/), which means, you can define a fact table whose data can be from both Kafka and Hive. The only requirement is that all the cube columns appear in both Kafka data and Hive data. I think maybe that can fit your need. The cube can be built from Kafka, in the meanwhile, it can also be built from Hive, the segments from Hive will overwrite the segments from Kafka (as usually Hive data is more accurate). When querying the cube, Kylin will firstly query historical segments, and then real-time segments (adding the max-time of historical segments as the condition).




Best regards,


Shaofeng Shi 史少锋
Apache Kylin PMC
Email: shaofengshi@apache.org


Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org









Andras Nagy <an...@gmail.com> 于2019年6月24日周一 下午11:29写道:

Dear Ma,


Thanks for your reply.


Slightly related to my original question on the hybrid model, I was wondering if it's possible to combine a batch and a streaming cube. I realized this is not possible, as a hybrid model can only be created from cubes of the same model (and a model points to either a batch or a streaming datasource).


The usecase would be this:
- we have a large amount of streaming data in Kafka that we would like to process with Kylin streaming
- Kafka retention is only a few days, so if we need to change anything in the cubes (e.g. introduce a new metric or dimension which has been present in the events, but not in the cube definition), we can only reprocess a few days worth of data in the streaming model
- the raw events are also written to a data lake for long-term storage
- the data written to the data lake could be used to feed the historic data into a batch kylin model (and cubes)
- I'm looking for a way to combine these, so if we want to change anything in the cubes, we can recalculate them for the historic data as well


Is there a way to achieve this with current Kylin? (Without implementing a custom query layer that combines the two cubes.)


Best regards,
Andras




















On Fri, Jun 14, 2019 at 6:43 AM Ma Gang <mg...@163.com> wrote:

Hi Andras,


Currently it doesn't support consume from specified offsets, only support consume from startOffset or latestOffset, if you want to consume from startOffset, you need to set the configuration: kylin.stream.consume.offsets.latest to false in the cube's overrides page.


If you do need to start from specified offsets, please create a jira request, but I think it is hard for user to know what's the offsets should be set for all partitions.


At 2019-06-13 22:34:59, "Andras Nagy" <an...@gmail.com> wrote:

Dear Ma,


Thank you very much!


>1)yes, you can specify a configuration in the new cube, to consume data from start offset
That is, an offset value for each partition of the topic? That would be good - could you please point me where to do this in practice, or point me to what I should read? (I haven't found it on the cube designer UI - perhaps this is something that's only available on the API?)


Many thanks,
Andras






On Thu, Jun 13, 2019 at 1:14 PM Ma Gang <mg...@163.com> wrote:

Hi Andras,
1)yes, you can specify a configuration in the new cube, to consume data from start offset

2)It should work, but I haven't tested it yet

3)as I remember, currently we use Kafka 1.0 client library, so it is better to use the version later, I'm sure that the version before 0.9.0 cannot work, but not sure 0.9.x can work or not




| |
Ma Gang
|
|
邮箱:mg4work@163.com
|

签名由 网易邮箱大师 定制

On 06/13/2019 18:01, Andras Nagy wrote:
Greetings,


I have a few questions related to the new streaming (real-time OLAP) implementation.


1) Is there a way to have data reprocessed from kafka? E.g. I change a cube definition and drop the cube (or add a new cube definition) and want to have data that is still available on kafka to be reprocessed to build the changed cube (or new cube)? Is this possible?


2) Does the hybrid model work with streaming cubes (to combine two cubes)?


3) What is minimum kafka version required? The tutorial asks to install Kafka 1.0, is this the minimum required version?


Thank you very much,
Andras




 

Re: Re: Kylin streaming questions

Posted by Andras Nagy <an...@gmail.com>.
Hi ShaoFeng,

Thanks a lot for the pointer on the lambda mode, yes, that's exactly what I
need :)

Is there perhaps documentation on this? For now, I was trying to get this
working 'empirically' and finally succeeded, but some of my conclusions may
be wrong. This is what I concluded:

- hive table must have the same name as the streaming table (name given to
the data source)
- cube can't be built from UI (to build the historic segments from the data
in hive), but it can be built using the REST API
- cube build engine must be mapreduce. For Spark as build engine I got
exception "Cannot adapt to interface
org.apache.kylin.engine.spark.ISparkOutput"
- endTime must be non-overlapping with the streaming data. When I had
overlap, the streaming data coming from kafka did not show up in the
output, I guess this is what you meant by "the segments from Hive will
overwrite the segments from Kafka".

Are these correct conclusions? Is there anything else I should be aware of?

Many thanks,
Andras

On Tue, Jun 25, 2019 at 9:19 AM ShaoFeng Shi <sh...@apache.org> wrote:

> Hello Andras,
>
> Kylin's realtime-OLAP feature supports a "Lambda" mode (mentioned in
> https://kylin.apache.org/blog/2019/04/12/rt-streaming-design/), which
> means, you can define a fact table whose data can be from both Kafka and
> Hive. The only requirement is that all the cube columns appear in both
> Kafka data and Hive data. I think maybe that can fit your need. The cube
> can be built from Kafka, in the meanwhile, it can also be built from Hive,
> the segments from Hive will overwrite the segments from Kafka (as usually
> Hive data is more accurate). When querying the cube, Kylin will firstly
> query historical segments, and then real-time segments (adding the max-time
> of historical segments as the condition).
>
>
> Best regards,
>
> Shaofeng Shi 史少锋
> Apache Kylin PMC
> Email: shaofengshi@apache.org
>
> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
> Join Kylin user mail group: user-subscribe@kylin.apache.org
> Join Kylin dev mail group: dev-subscribe@kylin.apache.org
>
>
>
>
> Andras Nagy <an...@gmail.com> 于2019年6月24日周一 下午11:29写道:
>
>> Dear Ma,
>>
>> Thanks for your reply.
>>
>> Slightly related to my original question on the hybrid model, I was
>> wondering if it's possible to combine a batch and a streaming cube. I
>> realized this is not possible, as a hybrid model can only be created from
>> cubes of the same model (and a model points to either a batch or a
>> streaming datasource).
>>
>> The usecase would be this:
>> - we have a large amount of streaming data in Kafka that we would like to
>> process with Kylin streaming
>> - Kafka retention is only a few days, so if we need to change anything in
>> the cubes (e.g. introduce a new metric or dimension which has been present
>> in the events, but not in the cube definition), we can only reprocess a few
>> days worth of data in the streaming model
>> - the raw events are also written to a data lake for long-term storage
>> - the data written to the data lake could be used to feed the historic
>> data into a batch kylin model (and cubes)
>> - I'm looking for a way to combine these, so if we want to change
>> anything in the cubes, we can recalculate them for the historic data as well
>>
>> Is there a way to achieve this with current Kylin? (Without implementing
>> a custom query layer that combines the two cubes.)
>>
>> Best regards,
>> Andras
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Jun 14, 2019 at 6:43 AM Ma Gang <mg...@163.com> wrote:
>>
>>> Hi Andras,
>>>
>>> Currently it doesn't support consume from specified offsets, only
>>> support consume from startOffset or latestOffset, if you want to consume
>>> from startOffset, you need to set the
>>> configuration: kylin.stream.consume.offsets.latest to false in the cube's
>>> overrides page.
>>>
>>> If you do need to start from specified offsets, please create a jira
>>> request, but I think it is hard for user to know what's the offsets should
>>> be set for all partitions.
>>>
>>> At 2019-06-13 22:34:59, "Andras Nagy" <an...@gmail.com>
>>> wrote:
>>>
>>> Dear Ma,
>>>
>>> Thank you very much!
>>>
>>> >1)yes, you can specify a configuration in the new cube, to consume
>>> data from start offset
>>> That is, an offset value for each partition of the topic? That would be
>>> good - could you please point me where to do this in practice, or point me
>>> to what I should read? (I haven't found it on the cube designer UI -
>>> perhaps this is something that's only available on the API?)
>>>
>>> Many thanks,
>>> Andras
>>>
>>>
>>>
>>> On Thu, Jun 13, 2019 at 1:14 PM Ma Gang <mg...@163.com> wrote:
>>>
>>>> Hi Andras,
>>>> 1)yes, you can specify a configuration in the new cube, to consume data
>>>> from start offset
>>>>
>>>> 2)It should work, but I haven't tested it yet
>>>>
>>>> 3)as I remember, currently we use Kafka 1.0 client library, so it is
>>>> better to use the version later, I'm sure that the version before 0.9.0
>>>> cannot work, but not sure 0.9.x can work or not
>>>>
>>>>
>>>>
>>>> Ma Gang
>>>> 邮箱:mg4work@163.com
>>>>
>>>> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=Ma+Gang&uid=mg4work%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Amg4work%40163.com%22%5D>
>>>>
>>>> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88> 定制
>>>>
>>>> On 06/13/2019 18:01, Andras Nagy <an...@gmail.com> wrote:
>>>> Greetings,
>>>>
>>>> I have a few questions related to the new streaming (real-time OLAP)
>>>> implementation.
>>>>
>>>> 1) Is there a way to have data reprocessed from kafka? E.g. I change a
>>>> cube definition and drop the cube (or add a new cube definition) and want
>>>> to have data that is still available on kafka to be reprocessed to build
>>>> the changed cube (or new cube)? Is this possible?
>>>>
>>>> 2) Does the hybrid model work with streaming cubes (to combine two
>>>> cubes)?
>>>>
>>>> 3) What is minimum kafka version required? The tutorial asks to install
>>>> Kafka 1.0, is this the minimum required version?
>>>>
>>>> Thank you very much,
>>>> Andras
>>>>
>>>>
>>>
>>>
>>>
>>

Re: Re: Kylin streaming questions

Posted by ShaoFeng Shi <sh...@apache.org>.
Hello Andras,

Kylin's realtime-OLAP feature supports a "Lambda" mode (mentioned in
https://kylin.apache.org/blog/2019/04/12/rt-streaming-design/), which
means, you can define a fact table whose data can be from both Kafka and
Hive. The only requirement is that all the cube columns appear in both
Kafka data and Hive data. I think maybe that can fit your need. The cube
can be built from Kafka, in the meanwhile, it can also be built from Hive,
the segments from Hive will overwrite the segments from Kafka (as usually
Hive data is more accurate). When querying the cube, Kylin will firstly
query historical segments, and then real-time segments (adding the max-time
of historical segments as the condition).


Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Email: shaofengshi@apache.org

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: user-subscribe@kylin.apache.org
Join Kylin dev mail group: dev-subscribe@kylin.apache.org




Andras Nagy <an...@gmail.com> 于2019年6月24日周一 下午11:29写道:

> Dear Ma,
>
> Thanks for your reply.
>
> Slightly related to my original question on the hybrid model, I was
> wondering if it's possible to combine a batch and a streaming cube. I
> realized this is not possible, as a hybrid model can only be created from
> cubes of the same model (and a model points to either a batch or a
> streaming datasource).
>
> The usecase would be this:
> - we have a large amount of streaming data in Kafka that we would like to
> process with Kylin streaming
> - Kafka retention is only a few days, so if we need to change anything in
> the cubes (e.g. introduce a new metric or dimension which has been present
> in the events, but not in the cube definition), we can only reprocess a few
> days worth of data in the streaming model
> - the raw events are also written to a data lake for long-term storage
> - the data written to the data lake could be used to feed the historic
> data into a batch kylin model (and cubes)
> - I'm looking for a way to combine these, so if we want to change anything
> in the cubes, we can recalculate them for the historic data as well
>
> Is there a way to achieve this with current Kylin? (Without implementing a
> custom query layer that combines the two cubes.)
>
> Best regards,
> Andras
>
>
>
>
>
>
>
>
>
>
> On Fri, Jun 14, 2019 at 6:43 AM Ma Gang <mg...@163.com> wrote:
>
>> Hi Andras,
>>
>> Currently it doesn't support consume from specified offsets, only support
>> consume from startOffset or latestOffset, if you want to consume from
>> startOffset, you need to set the
>> configuration: kylin.stream.consume.offsets.latest to false in the cube's
>> overrides page.
>>
>> If you do need to start from specified offsets, please create a jira
>> request, but I think it is hard for user to know what's the offsets should
>> be set for all partitions.
>>
>> At 2019-06-13 22:34:59, "Andras Nagy" <an...@gmail.com>
>> wrote:
>>
>> Dear Ma,
>>
>> Thank you very much!
>>
>> >1)yes, you can specify a configuration in the new cube, to consume data
>> from start offset
>> That is, an offset value for each partition of the topic? That would be
>> good - could you please point me where to do this in practice, or point me
>> to what I should read? (I haven't found it on the cube designer UI -
>> perhaps this is something that's only available on the API?)
>>
>> Many thanks,
>> Andras
>>
>>
>>
>> On Thu, Jun 13, 2019 at 1:14 PM Ma Gang <mg...@163.com> wrote:
>>
>>> Hi Andras,
>>> 1)yes, you can specify a configuration in the new cube, to consume data
>>> from start offset
>>>
>>> 2)It should work, but I haven't tested it yet
>>>
>>> 3)as I remember, currently we use Kafka 1.0 client library, so it is
>>> better to use the version later, I'm sure that the version before 0.9.0
>>> cannot work, but not sure 0.9.x can work or not
>>>
>>>
>>>
>>> Ma Gang
>>> 邮箱:mg4work@163.com
>>>
>>> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=Ma+Gang&uid=mg4work%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Amg4work%40163.com%22%5D>
>>>
>>> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88> 定制
>>>
>>> On 06/13/2019 18:01, Andras Nagy <an...@gmail.com> wrote:
>>> Greetings,
>>>
>>> I have a few questions related to the new streaming (real-time OLAP)
>>> implementation.
>>>
>>> 1) Is there a way to have data reprocessed from kafka? E.g. I change a
>>> cube definition and drop the cube (or add a new cube definition) and want
>>> to have data that is still available on kafka to be reprocessed to build
>>> the changed cube (or new cube)? Is this possible?
>>>
>>> 2) Does the hybrid model work with streaming cubes (to combine two
>>> cubes)?
>>>
>>> 3) What is minimum kafka version required? The tutorial asks to install
>>> Kafka 1.0, is this the minimum required version?
>>>
>>> Thank you very much,
>>> Andras
>>>
>>>
>>
>>
>>
>

Re: Re: Kylin streaming questions

Posted by Andras Nagy <an...@gmail.com>.
Dear Ma,

Thanks for your reply.

Slightly related to my original question on the hybrid model, I was
wondering if it's possible to combine a batch and a streaming cube. I
realized this is not possible, as a hybrid model can only be created from
cubes of the same model (and a model points to either a batch or a
streaming datasource).

The usecase would be this:
- we have a large amount of streaming data in Kafka that we would like to
process with Kylin streaming
- Kafka retention is only a few days, so if we need to change anything in
the cubes (e.g. introduce a new metric or dimension which has been present
in the events, but not in the cube definition), we can only reprocess a few
days worth of data in the streaming model
- the raw events are also written to a data lake for long-term storage
- the data written to the data lake could be used to feed the historic data
into a batch kylin model (and cubes)
- I'm looking for a way to combine these, so if we want to change anything
in the cubes, we can recalculate them for the historic data as well

Is there a way to achieve this with current Kylin? (Without implementing a
custom query layer that combines the two cubes.)

Best regards,
Andras










On Fri, Jun 14, 2019 at 6:43 AM Ma Gang <mg...@163.com> wrote:

> Hi Andras,
>
> Currently it doesn't support consume from specified offsets, only support
> consume from startOffset or latestOffset, if you want to consume from
> startOffset, you need to set the
> configuration: kylin.stream.consume.offsets.latest to false in the cube's
> overrides page.
>
> If you do need to start from specified offsets, please create a jira
> request, but I think it is hard for user to know what's the offsets should
> be set for all partitions.
>
> At 2019-06-13 22:34:59, "Andras Nagy" <an...@gmail.com>
> wrote:
>
> Dear Ma,
>
> Thank you very much!
>
> >1)yes, you can specify a configuration in the new cube, to consume data
> from start offset
> That is, an offset value for each partition of the topic? That would be
> good - could you please point me where to do this in practice, or point me
> to what I should read? (I haven't found it on the cube designer UI -
> perhaps this is something that's only available on the API?)
>
> Many thanks,
> Andras
>
>
>
> On Thu, Jun 13, 2019 at 1:14 PM Ma Gang <mg...@163.com> wrote:
>
>> Hi Andras,
>> 1)yes, you can specify a configuration in the new cube, to consume data
>> from start offset
>>
>> 2)It should work, but I haven't tested it yet
>>
>> 3)as I remember, currently we use Kafka 1.0 client library, so it is
>> better to use the version later, I'm sure that the version before 0.9.0
>> cannot work, but not sure 0.9.x can work or not
>>
>>
>>
>> Ma Gang
>> 邮箱:mg4work@163.com
>>
>> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=Ma+Gang&uid=mg4work%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Amg4work%40163.com%22%5D>
>>
>> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88> 定制
>>
>> On 06/13/2019 18:01, Andras Nagy <an...@gmail.com> wrote:
>> Greetings,
>>
>> I have a few questions related to the new streaming (real-time OLAP)
>> implementation.
>>
>> 1) Is there a way to have data reprocessed from kafka? E.g. I change a
>> cube definition and drop the cube (or add a new cube definition) and want
>> to have data that is still available on kafka to be reprocessed to build
>> the changed cube (or new cube)? Is this possible?
>>
>> 2) Does the hybrid model work with streaming cubes (to combine two cubes)?
>>
>> 3) What is minimum kafka version required? The tutorial asks to install
>> Kafka 1.0, is this the minimum required version?
>>
>> Thank you very much,
>> Andras
>>
>>
>
>
>

Re:Re: Kylin streaming questions

Posted by Ma Gang <mg...@163.com>.
Hi Andras,


Currently it doesn't support consume from specified offsets, only support consume from startOffset or latestOffset, if you want to consume from startOffset, you need to set the configuration: kylin.stream.consume.offsets.latest to false in the cube's overrides page.


If you do need to start from specified offsets, please create a jira request, but I think it is hard for user to know what's the offsets should be set for all partitions.


At 2019-06-13 22:34:59, "Andras Nagy" <an...@gmail.com> wrote:

Dear Ma,


Thank you very much!


>1)yes, you can specify a configuration in the new cube, to consume data from start offset
That is, an offset value for each partition of the topic? That would be good - could you please point me where to do this in practice, or point me to what I should read? (I haven't found it on the cube designer UI - perhaps this is something that's only available on the API?)


Many thanks,
Andras






On Thu, Jun 13, 2019 at 1:14 PM Ma Gang <mg...@163.com> wrote:

Hi Andras,
1)yes, you can specify a configuration in the new cube, to consume data from start offset

2)It should work, but I haven't tested it yet

3)as I remember, currently we use Kafka 1.0 client library, so it is better to use the version later, I'm sure that the version before 0.9.0 cannot work, but not sure 0.9.x can work or not




| |
Ma Gang
|
|
邮箱:mg4work@163.com
|

签名由 网易邮箱大师 定制

On 06/13/2019 18:01, Andras Nagy wrote:
Greetings,


I have a few questions related to the new streaming (real-time OLAP) implementation.


1) Is there a way to have data reprocessed from kafka? E.g. I change a cube definition and drop the cube (or add a new cube definition) and want to have data that is still available on kafka to be reprocessed to build the changed cube (or new cube)? Is this possible?


2) Does the hybrid model work with streaming cubes (to combine two cubes)?


3) What is minimum kafka version required? The tutorial asks to install Kafka 1.0, is this the minimum required version?


Thank you very much,
Andras

Re: Kylin streaming questions

Posted by Andras Nagy <an...@gmail.com>.
Dear Ma,

Thank you very much!

>1)yes, you can specify a configuration in the new cube, to consume data
from start offset
That is, an offset value for each partition of the topic? That would be
good - could you please point me where to do this in practice, or point me
to what I should read? (I haven't found it on the cube designer UI -
perhaps this is something that's only available on the API?)

Many thanks,
Andras



On Thu, Jun 13, 2019 at 1:14 PM Ma Gang <mg...@163.com> wrote:

> Hi Andras,
> 1)yes, you can specify a configuration in the new cube, to consume data
> from start offset
>
> 2)It should work, but I haven't tested it yet
>
> 3)as I remember, currently we use Kafka 1.0 client library, so it is
> better to use the version later, I'm sure that the version before 0.9.0
> cannot work, but not sure 0.9.x can work or not
>
>
>
> Ma Gang
> 邮箱:mg4work@163.com
>
> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=Ma+Gang&uid=mg4work%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Amg4work%40163.com%22%5D>
>
> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88> 定制
>
> On 06/13/2019 18:01, Andras Nagy <an...@gmail.com> wrote:
> Greetings,
>
> I have a few questions related to the new streaming (real-time OLAP)
> implementation.
>
> 1) Is there a way to have data reprocessed from kafka? E.g. I change a
> cube definition and drop the cube (or add a new cube definition) and want
> to have data that is still available on kafka to be reprocessed to build
> the changed cube (or new cube)? Is this possible?
>
> 2) Does the hybrid model work with streaming cubes (to combine two cubes)?
>
> 3) What is minimum kafka version required? The tutorial asks to install
> Kafka 1.0, is this the minimum required version?
>
> Thank you very much,
> Andras
>
>

Re: Kylin streaming questions

Posted by Ma Gang <mg...@163.com>.
Hi Andras,
1)yes, you can specify a configuration in the new cube, to consume data from start offset

2)It should work, but I haven't tested it yet

3)as I remember, currently we use Kafka 1.0 client library, so it is better to use the version later, I'm sure that the version before 0.9.0 cannot work, but not sure 0.9.x can work or not




| |
Ma Gang
|
|
邮箱:mg4work@163.com
|

签名由 网易邮箱大师 定制

On 06/13/2019 18:01, Andras Nagy wrote:
Greetings,


I have a few questions related to the new streaming (real-time OLAP) implementation.


1) Is there a way to have data reprocessed from kafka? E.g. I change a cube definition and drop the cube (or add a new cube definition) and want to have data that is still available on kafka to be reprocessed to build the changed cube (or new cube)? Is this possible?


2) Does the hybrid model work with streaming cubes (to combine two cubes)?


3) What is minimum kafka version required? The tutorial asks to install Kafka 1.0, is this the minimum required version?


Thank you very much,
Andras