You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@kylin.apache.org by "liyang (JIRA)" <ji...@apache.org> on 2016/05/23 09:06:13 UTC

[jira] [Commented] (KYLIN-1726) Scalable streaming cubing

    [ https://issues.apache.org/jira/browse/KYLIN-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15296124#comment-15296124 ] 

liyang commented on KYLIN-1726:
-------------------------------

The strategy is to have something quick and runnable, then evolve from there.

Making Kafka a source of current MR engine maybe the shortest path. User can trigger a new micro batch every few minutes. Hadoop will provide computation resource that can scale with the Kafka input. Fast cubing (in-mem) can be forced at cube level, to ensure one round MR build. There are still two more steps that involve MR -- extract distinct values for dictionary and convert to HFile -- which can slow down the overall build. The design goal is to finish the micro batch in 10 minutes.

Other related designs that needs consideration.

- Currently the assumption is no time overlap between cube segments. However late coming records in streaming is common, there's no guarantee of strict time ordering. We either have to drop the late coming records, or have to accept that data time can overlap across segments.
- Allow cube segmentation by Kafka offset. Offset can be used (instead of time ranges) to cut Kafka stream into segments.
- Allow merge job in parallel with multiple cubing jobs.



> Scalable streaming cubing
> -------------------------
>
>                 Key: KYLIN-1726
>                 URL: https://issues.apache.org/jira/browse/KYLIN-1726
>             Project: Kylin
>          Issue Type: New Feature
>            Reporter: liyang
>            Assignee: liyang
>
> We try to achieve:
> 1. Scale streaming cubing workload on a computation cluster, e.g. YARN
> 2. Support Kafka as a formal data source
> 3. Guarantee no data loss reading from Kafka, even records are not strictly ordered by time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)