You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kylin.apache.org by zeLiu <li...@wanda.cn> on 2016/03/25 09:29:07 UTC

update hbase data realtime and query it

hi kylin team,
the version of kylin is 1.2.
I build a cube first,and every field does not use  dictionary.
then,I pre calculation every cuboid in storm and update hbase table data
realtime.

I checked the rowkey in hbase is right,but When I query data, only a few are
correct.

for example:
select dt,code,sum(money) as money from mytable a inner join dimtable b on
a.id=b.id  group by dt,code;
2016-03-01  001   100
2016-03-01  002   100
2016-03-02  001   200
select dt,sum(money) as money from mytable a inner join dimtable b on
a.id=b.id group by dt;
2016-03-01  100
2016-03-02  200

the money is 200 in 2016-03-01 ,but the result is 100,

Why ? Is there  other processing when query?

thanks



--
View this message in context: http://apache-kylin.74782.x6.nabble.com/update-hbase-data-realtime-and-query-it-tp3959.html
Sent from the Apache Kylin mailing list archive at Nabble.com.

Re: update hbase data realtime and query it

Posted by zeLiu <li...@wanda.cn>.

Hi Li Yang,
Thank you for your reply.
I'm looking forward to kylin's real time support.
And I hope it will be able to provide a API to us so that we can write data
to the kylin through the storm process.

Thanks

--
View this message in context: http://apache-kylin.74782.x6.nabble.com/update-hbase-data-realtime-and-query-it-tp3959p4073.html
Sent from the Apache Kylin mailing list archive at Nabble.com.

Re: update hbase data realtime and query it

Posted by Li Yang <li...@apache.org>.

Realtime support is on Kylin's roadmap. We can collaborate on this if you
are interested.

The idea is simple. Say current Kylin can do 5 minutes micro batch, then
only need a realtime storage to catch up-to 5 minutes latest data (which
comes after the last batch). Query will hit both realtime storage and cube
storage. In your attempt, the realtime storage is still HBase, but I prefer
it be a new separate table. The realtime storage need to expose interface
for query, which is straightforward given we've done it once.

On Thu, Mar 31, 2016 at 11:54 AM, zeLiu <li...@wanda.cn> wrote:

> thanks hongbin,
>
> It is true that, as you say, the datas must be pre aggregate , or the same
> key will cover each other.
>
> I just refer to the MapReduce code of kylin, add data to the HBase in real
> time, and not a very good idea
>
> The reason for this is that we are all doing two products, a real-time and
> an offline.
> Their front UI are the same, in the past we are to write real-time data
> into
> the mysql,storage offline data use kylin, and this will need to develop new
> interfaces for mysql.
> But if both real-time and off-line are written to the kylin, we can only
> develop an interface for UI.
>
>
> I did a test, the same data, the delay is much smaller than the mysql, The
> average delay is about 6 ms.
> I didn't use the dictionary, because I'm worried that if a new value is not
> found in the dictionary, it will affect the accuracy of the data.
>
> Whether there is a better solution?
>
> the plug-in code: https://github.com/zeliu/kylin-storm-plugin
>
> thanks
>
> --
> View this message in context:
> http://apache-kylin.74782.x6.nabble.com/update-hbase-data-realtime-and-query-it-tp3959p4019.html
> Sent from the Apache Kylin mailing list archive at Nabble.com.
>

Re: update hbase data realtime and query it

Posted by zeLiu <li...@wanda.cn>.

thanks hongbin,

It is true that, as you say, the datas must be pre aggregate , or the same
key will cover each other.

I just refer to the MapReduce code of kylin, add data to the HBase in real
time, and not a very good idea

The reason for this is that we are all doing two products, a real-time and
an offline. 
Their front UI are the same, in the past we are to write real-time data into
the mysql,storage offline data use kylin, and this will need to develop new
interfaces for mysql. 
But if both real-time and off-line are written to the kylin, we can only
develop an interface for UI.


I did a test, the same data, the delay is much smaller than the mysql, The
average delay is about 6 ms.
I didn't use the dictionary, because I'm worried that if a new value is not
found in the dictionary, it will affect the accuracy of the data.

Whether there is a better solution?

the plug-in code: https://github.com/zeliu/kylin-storm-plugin

thanks

--
View this message in context: http://apache-kylin.74782.x6.nabble.com/update-hbase-data-realtime-and-query-it-tp3959p4019.html
Sent from the Apache Kylin mailing list archive at Nabble.com.

Re: update hbase data realtime and query it

Posted by hongbin ma <ma...@apache.org>.

hi zeliu

just want to make sure how you're updating the cuboid values in hbase? Are
you merely adding the new rows to the cuboid? it might not work.

Take the cuboid (dt,code) as an example, it might already contains a row
like "2016-03-01  001" : "100". Your storm outputs another row "2016-03-01
001" :   "50", simply add this row to hbase might not work because
currently kylin assumes all row key are distinct in a given cuboid.  A
safer way to approach your goal is to aggregate these two rows and
overwrite the original "2016-03-01  001" : "100" to "2016-03-01  001" :
"150".

It seems to me that you're inventing a new way for "streaming data on Kylin
platform". I suggest you discuss more of it with the community so that we
could make joint efforts to promote it as a fundamental feature of Kylin.
Otherwise any design changes in Kylin might break your addons, making it
impossible for you to upgrade.

On Fri, Mar 25, 2016 at 4:29 PM, zeLiu <li...@wanda.cn> wrote:

> hi kylin team,
> the version of kylin is 1.2.
> I build a cube first,and every field does not use  dictionary.
> then,I pre calculation every cuboid in storm and update hbase table data
> realtime.
>
> I checked the rowkey in hbase is right,but When I query data, only a few
> are
> correct.
>
> for example:
> select dt,code,sum(money) as money from mytable a inner join dimtable b on
> a.id=b.id  group by dt,code;
> 2016-03-01  001   100
> 2016-03-01  002   100
> 2016-03-02  001   200
> select dt,sum(money) as money from mytable a inner join dimtable b on
> a.id=b.id group by dt;
> 2016-03-01  100
> 2016-03-02  200
>
> the money is 200 in 2016-03-01 ,but the result is 100,
>
> Why ? Is there  other processing when query?
>
> thanks
>
>
>
> --
> View this message in context:
> http://apache-kylin.74782.x6.nabble.com/update-hbase-data-realtime-and-query-it-tp3959.html
> Sent from the Apache Kylin mailing list archive at Nabble.com.
>

-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone