You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kylin.apache.org by 蒋旭 <ji...@qq.com> on 2015/03/01 14:23:06 UTC

hbase rowkey design of inverted index

Basically, we have 2 ways to design the hbase rowkey for inverted index:
1. "time + keyword":  
It split the index by time that can avoid hbase region merge. But one query may scan lots of scattered rows that is not sequential.
2. "keyword + time":  
It can guarantee the sequential scan of keyword. But it may trigger the hbase region merge since one keyword may be scattered in many regions.


So, we can merge these 2 solutions as this: "coarse granularity time + keyword + fine granularity time". For example, "20150215 + abc + 1130". In this way, we use "coarse granularity time" to avoid hbase region merge and "fine granularity time" to guarantee the sequential scan.


User can define different "coarse granularity time" & "fine granularity time" for different cases. If the inverted index is only used in real-time case, we can define a small "coarse granularity time" (e.g. 1 day). If the indverted index will cover full data set, we can define a big "coarse granularity time" (e.g. 1 month).


Thanks
Jiang Xu

回复：hbase rowkey design of inverted index

Posted by 蒋旭 <ji...@qq.com>.

Sure. :)
Just a reminder, we still need "fine granularity time frame" as suffix to split the long inverted index. 

------------------ 原始邮件 ------------------
发件人: Li Yang <li...@apache.org>
发送时间: 2015年03月03日 16:47
收件人: dev <de...@kylin.incubator.apache.org>
主题: Re: hbase rowkey design of inverted index



Hi Xu, I understand the "coarse granularity time" concept, my "time frame"
means exactly the same, not the most granular time, but a bigger range of
time. Just think its shorter and looks better. :-)


On Tue, Mar 3, 2015 at 3:52 PM, 蒋旭 <ji...@qq.com> wrote:

> The key point is that we can't just put the time frame as a prefix before
> keyword.
> Since the time frame is normally small for high volume data, if the scan
> time range is big, the hbase scan range is too big and we have to skip
> other keywords in this range. So, then scan performance is bad.
> So, I suggest to use "coarse granularity time frame + keyword + fine
> granularity time frame". In this schema, you can call hbase several time by
> coarse granularity time frame with small scan range on fine granularity
> time frame.
>
> ------------------ 原始邮件 ------------------
> 发件人: Li Yang <li...@apache.org>
> 发送时间: 2015年03月03日 15:17
> 收件人: dev <de...@kylin.incubator.apache.org>
> 主题: Re: hbase rowkey design of inverted index
>
>
>
> Agree on the scan pattern of inverted index, that it is typically a keyword
> scan within a time range. In addition, data shall be sharded so parallel
> scans on multiple regions can cut down response time.The final rowkey may
> look like "shard + time frame + keyword".
>
> These ideas will be put into next version of invented index storage.
>
> On Sun, Mar 1, 2015 at 9:23 PM, 蒋旭 <ji...@qq.com> wrote:
>
> > Basically, we have 2 ways to design the hbase rowkey for inverted index:
> > 1. "time + keyword":
> > It split the index by time that can avoid hbase region merge. But one
> > query may scan lots of scattered rows that is not sequential.
> > 2. "keyword + time":
> > It can guarantee the sequential scan of keyword. But it may trigger the
> > hbase region merge since one keyword may be scattered in many regions.
> >
> >
> > So, we can merge these 2 solutions as this: "coarse granularity time +
> > keyword + fine granularity time". For example, "20150215 + abc + 1130".
> In
> > this way, we use "coarse granularity time" to avoid hbase region merge
> and
> > "fine granularity time" to guarantee the sequential scan.
> >
> >
> > User can define different "coarse granularity time" & "fine granularity
> > time" for different cases. If the inverted index is only used in
> real-time
> > case, we can define a small "coarse granularity time" (e.g. 1 day). If
> the
> > indverted index will cover full data set, we can define a big "coarse
> > granularity time" (e.g. 1 month).
> >
> >
> > Thanks
> > Jiang Xu
>

Re: hbase rowkey design of inverted index

Posted by Li Yang <li...@apache.org>.

Hi Xu, I understand the "coarse granularity time" concept, my "time frame"
means exactly the same, not the most granular time, but a bigger range of
time. Just think its shorter and looks better. :-)


On Tue, Mar 3, 2015 at 3:52 PM, 蒋旭 <ji...@qq.com> wrote:

> The key point is that we can't just put the time frame as a prefix before
> keyword.
> Since the time frame is normally small for high volume data, if the scan
> time range is big, the hbase scan range is too big and we have to skip
> other keywords in this range. So, then scan performance is bad.
> So, I suggest to use "coarse granularity time frame + keyword + fine
> granularity time frame". In this schema, you can call hbase several time by
> coarse granularity time frame with small scan range on fine granularity
> time frame.
>
> ------------------ 原始邮件 ------------------
> 发件人: Li Yang <li...@apache.org>
> 发送时间: 2015年03月03日 15:17
> 收件人: dev <de...@kylin.incubator.apache.org>
> 主题: Re: hbase rowkey design of inverted index
>
>
>
> Agree on the scan pattern of inverted index, that it is typically a keyword
> scan within a time range. In addition, data shall be sharded so parallel
> scans on multiple regions can cut down response time.The final rowkey may
> look like "shard + time frame + keyword".
>
> These ideas will be put into next version of invented index storage.
>
> On Sun, Mar 1, 2015 at 9:23 PM, 蒋旭 <ji...@qq.com> wrote:
>
> > Basically, we have 2 ways to design the hbase rowkey for inverted index:
> > 1. "time + keyword":
> > It split the index by time that can avoid hbase region merge. But one
> > query may scan lots of scattered rows that is not sequential.
> > 2. "keyword + time":
> > It can guarantee the sequential scan of keyword. But it may trigger the
> > hbase region merge since one keyword may be scattered in many regions.
> >
> >
> > So, we can merge these 2 solutions as this: "coarse granularity time +
> > keyword + fine granularity time". For example, "20150215 + abc + 1130".
> In
> > this way, we use "coarse granularity time" to avoid hbase region merge
> and
> > "fine granularity time" to guarantee the sequential scan.
> >
> >
> > User can define different "coarse granularity time" & "fine granularity
> > time" for different cases. If the inverted index is only used in
> real-time
> > case, we can define a small "coarse granularity time" (e.g. 1 day). If
> the
> > indverted index will cover full data set, we can define a big "coarse
> > granularity time" (e.g. 1 month).
> >
> >
> > Thanks
> > Jiang Xu
>

回复：hbase rowkey design of inverted index

Posted by 蒋旭 <ji...@qq.com>.

The key point is that we can't just put the time frame as a prefix before keyword. 
Since the time frame is normally small for high volume data, if the scan time range is big, the hbase scan range is too big and we have to skip other keywords in this range. So, then scan performance is bad.
So, I suggest to use "coarse granularity time frame + keyword + fine granularity time frame". In this schema, you can call hbase several time by coarse granularity time frame with small scan range on fine granularity time frame.

------------------ 原始邮件 ------------------
发件人: Li Yang <li...@apache.org>
发送时间: 2015年03月03日 15:17
收件人: dev <de...@kylin.incubator.apache.org>
主题: Re: hbase rowkey design of inverted index

Agree on the scan pattern of inverted index, that it is typically a keyword
scan within a time range. In addition, data shall be sharded so parallel
scans on multiple regions can cut down response time.The final rowkey may
look like "shard + time frame + keyword".

These ideas will be put into next version of invented index storage.

On Sun, Mar 1, 2015 at 9:23 PM, 蒋旭 <ji...@qq.com> wrote:

> Basically, we have 2 ways to design the hbase rowkey for inverted index:
> 1. "time + keyword":
> It split the index by time that can avoid hbase region merge. But one
> query may scan lots of scattered rows that is not sequential.
> 2. "keyword + time":
> It can guarantee the sequential scan of keyword. But it may trigger the
> hbase region merge since one keyword may be scattered in many regions.
>
>
> So, we can merge these 2 solutions as this: "coarse granularity time +
> keyword + fine granularity time". For example, "20150215 + abc + 1130". In
> this way, we use "coarse granularity time" to avoid hbase region merge and
> "fine granularity time" to guarantee the sequential scan.
>
>
> User can define different "coarse granularity time" & "fine granularity
> time" for different cases. If the inverted index is only used in real-time
> case, we can define a small "coarse granularity time" (e.g. 1 day). If the
> indverted index will cover full data set, we can define a big "coarse
> granularity time" (e.g. 1 month).
>
>
> Thanks
> Jiang Xu

Re: hbase rowkey design of inverted index

Posted by Li Yang <li...@apache.org>.

Agree on the scan pattern of inverted index, that it is typically a keyword
scan within a time range. In addition, data shall be sharded so parallel
scans on multiple regions can cut down response time.The final rowkey may
look like "shard + time frame + keyword".

These ideas will be put into next version of invented index storage.

On Sun, Mar 1, 2015 at 9:23 PM, 蒋旭 <ji...@qq.com> wrote:

> Basically, we have 2 ways to design the hbase rowkey for inverted index:
> 1. "time + keyword":
> It split the index by time that can avoid hbase region merge. But one
> query may scan lots of scattered rows that is not sequential.
> 2. "keyword + time":
> It can guarantee the sequential scan of keyword. But it may trigger the
> hbase region merge since one keyword may be scattered in many regions.
>
>
> So, we can merge these 2 solutions as this: "coarse granularity time +
> keyword + fine granularity time". For example, "20150215 + abc + 1130". In
> this way, we use "coarse granularity time" to avoid hbase region merge and
> "fine granularity time" to guarantee the sequential scan.
>
>
> User can define different "coarse granularity time" & "fine granularity
> time" for different cases. If the inverted index is only used in real-time
> case, we can define a small "coarse granularity time" (e.g. 1 day). If the
> indverted index will cover full data set, we can define a big "coarse
> granularity time" (e.g. 1 month).
>
>
> Thanks
> Jiang Xu