You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Software Dev <st...@gmail.com> on 2014/04/30 00:34:23 UTC

Help with row and column design

Hey all. I have some questions regarding row key and column design.

We want to calculate some metrics based on our page views broken down
by hour, day, month and year. We also want this broken down country
and have the ability to filter by some other attributes such as the
sex of the user or whether or not the user is logged in..... Note
these will all be increments.

So we have the initial row key design as

YYYY - Row key for yearly totals
YYYYMM - Row key for monthly totals
YYYYMMDD - Row key for daily totals
YYYYMMDDHH - Row key for hourly totals

I think this may make sense as it will be easy to do a range scan over
a time period.

Now for my column design. We were thinking along these lines.

daily:US  - Daily counts for the US
hourly:CA - Hourly counts for Canada
... and so on

Now this seems like it would work but fails when we add in the
requirement of filtering results base on some other attributes. Say we
wanted to be able to filter based on sex (M or F) and/or filter based
on logged in status (Online or Offline) OR and/or filter based on some
other attribute OR perform no filtering at all. How would I go about
accomplishing this?

Thanks for any input/pointers.

Re: Help with row and column design

Posted by Liam Slusser <ls...@gmail.com>.

Here is some links that helped me design my keys...

http://www.appfirst.com/blog/best-practices-for-managing-hbase-in-a-high-write-environment/
http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
http://hbase.apache.org/book/rowkey.design.html
http://opentsdb.net/docs/build/html/user_guide/backends/hbase.html

Some fun bed time reading..  :)

cheers,
liam



On Tue, Apr 29, 2014 at 3:51 PM, Software Dev <st...@gmail.com>wrote:

> Someone mentioned in another post about hotspotting. I guess I could
> reverse the row keys to prevent this?
>
> On Tue, Apr 29, 2014 at 3:34 PM, Software Dev <st...@gmail.com>
> wrote:
> > Hey all. I have some questions regarding row key and column design.
> >
> > We want to calculate some metrics based on our page views broken down
> > by hour, day, month and year. We also want this broken down country
> > and have the ability to filter by some other attributes such as the
> > sex of the user or whether or not the user is logged in..... Note
> > these will all be increments.
> >
> > So we have the initial row key design as
> >
> > YYYY - Row key for yearly totals
> > YYYYMM - Row key for monthly totals
> > YYYYMMDD - Row key for daily totals
> > YYYYMMDDHH - Row key for hourly totals
> >
> > I think this may make sense as it will be easy to do a range scan over
> > a time period.
> >
> > Now for my column design. We were thinking along these lines.
> >
> > daily:US  - Daily counts for the US
> > hourly:CA - Hourly counts for Canada
> > ... and so on
> >
> > Now this seems like it would work but fails when we add in the
> > requirement of filtering results base on some other attributes. Say we
> > wanted to be able to filter based on sex (M or F) and/or filter based
> > on logged in status (Online or Offline) OR and/or filter based on some
> > other attribute OR perform no filtering at all. How would I go about
> > accomplishing this?
> >
> > Thanks for any input/pointers.
>

Re: Help with row and column design

Posted by Software Dev <st...@gmail.com>.

Yes ill be storing at multiple levels of aggregation.



On Wed, Apr 30, 2014 at 9:21 AM, Rendon, Carlos (KBB) <CR...@kbb.com> wrote:
>> Ok didnt know if the sheer number of gets would be a limiting factor. Thanks
>
> Yes retrieving and summing thousands of rows is much slower and requires more network, memory, cpu, than doing that for a hundred or <10.
> Perhaps day-level, week-level, or month-level granularity would be a better fit for a 6 month aggregation?
> You did say you were going to store data at multiple levels of time aggregation right?
>
>
> -----Original Message-----
> From: Software Dev [mailto:static.void.dev@gmail.com]
> Sent: Tuesday, April 29, 2014 8:05 PM
> To: user@hbase.apache.org
> Subject: Re: Help with row and column design
>
> Ok didnt know if the sheer number of gets would be a limiting factor. Thanks
>
> On Tue, Apr 29, 2014 at 7:57 PM, Ted Yu <yu...@gmail.com> wrote:
>> As I said this afternoon:
>> See the following API in HTable for batching Get's :
>>
>>   public Result[] get(List<Get> gets) throws IOException {
>>
>> Cheers
>>
>>
>> On Tue, Apr 29, 2014 at 7:45 PM, Software Dev <st...@gmail.com>wrote:
>>
>>> Nothing against your code. I just meant that if we are doing a scan
>>> say for hourly metrics across a 6 month period we are talking about
>>> 4K+ gets. Is that something that can easily be handled?
>>>
>>> On Tue, Apr 29, 2014 at 5:08 PM, Rendon, Carlos (KBB)
>>> <CR...@kbb.com>
>>> wrote:
>>> >> Gets a bit hairy when doing say a shitload of gets thought.. no?
>>> >
>>> > If you by "hairy" you mean the code is ugly, it was written for
>>> > maximal
>>> clarity.
>>> > I think you'll find a few sensible loops makes it fairly clean.
>>> > Otherwise I'm not sure what you mean.
>>> >
>>> > -----Original Message-----
>>> > From: Software Dev [mailto:static.void.dev@gmail.com]
>>> > Sent: Tuesday, April 29, 2014 5:02 PM
>>> > To: user@hbase.apache.org
>>> > Subject: Re: Help with row and column design
>>> >
>>> >> Yes. See total_usa vs. total_female_usa above. Basically you have
>>> >> to
>>> pre-store every level of aggregation you care about.
>>> >
>>> > Ok I think this makes sense. Gets a bit hairy when doing say a
>>> > shitload
>>> of gets thought.. no?
>>> >
>>> > On Tue, Apr 29, 2014 at 4:43 PM, Rendon, Carlos (KBB)
>>> > <CR...@kbb.com>
>>> wrote:
>>> >> You don't do a scan, you do a series of gets, which I believe you
>>> >> can
>>> batch into one call.
>>> >>
>>> >> last 5 days query in pseudocode
>>> >> res1 = Get( hash("2014-04-29") + "2014-04-29")
>>> >> res2 = Get( hash("2014-04-28") + "2014-04-28")
>>> >> res3 = Get( hash("2014-04-27") + "2014-04-27")
>>> >> res4 = Get( hash("2014-04-26") + "2014-04-26")
>>> >> res5 = Get( hash("2014-04-25") + "2014-04-25")
>>> >>
>>> >> For each result you look for the particular column or columns you
>>> >> are interested in Total_usa = res1.get("c:usa") +
>>> >> res2.get("c:usa") +
>>> res3.get("c:usa") + ...
>>> >> Total_female_usa = res1.get("c:usa:sex:f") + ...
>>> >>
>>> >> "What happens when we add more fields? Do we just keep adding in
>>> >> more
>>> column qualifiers? If so, how would we filter across columns to get
>>> an aggregate total?"
>>> >>
>>> >> Yes. See total_usa vs. total_female_usa above. Basically you have
>>> >> to
>>> pre-store every level of aggregation you care about.
>>> >>
>>> >> -----Original Message-----
>>> >> From: Software Dev [mailto:static.void.dev@gmail.com]
>>> >> Sent: Tuesday, April 29, 2014 4:36 PM
>>> >> To: user@hbase.apache.org
>>> >> Subject: Re: Help with row and column design
>>> >>
>>> >>> The downside is it still has a hotspot when inserting, but when
>>> >>> reading a range of time it does not
>>> >>
>>> >> How can you do a scan query between dates when you hash the date?
>>> >>
>>> >>> Column qualifiers are just the collection of items you are
>>> >>> aggregating on. Values are increments. In your case qualifiers
>>> >>> might look like c:usa, c:usa:sex:m, c:usa:sex:f, c:italy:sex:m,
>>> >>> c:italy:sex:f, c:italy,
>>> >>
>>> >> What happens when we add more fields? Do we just keep adding in
>>> >> more
>>> column qualifiers? If so, how would we filter across columns to get
>>> an aggregate total?
>>>

RE: Help with row and column design

Posted by "Rendon, Carlos (KBB)" <CR...@kbb.com>.

> Ok didnt know if the sheer number of gets would be a limiting factor. Thanks

Yes retrieving and summing thousands of rows is much slower and requires more network, memory, cpu, than doing that for a hundred or <10.
Perhaps day-level, week-level, or month-level granularity would be a better fit for a 6 month aggregation? 
You did say you were going to store data at multiple levels of time aggregation right?


-----Original Message-----
From: Software Dev [mailto:static.void.dev@gmail.com] 
Sent: Tuesday, April 29, 2014 8:05 PM
To: user@hbase.apache.org
Subject: Re: Help with row and column design

Ok didnt know if the sheer number of gets would be a limiting factor. Thanks

On Tue, Apr 29, 2014 at 7:57 PM, Ted Yu <yu...@gmail.com> wrote:
> As I said this afternoon:
> See the following API in HTable for batching Get's :
>
>   public Result[] get(List<Get> gets) throws IOException {
>
> Cheers
>
>
> On Tue, Apr 29, 2014 at 7:45 PM, Software Dev <st...@gmail.com>wrote:
>
>> Nothing against your code. I just meant that if we are doing a scan 
>> say for hourly metrics across a 6 month period we are talking about
>> 4K+ gets. Is that something that can easily be handled?
>>
>> On Tue, Apr 29, 2014 at 5:08 PM, Rendon, Carlos (KBB) 
>> <CR...@kbb.com>
>> wrote:
>> >> Gets a bit hairy when doing say a shitload of gets thought.. no?
>> >
>> > If you by "hairy" you mean the code is ugly, it was written for 
>> > maximal
>> clarity.
>> > I think you'll find a few sensible loops makes it fairly clean.
>> > Otherwise I'm not sure what you mean.
>> >
>> > -----Original Message-----
>> > From: Software Dev [mailto:static.void.dev@gmail.com]
>> > Sent: Tuesday, April 29, 2014 5:02 PM
>> > To: user@hbase.apache.org
>> > Subject: Re: Help with row and column design
>> >
>> >> Yes. See total_usa vs. total_female_usa above. Basically you have 
>> >> to
>> pre-store every level of aggregation you care about.
>> >
>> > Ok I think this makes sense. Gets a bit hairy when doing say a 
>> > shitload
>> of gets thought.. no?
>> >
>> > On Tue, Apr 29, 2014 at 4:43 PM, Rendon, Carlos (KBB) 
>> > <CR...@kbb.com>
>> wrote:
>> >> You don't do a scan, you do a series of gets, which I believe you 
>> >> can
>> batch into one call.
>> >>
>> >> last 5 days query in pseudocode
>> >> res1 = Get( hash("2014-04-29") + "2014-04-29")
>> >> res2 = Get( hash("2014-04-28") + "2014-04-28")
>> >> res3 = Get( hash("2014-04-27") + "2014-04-27")
>> >> res4 = Get( hash("2014-04-26") + "2014-04-26")
>> >> res5 = Get( hash("2014-04-25") + "2014-04-25")
>> >>
>> >> For each result you look for the particular column or columns you 
>> >> are interested in Total_usa = res1.get("c:usa") + 
>> >> res2.get("c:usa") +
>> res3.get("c:usa") + ...
>> >> Total_female_usa = res1.get("c:usa:sex:f") + ...
>> >>
>> >> "What happens when we add more fields? Do we just keep adding in 
>> >> more
>> column qualifiers? If so, how would we filter across columns to get 
>> an aggregate total?"
>> >>
>> >> Yes. See total_usa vs. total_female_usa above. Basically you have 
>> >> to
>> pre-store every level of aggregation you care about.
>> >>
>> >> -----Original Message-----
>> >> From: Software Dev [mailto:static.void.dev@gmail.com]
>> >> Sent: Tuesday, April 29, 2014 4:36 PM
>> >> To: user@hbase.apache.org
>> >> Subject: Re: Help with row and column design
>> >>
>> >>> The downside is it still has a hotspot when inserting, but when 
>> >>> reading a range of time it does not
>> >>
>> >> How can you do a scan query between dates when you hash the date?
>> >>
>> >>> Column qualifiers are just the collection of items you are 
>> >>> aggregating on. Values are increments. In your case qualifiers 
>> >>> might look like c:usa, c:usa:sex:m, c:usa:sex:f, c:italy:sex:m, 
>> >>> c:italy:sex:f, c:italy,
>> >>
>> >> What happens when we add more fields? Do we just keep adding in 
>> >> more
>> column qualifiers? If so, how would we filter across columns to get 
>> an aggregate total?
>>

Re: Help with row and column design

Posted by Ted Yu <yu...@gmail.com>.

Please see this post:
http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/


On Wed, Apr 30, 2014 at 10:28 AM, Software Dev <st...@gmail.com>wrote:

> I did not know of the FuzzyRowFilter.. that looks like it may be my best
> bet.
>
> Anyone know what Sematexts HBaseWD uses to perform efficient scanning?
>
> On Tue, Apr 29, 2014 at 11:31 PM, Liam Slusser <ls...@gmail.com> wrote:
> > I would recommend pre-splitting the tables and then hashing your key and
> > putting that in the front.  ie
> >
> > [hash(20140429:Country:US)][2014042901:Country:US]  #notice you're not
> > hashing the sequence number
> >
> > some pseudo python code
> >
> >>>> import hashlib
> >>>> key = "2014042901:Country:US"
> >>>> ckey = "20140429:Country:US"
> >>>> hbase_key = "%s%s" % (hashlib.md5(ckey).hexdigest()[:5],key)
> >>>> hbase_key
> > '887d82014042901:Country:US'
> >
> > Now when you want to find something, you can just create the hash
> ('887d8)
> > and use FuzzyRowFilter to find it!
> >
> > cheers,
> > liam
> >
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Apr 29, 2014 at 8:08 PM, Software Dev <static.void.dev@gmail.com
> >wrote:
> >
> >> Any improvements in the row key design?
> >>
> >> If i always know we will be querying by country could/should I prefix
> >> the row key with the country to help with hotspotting?
> >>
> >> FR/2014042901
> >> FR/2014042902
> >> ....
> >> US/2014042901
> >> US/2014042902
> >> ...
> >>
> >> Is this preferred over adding it in a column... ie 2014042901
> :Country:US
> >>
> >> On Tue, Apr 29, 2014 at 8:05 PM, Software Dev <
> static.void.dev@gmail.com>
> >> wrote:
> >> > Ok didnt know if the sheer number of gets would be a limiting factor.
> >> Thanks
> >> >
> >> > On Tue, Apr 29, 2014 at 7:57 PM, Ted Yu <yu...@gmail.com> wrote:
> >> >> As I said this afternoon:
> >> >> See the following API in HTable for batching Get's :
> >> >>
> >> >>   public Result[] get(List<Get> gets) throws IOException {
> >> >>
> >> >> Cheers
> >> >>
> >> >>
> >> >> On Tue, Apr 29, 2014 at 7:45 PM, Software Dev <
> >> static.void.dev@gmail.com>wrote:
> >> >>
> >> >>> Nothing against your code. I just meant that if we are doing a scan
> >> >>> say for hourly metrics across a 6 month period we are talking about
> >> >>> 4K+ gets. Is that something that can easily be handled?
> >> >>>
> >> >>> On Tue, Apr 29, 2014 at 5:08 PM, Rendon, Carlos (KBB) <
> CRendon@kbb.com
> >> >
> >> >>> wrote:
> >> >>> >> Gets a bit hairy when doing say a shitload of gets thought.. no?
> >> >>> >
> >> >>> > If you by "hairy" you mean the code is ugly, it was written for
> >> maximal
> >> >>> clarity.
> >> >>> > I think you'll find a few sensible loops makes it fairly clean.
> >> >>> > Otherwise I'm not sure what you mean.
> >> >>> >
> >> >>> > -----Original Message-----
> >> >>> > From: Software Dev [mailto:static.void.dev@gmail.com]
> >> >>> > Sent: Tuesday, April 29, 2014 5:02 PM
> >> >>> > To: user@hbase.apache.org
> >> >>> > Subject: Re: Help with row and column design
> >> >>> >
> >> >>> >> Yes. See total_usa vs. total_female_usa above. Basically you
> have to
> >> >>> pre-store every level of aggregation you care about.
> >> >>> >
> >> >>> > Ok I think this makes sense. Gets a bit hairy when doing say a
> >> shitload
> >> >>> of gets thought.. no?
> >> >>> >
> >> >>> > On Tue, Apr 29, 2014 at 4:43 PM, Rendon, Carlos (KBB) <
> >> CRendon@kbb.com>
> >> >>> wrote:
> >> >>> >> You don't do a scan, you do a series of gets, which I believe you
> >> can
> >> >>> batch into one call.
> >> >>> >>
> >> >>> >> last 5 days query in pseudocode
> >> >>> >> res1 = Get( hash("2014-04-29") + "2014-04-29")
> >> >>> >> res2 = Get( hash("2014-04-28") + "2014-04-28")
> >> >>> >> res3 = Get( hash("2014-04-27") + "2014-04-27")
> >> >>> >> res4 = Get( hash("2014-04-26") + "2014-04-26")
> >> >>> >> res5 = Get( hash("2014-04-25") + "2014-04-25")
> >> >>> >>
> >> >>> >> For each result you look for the particular column or columns you
> >> are
> >> >>> >> interested in Total_usa = res1.get("c:usa") + res2.get("c:usa") +
> >> >>> res3.get("c:usa") + ...
> >> >>> >> Total_female_usa = res1.get("c:usa:sex:f") + ...
> >> >>> >>
> >> >>> >> "What happens when we add more fields? Do we just keep adding in
> >> more
> >> >>> column qualifiers? If so, how would we filter across columns to get
> an
> >> >>> aggregate total?"
> >> >>> >>
> >> >>> >> Yes. See total_usa vs. total_female_usa above. Basically you
> have to
> >> >>> pre-store every level of aggregation you care about.
> >> >>> >>
> >> >>> >> -----Original Message-----
> >> >>> >> From: Software Dev [mailto:static.void.dev@gmail.com]
> >> >>> >> Sent: Tuesday, April 29, 2014 4:36 PM
> >> >>> >> To: user@hbase.apache.org
> >> >>> >> Subject: Re: Help with row and column design
> >> >>> >>
> >> >>> >>> The downside is it still has a hotspot when inserting, but when
> >> >>> >>> reading a range of time it does not
> >> >>> >>
> >> >>> >> How can you do a scan query between dates when you hash the date?
> >> >>> >>
> >> >>> >>> Column qualifiers are just the collection of items you are
> >> >>> >>> aggregating on. Values are increments. In your case qualifiers
> >> might
> >> >>> >>> look like c:usa, c:usa:sex:m, c:usa:sex:f, c:italy:sex:m,
> >> >>> >>> c:italy:sex:f, c:italy,
> >> >>> >>
> >> >>> >> What happens when we add more fields? Do we just keep adding in
> more
> >> >>> column qualifiers? If so, how would we filter across columns to get
> an
> >> >>> aggregate total?
> >> >>>
> >>
>

Re: Help with row and column design

Posted by Software Dev <st...@gmail.com>.

FuzzyRowFilter is not part of the Rest client so this may not be an
option for us. Any alternatives?

On Wed, Apr 30, 2014 at 10:28 AM, Software Dev
<st...@gmail.com> wrote:
> I did not know of the FuzzyRowFilter.. that looks like it may be my best bet.
>
> Anyone know what Sematexts HBaseWD uses to perform efficient scanning?
>
> On Tue, Apr 29, 2014 at 11:31 PM, Liam Slusser <ls...@gmail.com> wrote:
>> I would recommend pre-splitting the tables and then hashing your key and
>> putting that in the front.  ie
>>
>> [hash(20140429:Country:US)][2014042901:Country:US]  #notice you're not
>> hashing the sequence number
>>
>> some pseudo python code
>>
>>>>> import hashlib
>>>>> key = "2014042901:Country:US"
>>>>> ckey = "20140429:Country:US"
>>>>> hbase_key = "%s%s" % (hashlib.md5(ckey).hexdigest()[:5],key)
>>>>> hbase_key
>> '887d82014042901:Country:US'
>>
>> Now when you want to find something, you can just create the hash ('887d8)
>> and use FuzzyRowFilter to find it!
>>
>> cheers,
>> liam
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Apr 29, 2014 at 8:08 PM, Software Dev <st...@gmail.com>wrote:
>>
>>> Any improvements in the row key design?
>>>
>>> If i always know we will be querying by country could/should I prefix
>>> the row key with the country to help with hotspotting?
>>>
>>> FR/2014042901
>>> FR/2014042902
>>> ....
>>> US/2014042901
>>> US/2014042902
>>> ...
>>>
>>> Is this preferred over adding it in a column... ie 2014042901:Country:US
>>>
>>> On Tue, Apr 29, 2014 at 8:05 PM, Software Dev <st...@gmail.com>
>>> wrote:
>>> > Ok didnt know if the sheer number of gets would be a limiting factor.
>>> Thanks
>>> >
>>> > On Tue, Apr 29, 2014 at 7:57 PM, Ted Yu <yu...@gmail.com> wrote:
>>> >> As I said this afternoon:
>>> >> See the following API in HTable for batching Get's :
>>> >>
>>> >>   public Result[] get(List<Get> gets) throws IOException {
>>> >>
>>> >> Cheers
>>> >>
>>> >>
>>> >> On Tue, Apr 29, 2014 at 7:45 PM, Software Dev <
>>> static.void.dev@gmail.com>wrote:
>>> >>
>>> >>> Nothing against your code. I just meant that if we are doing a scan
>>> >>> say for hourly metrics across a 6 month period we are talking about
>>> >>> 4K+ gets. Is that something that can easily be handled?
>>> >>>
>>> >>> On Tue, Apr 29, 2014 at 5:08 PM, Rendon, Carlos (KBB) <CRendon@kbb.com
>>> >
>>> >>> wrote:
>>> >>> >> Gets a bit hairy when doing say a shitload of gets thought.. no?
>>> >>> >
>>> >>> > If you by "hairy" you mean the code is ugly, it was written for
>>> maximal
>>> >>> clarity.
>>> >>> > I think you'll find a few sensible loops makes it fairly clean.
>>> >>> > Otherwise I'm not sure what you mean.
>>> >>> >
>>> >>> > -----Original Message-----
>>> >>> > From: Software Dev [mailto:static.void.dev@gmail.com]
>>> >>> > Sent: Tuesday, April 29, 2014 5:02 PM
>>> >>> > To: user@hbase.apache.org
>>> >>> > Subject: Re: Help with row and column design
>>> >>> >
>>> >>> >> Yes. See total_usa vs. total_female_usa above. Basically you have to
>>> >>> pre-store every level of aggregation you care about.
>>> >>> >
>>> >>> > Ok I think this makes sense. Gets a bit hairy when doing say a
>>> shitload
>>> >>> of gets thought.. no?
>>> >>> >
>>> >>> > On Tue, Apr 29, 2014 at 4:43 PM, Rendon, Carlos (KBB) <
>>> CRendon@kbb.com>
>>> >>> wrote:
>>> >>> >> You don't do a scan, you do a series of gets, which I believe you
>>> can
>>> >>> batch into one call.
>>> >>> >>
>>> >>> >> last 5 days query in pseudocode
>>> >>> >> res1 = Get( hash("2014-04-29") + "2014-04-29")
>>> >>> >> res2 = Get( hash("2014-04-28") + "2014-04-28")
>>> >>> >> res3 = Get( hash("2014-04-27") + "2014-04-27")
>>> >>> >> res4 = Get( hash("2014-04-26") + "2014-04-26")
>>> >>> >> res5 = Get( hash("2014-04-25") + "2014-04-25")
>>> >>> >>
>>> >>> >> For each result you look for the particular column or columns you
>>> are
>>> >>> >> interested in Total_usa = res1.get("c:usa") + res2.get("c:usa") +
>>> >>> res3.get("c:usa") + ...
>>> >>> >> Total_female_usa = res1.get("c:usa:sex:f") + ...
>>> >>> >>
>>> >>> >> "What happens when we add more fields? Do we just keep adding in
>>> more
>>> >>> column qualifiers? If so, how would we filter across columns to get an
>>> >>> aggregate total?"
>>> >>> >>
>>> >>> >> Yes. See total_usa vs. total_female_usa above. Basically you have to
>>> >>> pre-store every level of aggregation you care about.
>>> >>> >>
>>> >>> >> -----Original Message-----
>>> >>> >> From: Software Dev [mailto:static.void.dev@gmail.com]
>>> >>> >> Sent: Tuesday, April 29, 2014 4:36 PM
>>> >>> >> To: user@hbase.apache.org
>>> >>> >> Subject: Re: Help with row and column design
>>> >>> >>
>>> >>> >>> The downside is it still has a hotspot when inserting, but when
>>> >>> >>> reading a range of time it does not
>>> >>> >>
>>> >>> >> How can you do a scan query between dates when you hash the date?
>>> >>> >>
>>> >>> >>> Column qualifiers are just the collection of items you are
>>> >>> >>> aggregating on. Values are increments. In your case qualifiers
>>> might
>>> >>> >>> look like c:usa, c:usa:sex:m, c:usa:sex:f, c:italy:sex:m,
>>> >>> >>> c:italy:sex:f, c:italy,
>>> >>> >>
>>> >>> >> What happens when we add more fields? Do we just keep adding in more
>>> >>> column qualifiers? If so, how would we filter across columns to get an
>>> >>> aggregate total?
>>> >>>
>>>

Re: Help with row and column design

Posted by Software Dev <st...@gmail.com>.

I did not know of the FuzzyRowFilter.. that looks like it may be my best bet.

Anyone know what Sematexts HBaseWD uses to perform efficient scanning?

On Tue, Apr 29, 2014 at 11:31 PM, Liam Slusser <ls...@gmail.com> wrote:
> I would recommend pre-splitting the tables and then hashing your key and
> putting that in the front.  ie
>
> [hash(20140429:Country:US)][2014042901:Country:US]  #notice you're not
> hashing the sequence number
>
> some pseudo python code
>
>>>> import hashlib
>>>> key = "2014042901:Country:US"
>>>> ckey = "20140429:Country:US"
>>>> hbase_key = "%s%s" % (hashlib.md5(ckey).hexdigest()[:5],key)
>>>> hbase_key
> '887d82014042901:Country:US'
>
> Now when you want to find something, you can just create the hash ('887d8)
> and use FuzzyRowFilter to find it!
>
> cheers,
> liam
>
>
>
>
>
>
>
>
> On Tue, Apr 29, 2014 at 8:08 PM, Software Dev <st...@gmail.com>wrote:
>
>> Any improvements in the row key design?
>>
>> If i always know we will be querying by country could/should I prefix
>> the row key with the country to help with hotspotting?
>>
>> FR/2014042901
>> FR/2014042902
>> ....
>> US/2014042901
>> US/2014042902
>> ...
>>
>> Is this preferred over adding it in a column... ie 2014042901:Country:US
>>
>> On Tue, Apr 29, 2014 at 8:05 PM, Software Dev <st...@gmail.com>
>> wrote:
>> > Ok didnt know if the sheer number of gets would be a limiting factor.
>> Thanks
>> >
>> > On Tue, Apr 29, 2014 at 7:57 PM, Ted Yu <yu...@gmail.com> wrote:
>> >> As I said this afternoon:
>> >> See the following API in HTable for batching Get's :
>> >>
>> >>   public Result[] get(List<Get> gets) throws IOException {
>> >>
>> >> Cheers
>> >>
>> >>
>> >> On Tue, Apr 29, 2014 at 7:45 PM, Software Dev <
>> static.void.dev@gmail.com>wrote:
>> >>
>> >>> Nothing against your code. I just meant that if we are doing a scan
>> >>> say for hourly metrics across a 6 month period we are talking about
>> >>> 4K+ gets. Is that something that can easily be handled?
>> >>>
>> >>> On Tue, Apr 29, 2014 at 5:08 PM, Rendon, Carlos (KBB) <CRendon@kbb.com
>> >
>> >>> wrote:
>> >>> >> Gets a bit hairy when doing say a shitload of gets thought.. no?
>> >>> >
>> >>> > If you by "hairy" you mean the code is ugly, it was written for
>> maximal
>> >>> clarity.
>> >>> > I think you'll find a few sensible loops makes it fairly clean.
>> >>> > Otherwise I'm not sure what you mean.
>> >>> >
>> >>> > -----Original Message-----
>> >>> > From: Software Dev [mailto:static.void.dev@gmail.com]
>> >>> > Sent: Tuesday, April 29, 2014 5:02 PM
>> >>> > To: user@hbase.apache.org
>> >>> > Subject: Re: Help with row and column design
>> >>> >
>> >>> >> Yes. See total_usa vs. total_female_usa above. Basically you have to
>> >>> pre-store every level of aggregation you care about.
>> >>> >
>> >>> > Ok I think this makes sense. Gets a bit hairy when doing say a
>> shitload
>> >>> of gets thought.. no?
>> >>> >
>> >>> > On Tue, Apr 29, 2014 at 4:43 PM, Rendon, Carlos (KBB) <
>> CRendon@kbb.com>
>> >>> wrote:
>> >>> >> You don't do a scan, you do a series of gets, which I believe you
>> can
>> >>> batch into one call.
>> >>> >>
>> >>> >> last 5 days query in pseudocode
>> >>> >> res1 = Get( hash("2014-04-29") + "2014-04-29")
>> >>> >> res2 = Get( hash("2014-04-28") + "2014-04-28")
>> >>> >> res3 = Get( hash("2014-04-27") + "2014-04-27")
>> >>> >> res4 = Get( hash("2014-04-26") + "2014-04-26")
>> >>> >> res5 = Get( hash("2014-04-25") + "2014-04-25")
>> >>> >>
>> >>> >> For each result you look for the particular column or columns you
>> are
>> >>> >> interested in Total_usa = res1.get("c:usa") + res2.get("c:usa") +
>> >>> res3.get("c:usa") + ...
>> >>> >> Total_female_usa = res1.get("c:usa:sex:f") + ...
>> >>> >>
>> >>> >> "What happens when we add more fields? Do we just keep adding in
>> more
>> >>> column qualifiers? If so, how would we filter across columns to get an
>> >>> aggregate total?"
>> >>> >>
>> >>> >> Yes. See total_usa vs. total_female_usa above. Basically you have to
>> >>> pre-store every level of aggregation you care about.
>> >>> >>
>> >>> >> -----Original Message-----
>> >>> >> From: Software Dev [mailto:static.void.dev@gmail.com]
>> >>> >> Sent: Tuesday, April 29, 2014 4:36 PM
>> >>> >> To: user@hbase.apache.org
>> >>> >> Subject: Re: Help with row and column design
>> >>> >>
>> >>> >>> The downside is it still has a hotspot when inserting, but when
>> >>> >>> reading a range of time it does not
>> >>> >>
>> >>> >> How can you do a scan query between dates when you hash the date?
>> >>> >>
>> >>> >>> Column qualifiers are just the collection of items you are
>> >>> >>> aggregating on. Values are increments. In your case qualifiers
>> might
>> >>> >>> look like c:usa, c:usa:sex:m, c:usa:sex:f, c:italy:sex:m,
>> >>> >>> c:italy:sex:f, c:italy,
>> >>> >>
>> >>> >> What happens when we add more fields? Do we just keep adding in more
>> >>> column qualifiers? If so, how would we filter across columns to get an
>> >>> aggregate total?
>> >>>
>>

Re: Help with row and column design

Posted by Liam Slusser <ls...@gmail.com>.

I would recommend pre-splitting the tables and then hashing your key and
putting that in the front.  ie

[hash(20140429:Country:US)][2014042901:Country:US]  #notice you're not
hashing the sequence number

some pseudo python code

>>> import hashlib
>>> key = "2014042901:Country:US"
>>> ckey = "20140429:Country:US"
>>> hbase_key = "%s%s" % (hashlib.md5(ckey).hexdigest()[:5],key)
>>> hbase_key
'887d82014042901:Country:US'

Now when you want to find something, you can just create the hash ('887d8)
and use FuzzyRowFilter to find it!

cheers,
liam








On Tue, Apr 29, 2014 at 8:08 PM, Software Dev <st...@gmail.com>wrote:

> Any improvements in the row key design?
>
> If i always know we will be querying by country could/should I prefix
> the row key with the country to help with hotspotting?
>
> FR/2014042901
> FR/2014042902
> ....
> US/2014042901
> US/2014042902
> ...
>
> Is this preferred over adding it in a column... ie 2014042901:Country:US
>
> On Tue, Apr 29, 2014 at 8:05 PM, Software Dev <st...@gmail.com>
> wrote:
> > Ok didnt know if the sheer number of gets would be a limiting factor.
> Thanks
> >
> > On Tue, Apr 29, 2014 at 7:57 PM, Ted Yu <yu...@gmail.com> wrote:
> >> As I said this afternoon:
> >> See the following API in HTable for batching Get's :
> >>
> >>   public Result[] get(List<Get> gets) throws IOException {
> >>
> >> Cheers
> >>
> >>
> >> On Tue, Apr 29, 2014 at 7:45 PM, Software Dev <
> static.void.dev@gmail.com>wrote:
> >>
> >>> Nothing against your code. I just meant that if we are doing a scan
> >>> say for hourly metrics across a 6 month period we are talking about
> >>> 4K+ gets. Is that something that can easily be handled?
> >>>
> >>> On Tue, Apr 29, 2014 at 5:08 PM, Rendon, Carlos (KBB) <CRendon@kbb.com
> >
> >>> wrote:
> >>> >> Gets a bit hairy when doing say a shitload of gets thought.. no?
> >>> >
> >>> > If you by "hairy" you mean the code is ugly, it was written for
> maximal
> >>> clarity.
> >>> > I think you'll find a few sensible loops makes it fairly clean.
> >>> > Otherwise I'm not sure what you mean.
> >>> >
> >>> > -----Original Message-----
> >>> > From: Software Dev [mailto:static.void.dev@gmail.com]
> >>> > Sent: Tuesday, April 29, 2014 5:02 PM
> >>> > To: user@hbase.apache.org
> >>> > Subject: Re: Help with row and column design
> >>> >
> >>> >> Yes. See total_usa vs. total_female_usa above. Basically you have to
> >>> pre-store every level of aggregation you care about.
> >>> >
> >>> > Ok I think this makes sense. Gets a bit hairy when doing say a
> shitload
> >>> of gets thought.. no?
> >>> >
> >>> > On Tue, Apr 29, 2014 at 4:43 PM, Rendon, Carlos (KBB) <
> CRendon@kbb.com>
> >>> wrote:
> >>> >> You don't do a scan, you do a series of gets, which I believe you
> can
> >>> batch into one call.
> >>> >>
> >>> >> last 5 days query in pseudocode
> >>> >> res1 = Get( hash("2014-04-29") + "2014-04-29")
> >>> >> res2 = Get( hash("2014-04-28") + "2014-04-28")
> >>> >> res3 = Get( hash("2014-04-27") + "2014-04-27")
> >>> >> res4 = Get( hash("2014-04-26") + "2014-04-26")
> >>> >> res5 = Get( hash("2014-04-25") + "2014-04-25")
> >>> >>
> >>> >> For each result you look for the particular column or columns you
> are
> >>> >> interested in Total_usa = res1.get("c:usa") + res2.get("c:usa") +
> >>> res3.get("c:usa") + ...
> >>> >> Total_female_usa = res1.get("c:usa:sex:f") + ...
> >>> >>
> >>> >> "What happens when we add more fields? Do we just keep adding in
> more
> >>> column qualifiers? If so, how would we filter across columns to get an
> >>> aggregate total?"
> >>> >>
> >>> >> Yes. See total_usa vs. total_female_usa above. Basically you have to
> >>> pre-store every level of aggregation you care about.
> >>> >>
> >>> >> -----Original Message-----
> >>> >> From: Software Dev [mailto:static.void.dev@gmail.com]
> >>> >> Sent: Tuesday, April 29, 2014 4:36 PM
> >>> >> To: user@hbase.apache.org
> >>> >> Subject: Re: Help with row and column design
> >>> >>
> >>> >>> The downside is it still has a hotspot when inserting, but when
> >>> >>> reading a range of time it does not
> >>> >>
> >>> >> How can you do a scan query between dates when you hash the date?
> >>> >>
> >>> >>> Column qualifiers are just the collection of items you are
> >>> >>> aggregating on. Values are increments. In your case qualifiers
> might
> >>> >>> look like c:usa, c:usa:sex:m, c:usa:sex:f, c:italy:sex:m,
> >>> >>> c:italy:sex:f, c:italy,
> >>> >>
> >>> >> What happens when we add more fields? Do we just keep adding in more
> >>> column qualifiers? If so, how would we filter across columns to get an
> >>> aggregate total?
> >>>
>

Re: Help with row and column design

Posted by Sreepathi <pr...@gmail.com>.

I guess you can pre-split tables manually which avoids hotspotting..


On Tue, Apr 29, 2014 at 8:08 PM, Software Dev <st...@gmail.com>wrote:

> Any improvements in the row key design?
>
> If i always know we will be querying by country could/should I prefix
> the row key with the country to help with hotspotting?
>
> FR/2014042901
> FR/2014042902
> ....
> US/2014042901
> US/2014042902
> ...
>
> Is this preferred over adding it in a column... ie 2014042901:Country:US
>
> On Tue, Apr 29, 2014 at 8:05 PM, Software Dev <st...@gmail.com>
> wrote:
> > Ok didnt know if the sheer number of gets would be a limiting factor.
> Thanks
> >
> > On Tue, Apr 29, 2014 at 7:57 PM, Ted Yu <yu...@gmail.com> wrote:
> >> As I said this afternoon:
> >> See the following API in HTable for batching Get's :
> >>
> >>   public Result[] get(List<Get> gets) throws IOException {
> >>
> >> Cheers
> >>
> >>
> >> On Tue, Apr 29, 2014 at 7:45 PM, Software Dev <
> static.void.dev@gmail.com>wrote:
> >>
> >>> Nothing against your code. I just meant that if we are doing a scan
> >>> say for hourly metrics across a 6 month period we are talking about
> >>> 4K+ gets. Is that something that can easily be handled?
> >>>
> >>> On Tue, Apr 29, 2014 at 5:08 PM, Rendon, Carlos (KBB) <CRendon@kbb.com
> >
> >>> wrote:
> >>> >> Gets a bit hairy when doing say a shitload of gets thought.. no?
> >>> >
> >>> > If you by "hairy" you mean the code is ugly, it was written for
> maximal
> >>> clarity.
> >>> > I think you'll find a few sensible loops makes it fairly clean.
> >>> > Otherwise I'm not sure what you mean.
> >>> >
> >>> > -----Original Message-----
> >>> > From: Software Dev [mailto:static.void.dev@gmail.com]
> >>> > Sent: Tuesday, April 29, 2014 5:02 PM
> >>> > To: user@hbase.apache.org
> >>> > Subject: Re: Help with row and column design
> >>> >
> >>> >> Yes. See total_usa vs. total_female_usa above. Basically you have to
> >>> pre-store every level of aggregation you care about.
> >>> >
> >>> > Ok I think this makes sense. Gets a bit hairy when doing say a
> shitload
> >>> of gets thought.. no?
> >>> >
> >>> > On Tue, Apr 29, 2014 at 4:43 PM, Rendon, Carlos (KBB) <
> CRendon@kbb.com>
> >>> wrote:
> >>> >> You don't do a scan, you do a series of gets, which I believe you
> can
> >>> batch into one call.
> >>> >>
> >>> >> last 5 days query in pseudocode
> >>> >> res1 = Get( hash("2014-04-29") + "2014-04-29")
> >>> >> res2 = Get( hash("2014-04-28") + "2014-04-28")
> >>> >> res3 = Get( hash("2014-04-27") + "2014-04-27")
> >>> >> res4 = Get( hash("2014-04-26") + "2014-04-26")
> >>> >> res5 = Get( hash("2014-04-25") + "2014-04-25")
> >>> >>
> >>> >> For each result you look for the particular column or columns you
> are
> >>> >> interested in Total_usa = res1.get("c:usa") + res2.get("c:usa") +
> >>> res3.get("c:usa") + ...
> >>> >> Total_female_usa = res1.get("c:usa:sex:f") + ...
> >>> >>
> >>> >> "What happens when we add more fields? Do we just keep adding in
> more
> >>> column qualifiers? If so, how would we filter across columns to get an
> >>> aggregate total?"
> >>> >>
> >>> >> Yes. See total_usa vs. total_female_usa above. Basically you have to
> >>> pre-store every level of aggregation you care about.
> >>> >>
> >>> >> -----Original Message-----
> >>> >> From: Software Dev [mailto:static.void.dev@gmail.com]
> >>> >> Sent: Tuesday, April 29, 2014 4:36 PM
> >>> >> To: user@hbase.apache.org
> >>> >> Subject: Re: Help with row and column design
> >>> >>
> >>> >>> The downside is it still has a hotspot when inserting, but when
> >>> >>> reading a range of time it does not
> >>> >>
> >>> >> How can you do a scan query between dates when you hash the date?
> >>> >>
> >>> >>> Column qualifiers are just the collection of items you are
> >>> >>> aggregating on. Values are increments. In your case qualifiers
> might
> >>> >>> look like c:usa, c:usa:sex:m, c:usa:sex:f, c:italy:sex:m,
> >>> >>> c:italy:sex:f, c:italy,
> >>> >>
> >>> >> What happens when we add more fields? Do we just keep adding in more
> >>> column qualifiers? If so, how would we filter across columns to get an
> >>> aggregate total?
> >>>
>



-- 
*Regards,*
--- *Sreepathi *

Re: Help with row and column design

Posted by Software Dev <st...@gmail.com>.

Any improvements in the row key design?

If i always know we will be querying by country could/should I prefix
the row key with the country to help with hotspotting?

FR/2014042901
FR/2014042902
....
US/2014042901
US/2014042902
...

Is this preferred over adding it in a column... ie 2014042901:Country:US

On Tue, Apr 29, 2014 at 8:05 PM, Software Dev <st...@gmail.com> wrote:
> Ok didnt know if the sheer number of gets would be a limiting factor. Thanks
>
> On Tue, Apr 29, 2014 at 7:57 PM, Ted Yu <yu...@gmail.com> wrote:
>> As I said this afternoon:
>> See the following API in HTable for batching Get's :
>>
>>   public Result[] get(List<Get> gets) throws IOException {
>>
>> Cheers
>>
>>
>> On Tue, Apr 29, 2014 at 7:45 PM, Software Dev <st...@gmail.com>wrote:
>>
>>> Nothing against your code. I just meant that if we are doing a scan
>>> say for hourly metrics across a 6 month period we are talking about
>>> 4K+ gets. Is that something that can easily be handled?
>>>
>>> On Tue, Apr 29, 2014 at 5:08 PM, Rendon, Carlos (KBB) <CR...@kbb.com>
>>> wrote:
>>> >> Gets a bit hairy when doing say a shitload of gets thought.. no?
>>> >
>>> > If you by "hairy" you mean the code is ugly, it was written for maximal
>>> clarity.
>>> > I think you'll find a few sensible loops makes it fairly clean.
>>> > Otherwise I'm not sure what you mean.
>>> >
>>> > -----Original Message-----
>>> > From: Software Dev [mailto:static.void.dev@gmail.com]
>>> > Sent: Tuesday, April 29, 2014 5:02 PM
>>> > To: user@hbase.apache.org
>>> > Subject: Re: Help with row and column design
>>> >
>>> >> Yes. See total_usa vs. total_female_usa above. Basically you have to
>>> pre-store every level of aggregation you care about.
>>> >
>>> > Ok I think this makes sense. Gets a bit hairy when doing say a shitload
>>> of gets thought.. no?
>>> >
>>> > On Tue, Apr 29, 2014 at 4:43 PM, Rendon, Carlos (KBB) <CR...@kbb.com>
>>> wrote:
>>> >> You don't do a scan, you do a series of gets, which I believe you can
>>> batch into one call.
>>> >>
>>> >> last 5 days query in pseudocode
>>> >> res1 = Get( hash("2014-04-29") + "2014-04-29")
>>> >> res2 = Get( hash("2014-04-28") + "2014-04-28")
>>> >> res3 = Get( hash("2014-04-27") + "2014-04-27")
>>> >> res4 = Get( hash("2014-04-26") + "2014-04-26")
>>> >> res5 = Get( hash("2014-04-25") + "2014-04-25")
>>> >>
>>> >> For each result you look for the particular column or columns you are
>>> >> interested in Total_usa = res1.get("c:usa") + res2.get("c:usa") +
>>> res3.get("c:usa") + ...
>>> >> Total_female_usa = res1.get("c:usa:sex:f") + ...
>>> >>
>>> >> "What happens when we add more fields? Do we just keep adding in more
>>> column qualifiers? If so, how would we filter across columns to get an
>>> aggregate total?"
>>> >>
>>> >> Yes. See total_usa vs. total_female_usa above. Basically you have to
>>> pre-store every level of aggregation you care about.
>>> >>
>>> >> -----Original Message-----
>>> >> From: Software Dev [mailto:static.void.dev@gmail.com]
>>> >> Sent: Tuesday, April 29, 2014 4:36 PM
>>> >> To: user@hbase.apache.org
>>> >> Subject: Re: Help with row and column design
>>> >>
>>> >>> The downside is it still has a hotspot when inserting, but when
>>> >>> reading a range of time it does not
>>> >>
>>> >> How can you do a scan query between dates when you hash the date?
>>> >>
>>> >>> Column qualifiers are just the collection of items you are
>>> >>> aggregating on. Values are increments. In your case qualifiers might
>>> >>> look like c:usa, c:usa:sex:m, c:usa:sex:f, c:italy:sex:m,
>>> >>> c:italy:sex:f, c:italy,
>>> >>
>>> >> What happens when we add more fields? Do we just keep adding in more
>>> column qualifiers? If so, how would we filter across columns to get an
>>> aggregate total?
>>>

Re: Help with row and column design

Posted by Software Dev <st...@gmail.com>.

Ok didnt know if the sheer number of gets would be a limiting factor. Thanks

On Tue, Apr 29, 2014 at 7:57 PM, Ted Yu <yu...@gmail.com> wrote:
> As I said this afternoon:
> See the following API in HTable for batching Get's :
>
>   public Result[] get(List<Get> gets) throws IOException {
>
> Cheers
>
>
> On Tue, Apr 29, 2014 at 7:45 PM, Software Dev <st...@gmail.com>wrote:
>
>> Nothing against your code. I just meant that if we are doing a scan
>> say for hourly metrics across a 6 month period we are talking about
>> 4K+ gets. Is that something that can easily be handled?
>>
>> On Tue, Apr 29, 2014 at 5:08 PM, Rendon, Carlos (KBB) <CR...@kbb.com>
>> wrote:
>> >> Gets a bit hairy when doing say a shitload of gets thought.. no?
>> >
>> > If you by "hairy" you mean the code is ugly, it was written for maximal
>> clarity.
>> > I think you'll find a few sensible loops makes it fairly clean.
>> > Otherwise I'm not sure what you mean.
>> >
>> > -----Original Message-----
>> > From: Software Dev [mailto:static.void.dev@gmail.com]
>> > Sent: Tuesday, April 29, 2014 5:02 PM
>> > To: user@hbase.apache.org
>> > Subject: Re: Help with row and column design
>> >
>> >> Yes. See total_usa vs. total_female_usa above. Basically you have to
>> pre-store every level of aggregation you care about.
>> >
>> > Ok I think this makes sense. Gets a bit hairy when doing say a shitload
>> of gets thought.. no?
>> >
>> > On Tue, Apr 29, 2014 at 4:43 PM, Rendon, Carlos (KBB) <CR...@kbb.com>
>> wrote:
>> >> You don't do a scan, you do a series of gets, which I believe you can
>> batch into one call.
>> >>
>> >> last 5 days query in pseudocode
>> >> res1 = Get( hash("2014-04-29") + "2014-04-29")
>> >> res2 = Get( hash("2014-04-28") + "2014-04-28")
>> >> res3 = Get( hash("2014-04-27") + "2014-04-27")
>> >> res4 = Get( hash("2014-04-26") + "2014-04-26")
>> >> res5 = Get( hash("2014-04-25") + "2014-04-25")
>> >>
>> >> For each result you look for the particular column or columns you are
>> >> interested in Total_usa = res1.get("c:usa") + res2.get("c:usa") +
>> res3.get("c:usa") + ...
>> >> Total_female_usa = res1.get("c:usa:sex:f") + ...
>> >>
>> >> "What happens when we add more fields? Do we just keep adding in more
>> column qualifiers? If so, how would we filter across columns to get an
>> aggregate total?"
>> >>
>> >> Yes. See total_usa vs. total_female_usa above. Basically you have to
>> pre-store every level of aggregation you care about.
>> >>
>> >> -----Original Message-----
>> >> From: Software Dev [mailto:static.void.dev@gmail.com]
>> >> Sent: Tuesday, April 29, 2014 4:36 PM
>> >> To: user@hbase.apache.org
>> >> Subject: Re: Help with row and column design
>> >>
>> >>> The downside is it still has a hotspot when inserting, but when
>> >>> reading a range of time it does not
>> >>
>> >> How can you do a scan query between dates when you hash the date?
>> >>
>> >>> Column qualifiers are just the collection of items you are
>> >>> aggregating on. Values are increments. In your case qualifiers might
>> >>> look like c:usa, c:usa:sex:m, c:usa:sex:f, c:italy:sex:m,
>> >>> c:italy:sex:f, c:italy,
>> >>
>> >> What happens when we add more fields? Do we just keep adding in more
>> column qualifiers? If so, how would we filter across columns to get an
>> aggregate total?
>>

Re: Help with row and column design

Posted by Ted Yu <yu...@gmail.com>.

As I said this afternoon:
See the following API in HTable for batching Get's :

  public Result[] get(List<Get> gets) throws IOException {

Cheers


On Tue, Apr 29, 2014 at 7:45 PM, Software Dev <st...@gmail.com>wrote:

> Nothing against your code. I just meant that if we are doing a scan
> say for hourly metrics across a 6 month period we are talking about
> 4K+ gets. Is that something that can easily be handled?
>
> On Tue, Apr 29, 2014 at 5:08 PM, Rendon, Carlos (KBB) <CR...@kbb.com>
> wrote:
> >> Gets a bit hairy when doing say a shitload of gets thought.. no?
> >
> > If you by "hairy" you mean the code is ugly, it was written for maximal
> clarity.
> > I think you'll find a few sensible loops makes it fairly clean.
> > Otherwise I'm not sure what you mean.
> >
> > -----Original Message-----
> > From: Software Dev [mailto:static.void.dev@gmail.com]
> > Sent: Tuesday, April 29, 2014 5:02 PM
> > To: user@hbase.apache.org
> > Subject: Re: Help with row and column design
> >
> >> Yes. See total_usa vs. total_female_usa above. Basically you have to
> pre-store every level of aggregation you care about.
> >
> > Ok I think this makes sense. Gets a bit hairy when doing say a shitload
> of gets thought.. no?
> >
> > On Tue, Apr 29, 2014 at 4:43 PM, Rendon, Carlos (KBB) <CR...@kbb.com>
> wrote:
> >> You don't do a scan, you do a series of gets, which I believe you can
> batch into one call.
> >>
> >> last 5 days query in pseudocode
> >> res1 = Get( hash("2014-04-29") + "2014-04-29")
> >> res2 = Get( hash("2014-04-28") + "2014-04-28")
> >> res3 = Get( hash("2014-04-27") + "2014-04-27")
> >> res4 = Get( hash("2014-04-26") + "2014-04-26")
> >> res5 = Get( hash("2014-04-25") + "2014-04-25")
> >>
> >> For each result you look for the particular column or columns you are
> >> interested in Total_usa = res1.get("c:usa") + res2.get("c:usa") +
> res3.get("c:usa") + ...
> >> Total_female_usa = res1.get("c:usa:sex:f") + ...
> >>
> >> "What happens when we add more fields? Do we just keep adding in more
> column qualifiers? If so, how would we filter across columns to get an
> aggregate total?"
> >>
> >> Yes. See total_usa vs. total_female_usa above. Basically you have to
> pre-store every level of aggregation you care about.
> >>
> >> -----Original Message-----
> >> From: Software Dev [mailto:static.void.dev@gmail.com]
> >> Sent: Tuesday, April 29, 2014 4:36 PM
> >> To: user@hbase.apache.org
> >> Subject: Re: Help with row and column design
> >>
> >>> The downside is it still has a hotspot when inserting, but when
> >>> reading a range of time it does not
> >>
> >> How can you do a scan query between dates when you hash the date?
> >>
> >>> Column qualifiers are just the collection of items you are
> >>> aggregating on. Values are increments. In your case qualifiers might
> >>> look like c:usa, c:usa:sex:m, c:usa:sex:f, c:italy:sex:m,
> >>> c:italy:sex:f, c:italy,
> >>
> >> What happens when we add more fields? Do we just keep adding in more
> column qualifiers? If so, how would we filter across columns to get an
> aggregate total?
>

Re: Help with row and column design

Posted by Software Dev <st...@gmail.com>.

Nothing against your code. I just meant that if we are doing a scan
say for hourly metrics across a 6 month period we are talking about
4K+ gets. Is that something that can easily be handled?

On Tue, Apr 29, 2014 at 5:08 PM, Rendon, Carlos (KBB) <CR...@kbb.com> wrote:
>> Gets a bit hairy when doing say a shitload of gets thought.. no?
>
> If you by "hairy" you mean the code is ugly, it was written for maximal clarity.
> I think you'll find a few sensible loops makes it fairly clean.
> Otherwise I'm not sure what you mean.
>
> -----Original Message-----
> From: Software Dev [mailto:static.void.dev@gmail.com]
> Sent: Tuesday, April 29, 2014 5:02 PM
> To: user@hbase.apache.org
> Subject: Re: Help with row and column design
>
>> Yes. See total_usa vs. total_female_usa above. Basically you have to pre-store every level of aggregation you care about.
>
> Ok I think this makes sense. Gets a bit hairy when doing say a shitload of gets thought.. no?
>
> On Tue, Apr 29, 2014 at 4:43 PM, Rendon, Carlos (KBB) <CR...@kbb.com> wrote:
>> You don't do a scan, you do a series of gets, which I believe you can batch into one call.
>>
>> last 5 days query in pseudocode
>> res1 = Get( hash("2014-04-29") + "2014-04-29")
>> res2 = Get( hash("2014-04-28") + "2014-04-28")
>> res3 = Get( hash("2014-04-27") + "2014-04-27")
>> res4 = Get( hash("2014-04-26") + "2014-04-26")
>> res5 = Get( hash("2014-04-25") + "2014-04-25")
>>
>> For each result you look for the particular column or columns you are
>> interested in Total_usa = res1.get("c:usa") + res2.get("c:usa") + res3.get("c:usa") + ...
>> Total_female_usa = res1.get("c:usa:sex:f") + ...
>>
>> "What happens when we add more fields? Do we just keep adding in more column qualifiers? If so, how would we filter across columns to get an aggregate total?"
>>
>> Yes. See total_usa vs. total_female_usa above. Basically you have to pre-store every level of aggregation you care about.
>>
>> -----Original Message-----
>> From: Software Dev [mailto:static.void.dev@gmail.com]
>> Sent: Tuesday, April 29, 2014 4:36 PM
>> To: user@hbase.apache.org
>> Subject: Re: Help with row and column design
>>
>>> The downside is it still has a hotspot when inserting, but when
>>> reading a range of time it does not
>>
>> How can you do a scan query between dates when you hash the date?
>>
>>> Column qualifiers are just the collection of items you are
>>> aggregating on. Values are increments. In your case qualifiers might
>>> look like c:usa, c:usa:sex:m, c:usa:sex:f, c:italy:sex:m,
>>> c:italy:sex:f, c:italy,
>>
>> What happens when we add more fields? Do we just keep adding in more column qualifiers? If so, how would we filter across columns to get an aggregate total?

RE: Help with row and column design

Posted by "Rendon, Carlos (KBB)" <CR...@kbb.com>.

> Gets a bit hairy when doing say a shitload of gets thought.. no?

If you by "hairy" you mean the code is ugly, it was written for maximal clarity.
I think you'll find a few sensible loops makes it fairly clean.
Otherwise I'm not sure what you mean.

-----Original Message-----
From: Software Dev [mailto:static.void.dev@gmail.com] 
Sent: Tuesday, April 29, 2014 5:02 PM
To: user@hbase.apache.org
Subject: Re: Help with row and column design

> Yes. See total_usa vs. total_female_usa above. Basically you have to pre-store every level of aggregation you care about.

Ok I think this makes sense. Gets a bit hairy when doing say a shitload of gets thought.. no?

On Tue, Apr 29, 2014 at 4:43 PM, Rendon, Carlos (KBB) <CR...@kbb.com> wrote:
> You don't do a scan, you do a series of gets, which I believe you can batch into one call.
>
> last 5 days query in pseudocode
> res1 = Get( hash("2014-04-29") + "2014-04-29")
> res2 = Get( hash("2014-04-28") + "2014-04-28")
> res3 = Get( hash("2014-04-27") + "2014-04-27")
> res4 = Get( hash("2014-04-26") + "2014-04-26")
> res5 = Get( hash("2014-04-25") + "2014-04-25")
>
> For each result you look for the particular column or columns you are 
> interested in Total_usa = res1.get("c:usa") + res2.get("c:usa") + res3.get("c:usa") + ...
> Total_female_usa = res1.get("c:usa:sex:f") + ...
>
> "What happens when we add more fields? Do we just keep adding in more column qualifiers? If so, how would we filter across columns to get an aggregate total?"
>
> Yes. See total_usa vs. total_female_usa above. Basically you have to pre-store every level of aggregation you care about.
>
> -----Original Message-----
> From: Software Dev [mailto:static.void.dev@gmail.com]
> Sent: Tuesday, April 29, 2014 4:36 PM
> To: user@hbase.apache.org
> Subject: Re: Help with row and column design
>
>> The downside is it still has a hotspot when inserting, but when 
>> reading a range of time it does not
>
> How can you do a scan query between dates when you hash the date?
>
>> Column qualifiers are just the collection of items you are 
>> aggregating on. Values are increments. In your case qualifiers might 
>> look like c:usa, c:usa:sex:m, c:usa:sex:f, c:italy:sex:m, 
>> c:italy:sex:f, c:italy,
>
> What happens when we add more fields? Do we just keep adding in more column qualifiers? If so, how would we filter across columns to get an aggregate total?

Re: Help with row and column design

Posted by Software Dev <st...@gmail.com>.

> Yes. See total_usa vs. total_female_usa above. Basically you have to pre-store every level of aggregation you care about.

Ok I think this makes sense. Gets a bit hairy when doing say a
shitload of gets thought.. no?

On Tue, Apr 29, 2014 at 4:43 PM, Rendon, Carlos (KBB) <CR...@kbb.com> wrote:
> You don't do a scan, you do a series of gets, which I believe you can batch into one call.
>
> last 5 days query in pseudocode
> res1 = Get( hash("2014-04-29") + "2014-04-29")
> res2 = Get( hash("2014-04-28") + "2014-04-28")
> res3 = Get( hash("2014-04-27") + "2014-04-27")
> res4 = Get( hash("2014-04-26") + "2014-04-26")
> res5 = Get( hash("2014-04-25") + "2014-04-25")
>
> For each result you look for the particular column or columns you are interested in
> Total_usa = res1.get("c:usa") + res2.get("c:usa") + res3.get("c:usa") + ...
> Total_female_usa = res1.get("c:usa:sex:f") + ...
>
> "What happens when we add more fields? Do we just keep adding in more column qualifiers? If so, how would we filter across columns to get an aggregate total?"
>
> Yes. See total_usa vs. total_female_usa above. Basically you have to pre-store every level of aggregation you care about.
>
> -----Original Message-----
> From: Software Dev [mailto:static.void.dev@gmail.com]
> Sent: Tuesday, April 29, 2014 4:36 PM
> To: user@hbase.apache.org
> Subject: Re: Help with row and column design
>
>> The downside is it still has a hotspot when inserting, but when
>> reading a range of time it does not
>
> How can you do a scan query between dates when you hash the date?
>
>> Column qualifiers are just the collection of items you are aggregating
>> on. Values are increments. In your case qualifiers might look like
>> c:usa, c:usa:sex:m, c:usa:sex:f, c:italy:sex:m, c:italy:sex:f,
>> c:italy,
>
> What happens when we add more fields? Do we just keep adding in more column qualifiers? If so, how would we filter across columns to get an aggregate total?

Re: Help with row and column design

Posted by Ted Yu <yu...@gmail.com>.

bq. I believe you can batch into one call.

See the following API in HTable for batching Get's :

  public Result[] get(List<Get> gets) throws IOException {


On Tue, Apr 29, 2014 at 4:43 PM, Rendon, Carlos (KBB) <CR...@kbb.com>wrote:

> You don't do a scan, you do a series of gets, which I believe you can
> batch into one call.
>
> last 5 days query in pseudocode
> res1 = Get( hash("2014-04-29") + "2014-04-29")
> res2 = Get( hash("2014-04-28") + "2014-04-28")
> res3 = Get( hash("2014-04-27") + "2014-04-27")
> res4 = Get( hash("2014-04-26") + "2014-04-26")
> res5 = Get( hash("2014-04-25") + "2014-04-25")
>
> For each result you look for the particular column or columns you are
> interested in
> Total_usa = res1.get("c:usa") + res2.get("c:usa") + res3.get("c:usa") + ...
> Total_female_usa = res1.get("c:usa:sex:f") + ...
>
> "What happens when we add more fields? Do we just keep adding in more
> column qualifiers? If so, how would we filter across columns to get an
> aggregate total?"
>
> Yes. See total_usa vs. total_female_usa above. Basically you have to
> pre-store every level of aggregation you care about.
>
> -----Original Message-----
> From: Software Dev [mailto:static.void.dev@gmail.com]
> Sent: Tuesday, April 29, 2014 4:36 PM
> To: user@hbase.apache.org
> Subject: Re: Help with row and column design
>
> > The downside is it still has a hotspot when inserting, but when
> > reading a range of time it does not
>
> How can you do a scan query between dates when you hash the date?
>
> > Column qualifiers are just the collection of items you are aggregating
> > on. Values are increments. In your case qualifiers might look like
> > c:usa, c:usa:sex:m, c:usa:sex:f, c:italy:sex:m, c:italy:sex:f,
> > c:italy,
>
> What happens when we add more fields? Do we just keep adding in more
> column qualifiers? If so, how would we filter across columns to get an
> aggregate total?
>

RE: Help with row and column design

Posted by "Rendon, Carlos (KBB)" <CR...@kbb.com>.

You don't do a scan, you do a series of gets, which I believe you can batch into one call.

last 5 days query in pseudocode
res1 = Get( hash("2014-04-29") + "2014-04-29")
res2 = Get( hash("2014-04-28") + "2014-04-28")
res3 = Get( hash("2014-04-27") + "2014-04-27")
res4 = Get( hash("2014-04-26") + "2014-04-26")
res5 = Get( hash("2014-04-25") + "2014-04-25")

For each result you look for the particular column or columns you are interested in
Total_usa = res1.get("c:usa") + res2.get("c:usa") + res3.get("c:usa") + ... 
Total_female_usa = res1.get("c:usa:sex:f") + ...

"What happens when we add more fields? Do we just keep adding in more column qualifiers? If so, how would we filter across columns to get an aggregate total?"

Yes. See total_usa vs. total_female_usa above. Basically you have to pre-store every level of aggregation you care about.

-----Original Message-----
From: Software Dev [mailto:static.void.dev@gmail.com] 
Sent: Tuesday, April 29, 2014 4:36 PM
To: user@hbase.apache.org
Subject: Re: Help with row and column design

> The downside is it still has a hotspot when inserting, but when 
> reading a range of time it does not

How can you do a scan query between dates when you hash the date?

> Column qualifiers are just the collection of items you are aggregating 
> on. Values are increments. In your case qualifiers might look like 
> c:usa, c:usa:sex:m, c:usa:sex:f, c:italy:sex:m, c:italy:sex:f, 
> c:italy,

What happens when we add more fields? Do we just keep adding in more column qualifiers? If so, how would we filter across columns to get an aggregate total?

Re: Help with row and column design

Posted by Software Dev <st...@gmail.com>.

> The downside is it still has a hotspot when inserting, but when reading a range of time it does not

How can you do a scan query between dates when you hash the date?

> Column qualifiers are just the collection of items you are aggregating on. Values are increments. In your case qualifiers might look like
> c:usa, c:usa:sex:m, c:usa:sex:f, c:italy:sex:m, c:italy:sex:f, c:italy,

What happens when we add more fields? Do we just keep adding in more
column qualifiers? If so, how would we filter across columns to get an
aggregate total?

RE: Help with row and column design

Posted by "Rendon, Carlos (KBB)" <CR...@kbb.com>.

I've created a similar system using a rowkey like: (hash of date) - date
The downside is it still has a hotspot when inserting, but when reading a range of time it does not. My use case was geared towards speeding up lots of reads.

Column qualifiers are just the collection of items you are aggregating on. Values are increments. In your case qualifiers might look like

c:usa, c:usa:sex:m, c:usa:sex:f, c:italy:sex:m, c:italy:sex:f, c:italy, 

Basically any combination of things you care about. This has the downside that you have to determine what filters are available up front and not after the fact. The upside is querying should be fast.

Computing counts over time is a batch of gets, which you can compute using the list of dates/times that you care about.  For each qualifier you would sum across all of the row results.

Hope this gives you some ideas,

Carlos

-----Original Message-----
From: Software Dev [mailto:static.void.dev@gmail.com] 
Sent: Tuesday, April 29, 2014 3:51 PM
To: user@hbase.apache.org
Subject: Re: Help with row and column design

Someone mentioned in another post about hotspotting. I guess I could reverse the row keys to prevent this?

On Tue, Apr 29, 2014 at 3:34 PM, Software Dev <st...@gmail.com> wrote:
> Hey all. I have some questions regarding row key and column design.
>
> We want to calculate some metrics based on our page views broken down 
> by hour, day, month and year. We also want this broken down country 
> and have the ability to filter by some other attributes such as the 
> sex of the user or whether or not the user is logged in..... Note 
> these will all be increments.
>
> So we have the initial row key design as
>
> YYYY - Row key for yearly totals
> YYYYMM - Row key for monthly totals
> YYYYMMDD - Row key for daily totals
> YYYYMMDDHH - Row key for hourly totals
>
> I think this may make sense as it will be easy to do a range scan over 
> a time period.
>
> Now for my column design. We were thinking along these lines.
>
> daily:US  - Daily counts for the US
> hourly:CA - Hourly counts for Canada
> ... and so on
>
> Now this seems like it would work but fails when we add in the 
> requirement of filtering results base on some other attributes. Say we 
> wanted to be able to filter based on sex (M or F) and/or filter based 
> on logged in status (Online or Offline) OR and/or filter based on some 
> other attribute OR perform no filtering at all. How would I go about 
> accomplishing this?
>
> Thanks for any input/pointers.

Re: Help with row and column design

Posted by Software Dev <st...@gmail.com>.

Someone mentioned in another post about hotspotting. I guess I could
reverse the row keys to prevent this?

On Tue, Apr 29, 2014 at 3:34 PM, Software Dev <st...@gmail.com> wrote:
> Hey all. I have some questions regarding row key and column design.
>
> We want to calculate some metrics based on our page views broken down
> by hour, day, month and year. We also want this broken down country
> and have the ability to filter by some other attributes such as the
> sex of the user or whether or not the user is logged in..... Note
> these will all be increments.
>
> So we have the initial row key design as
>
> YYYY - Row key for yearly totals
> YYYYMM - Row key for monthly totals
> YYYYMMDD - Row key for daily totals
> YYYYMMDDHH - Row key for hourly totals
>
> I think this may make sense as it will be easy to do a range scan over
> a time period.
>
> Now for my column design. We were thinking along these lines.
>
> daily:US  - Daily counts for the US
> hourly:CA - Hourly counts for Canada
> ... and so on
>
> Now this seems like it would work but fails when we add in the
> requirement of filtering results base on some other attributes. Say we
> wanted to be able to filter based on sex (M or F) and/or filter based
> on logged in status (Online or Offline) OR and/or filter based on some
> other attribute OR perform no filtering at all. How would I go about
> accomplishing this?
>
> Thanks for any input/pointers.