You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Software Dev <st...@gmail.com> on 2014/05/02 02:18:14 UTC
Re: Help with row and column design

FuzzyRowFilter is not part of the Rest client so this may not be an
option for us. Any alternatives?

On Wed, Apr 30, 2014 at 10:28 AM, Software Dev
<st...@gmail.com> wrote:
> I did not know of the FuzzyRowFilter.. that looks like it may be my best bet.
>
> Anyone know what Sematexts HBaseWD uses to perform efficient scanning?
>
> On Tue, Apr 29, 2014 at 11:31 PM, Liam Slusser <ls...@gmail.com> wrote:
>> I would recommend pre-splitting the tables and then hashing your key and
>> putting that in the front.  ie
>>
>> [hash(20140429:Country:US)][2014042901:Country:US]  #notice you're not
>> hashing the sequence number
>>
>> some pseudo python code
>>
>>>>> import hashlib
>>>>> key = "2014042901:Country:US"
>>>>> ckey = "20140429:Country:US"
>>>>> hbase_key = "%s%s" % (hashlib.md5(ckey).hexdigest()[:5],key)
>>>>> hbase_key
>> '887d82014042901:Country:US'
>>
>> Now when you want to find something, you can just create the hash ('887d8)
>> and use FuzzyRowFilter to find it!
>>
>> cheers,
>> liam
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Apr 29, 2014 at 8:08 PM, Software Dev <st...@gmail.com>wrote:
>>
>>> Any improvements in the row key design?
>>>
>>> If i always know we will be querying by country could/should I prefix
>>> the row key with the country to help with hotspotting?
>>>
>>> FR/2014042901
>>> FR/2014042902
>>> ....
>>> US/2014042901
>>> US/2014042902
>>> ...
>>>
>>> Is this preferred over adding it in a column... ie 2014042901:Country:US
>>>
>>> On Tue, Apr 29, 2014 at 8:05 PM, Software Dev <st...@gmail.com>
>>> wrote:
>>> > Ok didnt know if the sheer number of gets would be a limiting factor.
>>> Thanks
>>> >
>>> > On Tue, Apr 29, 2014 at 7:57 PM, Ted Yu <yu...@gmail.com> wrote:
>>> >> As I said this afternoon:
>>> >> See the following API in HTable for batching Get's :
>>> >>
>>> >>   public Result[] get(List<Get> gets) throws IOException {
>>> >>
>>> >> Cheers
>>> >>
>>> >>
>>> >> On Tue, Apr 29, 2014 at 7:45 PM, Software Dev <
>>> static.void.dev@gmail.com>wrote:
>>> >>
>>> >>> Nothing against your code. I just meant that if we are doing a scan
>>> >>> say for hourly metrics across a 6 month period we are talking about
>>> >>> 4K+ gets. Is that something that can easily be handled?
>>> >>>
>>> >>> On Tue, Apr 29, 2014 at 5:08 PM, Rendon, Carlos (KBB) <CRendon@kbb.com
>>> >
>>> >>> wrote:
>>> >>> >> Gets a bit hairy when doing say a shitload of gets thought.. no?
>>> >>> >
>>> >>> > If you by "hairy" you mean the code is ugly, it was written for
>>> maximal
>>> >>> clarity.
>>> >>> > I think you'll find a few sensible loops makes it fairly clean.
>>> >>> > Otherwise I'm not sure what you mean.
>>> >>> >
>>> >>> > -----Original Message-----
>>> >>> > From: Software Dev [mailto:static.void.dev@gmail.com]
>>> >>> > Sent: Tuesday, April 29, 2014 5:02 PM
>>> >>> > To: user@hbase.apache.org
>>> >>> > Subject: Re: Help with row and column design
>>> >>> >
>>> >>> >> Yes. See total_usa vs. total_female_usa above. Basically you have to
>>> >>> pre-store every level of aggregation you care about.
>>> >>> >
>>> >>> > Ok I think this makes sense. Gets a bit hairy when doing say a
>>> shitload
>>> >>> of gets thought.. no?
>>> >>> >
>>> >>> > On Tue, Apr 29, 2014 at 4:43 PM, Rendon, Carlos (KBB) <
>>> CRendon@kbb.com>
>>> >>> wrote:
>>> >>> >> You don't do a scan, you do a series of gets, which I believe you
>>> can
>>> >>> batch into one call.
>>> >>> >>
>>> >>> >> last 5 days query in pseudocode
>>> >>> >> res1 = Get( hash("2014-04-29") + "2014-04-29")
>>> >>> >> res2 = Get( hash("2014-04-28") + "2014-04-28")
>>> >>> >> res3 = Get( hash("2014-04-27") + "2014-04-27")
>>> >>> >> res4 = Get( hash("2014-04-26") + "2014-04-26")
>>> >>> >> res5 = Get( hash("2014-04-25") + "2014-04-25")
>>> >>> >>
>>> >>> >> For each result you look for the particular column or columns you
>>> are
>>> >>> >> interested in Total_usa = res1.get("c:usa") + res2.get("c:usa") +
>>> >>> res3.get("c:usa") + ...
>>> >>> >> Total_female_usa = res1.get("c:usa:sex:f") + ...
>>> >>> >>
>>> >>> >> "What happens when we add more fields? Do we just keep adding in
>>> more
>>> >>> column qualifiers? If so, how would we filter across columns to get an
>>> >>> aggregate total?"
>>> >>> >>
>>> >>> >> Yes. See total_usa vs. total_female_usa above. Basically you have to
>>> >>> pre-store every level of aggregation you care about.
>>> >>> >>
>>> >>> >> -----Original Message-----
>>> >>> >> From: Software Dev [mailto:static.void.dev@gmail.com]
>>> >>> >> Sent: Tuesday, April 29, 2014 4:36 PM
>>> >>> >> To: user@hbase.apache.org
>>> >>> >> Subject: Re: Help with row and column design
>>> >>> >>
>>> >>> >>> The downside is it still has a hotspot when inserting, but when
>>> >>> >>> reading a range of time it does not
>>> >>> >>
>>> >>> >> How can you do a scan query between dates when you hash the date?
>>> >>> >>
>>> >>> >>> Column qualifiers are just the collection of items you are
>>> >>> >>> aggregating on. Values are increments. In your case qualifiers
>>> might
>>> >>> >>> look like c:usa, c:usa:sex:m, c:usa:sex:f, c:italy:sex:m,
>>> >>> >>> c:italy:sex:f, c:italy,
>>> >>> >>
>>> >>> >> What happens when we add more fields? Do we just keep adding in more
>>> >>> column qualifiers? If so, how would we filter across columns to get an
>>> >>> aggregate total?
>>> >>>
>>>