You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Alex Baranau <ba...@gmail.com> on 2015/03/10 21:40:32 UTC

Re: Regarding a doubt I am having for HBase

CCing HBase's user ML.

Could you give an example of the row key and example of two different
queries you are making to better understand your case?

Thank you,

Alex Baranau
--
http://cdap.io - open source framework to build and run data applications
on Hadoop & HBase


On Mon, Mar 9, 2015 at 9:00 AM, Jaspreet Singh <Ja...@clarte.co> wrote:

>  Hi Alex,
>
>
>  Thanks a lot for the response!!! The data I have is inform of hashes and
> id, every id is related to a cookie hashed data so for the id part the
> solution that you said would work good but in case hashes it would not be
> possible to mention a stoprow based on increment. Also I have millions of
> rows so duplicating the row means i end up having double of what I have
> right now.And ofcourse you can share this thread to the hbase mailing list.
> Let me know if you get any idea about what to do with this hash based data.
>
>
>  Jaspreet Singh
>  ------------------------------
> *From:* Alex Baranau <ba...@gmail.com>
> *Sent:* Thursday, March 5, 2015 2:56 PM
> *To:* Jaspreet Singh
> *Subject:* Re: Regarding a doubt I am having for HBase
>
>   Hi Jaspreet,
>
>  Do you see this time when you fetch by first field or by second? How do
> you construct your scan? In particular, what are start and stop keys and
> filters used?
>
>  For scan by first filed the simple prefix scan should work. For scan by
> second field, you will have to do filtering unless you can denormalize your
> data and create separate index to scan.
>
>  In first case you may be able to use fast forwarding in filter while
> scanning. E.g. use fuzzy row filter if the first field is of fixed size.
> Depending on your case this may help speedup scanning. Otherwise, you may
> consider implementing custom fast-forwarding filter.
>
>  If denormalizing is an option, you could store the record twice, with
> second_first key format in addition. And use prefix scanning again.
>
>  Which one works best for you?
>
>  Also: can I CC the hbase mailing list to the thread - people are amazing
> there and will be happy to provide help too :)
>
>  Alex
>
> On Tue, Mar 3, 2015 at 12:21 PM, Jaspreet Singh <Ja...@clarte.co>
> wrote:
>
>>
>>  Hi Alex,
>>
>>
>>  I was trying to look up some composite key related questions and
>> noticed that you are pro when it comes HBase related questions :) I read
>> many of your blogs, but I am still confused about the doubt I have. I have
>> a composite primary key in the form of first field_second field. I want to
>> scan my table giving the value of first field or second field and get the
>> results. I used the rowfilter for this but the time I am getting to fetch
>> the row is too high approx 27 seconds ( the number of rows is in millions).
>> I want to achieve something near 2 seconds or even less than that, can you
>> suggest me what I should do for this.
>>
>> Thank You
>>
>>
>>  Jaspreet Singh
>>
>> Clarte.co
>>
>
>

Re: Regarding a doubt I am having for HBase

Posted by Wilm Schumacher <wi...@gmail.com>.
Hi,

I would like to add a question: Why do you need the ID in the first
place? The hash seems to be generated by another source, thus is
imutable. But is this true for the ID, too? If not, why not using only
the hash?

Best wishes,

Wilm

Am 10.03.2015 um 21:40 schrieb Alex Baranau:
> CCing HBase's user ML.
>
> Could you give an example of the row key and example of two different
> queries you are making to better understand your case?
>
> Thank you,
>
> Alex Baranau
> --
> http://cdap.io - open source framework to build and run data applications
> on Hadoop & HBase
>
>
> On Mon, Mar 9, 2015 at 9:00 AM, Jaspreet Singh <Ja...@clarte.co> wrote:
>
>>  Hi Alex,
>>
>>
>>  Thanks a lot for the response!!! The data I have is inform of hashes and
>> id, every id is related to a cookie hashed data so for the id part the
>> solution that you said would work good but in case hashes it would not be
>> possible to mention a stoprow based on increment. Also I have millions of
>> rows so duplicating the row means i end up having double of what I have
>> right now.And ofcourse you can share this thread to the hbase mailing list.
>> Let me know if you get any idea about what to do with this hash based data.
>>
>>
>>  Jaspreet Singh
>>  ------------------------------
>> *From:* Alex Baranau <ba...@gmail.com>
>> *Sent:* Thursday, March 5, 2015 2:56 PM
>> *To:* Jaspreet Singh
>> *Subject:* Re: Regarding a doubt I am having for HBase
>>
>>   Hi Jaspreet,
>>
>>  Do you see this time when you fetch by first field or by second? How do
>> you construct your scan? In particular, what are start and stop keys and
>> filters used?
>>
>>  For scan by first filed the simple prefix scan should work. For scan by
>> second field, you will have to do filtering unless you can denormalize your
>> data and create separate index to scan.
>>
>>  In first case you may be able to use fast forwarding in filter while
>> scanning. E.g. use fuzzy row filter if the first field is of fixed size.
>> Depending on your case this may help speedup scanning. Otherwise, you may
>> consider implementing custom fast-forwarding filter.
>>
>>  If denormalizing is an option, you could store the record twice, with
>> second_first key format in addition. And use prefix scanning again.
>>
>>  Which one works best for you?
>>
>>  Also: can I CC the hbase mailing list to the thread - people are amazing
>> there and will be happy to provide help too :)
>>
>>  Alex
>>
>> On Tue, Mar 3, 2015 at 12:21 PM, Jaspreet Singh <Ja...@clarte.co>
>> wrote:
>>
>>>  Hi Alex,
>>>
>>>
>>>  I was trying to look up some composite key related questions and
>>> noticed that you are pro when it comes HBase related questions :) I read
>>> many of your blogs, but I am still confused about the doubt I have. I have
>>> a composite primary key in the form of first field_second field. I want to
>>> scan my table giving the value of first field or second field and get the
>>> results. I used the rowfilter for this but the time I am getting to fetch
>>> the row is too high approx 27 seconds ( the number of rows is in millions).
>>> I want to achieve something near 2 seconds or even less than that, can you
>>> suggest me what I should do for this.
>>>
>>> Thank You
>>>
>>>
>>>  Jaspreet Singh
>>>
>>> Clarte.co
>>>
>>