You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Narayanan K <kn...@gmail.com> on 2012/10/10 21:13:26 UTC

HBase Key Design : Doubt

Hi all,

I have a usecase wherein I need to find the unique of some things in HBase
across dates.

Say, on 1st Oct, A-B-C-D appeared, hence I insert a row with rowkey :
A-B-C-D.
On 2nd Oct, I get the same value A-B-C-D and I don't want to redundantly
store the row again with a new rowkey - A-B-C-D for 2nd Oct
i.e I will not want to have 20121001-A-B-C-D and 20121002-A-B-C-D as 2
rowkeys in the table.

Eg: If I have 1st Oct , 2nd Oct as 2 column families and if number of
versions are set to 1, only 1 row will be present in for both the dates
having rowkey A-B-C-D.
Hence if I need to find unique number of times A-B-C-D appeared during Oct
1 and Oct 2, I just need to take rowcount of the row A-B-C-D by filtering
over the 2 column families.
Similarly, if we have 10  date column families, and I need to scan only for
2 dates, then it scans only those store files having the specified column
families. This will make scanning faster.

But here the design problem is that I cant add more column families to the
table each day.

I would need to store data every day and I read that HBase doesnt work well
with more than 3 column families.

The other option is to have one single column family and store dates as
qualifiers : date:d1, date:d2.... But here if there are 30 date qualifiers
under date column family, to scan a single date qualifier or may be range
of 2-3 dates will have to scan through the entire data of all d1 to d30
qualifiers in the date column family which would be slower compared to
having separate column families for the each date..

Please share your thoughts on this. Also any alternate design suggestions
you might have.

Regards,
Narayanan

Re: HBase Key Design : Doubt

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

No, you're right.

But if you just want to keep "500" as the value, you just have to set
the number of version to 1 for your table...

If you just want to keep 100, then you can insert with a revert
timestamp, so the last cell inserted will be hidden by the previous
one.

JM

2012/10/11, Narayanan K <kn...@gmail.com>:
> Hi,
>
> I have 2 column families A and B in table T1.
>
> put 'T1', 'R1', 'A:qualf1',100
> put 'T1', R1', 'B:qualf2', 200
>
> As per my understanding the above is one row and one single version each
> for the 2 column families.
>
> If I do a put 'T1', 'R1', 'A:qualf1', 500, then there is another version
> for the rowkey pertaining to the combination {R1, A, qualf1}
>
> Please correct me if I am wrong.
>
> Regards,
> Narayanan
>
> On Thu, Oct 11, 2012 at 1:02 AM, Doug Meil
> <do...@explorysmedical.com>wrote:
>
>>
>> Correct.
>>
>> If you do 2 Puts for row key A-B-C-D on different days, the second Put
>> logically replaces the first and the earlier Put becomes a previous
>> version.  Unless you specifically want older versions, you won't get them
>> in either Gets or Scans.
>>
>> Definitely want to read thisŠ
>>
>> http://hbase.apache.org/book.html#datamodel
>>
>> See this for more information about they internal KeyValue structure.
>>
>> http://hbase.apache.org/book.html#regions.arch
>> 9.7.5.4. KeyValue
>>
>>
>> Older versions are kept around as long as the table descriptor says so
>> (e.g., max versions).  See the StoreFile and Compactions entries in the
>> RefGuide for more information on the internals.
>>
>>
>>
>>
>> On 10/10/12 3:24 PM, "Jerry Lam" <ch...@gmail.com> wrote:
>>
>> >correct me if I'm wrong. The version applies to the individual cell (ie.
>> >row key, column family and column qualifier) not (row key, column
>> > family).
>> >
>> >
>> >On Wed, Oct 10, 2012 at 3:13 PM, Narayanan K <kn...@gmail.com>
>> >wrote:
>> >
>> >> Hi all,
>> >>
>> >> I have a usecase wherein I need to find the unique of some things in
>> >>HBase
>> >> across dates.
>> >>
>> >> Say, on 1st Oct, A-B-C-D appeared, hence I insert a row with rowkey :
>> >> A-B-C-D.
>> >> On 2nd Oct, I get the same value A-B-C-D and I don't want to
>> >> redundantly
>> >> store the row again with a new rowkey - A-B-C-D for 2nd Oct
>> >> i.e I will not want to have 20121001-A-B-C-D and 20121002-A-B-C-D as 2
>> >> rowkeys in the table.
>> >>
>> >> Eg: If I have 1st Oct , 2nd Oct as 2 column families and if number of
>> >> versions are set to 1, only 1 row will be present in for both the
>> >> dates
>> >> having rowkey A-B-C-D.
>> >> Hence if I need to find unique number of times A-B-C-D appeared during
>> >>Oct
>> >> 1 and Oct 2, I just need to take rowcount of the row A-B-C-D by
>> >>filtering
>> >> over the 2 column families.
>> >> Similarly, if we have 10  date column families, and I need to scan
>> >> only
>> >>for
>> >> 2 dates, then it scans only those store files having the specified
>> >>column
>> >> families. This will make scanning faster.
>> >>
>> >> But here the design problem is that I cant add more column families to
>> >>the
>> >> table each day.
>> >>
>> >> I would need to store data every day and I read that HBase doesnt work
>> >>well
>> >> with more than 3 column families.
>> >>
>> >> The other option is to have one single column family and store dates
>> >> as
>> >> qualifiers : date:d1, date:d2.... But here if there are 30 date
>> >>qualifiers
>> >> under date column family, to scan a single date qualifier or may be
>> >>range
>> >> of 2-3 dates will have to scan through the entire data of all d1 to
>> >> d30
>> >> qualifiers in the date column family which would be slower compared to
>> >> having separate column families for the each date..
>> >>
>> >> Please share your thoughts on this. Also any alternate design
>> >>suggestions
>> >> you might have.
>> >>
>> >> Regards,
>> >> Narayanan
>> >>
>>
>>
>>
>

Re: HBase Key Design : Doubt

Posted by Narayanan K <kn...@gmail.com>.

Hi,

I have 2 column families A and B in table T1.

put 'T1', 'R1', 'A:qualf1',100
put 'T1', R1', 'B:qualf2', 200

As per my understanding the above is one row and one single version each
for the 2 column families.

If I do a put 'T1', 'R1', 'A:qualf1', 500, then there is another version
for the rowkey pertaining to the combination {R1, A, qualf1}

Please correct me if I am wrong.

Regards,
Narayanan

On Thu, Oct 11, 2012 at 1:02 AM, Doug Meil <do...@explorysmedical.com>wrote:

>
> Correct.
>
> If you do 2 Puts for row key A-B-C-D on different days, the second Put
> logically replaces the first and the earlier Put becomes a previous
> version.  Unless you specifically want older versions, you won't get them
> in either Gets or Scans.
>
> Definitely want to read thisŠ
>
> http://hbase.apache.org/book.html#datamodel
>
> See this for more information about they internal KeyValue structure.
>
> http://hbase.apache.org/book.html#regions.arch
> 9.7.5.4. KeyValue
>
>
> Older versions are kept around as long as the table descriptor says so
> (e.g., max versions).  See the StoreFile and Compactions entries in the
> RefGuide for more information on the internals.
>
>
>
>
> On 10/10/12 3:24 PM, "Jerry Lam" <ch...@gmail.com> wrote:
>
> >correct me if I'm wrong. The version applies to the individual cell (ie.
> >row key, column family and column qualifier) not (row key, column family).
> >
> >
> >On Wed, Oct 10, 2012 at 3:13 PM, Narayanan K <kn...@gmail.com>
> >wrote:
> >
> >> Hi all,
> >>
> >> I have a usecase wherein I need to find the unique of some things in
> >>HBase
> >> across dates.
> >>
> >> Say, on 1st Oct, A-B-C-D appeared, hence I insert a row with rowkey :
> >> A-B-C-D.
> >> On 2nd Oct, I get the same value A-B-C-D and I don't want to redundantly
> >> store the row again with a new rowkey - A-B-C-D for 2nd Oct
> >> i.e I will not want to have 20121001-A-B-C-D and 20121002-A-B-C-D as 2
> >> rowkeys in the table.
> >>
> >> Eg: If I have 1st Oct , 2nd Oct as 2 column families and if number of
> >> versions are set to 1, only 1 row will be present in for both the dates
> >> having rowkey A-B-C-D.
> >> Hence if I need to find unique number of times A-B-C-D appeared during
> >>Oct
> >> 1 and Oct 2, I just need to take rowcount of the row A-B-C-D by
> >>filtering
> >> over the 2 column families.
> >> Similarly, if we have 10  date column families, and I need to scan only
> >>for
> >> 2 dates, then it scans only those store files having the specified
> >>column
> >> families. This will make scanning faster.
> >>
> >> But here the design problem is that I cant add more column families to
> >>the
> >> table each day.
> >>
> >> I would need to store data every day and I read that HBase doesnt work
> >>well
> >> with more than 3 column families.
> >>
> >> The other option is to have one single column family and store dates as
> >> qualifiers : date:d1, date:d2.... But here if there are 30 date
> >>qualifiers
> >> under date column family, to scan a single date qualifier or may be
> >>range
> >> of 2-3 dates will have to scan through the entire data of all d1 to d30
> >> qualifiers in the date column family which would be slower compared to
> >> having separate column families for the each date..
> >>
> >> Please share your thoughts on this. Also any alternate design
> >>suggestions
> >> you might have.
> >>
> >> Regards,
> >> Narayanan
> >>
>
>
>

Re: HBase Key Design : Doubt

Posted by Doug Meil <do...@explorysmedical.com>.

Correct.

If you do 2 Puts for row key A-B-C-D on different days, the second Put
logically replaces the first and the earlier Put becomes a previous
version.  Unless you specifically want older versions, you won't get them
in either Gets or Scans.

Definitely want to read thisŠ

http://hbase.apache.org/book.html#datamodel

See this for more information about they internal KeyValue structure.

http://hbase.apache.org/book.html#regions.arch
9.7.5.4. KeyValue


Older versions are kept around as long as the table descriptor says so
(e.g., max versions).  See the StoreFile and Compactions entries in the
RefGuide for more information on the internals.




On 10/10/12 3:24 PM, "Jerry Lam" <ch...@gmail.com> wrote:

>correct me if I'm wrong. The version applies to the individual cell (ie.
>row key, column family and column qualifier) not (row key, column family).
>
>
>On Wed, Oct 10, 2012 at 3:13 PM, Narayanan K <kn...@gmail.com>
>wrote:
>
>> Hi all,
>>
>> I have a usecase wherein I need to find the unique of some things in
>>HBase
>> across dates.
>>
>> Say, on 1st Oct, A-B-C-D appeared, hence I insert a row with rowkey :
>> A-B-C-D.
>> On 2nd Oct, I get the same value A-B-C-D and I don't want to redundantly
>> store the row again with a new rowkey - A-B-C-D for 2nd Oct
>> i.e I will not want to have 20121001-A-B-C-D and 20121002-A-B-C-D as 2
>> rowkeys in the table.
>>
>> Eg: If I have 1st Oct , 2nd Oct as 2 column families and if number of
>> versions are set to 1, only 1 row will be present in for both the dates
>> having rowkey A-B-C-D.
>> Hence if I need to find unique number of times A-B-C-D appeared during
>>Oct
>> 1 and Oct 2, I just need to take rowcount of the row A-B-C-D by
>>filtering
>> over the 2 column families.
>> Similarly, if we have 10  date column families, and I need to scan only
>>for
>> 2 dates, then it scans only those store files having the specified
>>column
>> families. This will make scanning faster.
>>
>> But here the design problem is that I cant add more column families to
>>the
>> table each day.
>>
>> I would need to store data every day and I read that HBase doesnt work
>>well
>> with more than 3 column families.
>>
>> The other option is to have one single column family and store dates as
>> qualifiers : date:d1, date:d2.... But here if there are 30 date
>>qualifiers
>> under date column family, to scan a single date qualifier or may be
>>range
>> of 2-3 dates will have to scan through the entire data of all d1 to d30
>> qualifiers in the date column family which would be slower compared to
>> having separate column families for the each date..
>>
>> Please share your thoughts on this. Also any alternate design
>>suggestions
>> you might have.
>>
>> Regards,
>> Narayanan
>>

Re: HBase Key Design : Doubt

Posted by Jerry Lam <ch...@gmail.com>.

correct me if I'm wrong. The version applies to the individual cell (ie.
row key, column family and column qualifier) not (row key, column family).


On Wed, Oct 10, 2012 at 3:13 PM, Narayanan K <kn...@gmail.com> wrote:

> Hi all,
>
> I have a usecase wherein I need to find the unique of some things in HBase
> across dates.
>
> Say, on 1st Oct, A-B-C-D appeared, hence I insert a row with rowkey :
> A-B-C-D.
> On 2nd Oct, I get the same value A-B-C-D and I don't want to redundantly
> store the row again with a new rowkey - A-B-C-D for 2nd Oct
> i.e I will not want to have 20121001-A-B-C-D and 20121002-A-B-C-D as 2
> rowkeys in the table.
>
> Eg: If I have 1st Oct , 2nd Oct as 2 column families and if number of
> versions are set to 1, only 1 row will be present in for both the dates
> having rowkey A-B-C-D.
> Hence if I need to find unique number of times A-B-C-D appeared during Oct
> 1 and Oct 2, I just need to take rowcount of the row A-B-C-D by filtering
> over the 2 column families.
> Similarly, if we have 10  date column families, and I need to scan only for
> 2 dates, then it scans only those store files having the specified column
> families. This will make scanning faster.
>
> But here the design problem is that I cant add more column families to the
> table each day.
>
> I would need to store data every day and I read that HBase doesnt work well
> with more than 3 column families.
>
> The other option is to have one single column family and store dates as
> qualifiers : date:d1, date:d2.... But here if there are 30 date qualifiers
> under date column family, to scan a single date qualifier or may be range
> of 2-3 dates will have to scan through the entire data of all d1 to d30
> qualifiers in the date column family which would be slower compared to
> having separate column families for the each date..
>
> Please share your thoughts on this. Also any alternate design suggestions
> you might have.
>
> Regards,
> Narayanan
>