You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Andrey Kolyadenko <cr...@mailx.ru> on 2010/03/20 21:09:24 UTC

IHBase indexes persistence

Hi,

I am trying to load data into IHBase table with few 
indexes and my region server is crashing. I analyzed heap 
dump and now I am under impression that IHBase index data 
structures eat all available heap.

So my question is: is it any way to setup persistence for 
IHBase indexes instead of caching everything in the 
memory?

Thanks.

---
Миллионы анкет ждут Вас на на http://mylove.in.ua
Немедленная регистрация здесь http://mylove.in.ua/my/reg.phtml

Биржа ссылок, тысячи отзывов о нас в Рунете 
http://www.sape.ru/r.7fddbf83ee.php


Re: IHBase indexes persistence

Posted by Ryan Rawson <ry...@gmail.com>.
So it will be difficult to make generic index joins that are
efficient.  Since the data between the index and the main table reside
on different machines, the RPC calls involved in there can quickly
destroy any notion of doing these kinds of things fast.
Denormalization, ie: copying data into other tables for faster access
is the likely candidate.

Another thing to ask yourself is - should my highly relational data
belong in a non-relational data store?  If you have small amounts of
high relational data, and big stores of non-relational data, perhaps a
hybrid approach might be appropriate?

-ryan

On Sat, Mar 20, 2010 at 4:23 PM, Andrey Kolyadenko <cr...@mailx.ru> wrote:
> The problem comes when you trying to filter based on number of columns in
> OLAP-like queries (i.e. you want to retrieve the count of transaction for
> some customer and some date range). It's not so easy to implement such logic
> effectively, some indexes join algorithm should be implemented there, and
> since HBase supposed to deal with very large data sets, it could be tricky.
>
> On Sat, 20 Mar 2010 16:08:10 -0700
>  Ryan Rawson <ry...@gmail.com> wrote:
>>
>> Another way to think about it is that IHBase helps when the data is
>> not dense (ie: every row has the column you may be looking for), and
>> not sparse (where 1 column in millions or billions match) but
>> somewhere inbetween.  That sweet spot where you might return anywhere
>> between 10-30% of the rows from a region.
>>
>> Of course these are just suggestions and recommendations not hard and
>> fast rules.
>>
>> You might also want to look at THBase - it uses a transactional add-on
>> to maintain a secondary index (ie: another table that is an index of a
>> primary table).  It has different performance characteristics (one
>> write is translated into many writes and involves an RPC), but an
>> option to consider.
>>
>> Finally, you can always maintain secondary indexes by yourself in your
>> app.  Write and update 2 tables (the primary, the index).  This is
>> obviously less integrated and simple but also works.
>>
>> -ryan
>>
>> On Sat, Mar 20, 2010 at 4:00 PM, Dan Washusen <da...@reactive.org> wrote:
>>>
>>> Hey Tux,
>>> I've put some comments inline...
>>>
>>> On 21 March 2010 09:13, TuX RaceR <tu...@gmail.com> wrote:
>>>>
>>>> Hello Hbase user List!
>>>>
>>>> The feature provided by IHbase is very appealing. It seems to correspond
>>>> to
>>>> a use case very common in applications (at least in mine ;) )
>>>
>>> The functionality of IHBase might not be as useful as you think.  Take
>>> the following very basic user table layout:
>>>
>>> username (key) | email | name | password
>>>
>>> That table layout works great when you want to find a user by
>>> username, for example, when the user logs in.  You can simply do a get
>>> on the table with the username.  Now you need to add functionality to
>>> enable a user to retrieve their forgotten password.  The seemingly
>>> obvious solution with IHBase would be to add secondary index to the
>>> email column.  You could then perform a scan on the table with the
>>> appropriate index hint to fetch the user by their email address.  That
>>> solution would work while your dataset is small (one or two regions)
>>> but as your dataset grows and spans many hundreds of regions it's no
>>> longer a viable option.  The reason it's not a viable option is that
>>> IHbase maintains an index on the email column per region.  In order to
>>> find the row that has the email address you are looking for the scan
>>> must contact every region.  The scan would still return reasonable
>>> quickly (say each region responded in a few milliseconds) but it's
>>> still far to resource intensive...
>>>
>>> The way to make scans fast in HBase is to provide a start row and stop
>>> row and the same rule applies to IHBase.  It's just that with IHBase
>>> the scan will return much faster if the start and stop rows span a
>>> large range...
>>>
>>>>
>>>> Dan Washusen wrote:
>>>>>
>>>>> Not at the moment.  It currently keeps a copy of each unique indexed
>>>>> value and each row key in memory...
>>>>>
>>>>
>>>> Is there a more robust indexing on the roadmap?
>>>
>>> In IHbase yes, but probably not soon.
>>>
>>>> HBase if I understand well proposes an opensource version of Google
>>>> Bigtable.
>>>> To me the most striking difference between Hbase and Bigtable is for
>>>> narrowing searches; the example below shows what I mean by narrowing:
>>>>
>>>> If in Google you search for the word
>>>>
>>>> hbase:
>>>>
>>>> (i.e using:
>>>> http://www.google.com/search?q=hbase
>>>> )
>>>> you get a fast answer
>>>> (typically: Results *1* - *10* of about *249,000* for *hbase*. (*0.17*
>>>> seconds))
>>>>
>>>> Now if you search all pages coming for the hadoop.apache.org host name
>>>> (or
>>>> base URL), that is with the query:
>>>>
>>>> hbase +site:hadoop.apache.org
>>>>
>>>> (i.e using the URL:
>>>> http://www.google.com/search?q=hbase+%2Bsite%3Ahadoop.apache.org
>>>> )
>>>> you get a pretty fast answer to:
>>>> (typically: Results *1* - *10* of about *2,510* from *hadoop.apache.org*
>>>> for
>>>> *hbase*. (*0.12* seconds) )
>>>>
>>>> It seems to me that the second search uses a secondary index on a column
>>>> named 'site' to scan the 'hbase' based keys. Obviously Google found a
>>>> good
>>>> way to implement this (good= fast and scalable)
>>>> Is this Google second indexing documented somewhere? Is that implemented
>>>> using something like IHbase or more something like THbase, or something
>>>> else?
>>>
>>> What Ryan said.
>>>
>>>> Also, why IHbase stays in the 'contrib' tree? Is that because the code
>>>> is
>>>> not at the same level as the main hbase code (not as tested, not as
>>>> robust,
>>>> etc...)?
>>>
>>> IHBase is still very young (it was first released along with 0.20.3).
>>> As you can see from this email thread it's not as robust as it should
>>> be... :)
>>>
>>>>
>>>> Thanks
>>>> TuX
>>>>
>>>>
>>>
>
>
> ---
> Миллионы анкет ждут Вас на на http://mylove.in.ua
> Немедленная регистрация здесь http://mylove.in.ua/my/reg.phtml
>
> Биржа ссылок, тысячи отзывов о нас в Рунете
> http://www.sape.ru/r.7fddbf83ee.php
>
>

Re: IHBase indexes persistence

Posted by Andrey Kolyadenko <cr...@mailx.ru>.
The problem comes when you trying to filter based on 
number of columns in OLAP-like queries (i.e. you want to 
retrieve the count of transaction for some customer and 
some date range). It's not so easy to implement such logic 
effectively, some indexes join algorithm should be 
implemented there, and since HBase supposed to deal with 
very large data sets, it could be tricky.

On Sat, 20 Mar 2010 16:08:10 -0700
  Ryan Rawson <ry...@gmail.com> wrote:
> Another way to think about it is that IHBase helps when 
>the data is
> not dense (ie: every row has the column you may be 
>looking for), and
> not sparse (where 1 column in millions or billions 
>match) but
> somewhere inbetween.  That sweet spot where you might 
>return anywhere
> between 10-30% of the rows from a region.
> 
> Of course these are just suggestions and recommendations 
>not hard and
> fast rules.
> 
> You might also want to look at THBase - it uses a 
>transactional add-on
> to maintain a secondary index (ie: another table that is 
>an index of a
> primary table).  It has different performance 
>characteristics (one
> write is translated into many writes and involves an 
>RPC), but an
> option to consider.
> 
>Finally, you can always maintain secondary indexes by 
>yourself in your
> app.  Write and update 2 tables (the primary, the 
>index).  This is
> obviously less integrated and simple but also works.
> 
> -ryan
> 
> On Sat, Mar 20, 2010 at 4:00 PM, Dan Washusen 
><da...@reactive.org> wrote:
>> Hey Tux,
>> I've put some comments inline...
>>
>> On 21 March 2010 09:13, TuX RaceR <tu...@gmail.com> 
>>wrote:
>>> Hello Hbase user List!
>>>
>>> The feature provided by IHbase is very appealing. It 
>>>seems to correspond to
>>> a use case very common in applications (at least in mine 
>>>;) )
>>
>> The functionality of IHBase might not be as useful as 
>>you think.  Take
>> the following very basic user table layout:
>>
>> username (key) | email | name | password
>>
>> That table layout works great when you want to find a 
>>user by
>> username, for example, when the user logs in.  You can 
>>simply do a get
>> on the table with the username.  Now you need to add 
>>functionality to
>> enable a user to retrieve their forgotten password.  The 
>>seemingly
>> obvious solution with IHBase would be to add secondary 
>>index to the
>> email column.  You could then perform a scan on the 
>>table with the
>> appropriate index hint to fetch the user by their email 
>>address.  That
>> solution would work while your dataset is small (one or 
>>two regions)
>> but as your dataset grows and spans many hundreds of 
>>regions it's no
>> longer a viable option.  The reason it's not a viable 
>>option is that
>> IHbase maintains an index on the email column per 
>>region.  In order to
>> find the row that has the email address you are looking 
>>for the scan
>> must contact every region.  The scan would still return 
>>reasonable
>> quickly (say each region responded in a few 
>>milliseconds) but it's
>> still far to resource intensive...
>>
>> The way to make scans fast in HBase is to provide a 
>>start row and stop
>> row and the same rule applies to IHBase.  It's just that 
>>with IHBase
>> the scan will return much faster if the start and stop 
>>rows span a
>> large range...
>>
>>>
>>> Dan Washusen wrote:
>>>>
>>>> Not at the moment.  It currently keeps a copy of each 
>>>>unique indexed
>>>> value and each row key in memory...
>>>>
>>>
>>> Is there a more robust indexing on the roadmap?
>>
>> In IHbase yes, but probably not soon.
>>
>>> HBase if I understand well proposes an opensource 
>>>version of Google
>>> Bigtable.
>>> To me the most striking difference between Hbase and 
>>>Bigtable is for
>>> narrowing searches; the example below shows what I mean 
>>>by narrowing:
>>>
>>> If in Google you search for the word
>>>
>>> hbase:
>>>
>>> (i.e using:
>>> http://www.google.com/search?q=hbase
>>> )
>>> you get a fast answer
>>> (typically: Results *1* - *10* of about *249,000* for 
>>>*hbase*. (*0.17*
>>> seconds))
>>>
>>> Now if you search all pages coming for the 
>>>hadoop.apache.org host name (or
>>> base URL), that is with the query:
>>>
>>> hbase +site:hadoop.apache.org
>>>
>>> (i.e using the URL:
>>> http://www.google.com/search?q=hbase+%2Bsite%3Ahadoop.apache.org
>>> )
>>> you get a pretty fast answer to:
>>> (typically: Results *1* - *10* of about *2,510* from 
>>>*hadoop.apache.org* for
>>> *hbase*. (*0.12* seconds) )
>>>
>>> It seems to me that the second search uses a secondary 
>>>index on a column
>>> named 'site' to scan the 'hbase' based keys. Obviously 
>>>Google found a good
>>> way to implement this (good= fast and scalable)
>>> Is this Google second indexing documented somewhere? Is 
>>>that implemented
>>> using something like IHbase or more something like 
>>>THbase, or something
>>> else?
>>
>> What Ryan said.
>>
>>> Also, why IHbase stays in the 'contrib' tree? Is that 
>>>because the code is
>>> not at the same level as the main hbase code (not as 
>>>tested, not as robust,
>>> etc...)?
>>
>> IHBase is still very young (it was first released along 
>>with 0.20.3).
>> As you can see from this email thread it's not as robust 
>>as it should
>> be... :)
>>
>>>
>>> Thanks
>>> TuX
>>>
>>>
>>


---
Миллионы анкет ждут Вас на на http://mylove.in.ua
Немедленная регистрация здесь http://mylove.in.ua/my/reg.phtml

Биржа ссылок, тысячи отзывов о нас в Рунете 
http://www.sape.ru/r.7fddbf83ee.php


Re: IHBase indexes persistence

Posted by TuX RaceR <tu...@gmail.com>.
Thank you Ryan for this detailed answer. As my indexed column will be in 
100% of the keys (dense case), I think I'll need to forget about IHbase ;)
Cheers
TuX

Ryan Rawson wrote:
> Another way to think about it is that IHBase helps when the data is
> not dense (ie: every row has the column you may be looking for), and
> not sparse (where 1 column in millions or billions match) but
> somewhere inbetween.  That sweet spot where you might return anywhere
> between 10-30% of the rows from a region.
>
> Of course these are just suggestions and recommendations not hard and
> fast rules.
>
> You might also want to look at THBase - it uses a transactional add-on
> to maintain a secondary index (ie: another table that is an index of a
> primary table).  It has different performance characteristics (one
> write is translated into many writes and involves an RPC), but an
> option to consider.
>
> Finally, you can always maintain secondary indexes by yourself in your
> app.  Write and update 2 tables (the primary, the index).  This is
> obviously less integrated and simple but also works.
>
> -ryan
>   


Re: IHBase indexes persistence

Posted by Ryan Rawson <ry...@gmail.com>.
Another way to think about it is that IHBase helps when the data is
not dense (ie: every row has the column you may be looking for), and
not sparse (where 1 column in millions or billions match) but
somewhere inbetween.  That sweet spot where you might return anywhere
between 10-30% of the rows from a region.

Of course these are just suggestions and recommendations not hard and
fast rules.

You might also want to look at THBase - it uses a transactional add-on
to maintain a secondary index (ie: another table that is an index of a
primary table).  It has different performance characteristics (one
write is translated into many writes and involves an RPC), but an
option to consider.

Finally, you can always maintain secondary indexes by yourself in your
app.  Write and update 2 tables (the primary, the index).  This is
obviously less integrated and simple but also works.

-ryan

On Sat, Mar 20, 2010 at 4:00 PM, Dan Washusen <da...@reactive.org> wrote:
> Hey Tux,
> I've put some comments inline...
>
> On 21 March 2010 09:13, TuX RaceR <tu...@gmail.com> wrote:
>> Hello Hbase user List!
>>
>> The feature provided by IHbase is very appealing. It seems to correspond to
>> a use case very common in applications (at least in mine ;) )
>
> The functionality of IHBase might not be as useful as you think.  Take
> the following very basic user table layout:
>
> username (key) | email | name | password
>
> That table layout works great when you want to find a user by
> username, for example, when the user logs in.  You can simply do a get
> on the table with the username.  Now you need to add functionality to
> enable a user to retrieve their forgotten password.  The seemingly
> obvious solution with IHBase would be to add secondary index to the
> email column.  You could then perform a scan on the table with the
> appropriate index hint to fetch the user by their email address.  That
> solution would work while your dataset is small (one or two regions)
> but as your dataset grows and spans many hundreds of regions it's no
> longer a viable option.  The reason it's not a viable option is that
> IHbase maintains an index on the email column per region.  In order to
> find the row that has the email address you are looking for the scan
> must contact every region.  The scan would still return reasonable
> quickly (say each region responded in a few milliseconds) but it's
> still far to resource intensive...
>
> The way to make scans fast in HBase is to provide a start row and stop
> row and the same rule applies to IHBase.  It's just that with IHBase
> the scan will return much faster if the start and stop rows span a
> large range...
>
>>
>> Dan Washusen wrote:
>>>
>>> Not at the moment.  It currently keeps a copy of each unique indexed
>>> value and each row key in memory...
>>>
>>
>> Is there a more robust indexing on the roadmap?
>
> In IHbase yes, but probably not soon.
>
>> HBase if I understand well proposes an opensource version of Google
>> Bigtable.
>> To me the most striking difference between Hbase and Bigtable is for
>> narrowing searches; the example below shows what I mean by narrowing:
>>
>> If in Google you search for the word
>>
>> hbase:
>>
>> (i.e using:
>> http://www.google.com/search?q=hbase
>> )
>> you get a fast answer
>> (typically: Results *1* - *10* of about *249,000* for *hbase*. (*0.17*
>> seconds))
>>
>> Now if you search all pages coming for the hadoop.apache.org host name (or
>> base URL), that is with the query:
>>
>> hbase +site:hadoop.apache.org
>>
>> (i.e using the URL:
>> http://www.google.com/search?q=hbase+%2Bsite%3Ahadoop.apache.org
>> )
>> you get a pretty fast answer to:
>> (typically: Results *1* - *10* of about *2,510* from *hadoop.apache.org* for
>> *hbase*. (*0.12* seconds) )
>>
>> It seems to me that the second search uses a secondary index on a column
>> named 'site' to scan the 'hbase' based keys. Obviously Google found a good
>> way to implement this (good= fast and scalable)
>> Is this Google second indexing documented somewhere? Is that implemented
>> using something like IHbase or more something like THbase, or something
>> else?
>
> What Ryan said.
>
>> Also, why IHbase stays in the 'contrib' tree? Is that because the code is
>> not at the same level as the main hbase code (not as tested, not as robust,
>> etc...)?
>
> IHBase is still very young (it was first released along with 0.20.3).
> As you can see from this email thread it's not as robust as it should
> be... :)
>
>>
>> Thanks
>> TuX
>>
>>
>

Re: IHBase indexes persistence

Posted by TuX RaceR <tu...@gmail.com>.
Thanks Dan for the detailed explanation ;) yes, I was suspecting that as 
the data gets bigger, having to query all regions would cause 
scalability issues.

Dan Washusen wrote:
>>
>> The feature provided by IHbase is very appealing. It seems to correspond to
>> a use case very common in applications (at least in mine ;) )
>>     
>
> The functionality of IHBase might not be as useful as you think.  Take
> the following very basic user table layout:
>
> username (key) | email | name | password
>
> That table layout works great when you want to find a user by
> username, for example, when the user logs in.  You can simply do a get
> on the table with the username.  Now you need to add functionality to
> enable a user to retrieve their forgotten password.  The seemingly
> obvious solution with IHBase would be to add secondary index to the
> email column.  You could then perform a scan on the table with the
> appropriate index hint to fetch the user by their email address.  That
> solution would work while your dataset is small (one or two regions)
> but as your dataset grows and spans many hundreds of regions it's no
> longer a viable option.  The reason it's not a viable option is that
> IHbase maintains an index on the email column per region.  In order to
> find the row that has the email address you are looking for the scan
> must contact every region.  The scan would still return reasonable
> quickly (say each region responded in a few milliseconds) but it's
> still far to resource intensive...
>
> The way to make scans fast in HBase is to provide a start row and stop
> row and the same rule applies to IHBase.  It's just that with IHBase
> the scan will return much faster if the start and stop rows span a
> large range...
>   


Re: IHBase indexes persistence

Posted by Dan Washusen <da...@reactive.org>.
Hey Tux,
I've put some comments inline...

On 21 March 2010 09:13, TuX RaceR <tu...@gmail.com> wrote:
> Hello Hbase user List!
>
> The feature provided by IHbase is very appealing. It seems to correspond to
> a use case very common in applications (at least in mine ;) )

The functionality of IHBase might not be as useful as you think.  Take
the following very basic user table layout:

username (key) | email | name | password

That table layout works great when you want to find a user by
username, for example, when the user logs in.  You can simply do a get
on the table with the username.  Now you need to add functionality to
enable a user to retrieve their forgotten password.  The seemingly
obvious solution with IHBase would be to add secondary index to the
email column.  You could then perform a scan on the table with the
appropriate index hint to fetch the user by their email address.  That
solution would work while your dataset is small (one or two regions)
but as your dataset grows and spans many hundreds of regions it's no
longer a viable option.  The reason it's not a viable option is that
IHbase maintains an index on the email column per region.  In order to
find the row that has the email address you are looking for the scan
must contact every region.  The scan would still return reasonable
quickly (say each region responded in a few milliseconds) but it's
still far to resource intensive...

The way to make scans fast in HBase is to provide a start row and stop
row and the same rule applies to IHBase.  It's just that with IHBase
the scan will return much faster if the start and stop rows span a
large range...

>
> Dan Washusen wrote:
>>
>> Not at the moment.  It currently keeps a copy of each unique indexed
>> value and each row key in memory...
>>
>
> Is there a more robust indexing on the roadmap?

In IHbase yes, but probably not soon.

> HBase if I understand well proposes an opensource version of Google
> Bigtable.
> To me the most striking difference between Hbase and Bigtable is for
> narrowing searches; the example below shows what I mean by narrowing:
>
> If in Google you search for the word
>
> hbase:
>
> (i.e using:
> http://www.google.com/search?q=hbase
> )
> you get a fast answer
> (typically: Results *1* - *10* of about *249,000* for *hbase*. (*0.17*
> seconds))
>
> Now if you search all pages coming for the hadoop.apache.org host name (or
> base URL), that is with the query:
>
> hbase +site:hadoop.apache.org
>
> (i.e using the URL:
> http://www.google.com/search?q=hbase+%2Bsite%3Ahadoop.apache.org
> )
> you get a pretty fast answer to:
> (typically: Results *1* - *10* of about *2,510* from *hadoop.apache.org* for
> *hbase*. (*0.12* seconds) )
>
> It seems to me that the second search uses a secondary index on a column
> named 'site' to scan the 'hbase' based keys. Obviously Google found a good
> way to implement this (good= fast and scalable)
> Is this Google second indexing documented somewhere? Is that implemented
> using something like IHbase or more something like THbase, or something
> else?

What Ryan said.

> Also, why IHbase stays in the 'contrib' tree? Is that because the code is
> not at the same level as the main hbase code (not as tested, not as robust,
> etc...)?

IHBase is still very young (it was first released along with 0.20.3).
As you can see from this email thread it's not as robust as it should
be... :)

>
> Thanks
> TuX
>
>

Re: IHBase indexes persistence

Posted by TuX RaceR <tu...@gmail.com>.
Thank you Ryan for your answer.
I really like Solr, but to me it does not scale in the same way Hbase 
scales. Solr 1.4 ships with index replication: that is very nice and 
easy to use, but from the scaling point of view you are for instance 
limited by the disk size. Then you have shards: I'll have another look 
at Katta but the Katta-Solr integration Jira 
http://issues.apache.org/jira/browse/SOLR-1395 mentions search times 
rather long: "The
KattaClientTest test case shows a Katta cluster being created locally, a 
couple of cores/shards being placed into the cluster, then a query being 
executed that returns the correct number of results. It takes about 30s 
- 1.5min to run".
And yes Google seems (http://infolab.stanford.edu/~backrub/google.html) 
to have a dedicated index structure. I looked at Nutch which sounds like 
a direct opensource implementation of Google search, but I do not 
understand yet how to extract the distributed indexing part of the whole 
project (this is the part that I am really interested in as I do not 
have to crawl the web)

Thanks
TuX


Ryan Rawson wrote:
> Hey guys,
>
> I hate to ruin it for you, but Google search does not use bigtable at
> the query time.  If you would like an example of a good robust search
> and indexing system, you could have a look at lucene library, the solr
> system build on lucene, and katta which is another system building on
> lucene.
>
> -ryan
>
> On Sat, Mar 20, 2010 at 3:13 PM, TuX RaceR <tu...@gmail.com> wrote:
>   
>> Hello Hbase user List!
>>
>> The feature provided by IHbase is very appealing. It seems to correspond to
>> a use case very common in applications (at least in mine ;) )
>>
>> Dan Washusen wrote:
>>     
>>> Not at the moment.  It currently keeps a copy of each unique indexed
>>> value and each row key in memory...
>>>
>>>       
>> Is there a more robust indexing on the roadmap?
>> HBase if I understand well proposes an opensource version of Google
>> Bigtable.
>> To me the most striking difference between Hbase and Bigtable is for
>> narrowing searches; the example below shows what I mean by narrowing:
>>
>> If in Google you search for the word
>>
>> hbase:
>>
>> (i.e using:
>> http://www.google.com/search?q=hbase
>> )
>> you get a fast answer
>> (typically: Results *1* - *10* of about *249,000* for *hbase*. (*0.17*
>> seconds))
>>
>> Now if you search all pages coming for the hadoop.apache.org host name (or
>> base URL), that is with the query:
>>
>> hbase +site:hadoop.apache.org
>>
>> (i.e using the URL:
>> http://www.google.com/search?q=hbase+%2Bsite%3Ahadoop.apache.org
>> )
>> you get a pretty fast answer to:
>> (typically: Results *1* - *10* of about *2,510* from *hadoop.apache.org* for
>> *hbase*. (*0.12* seconds) )
>>
>> It seems to me that the second search uses a secondary index on a column
>> named 'site' to scan the 'hbase' based keys. Obviously Google found a good
>> way to implement this (good= fast and scalable)
>> Is this Google second indexing documented somewhere? Is that implemented
>> using something like IHbase or more something like THbase, or something
>> else?
>> Also, why IHbase stays in the 'contrib' tree? Is that because the code is
>> not at the same level as the main hbase code (not as tested, not as robust,
>> etc...)?
>>
>> Thanks
>> TuX
>>
>>
>>     


Re: IHBase indexes persistence

Posted by Ryan Rawson <ry...@gmail.com>.
Hey guys,

I hate to ruin it for you, but Google search does not use bigtable at
the query time.  If you would like an example of a good robust search
and indexing system, you could have a look at lucene library, the solr
system build on lucene, and katta which is another system building on
lucene.

-ryan

On Sat, Mar 20, 2010 at 3:13 PM, TuX RaceR <tu...@gmail.com> wrote:
> Hello Hbase user List!
>
> The feature provided by IHbase is very appealing. It seems to correspond to
> a use case very common in applications (at least in mine ;) )
>
> Dan Washusen wrote:
>>
>> Not at the moment.  It currently keeps a copy of each unique indexed
>> value and each row key in memory...
>>
>
> Is there a more robust indexing on the roadmap?
> HBase if I understand well proposes an opensource version of Google
> Bigtable.
> To me the most striking difference between Hbase and Bigtable is for
> narrowing searches; the example below shows what I mean by narrowing:
>
> If in Google you search for the word
>
> hbase:
>
> (i.e using:
> http://www.google.com/search?q=hbase
> )
> you get a fast answer
> (typically: Results *1* - *10* of about *249,000* for *hbase*. (*0.17*
> seconds))
>
> Now if you search all pages coming for the hadoop.apache.org host name (or
> base URL), that is with the query:
>
> hbase +site:hadoop.apache.org
>
> (i.e using the URL:
> http://www.google.com/search?q=hbase+%2Bsite%3Ahadoop.apache.org
> )
> you get a pretty fast answer to:
> (typically: Results *1* - *10* of about *2,510* from *hadoop.apache.org* for
> *hbase*. (*0.12* seconds) )
>
> It seems to me that the second search uses a secondary index on a column
> named 'site' to scan the 'hbase' based keys. Obviously Google found a good
> way to implement this (good= fast and scalable)
> Is this Google second indexing documented somewhere? Is that implemented
> using something like IHbase or more something like THbase, or something
> else?
> Also, why IHbase stays in the 'contrib' tree? Is that because the code is
> not at the same level as the main hbase code (not as tested, not as robust,
> etc...)?
>
> Thanks
> TuX
>
>

Re: IHBase indexes persistence

Posted by TuX RaceR <tu...@gmail.com>.
Hello Hbase user List!

The feature provided by IHbase is very appealing. It seems to correspond 
to a use case very common in applications (at least in mine ;) )

Dan Washusen wrote:
> Not at the moment.  It currently keeps a copy of each unique indexed
> value and each row key in memory...
>   
Is there a more robust indexing on the roadmap?
HBase if I understand well proposes an opensource version of Google 
Bigtable.
To me the most striking difference between Hbase and Bigtable is for 
narrowing searches; the example below shows what I mean by narrowing:

If in Google you search for the word

hbase:

(i.e using:
http://www.google.com/search?q=hbase
)
you get a fast answer
(typically: Results *1* - *10* of about *249,000* for *hbase*. (*0.17* 
seconds))

Now if you search all pages coming for the hadoop.apache.org host name 
(or base URL), that is with the query:

hbase +site:hadoop.apache.org

(i.e using the URL:
http://www.google.com/search?q=hbase+%2Bsite%3Ahadoop.apache.org
)
you get a pretty fast answer to:
(typically: Results *1* - *10* of about *2,510* from *hadoop.apache.org* 
for *hbase*. (*0.12* seconds) )

It seems to me that the second search uses a secondary index on a column 
named 'site' to scan the 'hbase' based keys. Obviously Google found a 
good way to implement this (good= fast and scalable)
Is this Google second indexing documented somewhere? Is that implemented 
using something like IHbase or more something like THbase, or something 
else?
Also, why IHbase stays in the 'contrib' tree? Is that because the code 
is not at the same level as the main hbase code (not as tested, not as 
robust, etc...)?

Thanks
TuX


Re: IHBase indexes persistence

Posted by Dan Washusen <da...@reactive.org>.
Hi Andrey,
Not at the moment.  It currently keeps a copy of each unique indexed
value and each row key in memory...

Depending on the type of indices you are creating you may be able to
take advantage of HBASE-2207 when 0.20.4 is released.  Also HBASE-2206
fixes a heap fragmentation issue (in the indexed contrib) that was
causing out of memory errors even when heap was still available...

Hope that helps...

Cheers,
Dan

On 21 March 2010 07:09, Andrey Kolyadenko <cr...@mailx.ru> wrote:
>
> Hi,
>
> I am trying to load data into IHBase table with few indexes and my region server is crashing. I analyzed heap dump and now I am under impression that IHBase index data structures eat all available heap.
>
> So my question is: is it any way to setup persistence for IHBase indexes instead of caching everything in the memory?
>
> Thanks.
>