You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Wilm Schumacher <wi...@gmail.com> on 2015/03/16 19:37:14 UTC

Re: Status of Huawei's 2' Indexing?

Hi,

a cross post from the dev list. perhaps here more people have valuable
hints or ideas.

Am 16.03.2015 um 18:46 schrieb Rose, Joseph:
> Alright, let’s see if I can get this discussion back on track.
>
> I have a sensibly defined table for patient data; its rowkey is simply
> lastname:firstname, since it’s convenient for the bulk of my lookups.
> Unfortunately I also need to efficiently find patients using an ID string,
> whose literal value is buried in a value field. I’m sure this situation is
> not foreign to the people on this list.
>
> It’s been suggested that I implement 2’ indexes myself — fine. All the
> research I’ve done seems to end with that suggestion, with the exception
> of Phoenix (I don’t want the RDBMS layer) and Huawei’s stuff (which seems
> to incite some discussion here). I’m happy to put this together but I’d
> rather go with something that has been vetted and has a larger developer
> community than one (i.e., ME). Besides, I have a full enough plate at the
> moment that I’d rather not have to do this, too.
>
> Are there constructive suggestions regarding how I can proceed with HBase?
> Right now even a well-vetted local index would be a godsend.

Well first I have a question. Is "lastname:firstname" a good idea for a
row key? Is a name that  specific? I think your row key should be the
ID, rather than the names, as it can be made unique. UUID or whatever.
However, by this the problem still stands, as just the roles are
switched. You either need an index for the IDs or the names.

The following is argued with the ID as row key and the name-firstname as
index data.

I could be image 3 solutions:

* First ... MacGyver your own index.

That's not that complicate as it sounds. A very easy idea would be the
update within the CRUD operations on your data. Within a

Put put =  new Put( Bytes.toBytes( id ) );
put.add( Bytes.toBytes( "firstname" ) , firstname );
put.add( Bytes.toBytes( "lastname" ) , lastname );

make an additional
Put indexPut = new Put( Bytes.toBytes( lastname+":"+firstname ) )
indexPut.add( Bytes.toBytes( id ) , null );

...
<put to tables>

Deleting is practically the same. Just fetch the ID, get the lastname,
firstname combination and kick it out of the index.

By this you can just fetch the row "lastename:firstname" and get all
possible ids as column qualifiers. And that's it ... almost. Here the
"risk" it, that your hbase table throws some error and the stuff is
added, but the index is not refreshed. Thus you have to write a little
more code to catch the "ID not existing" errors.

Furthermore you would have to run a small map-red now and then (perhaps
every night or so) which runs through the rows and refreshes the index
and run through the index and kicks rows there if the ID is not present
anymore. If you missed something above.

If you are new to hbase this perhaps sounds a little complicate. But
actually it's simple. If you are interested I could send you some small
snippets directly.

* Second ... Lucene

Lucene is an index system right away. As I wrote some days ago: with
hbase comes all the fancy apache/hadoop stuff. With lucene you can
implement a search method for your data. E.g. on ... drugs the people
already had.  Fancy feature for your application. And of course you
could search for "firstname=<firstname> AND lastname=<lastname>" which
would fit your need.

However, by this you introduce a new system which you have to maintain :/.

* Third ... other index systems, e.g. Elasticsearch

like the second idea. But more fancy, but more complicate. More points
of failure etc.

If your application do not need a search method, I would go with 1. If
you have to create a search anyway I would go with 2 or 3 as you can use
the search facility for your indexing problem right away.

Best wishes,

Wilm

Re: Status of Huawei's 2' Indexing?

Posted by Wilm Schumacher <wi...@gmail.com>.
damit. Sry for double post. I forgot something.

Am 16.03.2015 um 19:37 schrieb Wilm Schumacher:
> * First ... MacGyver your own index.
>
> That's not that complicate as it sounds. A very easy idea would be the
> update within the CRUD operations on your data. Within a
>
> Put put =  new Put( Bytes.toBytes( id ) );
> put.add( Bytes.toBytes( "firstname" ) , firstname );
> put.add( Bytes.toBytes( "lastname" ) , lastname );
>
> make an additional
> Put indexPut = new Put( Bytes.toBytes( lastname+":"+firstname ) )
> indexPut.add( Bytes.toBytes( id ) , null );
>
> ...
> <put to tables>
Something like 1.b.) You could implement your index on client side, as
above pictured. OR your could go with coprocessors. By this your client
code dosn't have to deal with the index. On every put or delete the
above operations are triggered. This would be make the indexing system
more robust, as the clients couldn't corrupt your index by being killed
or whatever could happen on client side.

Best wishes

Wilm