You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Lukáš Drbal <lu...@gmail.com> on 2012/08/12 14:15:53 UTC

Secondary indexes suggestions

Hi all,

iam new user of Hbase and i need help with secondary indexes.

For example i have messages and users. Each user has many messages.
Data structure will be like this:

Message:
- String id
- Long sender_id
- Long recipient_id
- String text
- Timestamp created_at
[...]

User:
- Long id
- String username
[...]

I need create secondary indexes for reading all messages:
a) inbox (by recipient_id) in timerange.
b) outbox (by sender_id) in timerange

Can someone give me suggestions for this index(es) and attributes for
columnFamily?
I expect here 500M messages and 50M users.

Thanks a lot for response.


P.S. Sorry for my bad english, isn't my primary language


Lukas Drbal

Re: Secondary indexes suggestions

Posted by Michael Segel <mi...@hotmail.com>.

Ah... schema design...

Yes you have both options identified... but just to add a twist... in the column name, prepend the  (epoch - timestamp) to the message id. This will put the messages in reverse order. 
The only drawback to this is that its theoretically possible to create a row which exceeds your region's size....

You could also do this if you use a composite key. (Hash the user_id  and then (epoch - timestamp) and then the message_id. 

You are correct that you have to scan many rows. However by using a start scanner that has the user_id as the start key and then end key as the user_id + the first character after the separator key. 

The only reason I would say to hash the key is so that you get a more even distribution of data across the cluster, but that's not really that important.

On Aug 14, 2012, at 6:44 AM, Lukáš Drbal <lu...@gmail.com> wrote:

> Hi,
> 
> thanks a lot for all response.
> 
> Otis: filter from your link are great, i'll check it in my tests.
> 
> Michael: i understand what is secondary indexes, but still don't have
> idea about effective rowkey format. I'm ok with delay in creating
> secondary index and atomicity, we don't need "realitime" data.
> 
> 
> When i have 10 messages with ids 1, 8, 10, 255, ... from one user with
> id 88. I see here only 2 options for rowkey in sec. index:
> 
> 1) composite rowkey like <userId><SEPARATOR><messageId>
> 2) use userId as rowkey and put messageId into cells
> Exists any other?
> 
> When i use first method, i must scan over many rows. What about
> startRow for scanner? Can be this scan effective?
> 
> Second method need many many cells and i don't need all in one time,
> so this is imho bad idea.
> 
> 
> -- 
> Save The World - http://www.worldcommunitygrid.org/
> http://www.worldcommunitygrid.org/stat/viewMemberInfo.do?userName=LesTR
> 
> Lukas Drbal
>

Re: Secondary indexes suggestions

Posted by Lukáš Drbal <lu...@gmail.com>.

Hi,

thanks a lot for all response.

Otis: filter from your link are great, i'll check it in my tests.

Michael: i understand what is secondary indexes, but still don't have
idea about effective rowkey format. I'm ok with delay in creating
secondary index and atomicity, we don't need "realitime" data.


When i have 10 messages with ids 1, 8, 10, 255, ... from one user with
id 88. I see here only 2 options for rowkey in sec. index:

1) composite rowkey like <userId><SEPARATOR><messageId>
2) use userId as rowkey and put messageId into cells
Exists any other?

When i use first method, i must scan over many rows. What about
startRow for scanner? Can be this scan effective?

Second method need many many cells and i don't need all in one time,
so this is imho bad idea.


-- 
Save The World - http://www.worldcommunitygrid.org/
http://www.worldcommunitygrid.org/stat/viewMemberInfo.do?userName=LesTR

Lukas Drbal

Re: Secondary indexes suggestions

Posted by lars hofhansl <lh...@yahoo.com>.

Maybe we know one thing or the other about this :)  There are pros and cons to both approaches. Nothing was dismissed.
For global, table-level indexes we need some of distributed commit protocol. For "index transactions" it is slightly simpler, because
they are known ahead of time to be idempotent; maybe we can up with something less strict than 2pc/paxos.

Naively updating another table and say "now we have secondary indexes" *is* going to bring unexpected surprises,
as it will work until it breaks because of concurrency issues. If you want that, write to two tables from your client.

I think this is a good discussion.

-- Lars

________________________________
 From: Michael Segel <mi...@hotmail.com>
To: user@hbase.apache.org 
Cc: "apurtell@apache.org Purtell" <ap...@apache.org>; lars hofhansl <lh...@yahoo.com> 
Sent: Tuesday, August 14, 2012 8:01 PM
Subject: Re: Secondary indexes suggestions

Perhaps not dismissive but more focused on indexing at the region. 
And it wasn't just you, but also Lars. 

Also don't read in to what I am saying as an argument. Its not. ;-P

I think the issue is how to approach the problem. 

On Aug 14, 2012, at 9:49 PM, Andrew Purtell <ap...@apache.org> wrote:

> On Tue, Aug 14, 2012 at 7:38 PM, Michael Segel
> <mi...@hotmail.com> wrote:
>> I think you need to think outside of the box...
>> But I think you've been too dismissive of looking at the index at the table level and not at the region level.
> 
> I'd be interested if you can point out exactly where I dismissed
> something, as in "this is not a good idea..." or "this is wrong..." or
> any other explicit statement. Otherwise, you are reading in something
> as implicit that isn't there. I contributed a few thoughts on the
> subject as opposed to writing a treatise. Why does this have to be an
> argument instead of a discussion?
> 
> But if you don't mind I'm not going to look at this thread further.
> 
> Best regards,
> 
>   - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet
> Hein (via Tom White)
>

Re: Secondary indexes suggestions

Posted by Michael Segel <mi...@hotmail.com>.

Perhaps not dismissive but more focused on indexing at the region. 
And it wasn't just you, but also Lars. 

Also don't read in to what I am saying as an argument. Its not. ;-P

I think the issue is how to approach the problem. 


On Aug 14, 2012, at 9:49 PM, Andrew Purtell <ap...@apache.org> wrote:

> On Tue, Aug 14, 2012 at 7:38 PM, Michael Segel
> <mi...@hotmail.com> wrote:
>> I think you need to think outside of the box...
>> But I think you've been too dismissive of looking at the index at the table level and not at the region level.
> 
> I'd be interested if you can point out exactly where I dismissed
> something, as in "this is not a good idea..." or "this is wrong..." or
> any other explicit statement. Otherwise, you are reading in something
> as implicit that isn't there. I contributed a few thoughts on the
> subject as opposed to writing a treatise. Why does this have to be an
> argument instead of a discussion?
> 
> But if you don't mind I'm not going to look at this thread further.
> 
> Best regards,
> 
>   - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet
> Hein (via Tom White)
>

Re: Secondary indexes suggestions

Posted by Andrew Purtell <ap...@apache.org>.

On Tue, Aug 14, 2012 at 7:38 PM, Michael Segel
<mi...@hotmail.com> wrote:
> I think you need to think outside of the box...
> But I think you've been too dismissive of looking at the index at the table level and not at the region level.

I'd be interested if you can point out exactly where I dismissed
something, as in "this is not a good idea..." or "this is wrong..." or
any other explicit statement. Otherwise, you are reading in something
as implicit that isn't there. I contributed a few thoughts on the
subject as opposed to writing a treatise. Why does this have to be an
argument instead of a discussion?

But if you don't mind I'm not going to look at this thread further.

Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)

Re: Secondary indexes suggestions

Posted by Michael Segel <mi...@hotmail.com>.

I think you need to think outside of the box...

I've thought about it a little more and while there's validity to indexing at the RS, there's a bit more of a headache. 

But I think you've been too dismissive of looking at the index at the table level and not at the region level. 

 
On Aug 14, 2012, at 8:59 PM, Andrew Purtell <ap...@apache.org> wrote:

> Hey Lars,
> 
> On Tue, Aug 14, 2012 at 5:08 PM, lars hofhansl <lh...@yahoo.com> wrote:
>> Yep. It's not simple if (and only if) you data is changing a lot. Michael is right though, that it is simple problem if your data is static.
> 
> Yeah, a good option for that are MR processes that emit in one shot
> HFiles for bulk import of infrequent updates into the primary table
> and all projections/materializations/indices. We have an application
> that does this in production.
> 
>> Todd Lipcon and I were talking last week. And he mentioned primitives like logged updates,operations that will eventually complete, and as long as log-replay can be forced before a read operation they can be used for consistent indexes.
> 
> Back to HBASE-3340 again. :-)
> 
> Best regards,
> 
>   - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet
> Hein (via Tom White)
>

Re: Secondary indexes suggestions

Posted by Andrew Purtell <ap...@apache.org>.

Hey Lars,

On Tue, Aug 14, 2012 at 5:08 PM, lars hofhansl <lh...@yahoo.com> wrote:
> Yep. It's not simple if (and only if) you data is changing a lot. Michael is right though, that it is simple problem if your data is static.

Yeah, a good option for that are MR processes that emit in one shot
HFiles for bulk import of infrequent updates into the primary table
and all projections/materializations/indices. We have an application
that does this in production.

> Todd Lipcon and I were talking last week. And he mentioned primitives like logged updates,operations that will eventually complete, and as long as log-replay can be forced before a read operation they can be used for consistent indexes.

Back to HBASE-3340 again. :-)

Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)

Re: Secondary indexes suggestions

Posted by lars hofhansl <lh...@yahoo.com>.

Thanks Andy.

Yep. It's not simple if (and only if) you data is changing a lot. Michael is right though, that it is simple problem if your data is static.

In my mind we should think about providing the building blocks (like the limited cross row transaction stuff I did a while back),
rather then forcing a particular implementation.

o some folks cannot tolerate cross region server lookups updates
o others will want index-covered-queries i.e. denormalization
o others will want the equivalent of materialized views
o some have natural chards (or tenants, maybe), and chards should co-locates with their indexes.

o etc.

Todd Lipcon and I were talking last week. And he mentioned primitives like logged updates,operations that will eventually complete,
and as long as log-replay can be forced before a read operation they can be used for consistent indexes.

-- Lars

----- Original Message -----
From: Andrew Purtell <ap...@apache.org>
To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
Cc: 
Sent: Monday, August 13, 2012 8:11 PM
Subject: Re: Secondary indexes suggestions

Please pardon while I ramble, this started off as a short response and
is now... lengthy.

I've also seen Megastore-inspired secondary index implementations that
clone the data from the primary table into the secondary table, by
sort order of the attribute that is indexed. In Megastore this was
configurable on a per index table basis: "Accessing entity data
through indexes is normally a two-step process: first the index is
read to find matching primary keys, then these keys are used to fetch
entities. We provide a way to denormalize portions of entity data
directly into index entries. By adding the STORING clause to an index
[...]" A naive implementation of this for HBase will require
consistency checking of the index table(s) because it is easy for the
denormalized data to become stale in some places if a client (or
coprocessor) fails mid-write or if the index update is significantly
delayed from the primary table update. A non-naive implementation will
have some difficult to implement correctly Paxos-ish commit protocol
doubly difficult to make perform well. Without that extra layer, it is
assured the index is always slightly out of date. The lag can increase
substantially if index region(s) are in transition when the primary
table write happens, and then the client (or coprocessor) has to wait
to update the index. You could also do this in reverse, update the
index table first. Either the client would have to do this as Lars
says, or a background MapReduce based process might be employed, or
both.

Without denormalization then you have the possibility of dangling
pointers in the index tables, or data in the primary table that is not
fully indexed. Also these cases would have to be found and fixed, the
secondary index could potentially always be in some slight state of
disrepair.

CCIndex (https://github.com/Jia-Liu/CCIndex) was a scheme such as the
above that also reduced the replication factor of denormalized index
tables to soften the storage impact of the data duplication, and
patched HBase core to regenerate the index table from the primary
table if one of the index HFiles became corrupt. This is a
questionable idea in my opinion, but it does lead to the interesting
consideration if HBase should support trapping HFile IOEs to enable
this sort of thing to be built as a coprocessor.

A secondary indexing coprocessor could force the colocation of regions
of a primary table with the regions of index tables that map back to
them. Cross-region transactions are possible within a single
RegionServer. A MasterObserver could control region placement. A
RegionObserver on the region of the primary table could transact with
those on the regions of the index tables. A WALObserver could group
the update to the primary table and indexes into a single WAL entry.
Should the RegionServer crash mid transaction, all updates would be
replayed from the WAL, maintaining at all times the consistency of the
index(es) with respect to the primary table. But I see a number of
challenges with this. Foremost, now your availability concerns are not
limited to the regions of the primary table possibly being in
transition, now updates to the primary table would need to block until
all relevant index regions are migrated over to where the primary
region are resident. It may be worth trying to do something like this,
but evicting regions to make room for colocation of index table
regions with primary table regions could get out of hand. After a
couple of RegionServers fail, perhaps quickly after each other, would
the cluster converge to full availability? Would have to be
extensively tested.

The above is a fair amount of (over)engineering for where the client
should be oblivious to how secondary indexing is done on the cluster.
If that is not a design constraint, then HBase 0.94+ has limited cross
row atomicity, within a single region. So if you are able to construct
primary record keys and index record keys such as they will all fall
within the keyspace of a single region, then this can be done today,
the client can send them up as a group packed into a single RPC and be
assured of server side atomic commit. However, doing such keyspace
engineering while also aiming for efficient queries could be a big
challenge.

On Mon, Aug 13, 2012 at 5:42 PM, lars hofhansl <lh...@yahoo.com> wrote:
> Secondary indexes are only simple when you ignore concurrent updates and failing clients.
> A client could manage to write the index first and then fail in the main row (that can be handled by always rechecking the main row and always scan all versions of the index rows, which is hard/expensive in a scan).
> You can also have a WAL, which you check upon each read and reapply all outstanding changes. (2ndary index updates are nice in that they are idempotent).
>
> Similarly there are other scenarios that make this hard, and is the reason why HBase doesn't have them.
> We've been thinking about primitives to add to HBase to make building/using of 2ndary indexes easier/feasible.
>
> Should indexes be global (i.e. it is up to a client or coprocessor to gather then matches and requery the actual rows)? Or local (which means a query needs to farm many queries in parallel to all index sites)?
> Both have pros and cons.
>
> I think the key of Fuzzy filter is that it can actually seek ahead (using the HBase Filter seek hints), which has the potential to be far more efficient than a full scan.
> In fact local indexes would probably implemented that way: You always scan the main table and use the index information seek ahead.
>
> Just my $0.02, though. :)
>
> -- Lars

Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)

Re: Secondary indexes suggestions

Posted by Andrew Purtell <ap...@apache.org>.

Please pardon while I ramble, this started off as a short response and
is now... lengthy.

I've also seen Megastore-inspired secondary index implementations that
clone the data from the primary table into the secondary table, by
sort order of the attribute that is indexed. In Megastore this was
configurable on a per index table basis: "Accessing entity data
through indexes is normally a two-step process: first the index is
read to find matching primary keys, then these keys are used to fetch
entities. We provide a way to denormalize portions of entity data
directly into index entries. By adding the STORING clause to an index
[...]" A naive implementation of this for HBase will require
consistency checking of the index table(s) because it is easy for the
denormalized data to become stale in some places if a client (or
coprocessor) fails mid-write or if the index update is significantly
delayed from the primary table update. A non-naive implementation will
have some difficult to implement correctly Paxos-ish commit protocol
doubly difficult to make perform well. Without that extra layer, it is
assured the index is always slightly out of date. The lag can increase
substantially if index region(s) are in transition when the primary
table write happens, and then the client (or coprocessor) has to wait
to update the index. You could also do this in reverse, update the
index table first. Either the client would have to do this as Lars
says, or a background MapReduce based process might be employed, or
both.

Without denormalization then you have the possibility of dangling
pointers in the index tables, or data in the primary table that is not
fully indexed. Also these cases would have to be found and fixed, the
secondary index could potentially always be in some slight state of
disrepair.

CCIndex (https://github.com/Jia-Liu/CCIndex) was a scheme such as the
above that also reduced the replication factor of denormalized index
tables to soften the storage impact of the data duplication, and
patched HBase core to regenerate the index table from the primary
table if one of the index HFiles became corrupt. This is a
questionable idea in my opinion, but it does lead to the interesting
consideration if HBase should support trapping HFile IOEs to enable
this sort of thing to be built as a coprocessor.

A secondary indexing coprocessor could force the colocation of regions
of a primary table with the regions of index tables that map back to
them. Cross-region transactions are possible within a single
RegionServer. A MasterObserver could control region placement. A
RegionObserver on the region of the primary table could transact with
those on the regions of the index tables. A WALObserver could group
the update to the primary table and indexes into a single WAL entry.
Should the RegionServer crash mid transaction, all updates would be
replayed from the WAL, maintaining at all times the consistency of the
index(es) with respect to the primary table. But I see a number of
challenges with this. Foremost, now your availability concerns are not
limited to the regions of the primary table possibly being in
transition, now updates to the primary table would need to block until
all relevant index regions are migrated over to where the primary
region are resident. It may be worth trying to do something like this,
but evicting regions to make room for colocation of index table
regions with primary table regions could get out of hand. After a
couple of RegionServers fail, perhaps quickly after each other, would
the cluster converge to full availability? Would have to be
extensively tested.

The above is a fair amount of (over)engineering for where the client
should be oblivious to how secondary indexing is done on the cluster.
If that is not a design constraint, then HBase 0.94+ has limited cross
row atomicity, within a single region. So if you are able to construct
primary record keys and index record keys such as they will all fall
within the keyspace of a single region, then this can be done today,
the client can send them up as a group packed into a single RPC and be
assured of server side atomic commit. However, doing such keyspace
engineering while also aiming for efficient queries could be a big
challenge.

On Mon, Aug 13, 2012 at 5:42 PM, lars hofhansl <lh...@yahoo.com> wrote:
> Secondary indexes are only simple when you ignore concurrent updates and failing clients.
> A client could manage to write the index first and then fail in the main row (that can be handled by always rechecking the main row and always scan all versions of the index rows, which is hard/expensive in a scan).
> You can also have a WAL, which you check upon each read and reapply all outstanding changes. (2ndary index updates are nice in that they are idempotent).
>
> Similarly there are other scenarios that make this hard, and is the reason why HBase doesn't have them.
> We've been thinking about primitives to add to HBase to make building/using of 2ndary indexes easier/feasible.
>
> Should indexes be global (i.e. it is up to a client or coprocessor to gather then matches and requery the actual rows)? Or local (which means a query needs to farm many queries in parallel to all index sites)?
> Both have pros and cons.
>
> I think the key of Fuzzy filter is that it can actually seek ahead (using the HBase Filter seek hints), which has the potential to be far more efficient than a full scan.
> In fact local indexes would probably implemented that way: You always scan the main table and use the index information seek ahead.
>
> Just my $0.02, though. :)
>
> -- Lars

Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)

Re: Secondary indexes suggestions

Posted by lars hofhansl <lh...@yahoo.com>.

Secondary indexes are only simple when you ignore concurrent updates and failing clients.
A client could manage to write the index first and then fail in the main row (that can be handled by always rechecking the main row and always scan all versions of the index rows, which is hard/expensive in a scan).
You can also have a WAL, which you check upon each read and reapply all outstanding changes. (2ndary index updates are nice in that they are idempotent).

Similarly there are other scenarios that make this hard, and is the reason why HBase doesn't have them.
We've been thinking about primitives to add to HBase to make building/using of 2ndary indexes easier/feasible.

Should indexes be global (i.e. it is up to a client or coprocessor to gather then matches and requery the actual rows)? Or local (which means a query needs to farm many queries in parallel to all index sites)?
Both have pros and cons.

I think the key of Fuzzy filter is that it can actually seek ahead (using the HBase Filter seek hints), which has the potential to be far more efficient than a full scan.
In fact local indexes would probably implemented that way: You always scan the main table and use the index information seek ahead.

Just my $0.02, though. :)

-- Lars

----- Original Message -----
From: Michael Segel <mi...@hotmail.com>
To: user@hbase.apache.org; Otis Gospodnetic <ot...@yahoo.com>
Cc: 
Sent: Monday, August 13, 2012 5:28 PM
Subject: Re: Secondary indexes suggestions

Not really a good idea or anything new. 
Essentially a full table scan where you're doing a closer inspection on the key to see if it matches your search regex, before actually fetching the entire row and returning it. 

Secondary indexes are pretty straight forward. 
You have your primary key and then your value. 
Secondary index has a table where the key be one of your values from the main base table, and then the value is the key from the base table. 

So if your main key is 12345, and you store {'Fred', 'Cleveland', 'Ohio'}  == {Name, City, State}

You could create an index on State where you store 'Ohio' as the key, and a column value of 12345.

Then if you search the second table on a row with the key 'Ohio', you'll get all the rows where there is a record in the base table. In this example. a row with the key '12345' ...

HTH

On Aug 13, 2012, at 4:49 PM, Otis Gospodnetic <ot...@yahoo.com> wrote:

> Lukáš, have a look at this recent post on this topic:
> 
> 
> http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/ 
> 
> 
> Otis 
> ----
> Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm 
> 
> 
> 
>> ________________________________
>> From: Lukáš Drbal <lu...@gmail.com>
>> To: user@hbase.apache.org 
>> Sent: Sunday, August 12, 2012 8:15 AM
>> Subject: Secondary indexes suggestions
>> 
>> Hi all,
>> 
>> iam new user of Hbase and i need help with secondary indexes.
>> 
>> For example i have messages and users. Each user has many messages.
>> Data structure will be like this:
>> 
>> Message:
>> - String id
>> - Long sender_id
>> - Long recipient_id
>> - String text
>> - Timestamp created_at
>> [...]
>> 
>> User:
>> - Long id
>> - String username
>> [...]
>> 
>> I need create secondary indexes for reading all messages:
>> a) inbox (by recipient_id) in timerange.
>> b) outbox (by sender_id) in timerange
>> 
>> Can someone give me suggestions for this index(es) and attributes for
>> columnFamily?
>> I expect here 500M messages and 50M users.
>> 
>> Thanks a lot for response.
>> 
>> 
>> P.S. Sorry for my bad english, isn't my primary language
>> 
>> 
>> Lukas Drbal
>> 
>>

Re: Secondary indexes suggestions

Posted by Michael Segel <mi...@hotmail.com>.

Not really a good idea or anything new. 
Essentially a full table scan where you're doing a closer inspection on the key to see if it matches your search regex, before actually fetching the entire row and returning it. 

Secondary indexes are pretty straight forward. 
You have your primary key and then your value. 
Secondary index has a table where the key be one of your values from the main base table, and then the value is the key from the base table. 

So if your main key is 12345, and you store {'Fred', 'Cleveland', 'Ohio'}  == {Name, City, State}

You could create an index on State where you store 'Ohio' as the key, and a column value of 12345.

Then if you search the second table on a row with the key 'Ohio', you'll get all the rows where there is a record in the base table. In this example. a row with the key '12345' ...

HTH

On Aug 13, 2012, at 4:49 PM, Otis Gospodnetic <ot...@yahoo.com> wrote:

> Lukáš, have a look at this recent post on this topic:
> 
> 
> http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/ 
> 
> 
> Otis 
> ----
> Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm 
> 
> 
> 
>> ________________________________
>> From: Lukáš Drbal <lu...@gmail.com>
>> To: user@hbase.apache.org 
>> Sent: Sunday, August 12, 2012 8:15 AM
>> Subject: Secondary indexes suggestions
>> 
>> Hi all,
>> 
>> iam new user of Hbase and i need help with secondary indexes.
>> 
>> For example i have messages and users. Each user has many messages.
>> Data structure will be like this:
>> 
>> Message:
>> - String id
>> - Long sender_id
>> - Long recipient_id
>> - String text
>> - Timestamp created_at
>> [...]
>> 
>> User:
>> - Long id
>> - String username
>> [...]
>> 
>> I need create secondary indexes for reading all messages:
>> a) inbox (by recipient_id) in timerange.
>> b) outbox (by sender_id) in timerange
>> 
>> Can someone give me suggestions for this index(es) and attributes for
>> columnFamily?
>> I expect here 500M messages and 50M users.
>> 
>> Thanks a lot for response.
>> 
>> 
>> P.S. Sorry for my bad english, isn't my primary language
>> 
>> 
>> Lukas Drbal
>> 
>>

Re: Secondary indexes suggestions

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Lukáš, have a look at this recent post on this topic:


http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/ 


Otis 
----
Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm 



>________________________________
> From: Lukáš Drbal <lu...@gmail.com>
>To: user@hbase.apache.org 
>Sent: Sunday, August 12, 2012 8:15 AM
>Subject: Secondary indexes suggestions
> 
>Hi all,
>
>iam new user of Hbase and i need help with secondary indexes.
>
>For example i have messages and users. Each user has many messages.
>Data structure will be like this:
>
>Message:
>- String id
>- Long sender_id
>- Long recipient_id
>- String text
>- Timestamp created_at
>[...]
>
>User:
>- Long id
>- String username
>[...]
>
>I need create secondary indexes for reading all messages:
>a) inbox (by recipient_id) in timerange.
>b) outbox (by sender_id) in timerange
>
>Can someone give me suggestions for this index(es) and attributes for
>columnFamily?
>I expect here 500M messages and 50M users.
>
>Thanks a lot for response.
>
>
>P.S. Sorry for my bad english, isn't my primary language
>
>
>Lukas Drbal
>
>
>