You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Stack <st...@duboce.net> on 2012/09/01 00:59:45 UTC

Re: HBase Developer's Pow-wow.

On Thu, Aug 30, 2012 at 3:42 PM, Stack <st...@duboce.net> wrote:
> On Thu, Aug 30, 2012 at 3:36 PM, Devaraj Das <dd...@hortonworks.com> wrote:
>> Should we move it to that week to accommodate Lars?
>>
>
> We could.  We had it set for the week of 10th so Andrew could come.
> Andrew could you come the following week?

An off-list exchange has it that Andrew can't make the following week
so I'd say, because LarsG showed up on the thread later, lets stick w/
the original proposal of 9/11.

What time would suit? 6pm?  Max 20?  30?

I'll put a post up on meetup.com for bay area hbase.
St.Ack

RE: HBase Developer's Pow-wow.

Posted by "Ramkrishna.S.Vasudevan" <ra...@huawei.com>.

Stack, I may not be able to join seeing the time 2pm which is 2AM over here.
Anyway I can share my thoughts after the discussions are drafted in a
writeup.

Regards
Ram
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> Stack
> Sent: Monday, September 03, 2012 9:11 PM
> To: dev@hbase.apache.org
> Subject: Re: HBase Developer's Pow-wow.
> 
> On Fri, Aug 31, 2012 at 3:59 PM, Stack <st...@duboce.net> wrote:
> > I'll put a post up on meetup.com for bay area hbase.
> 
> I put the meetup up here:
> http://www.meetup.com/hbaseusergroup/events/80621872/ (2pm at HWX).
> Let me know if any of the details are off (Thanks to Jon for the bulk
> of the text).
> St.Ack

Re: HBase Developer's Pow-wow.

Posted by lars hofhansl <lh...@yahoo.com>.

I'm back from the woods (and yes, I'm already reading the dev list, sigh) :)

I'll be back at work tomorrow, but I might have to tie some other knots first.Let's see.

I'd also be interested to join the talk about 2ndary indexing.


In addition I can talk a bit about
- the profiling I did, and maybe mention some (just 1 or 2 really) gotchas to avoid in the future
- the additions to the coprocessor framework I added
- thoughts about backups (?)
- using iterator trees instead of scanners (although the relational DB world apparently has become a bit skeptical) (?)


Let me know.

I won't have time to prepare much for this, though. So it would be an ad hoc discussion, maybe with some white boarding.

-- Lars


----- Original Message -----
From: Jesse Yates <je...@gmail.com>
To: dev@hbase.apache.org
Cc: 
Sent: Sunday, September 9, 2012 3:11 PM
Subject: Re: HBase Developer's Pow-wow.

>
> We are missing fellas to lead a chat on process change ideas (How to
> have it so Jenkins is more blue than red; How do we enforce more rigor
> around what gets committer, etc.).  Anyone want to volunteer?  I'd
> volunteer LarsH since he was last to float these eternally recurring
> notions but I believe he will be up on Half Dome looking down on us
> when the meeting goes off.  Anyone else want to lead the discussion
> (Jon?  Andrew?)?
>


I thought Lars would be be back by the meetup, but lets get a second talker
on it too :)

Anyone want to lead a discussion on whats next?  Post 0.96?
>
> Anything else that folks want to talk about?
>

I think we talked about wanting to do secondary indexing as well, as least
what that means for HBase (and maybe some of the _how_ it would work too).

-Jesse

-------------------
Jesse Yates
@jesse_yates
jyates.github.com

Re: HBase Developer's Pow-wow.

Posted by Jacques <wh...@gmail.com>.

On Mon, Sep 10, 2012 at 6:20 PM, Matt Corgan <mc...@hotpads.com> wrote:

> ... snipping lots of helpful use cases...

It seems like portions of what you discussed would probably be nominally
impacted by indexes while other would be very impacted.  Also seems like
compound-qualifier indexing would potentially be of interest to you...
(although I'm not sure how much it would buy you). Are you going to be at
the powwow tomorrow?

> Seems like there are 3 categories of sparseness:
> 1) sparse indexes (like ipAddress) where a per-table approach is more
> efficient for reads
> 2) dense indexes (like eventType) where there are likely values of every
> index key on each region
> 3) very dense indexes (like male/female) where you should just be doing a
> table scan anyway
>

Yes.  I probably shouldn't have used the male/female example since you're
right that a table scan is probably the best the option in that case.  For
category one, I was imagining a situation of more extreme sparseness such
as one target row in a large number of regions.  This is the place where
the all region checking of region-based approach is the most egregious.
 I'd probably put anything that was in at a small percentages of regions as
the second case. (I also wonder if, in the single row scenario, a judicious
use of bloomfilters might provide satisfactory performance even if you do
need to hit all regions-- one of the things we've used as a guiding
principle for our search stuff is that if you're trying to hit realtime,
you can actually eat the most latency on the smallest scan since you have
so little data to move around...depends on allowable memory usage I
suppose.)

> Why is the per-region
> approach more beneficial than the per-table?  Is it because it's easier to
> plug into hbase's existing per-region MapReduce splitter?
>

Part of it has to do with a bunch of non-HBase work I've been doing over
the past few years.  That's why I really hope people share as many use
cases as possible... so that the conclusions that come out of our work are
representative of everyone's needs (as much as possible).  What makes me
lean towards region-level for a lot of use cases are the following:  (I
hadn't even really thought about the existing MR splitter.)
- How to maintain consistency (maybe this is unimportant?)
- How to avoid network bottleneck as the cluster expands (in the case of a
per-table approach, you're going to have pass primary keys around
constantly except in the case that the only value you want is the indexed
value and you saved that entire value in the index table.)
- How to maximize scale.  (In the per table case, a particular set of
indexed values will probably be colocated among a fraction of all nodes.
 Any kind of parallel/MR job will then be constrained by these nodes.)
- How to minimize long term storage cost of indexes.  (If we have
region-level relationships, we can get more tightly coupled over time and
use more efficient compact approaches like the store file position approach
I tossed out in one of my other emails.)

I spent some time in the Cassandra community doing a review of various
indexing use cases.  I should go take another look to see what they do and
how it works for them...

>> Thanks for starting the important discussion.

Lots to talk about. Lots to potentially do.  It will be interesting to see
who has time to put against this as that will probably substantially
constrain all of our great ideas :)

Jacques

Re: HBase Developer's Pow-wow.

Posted by Matt Corgan <mc...@hotpads.com>.

One sparse use case for us is rate limit detection.  We store user events
in an Event table whose primary key is a unique timestamp (sharded to avoid
hotspotting) and which has eventType and ipAddress columns.  We manually
keep a separate table (the index, also sharded) called EventByDateIpType
with row format [year/month/date/ipAddress/eventType/eventId].  Background
jobs are constantly scanning the index to count combinations of
ipAddress+eventType to hunt down the people that are doing things like
adding spam to the site.  Then we might dig up all the events for a suspect
ipAddress, where the absolute busiest ipAddress might account for .1% of
the events in a day, so pretty sparse.  A per-table index is a must-have
here.

For this same Event table, there are also dense indexes like
EventByDateType whose row key is [year/month/date/eventType/eventId].
 There are only about 200 eventTypes.  If we have 1 million of a certain
eventType on a given day where we need to access the primary rows, we do a
scan on the EventByDateType index table and pull the rows out of the Event
table in batches.  One nice aspect of this is that we are getting the rows
in globally sorted order.  Either per-table or per-region indexes would
work here, but i guess i'm failing to see the read-time benefit of the
per-region index.

Seems like there are 3 categories of sparseness:
1) sparse indexes (like ipAddress) where a per-table approach is more
efficient for reads
2) dense indexes (like eventType) where there are likely values of every
index key on each region
3) very dense indexes (like male/female) where you should just be doing a
table scan anyway

Jacques, you say "If we're talking about a gender column on a user profile
table, you really want that
to be spread out among all regions".  Can you expand on that more?  I guess
i don't understand your read pattern.  If you have 5 million of each user,
you are probably not doing a single select of all males.  You will probably
have to iterate through them in small batches.  Why is the per-region
approach more beneficial than the per-table?  Is it because it's easier to
plug into hbase's existing per-region MapReduce splitter?  If so, could you
just as easily feed the separate per-table index into MapReduce?

Thanks for starting the important discussion.

On Mon, Sep 10, 2012 at 4:40 PM, Jacques <wh...@gmail.com> wrote:

> >
> > All of my use-cases would require Per-table indexes.  Per-region is
> easier
> > to keep consistent at write-time, but is seems useless to me for the
> large
> > tables that hbase is designed for (because you have to hit every region
> for
> > each read).
> >
>
> Can you expound on use cases?  The pros and cons are heavily dependent on
> the sparseness of the indexed values and the particular use case.  If we're
> talking about a gender column on a user profile table, you really want that
> to be spread out among all regions.  If we're talking about an email
> address... not so much.
>

Re: HBase Developer's Pow-wow.

Posted by Jacques <wh...@gmail.com>.

>
> All of my use-cases would require Per-table indexes.  Per-region is easier
> to keep consistent at write-time, but is seems useless to me for the large
> tables that hbase is designed for (because you have to hit every region for
> each read).
>

Can you expound on use cases?  The pros and cons are heavily dependent on
the sparseness of the indexed values and the particular use case.  If we're
talking about a gender column on a user profile table, you really want that
to be spread out among all regions.  If we're talking about an email
address... not so much.

RE: HBase Developer's Pow-wow.

Posted by "Ramkrishna.S.Vasudevan" <ra...@huawei.com>.

Hi

Yes, a separate index table along with the main table and the master should
ensure that the regions of both tables are collocated during assignments.

The regions in index table can be same as that of the main table in the
sense that both should have the same start and endkeys.  

Different indices can be grouped within these regions.  

In case of spare data definitely the index creation is going to be a
beneficial one.
In case of dense data may be the indices may be an overhead in some cases.

In one of the wiki pages of Cassandra I also read that they suggest to have
atleast one EQUALS condition in the query that tries to use indices. This
will help in confining the results to a specific set and over which the
range queries can be applied.  So may be at the first level we can see what
gain we get when we use EQUALs condition but any way the framework can be
generic to handle range queries and EQUALs condition queries.

After the meet up is over, I can go through the discussion topics and
provide our experiences also.  

Regards
Ram

> -----Original Message-----
> From: Andrew Purtell [mailto:apurtell@apache.org]
> Sent: Tuesday, September 11, 2012 9:52 AM
> To: dev@hbase.apache.org
> Subject: Re: HBase Developer's Pow-wow.
> 
> Regarding this:
> 
> On Mon, Sep 10, 2012 at 12:13 PM, Matt Corgan <mc...@hotpads.com>
> wrote:
> > 1) Per-region or Per-table
> [...]
> > 1)
> > - Per-region: the index entries are stored on the same machine as the
> > primary rows
> > - Per-table: each index is stored in a separate table, requiring
> > cross-server consistency
> 
> LarsH and I were discussing this a bit. This doesn't have to be a
> choice, it could be possible to have both, a separate table for index
> storage, and colocation of the index table regions and primary table
> regions on the same regionserver so cross-region consistency issues
> can be dealt with through low latency in-memory channels. (With
> fallback to cross-server consistency mechanism when placement can't be
> ideal when the cluster is out of steady state due to failure/churn.)
> The master might assign primary and index regions out together as a
> group.
> 
> Best regards,
> 
>    - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet
> Hein (via Tom White)

Re: HBase Developer's Pow-wow.

Posted by Matt Corgan <mc...@hotpads.com>.

Jacques - i'll be there tomorrow.  Look forward to talking.  Some comments
before then:

- How to maintain consistency (maybe this is unimportant?)

Not unimportant at all.  In fact, I picture the whole secondary index
conversation as a lower level goal of supporting consistent cross-region
updates.  I'm hesitant on some of the region co-location ideas because they
look like optimizations on the end goal of consistency across servers.  All
of the optimizations are nice, but the real meat of the problem is how to
bake cross-region consistency in at the ground level as opposed to patching
it on as the failure case where an index region gets separated from its
parent.  It would be better to get the cross-server stuff working first,
and then optimize the same-server scenario.  That is, like you say, if
anybody has time =)

- How to avoid network bottleneck as the cluster expands (in the case of
> a per-table approach, you're going to have pass primary keys
> around constantly except in the case that the only value you want is the
> indexed value and you saved that entire value in the index table.)

In my use cases, i typically scan batches of ~1000 index entries from the
index table (~1 RPC / ~1 data block), and then i issue a multiGet to fetch
the primary rows.  Because the index is sorted by the primary rows, they
all go to the first region in the table which again equates to ~1 RPC.  So
maybe it's 2 RPC's instead of 1 which doesn't seem too bad.

- How to maximize scale.  (In the per table case, a particular set of indexed
> values will probably be colocated among a fraction of all nodes.

Writes will definitely be slightly faster in the per-region case, but at
the huge expense of reads having to go to multiple servers.  In terms of
number of regions (R), the additional write expense is O(1) whereas the
read expense is on average O(R/2).  If you have 100 regions of users and
want to look up a userId by email, you have to jump through 50 regions on
average to find the user.

I spent some time in the Cassandra community doing a review of various indexing
> use cases.  I should go take another look to see what they do and how it
> works for them...

HBase has a lot of similarities to Cassandra but i would say it is a
different beast when it comes to indexing.  The biggest difference (even
bigger than the tunable consistency) is the fact that hbase stores all rows
in a sorted order that automatically split into regions and evenly
distributed.  Cassandra is not designed to host unpredictably growing
sorted tables (like secondary index tables tend to be), so it makes some
concessions in index design.  Instead of storing each index entry as a
separate row in a rapidly growing table, which hbase deals with nicely
because it can split/balance the index table, cassandra stores all of the
index entries for an index value as columns (qualifiers) in the same row.
 For low cardinality indexes this can create several huge rows which become
hotspots.  Said differently, cassandra is forced to create indexes using
wide tables, where hbase has the luxury of using tall tables.  My cassandra
knowledge is dated, so please correct me if that's wrong.

On Mon, Sep 10, 2012 at 9:47 PM, Ramkrishna.S.Vasudevan <
ramkrishna.vasudevan@huawei.com> wrote:

> Hi
>
> Yes, a separate index table along with the main table and the master should
> ensure that the regions of both tables are collocated during assignments.
>
> The regions in index table can be same as that of the main table in the
> sense that both should have the same start and endkeys.
>
> Different indices can be grouped within these regions.
>
> In case of spare data definitely the index creation is going to be a
> beneficial one.
> In case of dense data may be the indices may be an overhead in some cases.
>
> In one of the wiki pages of Cassandra I also read that they suggest to have
> atleast one EQUALS condition in the query that tries to use indices. This
> will help in confining the results to a specific set and over which the
> range queries can be applied.  So may be at the first level we can see what
> gain we get when we use EQUALs condition but any way the framework can be
> generic to handle range queries and EQUALs condition queries.
>
> After the meet up is over, I can go through the discussion topics and
> provide our experiences also.
>
> Regards
> Ram
>
>
> > -----Original Message-----
> > From: Andrew Purtell [mailto:apurtell@apache.org]
> > Sent: Tuesday, September 11, 2012 9:52 AM
> > To: dev@hbase.apache.org
> > Subject: Re: HBase Developer's Pow-wow.
> >
> > Regarding this:
> >
> > On Mon, Sep 10, 2012 at 12:13 PM, Matt Corgan <mc...@hotpads.com>
> > wrote:
> > > 1) Per-region or Per-table
> > [...]
> > > 1)
> > > - Per-region: the index entries are stored on the same machine as the
> > > primary rows
> > > - Per-table: each index is stored in a separate table, requiring
> > > cross-server consistency
> >
> > LarsH and I were discussing this a bit. This doesn't have to be a
> > choice, it could be possible to have both, a separate table for index
> > storage, and colocation of the index table regions and primary table
> > regions on the same regionserver so cross-region consistency issues
> > can be dealt with through low latency in-memory channels. (With
> > fallback to cross-server consistency mechanism when placement can't be
> > ideal when the cluster is out of steady state due to failure/churn.)
> > The master might assign primary and index regions out together as a
> > group.
> >
> > Best regards,
> >
> >    - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet
> > Hein (via Tom White)
>
>

Re: HBase Developer's Pow-wow.

Posted by Andrew Purtell <ap...@apache.org>.

Regarding this:

On Mon, Sep 10, 2012 at 12:13 PM, Matt Corgan <mc...@hotpads.com> wrote:
> 1) Per-region or Per-table
[...]
> 1)
> - Per-region: the index entries are stored on the same machine as the
> primary rows
> - Per-table: each index is stored in a separate table, requiring
> cross-server consistency

LarsH and I were discussing this a bit. This doesn't have to be a
choice, it could be possible to have both, a separate table for index
storage, and colocation of the index table regions and primary table
regions on the same regionserver so cross-region consistency issues
can be dealt with through low latency in-memory channels. (With
fallback to cross-server consistency mechanism when placement can't be
ideal when the cluster is out of steady state due to failure/churn.)
The master might assign primary and index regions out together as a
group.

Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)

Re: HBase Developer's Pow-wow.

Posted by Matt Corgan <mc...@hotpads.com>.

Can indexing be boiled down to these questions to start?

1) Per-region or Per-table
2) Sync or Async
3) Client-managed or Server-managed
4) Schema or Schema-less

Definitions:

1)
- Per-region: the index entries are stored on the same machine as the
primary rows
- Per-table: each index is stored in a separate table, requiring
cross-server consistency

2)
- Sync: the client blocks until all index entries exist
- Async: the client returns when the primary row has been inserted, but
indexes are guaranteed to be created eventually

3)
- Client-managed: client pushes index entries directly to regions, possibly
utilizing some server-side locks or id generators
- Server-managed: client pushes index entries to the same server as the
primary row, letting the server push the index entries on to the
destination regions

4)
- Schema: (complex to even define) client and/or server have info about
column names, value formats, etc.  (Taking this route opens a world of
follow-on questions)
- Schema-less: client provides the index entries which are rows with opaque
row/family/qualifier/timestamp like in normal hbase

Personal opinions:

All of my use-cases would require Per-table indexes.  Per-region is easier
to keep consistent at write-time, but is seems useless to me for the large
tables that hbase is designed for (because you have to hit every region for
each read).

I think Synchronous writes is important for high-consistency (OLTP style)
uses cases while Async is important for high-throughput (OLAP style).  I'd
say sync is a more desirable feature because it's easier to roll your own
async.  I would love to see the difference reduced to a per-index-entry
flag on the Put object.

Client-managed vs Server-managed isn't tremendously important.
 Client-managed seems admirable for the sync case, but server-managed is
better for async.  Therefore, probably better to keep the api simple and do
server-managed for both cases with a flag for sync/async.

The notion of adding a schema to hbase for secondary indexing scares me a
little.  Many of us already have ORM-type layers above hbase that do all
sorts of custom serializations.  It would be more flexible to let the
client generate abritrary index entries and ship them to the server inside
the Put object.

Anyway - my abbreviated 2 cents on a big topic.
Matt

On Mon, Sep 10, 2012 at 11:09 AM, Andrew Purtell <ap...@apache.org>wrote:

> On Mon, Sep 10, 2012 at 12:03 AM, Jacques <wh...@gmail.com> wrote:
> >    - How important is indexing column qualifiers themselves (similar to
> >    Cassandra where people frequently utilize column qualifiers as
> "values"
> >    with no actual values stored)?
>
> It would be good to have a secondary indexing option that can build an
> index from some transform of family+qualifier.
>
> >    - In general it seems like there is tension between the main low level
> >    approaches of (1) leverage as much HBase infrastructure as possible
> (e.g.
> >    secondary tables) and (2) leverage an efficient indexing library e.g.
> >    Lucene.
>
> Regarding option #2, Jason Rutherglen's experiences may be of
> interest: https://issues.apache.org/jira/browse/HBASE-3529 . The new
> Codec and CodecProvider classes of Lucene 4 could conceivably support
> storage of postings in HBase proper now
> (http://wiki.apache.org/lucene-java/FlexibleIndexing) so HDFS hacks
> for bringing indexes local for mmapping may not be necessary, though
> this is a huge hand-wave.
>
> The remainder of your mail is focused on option #1, I have no comment
> to add there, lots of food for thought.
>
> > *
> > *
> > *Approach Thoughts*
> > Trying to leverage HBase as much as possible is hard if we want to
> utilize
> > the approach above and have consistent indexing.  However, I think we can
> > do it if we add support for what I will call a "local shadow family".
> >  These are additional, internally managed families for a table.  However,
> > they have the special characteristic that they belong to the region
> despite
> > their primary keys being outside the range of the region's.  Otherwise
> they
> > look like a typical family.  On splits, they are regenerated (somehow).
>  If
> > we take advantage of Lars'
> > HBASE-5229<https://issues.apache.org/jira/browse/HBASE-5229>,
> > we then have the opportunity to consistently insert one or more rows into
> > these local shadow families for the purpose of secondary indexing. The
> > structure of these secondary families could use row keys as the indexed
> > values, qualifiers for specific store files and the value of each being a
> > list of originating keys (using read-append or
> > HBASE-5993<https://issues.apache.org/jira/browse/HBASE-5993>).
> >  By leveraging the existing family infrastructure, we get things like
> > optional in-memory indexes and basic scanners for free and don't have to
> > swallow a big chunk of external indexing code.
> >
> > The simplest approach for integration of these for queries would be
> > internally be a  ScannerBasedFilter (a filter that is based on a scanner)
> > and a GroupingScanner (a Scanner that does intersection and/or union of
> > scanners for multi criteria queries).  Implementation of these scanners
> > could happen at one of two levels:
> >
> >    - StoreScanner level: A more efficient approach using the store file
> >    qualifier approach above (this allows easier maintenance of index
> >    deletions)
> >    - RegionScanner level: A simpler implementation with less violation of
> >    existing encapsulation.  We'd store row keys in qualifiers instead of
> >    values to ensure ordering that works iteratively with RegionScanner.
>  The
> >    weaknesses of this approach are less efficient scanning and figuring
> out
> >    how to manage primary value deletes.
> >
> > In general, the best way to deal with deletes is probably to age them out
> > per storefile and just filter "near misses" as a secondary filter that
> > works with ScannerBasedFilter.  The client side would be TBD but would
> > probably offer some kind of criteria filters that on server side had all
> > the lower level ramifications.
> >
> > *Future Optimizations*
> > In a perfect world, we'd actually use StoreFile block start locations as
> > the index pointer values in the secondary families.  This would make
> things
> > much more compact and efficient.  Especially if we used a smarter block
> > codec that took advantage of this nature.  However, this requires quite a
> > bit more work since we'd need to actually use the primary keys in the
> > secondary memstore and then "patch" the values to block locations as we
> > flushed the primary family that we were indexing (ugh).
> >
> > Assuming that the primary limiter of peak write throughput for HBase is
> > typically WAL writing and since indexes have no "real" data, we could
> > consider disabling WAL for local shadow families and simply regenerate
> this
> > data upon primary WAL playback.  I haven't spent enough time in that code
> > to know what kind of consistency pain this would cause  (my intuition is
> it
> > would be fine as long as we didn't fix
> > HBASE-3149<https://issues.apache.org/jira/browse/HBASE-3149>).
> > If consistency isn't a problem, this would be a nice option since it
> means
> > that indexing would have minimal impact on peak write throughput.
> >
> > *I haven't thought at all about...*
> >
> >    - How/whether this makes sense to be implemented as a coprocessor.
> >    - Weird timestamp impacts/considerations here.
> >    - Version handling/impacts.
>
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet
> Hein (via Tom White)
>

Re: HBase Developer's Pow-wow.

Posted by Andrew Purtell <ap...@apache.org>.

On Mon, Sep 10, 2012 at 12:03 AM, Jacques <wh...@gmail.com> wrote:
>    - How important is indexing column qualifiers themselves (similar to
>    Cassandra where people frequently utilize column qualifiers as "values"
>    with no actual values stored)?

It would be good to have a secondary indexing option that can build an
index from some transform of family+qualifier.

>    - In general it seems like there is tension between the main low level
>    approaches of (1) leverage as much HBase infrastructure as possible (e.g.
>    secondary tables) and (2) leverage an efficient indexing library e.g.
>    Lucene.

Regarding option #2, Jason Rutherglen's experiences may be of
interest: https://issues.apache.org/jira/browse/HBASE-3529 . The new
Codec and CodecProvider classes of Lucene 4 could conceivably support
storage of postings in HBase proper now
(http://wiki.apache.org/lucene-java/FlexibleIndexing) so HDFS hacks
for bringing indexes local for mmapping may not be necessary, though
this is a huge hand-wave.

The remainder of your mail is focused on option #1, I have no comment
to add there, lots of food for thought.

> *
> *
> *Approach Thoughts*
> Trying to leverage HBase as much as possible is hard if we want to utilize
> the approach above and have consistent indexing.  However, I think we can
> do it if we add support for what I will call a "local shadow family".
>  These are additional, internally managed families for a table.  However,
> they have the special characteristic that they belong to the region despite
> their primary keys being outside the range of the region's.  Otherwise they
> look like a typical family.  On splits, they are regenerated (somehow).  If
> we take advantage of Lars'
> HBASE-5229<https://issues.apache.org/jira/browse/HBASE-5229>,
> we then have the opportunity to consistently insert one or more rows into
> these local shadow families for the purpose of secondary indexing. The
> structure of these secondary families could use row keys as the indexed
> values, qualifiers for specific store files and the value of each being a
> list of originating keys (using read-append or
> HBASE-5993<https://issues.apache.org/jira/browse/HBASE-5993>).
>  By leveraging the existing family infrastructure, we get things like
> optional in-memory indexes and basic scanners for free and don't have to
> swallow a big chunk of external indexing code.
>
> The simplest approach for integration of these for queries would be
> internally be a  ScannerBasedFilter (a filter that is based on a scanner)
> and a GroupingScanner (a Scanner that does intersection and/or union of
> scanners for multi criteria queries).  Implementation of these scanners
> could happen at one of two levels:
>
>    - StoreScanner level: A more efficient approach using the store file
>    qualifier approach above (this allows easier maintenance of index
>    deletions)
>    - RegionScanner level: A simpler implementation with less violation of
>    existing encapsulation.  We'd store row keys in qualifiers instead of
>    values to ensure ordering that works iteratively with RegionScanner.  The
>    weaknesses of this approach are less efficient scanning and figuring out
>    how to manage primary value deletes.
>
> In general, the best way to deal with deletes is probably to age them out
> per storefile and just filter "near misses" as a secondary filter that
> works with ScannerBasedFilter.  The client side would be TBD but would
> probably offer some kind of criteria filters that on server side had all
> the lower level ramifications.
>
> *Future Optimizations*
> In a perfect world, we'd actually use StoreFile block start locations as
> the index pointer values in the secondary families.  This would make things
> much more compact and efficient.  Especially if we used a smarter block
> codec that took advantage of this nature.  However, this requires quite a
> bit more work since we'd need to actually use the primary keys in the
> secondary memstore and then "patch" the values to block locations as we
> flushed the primary family that we were indexing (ugh).
>
> Assuming that the primary limiter of peak write throughput for HBase is
> typically WAL writing and since indexes have no "real" data, we could
> consider disabling WAL for local shadow families and simply regenerate this
> data upon primary WAL playback.  I haven't spent enough time in that code
> to know what kind of consistency pain this would cause  (my intuition is it
> would be fine as long as we didn't fix
> HBASE-3149<https://issues.apache.org/jira/browse/HBASE-3149>).
> If consistency isn't a problem, this would be a nice option since it means
> that indexing would have minimal impact on peak write throughput.
>
> *I haven't thought at all about...*
>
>    - How/whether this makes sense to be implemented as a coprocessor.
>    - Weird timestamp impacts/considerations here.
>    - Version handling/impacts.

Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)

Re: HBase Developer's Pow-wow.

Posted by Jacques <wh...@gmail.com>.

See below

On Mon, Sep 10, 2012 at 10:51 AM, Ted Yu <yu...@gmail.com> wrote:

> Jacques:
> Thanks for your sharing.
>
> bq. row-level sharding as opposed to term
>
> Please elaborate on the above a little more: what is term sharding ?
>

If an index is basically a value (or term) pointing back to a row, there
are two main ways that you can slice up the data to scale it.  Lets say you
have ten nodes and you want to index a column that stores values between 1
and 100.  This columns values are likely distributed throughout all the
regions.   The two options would look like:

Option 1 (term sharding): Each node/region holds all pointers for a single
value.  E.g. Node A holds 1-10, B 11-20, C:21-30, etc.  (A variation of
this is hashing the values to avoid distribution problems.)  The strength
of this approach is that if you know you only want values 1-5, you don't
have to have all the nodes evaluate their index.  The downsides are:  you
have to have some kind of cross node/region data approach and consistency
is hard.  You also have problems as your data scales: on a massive scale,
an index can takes a while to iterate through once it gets large you'll
bottleneck this problem to a single machine.

Option 2 (row-sharding): Each node/region holds all pointers for all the
rows that are on that node.  In this case, you have to consult all the
nodes before you get all the values.  More complicated on query time but
limitless scale and simpler consistency problems.

>
> bq. for what I will call a "local shadow family"
>
> I like this idea. User may request more than one index. Currently HBase is
> not so good at serving high number of families. So we may need to watch
> out.
>
> Yeah.  A simple approach could utilize two families, one in-memory and one
not.  No reason a family can't hold multiple indexes.  Just need to get a
little more tricky about how we use things like qualifiers.  Also makes
index dropping more convoluted.

> bq. GroupingScanner (a Scanner that does intersection and/or union of
> scanners for multi criteria queries)
>
> Do you think the following enhancement is related to your proposal above ?
> HBASE-5416 Improve performance of scans with some kind of filters
>

On first glance, I don't think this is really related.  A grouping scanner
would be used to take the secondary index scanners and merge them into a
single filter scanner to then be used when the primary scan is done.

>
> bq. and then "patch" the values to block locations as we
> flushed the primary family that we were indexing (ugh).
>
> Yeah. We also need to consider the effect of compaction.
>
Yeah... painful...

>
> bq. my intuition is it would be fine as long as we didn't fix HBASE-3149
>
> I was actually expecting someone to pick up the work of HBASE-3149 :-)
>

:P

Re: HBase Developer's Pow-wow.

Posted by Ted Yu <yu...@gmail.com>.

Jacques:
Thanks for your sharing.

bq. row-level sharding as opposed to term

Please elaborate on the above a little more: what is term sharding ?

bq. for what I will call a "local shadow family"

I like this idea. User may request more than one index. Currently HBase is
not so good at serving high number of families. So we may need to watch out.

bq. GroupingScanner (a Scanner that does intersection and/or union of
scanners for multi criteria queries)

Do you think the following enhancement is related to your proposal above ?
HBASE-5416 Improve performance of scans with some kind of filters

bq. and then "patch" the values to block locations as we
flushed the primary family that we were indexing (ugh).

Yeah. We also need to consider the effect of compaction.

bq. my intuition is it would be fine as long as we didn't fix HBASE-3149

I was actually expecting someone to pick up the work of HBASE-3149 :-)

Cheers

On Mon, Sep 10, 2012 at 12:03 AM, Jacques <wh...@gmail.com> wrote:

> more food for thought on secondary indexing...
>
> *Additional questions*:
>
>    - How important is indexing column qualifiers themselves (similar to
>    Cassandra where people frequently utilize column qualifiers as "values"
>    with no actual values stored)?
>    - How important is indexing cell timestamps?
>
>
> *More thoughts/my answers on some of the questions I posed:*
>
>    - From my experience, indexes should be at the region level (e.g.
>    row-level sharding as opposed to term).  Other sharding approaches will
>    likely have scale and consistency problems.
>    - In general it seems like there is tension between the main low level
>    approaches of (1) leverage as much HBase infrastructure as possible
> (e.g.
>    secondary tables) and (2) leverage an efficient indexing library e.g.
>    Lucene.
>
> *
> *
> *Approach Thoughts*
> Trying to leverage HBase as much as possible is hard if we want to utilize
> the approach above and have consistent indexing.  However, I think we can
> do it if we add support for what I will call a "local shadow family".
>  These are additional, internally managed families for a table.  However,
> they have the special characteristic that they belong to the region despite
> their primary keys being outside the range of the region's.  Otherwise they
> look like a typical family.  On splits, they are regenerated (somehow).  If
> we take advantage of Lars'
> HBASE-5229<https://issues.apache.org/jira/browse/HBASE-5229>,
> we then have the opportunity to consistently insert one or more rows into
> these local shadow families for the purpose of secondary indexing. The
> structure of these secondary families could use row keys as the indexed
> values, qualifiers for specific store files and the value of each being a
> list of originating keys (using read-append or
> HBASE-5993<https://issues.apache.org/jira/browse/HBASE-5993>).
>  By leveraging the existing family infrastructure, we get things like
> optional in-memory indexes and basic scanners for free and don't have to
> swallow a big chunk of external indexing code.
>
> The simplest approach for integration of these for queries would be
> internally be a  ScannerBasedFilter (a filter that is based on a scanner)
> and a GroupingScanner (a Scanner that does intersection and/or union of
> scanners for multi criteria queries).  Implementation of these scanners
> could happen at one of two levels:
>
>    - StoreScanner level: A more efficient approach using the store file
>    qualifier approach above (this allows easier maintenance of index
>    deletions)
>    - RegionScanner level: A simpler implementation with less violation of
>    existing encapsulation.  We'd store row keys in qualifiers instead of
>    values to ensure ordering that works iteratively with RegionScanner.
>  The
>    weaknesses of this approach are less efficient scanning and figuring out
>    how to manage primary value deletes.
>
> In general, the best way to deal with deletes is probably to age them out
> per storefile and just filter "near misses" as a secondary filter that
> works with ScannerBasedFilter.  The client side would be TBD but would
> probably offer some kind of criteria filters that on server side had all
> the lower level ramifications.
>
> *Future Optimizations*
> In a perfect world, we'd actually use StoreFile block start locations as
> the index pointer values in the secondary families.  This would make things
> much more compact and efficient.  Especially if we used a smarter block
> codec that took advantage of this nature.  However, this requires quite a
> bit more work since we'd need to actually use the primary keys in the
> secondary memstore and then "patch" the values to block locations as we
> flushed the primary family that we were indexing (ugh).
>
> Assuming that the primary limiter of peak write throughput for HBase is
> typically WAL writing and since indexes have no "real" data, we could
> consider disabling WAL for local shadow families and simply regenerate this
> data upon primary WAL playback.  I haven't spent enough time in that code
> to know what kind of consistency pain this would cause  (my intuition is it
> would be fine as long as we didn't fix
> HBASE-3149<https://issues.apache.org/jira/browse/HBASE-3149>).
> If consistency isn't a problem, this would be a nice option since it means
> that indexing would have minimal impact on peak write throughput.
>
>
> *I haven't thought at all about...*
>
>    - How/whether this makes sense to be implemented as a coprocessor.
>    - Weird timestamp impacts/considerations here.
>    - Version handling/impacts.
>
>
>
>
>
> On Sun, Sep 9, 2012 at 8:03 PM, Jacques <wh...@gmail.com> wrote:
>
> > Some random thoughts/questions bubbling around in my mind regarding
> > secondary indexes/indices.
> >
> > What are the top 5 use cases people are trying to solve?
> > What solves more of these needs: synchronous 'transactional' or
> > asynchronous best-effort (or delayed durable) index commit?
> > Does family level indexing make sense or is the real need for qualifier
> > level indexing?
> > What are ideas for a client interface and how transparent is index usage?
> >  (E.g. if you set a filter on a qualifier... )
> > How important is supporting multiple simultaneous criteria or would 90%
> of
> > uses cases be captured with single criteria support?
> > How important is value multi-parsing (e.g. a single value can be indexed
> > to multiple index values: e.g. free text indexing)?
> > What were the challenges and issues with the proof of concept TrendMicro
> > approach that ultimately made it untenable? (was an eventually consistent
> > approach)
> > What are people's thoughts regarding region-level alternative structure,
> > secondary table structure, etc?
> > Is it important to colocate/duplicate indexed values and/or additional
> > portions of data in secondary indices to minimize disk seeks (almost
> making
> > HBase optionally more columnar in nature)?
> > How important are multi-qualifier indexes? (e.g. when you want to do a
> > query for all users who are male engineers that have kids)
> > How important is partial index matching/ range matching (e.g. startswith
> > and/or between)?
> > How important is ordering of returned values? (e.g. if you support
> > startswith or range matching and you do indexing at the region-level,
> > you'll be able to get back two rows with the same value the are
> > interspersed with rows of different values)
> >
> > These were partially in response to:
> > http://wiki.apache.org/hadoop/Hbase/SecondaryIndexing
> >
> >
> http://apache-hbase.679495.n3.nabble.com/what-s-the-roadmap-of-secondary-index-of-hbase-td2573618.html
> > https://issues.apache.org/jira/browse/HBASE-3529
> > https://issues.apache.org/jira/browse/HBASE-2038
> > https://issues.apache.org/jira/browse/HBASE-3340
> > https://github.com/jyates/culvert
> >
> >
> >
> >
> > On Sun, Sep 9, 2012 at 3:44 PM, Stack <st...@duboce.net> wrote:
> >
> >> On Sun, Sep 9, 2012 at 3:25 PM, Jesse Yates <je...@gmail.com>
> >> wrote:
> >> > On Sun, Sep 9, 2012 at 3:21 PM, Stack <st...@duboce.net> wrote:
> >> >
> >> >> On Sun, Sep 9, 2012 at 3:11 PM, Jesse Yates <jesse.k.yates@gmail.com
> >
> >> >> wrote:
> >> >> > I think we talked about wanting to do secondary indexing as well,
> as
> >> >> least
> >> >> > what that means for HBase (and maybe some of the _how_ it would
> work
> >> >> too).
> >> >> >
> >> >>
> >> >> Mind leading it Jesse?  You have the necessary qualifications
> (smile).
> >> >>  Would suggest you make include rehearsal of points made by Andrew
> >> >> Purtell and LarsH in the most recent thread on 2ndary indexes.
> >> >>
> >> >>
> >> > ....ok, I can do that :)
> >>
> >> Adding you to the list... Thanks J,
> >> St.Ack
> >>
> >
> >
>

Re: HBase Developer's Pow-wow.

Posted by Jacques <wh...@gmail.com>.

more food for thought on secondary indexing...

*Additional questions*:

   - How important is indexing column qualifiers themselves (similar to
   Cassandra where people frequently utilize column qualifiers as "values"
   with no actual values stored)?
   - How important is indexing cell timestamps?

*More thoughts/my answers on some of the questions I posed:*

   - From my experience, indexes should be at the region level (e.g.
   row-level sharding as opposed to term).  Other sharding approaches will
   likely have scale and consistency problems.
   - In general it seems like there is tension between the main low level
   approaches of (1) leverage as much HBase infrastructure as possible (e.g.
   secondary tables) and (2) leverage an efficient indexing library e.g.
   Lucene.

*
*
*Approach Thoughts*
Trying to leverage HBase as much as possible is hard if we want to utilize
the approach above and have consistent indexing.  However, I think we can
do it if we add support for what I will call a "local shadow family".
 These are additional, internally managed families for a table.  However,
they have the special characteristic that they belong to the region despite
their primary keys being outside the range of the region's.  Otherwise they
look like a typical family.  On splits, they are regenerated (somehow).  If
we take advantage of Lars'
HBASE-5229<https://issues.apache.org/jira/browse/HBASE-5229>,
we then have the opportunity to consistently insert one or more rows into
these local shadow families for the purpose of secondary indexing. The
structure of these secondary families could use row keys as the indexed
values, qualifiers for specific store files and the value of each being a
list of originating keys (using read-append or
HBASE-5993<https://issues.apache.org/jira/browse/HBASE-5993>).
 By leveraging the existing family infrastructure, we get things like
optional in-memory indexes and basic scanners for free and don't have to
swallow a big chunk of external indexing code.

The simplest approach for integration of these for queries would be
internally be a  ScannerBasedFilter (a filter that is based on a scanner)
and a GroupingScanner (a Scanner that does intersection and/or union of
scanners for multi criteria queries).  Implementation of these scanners
could happen at one of two levels:

   - StoreScanner level: A more efficient approach using the store file
   qualifier approach above (this allows easier maintenance of index
   deletions)
   - RegionScanner level: A simpler implementation with less violation of
   existing encapsulation.  We'd store row keys in qualifiers instead of
   values to ensure ordering that works iteratively with RegionScanner.  The
   weaknesses of this approach are less efficient scanning and figuring out
   how to manage primary value deletes.

In general, the best way to deal with deletes is probably to age them out
per storefile and just filter "near misses" as a secondary filter that
works with ScannerBasedFilter.  The client side would be TBD but would
probably offer some kind of criteria filters that on server side had all
the lower level ramifications.

*Future Optimizations*
In a perfect world, we'd actually use StoreFile block start locations as
the index pointer values in the secondary families.  This would make things
much more compact and efficient.  Especially if we used a smarter block
codec that took advantage of this nature.  However, this requires quite a
bit more work since we'd need to actually use the primary keys in the
secondary memstore and then "patch" the values to block locations as we
flushed the primary family that we were indexing (ugh).

Assuming that the primary limiter of peak write throughput for HBase is
typically WAL writing and since indexes have no "real" data, we could
consider disabling WAL for local shadow families and simply regenerate this
data upon primary WAL playback.  I haven't spent enough time in that code
to know what kind of consistency pain this would cause  (my intuition is it
would be fine as long as we didn't fix
HBASE-3149<https://issues.apache.org/jira/browse/HBASE-3149>).
If consistency isn't a problem, this would be a nice option since it means
that indexing would have minimal impact on peak write throughput.

*I haven't thought at all about...*

   - How/whether this makes sense to be implemented as a coprocessor.
   - Weird timestamp impacts/considerations here.
   - Version handling/impacts.

On Sun, Sep 9, 2012 at 8:03 PM, Jacques <wh...@gmail.com> wrote:

> Some random thoughts/questions bubbling around in my mind regarding
> secondary indexes/indices.
>
> What are the top 5 use cases people are trying to solve?
> What solves more of these needs: synchronous 'transactional' or
> asynchronous best-effort (or delayed durable) index commit?
> Does family level indexing make sense or is the real need for qualifier
> level indexing?
> What are ideas for a client interface and how transparent is index usage?
>  (E.g. if you set a filter on a qualifier... )
> How important is supporting multiple simultaneous criteria or would 90% of
> uses cases be captured with single criteria support?
> How important is value multi-parsing (e.g. a single value can be indexed
> to multiple index values: e.g. free text indexing)?
> What were the challenges and issues with the proof of concept TrendMicro
> approach that ultimately made it untenable? (was an eventually consistent
> approach)
> What are people's thoughts regarding region-level alternative structure,
> secondary table structure, etc?
> Is it important to colocate/duplicate indexed values and/or additional
> portions of data in secondary indices to minimize disk seeks (almost making
> HBase optionally more columnar in nature)?
> How important are multi-qualifier indexes? (e.g. when you want to do a
> query for all users who are male engineers that have kids)
> How important is partial index matching/ range matching (e.g. startswith
> and/or between)?
> How important is ordering of returned values? (e.g. if you support
> startswith or range matching and you do indexing at the region-level,
> you'll be able to get back two rows with the same value the are
> interspersed with rows of different values)
>
> These were partially in response to:
> http://wiki.apache.org/hadoop/Hbase/SecondaryIndexing
>
> http://apache-hbase.679495.n3.nabble.com/what-s-the-roadmap-of-secondary-index-of-hbase-td2573618.html
> https://issues.apache.org/jira/browse/HBASE-3529
> https://issues.apache.org/jira/browse/HBASE-2038
> https://issues.apache.org/jira/browse/HBASE-3340
> https://github.com/jyates/culvert
>
>
>
>
> On Sun, Sep 9, 2012 at 3:44 PM, Stack <st...@duboce.net> wrote:
>
>> On Sun, Sep 9, 2012 at 3:25 PM, Jesse Yates <je...@gmail.com>
>> wrote:
>> > On Sun, Sep 9, 2012 at 3:21 PM, Stack <st...@duboce.net> wrote:
>> >
>> >> On Sun, Sep 9, 2012 at 3:11 PM, Jesse Yates <je...@gmail.com>
>> >> wrote:
>> >> > I think we talked about wanting to do secondary indexing as well, as
>> >> least
>> >> > what that means for HBase (and maybe some of the _how_ it would work
>> >> too).
>> >> >
>> >>
>> >> Mind leading it Jesse?  You have the necessary qualifications (smile).
>> >>  Would suggest you make include rehearsal of points made by Andrew
>> >> Purtell and LarsH in the most recent thread on 2ndary indexes.
>> >>
>> >>
>> > ....ok, I can do that :)
>>
>> Adding you to the list... Thanks J,
>> St.Ack
>>
>
>

Re: HBase Developer's Pow-wow.

Posted by Stack <st...@duboce.net>.

On Sun, Sep 9, 2012 at 8:03 PM, Jacques <wh...@gmail.com> wrote:
> Some random thoughts/questions bubbling around in my mind regarding
> secondary indexes/indices.
>

Nice list Jacques.

(Jesse, here is your chance to look real good.  You are getting the
questions in advance!  When Jacques stands up to start asking Tuesday,
you can look real intelligent as you bang out the answers)

St.Ack

Re: HBase Developer's Pow-wow.

Posted by Jacques <wh...@gmail.com>.

>
>
> The use cases considered, at least over here at TM, all come down to
> range scanning over values (e.g. WHERE INTEGER($value) < 50). So we
> need a mapping such that a scan over the index returns either lists of
> pointers to row:family:qualifier, or the value itself embedded in the
> index, following the natural order of values in the primary table as
> given by a comparator. And a number of projections like this.


I was thinking that exact criteria queries were higher priority than range
queries.  Interesting that you have a lot of needs for range queries.
 Performant range queries definitely lead to more likely storing values
next to the index and also in general a more compact storage format than is
easily achievable utilizing the shadow family idea.


> A set of
> default comparators for interpreting values as integers, longs,
> floating point, and complex JSON or AVRO records, would be useful.
>

Agreed.  Once a framework is in place, I see these being fairly
straightforward.

Re: HBase Developer's Pow-wow.

Posted by Andrew Purtell <ap...@apache.org>.

Hi Jaques,

> Does family level indexing make sense or is the real need for qualifier
> level indexing?

The use cases considered, at least over here at TM, all come down to
range scanning over values (e.g. WHERE INTEGER($value) < 50). So we
need a mapping such that a scan over the index returns either lists of
pointers to row:family:qualifier, or the value itself embedded in the
index, following the natural order of values in the primary table as
given by a comparator. And a number of projections like this. A set of
default comparators for interpreting values as integers, longs,
floating point, and complex JSON or AVRO records, would be useful.

> What are ideas for a client interface and how transparent is index usage?
>  (E.g. if you set a filter on a qualifier... )

It would be nice if the existing client API can handle it somehow.
Get, Put, Increment, Scan, all of these API objects can transmit
arbitrary attributes from the client to the server. It would be low
friction for a user to modify their use of these existing API objects,
rather than using a completely different interface like coprocessor
Endpoint invocations. (Or, at least a client library should hide that,
in that case.)

> What were the challenges and issues with the proof of concept TrendMicro
> approach that ultimately made it untenable? (was an eventually consistent
> approach)

This was simply a prototype implementation quality issue, nothing
wrong about an eventually consistent approach per se.

> Is it important to colocate/duplicate indexed values and/or additional
> portions of data in secondary indices to minimize disk seeks (almost making
> HBase optionally more columnar in nature)?

I do think we want to offer the Megastore-like option for storing
value data into indexes, and also not. Then we can manage this
tradeoff of minimizing seeks and round trips versus increased storage
utilization on a per-index basis according to the needs of the use
case.

Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)

Re: HBase Developer's Pow-wow.

Posted by Jacques <wh...@gmail.com>.

Some random thoughts/questions bubbling around in my mind regarding
secondary indexes/indices.

What are the top 5 use cases people are trying to solve?
What solves more of these needs: synchronous 'transactional' or
asynchronous best-effort (or delayed durable) index commit?
Does family level indexing make sense or is the real need for qualifier
level indexing?
What are ideas for a client interface and how transparent is index usage?
 (E.g. if you set a filter on a qualifier... )
How important is supporting multiple simultaneous criteria or would 90% of
uses cases be captured with single criteria support?
How important is value multi-parsing (e.g. a single value can be indexed to
multiple index values: e.g. free text indexing)?
What were the challenges and issues with the proof of concept TrendMicro
approach that ultimately made it untenable? (was an eventually consistent
approach)
What are people's thoughts regarding region-level alternative structure,
secondary table structure, etc?
Is it important to colocate/duplicate indexed values and/or additional
portions of data in secondary indices to minimize disk seeks (almost making
HBase optionally more columnar in nature)?
How important are multi-qualifier indexes? (e.g. when you want to do a
query for all users who are male engineers that have kids)
How important is partial index matching/ range matching (e.g. startswith
and/or between)?
How important is ordering of returned values? (e.g. if you support
startswith or range matching and you do indexing at the region-level,
you'll be able to get back two rows with the same value the are
interspersed with rows of different values)

These were partially in response to:
http://wiki.apache.org/hadoop/Hbase/SecondaryIndexing
http://apache-hbase.679495.n3.nabble.com/what-s-the-roadmap-of-secondary-index-of-hbase-td2573618.html
https://issues.apache.org/jira/browse/HBASE-3529
https://issues.apache.org/jira/browse/HBASE-2038
https://issues.apache.org/jira/browse/HBASE-3340
https://github.com/jyates/culvert

On Sun, Sep 9, 2012 at 3:44 PM, Stack <st...@duboce.net> wrote:

> On Sun, Sep 9, 2012 at 3:25 PM, Jesse Yates <je...@gmail.com>
> wrote:
> > On Sun, Sep 9, 2012 at 3:21 PM, Stack <st...@duboce.net> wrote:
> >
> >> On Sun, Sep 9, 2012 at 3:11 PM, Jesse Yates <je...@gmail.com>
> >> wrote:
> >> > I think we talked about wanting to do secondary indexing as well, as
> >> least
> >> > what that means for HBase (and maybe some of the _how_ it would work
> >> too).
> >> >
> >>
> >> Mind leading it Jesse?  You have the necessary qualifications (smile).
> >>  Would suggest you make include rehearsal of points made by Andrew
> >> Purtell and LarsH in the most recent thread on 2ndary indexes.
> >>
> >>
> > ....ok, I can do that :)
>
> Adding you to the list... Thanks J,
> St.Ack
>

Re: HBase Developer's Pow-wow.

Posted by Stack <st...@duboce.net>.

On Sun, Sep 9, 2012 at 3:25 PM, Jesse Yates <je...@gmail.com> wrote:
> On Sun, Sep 9, 2012 at 3:21 PM, Stack <st...@duboce.net> wrote:
>
>> On Sun, Sep 9, 2012 at 3:11 PM, Jesse Yates <je...@gmail.com>
>> wrote:
>> > I think we talked about wanting to do secondary indexing as well, as
>> least
>> > what that means for HBase (and maybe some of the _how_ it would work
>> too).
>> >
>>
>> Mind leading it Jesse?  You have the necessary qualifications (smile).
>>  Would suggest you make include rehearsal of points made by Andrew
>> Purtell and LarsH in the most recent thread on 2ndary indexes.
>>
>>
> ....ok, I can do that :)

Adding you to the list... Thanks J,
St.Ack

Re: HBase Developer's Pow-wow.

Posted by Jesse Yates <je...@gmail.com>.

On Sun, Sep 9, 2012 at 3:21 PM, Stack <st...@duboce.net> wrote:

> On Sun, Sep 9, 2012 at 3:11 PM, Jesse Yates <je...@gmail.com>
> wrote:
> > I think we talked about wanting to do secondary indexing as well, as
> least
> > what that means for HBase (and maybe some of the _how_ it would work
> too).
> >
>
> Mind leading it Jesse?  You have the necessary qualifications (smile).
>  Would suggest you make include rehearsal of points made by Andrew
> Purtell and LarsH in the most recent thread on 2ndary indexes.
>
>
....ok, I can do that :)

-------------------
Jesse Yates
@jesse_yates
jyates.github.com


> (Hopefully LarsH is back by Tuesday.  Unless someone else volunteers
> meantime, lets volunteer him to lead the process section).
> St.Ack
>

Re: HBase Developer's Pow-wow.

Posted by Stack <st...@duboce.net>.

On Sun, Sep 9, 2012 at 3:11 PM, Jesse Yates <je...@gmail.com> wrote:
> I think we talked about wanting to do secondary indexing as well, as least
> what that means for HBase (and maybe some of the _how_ it would work too).
>

Mind leading it Jesse?  You have the necessary qualifications (smile).
 Would suggest you make include rehearsal of points made by Andrew
Purtell and LarsH in the most recent thread on 2ndary indexes.

(Hopefully LarsH is back by Tuesday.  Unless someone else volunteers
meantime, lets volunteer him to lead the process section).
St.Ack

Re: HBase Developer's Pow-wow.

Posted by Jesse Yates <je...@gmail.com>.

>
> We are missing fellas to lead a chat on process change ideas (How to
> have it so Jenkins is more blue than red; How do we enforce more rigor
> around what gets committer, etc.).  Anyone want to volunteer?  I'd
> volunteer LarsH since he was last to float these eternally recurring
> notions but I believe he will be up on Half Dome looking down on us
> when the meeting goes off.  Anyone else want to lead the discussion
> (Jon?  Andrew?)?
>


I thought Lars would be be back by the meetup, but lets get a second talker
on it too :)

Anyone want to lead a discussion on whats next?  Post 0.96?
>
> Anything else that folks want to talk about?
>

I think we talked about wanting to do secondary indexing as well, as least
what that means for HBase (and maybe some of the _how_ it would work too).

-Jesse

-------------------
Jesse Yates
@jesse_yates
jyates.github.com

Re: HBase Developer's Pow-wow.

Posted by Stack <st...@duboce.net>.

On Mon, Sep 3, 2012 at 8:40 AM, Stack <st...@duboce.net> wrote:
> On Fri, Aug 31, 2012 at 3:59 PM, Stack <st...@duboce.net> wrote:
>> I'll put a post up on meetup.com for bay area hbase.
>
> I put the meetup up here:
> http://www.meetup.com/hbaseusergroup/events/80621872/ (2pm at HWX).
> Let me know if any of the details are off (Thanks to Jon for the bulk
> of the text).

Regards Tuesdays' meetup:

+ We have our Jimmy Xiang to do an overview on recent
AssignmentManager changes and discussion of what we should do in
AM-land over the near future
+ Mighty Enis will talk up his fat Integration Tests addition +
ChaosMonkey messer that is about to be committed and how we can now
check in a new class of tests.

We are missing fellas to lead a chat on process change ideas (How to
have it so Jenkins is more blue than red; How do we enforce more rigor
around what gets committer, etc.).  Anyone want to volunteer?  I'd
volunteer LarsH since he was last to float these eternally recurring
notions but I believe he will be up on Half Dome looking down on us
when the meeting goes off.  Anyone else want to lead the discussion
(Jon?  Andrew?)?

Anyone want to lead a discussion on whats next?  Post 0.96?

Anything else that folks want to talk about?

(I'll post above on the meetup too).

St.Ack

Re: HBase Developer's Pow-wow.

Posted by Stack <st...@duboce.net>.

On Tue, Sep 4, 2012 at 9:18 PM, Ramkrishna.S.Vasudevan
<ra...@huawei.com> wrote:
> Stack, I may not be able to join seeing the time 2pm which is 2AM over here.
> Anyway I can share my thoughts after the discussions are drafted in a
> writeup.
>

Understood (Pardon our insensitivity arriving at a 2AM, for you, start
time Ram).
St.Ack

Re: HBase Developer's Pow-wow.

Posted by Stack <st...@duboce.net>.

On Fri, Aug 31, 2012 at 3:59 PM, Stack <st...@duboce.net> wrote:
> I'll put a post up on meetup.com for bay area hbase.

I put the meetup up here:
http://www.meetup.com/hbaseusergroup/events/80621872/ (2pm at HWX).
Let me know if any of the details are off (Thanks to Jon for the bulk
of the text).
St.Ack