You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Adrien Mogenet <ad...@gmail.com> on 2013/01/06 21:30:33 UTC

Re: HBase - Secondary Index

Nice topic, perhaps one of the most important for 2013 :-)
I still don't get how you're ensuring consistency between index table and
main table, without an external component (such as bookkeeper/zookeeper).
What's the exact write path in your situation when inserting data ?
(WAL/RegionObserver, pre/post put/WALedit...)

The underlying question is about how you're ensuring that WALEdit in Index
and Main tables are perfectly sync'ed, and how you 're able to rollback in
case of issue in both WAL ?


On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <ke...@gmail.com> wrote:

> >Yes as you say when the no of rows to be returned is becoming more and
> more the latency will be becoming more.  seeks within an HFile block is
> some what expensive op now. (Not much but still)  The new encoding >prefix
> trie will be a huge bonus here. There the seeks will be flying.. [Ted also
> presented this in the Hadoop China]  Thanks to Matt... :)  I am trying to
> measure the scan performance with this new encoding . Trying to >back port
> a simple patch for 94 version just for testing...   Yes when the no of
> results to be returned is more and more any index will become less
> performing as per my study  :)
>
> yes, you are right, I guess it's just a drawback of any index approach.
> Thanks for the explanation.
>
> Shengjie
>
> On 28 December 2012 04:14, Anoop Sam John <an...@huawei.com> wrote:
>
> > > Do you have link to that presentation?
> >
> > http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf
> >
> > -Anoop-
> >
> > ________________________________________
> > From: Mohit Anchlia [mohitanchlia@gmail.com]
> > Sent: Friday, December 28, 2012 9:12 AM
> > To: user@hbase.apache.org
> > Subject: Re: HBase - Secondary Index
> >
> > On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <an...@huawei.com>
> > wrote:
> >
> > > Yes as you say when the no of rows to be returned is becoming more and
> > > more the latency will be becoming more.  seeks within an HFile block is
> > > some what expensive op now. (Not much but still)  The new encoding
> prefix
> > > trie will be a huge bonus here. There the seeks will be flying.. [Ted
> > also
> > > presented this in the Hadoop China]  Thanks to Matt... :)  I am trying
> to
> > > measure the scan performance with this new encoding . Trying to back
> > port a
> > > simple patch for 94 version just for testing...   Yes when the no of
> > > results to be returned is more and more any index will become less
> > > performing as per my study  :)
> > >
> > > Do you have link to that presentation?
> >
> >
> > > >btw, quick question- in your presentation, the scale there is seconds
> or
> > > mill-seconds:)
> > >
> > > It is seconds.  Dont consider the exact values. What is the % of
> increase
> > > in latency is important :) Those were not high end machines.
> > >
> > > -Anoop-
> > > ________________________________________
> > > From: Shengjie Min [kelvin.msj@gmail.com]
> > > Sent: Thursday, December 27, 2012 9:59 PM
> > > To: user@hbase.apache.org
> > > Subject: Re: HBase - Secondary Index
> > >
> > >  >Didnt follow u completely here. There wont be any get() happening..
> As
> > > the
> > > >exact rowkey in a region we get from the index table, we can seek to
> the
> > > >exact position and return that row.
> > >
> > > Sorry, When I misused "get()" here, I meant seeking. Yes, if it's just
> > > small number of rows returned, this works perfect. As you said you will
> > get
> > > the exact rowkey positions per region, and simply seek them. I was
> trying
> > > to work out the case that when the number of result rows increases
> > > massively. Like in Anil's case, he wants to do a scan query against the
> > > 2ndary index(timestamp): "select all rows from timestamp1 to
> timestamp2"
> > > given no customerId provided. During that time period, he might have a
> > big
> > > chunk of rows from different customerIds. The index table returns a lot
> > of
> > > rowkey positions for different customerIds (I believe they are
> scattered
> > in
> > > different regions), then you end up seeking all different positions in
> > > different regions and return all the rows needed. According to your
> > > presentation page14 - Performance Test Results (Scan), without index,
> > it's
> > > a linear increase as result rows # increases. on the other hand, with
> > > index, time spent climbs up way quicker than the case without index.
> > >
> > > btw, quick question- in your presentation, the scale there is seconds
> or
> > > mill-seconds:)
> > >
> > > - Shengjie
> > >
> > >
> > > On 27 December 2012 15:54, Anoop John <an...@gmail.com> wrote:
> > >
> > > > >how the massive number of get() is going to
> > > > perform againt the main table
> > > >
> > > > Didnt follow u completely here. There wont be any get() happening..
> As
> > > the
> > > > exact rowkey in a region we get from the index table, we can seek to
> > the
> > > > exact position and return that row.
> > > >
> > > > -Anoop-
> > > >
> > > > On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <ke...@gmail.com>
> > > > wrote:
> > > >
> > > > > how the massive number of get() is going to
> > > > > perform againt the main table
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > All the best,
> > > Shengjie Min
> > >
> >
>
>
>
> --
> All the best,
> Shengjie Min
>



-- 
Adrien Mogenet
06.59.16.64.22
http://www.mogenet.me

Re: HBase - Secondary Index

Posted by Mohit Anchlia <mo...@gmail.com>.
It makes sense to use inverted indexes when you have unique index columns.
But if you have columns that are evenly distributed then parallel search
makes more sense. It just depends on cardinality of your indexed columns.
On Tue, Jan 8, 2013 at 5:28 PM, anil gupta <an...@gmail.com> wrote:

> +1 on Lars comment.
>
> Either the client gets the rowkey from secondary table and then gets the
> real data from Primary Table. ** OR ** Send the request to all the RS(or
> region) hosting a region of primary table.
>
> Anoop is using the latter mechanism. Both the mechanism have their pros and
> cons. IMO, there is no outright winner.
>
> ~Anil Gupta
>
> On Tue, Jan 8, 2013 at 4:30 PM, lars hofhansl <la...@apache.org> wrote:
>
> > Different use cases.
> >
> >
> > For global point queries you want exactly what you said below.
> > For range scans across many rows you want Anoop's design. As usually it
> > depends.
> >
> >
> > The tradeoff is bringing a lot of unnecessary data to the client vs
> having
> > to contact each region (or at least each region server).
> >
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> >  From: Michael Segel <mi...@hotmail.com>
> > To: user@hbase.apache.org
> > Sent: Tuesday, January 8, 2013 6:33 AM
> > Subject: Re: HBase - Secondary Index
> >
> > So if you're using an inverted table / index why on earth are you doing
> it
> > at the region level?
> >
> > I've tried to explain this to others over 6 months ago and its not really
> > a good idea.
> >
> > You're over complicating this and you will end up creating performance
> > bottlenecks when your secondary index is completely orthogonal to your
> row
> > key.
> >
> > To give you an example...
> >
> > Suppose you're CCCIS and you have a large database of auto insurance
> > claims that you've acquired over the years from your Pathways product.
> >
> > Your primary key would be a combination of the Insurance Company's ID and
> > their internal claim ID for the individual claim.
> > Your row would be all of the data associated to that claim.
> >
> > So now lets say you want to find the average cost to repair a front end
> > collision of an S80 Volvo.
> > The make and model of the car would be orthogonal to the initial key.
> This
> > means that the result set containing insurance records for Front End
> > collisions of S80 Volvos would be most likely evenly distributed across
> the
> > cluster's regions.
> >
> > If you used a series of inverted tables, you would be able to use a
> series
> > of get()s to get the result set from each index and then find their
> > intersections. (Note that you could also put them in sort order so that
> the
> > intersections would be fairly straight forward to find.
> >
> > Doing this at the region level isn't so simple.
> >
> > So I have to again ask why go through and over complicate things?
> >
> > Just saying...
> >
> > On Jan 7, 2013, at 7:49 AM, Anoop Sam John <an...@huawei.com> wrote:
> >
> > > Hi,
> > > It is inverted index based on column(s) value(s)
> > > It will be region wise indexing. Can work when some one knows the
> rowkey
> > range or NOT.
> > >
> > > -Anoop-
> > > ________________________________________
> > > From: Mohit Anchlia [mohitanchlia@gmail.com]
> > > Sent: Monday, January 07, 2013 9:47 AM
> > > To: user@hbase.apache.org
> > > Subject: Re: HBase - Secondary Index
> > >
> > > Hi Anoop,
> > >
> > > Am I correct in understanding that this indexing mechanism is only
> > > applicable when you know the row key? It's not an inverted index truly
> > > based on the column value.
> > >
> > > Mohit
> > > On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <an...@huawei.com>
> > wrote:
> > >
> > >> Hi Adrien
> > >>                 We are making the consistency btw the main table and
> > >> index table and the roll back mentioned below etc using the CP hooks.
> > The
> > >> current hooks were not enough for those though..  I am in the process
> of
> > >> trying to contribute those new hooks, core changes etc now...  Once
> all
> > are
> > >> done I will be able to explain in details..
> > >>
> > >> -Anoop-
> > >> ________________________________________
> > >> From: Adrien Mogenet [adrien.mogenet@gmail.com]
> > >> Sent: Monday, January 07, 2013 2:00 AM
> > >> To: user@hbase.apache.org
> > >> Subject: Re: HBase - Secondary Index
> > >>
> > >> Nice topic, perhaps one of the most important for 2013 :-)
> > >> I still don't get how you're ensuring consistency between index table
> > and
> > >> main table, without an external component (such as
> > bookkeeper/zookeeper).
> > >> What's the exact write path in your situation when inserting data ?
> > >> (WAL/RegionObserver, pre/post put/WALedit...)
> > >>
> > >> The underlying question is about how you're ensuring that WALEdit in
> > Index
> > >> and Main tables are perfectly sync'ed, and how you 're able to
> rollback
> > in
> > >> case of issue in both WAL ?
> > >>
> > >>
> > >> On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <ke...@gmail.com>
> > >> wrote:
> > >>
> > >>>> Yes as you say when the no of rows to be returned is becoming more
> and
> > >>> more the latency will be becoming more.  seeks within an HFile block
> is
> > >>> some what expensive op now. (Not much but still)  The new encoding
> > >>> prefix
> > >>> trie will be a huge bonus here. There the seeks will be flying.. [Ted
> > >> also
> > >>> presented this in the Hadoop China]  Thanks to Matt... :)  I am
> trying
> > to
> > >>> measure the scan performance with this new encoding . Trying to >back
> > >> port
> > >>> a simple patch for 94 version just for testing...   Yes when the no
> of
> > >>> results to be returned is more and more any index will become less
> > >>> performing as per my study  :)
> > >>>
> > >>> yes, you are right, I guess it's just a drawback of any index
> approach.
> > >>> Thanks for the explanation.
> > >>>
> > >>> Shengjie
> > >>>
> > >>> On 28 December 2012 04:14, Anoop Sam John <an...@huawei.com>
> wrote:
> > >>>
> > >>>>> Do you have link to that presentation?
> > >>>>
> > >>>> http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf
> > >>>>
> > >>>> -Anoop-
> > >>>>
> > >>>> ________________________________________
> > >>>> From: Mohit Anchlia [mohitanchlia@gmail.com]
> > >>>> Sent: Friday, December 28, 2012 9:12 AM
> > >>>> To: user@hbase.apache.org
> > >>>> Subject: Re: HBase - Secondary Index
> > >>>>
> > >>>> On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <anoopsj@huawei.com
> >
> > >>>> wrote:
> > >>>>
> > >>>>> Yes as you say when the no of rows to be returned is becoming more
> > >> and
> > >>>>> more the latency will be becoming more.  seeks within an HFile
> block
> > >> is
> > >>>>> some what expensive op now. (Not much but still)  The new encoding
> > >>> prefix
> > >>>>> trie will be a huge bonus here. There the seeks will be flying..
> [Ted
> > >>>> also
> > >>>>> presented this in the Hadoop China]  Thanks to Matt... :)  I am
> > >> trying
> > >>> to
> > >>>>> measure the scan performance with this new encoding . Trying to
> back
> > >>>> port a
> > >>>>> simple patch for 94 version just for testing...   Yes when the no
> of
> > >>>>> results to be returned is more and more any index will become less
> > >>>>> performing as per my study  :)
> > >>>>>
> > >>>>> Do you have link to that presentation?
> > >>>>
> > >>>>
> > >>>>>> btw, quick question- in your presentation, the scale there is
> > >> seconds
> > >>> or
> > >>>>> mill-seconds:)
> > >>>>>
> > >>>>> It is seconds.  Dont consider the exact values. What is the % of
> > >>> increase
> > >>>>> in latency is important :) Those were not high end machines.
> > >>>>>
> > >>>>> -Anoop-
> > >>>>> ________________________________________
> > >>>>> From: Shengjie Min [kelvin.msj@gmail.com]
> > >>>>> Sent: Thursday, December 27, 2012 9:59 PM
> > >>>>> To: user@hbase.apache.org
> > >>>>> Subject: Re: HBase - Secondary Index
> > >>>>>
> > >>>>>> Didnt follow u completely here. There wont be any get()
> happening..
> > >>> As
> > >>>>> the
> > >>>>>> exact rowkey in a region we get from the index table, we can seek
> to
> > >>> the
> > >>>>>> exact position and return that row.
> > >>>>>
> > >>>>> Sorry, When I misused "get()" here, I meant seeking. Yes, if it's
> > >> just
> > >>>>> small number of rows returned, this works perfect. As you said you
> > >> will
> > >>>> get
> > >>>>> the exact rowkey positions per region, and simply seek them. I was
> > >>> trying
> > >>>>> to work out the case that when the number of result rows increases
> > >>>>> massively. Like in Anil's case, he wants to do a scan query against
> > >> the
> > >>>>> 2ndary index(timestamp): "select all rows from timestamp1 to
> > >>> timestamp2"
> > >>>>> given no customerId provided. During that time period, he might
> have
> > >> a
> > >>>> big
> > >>>>> chunk of rows from different customerIds. The index table returns a
> > >> lot
> > >>>> of
> > >>>>> rowkey positions for different customerIds (I believe they are
> > >>> scattered
> > >>>> in
> > >>>>> different regions), then you end up seeking all different positions
> > >> in
> > >>>>> different regions and return all the rows needed. According to your
> > >>>>> presentation page14 - Performance Test Results (Scan), without
> index,
> > >>>> it's
> > >>>>> a linear increase as result rows # increases. on the other hand,
> with
> > >>>>> index, time spent climbs up way quicker than the case without
> index.
> > >>>>>
> > >>>>> btw, quick question- in your presentation, the scale there is
> seconds
> > >>> or
> > >>>>> mill-seconds:)
> > >>>>>
> > >>>>> - Shengjie
> > >>>>>
> > >>>>>
> > >>>>> On 27 December 2012 15:54, Anoop John <an...@gmail.com>
> wrote:
> > >>>>>
> > >>>>>>> how the massive number of get() is going to
> > >>>>>> perform againt the main table
> > >>>>>>
> > >>>>>> Didnt follow u completely here. There wont be any get()
> happening..
> > >>> As
> > >>>>> the
> > >>>>>> exact rowkey in a region we get from the index table, we can seek
> > >> to
> > >>>> the
> > >>>>>> exact position and return that row.
> > >>>>>>
> > >>>>>> -Anoop-
> > >>>>>>
> > >>>>>> On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <
> > >> kelvin.msj@gmail.com>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> how the massive number of get() is going to
> > >>>>>>> perform againt the main table
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> --
> > >>>>> All the best,
> > >>>>> Shengjie Min
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> All the best,
> > >>> Shengjie Min
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Adrien Mogenet
> > >> 06.59.16.64.22
> > >> http://www.mogenet.me
> > >>
>
>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>

Re: HBase - Secondary Index

Posted by Michel Segel <mi...@hotmail.com>.
You haven't provided a use case...

You are defending a design without showing an example of how it its implemented.
Without a concrete example, it's impossible to determine if there is a design flaw in the initial design.

Sorry, but I don't think that enough thought has been done...
I'm trying to understand your reasoning, but without a concrete example... It's kind of hard.


Sent from a remote device. Please excuse any typos...

Mike Segel

On Jan 8, 2013, at 9:22 PM, Anoop Sam John <an...@huawei.com> wrote:

> Totally agree with Lars.  The design came up as per our usage and data distribution style etc.
> Also the put performance we were not able to compromise. That is why the region collocation based region based indexing design came :)
> Also as we are having the indexing and index usage every thing happening at server side, there is no need for any change in the client part depending on what type of client u use. Java code or REST APIs or any thing.  Also MR based parallel scans any thing can be comparably easy I feel as there is absolutely no changes needed at client side.  :)
> 
> As Anil said there will be pros and cons for every way and which one suits your usage, needs to be adopted. :)
> 
> -Anoop-
> ________________________________________
> From: anil gupta [anilgupta84@gmail.com]
> Sent: Wednesday, January 09, 2013 6:58 AM
> To: user@hbase.apache.org; lars hofhansl
> Subject: Re: HBase - Secondary Index
> 
> +1 on Lars comment.
> 
> Either the client gets the rowkey from secondary table and then gets the
> real data from Primary Table. ** OR ** Send the request to all the RS(or
> region) hosting a region of primary table.
> 
> Anoop is using the latter mechanism. Both the mechanism have their pros and
> cons. IMO, there is no outright winner.
> 
> ~Anil Gupta
> 
> On Tue, Jan 8, 2013 at 4:30 PM, lars hofhansl <la...@apache.org> wrote:
> 
>> Different use cases.
>> 
>> 
>> For global point queries you want exactly what you said below.
>> For range scans across many rows you want Anoop's design. As usually it
>> depends.
> 

Re: HBase - Secondary Index

Posted by Michel Segel <mi...@hotmail.com>.
I suggest you think more about the problem... Whenever I hear someone talk about unnecessary RPC calls in a distributed system.... Well, I get skeptical. Especially in light of 10GBe.

Sloppy code is one thing. Being myopic is another. 

This is why I am asking for a more concrete use case. Lars makes a point that there is a wide variety of potential use cases. However trying to solve a bad use case with something that could have a negative impact on overall performance for other use cases... well, I'd like to avoid that if it were possible.

I think I can give you a couple of examples...

First if you had a use case where you really needed a distributed OLTP database, I'd say HBase wasn't the right tool and you should look at Informix's XPS engine,provided IBM still sells it... The point being 

Second, HBase sits on top of HDFS. Because of design issues in HDFS, like not having R/W access to files, there are limitations to HBase where we have to deal with things like compactions.  

I point this out because its a design constraint that impacts the solutions we design using HBase. 
In these two examples, we would want to question the use of HBase as part of the solution.  This is why I'm asking for a use case of where indexing at the region level makes sense. 

It sounds like the idea is to use the secondary index as a filter on a range scan. Using an inverted table, the columns are in sort order. So that it would be easier and lighter to fetch only the columns within the range of the query. Simple extension to using a secondary inverted table for your index.

Just saying...



Sent from a remote device. Please excuse any typos...

Mike Segel

On Jan 8, 2013, at 10:11 PM, ramkrishna vasudevan <ra...@gmail.com> wrote:

> As far as i can see its more related to using the coprocessor framework in
> this soln that helps us in a great way to avoid unnecessary RPC calls when
> we go with Region level indexing.
> 
> Regards
> Ram
> 
> On Wed, Jan 9, 2013 at 8:52 AM, Anoop Sam John <an...@huawei.com> wrote:
> 
>> Totally agree with Lars.  The design came up as per our usage and data
>> distribution style etc.
>> Also the put performance we were not able to compromise. That is why the
>> region collocation based region based indexing design came :)
>> Also as we are having the indexing and index usage every thing happening
>> at server side, there is no need for any change in the client part
>> depending on what type of client u use. Java code or REST APIs or any
>> thing.  Also MR based parallel scans any thing can be comparably easy I
>> feel as there is absolutely no changes needed at client side.  :)
>> 
>> As Anil said there will be pros and cons for every way and which one suits
>> your usage, needs to be adopted. :)
>> 
>> -Anoop-
>> ________________________________________
>> From: anil gupta [anilgupta84@gmail.com]
>> Sent: Wednesday, January 09, 2013 6:58 AM
>> To: user@hbase.apache.org; lars hofhansl
>> Subject: Re: HBase - Secondary Index
>> 
>> +1 on Lars comment.
>> 
>> Either the client gets the rowkey from secondary table and then gets the
>> real data from Primary Table. ** OR ** Send the request to all the RS(or
>> region) hosting a region of primary table.
>> 
>> Anoop is using the latter mechanism. Both the mechanism have their pros and
>> cons. IMO, there is no outright winner.
>> 
>> ~Anil Gupta
>> 
>> On Tue, Jan 8, 2013 at 4:30 PM, lars hofhansl <la...@apache.org> wrote:
>> 
>>> Different use cases.
>>> 
>>> 
>>> For global point queries you want exactly what you said below.
>>> For range scans across many rows you want Anoop's design. As usually it
>>> depends.
>>> 
>>> 
>>> The tradeoff is bringing a lot of unnecessary data to the client vs
>> having
>>> to contact each region (or at least each region server).
>>> 
>>> 
>>> -- Lars
>>> 
>>> 
>>> 
>>> ________________________________
>>> From: Michael Segel <mi...@hotmail.com>
>>> To: user@hbase.apache.org
>>> Sent: Tuesday, January 8, 2013 6:33 AM
>>> Subject: Re: HBase - Secondary Index
>>> 
>>> So if you're using an inverted table / index why on earth are you doing
>> it
>>> at the region level?
>>> 
>>> I've tried to explain this to others over 6 months ago and its not really
>>> a good idea.
>>> 
>>> You're over complicating this and you will end up creating performance
>>> bottlenecks when your secondary index is completely orthogonal to your
>> row
>>> key.
>>> 
>>> To give you an example...
>>> 
>>> Suppose you're CCCIS and you have a large database of auto insurance
>>> claims that you've acquired over the years from your Pathways product.
>>> 
>>> Your primary key would be a combination of the Insurance Company's ID and
>>> their internal claim ID for the individual claim.
>>> Your row would be all of the data associated to that claim.
>>> 
>>> So now lets say you want to find the average cost to repair a front end
>>> collision of an S80 Volvo.
>>> The make and model of the car would be orthogonal to the initial key.
>> This
>>> means that the result set containing insurance records for Front End
>>> collisions of S80 Volvos would be most likely evenly distributed across
>> the
>>> cluster's regions.
>>> 
>>> If you used a series of inverted tables, you would be able to use a
>> series
>>> of get()s to get the result set from each index and then find their
>>> intersections. (Note that you could also put them in sort order so that
>> the
>>> intersections would be fairly straight forward to find.
>>> 
>>> Doing this at the region level isn't so simple.
>>> 
>>> So I have to again ask why go through and over complicate things?
>>> 
>>> Just saying...
>>> 
>>> On Jan 7, 2013, at 7:49 AM, Anoop Sam John <an...@huawei.com> wrote:
>>> 
>>>> Hi,
>>>> It is inverted index based on column(s) value(s)
>>>> It will be region wise indexing. Can work when some one knows the
>> rowkey
>>> range or NOT.
>>>> 
>>>> -Anoop-
>>>> ________________________________________
>>>> From: Mohit Anchlia [mohitanchlia@gmail.com]
>>>> Sent: Monday, January 07, 2013 9:47 AM
>>>> To: user@hbase.apache.org
>>>> Subject: Re: HBase - Secondary Index
>>>> 
>>>> Hi Anoop,
>>>> 
>>>> Am I correct in understanding that this indexing mechanism is only
>>>> applicable when you know the row key? It's not an inverted index truly
>>>> based on the column value.
>>>> 
>>>> Mohit
>>>> On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <an...@huawei.com>
>>> wrote:
>>>> 
>>>>> Hi Adrien
>>>>>                We are making the consistency btw the main table and
>>>>> index table and the roll back mentioned below etc using the CP hooks.
>>> The
>>>>> current hooks were not enough for those though..  I am in the process
>> of
>>>>> trying to contribute those new hooks, core changes etc now...  Once
>> all
>>> are
>>>>> done I will be able to explain in details..
>>>>> 
>>>>> -Anoop-
>>>>> ________________________________________
>>>>> From: Adrien Mogenet [adrien.mogenet@gmail.com]
>>>>> Sent: Monday, January 07, 2013 2:00 AM
>>>>> To: user@hbase.apache.org
>>>>> Subject: Re: HBase - Secondary Index
>>>>> 
>>>>> Nice topic, perhaps one of the most important for 2013 :-)
>>>>> I still don't get how you're ensuring consistency between index table
>>> and
>>>>> main table, without an external component (such as
>>> bookkeeper/zookeeper).
>>>>> What's the exact write path in your situation when inserting data ?
>>>>> (WAL/RegionObserver, pre/post put/WALedit...)
>>>>> 
>>>>> The underlying question is about how you're ensuring that WALEdit in
>>> Index
>>>>> and Main tables are perfectly sync'ed, and how you 're able to
>> rollback
>>> in
>>>>> case of issue in both WAL ?
>>>>> 
>>>>> 
>>>>> On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <ke...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>>> Yes as you say when the no of rows to be returned is becoming more
>> and
>>>>>> more the latency will be becoming more.  seeks within an HFile block
>> is
>>>>>> some what expensive op now. (Not much but still)  The new encoding
>>>>>> prefix
>>>>>> trie will be a huge bonus here. There the seeks will be flying.. [Ted
>>>>> also
>>>>>> presented this in the Hadoop China]  Thanks to Matt... :)  I am
>> trying
>>> to
>>>>>> measure the scan performance with this new encoding . Trying to >back
>>>>> port
>>>>>> a simple patch for 94 version just for testing...   Yes when the no
>> of
>>>>>> results to be returned is more and more any index will become less
>>>>>> performing as per my study  :)
>>>>>> 
>>>>>> yes, you are right, I guess it's just a drawback of any index
>> approach.
>>>>>> Thanks for the explanation.
>>>>>> 
>>>>>> Shengjie
>>>>>> 
>>>>>> On 28 December 2012 04:14, Anoop Sam John <an...@huawei.com>
>> wrote:
>>>>>> 
>>>>>>>> Do you have link to that presentation?
>>>>>>> 
>>>>>>> http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf
>>>>>>> 
>>>>>>> -Anoop-
>>>>>>> 
>>>>>>> ________________________________________
>>>>>>> From: Mohit Anchlia [mohitanchlia@gmail.com]
>>>>>>> Sent: Friday, December 28, 2012 9:12 AM
>>>>>>> To: user@hbase.apache.org
>>>>>>> Subject: Re: HBase - Secondary Index
>>>>>>> 
>>>>>>> On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <anoopsj@huawei.com
>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Yes as you say when the no of rows to be returned is becoming more
>>>>> and
>>>>>>>> more the latency will be becoming more.  seeks within an HFile
>> block
>>>>> is
>>>>>>>> some what expensive op now. (Not much but still)  The new encoding
>>>>>> prefix
>>>>>>>> trie will be a huge bonus here. There the seeks will be flying..
>> [Ted
>>>>>>> also
>>>>>>>> presented this in the Hadoop China]  Thanks to Matt... :)  I am
>>>>> trying
>>>>>> to
>>>>>>>> measure the scan performance with this new encoding . Trying to
>> back
>>>>>>> port a
>>>>>>>> simple patch for 94 version just for testing...   Yes when the no
>> of
>>>>>>>> results to be returned is more and more any index will become less
>>>>>>>> performing as per my study  :)
>>>>>>>> 
>>>>>>>> Do you have link to that presentation?
>>>>>>> 
>>>>>>> 
>>>>>>>>> btw, quick question- in your presentation, the scale there is
>>>>> seconds
>>>>>> or
>>>>>>>> mill-seconds:)
>>>>>>>> 
>>>>>>>> It is seconds.  Dont consider the exact values. What is the % of
>>>>>> increase
>>>>>>>> in latency is important :) Those were not high end machines.
>>>>>>>> 
>>>>>>>> -Anoop-
>>>>>>>> ________________________________________
>>>>>>>> From: Shengjie Min [kelvin.msj@gmail.com]
>>>>>>>> Sent: Thursday, December 27, 2012 9:59 PM
>>>>>>>> To: user@hbase.apache.org
>>>>>>>> Subject: Re: HBase - Secondary Index
>>>>>>>> 
>>>>>>>>> Didnt follow u completely here. There wont be any get()
>> happening..
>>>>>> As
>>>>>>>> the
>>>>>>>>> exact rowkey in a region we get from the index table, we can seek
>> to
>>>>>> the
>>>>>>>>> exact position and return that row.
>>>>>>>> 
>>>>>>>> Sorry, When I misused "get()" here, I meant seeking. Yes, if it's
>>>>> just
>>>>>>>> small number of rows returned, this works perfect. As you said you
>>>>> will
>>>>>>> get
>>>>>>>> the exact rowkey positions per region, and simply seek them. I was
>>>>>> trying
>>>>>>>> to work out the case that when the number of result rows increases
>>>>>>>> massively. Like in Anil's case, he wants to do a scan query against
>>>>> the
>>>>>>>> 2ndary index(timestamp): "select all rows from timestamp1 to
>>>>>> timestamp2"
>>>>>>>> given no customerId provided. During that time period, he might
>> have
>>>>> a
>>>>>>> big
>>>>>>>> chunk of rows from different customerIds. The index table returns a
>>>>> lot
>>>>>>> of
>>>>>>>> rowkey positions for different customerIds (I believe they are
>>>>>> scattered
>>>>>>> in
>>>>>>>> different regions), then you end up seeking all different positions
>>>>> in
>>>>>>>> different regions and return all the rows needed. According to your
>>>>>>>> presentation page14 - Performance Test Results (Scan), without
>> index,
>>>>>>> it's
>>>>>>>> a linear increase as result rows # increases. on the other hand,
>> with
>>>>>>>> index, time spent climbs up way quicker than the case without
>> index.
>>>>>>>> 
>>>>>>>> btw, quick question- in your presentation, the scale there is
>> seconds
>>>>>> or
>>>>>>>> mill-seconds:)
>>>>>>>> 
>>>>>>>> - Shengjie
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 27 December 2012 15:54, Anoop John <an...@gmail.com>
>> wrote:
>>>>>>>> 
>>>>>>>>>> how the massive number of get() is going to
>>>>>>>>> perform againt the main table
>>>>>>>>> 
>>>>>>>>> Didnt follow u completely here. There wont be any get()
>> happening..
>>>>>> As
>>>>>>>> the
>>>>>>>>> exact rowkey in a region we get from the index table, we can seek
>>>>> to
>>>>>>> the
>>>>>>>>> exact position and return that row.
>>>>>>>>> 
>>>>>>>>> -Anoop-
>>>>>>>>> 
>>>>>>>>> On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <
>>>>> kelvin.msj@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> how the massive number of get() is going to
>>>>>>>>>> perform againt the main table
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> All the best,
>>>>>>>> Shengjie Min
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> All the best,
>>>>>> Shengjie Min
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Adrien Mogenet
>>>>> 06.59.16.64.22
>>>>> http://www.mogenet.me
>> 
>> 
>> 
>> 
>> --
>> Thanks & Regards,
>> Anil Gupta
>> 

Re: HBase - Secondary Index

Posted by ramkrishna vasudevan <ra...@gmail.com>.
As far as i can see its more related to using the coprocessor framework in
this soln that helps us in a great way to avoid unnecessary RPC calls when
we go with Region level indexing.

Regards
Ram

On Wed, Jan 9, 2013 at 8:52 AM, Anoop Sam John <an...@huawei.com> wrote:

> Totally agree with Lars.  The design came up as per our usage and data
> distribution style etc.
> Also the put performance we were not able to compromise. That is why the
> region collocation based region based indexing design came :)
> Also as we are having the indexing and index usage every thing happening
> at server side, there is no need for any change in the client part
> depending on what type of client u use. Java code or REST APIs or any
> thing.  Also MR based parallel scans any thing can be comparably easy I
> feel as there is absolutely no changes needed at client side.  :)
>
> As Anil said there will be pros and cons for every way and which one suits
> your usage, needs to be adopted. :)
>
> -Anoop-
> ________________________________________
> From: anil gupta [anilgupta84@gmail.com]
> Sent: Wednesday, January 09, 2013 6:58 AM
> To: user@hbase.apache.org; lars hofhansl
> Subject: Re: HBase - Secondary Index
>
> +1 on Lars comment.
>
> Either the client gets the rowkey from secondary table and then gets the
> real data from Primary Table. ** OR ** Send the request to all the RS(or
> region) hosting a region of primary table.
>
> Anoop is using the latter mechanism. Both the mechanism have their pros and
> cons. IMO, there is no outright winner.
>
> ~Anil Gupta
>
> On Tue, Jan 8, 2013 at 4:30 PM, lars hofhansl <la...@apache.org> wrote:
>
> > Different use cases.
> >
> >
> > For global point queries you want exactly what you said below.
> > For range scans across many rows you want Anoop's design. As usually it
> > depends.
> >
> >
> > The tradeoff is bringing a lot of unnecessary data to the client vs
> having
> > to contact each region (or at least each region server).
> >
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> >  From: Michael Segel <mi...@hotmail.com>
> > To: user@hbase.apache.org
> > Sent: Tuesday, January 8, 2013 6:33 AM
> > Subject: Re: HBase - Secondary Index
> >
> > So if you're using an inverted table / index why on earth are you doing
> it
> > at the region level?
> >
> > I've tried to explain this to others over 6 months ago and its not really
> > a good idea.
> >
> > You're over complicating this and you will end up creating performance
> > bottlenecks when your secondary index is completely orthogonal to your
> row
> > key.
> >
> > To give you an example...
> >
> > Suppose you're CCCIS and you have a large database of auto insurance
> > claims that you've acquired over the years from your Pathways product.
> >
> > Your primary key would be a combination of the Insurance Company's ID and
> > their internal claim ID for the individual claim.
> > Your row would be all of the data associated to that claim.
> >
> > So now lets say you want to find the average cost to repair a front end
> > collision of an S80 Volvo.
> > The make and model of the car would be orthogonal to the initial key.
> This
> > means that the result set containing insurance records for Front End
> > collisions of S80 Volvos would be most likely evenly distributed across
> the
> > cluster's regions.
> >
> > If you used a series of inverted tables, you would be able to use a
> series
> > of get()s to get the result set from each index and then find their
> > intersections. (Note that you could also put them in sort order so that
> the
> > intersections would be fairly straight forward to find.
> >
> > Doing this at the region level isn't so simple.
> >
> > So I have to again ask why go through and over complicate things?
> >
> > Just saying...
> >
> > On Jan 7, 2013, at 7:49 AM, Anoop Sam John <an...@huawei.com> wrote:
> >
> > > Hi,
> > > It is inverted index based on column(s) value(s)
> > > It will be region wise indexing. Can work when some one knows the
> rowkey
> > range or NOT.
> > >
> > > -Anoop-
> > > ________________________________________
> > > From: Mohit Anchlia [mohitanchlia@gmail.com]
> > > Sent: Monday, January 07, 2013 9:47 AM
> > > To: user@hbase.apache.org
> > > Subject: Re: HBase - Secondary Index
> > >
> > > Hi Anoop,
> > >
> > > Am I correct in understanding that this indexing mechanism is only
> > > applicable when you know the row key? It's not an inverted index truly
> > > based on the column value.
> > >
> > > Mohit
> > > On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <an...@huawei.com>
> > wrote:
> > >
> > >> Hi Adrien
> > >>                 We are making the consistency btw the main table and
> > >> index table and the roll back mentioned below etc using the CP hooks.
> > The
> > >> current hooks were not enough for those though..  I am in the process
> of
> > >> trying to contribute those new hooks, core changes etc now...  Once
> all
> > are
> > >> done I will be able to explain in details..
> > >>
> > >> -Anoop-
> > >> ________________________________________
> > >> From: Adrien Mogenet [adrien.mogenet@gmail.com]
> > >> Sent: Monday, January 07, 2013 2:00 AM
> > >> To: user@hbase.apache.org
> > >> Subject: Re: HBase - Secondary Index
> > >>
> > >> Nice topic, perhaps one of the most important for 2013 :-)
> > >> I still don't get how you're ensuring consistency between index table
> > and
> > >> main table, without an external component (such as
> > bookkeeper/zookeeper).
> > >> What's the exact write path in your situation when inserting data ?
> > >> (WAL/RegionObserver, pre/post put/WALedit...)
> > >>
> > >> The underlying question is about how you're ensuring that WALEdit in
> > Index
> > >> and Main tables are perfectly sync'ed, and how you 're able to
> rollback
> > in
> > >> case of issue in both WAL ?
> > >>
> > >>
> > >> On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <ke...@gmail.com>
> > >> wrote:
> > >>
> > >>>> Yes as you say when the no of rows to be returned is becoming more
> and
> > >>> more the latency will be becoming more.  seeks within an HFile block
> is
> > >>> some what expensive op now. (Not much but still)  The new encoding
> > >>> prefix
> > >>> trie will be a huge bonus here. There the seeks will be flying.. [Ted
> > >> also
> > >>> presented this in the Hadoop China]  Thanks to Matt... :)  I am
> trying
> > to
> > >>> measure the scan performance with this new encoding . Trying to >back
> > >> port
> > >>> a simple patch for 94 version just for testing...   Yes when the no
> of
> > >>> results to be returned is more and more any index will become less
> > >>> performing as per my study  :)
> > >>>
> > >>> yes, you are right, I guess it's just a drawback of any index
> approach.
> > >>> Thanks for the explanation.
> > >>>
> > >>> Shengjie
> > >>>
> > >>> On 28 December 2012 04:14, Anoop Sam John <an...@huawei.com>
> wrote:
> > >>>
> > >>>>> Do you have link to that presentation?
> > >>>>
> > >>>> http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf
> > >>>>
> > >>>> -Anoop-
> > >>>>
> > >>>> ________________________________________
> > >>>> From: Mohit Anchlia [mohitanchlia@gmail.com]
> > >>>> Sent: Friday, December 28, 2012 9:12 AM
> > >>>> To: user@hbase.apache.org
> > >>>> Subject: Re: HBase - Secondary Index
> > >>>>
> > >>>> On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <anoopsj@huawei.com
> >
> > >>>> wrote:
> > >>>>
> > >>>>> Yes as you say when the no of rows to be returned is becoming more
> > >> and
> > >>>>> more the latency will be becoming more.  seeks within an HFile
> block
> > >> is
> > >>>>> some what expensive op now. (Not much but still)  The new encoding
> > >>> prefix
> > >>>>> trie will be a huge bonus here. There the seeks will be flying..
> [Ted
> > >>>> also
> > >>>>> presented this in the Hadoop China]  Thanks to Matt... :)  I am
> > >> trying
> > >>> to
> > >>>>> measure the scan performance with this new encoding . Trying to
> back
> > >>>> port a
> > >>>>> simple patch for 94 version just for testing...   Yes when the no
> of
> > >>>>> results to be returned is more and more any index will become less
> > >>>>> performing as per my study  :)
> > >>>>>
> > >>>>> Do you have link to that presentation?
> > >>>>
> > >>>>
> > >>>>>> btw, quick question- in your presentation, the scale there is
> > >> seconds
> > >>> or
> > >>>>> mill-seconds:)
> > >>>>>
> > >>>>> It is seconds.  Dont consider the exact values. What is the % of
> > >>> increase
> > >>>>> in latency is important :) Those were not high end machines.
> > >>>>>
> > >>>>> -Anoop-
> > >>>>> ________________________________________
> > >>>>> From: Shengjie Min [kelvin.msj@gmail.com]
> > >>>>> Sent: Thursday, December 27, 2012 9:59 PM
> > >>>>> To: user@hbase.apache.org
> > >>>>> Subject: Re: HBase - Secondary Index
> > >>>>>
> > >>>>>> Didnt follow u completely here. There wont be any get()
> happening..
> > >>> As
> > >>>>> the
> > >>>>>> exact rowkey in a region we get from the index table, we can seek
> to
> > >>> the
> > >>>>>> exact position and return that row.
> > >>>>>
> > >>>>> Sorry, When I misused "get()" here, I meant seeking. Yes, if it's
> > >> just
> > >>>>> small number of rows returned, this works perfect. As you said you
> > >> will
> > >>>> get
> > >>>>> the exact rowkey positions per region, and simply seek them. I was
> > >>> trying
> > >>>>> to work out the case that when the number of result rows increases
> > >>>>> massively. Like in Anil's case, he wants to do a scan query against
> > >> the
> > >>>>> 2ndary index(timestamp): "select all rows from timestamp1 to
> > >>> timestamp2"
> > >>>>> given no customerId provided. During that time period, he might
> have
> > >> a
> > >>>> big
> > >>>>> chunk of rows from different customerIds. The index table returns a
> > >> lot
> > >>>> of
> > >>>>> rowkey positions for different customerIds (I believe they are
> > >>> scattered
> > >>>> in
> > >>>>> different regions), then you end up seeking all different positions
> > >> in
> > >>>>> different regions and return all the rows needed. According to your
> > >>>>> presentation page14 - Performance Test Results (Scan), without
> index,
> > >>>> it's
> > >>>>> a linear increase as result rows # increases. on the other hand,
> with
> > >>>>> index, time spent climbs up way quicker than the case without
> index.
> > >>>>>
> > >>>>> btw, quick question- in your presentation, the scale there is
> seconds
> > >>> or
> > >>>>> mill-seconds:)
> > >>>>>
> > >>>>> - Shengjie
> > >>>>>
> > >>>>>
> > >>>>> On 27 December 2012 15:54, Anoop John <an...@gmail.com>
> wrote:
> > >>>>>
> > >>>>>>> how the massive number of get() is going to
> > >>>>>> perform againt the main table
> > >>>>>>
> > >>>>>> Didnt follow u completely here. There wont be any get()
> happening..
> > >>> As
> > >>>>> the
> > >>>>>> exact rowkey in a region we get from the index table, we can seek
> > >> to
> > >>>> the
> > >>>>>> exact position and return that row.
> > >>>>>>
> > >>>>>> -Anoop-
> > >>>>>>
> > >>>>>> On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <
> > >> kelvin.msj@gmail.com>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> how the massive number of get() is going to
> > >>>>>>> perform againt the main table
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> --
> > >>>>> All the best,
> > >>>>> Shengjie Min
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> All the best,
> > >>> Shengjie Min
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Adrien Mogenet
> > >> 06.59.16.64.22
> > >> http://www.mogenet.me
> > >>
>
>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>

RE: HBase - Secondary Index

Posted by Anoop Sam John <an...@huawei.com>.
Totally agree with Lars.  The design came up as per our usage and data distribution style etc.
Also the put performance we were not able to compromise. That is why the region collocation based region based indexing design came :)
Also as we are having the indexing and index usage every thing happening at server side, there is no need for any change in the client part depending on what type of client u use. Java code or REST APIs or any thing.  Also MR based parallel scans any thing can be comparably easy I feel as there is absolutely no changes needed at client side.  :)

As Anil said there will be pros and cons for every way and which one suits your usage, needs to be adopted. :)

-Anoop-
________________________________________
From: anil gupta [anilgupta84@gmail.com]
Sent: Wednesday, January 09, 2013 6:58 AM
To: user@hbase.apache.org; lars hofhansl
Subject: Re: HBase - Secondary Index

+1 on Lars comment.

Either the client gets the rowkey from secondary table and then gets the
real data from Primary Table. ** OR ** Send the request to all the RS(or
region) hosting a region of primary table.

Anoop is using the latter mechanism. Both the mechanism have their pros and
cons. IMO, there is no outright winner.

~Anil Gupta

On Tue, Jan 8, 2013 at 4:30 PM, lars hofhansl <la...@apache.org> wrote:

> Different use cases.
>
>
> For global point queries you want exactly what you said below.
> For range scans across many rows you want Anoop's design. As usually it
> depends.
>
>
> The tradeoff is bringing a lot of unnecessary data to the client vs having
> to contact each region (or at least each region server).
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Michael Segel <mi...@hotmail.com>
> To: user@hbase.apache.org
> Sent: Tuesday, January 8, 2013 6:33 AM
> Subject: Re: HBase - Secondary Index
>
> So if you're using an inverted table / index why on earth are you doing it
> at the region level?
>
> I've tried to explain this to others over 6 months ago and its not really
> a good idea.
>
> You're over complicating this and you will end up creating performance
> bottlenecks when your secondary index is completely orthogonal to your row
> key.
>
> To give you an example...
>
> Suppose you're CCCIS and you have a large database of auto insurance
> claims that you've acquired over the years from your Pathways product.
>
> Your primary key would be a combination of the Insurance Company's ID and
> their internal claim ID for the individual claim.
> Your row would be all of the data associated to that claim.
>
> So now lets say you want to find the average cost to repair a front end
> collision of an S80 Volvo.
> The make and model of the car would be orthogonal to the initial key. This
> means that the result set containing insurance records for Front End
> collisions of S80 Volvos would be most likely evenly distributed across the
> cluster's regions.
>
> If you used a series of inverted tables, you would be able to use a series
> of get()s to get the result set from each index and then find their
> intersections. (Note that you could also put them in sort order so that the
> intersections would be fairly straight forward to find.
>
> Doing this at the region level isn't so simple.
>
> So I have to again ask why go through and over complicate things?
>
> Just saying...
>
> On Jan 7, 2013, at 7:49 AM, Anoop Sam John <an...@huawei.com> wrote:
>
> > Hi,
> > It is inverted index based on column(s) value(s)
> > It will be region wise indexing. Can work when some one knows the rowkey
> range or NOT.
> >
> > -Anoop-
> > ________________________________________
> > From: Mohit Anchlia [mohitanchlia@gmail.com]
> > Sent: Monday, January 07, 2013 9:47 AM
> > To: user@hbase.apache.org
> > Subject: Re: HBase - Secondary Index
> >
> > Hi Anoop,
> >
> > Am I correct in understanding that this indexing mechanism is only
> > applicable when you know the row key? It's not an inverted index truly
> > based on the column value.
> >
> > Mohit
> > On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <an...@huawei.com>
> wrote:
> >
> >> Hi Adrien
> >>                 We are making the consistency btw the main table and
> >> index table and the roll back mentioned below etc using the CP hooks.
> The
> >> current hooks were not enough for those though..  I am in the process of
> >> trying to contribute those new hooks, core changes etc now...  Once all
> are
> >> done I will be able to explain in details..
> >>
> >> -Anoop-
> >> ________________________________________
> >> From: Adrien Mogenet [adrien.mogenet@gmail.com]
> >> Sent: Monday, January 07, 2013 2:00 AM
> >> To: user@hbase.apache.org
> >> Subject: Re: HBase - Secondary Index
> >>
> >> Nice topic, perhaps one of the most important for 2013 :-)
> >> I still don't get how you're ensuring consistency between index table
> and
> >> main table, without an external component (such as
> bookkeeper/zookeeper).
> >> What's the exact write path in your situation when inserting data ?
> >> (WAL/RegionObserver, pre/post put/WALedit...)
> >>
> >> The underlying question is about how you're ensuring that WALEdit in
> Index
> >> and Main tables are perfectly sync'ed, and how you 're able to rollback
> in
> >> case of issue in both WAL ?
> >>
> >>
> >> On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <ke...@gmail.com>
> >> wrote:
> >>
> >>>> Yes as you say when the no of rows to be returned is becoming more and
> >>> more the latency will be becoming more.  seeks within an HFile block is
> >>> some what expensive op now. (Not much but still)  The new encoding
> >>> prefix
> >>> trie will be a huge bonus here. There the seeks will be flying.. [Ted
> >> also
> >>> presented this in the Hadoop China]  Thanks to Matt... :)  I am trying
> to
> >>> measure the scan performance with this new encoding . Trying to >back
> >> port
> >>> a simple patch for 94 version just for testing...   Yes when the no of
> >>> results to be returned is more and more any index will become less
> >>> performing as per my study  :)
> >>>
> >>> yes, you are right, I guess it's just a drawback of any index approach.
> >>> Thanks for the explanation.
> >>>
> >>> Shengjie
> >>>
> >>> On 28 December 2012 04:14, Anoop Sam John <an...@huawei.com> wrote:
> >>>
> >>>>> Do you have link to that presentation?
> >>>>
> >>>> http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf
> >>>>
> >>>> -Anoop-
> >>>>
> >>>> ________________________________________
> >>>> From: Mohit Anchlia [mohitanchlia@gmail.com]
> >>>> Sent: Friday, December 28, 2012 9:12 AM
> >>>> To: user@hbase.apache.org
> >>>> Subject: Re: HBase - Secondary Index
> >>>>
> >>>> On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <an...@huawei.com>
> >>>> wrote:
> >>>>
> >>>>> Yes as you say when the no of rows to be returned is becoming more
> >> and
> >>>>> more the latency will be becoming more.  seeks within an HFile block
> >> is
> >>>>> some what expensive op now. (Not much but still)  The new encoding
> >>> prefix
> >>>>> trie will be a huge bonus here. There the seeks will be flying.. [Ted
> >>>> also
> >>>>> presented this in the Hadoop China]  Thanks to Matt... :)  I am
> >> trying
> >>> to
> >>>>> measure the scan performance with this new encoding . Trying to back
> >>>> port a
> >>>>> simple patch for 94 version just for testing...   Yes when the no of
> >>>>> results to be returned is more and more any index will become less
> >>>>> performing as per my study  :)
> >>>>>
> >>>>> Do you have link to that presentation?
> >>>>
> >>>>
> >>>>>> btw, quick question- in your presentation, the scale there is
> >> seconds
> >>> or
> >>>>> mill-seconds:)
> >>>>>
> >>>>> It is seconds.  Dont consider the exact values. What is the % of
> >>> increase
> >>>>> in latency is important :) Those were not high end machines.
> >>>>>
> >>>>> -Anoop-
> >>>>> ________________________________________
> >>>>> From: Shengjie Min [kelvin.msj@gmail.com]
> >>>>> Sent: Thursday, December 27, 2012 9:59 PM
> >>>>> To: user@hbase.apache.org
> >>>>> Subject: Re: HBase - Secondary Index
> >>>>>
> >>>>>> Didnt follow u completely here. There wont be any get() happening..
> >>> As
> >>>>> the
> >>>>>> exact rowkey in a region we get from the index table, we can seek to
> >>> the
> >>>>>> exact position and return that row.
> >>>>>
> >>>>> Sorry, When I misused "get()" here, I meant seeking. Yes, if it's
> >> just
> >>>>> small number of rows returned, this works perfect. As you said you
> >> will
> >>>> get
> >>>>> the exact rowkey positions per region, and simply seek them. I was
> >>> trying
> >>>>> to work out the case that when the number of result rows increases
> >>>>> massively. Like in Anil's case, he wants to do a scan query against
> >> the
> >>>>> 2ndary index(timestamp): "select all rows from timestamp1 to
> >>> timestamp2"
> >>>>> given no customerId provided. During that time period, he might have
> >> a
> >>>> big
> >>>>> chunk of rows from different customerIds. The index table returns a
> >> lot
> >>>> of
> >>>>> rowkey positions for different customerIds (I believe they are
> >>> scattered
> >>>> in
> >>>>> different regions), then you end up seeking all different positions
> >> in
> >>>>> different regions and return all the rows needed. According to your
> >>>>> presentation page14 - Performance Test Results (Scan), without index,
> >>>> it's
> >>>>> a linear increase as result rows # increases. on the other hand, with
> >>>>> index, time spent climbs up way quicker than the case without index.
> >>>>>
> >>>>> btw, quick question- in your presentation, the scale there is seconds
> >>> or
> >>>>> mill-seconds:)
> >>>>>
> >>>>> - Shengjie
> >>>>>
> >>>>>
> >>>>> On 27 December 2012 15:54, Anoop John <an...@gmail.com> wrote:
> >>>>>
> >>>>>>> how the massive number of get() is going to
> >>>>>> perform againt the main table
> >>>>>>
> >>>>>> Didnt follow u completely here. There wont be any get() happening..
> >>> As
> >>>>> the
> >>>>>> exact rowkey in a region we get from the index table, we can seek
> >> to
> >>>> the
> >>>>>> exact position and return that row.
> >>>>>>
> >>>>>> -Anoop-
> >>>>>>
> >>>>>> On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <
> >> kelvin.msj@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> how the massive number of get() is going to
> >>>>>>> perform againt the main table
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> All the best,
> >>>>> Shengjie Min
> >>>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> All the best,
> >>> Shengjie Min
> >>>
> >>
> >>
> >>
> >> --
> >> Adrien Mogenet
> >> 06.59.16.64.22
> >> http://www.mogenet.me
> >>




--
Thanks & Regards,
Anil Gupta

Re: HBase - Secondary Index

Posted by anil gupta <an...@gmail.com>.
+1 on Lars comment.

Either the client gets the rowkey from secondary table and then gets the
real data from Primary Table. ** OR ** Send the request to all the RS(or
region) hosting a region of primary table.

Anoop is using the latter mechanism. Both the mechanism have their pros and
cons. IMO, there is no outright winner.

~Anil Gupta

On Tue, Jan 8, 2013 at 4:30 PM, lars hofhansl <la...@apache.org> wrote:

> Different use cases.
>
>
> For global point queries you want exactly what you said below.
> For range scans across many rows you want Anoop's design. As usually it
> depends.
>
>
> The tradeoff is bringing a lot of unnecessary data to the client vs having
> to contact each region (or at least each region server).
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Michael Segel <mi...@hotmail.com>
> To: user@hbase.apache.org
> Sent: Tuesday, January 8, 2013 6:33 AM
> Subject: Re: HBase - Secondary Index
>
> So if you're using an inverted table / index why on earth are you doing it
> at the region level?
>
> I've tried to explain this to others over 6 months ago and its not really
> a good idea.
>
> You're over complicating this and you will end up creating performance
> bottlenecks when your secondary index is completely orthogonal to your row
> key.
>
> To give you an example...
>
> Suppose you're CCCIS and you have a large database of auto insurance
> claims that you've acquired over the years from your Pathways product.
>
> Your primary key would be a combination of the Insurance Company's ID and
> their internal claim ID for the individual claim.
> Your row would be all of the data associated to that claim.
>
> So now lets say you want to find the average cost to repair a front end
> collision of an S80 Volvo.
> The make and model of the car would be orthogonal to the initial key. This
> means that the result set containing insurance records for Front End
> collisions of S80 Volvos would be most likely evenly distributed across the
> cluster's regions.
>
> If you used a series of inverted tables, you would be able to use a series
> of get()s to get the result set from each index and then find their
> intersections. (Note that you could also put them in sort order so that the
> intersections would be fairly straight forward to find.
>
> Doing this at the region level isn't so simple.
>
> So I have to again ask why go through and over complicate things?
>
> Just saying...
>
> On Jan 7, 2013, at 7:49 AM, Anoop Sam John <an...@huawei.com> wrote:
>
> > Hi,
> > It is inverted index based on column(s) value(s)
> > It will be region wise indexing. Can work when some one knows the rowkey
> range or NOT.
> >
> > -Anoop-
> > ________________________________________
> > From: Mohit Anchlia [mohitanchlia@gmail.com]
> > Sent: Monday, January 07, 2013 9:47 AM
> > To: user@hbase.apache.org
> > Subject: Re: HBase - Secondary Index
> >
> > Hi Anoop,
> >
> > Am I correct in understanding that this indexing mechanism is only
> > applicable when you know the row key? It's not an inverted index truly
> > based on the column value.
> >
> > Mohit
> > On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <an...@huawei.com>
> wrote:
> >
> >> Hi Adrien
> >>                 We are making the consistency btw the main table and
> >> index table and the roll back mentioned below etc using the CP hooks.
> The
> >> current hooks were not enough for those though..  I am in the process of
> >> trying to contribute those new hooks, core changes etc now...  Once all
> are
> >> done I will be able to explain in details..
> >>
> >> -Anoop-
> >> ________________________________________
> >> From: Adrien Mogenet [adrien.mogenet@gmail.com]
> >> Sent: Monday, January 07, 2013 2:00 AM
> >> To: user@hbase.apache.org
> >> Subject: Re: HBase - Secondary Index
> >>
> >> Nice topic, perhaps one of the most important for 2013 :-)
> >> I still don't get how you're ensuring consistency between index table
> and
> >> main table, without an external component (such as
> bookkeeper/zookeeper).
> >> What's the exact write path in your situation when inserting data ?
> >> (WAL/RegionObserver, pre/post put/WALedit...)
> >>
> >> The underlying question is about how you're ensuring that WALEdit in
> Index
> >> and Main tables are perfectly sync'ed, and how you 're able to rollback
> in
> >> case of issue in both WAL ?
> >>
> >>
> >> On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <ke...@gmail.com>
> >> wrote:
> >>
> >>>> Yes as you say when the no of rows to be returned is becoming more and
> >>> more the latency will be becoming more.  seeks within an HFile block is
> >>> some what expensive op now. (Not much but still)  The new encoding
> >>> prefix
> >>> trie will be a huge bonus here. There the seeks will be flying.. [Ted
> >> also
> >>> presented this in the Hadoop China]  Thanks to Matt... :)  I am trying
> to
> >>> measure the scan performance with this new encoding . Trying to >back
> >> port
> >>> a simple patch for 94 version just for testing...   Yes when the no of
> >>> results to be returned is more and more any index will become less
> >>> performing as per my study  :)
> >>>
> >>> yes, you are right, I guess it's just a drawback of any index approach.
> >>> Thanks for the explanation.
> >>>
> >>> Shengjie
> >>>
> >>> On 28 December 2012 04:14, Anoop Sam John <an...@huawei.com> wrote:
> >>>
> >>>>> Do you have link to that presentation?
> >>>>
> >>>> http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf
> >>>>
> >>>> -Anoop-
> >>>>
> >>>> ________________________________________
> >>>> From: Mohit Anchlia [mohitanchlia@gmail.com]
> >>>> Sent: Friday, December 28, 2012 9:12 AM
> >>>> To: user@hbase.apache.org
> >>>> Subject: Re: HBase - Secondary Index
> >>>>
> >>>> On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <an...@huawei.com>
> >>>> wrote:
> >>>>
> >>>>> Yes as you say when the no of rows to be returned is becoming more
> >> and
> >>>>> more the latency will be becoming more.  seeks within an HFile block
> >> is
> >>>>> some what expensive op now. (Not much but still)  The new encoding
> >>> prefix
> >>>>> trie will be a huge bonus here. There the seeks will be flying.. [Ted
> >>>> also
> >>>>> presented this in the Hadoop China]  Thanks to Matt... :)  I am
> >> trying
> >>> to
> >>>>> measure the scan performance with this new encoding . Trying to back
> >>>> port a
> >>>>> simple patch for 94 version just for testing...   Yes when the no of
> >>>>> results to be returned is more and more any index will become less
> >>>>> performing as per my study  :)
> >>>>>
> >>>>> Do you have link to that presentation?
> >>>>
> >>>>
> >>>>>> btw, quick question- in your presentation, the scale there is
> >> seconds
> >>> or
> >>>>> mill-seconds:)
> >>>>>
> >>>>> It is seconds.  Dont consider the exact values. What is the % of
> >>> increase
> >>>>> in latency is important :) Those were not high end machines.
> >>>>>
> >>>>> -Anoop-
> >>>>> ________________________________________
> >>>>> From: Shengjie Min [kelvin.msj@gmail.com]
> >>>>> Sent: Thursday, December 27, 2012 9:59 PM
> >>>>> To: user@hbase.apache.org
> >>>>> Subject: Re: HBase - Secondary Index
> >>>>>
> >>>>>> Didnt follow u completely here. There wont be any get() happening..
> >>> As
> >>>>> the
> >>>>>> exact rowkey in a region we get from the index table, we can seek to
> >>> the
> >>>>>> exact position and return that row.
> >>>>>
> >>>>> Sorry, When I misused "get()" here, I meant seeking. Yes, if it's
> >> just
> >>>>> small number of rows returned, this works perfect. As you said you
> >> will
> >>>> get
> >>>>> the exact rowkey positions per region, and simply seek them. I was
> >>> trying
> >>>>> to work out the case that when the number of result rows increases
> >>>>> massively. Like in Anil's case, he wants to do a scan query against
> >> the
> >>>>> 2ndary index(timestamp): "select all rows from timestamp1 to
> >>> timestamp2"
> >>>>> given no customerId provided. During that time period, he might have
> >> a
> >>>> big
> >>>>> chunk of rows from different customerIds. The index table returns a
> >> lot
> >>>> of
> >>>>> rowkey positions for different customerIds (I believe they are
> >>> scattered
> >>>> in
> >>>>> different regions), then you end up seeking all different positions
> >> in
> >>>>> different regions and return all the rows needed. According to your
> >>>>> presentation page14 - Performance Test Results (Scan), without index,
> >>>> it's
> >>>>> a linear increase as result rows # increases. on the other hand, with
> >>>>> index, time spent climbs up way quicker than the case without index.
> >>>>>
> >>>>> btw, quick question- in your presentation, the scale there is seconds
> >>> or
> >>>>> mill-seconds:)
> >>>>>
> >>>>> - Shengjie
> >>>>>
> >>>>>
> >>>>> On 27 December 2012 15:54, Anoop John <an...@gmail.com> wrote:
> >>>>>
> >>>>>>> how the massive number of get() is going to
> >>>>>> perform againt the main table
> >>>>>>
> >>>>>> Didnt follow u completely here. There wont be any get() happening..
> >>> As
> >>>>> the
> >>>>>> exact rowkey in a region we get from the index table, we can seek
> >> to
> >>>> the
> >>>>>> exact position and return that row.
> >>>>>>
> >>>>>> -Anoop-
> >>>>>>
> >>>>>> On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <
> >> kelvin.msj@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> how the massive number of get() is going to
> >>>>>>> perform againt the main table
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> All the best,
> >>>>> Shengjie Min
> >>>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> All the best,
> >>> Shengjie Min
> >>>
> >>
> >>
> >>
> >> --
> >> Adrien Mogenet
> >> 06.59.16.64.22
> >> http://www.mogenet.me
> >>




-- 
Thanks & Regards,
Anil Gupta

Re: HBase - Secondary Index

Posted by Michel Segel <mi...@hotmail.com>.
Can you provide a use case?


Sent from a remote device. Please excuse any typos...

Mike Segel

On Jan 8, 2013, at 6:30 PM, lars hofhansl <la...@apache.org> wrote:

> Different use cases.
> 
> 
> For global point queries you want exactly what you said below.
> For range scans across many rows you want Anoop's design. As usually it depends.
> 
> 
> The tradeoff is bringing a lot of unnecessary data to the client vs having to contact each region (or at least each region server).
> 
> 
> -- Lars
> 
> 
> 
> ________________________________
> From: Michael Segel <mi...@hotmail.com>
> To: user@hbase.apache.org 
> Sent: Tuesday, January 8, 2013 6:33 AM
> Subject: Re: HBase - Secondary Index
> 
> So if you're using an inverted table / index why on earth are you doing it at the region level? 
> 
> I've tried to explain this to others over 6 months ago and its not really a good idea. 
> 
> You're over complicating this and you will end up creating performance bottlenecks when your secondary index is completely orthogonal to your row key. 
> 
> To give you an example... 
> 
> Suppose you're CCCIS and you have a large database of auto insurance claims that you've acquired over the years from your Pathways product. 
> 
> Your primary key would be a combination of the Insurance Company's ID and their internal claim ID for the individual claim. 
> Your row would be all of the data associated to that claim.
> 
> So now lets say you want to find the average cost to repair a front end collision of an S80 Volvo. 
> The make and model of the car would be orthogonal to the initial key. This means that the result set containing insurance records for Front End collisions of S80 Volvos would be most likely evenly distributed across the cluster's regions. 
> 
> If you used a series of inverted tables, you would be able to use a series of get()s to get the result set from each index and then find their intersections. (Note that you could also put them in sort order so that the intersections would be fairly straight forward to find. 
> 
> Doing this at the region level isn't so simple. 
> 
> So I have to again ask why go through and over complicate things? 
> 
> Just saying... 
> 
> On Jan 7, 2013, at 7:49 AM, Anoop Sam John <an...@huawei.com> wrote:
> 
>> Hi,
>> It is inverted index based on column(s) value(s)
>> It will be region wise indexing. Can work when some one knows the rowkey range or NOT.
>> 
>> -Anoop-
>> ________________________________________
>> From: Mohit Anchlia [mohitanchlia@gmail.com]
>> Sent: Monday, January 07, 2013 9:47 AM
>> To: user@hbase.apache.org
>> Subject: Re: HBase - Secondary Index
>> 
>> Hi Anoop,
>> 
>> Am I correct in understanding that this indexing mechanism is only
>> applicable when you know the row key? It's not an inverted index truly
>> based on the column value.
>> 
>> Mohit
>> On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <an...@huawei.com> wrote:
>> 
>>> Hi Adrien
>>>                  We are making the consistency btw the main table and
>>> index table and the roll back mentioned below etc using the CP hooks. The
>>> current hooks were not enough for those though..  I am in the process of
>>> trying to contribute those new hooks, core changes etc now...  Once all are
>>> done I will be able to explain in details..
>>> 
>>> -Anoop-
>>> ________________________________________
>>> From: Adrien Mogenet [adrien.mogenet@gmail.com]
>>> Sent: Monday, January 07, 2013 2:00 AM
>>> To: user@hbase.apache.org
>>> Subject: Re: HBase - Secondary Index
>>> 
>>> Nice topic, perhaps one of the most important for 2013 :-)
>>> I still don't get how you're ensuring consistency between index table and
>>> main table, without an external component (such as bookkeeper/zookeeper).
>>> What's the exact write path in your situation when inserting data ?
>>> (WAL/RegionObserver, pre/post put/WALedit...)
>>> 
>>> The underlying question is about how you're ensuring that WALEdit in Index
>>> and Main tables are perfectly sync'ed, and how you 're able to rollback in
>>> case of issue in both WAL ?
>>> 
>>> 
>>> On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <ke...@gmail.com>
>>> wrote:
>>> 
>>>>> Yes as you say when the no of rows to be returned is becoming more and
>>>> more the latency will be becoming more.  seeks within an HFile block is
>>>> some what expensive op now. (Not much but still)  The new encoding
>>>> prefix
>>>> trie will be a huge bonus here. There the seeks will be flying.. [Ted
>>> also
>>>> presented this in the Hadoop China]  Thanks to Matt... :)  I am trying to
>>>> measure the scan performance with this new encoding . Trying to >back
>>> port
>>>> a simple patch for 94 version just for testing...   Yes when the no of
>>>> results to be returned is more and more any index will become less
>>>> performing as per my study  :)
>>>> 
>>>> yes, you are right, I guess it's just a drawback of any index approach.
>>>> Thanks for the explanation.
>>>> 
>>>> Shengjie
>>>> 
>>>> On 28 December 2012 04:14, Anoop Sam John <an...@huawei.com> wrote:
>>>> 
>>>>>> Do you have link to that presentation?
>>>>> 
>>>>> http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf
>>>>> 
>>>>> -Anoop-
>>>>> 
>>>>> ________________________________________
>>>>> From: Mohit Anchlia [mohitanchlia@gmail.com]
>>>>> Sent: Friday, December 28, 2012 9:12 AM
>>>>> To: user@hbase.apache.org
>>>>> Subject: Re: HBase - Secondary Index
>>>>> 
>>>>> On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <an...@huawei.com>
>>>>> wrote:
>>>>> 
>>>>>> Yes as you say when the no of rows to be returned is becoming more
>>> and
>>>>>> more the latency will be becoming more.  seeks within an HFile block
>>> is
>>>>>> some what expensive op now. (Not much but still)  The new encoding
>>>> prefix
>>>>>> trie will be a huge bonus here. There the seeks will be flying.. [Ted
>>>>> also
>>>>>> presented this in the Hadoop China]  Thanks to Matt... :)  I am
>>> trying
>>>> to
>>>>>> measure the scan performance with this new encoding . Trying to back
>>>>> port a
>>>>>> simple patch for 94 version just for testing...   Yes when the no of
>>>>>> results to be returned is more and more any index will become less
>>>>>> performing as per my study  :)
>>>>>> 
>>>>>> Do you have link to that presentation?
>>>>> 
>>>>> 
>>>>>>> btw, quick question- in your presentation, the scale there is
>>> seconds
>>>> or
>>>>>> mill-seconds:)
>>>>>> 
>>>>>> It is seconds.  Dont consider the exact values. What is the % of
>>>> increase
>>>>>> in latency is important :) Those were not high end machines.
>>>>>> 
>>>>>> -Anoop-
>>>>>> ________________________________________
>>>>>> From: Shengjie Min [kelvin.msj@gmail.com]
>>>>>> Sent: Thursday, December 27, 2012 9:59 PM
>>>>>> To: user@hbase.apache.org
>>>>>> Subject: Re: HBase - Secondary Index
>>>>>> 
>>>>>>> Didnt follow u completely here. There wont be any get() happening..
>>>> As
>>>>>> the
>>>>>>> exact rowkey in a region we get from the index table, we can seek to
>>>> the
>>>>>>> exact position and return that row.
>>>>>> 
>>>>>> Sorry, When I misused "get()" here, I meant seeking. Yes, if it's
>>> just
>>>>>> small number of rows returned, this works perfect. As you said you
>>> will
>>>>> get
>>>>>> the exact rowkey positions per region, and simply seek them. I was
>>>> trying
>>>>>> to work out the case that when the number of result rows increases
>>>>>> massively. Like in Anil's case, he wants to do a scan query against
>>> the
>>>>>> 2ndary index(timestamp): "select all rows from timestamp1 to
>>>> timestamp2"
>>>>>> given no customerId provided. During that time period, he might have
>>> a
>>>>> big
>>>>>> chunk of rows from different customerIds. The index table returns a
>>> lot
>>>>> of
>>>>>> rowkey positions for different customerIds (I believe they are
>>>> scattered
>>>>> in
>>>>>> different regions), then you end up seeking all different positions
>>> in
>>>>>> different regions and return all the rows needed. According to your
>>>>>> presentation page14 - Performance Test Results (Scan), without index,
>>>>> it's
>>>>>> a linear increase as result rows # increases. on the other hand, with
>>>>>> index, time spent climbs up way quicker than the case without index.
>>>>>> 
>>>>>> btw, quick question- in your presentation, the scale there is seconds
>>>> or
>>>>>> mill-seconds:)
>>>>>> 
>>>>>> - Shengjie
>>>>>> 
>>>>>> 
>>>>>> On 27 December 2012 15:54, Anoop John <an...@gmail.com> wrote:
>>>>>> 
>>>>>>>> how the massive number of get() is going to
>>>>>>> perform againt the main table
>>>>>>> 
>>>>>>> Didnt follow u completely here. There wont be any get() happening..
>>>> As
>>>>>> the
>>>>>>> exact rowkey in a region we get from the index table, we can seek
>>> to
>>>>> the
>>>>>>> exact position and return that row.
>>>>>>> 
>>>>>>> -Anoop-
>>>>>>> 
>>>>>>> On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <
>>> kelvin.msj@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> how the massive number of get() is going to
>>>>>>>> perform againt the main table
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> All the best,
>>>>>> Shengjie Min
>>>> 
>>>> 
>>>> 
>>>> --
>>>> All the best,
>>>> Shengjie Min
>>> 
>>> 
>>> 
>>> --
>>> Adrien Mogenet
>>> 06.59.16.64.22
>>> http://www.mogenet.me

Re: HBase - Secondary Index

Posted by Michel Segel <mi...@hotmail.com>.
Sorry, this makes no sense...

You are doing a range scan, I get that... 

Consider that in an inverted table as your index, each column would be your rowkey which will be in a sort order.

Extend get() to take in a range pair as parameters and limit the result set returned to those columns which fall within your range... 

Problem solved. Right?

The RPC and network traffic is kept to a minimum and you are still solving the underlying use case with cleaner code.

Just saying...


Sent from a remote device. Please excuse any typos...

Mike Segel

On Jan 8, 2013, at 6:30 PM, lars hofhansl <la...@apache.org> wrote:

> Different use cases.
> 
> 
> For global point queries you want exactly what you said below.
> For range scans across many rows you want Anoop's design. As usually it depends.
> 
> 
> The tradeoff is bringing a lot of unnecessary data to the client vs having to contact each region (or at least each region server).
> 
> 
> -- Lars
> 
> 
> 
> ________________________________
> From: Michael Segel <mi...@hotmail.com>
> To: user@hbase.apache.org 
> Sent: Tuesday, January 8, 2013 6:33 AM
> Subject: Re: HBase - Secondary Index
> 
> So if you're using an inverted table / index why on earth are you doing it at the region level? 
> 
> I've tried to explain this to others over 6 months ago and its not really a good idea. 
> 
> You're over complicating this and you will end up creating performance bottlenecks when your secondary index is completely orthogonal to your row key. 
> 
> To give you an example... 
> 
> Suppose you're CCCIS and you have a large database of auto insurance claims that you've acquired over the years from your Pathways product. 
> 
> Your primary key would be a combination of the Insurance Company's ID and their internal claim ID for the individual claim. 
> Your row would be all of the data associated to that claim.
> 
> So now lets say you want to find the average cost to repair a front end collision of an S80 Volvo. 
> The make and model of the car would be orthogonal to the initial key. This means that the result set containing insurance records for Front End collisions of S80 Volvos would be most likely evenly distributed across the cluster's regions. 
> 
> If you used a series of inverted tables, you would be able to use a series of get()s to get the result set from each index and then find their intersections. (Note that you could also put them in sort order so that the intersections would be fairly straight forward to find. 
> 
> Doing this at the region level isn't so simple. 
> 
> So I have to again ask why go through and over complicate things? 
> 
> Just saying... 
> 
> On Jan 7, 2013, at 7:49 AM, Anoop Sam John <an...@huawei.com> wrote:
> 
>> Hi,
>> It is inverted index based on column(s) value(s)
>> It will be region wise indexing. Can work when some one knows the rowkey range or NOT.
>> 
>> -Anoop-
>> ________________________________________
>> From: Mohit Anchlia [mohitanchlia@gmail.com]
>> Sent: Monday, January 07, 2013 9:47 AM
>> To: user@hbase.apache.org
>> Subject: Re: HBase - Secondary Index
>> 
>> Hi Anoop,
>> 
>> Am I correct in understanding that this indexing mechanism is only
>> applicable when you know the row key? It's not an inverted index truly
>> based on the column value.
>> 
>> Mohit
>> On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <an...@huawei.com> wrote:
>> 
>>> Hi Adrien
>>>                  We are making the consistency btw the main table and
>>> index table and the roll back mentioned below etc using the CP hooks. The
>>> current hooks were not enough for those though..  I am in the process of
>>> trying to contribute those new hooks, core changes etc now...  Once all are
>>> done I will be able to explain in details..
>>> 
>>> -Anoop-
>>> ________________________________________
>>> From: Adrien Mogenet [adrien.mogenet@gmail.com]
>>> Sent: Monday, January 07, 2013 2:00 AM
>>> To: user@hbase.apache.org
>>> Subject: Re: HBase - Secondary Index
>>> 
>>> Nice topic, perhaps one of the most important for 2013 :-)
>>> I still don't get how you're ensuring consistency between index table and
>>> main table, without an external component (such as bookkeeper/zookeeper).
>>> What's the exact write path in your situation when inserting data ?
>>> (WAL/RegionObserver, pre/post put/WALedit...)
>>> 
>>> The underlying question is about how you're ensuring that WALEdit in Index
>>> and Main tables are perfectly sync'ed, and how you 're able to rollback in
>>> case of issue in both WAL ?
>>> 
>>> 
>>> On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <ke...@gmail.com>
>>> wrote:
>>> 
>>>>> Yes as you say when the no of rows to be returned is becoming more and
>>>> more the latency will be becoming more.  seeks within an HFile block is
>>>> some what expensive op now. (Not much but still)  The new encoding
>>>> prefix
>>>> trie will be a huge bonus here. There the seeks will be flying.. [Ted
>>> also
>>>> presented this in the Hadoop China]  Thanks to Matt... :)  I am trying to
>>>> measure the scan performance with this new encoding . Trying to >back
>>> port
>>>> a simple patch for 94 version just for testing...   Yes when the no of
>>>> results to be returned is more and more any index will become less
>>>> performing as per my study  :)
>>>> 
>>>> yes, you are right, I guess it's just a drawback of any index approach.
>>>> Thanks for the explanation.
>>>> 
>>>> Shengjie
>>>> 
>>>> On 28 December 2012 04:14, Anoop Sam John <an...@huawei.com> wrote:
>>>> 
>>>>>> Do you have link to that presentation?
>>>>> 
>>>>> http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf
>>>>> 
>>>>> -Anoop-
>>>>> 
>>>>> ________________________________________
>>>>> From: Mohit Anchlia [mohitanchlia@gmail.com]
>>>>> Sent: Friday, December 28, 2012 9:12 AM
>>>>> To: user@hbase.apache.org
>>>>> Subject: Re: HBase - Secondary Index
>>>>> 
>>>>> On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <an...@huawei.com>
>>>>> wrote:
>>>>> 
>>>>>> Yes as you say when the no of rows to be returned is becoming more
>>> and
>>>>>> more the latency will be becoming more.  seeks within an HFile block
>>> is
>>>>>> some what expensive op now. (Not much but still)  The new encoding
>>>> prefix
>>>>>> trie will be a huge bonus here. There the seeks will be flying.. [Ted
>>>>> also
>>>>>> presented this in the Hadoop China]  Thanks to Matt... :)  I am
>>> trying
>>>> to
>>>>>> measure the scan performance with this new encoding . Trying to back
>>>>> port a
>>>>>> simple patch for 94 version just for testing...   Yes when the no of
>>>>>> results to be returned is more and more any index will become less
>>>>>> performing as per my study  :)
>>>>>> 
>>>>>> Do you have link to that presentation?
>>>>> 
>>>>> 
>>>>>>> btw, quick question- in your presentation, the scale there is
>>> seconds
>>>> or
>>>>>> mill-seconds:)
>>>>>> 
>>>>>> It is seconds.  Dont consider the exact values. What is the % of
>>>> increase
>>>>>> in latency is important :) Those were not high end machines.
>>>>>> 
>>>>>> -Anoop-
>>>>>> ________________________________________
>>>>>> From: Shengjie Min [kelvin.msj@gmail.com]
>>>>>> Sent: Thursday, December 27, 2012 9:59 PM
>>>>>> To: user@hbase.apache.org
>>>>>> Subject: Re: HBase - Secondary Index
>>>>>> 
>>>>>>> Didnt follow u completely here. There wont be any get() happening..
>>>> As
>>>>>> the
>>>>>>> exact rowkey in a region we get from the index table, we can seek to
>>>> the
>>>>>>> exact position and return that row.
>>>>>> 
>>>>>> Sorry, When I misused "get()" here, I meant seeking. Yes, if it's
>>> just
>>>>>> small number of rows returned, this works perfect. As you said you
>>> will
>>>>> get
>>>>>> the exact rowkey positions per region, and simply seek them. I was
>>>> trying
>>>>>> to work out the case that when the number of result rows increases
>>>>>> massively. Like in Anil's case, he wants to do a scan query against
>>> the
>>>>>> 2ndary index(timestamp): "select all rows from timestamp1 to
>>>> timestamp2"
>>>>>> given no customerId provided. During that time period, he might have
>>> a
>>>>> big
>>>>>> chunk of rows from different customerIds. The index table returns a
>>> lot
>>>>> of
>>>>>> rowkey positions for different customerIds (I believe they are
>>>> scattered
>>>>> in
>>>>>> different regions), then you end up seeking all different positions
>>> in
>>>>>> different regions and return all the rows needed. According to your
>>>>>> presentation page14 - Performance Test Results (Scan), without index,
>>>>> it's
>>>>>> a linear increase as result rows # increases. on the other hand, with
>>>>>> index, time spent climbs up way quicker than the case without index.
>>>>>> 
>>>>>> btw, quick question- in your presentation, the scale there is seconds
>>>> or
>>>>>> mill-seconds:)
>>>>>> 
>>>>>> - Shengjie
>>>>>> 
>>>>>> 
>>>>>> On 27 December 2012 15:54, Anoop John <an...@gmail.com> wrote:
>>>>>> 
>>>>>>>> how the massive number of get() is going to
>>>>>>> perform againt the main table
>>>>>>> 
>>>>>>> Didnt follow u completely here. There wont be any get() happening..
>>>> As
>>>>>> the
>>>>>>> exact rowkey in a region we get from the index table, we can seek
>>> to
>>>>> the
>>>>>>> exact position and return that row.
>>>>>>> 
>>>>>>> -Anoop-
>>>>>>> 
>>>>>>> On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <
>>> kelvin.msj@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> how the massive number of get() is going to
>>>>>>>> perform againt the main table
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> All the best,
>>>>>> Shengjie Min
>>>> 
>>>> 
>>>> 
>>>> --
>>>> All the best,
>>>> Shengjie Min
>>> 
>>> 
>>> 
>>> --
>>> Adrien Mogenet
>>> 06.59.16.64.22
>>> http://www.mogenet.me

Re: HBase - Secondary Index

Posted by lars hofhansl <la...@apache.org>.
Different use cases.


For global point queries you want exactly what you said below.
For range scans across many rows you want Anoop's design. As usually it depends.


The tradeoff is bringing a lot of unnecessary data to the client vs having to contact each region (or at least each region server).


-- Lars



________________________________
 From: Michael Segel <mi...@hotmail.com>
To: user@hbase.apache.org 
Sent: Tuesday, January 8, 2013 6:33 AM
Subject: Re: HBase - Secondary Index
 
So if you're using an inverted table / index why on earth are you doing it at the region level? 

I've tried to explain this to others over 6 months ago and its not really a good idea. 

You're over complicating this and you will end up creating performance bottlenecks when your secondary index is completely orthogonal to your row key. 

To give you an example... 

Suppose you're CCCIS and you have a large database of auto insurance claims that you've acquired over the years from your Pathways product. 

Your primary key would be a combination of the Insurance Company's ID and their internal claim ID for the individual claim. 
Your row would be all of the data associated to that claim.

So now lets say you want to find the average cost to repair a front end collision of an S80 Volvo. 
The make and model of the car would be orthogonal to the initial key. This means that the result set containing insurance records for Front End collisions of S80 Volvos would be most likely evenly distributed across the cluster's regions. 

If you used a series of inverted tables, you would be able to use a series of get()s to get the result set from each index and then find their intersections. (Note that you could also put them in sort order so that the intersections would be fairly straight forward to find. 

Doing this at the region level isn't so simple. 

So I have to again ask why go through and over complicate things? 

Just saying... 

On Jan 7, 2013, at 7:49 AM, Anoop Sam John <an...@huawei.com> wrote:

> Hi,
> It is inverted index based on column(s) value(s)
> It will be region wise indexing. Can work when some one knows the rowkey range or NOT.
> 
> -Anoop-
> ________________________________________
> From: Mohit Anchlia [mohitanchlia@gmail.com]
> Sent: Monday, January 07, 2013 9:47 AM
> To: user@hbase.apache.org
> Subject: Re: HBase - Secondary Index
> 
> Hi Anoop,
> 
> Am I correct in understanding that this indexing mechanism is only
> applicable when you know the row key? It's not an inverted index truly
> based on the column value.
> 
> Mohit
> On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <an...@huawei.com> wrote:
> 
>> Hi Adrien
>>                 We are making the consistency btw the main table and
>> index table and the roll back mentioned below etc using the CP hooks. The
>> current hooks were not enough for those though..  I am in the process of
>> trying to contribute those new hooks, core changes etc now...  Once all are
>> done I will be able to explain in details..
>> 
>> -Anoop-
>> ________________________________________
>> From: Adrien Mogenet [adrien.mogenet@gmail.com]
>> Sent: Monday, January 07, 2013 2:00 AM
>> To: user@hbase.apache.org
>> Subject: Re: HBase - Secondary Index
>> 
>> Nice topic, perhaps one of the most important for 2013 :-)
>> I still don't get how you're ensuring consistency between index table and
>> main table, without an external component (such as bookkeeper/zookeeper).
>> What's the exact write path in your situation when inserting data ?
>> (WAL/RegionObserver, pre/post put/WALedit...)
>> 
>> The underlying question is about how you're ensuring that WALEdit in Index
>> and Main tables are perfectly sync'ed, and how you 're able to rollback in
>> case of issue in both WAL ?
>> 
>> 
>> On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <ke...@gmail.com>
>> wrote:
>> 
>>>> Yes as you say when the no of rows to be returned is becoming more and
>>> more the latency will be becoming more.  seeks within an HFile block is
>>> some what expensive op now. (Not much but still)  The new encoding
>>> prefix
>>> trie will be a huge bonus here. There the seeks will be flying.. [Ted
>> also
>>> presented this in the Hadoop China]  Thanks to Matt... :)  I am trying to
>>> measure the scan performance with this new encoding . Trying to >back
>> port
>>> a simple patch for 94 version just for testing...   Yes when the no of
>>> results to be returned is more and more any index will become less
>>> performing as per my study  :)
>>> 
>>> yes, you are right, I guess it's just a drawback of any index approach.
>>> Thanks for the explanation.
>>> 
>>> Shengjie
>>> 
>>> On 28 December 2012 04:14, Anoop Sam John <an...@huawei.com> wrote:
>>> 
>>>>> Do you have link to that presentation?
>>>> 
>>>> http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf
>>>> 
>>>> -Anoop-
>>>> 
>>>> ________________________________________
>>>> From: Mohit Anchlia [mohitanchlia@gmail.com]
>>>> Sent: Friday, December 28, 2012 9:12 AM
>>>> To: user@hbase.apache.org
>>>> Subject: Re: HBase - Secondary Index
>>>> 
>>>> On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <an...@huawei.com>
>>>> wrote:
>>>> 
>>>>> Yes as you say when the no of rows to be returned is becoming more
>> and
>>>>> more the latency will be becoming more.  seeks within an HFile block
>> is
>>>>> some what expensive op now. (Not much but still)  The new encoding
>>> prefix
>>>>> trie will be a huge bonus here. There the seeks will be flying.. [Ted
>>>> also
>>>>> presented this in the Hadoop China]  Thanks to Matt... :)  I am
>> trying
>>> to
>>>>> measure the scan performance with this new encoding . Trying to back
>>>> port a
>>>>> simple patch for 94 version just for testing...   Yes when the no of
>>>>> results to be returned is more and more any index will become less
>>>>> performing as per my study  :)
>>>>> 
>>>>> Do you have link to that presentation?
>>>> 
>>>> 
>>>>>> btw, quick question- in your presentation, the scale there is
>> seconds
>>> or
>>>>> mill-seconds:)
>>>>> 
>>>>> It is seconds.  Dont consider the exact values. What is the % of
>>> increase
>>>>> in latency is important :) Those were not high end machines.
>>>>> 
>>>>> -Anoop-
>>>>> ________________________________________
>>>>> From: Shengjie Min [kelvin.msj@gmail.com]
>>>>> Sent: Thursday, December 27, 2012 9:59 PM
>>>>> To: user@hbase.apache.org
>>>>> Subject: Re: HBase - Secondary Index
>>>>> 
>>>>>> Didnt follow u completely here. There wont be any get() happening..
>>> As
>>>>> the
>>>>>> exact rowkey in a region we get from the index table, we can seek to
>>> the
>>>>>> exact position and return that row.
>>>>> 
>>>>> Sorry, When I misused "get()" here, I meant seeking. Yes, if it's
>> just
>>>>> small number of rows returned, this works perfect. As you said you
>> will
>>>> get
>>>>> the exact rowkey positions per region, and simply seek them. I was
>>> trying
>>>>> to work out the case that when the number of result rows increases
>>>>> massively. Like in Anil's case, he wants to do a scan query against
>> the
>>>>> 2ndary index(timestamp): "select all rows from timestamp1 to
>>> timestamp2"
>>>>> given no customerId provided. During that time period, he might have
>> a
>>>> big
>>>>> chunk of rows from different customerIds. The index table returns a
>> lot
>>>> of
>>>>> rowkey positions for different customerIds (I believe they are
>>> scattered
>>>> in
>>>>> different regions), then you end up seeking all different positions
>> in
>>>>> different regions and return all the rows needed. According to your
>>>>> presentation page14 - Performance Test Results (Scan), without index,
>>>> it's
>>>>> a linear increase as result rows # increases. on the other hand, with
>>>>> index, time spent climbs up way quicker than the case without index.
>>>>> 
>>>>> btw, quick question- in your presentation, the scale there is seconds
>>> or
>>>>> mill-seconds:)
>>>>> 
>>>>> - Shengjie
>>>>> 
>>>>> 
>>>>> On 27 December 2012 15:54, Anoop John <an...@gmail.com> wrote:
>>>>> 
>>>>>>> how the massive number of get() is going to
>>>>>> perform againt the main table
>>>>>> 
>>>>>> Didnt follow u completely here. There wont be any get() happening..
>>> As
>>>>> the
>>>>>> exact rowkey in a region we get from the index table, we can seek
>> to
>>>> the
>>>>>> exact position and return that row.
>>>>>> 
>>>>>> -Anoop-
>>>>>> 
>>>>>> On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <
>> kelvin.msj@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> how the massive number of get() is going to
>>>>>>> perform againt the main table
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> All the best,
>>>>> Shengjie Min
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> All the best,
>>> Shengjie Min
>>> 
>> 
>> 
>> 
>> --
>> Adrien Mogenet
>> 06.59.16.64.22
>> http://www.mogenet.me
>> 

Re: HBase - Secondary Index

Posted by Asaf Mesika <as...@gmail.com>.
I guess one reason is the the amount of data traveling. In your design, you
have to query a secondary index table, read all the matched original table
row keys, send them back to the client, and then issue a special scan that
retrieves only those row keys values. In his example, he retrieved 2% of
the data which was around 10 million records, which is around 1 GB
according his key size (800 bytes). That's a lot of bytes being transferred
and throttling your switches. In hi design you read the rowkeys locally,
thus able to apply the rest of the filters, and may eventually return just
100 key values which matches to those extra filters. Thus saving tons of
bandwidth and lots of rpc calls.
In your example, and using his design, you can treat each region as mini
table, each indexing its own data.

Having a secondary indexing solution which also supports join like any
RDBMS does as yet to be found since its fairly complex.

On Tuesday, January 8, 2013, Michael Segel wrote:

> So if you're using an inverted table / index why on earth are you doing it
> at the region level?
>
> I've tried to explain this to others over 6 months ago and its not really
> a good idea.
>
> You're over complicating this and you will end up creating performance
> bottlenecks when your secondary index is completely orthogonal to your row
> key.
>
> To give you an example...
>
> Suppose you're CCCIS and you have a large database of auto insurance
> claims that you've acquired over the years from your Pathways product.
>
> Your primary key would be a combination of the Insurance Company's ID and
> their internal claim ID for the individual claim.
> Your row would be all of the data associated to that claim.
>
> So now lets say you want to find the average cost to repair a front end
> collision of an S80 Volvo.
> The make and model of the car would be orthogonal to the initial key. This
> means that the result set containing insurance records for Front End
> collisions of S80 Volvos would be most likely evenly distributed across the
> cluster's regions.
>
> If you used a series of inverted tables, you would be able to use a series
> of get()s to get the result set from each index and then find their
> intersections. (Note that you could also put them in sort order so that the
> intersections would be fairly straight forward to find.
>
> Doing this at the region level isn't so simple.
>
> So I have to again ask why go through and over complicate things?
>
> Just saying...
>
> On Jan 7, 2013, at 7:49 AM, Anoop Sam John <an...@huawei.com> wrote:
>
> > Hi,
> > It is inverted index based on column(s) value(s)
> > It will be region wise indexing. Can work when some one knows the rowkey
> range or NOT.
> >
> > -Anoop-
> > ________________________________________
> > From: Mohit Anchlia [mohitanchlia@gmail.com]
> > Sent: Monday, January 07, 2013 9:47 AM
> > To: user@hbase.apache.org
> > Subject: Re: HBase - Secondary Index
> >
> > Hi Anoop,
> >
> > Am I correct in understanding that this indexing mechanism is only
> > applicable when you know the row key? It's not an inverted index truly
> > based on the column value.
> >
> > Mohit
> > On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <an...@huawei.com>
> wrote:
> >
> >> Hi Adrien
> >>                 We are making the consistency btw the main table and
> >> index table and the roll back mentioned below etc using the CP hooks.
> The
> >> current hooks were not enough for those though..  I am in the process of
> >> trying to contribute those new hooks, core changes etc now...  Once all
> are
> >> done I will be able to explain in details..
> >>
> >> -Anoop-
> >> ________________________________________
> >> From: Adrien Mogenet [adrien.mogenet@gmail.com]
> >> Sent: Monday, January 07, 2013 2:00 AM
> >> To: user@hbase.apache.org
> >> Subject: Re: HBase - Secondary Index
> >>
> >> Nice topic, perhaps one of the most important for 2013 :-)
> >> I still don't get how you're ensuring consistency between index table
> and
> >> main table, without an external component (such as
> bookkeeper/zookeeper).
> >> What's the exact write path in your situation when inserting data ?
> >> (WAL/RegionObserver, pre/post put/WALedit...)
> >>
> >> The underlying question is about how you're ensuring that WALEdit in
> Index
> >> and Main tables are perfectly sync'ed, and how you 're able to rollback
> in
> >> case of issue in both WAL ?
> >>
> >>
> >> On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <ke...@gmail.com>
> >> wrote:
> >>
> >>>> Yes as you say when the no of rows to be returned is becoming more and
> >>> more the latency will be becoming more.  seeks within an HFile block is
> >>> some what expensive op now. (Not much but still)  The new encoding
> >>> prefix
> >>> trie will be a huge bonus here. There the seeks will be flying.. [Ted
> >> also
> >>> presented this in the Hadoop China]  Thanks to Matt... :)  I am trying
> to
> >>> measure the scan performance with this new encoding . Trying to >back
> >> port
> >>> a simple patch for 94 version just for testing...   Yes when the no of
> >>> results to be returned is more and more any index will become less
> >>> performing as per my study  :)
> >>>
> >>> yes, you are right, I guess it's just a drawback of any index approach.
> >>> Thanks for the explanation.
> >>>
> >>> Shengjie
> >>>
> >>> On 28 December 2012 04:14, Anoop Sam John <an...@huawei.com> wrote:
> >>>
> >>>>> Do you have link to that presentation?
> >>>>
> >>>> http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf
> >>>>
> >>>> -Anoop-
> >>>>
> >>>> ________________________________________
> >>>> From: Mohit Anchlia [mohitanchlia@gmail.com]
> >>>> Sent: Friday, December 28, 2012 9:12 AM
> >>>> To: user@hbase.apache.org
> >>>

Re: HBase - Secondary Index

Posted by Michael Segel <mi...@hotmail.com>.
So if you're using an inverted table / index why on earth are you doing it at the region level? 

I've tried to explain this to others over 6 months ago and its not really a good idea. 

You're over complicating this and you will end up creating performance bottlenecks when your secondary index is completely orthogonal to your row key. 

To give you an example... 

Suppose you're CCCIS and you have a large database of auto insurance claims that you've acquired over the years from your Pathways product. 

Your primary key would be a combination of the Insurance Company's ID and their internal claim ID for the individual claim. 
Your row would be all of the data associated to that claim.

So now lets say you want to find the average cost to repair a front end collision of an S80 Volvo. 
The make and model of the car would be orthogonal to the initial key. This means that the result set containing insurance records for Front End collisions of S80 Volvos would be most likely evenly distributed across the cluster's regions. 

If you used a series of inverted tables, you would be able to use a series of get()s to get the result set from each index and then find their intersections. (Note that you could also put them in sort order so that the intersections would be fairly straight forward to find. 

Doing this at the region level isn't so simple. 

So I have to again ask why go through and over complicate things? 

Just saying... 

On Jan 7, 2013, at 7:49 AM, Anoop Sam John <an...@huawei.com> wrote:

> Hi,
> It is inverted index based on column(s) value(s)
> It will be region wise indexing. Can work when some one knows the rowkey range or NOT.
> 
> -Anoop-
> ________________________________________
> From: Mohit Anchlia [mohitanchlia@gmail.com]
> Sent: Monday, January 07, 2013 9:47 AM
> To: user@hbase.apache.org
> Subject: Re: HBase - Secondary Index
> 
> Hi Anoop,
> 
> Am I correct in understanding that this indexing mechanism is only
> applicable when you know the row key? It's not an inverted index truly
> based on the column value.
> 
> Mohit
> On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <an...@huawei.com> wrote:
> 
>> Hi Adrien
>>                 We are making the consistency btw the main table and
>> index table and the roll back mentioned below etc using the CP hooks. The
>> current hooks were not enough for those though..  I am in the process of
>> trying to contribute those new hooks, core changes etc now...  Once all are
>> done I will be able to explain in details..
>> 
>> -Anoop-
>> ________________________________________
>> From: Adrien Mogenet [adrien.mogenet@gmail.com]
>> Sent: Monday, January 07, 2013 2:00 AM
>> To: user@hbase.apache.org
>> Subject: Re: HBase - Secondary Index
>> 
>> Nice topic, perhaps one of the most important for 2013 :-)
>> I still don't get how you're ensuring consistency between index table and
>> main table, without an external component (such as bookkeeper/zookeeper).
>> What's the exact write path in your situation when inserting data ?
>> (WAL/RegionObserver, pre/post put/WALedit...)
>> 
>> The underlying question is about how you're ensuring that WALEdit in Index
>> and Main tables are perfectly sync'ed, and how you 're able to rollback in
>> case of issue in both WAL ?
>> 
>> 
>> On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <ke...@gmail.com>
>> wrote:
>> 
>>>> Yes as you say when the no of rows to be returned is becoming more and
>>> more the latency will be becoming more.  seeks within an HFile block is
>>> some what expensive op now. (Not much but still)  The new encoding
>>> prefix
>>> trie will be a huge bonus here. There the seeks will be flying.. [Ted
>> also
>>> presented this in the Hadoop China]  Thanks to Matt... :)  I am trying to
>>> measure the scan performance with this new encoding . Trying to >back
>> port
>>> a simple patch for 94 version just for testing...   Yes when the no of
>>> results to be returned is more and more any index will become less
>>> performing as per my study  :)
>>> 
>>> yes, you are right, I guess it's just a drawback of any index approach.
>>> Thanks for the explanation.
>>> 
>>> Shengjie
>>> 
>>> On 28 December 2012 04:14, Anoop Sam John <an...@huawei.com> wrote:
>>> 
>>>>> Do you have link to that presentation?
>>>> 
>>>> http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf
>>>> 
>>>> -Anoop-
>>>> 
>>>> ________________________________________
>>>> From: Mohit Anchlia [mohitanchlia@gmail.com]
>>>> Sent: Friday, December 28, 2012 9:12 AM
>>>> To: user@hbase.apache.org
>>>> Subject: Re: HBase - Secondary Index
>>>> 
>>>> On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <an...@huawei.com>
>>>> wrote:
>>>> 
>>>>> Yes as you say when the no of rows to be returned is becoming more
>> and
>>>>> more the latency will be becoming more.  seeks within an HFile block
>> is
>>>>> some what expensive op now. (Not much but still)  The new encoding
>>> prefix
>>>>> trie will be a huge bonus here. There the seeks will be flying.. [Ted
>>>> also
>>>>> presented this in the Hadoop China]  Thanks to Matt... :)  I am
>> trying
>>> to
>>>>> measure the scan performance with this new encoding . Trying to back
>>>> port a
>>>>> simple patch for 94 version just for testing...   Yes when the no of
>>>>> results to be returned is more and more any index will become less
>>>>> performing as per my study  :)
>>>>> 
>>>>> Do you have link to that presentation?
>>>> 
>>>> 
>>>>>> btw, quick question- in your presentation, the scale there is
>> seconds
>>> or
>>>>> mill-seconds:)
>>>>> 
>>>>> It is seconds.  Dont consider the exact values. What is the % of
>>> increase
>>>>> in latency is important :) Those were not high end machines.
>>>>> 
>>>>> -Anoop-
>>>>> ________________________________________
>>>>> From: Shengjie Min [kelvin.msj@gmail.com]
>>>>> Sent: Thursday, December 27, 2012 9:59 PM
>>>>> To: user@hbase.apache.org
>>>>> Subject: Re: HBase - Secondary Index
>>>>> 
>>>>>> Didnt follow u completely here. There wont be any get() happening..
>>> As
>>>>> the
>>>>>> exact rowkey in a region we get from the index table, we can seek to
>>> the
>>>>>> exact position and return that row.
>>>>> 
>>>>> Sorry, When I misused "get()" here, I meant seeking. Yes, if it's
>> just
>>>>> small number of rows returned, this works perfect. As you said you
>> will
>>>> get
>>>>> the exact rowkey positions per region, and simply seek them. I was
>>> trying
>>>>> to work out the case that when the number of result rows increases
>>>>> massively. Like in Anil's case, he wants to do a scan query against
>> the
>>>>> 2ndary index(timestamp): "select all rows from timestamp1 to
>>> timestamp2"
>>>>> given no customerId provided. During that time period, he might have
>> a
>>>> big
>>>>> chunk of rows from different customerIds. The index table returns a
>> lot
>>>> of
>>>>> rowkey positions for different customerIds (I believe they are
>>> scattered
>>>> in
>>>>> different regions), then you end up seeking all different positions
>> in
>>>>> different regions and return all the rows needed. According to your
>>>>> presentation page14 - Performance Test Results (Scan), without index,
>>>> it's
>>>>> a linear increase as result rows # increases. on the other hand, with
>>>>> index, time spent climbs up way quicker than the case without index.
>>>>> 
>>>>> btw, quick question- in your presentation, the scale there is seconds
>>> or
>>>>> mill-seconds:)
>>>>> 
>>>>> - Shengjie
>>>>> 
>>>>> 
>>>>> On 27 December 2012 15:54, Anoop John <an...@gmail.com> wrote:
>>>>> 
>>>>>>> how the massive number of get() is going to
>>>>>> perform againt the main table
>>>>>> 
>>>>>> Didnt follow u completely here. There wont be any get() happening..
>>> As
>>>>> the
>>>>>> exact rowkey in a region we get from the index table, we can seek
>> to
>>>> the
>>>>>> exact position and return that row.
>>>>>> 
>>>>>> -Anoop-
>>>>>> 
>>>>>> On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <
>> kelvin.msj@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> how the massive number of get() is going to
>>>>>>> perform againt the main table
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> All the best,
>>>>> Shengjie Min
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> All the best,
>>> Shengjie Min
>>> 
>> 
>> 
>> 
>> --
>> Adrien Mogenet
>> 06.59.16.64.22
>> http://www.mogenet.me
>> 


RE: HBase - Secondary Index

Posted by Anoop Sam John <an...@huawei.com>.
Hi,
It is inverted index based on column(s) value(s)
It will be region wise indexing. Can work when some one knows the rowkey range or NOT.

-Anoop-
________________________________________
From: Mohit Anchlia [mohitanchlia@gmail.com]
Sent: Monday, January 07, 2013 9:47 AM
To: user@hbase.apache.org
Subject: Re: HBase - Secondary Index

Hi Anoop,

Am I correct in understanding that this indexing mechanism is only
applicable when you know the row key? It's not an inverted index truly
based on the column value.

Mohit
On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <an...@huawei.com> wrote:

> Hi Adrien
>                  We are making the consistency btw the main table and
> index table and the roll back mentioned below etc using the CP hooks. The
> current hooks were not enough for those though..  I am in the process of
> trying to contribute those new hooks, core changes etc now...  Once all are
> done I will be able to explain in details..
>
> -Anoop-
> ________________________________________
> From: Adrien Mogenet [adrien.mogenet@gmail.com]
> Sent: Monday, January 07, 2013 2:00 AM
>  To: user@hbase.apache.org
> Subject: Re: HBase - Secondary Index
>
> Nice topic, perhaps one of the most important for 2013 :-)
> I still don't get how you're ensuring consistency between index table and
> main table, without an external component (such as bookkeeper/zookeeper).
> What's the exact write path in your situation when inserting data ?
> (WAL/RegionObserver, pre/post put/WALedit...)
>
> The underlying question is about how you're ensuring that WALEdit in Index
> and Main tables are perfectly sync'ed, and how you 're able to rollback in
> case of issue in both WAL ?
>
>
> On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <ke...@gmail.com>
> wrote:
>
> > >Yes as you say when the no of rows to be returned is becoming more and
> > more the latency will be becoming more.  seeks within an HFile block is
> > some what expensive op now. (Not much but still)  The new encoding
> >prefix
> > trie will be a huge bonus here. There the seeks will be flying.. [Ted
> also
> > presented this in the Hadoop China]  Thanks to Matt... :)  I am trying to
> > measure the scan performance with this new encoding . Trying to >back
> port
> > a simple patch for 94 version just for testing...   Yes when the no of
> > results to be returned is more and more any index will become less
> > performing as per my study  :)
> >
> > yes, you are right, I guess it's just a drawback of any index approach.
> > Thanks for the explanation.
> >
> > Shengjie
> >
> > On 28 December 2012 04:14, Anoop Sam John <an...@huawei.com> wrote:
> >
> > > > Do you have link to that presentation?
> > >
> > > http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf
> > >
> > > -Anoop-
> > >
> > > ________________________________________
> > > From: Mohit Anchlia [mohitanchlia@gmail.com]
> > > Sent: Friday, December 28, 2012 9:12 AM
> > > To: user@hbase.apache.org
> > > Subject: Re: HBase - Secondary Index
> > >
> > > On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <an...@huawei.com>
> > > wrote:
> > >
> > > > Yes as you say when the no of rows to be returned is becoming more
> and
> > > > more the latency will be becoming more.  seeks within an HFile block
> is
> > > > some what expensive op now. (Not much but still)  The new encoding
> > prefix
> > > > trie will be a huge bonus here. There the seeks will be flying.. [Ted
> > > also
> > > > presented this in the Hadoop China]  Thanks to Matt... :)  I am
> trying
> > to
> > > > measure the scan performance with this new encoding . Trying to back
> > > port a
> > > > simple patch for 94 version just for testing...   Yes when the no of
> > > > results to be returned is more and more any index will become less
> > > > performing as per my study  :)
> > > >
> > > > Do you have link to that presentation?
> > >
> > >
> > > > >btw, quick question- in your presentation, the scale there is
> seconds
> > or
> > > > mill-seconds:)
> > > >
> > > > It is seconds.  Dont consider the exact values. What is the % of
> > increase
> > > > in latency is important :) Those were not high end machines.
> > > >
> > > > -Anoop-
> > > > ________________________________________
> > > > From: Shengjie Min [kelvin.msj@gmail.com]
> > > > Sent: Thursday, December 27, 2012 9:59 PM
> > > > To: user@hbase.apache.org
> > > > Subject: Re: HBase - Secondary Index
> > > >
> > > >  >Didnt follow u completely here. There wont be any get() happening..
> > As
> > > > the
> > > > >exact rowkey in a region we get from the index table, we can seek to
> > the
> > > > >exact position and return that row.
> > > >
> > > > Sorry, When I misused "get()" here, I meant seeking. Yes, if it's
> just
> > > > small number of rows returned, this works perfect. As you said you
> will
> > > get
> > > > the exact rowkey positions per region, and simply seek them. I was
> > trying
> > > > to work out the case that when the number of result rows increases
> > > > massively. Like in Anil's case, he wants to do a scan query against
> the
> > > > 2ndary index(timestamp): "select all rows from timestamp1 to
> > timestamp2"
> > > > given no customerId provided. During that time period, he might have
> a
> > > big
> > > > chunk of rows from different customerIds. The index table returns a
> lot
> > > of
> > > > rowkey positions for different customerIds (I believe they are
> > scattered
> > > in
> > > > different regions), then you end up seeking all different positions
> in
> > > > different regions and return all the rows needed. According to your
> > > > presentation page14 - Performance Test Results (Scan), without index,
> > > it's
> > > > a linear increase as result rows # increases. on the other hand, with
> > > > index, time spent climbs up way quicker than the case without index.
> > > >
> > > > btw, quick question- in your presentation, the scale there is seconds
> > or
> > > > mill-seconds:)
> > > >
> > > > - Shengjie
> > > >
> > > >
> > > > On 27 December 2012 15:54, Anoop John <an...@gmail.com> wrote:
> > > >
> > > > > >how the massive number of get() is going to
> > > > > perform againt the main table
> > > > >
> > > > > Didnt follow u completely here. There wont be any get() happening..
> > As
> > > > the
> > > > > exact rowkey in a region we get from the index table, we can seek
> to
> > > the
> > > > > exact position and return that row.
> > > > >
> > > > > -Anoop-
> > > > >
> > > > > On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <
> kelvin.msj@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > how the massive number of get() is going to
> > > > > > perform againt the main table
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > All the best,
> > > > Shengjie Min
> > > >
> > >
> >
> >
> >
> > --
> > All the best,
> > Shengjie Min
> >
>
>
>
> --
> Adrien Mogenet
> 06.59.16.64.22
> http://www.mogenet.me
>

Re: HBase - Secondary Index

Posted by Mohit Anchlia <mo...@gmail.com>.
Hi Anoop,

Am I correct in understanding that this indexing mechanism is only
applicable when you know the row key? It's not an inverted index truly
based on the column value.

Mohit
On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <an...@huawei.com> wrote:

> Hi Adrien
>                  We are making the consistency btw the main table and
> index table and the roll back mentioned below etc using the CP hooks. The
> current hooks were not enough for those though..  I am in the process of
> trying to contribute those new hooks, core changes etc now...  Once all are
> done I will be able to explain in details..
>
> -Anoop-
> ________________________________________
> From: Adrien Mogenet [adrien.mogenet@gmail.com]
> Sent: Monday, January 07, 2013 2:00 AM
>  To: user@hbase.apache.org
> Subject: Re: HBase - Secondary Index
>
> Nice topic, perhaps one of the most important for 2013 :-)
> I still don't get how you're ensuring consistency between index table and
> main table, without an external component (such as bookkeeper/zookeeper).
> What's the exact write path in your situation when inserting data ?
> (WAL/RegionObserver, pre/post put/WALedit...)
>
> The underlying question is about how you're ensuring that WALEdit in Index
> and Main tables are perfectly sync'ed, and how you 're able to rollback in
> case of issue in both WAL ?
>
>
> On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <ke...@gmail.com>
> wrote:
>
> > >Yes as you say when the no of rows to be returned is becoming more and
> > more the latency will be becoming more.  seeks within an HFile block is
> > some what expensive op now. (Not much but still)  The new encoding
> >prefix
> > trie will be a huge bonus here. There the seeks will be flying.. [Ted
> also
> > presented this in the Hadoop China]  Thanks to Matt... :)  I am trying to
> > measure the scan performance with this new encoding . Trying to >back
> port
> > a simple patch for 94 version just for testing...   Yes when the no of
> > results to be returned is more and more any index will become less
> > performing as per my study  :)
> >
> > yes, you are right, I guess it's just a drawback of any index approach.
> > Thanks for the explanation.
> >
> > Shengjie
> >
> > On 28 December 2012 04:14, Anoop Sam John <an...@huawei.com> wrote:
> >
> > > > Do you have link to that presentation?
> > >
> > > http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf
> > >
> > > -Anoop-
> > >
> > > ________________________________________
> > > From: Mohit Anchlia [mohitanchlia@gmail.com]
> > > Sent: Friday, December 28, 2012 9:12 AM
> > > To: user@hbase.apache.org
> > > Subject: Re: HBase - Secondary Index
> > >
> > > On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <an...@huawei.com>
> > > wrote:
> > >
> > > > Yes as you say when the no of rows to be returned is becoming more
> and
> > > > more the latency will be becoming more.  seeks within an HFile block
> is
> > > > some what expensive op now. (Not much but still)  The new encoding
> > prefix
> > > > trie will be a huge bonus here. There the seeks will be flying.. [Ted
> > > also
> > > > presented this in the Hadoop China]  Thanks to Matt... :)  I am
> trying
> > to
> > > > measure the scan performance with this new encoding . Trying to back
> > > port a
> > > > simple patch for 94 version just for testing...   Yes when the no of
> > > > results to be returned is more and more any index will become less
> > > > performing as per my study  :)
> > > >
> > > > Do you have link to that presentation?
> > >
> > >
> > > > >btw, quick question- in your presentation, the scale there is
> seconds
> > or
> > > > mill-seconds:)
> > > >
> > > > It is seconds.  Dont consider the exact values. What is the % of
> > increase
> > > > in latency is important :) Those were not high end machines.
> > > >
> > > > -Anoop-
> > > > ________________________________________
> > > > From: Shengjie Min [kelvin.msj@gmail.com]
> > > > Sent: Thursday, December 27, 2012 9:59 PM
> > > > To: user@hbase.apache.org
> > > > Subject: Re: HBase - Secondary Index
> > > >
> > > >  >Didnt follow u completely here. There wont be any get() happening..
> > As
> > > > the
> > > > >exact rowkey in a region we get from the index table, we can seek to
> > the
> > > > >exact position and return that row.
> > > >
> > > > Sorry, When I misused "get()" here, I meant seeking. Yes, if it's
> just
> > > > small number of rows returned, this works perfect. As you said you
> will
> > > get
> > > > the exact rowkey positions per region, and simply seek them. I was
> > trying
> > > > to work out the case that when the number of result rows increases
> > > > massively. Like in Anil's case, he wants to do a scan query against
> the
> > > > 2ndary index(timestamp): "select all rows from timestamp1 to
> > timestamp2"
> > > > given no customerId provided. During that time period, he might have
> a
> > > big
> > > > chunk of rows from different customerIds. The index table returns a
> lot
> > > of
> > > > rowkey positions for different customerIds (I believe they are
> > scattered
> > > in
> > > > different regions), then you end up seeking all different positions
> in
> > > > different regions and return all the rows needed. According to your
> > > > presentation page14 - Performance Test Results (Scan), without index,
> > > it's
> > > > a linear increase as result rows # increases. on the other hand, with
> > > > index, time spent climbs up way quicker than the case without index.
> > > >
> > > > btw, quick question- in your presentation, the scale there is seconds
> > or
> > > > mill-seconds:)
> > > >
> > > > - Shengjie
> > > >
> > > >
> > > > On 27 December 2012 15:54, Anoop John <an...@gmail.com> wrote:
> > > >
> > > > > >how the massive number of get() is going to
> > > > > perform againt the main table
> > > > >
> > > > > Didnt follow u completely here. There wont be any get() happening..
> > As
> > > > the
> > > > > exact rowkey in a region we get from the index table, we can seek
> to
> > > the
> > > > > exact position and return that row.
> > > > >
> > > > > -Anoop-
> > > > >
> > > > > On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <
> kelvin.msj@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > how the massive number of get() is going to
> > > > > > perform againt the main table
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > All the best,
> > > > Shengjie Min
> > > >
> > >
> >
> >
> >
> > --
> > All the best,
> > Shengjie Min
> >
>
>
>
> --
> Adrien Mogenet
> 06.59.16.64.22
> http://www.mogenet.me
>

RE: HBase - Secondary Index

Posted by Anoop Sam John <an...@huawei.com>.
Hi Adrien 
                 We are making the consistency btw the main table and index table and the roll back mentioned below etc using the CP hooks. The current hooks were not enough for those though..  I am in the process of trying to contribute those new hooks, core changes etc now...  Once all are done I will be able to explain in details..

-Anoop-
________________________________________
From: Adrien Mogenet [adrien.mogenet@gmail.com]
Sent: Monday, January 07, 2013 2:00 AM
To: user@hbase.apache.org
Subject: Re: HBase - Secondary Index

Nice topic, perhaps one of the most important for 2013 :-)
I still don't get how you're ensuring consistency between index table and
main table, without an external component (such as bookkeeper/zookeeper).
What's the exact write path in your situation when inserting data ?
(WAL/RegionObserver, pre/post put/WALedit...)

The underlying question is about how you're ensuring that WALEdit in Index
and Main tables are perfectly sync'ed, and how you 're able to rollback in
case of issue in both WAL ?


On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <ke...@gmail.com> wrote:

> >Yes as you say when the no of rows to be returned is becoming more and
> more the latency will be becoming more.  seeks within an HFile block is
> some what expensive op now. (Not much but still)  The new encoding >prefix
> trie will be a huge bonus here. There the seeks will be flying.. [Ted also
> presented this in the Hadoop China]  Thanks to Matt... :)  I am trying to
> measure the scan performance with this new encoding . Trying to >back port
> a simple patch for 94 version just for testing...   Yes when the no of
> results to be returned is more and more any index will become less
> performing as per my study  :)
>
> yes, you are right, I guess it's just a drawback of any index approach.
> Thanks for the explanation.
>
> Shengjie
>
> On 28 December 2012 04:14, Anoop Sam John <an...@huawei.com> wrote:
>
> > > Do you have link to that presentation?
> >
> > http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf
> >
> > -Anoop-
> >
> > ________________________________________
> > From: Mohit Anchlia [mohitanchlia@gmail.com]
> > Sent: Friday, December 28, 2012 9:12 AM
> > To: user@hbase.apache.org
> > Subject: Re: HBase - Secondary Index
> >
> > On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <an...@huawei.com>
> > wrote:
> >
> > > Yes as you say when the no of rows to be returned is becoming more and
> > > more the latency will be becoming more.  seeks within an HFile block is
> > > some what expensive op now. (Not much but still)  The new encoding
> prefix
> > > trie will be a huge bonus here. There the seeks will be flying.. [Ted
> > also
> > > presented this in the Hadoop China]  Thanks to Matt... :)  I am trying
> to
> > > measure the scan performance with this new encoding . Trying to back
> > port a
> > > simple patch for 94 version just for testing...   Yes when the no of
> > > results to be returned is more and more any index will become less
> > > performing as per my study  :)
> > >
> > > Do you have link to that presentation?
> >
> >
> > > >btw, quick question- in your presentation, the scale there is seconds
> or
> > > mill-seconds:)
> > >
> > > It is seconds.  Dont consider the exact values. What is the % of
> increase
> > > in latency is important :) Those were not high end machines.
> > >
> > > -Anoop-
> > > ________________________________________
> > > From: Shengjie Min [kelvin.msj@gmail.com]
> > > Sent: Thursday, December 27, 2012 9:59 PM
> > > To: user@hbase.apache.org
> > > Subject: Re: HBase - Secondary Index
> > >
> > >  >Didnt follow u completely here. There wont be any get() happening..
> As
> > > the
> > > >exact rowkey in a region we get from the index table, we can seek to
> the
> > > >exact position and return that row.
> > >
> > > Sorry, When I misused "get()" here, I meant seeking. Yes, if it's just
> > > small number of rows returned, this works perfect. As you said you will
> > get
> > > the exact rowkey positions per region, and simply seek them. I was
> trying
> > > to work out the case that when the number of result rows increases
> > > massively. Like in Anil's case, he wants to do a scan query against the
> > > 2ndary index(timestamp): "select all rows from timestamp1 to
> timestamp2"
> > > given no customerId provided. During that time period, he might have a
> > big
> > > chunk of rows from different customerIds. The index table returns a lot
> > of
> > > rowkey positions for different customerIds (I believe they are
> scattered
> > in
> > > different regions), then you end up seeking all different positions in
> > > different regions and return all the rows needed. According to your
> > > presentation page14 - Performance Test Results (Scan), without index,
> > it's
> > > a linear increase as result rows # increases. on the other hand, with
> > > index, time spent climbs up way quicker than the case without index.
> > >
> > > btw, quick question- in your presentation, the scale there is seconds
> or
> > > mill-seconds:)
> > >
> > > - Shengjie
> > >
> > >
> > > On 27 December 2012 15:54, Anoop John <an...@gmail.com> wrote:
> > >
> > > > >how the massive number of get() is going to
> > > > perform againt the main table
> > > >
> > > > Didnt follow u completely here. There wont be any get() happening..
> As
> > > the
> > > > exact rowkey in a region we get from the index table, we can seek to
> > the
> > > > exact position and return that row.
> > > >
> > > > -Anoop-
> > > >
> > > > On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <ke...@gmail.com>
> > > > wrote:
> > > >
> > > > > how the massive number of get() is going to
> > > > > perform againt the main table
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > All the best,
> > > Shengjie Min
> > >
> >
>
>
>
> --
> All the best,
> Shengjie Min
>



--
Adrien Mogenet
06.59.16.64.22
http://www.mogenet.me

Re: HBase - Secondary Index

Posted by anil gupta <an...@gmail.com>.
@Mohit: Here is the jira for prefix compression discussed here:
https://issues.apache.org/jira/browse/HBASE-4676

HTH,
Anil Gupta

On Sun, Jan 6, 2013 at 12:40 PM, Adrien Mogenet <ad...@gmail.com>wrote:

> Are your talking about Data block encoding of K/V ?
> https://issues.apache.org/jira/browse/HBASE-4218
>
>
> On Sun, Jan 6, 2013 at 9:36 PM, Mohit Anchlia <mohitanchlia@gmail.com
> >wrote:
>
> > Does anyone has any links or information to the new prefix encoding
> feature
> > in HBase that's being referred to in this mail?
> >
> > On Sun, Jan 6, 2013 at 12:30 PM, Adrien Mogenet <
> adrien.mogenet@gmail.com
> > >wrote:
> >
> > > Nice topic, perhaps one of the most important for 2013 :-)
> > > I still don't get how you're ensuring consistency between index table
> and
> > > main table, without an external component (such as
> bookkeeper/zookeeper).
> > > What's the exact write path in your situation when inserting data ?
> > > (WAL/RegionObserver, pre/post put/WALedit...)
> > >
> > > The underlying question is about how you're ensuring that WALEdit in
> > Index
> > > and Main tables are perfectly sync'ed, and how you 're able to rollback
> > in
> > > case of issue in both WAL ?
> > >
> > >
> > > On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <ke...@gmail.com>
> > > wrote:
> > >
> > > > >Yes as you say when the no of rows to be returned is becoming more
> and
> > > > more the latency will be becoming more.  seeks within an HFile block
> is
> > > > some what expensive op now. (Not much but still)  The new encoding
> > > >prefix
> > > > trie will be a huge bonus here. There the seeks will be flying.. [Ted
> > > also
> > > > presented this in the Hadoop China]  Thanks to Matt... :)  I am
> trying
> > to
> > > > measure the scan performance with this new encoding . Trying to >back
> > > port
> > > > a simple patch for 94 version just for testing...   Yes when the no
> of
> > > > results to be returned is more and more any index will become less
> > > > performing as per my study  :)
> > > >
> > > > yes, you are right, I guess it's just a drawback of any index
> approach.
> > > > Thanks for the explanation.
> > > >
> > > > Shengjie
> > > >
> > > > On 28 December 2012 04:14, Anoop Sam John <an...@huawei.com>
> wrote:
> > > >
> > > > > > Do you have link to that presentation?
> > > > >
> > > > > http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf
> > > > >
> > > > > -Anoop-
> > > > >
> > > > > ________________________________________
> > > > > From: Mohit Anchlia [mohitanchlia@gmail.com]
> > > > > Sent: Friday, December 28, 2012 9:12 AM
> > > > > To: user@hbase.apache.org
> > > > > Subject: Re: HBase - Secondary Index
> > > > >
> > > > > On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <
> anoopsj@huawei.com>
> > > > > wrote:
> > > > >
> > > > > > Yes as you say when the no of rows to be returned is becoming
> more
> > > and
> > > > > > more the latency will be becoming more.  seeks within an HFile
> > block
> > > is
> > > > > > some what expensive op now. (Not much but still)  The new
> encoding
> > > > prefix
> > > > > > trie will be a huge bonus here. There the seeks will be flying..
> > [Ted
> > > > > also
> > > > > > presented this in the Hadoop China]  Thanks to Matt... :)  I am
> > > trying
> > > > to
> > > > > > measure the scan performance with this new encoding . Trying to
> > back
> > > > > port a
> > > > > > simple patch for 94 version just for testing...   Yes when the no
> > of
> > > > > > results to be returned is more and more any index will become
> less
> > > > > > performing as per my study  :)
> > > > > >
> > > > > > Do you have link to that presentation?
> > > > >
> > > > >
> > > > > > >btw, quick question- in your presentation, the scale there is
> > > seconds
> > > > or
> > > > > > mill-seconds:)
> > > > > >
> > > > > > It is seconds.  Dont consider the exact values. What is the % of
> > > > increase
> > > > > > in latency is important :) Those were not high end machines.
> > > > > >
> > > > > > -Anoop-
> > > > > > ________________________________________
> > > > > > From: Shengjie Min [kelvin.msj@gmail.com]
> > > > > > Sent: Thursday, December 27, 2012 9:59 PM
> > > > > > To: user@hbase.apache.org
> > > > > > Subject: Re: HBase - Secondary Index
> > > > > >
> > > > > >  >Didnt follow u completely here. There wont be any get()
> > happening..
> > > > As
> > > > > > the
> > > > > > >exact rowkey in a region we get from the index table, we can
> seek
> > to
> > > > the
> > > > > > >exact position and return that row.
> > > > > >
> > > > > > Sorry, When I misused "get()" here, I meant seeking. Yes, if it's
> > > just
> > > > > > small number of rows returned, this works perfect. As you said
> you
> > > will
> > > > > get
> > > > > > the exact rowkey positions per region, and simply seek them. I
> was
> > > > trying
> > > > > > to work out the case that when the number of result rows
> increases
> > > > > > massively. Like in Anil's case, he wants to do a scan query
> against
> > > the
> > > > > > 2ndary index(timestamp): "select all rows from timestamp1 to
> > > > timestamp2"
> > > > > > given no customerId provided. During that time period, he might
> > have
> > > a
> > > > > big
> > > > > > chunk of rows from different customerIds. The index table
> returns a
> > > lot
> > > > > of
> > > > > > rowkey positions for different customerIds (I believe they are
> > > > scattered
> > > > > in
> > > > > > different regions), then you end up seeking all different
> positions
> > > in
> > > > > > different regions and return all the rows needed. According to
> your
> > > > > > presentation page14 - Performance Test Results (Scan), without
> > index,
> > > > > it's
> > > > > > a linear increase as result rows # increases. on the other hand,
> > with
> > > > > > index, time spent climbs up way quicker than the case without
> > index.
> > > > > >
> > > > > > btw, quick question- in your presentation, the scale there is
> > seconds
> > > > or
> > > > > > mill-seconds:)
> > > > > >
> > > > > > - Shengjie
> > > > > >
> > > > > >
> > > > > > On 27 December 2012 15:54, Anoop John <an...@gmail.com>
> > wrote:
> > > > > >
> > > > > > > >how the massive number of get() is going to
> > > > > > > perform againt the main table
> > > > > > >
> > > > > > > Didnt follow u completely here. There wont be any get()
> > happening..
> > > > As
> > > > > > the
> > > > > > > exact rowkey in a region we get from the index table, we can
> seek
> > > to
> > > > > the
> > > > > > > exact position and return that row.
> > > > > > >
> > > > > > > -Anoop-
> > > > > > >
> > > > > > > On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <
> > > kelvin.msj@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > how the massive number of get() is going to
> > > > > > > > perform againt the main table
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > All the best,
> > > > > > Shengjie Min
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > All the best,
> > > > Shengjie Min
> > > >
> > >
> > >
> > >
> > > --
> > > Adrien Mogenet
> > > 06.59.16.64.22
> > > http://www.mogenet.me
> > >
> >
>
>
>
> --
> Adrien Mogenet
> 06.59.16.64.22
> http://www.mogenet.me
>



-- 
Thanks & Regards,
Anil Gupta

Re: HBase - Secondary Index

Posted by Adrien Mogenet <ad...@gmail.com>.
Are your talking about Data block encoding of K/V ?
https://issues.apache.org/jira/browse/HBASE-4218


On Sun, Jan 6, 2013 at 9:36 PM, Mohit Anchlia <mo...@gmail.com>wrote:

> Does anyone has any links or information to the new prefix encoding feature
> in HBase that's being referred to in this mail?
>
> On Sun, Jan 6, 2013 at 12:30 PM, Adrien Mogenet <adrien.mogenet@gmail.com
> >wrote:
>
> > Nice topic, perhaps one of the most important for 2013 :-)
> > I still don't get how you're ensuring consistency between index table and
> > main table, without an external component (such as bookkeeper/zookeeper).
> > What's the exact write path in your situation when inserting data ?
> > (WAL/RegionObserver, pre/post put/WALedit...)
> >
> > The underlying question is about how you're ensuring that WALEdit in
> Index
> > and Main tables are perfectly sync'ed, and how you 're able to rollback
> in
> > case of issue in both WAL ?
> >
> >
> > On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <ke...@gmail.com>
> > wrote:
> >
> > > >Yes as you say when the no of rows to be returned is becoming more and
> > > more the latency will be becoming more.  seeks within an HFile block is
> > > some what expensive op now. (Not much but still)  The new encoding
> > >prefix
> > > trie will be a huge bonus here. There the seeks will be flying.. [Ted
> > also
> > > presented this in the Hadoop China]  Thanks to Matt... :)  I am trying
> to
> > > measure the scan performance with this new encoding . Trying to >back
> > port
> > > a simple patch for 94 version just for testing...   Yes when the no of
> > > results to be returned is more and more any index will become less
> > > performing as per my study  :)
> > >
> > > yes, you are right, I guess it's just a drawback of any index approach.
> > > Thanks for the explanation.
> > >
> > > Shengjie
> > >
> > > On 28 December 2012 04:14, Anoop Sam John <an...@huawei.com> wrote:
> > >
> > > > > Do you have link to that presentation?
> > > >
> > > > http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf
> > > >
> > > > -Anoop-
> > > >
> > > > ________________________________________
> > > > From: Mohit Anchlia [mohitanchlia@gmail.com]
> > > > Sent: Friday, December 28, 2012 9:12 AM
> > > > To: user@hbase.apache.org
> > > > Subject: Re: HBase - Secondary Index
> > > >
> > > > On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <an...@huawei.com>
> > > > wrote:
> > > >
> > > > > Yes as you say when the no of rows to be returned is becoming more
> > and
> > > > > more the latency will be becoming more.  seeks within an HFile
> block
> > is
> > > > > some what expensive op now. (Not much but still)  The new encoding
> > > prefix
> > > > > trie will be a huge bonus here. There the seeks will be flying..
> [Ted
> > > > also
> > > > > presented this in the Hadoop China]  Thanks to Matt... :)  I am
> > trying
> > > to
> > > > > measure the scan performance with this new encoding . Trying to
> back
> > > > port a
> > > > > simple patch for 94 version just for testing...   Yes when the no
> of
> > > > > results to be returned is more and more any index will become less
> > > > > performing as per my study  :)
> > > > >
> > > > > Do you have link to that presentation?
> > > >
> > > >
> > > > > >btw, quick question- in your presentation, the scale there is
> > seconds
> > > or
> > > > > mill-seconds:)
> > > > >
> > > > > It is seconds.  Dont consider the exact values. What is the % of
> > > increase
> > > > > in latency is important :) Those were not high end machines.
> > > > >
> > > > > -Anoop-
> > > > > ________________________________________
> > > > > From: Shengjie Min [kelvin.msj@gmail.com]
> > > > > Sent: Thursday, December 27, 2012 9:59 PM
> > > > > To: user@hbase.apache.org
> > > > > Subject: Re: HBase - Secondary Index
> > > > >
> > > > >  >Didnt follow u completely here. There wont be any get()
> happening..
> > > As
> > > > > the
> > > > > >exact rowkey in a region we get from the index table, we can seek
> to
> > > the
> > > > > >exact position and return that row.
> > > > >
> > > > > Sorry, When I misused "get()" here, I meant seeking. Yes, if it's
> > just
> > > > > small number of rows returned, this works perfect. As you said you
> > will
> > > > get
> > > > > the exact rowkey positions per region, and simply seek them. I was
> > > trying
> > > > > to work out the case that when the number of result rows increases
> > > > > massively. Like in Anil's case, he wants to do a scan query against
> > the
> > > > > 2ndary index(timestamp): "select all rows from timestamp1 to
> > > timestamp2"
> > > > > given no customerId provided. During that time period, he might
> have
> > a
> > > > big
> > > > > chunk of rows from different customerIds. The index table returns a
> > lot
> > > > of
> > > > > rowkey positions for different customerIds (I believe they are
> > > scattered
> > > > in
> > > > > different regions), then you end up seeking all different positions
> > in
> > > > > different regions and return all the rows needed. According to your
> > > > > presentation page14 - Performance Test Results (Scan), without
> index,
> > > > it's
> > > > > a linear increase as result rows # increases. on the other hand,
> with
> > > > > index, time spent climbs up way quicker than the case without
> index.
> > > > >
> > > > > btw, quick question- in your presentation, the scale there is
> seconds
> > > or
> > > > > mill-seconds:)
> > > > >
> > > > > - Shengjie
> > > > >
> > > > >
> > > > > On 27 December 2012 15:54, Anoop John <an...@gmail.com>
> wrote:
> > > > >
> > > > > > >how the massive number of get() is going to
> > > > > > perform againt the main table
> > > > > >
> > > > > > Didnt follow u completely here. There wont be any get()
> happening..
> > > As
> > > > > the
> > > > > > exact rowkey in a region we get from the index table, we can seek
> > to
> > > > the
> > > > > > exact position and return that row.
> > > > > >
> > > > > > -Anoop-
> > > > > >
> > > > > > On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <
> > kelvin.msj@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > how the massive number of get() is going to
> > > > > > > perform againt the main table
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > All the best,
> > > > > Shengjie Min
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > All the best,
> > > Shengjie Min
> > >
> >
> >
> >
> > --
> > Adrien Mogenet
> > 06.59.16.64.22
> > http://www.mogenet.me
> >
>



-- 
Adrien Mogenet
06.59.16.64.22
http://www.mogenet.me

Re: HBase - Secondary Index

Posted by Mohit Anchlia <mo...@gmail.com>.
Does anyone has any links or information to the new prefix encoding feature
in HBase that's being referred to in this mail?

On Sun, Jan 6, 2013 at 12:30 PM, Adrien Mogenet <ad...@gmail.com>wrote:

> Nice topic, perhaps one of the most important for 2013 :-)
> I still don't get how you're ensuring consistency between index table and
> main table, without an external component (such as bookkeeper/zookeeper).
> What's the exact write path in your situation when inserting data ?
> (WAL/RegionObserver, pre/post put/WALedit...)
>
> The underlying question is about how you're ensuring that WALEdit in Index
> and Main tables are perfectly sync'ed, and how you 're able to rollback in
> case of issue in both WAL ?
>
>
> On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <ke...@gmail.com>
> wrote:
>
> > >Yes as you say when the no of rows to be returned is becoming more and
> > more the latency will be becoming more.  seeks within an HFile block is
> > some what expensive op now. (Not much but still)  The new encoding
> >prefix
> > trie will be a huge bonus here. There the seeks will be flying.. [Ted
> also
> > presented this in the Hadoop China]  Thanks to Matt... :)  I am trying to
> > measure the scan performance with this new encoding . Trying to >back
> port
> > a simple patch for 94 version just for testing...   Yes when the no of
> > results to be returned is more and more any index will become less
> > performing as per my study  :)
> >
> > yes, you are right, I guess it's just a drawback of any index approach.
> > Thanks for the explanation.
> >
> > Shengjie
> >
> > On 28 December 2012 04:14, Anoop Sam John <an...@huawei.com> wrote:
> >
> > > > Do you have link to that presentation?
> > >
> > > http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf
> > >
> > > -Anoop-
> > >
> > > ________________________________________
> > > From: Mohit Anchlia [mohitanchlia@gmail.com]
> > > Sent: Friday, December 28, 2012 9:12 AM
> > > To: user@hbase.apache.org
> > > Subject: Re: HBase - Secondary Index
> > >
> > > On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <an...@huawei.com>
> > > wrote:
> > >
> > > > Yes as you say when the no of rows to be returned is becoming more
> and
> > > > more the latency will be becoming more.  seeks within an HFile block
> is
> > > > some what expensive op now. (Not much but still)  The new encoding
> > prefix
> > > > trie will be a huge bonus here. There the seeks will be flying.. [Ted
> > > also
> > > > presented this in the Hadoop China]  Thanks to Matt... :)  I am
> trying
> > to
> > > > measure the scan performance with this new encoding . Trying to back
> > > port a
> > > > simple patch for 94 version just for testing...   Yes when the no of
> > > > results to be returned is more and more any index will become less
> > > > performing as per my study  :)
> > > >
> > > > Do you have link to that presentation?
> > >
> > >
> > > > >btw, quick question- in your presentation, the scale there is
> seconds
> > or
> > > > mill-seconds:)
> > > >
> > > > It is seconds.  Dont consider the exact values. What is the % of
> > increase
> > > > in latency is important :) Those were not high end machines.
> > > >
> > > > -Anoop-
> > > > ________________________________________
> > > > From: Shengjie Min [kelvin.msj@gmail.com]
> > > > Sent: Thursday, December 27, 2012 9:59 PM
> > > > To: user@hbase.apache.org
> > > > Subject: Re: HBase - Secondary Index
> > > >
> > > >  >Didnt follow u completely here. There wont be any get() happening..
> > As
> > > > the
> > > > >exact rowkey in a region we get from the index table, we can seek to
> > the
> > > > >exact position and return that row.
> > > >
> > > > Sorry, When I misused "get()" here, I meant seeking. Yes, if it's
> just
> > > > small number of rows returned, this works perfect. As you said you
> will
> > > get
> > > > the exact rowkey positions per region, and simply seek them. I was
> > trying
> > > > to work out the case that when the number of result rows increases
> > > > massively. Like in Anil's case, he wants to do a scan query against
> the
> > > > 2ndary index(timestamp): "select all rows from timestamp1 to
> > timestamp2"
> > > > given no customerId provided. During that time period, he might have
> a
> > > big
> > > > chunk of rows from different customerIds. The index table returns a
> lot
> > > of
> > > > rowkey positions for different customerIds (I believe they are
> > scattered
> > > in
> > > > different regions), then you end up seeking all different positions
> in
> > > > different regions and return all the rows needed. According to your
> > > > presentation page14 - Performance Test Results (Scan), without index,
> > > it's
> > > > a linear increase as result rows # increases. on the other hand, with
> > > > index, time spent climbs up way quicker than the case without index.
> > > >
> > > > btw, quick question- in your presentation, the scale there is seconds
> > or
> > > > mill-seconds:)
> > > >
> > > > - Shengjie
> > > >
> > > >
> > > > On 27 December 2012 15:54, Anoop John <an...@gmail.com> wrote:
> > > >
> > > > > >how the massive number of get() is going to
> > > > > perform againt the main table
> > > > >
> > > > > Didnt follow u completely here. There wont be any get() happening..
> > As
> > > > the
> > > > > exact rowkey in a region we get from the index table, we can seek
> to
> > > the
> > > > > exact position and return that row.
> > > > >
> > > > > -Anoop-
> > > > >
> > > > > On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <
> kelvin.msj@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > how the massive number of get() is going to
> > > > > > perform againt the main table
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > All the best,
> > > > Shengjie Min
> > > >
> > >
> >
> >
> >
> > --
> > All the best,
> > Shengjie Min
> >
>
>
>
> --
> Adrien Mogenet
> 06.59.16.64.22
> http://www.mogenet.me
>