You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Varun Sharma <va...@pinterest.com> on 2013/02/09 02:38:03 UTC

Get on a row with multiple columns

Hi,

When I do a Get on a row with multiple column qualifiers. Do we sort the
column qualifers and make use of the sorted order when we get the results ?

Thanks
Varun

Re: Get on a row with multiple columns

Posted by Varun Sharma <va...@pinterest.com>.

https://issues.apache.org/jira/browse/HBASE-7813

On Mon, Feb 11, 2013 at 8:44 AM, Varun Sharma <va...@pinterest.com> wrote:

> I think I found a bug with the BulkDeleteEndpoint which is causing me to
> lose entire rows even with COLUMN deletes. I filed a JIRA for the same and
> can upload a patch.
>
>
> On Mon, Feb 11, 2013 at 7:36 AM, Varun Sharma <va...@pinterest.com> wrote:
>
>> No,
>>
>> Endpoint executes with normal QoS but it initiates a scan which seems to
>> be execute on High QoS looking at the handlers. Though, I am not totally
>> sure, maybe that region server was housing the .META table and those were
>> actually scan.next operations for the META table. So I will need to confirm
>> this.
>>
>> Varun
>>
>>
>> On Mon, Feb 11, 2013 at 4:50 AM, Anoop Sam John <an...@huawei.com>wrote:
>>
>>> You mean the end point is geetting executed with high QoS?  You checked
>>> with some logs?
>>>
>>> -Anoop-
>>> ________________________________________
>>> From: Varun Sharma [varun@pinterest.com]
>>> Sent: Monday, February 11, 2013 4:05 AM
>>> To: user@hbase.apache.org; lars hofhansl
>>> Subject: Re: Get on a row with multiple columns
>>>
>>> Back to BulkDeleteEndpoint, i got it to work but why are the
>>> scanner.next()
>>> calls executing on the Priority handler queue ?
>>>
>>> Varun
>>>
>>> On Sat, Feb 9, 2013 at 8:46 AM, lars hofhansl <la...@apache.org> wrote:
>>>
>>> > The answer is "probably" :)
>>> > It's disabled in 0.96 by default. Check out HBASE-7008 (
>>> > https://issues.apache.org/jira/browse/HBASE-7008) and the discussion
>>> > there.
>>> >
>>> > Also check out the discussion in HBASE-5943 and HADOOP-8069 (
>>> > https://issues.apache.org/jira/browse/HADOOP-8069)
>>> >
>>> >
>>> > -- Lars
>>> >
>>> >
>>> >
>>> > ________________________________
>>> >  From: Jean-Marc Spaggiari <je...@spaggiari.org>
>>> > To: user@hbase.apache.org
>>> > Sent: Saturday, February 9, 2013 5:02 AM
>>> > Subject: Re: Get on a row with multiple columns
>>> >
>>> > Lars, should we always consider disabling Nagle? What's the down side?
>>> >
>>> > JM
>>> >
>>> > 2013/2/9, Varun Sharma <va...@pinterest.com>:
>>> > > Yeah, I meant true...
>>> > >
>>> > > On Sat, Feb 9, 2013 at 12:17 AM, lars hofhansl <la...@apache.org>
>>> wrote:
>>> > >
>>> > >> Should be set to true. If tcpnodelay is set to true, Nagle's is
>>> > disabled.
>>> > >>
>>> > >> -- Lars
>>> > >>
>>> > >>
>>> > >>
>>> > >> ________________________________
>>> > >>  From: Varun Sharma <va...@pinterest.com>
>>> > >> To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
>>> > >> Sent: Saturday, February 9, 2013 12:11 AM
>>> > >> Subject: Re: Get on a row with multiple columns
>>> > >>
>>> > >>
>>> > >> Okay I did my research - these need to be set to false. I agree.
>>> > >>
>>> > >>
>>> > >> On Sat, Feb 9, 2013 at 12:05 AM, Varun Sharma <va...@pinterest.com>
>>> > >> wrote:
>>> > >>
>>> > >> I have ipc.client.tcpnodelay, ipc.server.tcpnodelay set to false
>>> and the
>>> > >> hbase one - [hbase].ipc.client.tcpnodelay set to true. Do these
>>> induce
>>> > >> network latency ?
>>> > >> >
>>> > >> >
>>> > >> >On Fri, Feb 8, 2013 at 11:57 PM, lars hofhansl <la...@apache.org>
>>> > wrote:
>>> > >> >
>>> > >> >Sorry.. I meant set these two config parameters to true (not false
>>> as I
>>> > >> state below).
>>> > >> >>
>>> > >> >>
>>> > >> >>
>>> > >> >>
>>> > >> >>----- Original Message -----
>>> > >> >>From: lars hofhansl <la...@apache.org>
>>> > >> >>To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>> > >> >>Cc:
>>> > >> >>Sent: Friday, February 8, 2013 11:41 PM
>>> > >> >>Subject: Re: Get on a row with multiple columns
>>> > >> >>
>>> > >> >>Only somewhat related. Seeing the magic 40ms random read time
>>> there.
>>> > >> >> Did
>>> > >> you disable Nagle's?
>>> > >> >>(set hbase.ipc.client.tcpnodelay and ipc.server.tcpnodelay to
>>> false in
>>> > >> hbase-site.xml).
>>> > >> >>
>>> > >> >>________________________________
>>> > >> >>From: Varun Sharma <va...@pinterest.com>
>>> > >> >>To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
>>> > >> >>Sent: Friday, February 8, 2013 10:45 PM
>>> > >> >>Subject: Re: Get on a row with multiple columns
>>> > >> >>
>>> > >> >>The use case is like your twitter feed. Tweets from people u
>>> follow.
>>> > >> >> When
>>> > >> >>someone unfollows, you need to delete a bunch of his tweets from
>>> the
>>> > >> >>following feed. So, its frequent, and we are essentially running
>>> into
>>> > >> some
>>> > >> >>extreme corner cases like the one above. We need high write
>>> throughput
>>> > >> for
>>> > >> >>this, since when someone tweets, we need to fanout the tweet to
>>> all
>>> > the
>>> > >> >>followers. We need the ability to do fast deletes (unfollow) and
>>> fast
>>> > >> adds
>>> > >> >>(follow) and also be able to do fast random gets - when a real
>>> user
>>> > >> >> loads
>>> > >> >>the feed. I doubt we will able to play much with the schema here
>>> since
>>> > >> >> we
>>> > >> >>need to support a bunch of use cases.
>>> > >> >>
>>> > >> >>@lars: It does not take 30 seconds to place 300 delete markers. It
>>> > >> >> takes
>>> > >> 30
>>> > >> >>seconds to first find which of those 300 pins are in the set of
>>> > columns
>>> > >> >>present - this invokes 300 gets and then place the appropriate
>>> delete
>>> > >> >>markers. Note that we can have tens of thousands of columns in a
>>> > single
>>> > >> row
>>> > >> >>so a single get is not cheap.
>>> > >> >>
>>> > >> >>If we were to just place delete markers, that is very fast. But
>>> when
>>> > >> >>started doing that, our random read performance suffered because
>>> of
>>> > too
>>> > >> >>many delete markers. The 90th percentile on random reads shot up
>>> from
>>> > >> >> 40
>>> > >> >>milliseconds to 150 milliseconds, which is not acceptable for our
>>> > >> usecase.
>>> > >> >>
>>> > >> >>Thanks
>>> > >> >>Varun
>>> > >> >>
>>> > >> >>On Fri, Feb 8, 2013 at 10:33 PM, lars hofhansl <la...@apache.org>
>>> > >> >> wrote:
>>> > >> >>
>>> > >> >>> Can you organize your columns and then delete by column family?
>>> > >> >>>
>>> > >> >>> deleteColumn without specifying a TS is expensive, since HBase
>>> first
>>> > >> has
>>> > >> >>> to figure out what the latest TS is.
>>> > >> >>>
>>> > >> >>> Should be better in 0.94.1 or later since deletes are batched
>>> like
>>> > >> >>> Puts
>>> > >> >>> (still need to retrieve the latest version, though).
>>> > >> >>>
>>> > >> >>> In 0.94.3 or later you can also the BulkDeleteEndPoint, which
>>> > >> >>> basically
>>> > >> >>> let's specify a scan condition and then place specific delete
>>> marker
>>> > >> for
>>> > >> >>> all KVs encountered.
>>> > >> >>>
>>> > >> >>>
>>> > >> >>> If you wanted to get really
>>> > >> >>> fancy, you could hook up a coprocessor to the compaction
>>> process and
>>> > >> >>> simply filter all KVs you no longer want (without ever placing
>>> any
>>> > >> >>> delete markers).
>>> > >> >>>
>>> > >> >>>
>>> > >> >>> Are you saying it takes 15 seconds to place 300 version delete
>>> > >> markers?!
>>> > >> >>>
>>> > >> >>>
>>> > >> >>> -- Lars
>>> > >> >>>
>>> > >> >>>
>>> > >> >>>
>>> > >> >>> ________________________________
>>> > >> >>>  From: Varun Sharma <va...@pinterest.com>
>>> > >> >>> To: user@hbase.apache.org
>>> > >> >>> Sent: Friday, February 8, 2013 10:05 PM
>>> > >> >>> Subject: Re: Get on a row with multiple columns
>>> > >> >>>
>>> > >> >>> We are given a set of 300 columns to delete. I tested two cases:
>>> > >> >>>
>>> > >> >>> 1) deleteColumns() - with the 's'
>>> > >> >>>
>>> > >> >>> This function simply adds delete markers for 300 columns, in our
>>> > >> >>> case,
>>> > >> >>> typically only a fraction of these columns are actually present
>>> -
>>> > 10.
>>> > >> After
>>> > >> >>> starting to use deleteColumns, we starting seeing a drop in
>>> cluster
>>> > >> wide
>>> > >> >>> random read performance - 90th percentile latency worsened, so
>>> did
>>> > >> >>> 99th
>>> > >> >>> probably because of having to traverse delete markers. I
>>> attribute
>>> > >> this to
>>> > >> >>> profusion of delete markers in the cluster. Major compactions
>>> slowed
>>> > >> down
>>> > >> >>> by almost 50 percent probably because of having to clean out
>>> > >> significantly
>>> > >> >>> more delete markers.
>>> > >> >>>
>>> > >> >>> 2) deleteColumn()
>>> > >> >>>
>>> > >> >>> Ended up with untolerable 15 second calls, which clogged all the
>>> > >> handlers.
>>> > >> >>> Making the cluster pretty much unresponsive.
>>> > >> >>>
>>> > >> >>> On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <yu...@gmail.com>
>>> wrote:
>>> > >> >>>
>>> > >> >>> > For the 300 column deletes, can you show us how the Delete(s)
>>> are
>>> > >> >>> > constructed ?
>>> > >> >>> >
>>> > >> >>> > Do you use this method ?
>>> > >> >>> >
>>> > >> >>> >   public Delete deleteColumns(byte [] family, byte []
>>> qualifier) {
>>> > >> >>> > Thanks
>>> > >> >>> >
>>> > >> >>> > On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <
>>> varun@pinterest.com
>>> > >
>>> > >> >>> wrote:
>>> > >> >>> >
>>> > >> >>> > > So a Get call with multiple columns on a single row should
>>> be
>>> > >> >>> > > much
>>> > >> >>> faster
>>> > >> >>> > > than independent Get(s) on each of those columns for that
>>> row. I
>>> > >> >>> > > am
>>> > >> >>> > > basically seeing severely poor performance (~ 15 seconds)
>>> for
>>> > >> certain
>>> > >> >>> > > deleteColumn() calls and I am seeing that there is a
>>> > >> >>> > > prepareDeleteTimestamps() function in HRegion.java which
>>> first
>>> > >> tries to
>>> > >> >>> > > locate the column by doing individual gets on each column
>>> you
>>> > >> >>> > > want
>>> > >> to
>>> > >> >>> > > delete (I am doing 300 column deletes). Now, I think this
>>> should
>>> > >> ideall
>>> > >> >>> > by
>>> > >> >>> > > 1 get call with the batch of 300 columns so that one scan
>>> can
>>> > >> retrieve
>>> > >> >>> > the
>>> > >> >>> > > columns and the columns that are found, are indeed deleted.
>>> > >> >>> > >
>>> > >> >>> > > Before I try this fix, I wanted to get an opinion if it will
>>> > make
>>> > >> >>> > > a
>>> > >> >>> > > difference to batch the get() and it seems from your
>>> answer, it
>>> > >> should.
>>> > >> >>> > >
>>> > >> >>> > > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <
>>> larsh@apache.org
>>> > >
>>> > >> >>> wrote:
>>> > >> >>> > >
>>> > >> >>> > > > Everything is stored as a KeyValue in HBase.
>>> > >> >>> > > > The Key part of a KeyValue contains the row key, column
>>> > family,
>>> > >> >>> column
>>> > >> >>> > > > name, and timestamp in that order.
>>> > >> >>> > > > Each column family has it's own store and store files.
>>> > >> >>> > > >
>>> > >> >>> > > > So in a nutshell a get is executed by starting a scan at
>>> the
>>> > >> >>> > > > row
>>> > >> key
>>> > >> >>> > > > (which is a prefix of the key) in each store (CF) and then
>>> > >> scanning
>>> > >> >>> > > forward
>>> > >> >>> > > > in each store until the next row key is reached. (in
>>> reality
>>> > it
>>> > >> is a
>>> > >> >>> > bit
>>> > >> >>> > > > more complicated due to multiple versions, skipping
>>> columns,
>>> > >> >>> > > > etc)
>>> > >> >>> > > >
>>> > >> >>> > > >
>>> > >> >>> > > > -- Lars
>>> > >> >>> > > > ________________________________
>>> > >> >>> > > > From: Varun Sharma <va...@pinterest.com>
>>> > >> >>> > > > To: user@hbase.apache.org
>>> > >> >>> > > > Sent: Friday, February 8, 2013 9:22 PM
>>> > >> >>> > > > Subject: Re: Get on a row with multiple columns
>>> > >> >>> > > >
>>> > >> >>> > > > Sorry, I was a little unclear with my question.
>>> > >> >>> > > >
>>> > >> >>> > > > Lets say you have
>>> > >> >>> > > >
>>> > >> >>> > > > Get get = new Get(row)
>>> > >> >>> > > > get.addColumn("1");
>>> > >> >>> > > > get.addColumn("2");
>>> > >> >>> > > > .
>>> > >> >>> > > > .
>>> > >> >>> > > > .
>>> > >> >>> > > >
>>> > >> >>> > > > When internally hbase executes the batch get, it will
>>> seek to
>>> > >> column
>>> > >> >>> > "1",
>>> > >> >>> > > > now since data is lexicographically sorted, it does not
>>> need
>>> > to
>>> > >> seek
>>> > >> >>> > from
>>> > >> >>> > > > the beginning to get to "2", it can continue seeking,
>>> > >> >>> > > > henceforth
>>> > >> >>> since
>>> > >> >>> > > > column "2" will always be after column "1". I want to know
>>> > >> whether
>>> > >> >>> this
>>> > >> >>> > > is
>>> > >> >>> > > > how a multicolumn get on a row works or not.
>>> > >> >>> > > >
>>> > >> >>> > > > Thanks
>>> > >> >>> > > > Varun
>>> > >> >>> > > >
>>> > >> >>> > > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <
>>> mlortiz@uci.cu>
>>> > >> wrote:
>>> > >> >>> > > >
>>> > >> >>> > > > > Like Ishan said, a get give an instance of the Result
>>> class.
>>> > >> >>> > > > > All utility methods that you can use are:
>>> > >> >>> > > > >  byte[] getValue(byte[] family, byte[] qualifier)
>>> > >> >>> > > > >  byte[] value()
>>> > >> >>> > > > >  byte[] getRow()
>>> > >> >>> > > > >  int size()
>>> > >> >>> > > > >  boolean isEmpty()
>>> > >> >>> > > > >  KeyValue[] raw() # Like Ishan said, all data here is
>>> sorted
>>> > >> >>> > > > >  List<KeyValue> list()
>>> > >> >>> > > > >
>>> > >> >>> > > > >
>>> > >> >>> > > > >
>>> > >> >>> > > > >
>>> > >> >>> > > > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
>>> > >> >>> > > > >
>>> > >> >>> > > > >> Based on what I read in Lars' book, a get will return a
>>> > >> result a
>>> > >> >>> > > Result,
>>> > >> >>> > > > >> which is internally a KeyValue[]. This KeyValue[] is
>>> sorted
>>> > >> by the
>>> > >> >>> > key
>>> > >> >>> > > > and
>>> > >> >>> > > > >> you access this array using raw or list methods on the
>>> > >> >>> > > > >> Result
>>> > >> >>> > object.
>>> > >> >>> > > > >>
>>> > >> >>> > > > >>
>>> > >> >>> > > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <
>>> > >> varun@pinterest.com
>>> > >> >>> >
>>> > >> >>> > > > wrote:
>>> > >> >>> > > > >>
>>> > >> >>> > > > >>  +user
>>> > >> >>> > > > >>>
>>> > >> >>> > > > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <
>>> > >> >>> varun@pinterest.com>
>>> > >> >>> > > > >>> wrote:
>>> > >> >>> > > > >>>
>>> > >> >>> > > > >>>  Hi,
>>> > >> >>> > > > >>>>
>>> > >> >>> > > > >>>> When I do a Get on a row with multiple column
>>> qualifiers.
>>> > >> Do we
>>> > >> >>> > sort
>>> > >> >>> > > > the
>>> > >> >>> > > > >>>> column qualifers and make use of the sorted order
>>> when we
>>> > >> get
>>> > >> >>> the
>>> > >> >>> > > > >>>>
>>> > >> >>> > > > >>> results ?
>>> > >> >>> > > > >>>
>>> > >> >>> > > > >>>> Thanks
>>> > >> >>> > > > >>>> Varun
>>> > >> >>> > > > >>>>
>>> > >> >>> > > > >>>>
>>> > >> >>> > > > >>
>>> > >> >>> > > > >>
>>> > >> >>> > > > > --
>>> > >> >>> > > > > Marcos Ortiz Valmaseda,
>>> > >> >>> > > > > Product Manager && Data Scientist at UCI
>>> > >> >>> > > > > Blog: http://marcosluis2186.**posterous.com<
>>> > >> >>> > > > http://marcosluis2186.posterous.com>
>>> > >> >>> > > > > Twitter: @marcosluis2186
>>> > >> >>> > > > > <http://twitter.com/**marcosluis2186<
>>> > >> >>> > > > http://twitter.com/marcosluis2186>
>>> > >> >>> > > > > >
>>> > >> >>> > > > >
>>> > >> >>> > > >
>>> > >> >>> > >
>>> > >> >>> >
>>> > >> >>>
>>> > >> >>
>>> > >> >>
>>> > >> >
>>> > >>
>>> > >
>>> >
>>>
>>
>>
>

Re: Get on a row with multiple columns

Posted by Varun Sharma <va...@pinterest.com>.

I think I found a bug with the BulkDeleteEndpoint which is causing me to
lose entire rows even with COLUMN deletes. I filed a JIRA for the same and
can upload a patch.

On Mon, Feb 11, 2013 at 7:36 AM, Varun Sharma <va...@pinterest.com> wrote:

> No,
>
> Endpoint executes with normal QoS but it initiates a scan which seems to
> be execute on High QoS looking at the handlers. Though, I am not totally
> sure, maybe that region server was housing the .META table and those were
> actually scan.next operations for the META table. So I will need to confirm
> this.
>
> Varun
>
>
> On Mon, Feb 11, 2013 at 4:50 AM, Anoop Sam John <an...@huawei.com>wrote:
>
>> You mean the end point is geetting executed with high QoS?  You checked
>> with some logs?
>>
>> -Anoop-
>> ________________________________________
>> From: Varun Sharma [varun@pinterest.com]
>> Sent: Monday, February 11, 2013 4:05 AM
>> To: user@hbase.apache.org; lars hofhansl
>> Subject: Re: Get on a row with multiple columns
>>
>> Back to BulkDeleteEndpoint, i got it to work but why are the
>> scanner.next()
>> calls executing on the Priority handler queue ?
>>
>> Varun
>>
>> On Sat, Feb 9, 2013 at 8:46 AM, lars hofhansl <la...@apache.org> wrote:
>>
>> > The answer is "probably" :)
>> > It's disabled in 0.96 by default. Check out HBASE-7008 (
>> > https://issues.apache.org/jira/browse/HBASE-7008) and the discussion
>> > there.
>> >
>> > Also check out the discussion in HBASE-5943 and HADOOP-8069 (
>> > https://issues.apache.org/jira/browse/HADOOP-8069)
>> >
>> >
>> > -- Lars
>> >
>> >
>> >
>> > ________________________________
>> >  From: Jean-Marc Spaggiari <je...@spaggiari.org>
>> > To: user@hbase.apache.org
>> > Sent: Saturday, February 9, 2013 5:02 AM
>> > Subject: Re: Get on a row with multiple columns
>> >
>> > Lars, should we always consider disabling Nagle? What's the down side?
>> >
>> > JM
>> >
>> > 2013/2/9, Varun Sharma <va...@pinterest.com>:
>> > > Yeah, I meant true...
>> > >
>> > > On Sat, Feb 9, 2013 at 12:17 AM, lars hofhansl <la...@apache.org>
>> wrote:
>> > >
>> > >> Should be set to true. If tcpnodelay is set to true, Nagle's is
>> > disabled.
>> > >>
>> > >> -- Lars
>> > >>
>> > >>
>> > >>
>> > >> ________________________________
>> > >>  From: Varun Sharma <va...@pinterest.com>
>> > >> To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
>> > >> Sent: Saturday, February 9, 2013 12:11 AM
>> > >> Subject: Re: Get on a row with multiple columns
>> > >>
>> > >>
>> > >> Okay I did my research - these need to be set to false. I agree.
>> > >>
>> > >>
>> > >> On Sat, Feb 9, 2013 at 12:05 AM, Varun Sharma <va...@pinterest.com>
>> > >> wrote:
>> > >>
>> > >> I have ipc.client.tcpnodelay, ipc.server.tcpnodelay set to false and
>> the
>> > >> hbase one - [hbase].ipc.client.tcpnodelay set to true. Do these
>> induce
>> > >> network latency ?
>> > >> >
>> > >> >
>> > >> >On Fri, Feb 8, 2013 at 11:57 PM, lars hofhansl <la...@apache.org>
>> > wrote:
>> > >> >
>> > >> >Sorry.. I meant set these two config parameters to true (not false
>> as I
>> > >> state below).
>> > >> >>
>> > >> >>
>> > >> >>
>> > >> >>
>> > >> >>----- Original Message -----
>> > >> >>From: lars hofhansl <la...@apache.org>
>> > >> >>To: "user@hbase.apache.org" <us...@hbase.apache.org>
>> > >> >>Cc:
>> > >> >>Sent: Friday, February 8, 2013 11:41 PM
>> > >> >>Subject: Re: Get on a row with multiple columns
>> > >> >>
>> > >> >>Only somewhat related. Seeing the magic 40ms random read time
>> there.
>> > >> >> Did
>> > >> you disable Nagle's?
>> > >> >>(set hbase.ipc.client.tcpnodelay and ipc.server.tcpnodelay to
>> false in
>> > >> hbase-site.xml).
>> > >> >>
>> > >> >>________________________________
>> > >> >>From: Varun Sharma <va...@pinterest.com>
>> > >> >>To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
>> > >> >>Sent: Friday, February 8, 2013 10:45 PM
>> > >> >>Subject: Re: Get on a row with multiple columns
>> > >> >>
>> > >> >>The use case is like your twitter feed. Tweets from people u
>> follow.
>> > >> >> When
>> > >> >>someone unfollows, you need to delete a bunch of his tweets from
>> the
>> > >> >>following feed. So, its frequent, and we are essentially running
>> into
>> > >> some
>> > >> >>extreme corner cases like the one above. We need high write
>> throughput
>> > >> for
>> > >> >>this, since when someone tweets, we need to fanout the tweet to all
>> > the
>> > >> >>followers. We need the ability to do fast deletes (unfollow) and
>> fast
>> > >> adds
>> > >> >>(follow) and also be able to do fast random gets - when a real user
>> > >> >> loads
>> > >> >>the feed. I doubt we will able to play much with the schema here
>> since
>> > >> >> we
>> > >> >>need to support a bunch of use cases.
>> > >> >>
>> > >> >>@lars: It does not take 30 seconds to place 300 delete markers. It
>> > >> >> takes
>> > >> 30
>> > >> >>seconds to first find which of those 300 pins are in the set of
>> > columns
>> > >> >>present - this invokes 300 gets and then place the appropriate
>> delete
>> > >> >>markers. Note that we can have tens of thousands of columns in a
>> > single
>> > >> row
>> > >> >>so a single get is not cheap.
>> > >> >>
>> > >> >>If we were to just place delete markers, that is very fast. But
>> when
>> > >> >>started doing that, our random read performance suffered because of
>> > too
>> > >> >>many delete markers. The 90th percentile on random reads shot up
>> from
>> > >> >> 40
>> > >> >>milliseconds to 150 milliseconds, which is not acceptable for our
>> > >> usecase.
>> > >> >>
>> > >> >>Thanks
>> > >> >>Varun
>> > >> >>
>> > >> >>On Fri, Feb 8, 2013 at 10:33 PM, lars hofhansl <la...@apache.org>
>> > >> >> wrote:
>> > >> >>
>> > >> >>> Can you organize your columns and then delete by column family?
>> > >> >>>
>> > >> >>> deleteColumn without specifying a TS is expensive, since HBase
>> first
>> > >> has
>> > >> >>> to figure out what the latest TS is.
>> > >> >>>
>> > >> >>> Should be better in 0.94.1 or later since deletes are batched
>> like
>> > >> >>> Puts
>> > >> >>> (still need to retrieve the latest version, though).
>> > >> >>>
>> > >> >>> In 0.94.3 or later you can also the BulkDeleteEndPoint, which
>> > >> >>> basically
>> > >> >>> let's specify a scan condition and then place specific delete
>> marker
>> > >> for
>> > >> >>> all KVs encountered.
>> > >> >>>
>> > >> >>>
>> > >> >>> If you wanted to get really
>> > >> >>> fancy, you could hook up a coprocessor to the compaction process
>> and
>> > >> >>> simply filter all KVs you no longer want (without ever placing
>> any
>> > >> >>> delete markers).
>> > >> >>>
>> > >> >>>
>> > >> >>> Are you saying it takes 15 seconds to place 300 version delete
>> > >> markers?!
>> > >> >>>
>> > >> >>>
>> > >> >>> -- Lars
>> > >> >>>
>> > >> >>>
>> > >> >>>
>> > >> >>> ________________________________
>> > >> >>>  From: Varun Sharma <va...@pinterest.com>
>> > >> >>> To: user@hbase.apache.org
>> > >> >>> Sent: Friday, February 8, 2013 10:05 PM
>> > >> >>> Subject: Re: Get on a row with multiple columns
>> > >> >>>
>> > >> >>> We are given a set of 300 columns to delete. I tested two cases:
>> > >> >>>
>> > >> >>> 1) deleteColumns() - with the 's'
>> > >> >>>
>> > >> >>> This function simply adds delete markers for 300 columns, in our
>> > >> >>> case,
>> > >> >>> typically only a fraction of these columns are actually present -
>> > 10.
>> > >> After
>> > >> >>> starting to use deleteColumns, we starting seeing a drop in
>> cluster
>> > >> wide
>> > >> >>> random read performance - 90th percentile latency worsened, so
>> did
>> > >> >>> 99th
>> > >> >>> probably because of having to traverse delete markers. I
>> attribute
>> > >> this to
>> > >> >>> profusion of delete markers in the cluster. Major compactions
>> slowed
>> > >> down
>> > >> >>> by almost 50 percent probably because of having to clean out
>> > >> significantly
>> > >> >>> more delete markers.
>> > >> >>>
>> > >> >>> 2) deleteColumn()
>> > >> >>>
>> > >> >>> Ended up with untolerable 15 second calls, which clogged all the
>> > >> handlers.
>> > >> >>> Making the cluster pretty much unresponsive.
>> > >> >>>
>> > >> >>> On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <yu...@gmail.com>
>> wrote:
>> > >> >>>
>> > >> >>> > For the 300 column deletes, can you show us how the Delete(s)
>> are
>> > >> >>> > constructed ?
>> > >> >>> >
>> > >> >>> > Do you use this method ?
>> > >> >>> >
>> > >> >>> >   public Delete deleteColumns(byte [] family, byte []
>> qualifier) {
>> > >> >>> > Thanks
>> > >> >>> >
>> > >> >>> > On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <
>> varun@pinterest.com
>> > >
>> > >> >>> wrote:
>> > >> >>> >
>> > >> >>> > > So a Get call with multiple columns on a single row should be
>> > >> >>> > > much
>> > >> >>> faster
>> > >> >>> > > than independent Get(s) on each of those columns for that
>> row. I
>> > >> >>> > > am
>> > >> >>> > > basically seeing severely poor performance (~ 15 seconds) for
>> > >> certain
>> > >> >>> > > deleteColumn() calls and I am seeing that there is a
>> > >> >>> > > prepareDeleteTimestamps() function in HRegion.java which
>> first
>> > >> tries to
>> > >> >>> > > locate the column by doing individual gets on each column you
>> > >> >>> > > want
>> > >> to
>> > >> >>> > > delete (I am doing 300 column deletes). Now, I think this
>> should
>> > >> ideall
>> > >> >>> > by
>> > >> >>> > > 1 get call with the batch of 300 columns so that one scan can
>> > >> retrieve
>> > >> >>> > the
>> > >> >>> > > columns and the columns that are found, are indeed deleted.
>> > >> >>> > >
>> > >> >>> > > Before I try this fix, I wanted to get an opinion if it will
>> > make
>> > >> >>> > > a
>> > >> >>> > > difference to batch the get() and it seems from your answer,
>> it
>> > >> should.
>> > >> >>> > >
>> > >> >>> > > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <
>> larsh@apache.org
>> > >
>> > >> >>> wrote:
>> > >> >>> > >
>> > >> >>> > > > Everything is stored as a KeyValue in HBase.
>> > >> >>> > > > The Key part of a KeyValue contains the row key, column
>> > family,
>> > >> >>> column
>> > >> >>> > > > name, and timestamp in that order.
>> > >> >>> > > > Each column family has it's own store and store files.
>> > >> >>> > > >
>> > >> >>> > > > So in a nutshell a get is executed by starting a scan at
>> the
>> > >> >>> > > > row
>> > >> key
>> > >> >>> > > > (which is a prefix of the key) in each store (CF) and then
>> > >> scanning
>> > >> >>> > > forward
>> > >> >>> > > > in each store until the next row key is reached. (in
>> reality
>> > it
>> > >> is a
>> > >> >>> > bit
>> > >> >>> > > > more complicated due to multiple versions, skipping
>> columns,
>> > >> >>> > > > etc)
>> > >> >>> > > >
>> > >> >>> > > >
>> > >> >>> > > > -- Lars
>> > >> >>> > > > ________________________________
>> > >> >>> > > > From: Varun Sharma <va...@pinterest.com>
>> > >> >>> > > > To: user@hbase.apache.org
>> > >> >>> > > > Sent: Friday, February 8, 2013 9:22 PM
>> > >> >>> > > > Subject: Re: Get on a row with multiple columns
>> > >> >>> > > >
>> > >> >>> > > > Sorry, I was a little unclear with my question.
>> > >> >>> > > >
>> > >> >>> > > > Lets say you have
>> > >> >>> > > >
>> > >> >>> > > > Get get = new Get(row)
>> > >> >>> > > > get.addColumn("1");
>> > >> >>> > > > get.addColumn("2");
>> > >> >>> > > > .
>> > >> >>> > > > .
>> > >> >>> > > > .
>> > >> >>> > > >
>> > >> >>> > > > When internally hbase executes the batch get, it will seek
>> to
>> > >> column
>> > >> >>> > "1",
>> > >> >>> > > > now since data is lexicographically sorted, it does not
>> need
>> > to
>> > >> seek
>> > >> >>> > from
>> > >> >>> > > > the beginning to get to "2", it can continue seeking,
>> > >> >>> > > > henceforth
>> > >> >>> since
>> > >> >>> > > > column "2" will always be after column "1". I want to know
>> > >> whether
>> > >> >>> this
>> > >> >>> > > is
>> > >> >>> > > > how a multicolumn get on a row works or not.
>> > >> >>> > > >
>> > >> >>> > > > Thanks
>> > >> >>> > > > Varun
>> > >> >>> > > >
>> > >> >>> > > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <
>> mlortiz@uci.cu>
>> > >> wrote:
>> > >> >>> > > >
>> > >> >>> > > > > Like Ishan said, a get give an instance of the Result
>> class.
>> > >> >>> > > > > All utility methods that you can use are:
>> > >> >>> > > > >  byte[] getValue(byte[] family, byte[] qualifier)
>> > >> >>> > > > >  byte[] value()
>> > >> >>> > > > >  byte[] getRow()
>> > >> >>> > > > >  int size()
>> > >> >>> > > > >  boolean isEmpty()
>> > >> >>> > > > >  KeyValue[] raw() # Like Ishan said, all data here is
>> sorted
>> > >> >>> > > > >  List<KeyValue> list()
>> > >> >>> > > > >
>> > >> >>> > > > >
>> > >> >>> > > > >
>> > >> >>> > > > >
>> > >> >>> > > > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
>> > >> >>> > > > >
>> > >> >>> > > > >> Based on what I read in Lars' book, a get will return a
>> > >> result a
>> > >> >>> > > Result,
>> > >> >>> > > > >> which is internally a KeyValue[]. This KeyValue[] is
>> sorted
>> > >> by the
>> > >> >>> > key
>> > >> >>> > > > and
>> > >> >>> > > > >> you access this array using raw or list methods on the
>> > >> >>> > > > >> Result
>> > >> >>> > object.
>> > >> >>> > > > >>
>> > >> >>> > > > >>
>> > >> >>> > > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <
>> > >> varun@pinterest.com
>> > >> >>> >
>> > >> >>> > > > wrote:
>> > >> >>> > > > >>
>> > >> >>> > > > >>  +user
>> > >> >>> > > > >>>
>> > >> >>> > > > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <
>> > >> >>> varun@pinterest.com>
>> > >> >>> > > > >>> wrote:
>> > >> >>> > > > >>>
>> > >> >>> > > > >>>  Hi,
>> > >> >>> > > > >>>>
>> > >> >>> > > > >>>> When I do a Get on a row with multiple column
>> qualifiers.
>> > >> Do we
>> > >> >>> > sort
>> > >> >>> > > > the
>> > >> >>> > > > >>>> column qualifers and make use of the sorted order
>> when we
>> > >> get
>> > >> >>> the
>> > >> >>> > > > >>>>
>> > >> >>> > > > >>> results ?
>> > >> >>> > > > >>>
>> > >> >>> > > > >>>> Thanks
>> > >> >>> > > > >>>> Varun
>> > >> >>> > > > >>>>
>> > >> >>> > > > >>>>
>> > >> >>> > > > >>
>> > >> >>> > > > >>
>> > >> >>> > > > > --
>> > >> >>> > > > > Marcos Ortiz Valmaseda,
>> > >> >>> > > > > Product Manager && Data Scientist at UCI
>> > >> >>> > > > > Blog: http://marcosluis2186.**posterous.com<
>> > >> >>> > > > http://marcosluis2186.posterous.com>
>> > >> >>> > > > > Twitter: @marcosluis2186
>> > >> >>> > > > > <http://twitter.com/**marcosluis2186<
>> > >> >>> > > > http://twitter.com/marcosluis2186>
>> > >> >>> > > > > >
>> > >> >>> > > > >
>> > >> >>> > > >
>> > >> >>> > >
>> > >> >>> >
>> > >> >>>
>> > >> >>
>> > >> >>
>> > >> >
>> > >>
>> > >
>> >
>>
>
>

Re: Get on a row with multiple columns

Posted by Varun Sharma <va...@pinterest.com>.

No,

Endpoint executes with normal QoS but it initiates a scan which seems to be
execute on High QoS looking at the handlers. Though, I am not totally sure,
maybe that region server was housing the .META table and those were
actually scan.next operations for the META table. So I will need to confirm
this.

Varun

On Mon, Feb 11, 2013 at 4:50 AM, Anoop Sam John <an...@huawei.com> wrote:

> You mean the end point is geetting executed with high QoS?  You checked
> with some logs?
>
> -Anoop-
> ________________________________________
> From: Varun Sharma [varun@pinterest.com]
> Sent: Monday, February 11, 2013 4:05 AM
> To: user@hbase.apache.org; lars hofhansl
> Subject: Re: Get on a row with multiple columns
>
> Back to BulkDeleteEndpoint, i got it to work but why are the scanner.next()
> calls executing on the Priority handler queue ?
>
> Varun
>
> On Sat, Feb 9, 2013 at 8:46 AM, lars hofhansl <la...@apache.org> wrote:
>
> > The answer is "probably" :)
> > It's disabled in 0.96 by default. Check out HBASE-7008 (
> > https://issues.apache.org/jira/browse/HBASE-7008) and the discussion
> > there.
> >
> > Also check out the discussion in HBASE-5943 and HADOOP-8069 (
> > https://issues.apache.org/jira/browse/HADOOP-8069)
> >
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> >  From: Jean-Marc Spaggiari <je...@spaggiari.org>
> > To: user@hbase.apache.org
> > Sent: Saturday, February 9, 2013 5:02 AM
> > Subject: Re: Get on a row with multiple columns
> >
> > Lars, should we always consider disabling Nagle? What's the down side?
> >
> > JM
> >
> > 2013/2/9, Varun Sharma <va...@pinterest.com>:
> > > Yeah, I meant true...
> > >
> > > On Sat, Feb 9, 2013 at 12:17 AM, lars hofhansl <la...@apache.org>
> wrote:
> > >
> > >> Should be set to true. If tcpnodelay is set to true, Nagle's is
> > disabled.
> > >>
> > >> -- Lars
> > >>
> > >>
> > >>
> > >> ________________________________
> > >>  From: Varun Sharma <va...@pinterest.com>
> > >> To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
> > >> Sent: Saturday, February 9, 2013 12:11 AM
> > >> Subject: Re: Get on a row with multiple columns
> > >>
> > >>
> > >> Okay I did my research - these need to be set to false. I agree.
> > >>
> > >>
> > >> On Sat, Feb 9, 2013 at 12:05 AM, Varun Sharma <va...@pinterest.com>
> > >> wrote:
> > >>
> > >> I have ipc.client.tcpnodelay, ipc.server.tcpnodelay set to false and
> the
> > >> hbase one - [hbase].ipc.client.tcpnodelay set to true. Do these induce
> > >> network latency ?
> > >> >
> > >> >
> > >> >On Fri, Feb 8, 2013 at 11:57 PM, lars hofhansl <la...@apache.org>
> > wrote:
> > >> >
> > >> >Sorry.. I meant set these two config parameters to true (not false
> as I
> > >> state below).
> > >> >>
> > >> >>
> > >> >>
> > >> >>
> > >> >>----- Original Message -----
> > >> >>From: lars hofhansl <la...@apache.org>
> > >> >>To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > >> >>Cc:
> > >> >>Sent: Friday, February 8, 2013 11:41 PM
> > >> >>Subject: Re: Get on a row with multiple columns
> > >> >>
> > >> >>Only somewhat related. Seeing the magic 40ms random read time there.
> > >> >> Did
> > >> you disable Nagle's?
> > >> >>(set hbase.ipc.client.tcpnodelay and ipc.server.tcpnodelay to false
> in
> > >> hbase-site.xml).
> > >> >>
> > >> >>________________________________
> > >> >>From: Varun Sharma <va...@pinterest.com>
> > >> >>To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
> > >> >>Sent: Friday, February 8, 2013 10:45 PM
> > >> >>Subject: Re: Get on a row with multiple columns
> > >> >>
> > >> >>The use case is like your twitter feed. Tweets from people u follow.
> > >> >> When
> > >> >>someone unfollows, you need to delete a bunch of his tweets from the
> > >> >>following feed. So, its frequent, and we are essentially running
> into
> > >> some
> > >> >>extreme corner cases like the one above. We need high write
> throughput
> > >> for
> > >> >>this, since when someone tweets, we need to fanout the tweet to all
> > the
> > >> >>followers. We need the ability to do fast deletes (unfollow) and
> fast
> > >> adds
> > >> >>(follow) and also be able to do fast random gets - when a real user
> > >> >> loads
> > >> >>the feed. I doubt we will able to play much with the schema here
> since
> > >> >> we
> > >> >>need to support a bunch of use cases.
> > >> >>
> > >> >>@lars: It does not take 30 seconds to place 300 delete markers. It
> > >> >> takes
> > >> 30
> > >> >>seconds to first find which of those 300 pins are in the set of
> > columns
> > >> >>present - this invokes 300 gets and then place the appropriate
> delete
> > >> >>markers. Note that we can have tens of thousands of columns in a
> > single
> > >> row
> > >> >>so a single get is not cheap.
> > >> >>
> > >> >>If we were to just place delete markers, that is very fast. But when
> > >> >>started doing that, our random read performance suffered because of
> > too
> > >> >>many delete markers. The 90th percentile on random reads shot up
> from
> > >> >> 40
> > >> >>milliseconds to 150 milliseconds, which is not acceptable for our
> > >> usecase.
> > >> >>
> > >> >>Thanks
> > >> >>Varun
> > >> >>
> > >> >>On Fri, Feb 8, 2013 at 10:33 PM, lars hofhansl <la...@apache.org>
> > >> >> wrote:
> > >> >>
> > >> >>> Can you organize your columns and then delete by column family?
> > >> >>>
> > >> >>> deleteColumn without specifying a TS is expensive, since HBase
> first
> > >> has
> > >> >>> to figure out what the latest TS is.
> > >> >>>
> > >> >>> Should be better in 0.94.1 or later since deletes are batched like
> > >> >>> Puts
> > >> >>> (still need to retrieve the latest version, though).
> > >> >>>
> > >> >>> In 0.94.3 or later you can also the BulkDeleteEndPoint, which
> > >> >>> basically
> > >> >>> let's specify a scan condition and then place specific delete
> marker
> > >> for
> > >> >>> all KVs encountered.
> > >> >>>
> > >> >>>
> > >> >>> If you wanted to get really
> > >> >>> fancy, you could hook up a coprocessor to the compaction process
> and
> > >> >>> simply filter all KVs you no longer want (without ever placing any
> > >> >>> delete markers).
> > >> >>>
> > >> >>>
> > >> >>> Are you saying it takes 15 seconds to place 300 version delete
> > >> markers?!
> > >> >>>
> > >> >>>
> > >> >>> -- Lars
> > >> >>>
> > >> >>>
> > >> >>>
> > >> >>> ________________________________
> > >> >>>  From: Varun Sharma <va...@pinterest.com>
> > >> >>> To: user@hbase.apache.org
> > >> >>> Sent: Friday, February 8, 2013 10:05 PM
> > >> >>> Subject: Re: Get on a row with multiple columns
> > >> >>>
> > >> >>> We are given a set of 300 columns to delete. I tested two cases:
> > >> >>>
> > >> >>> 1) deleteColumns() - with the 's'
> > >> >>>
> > >> >>> This function simply adds delete markers for 300 columns, in our
> > >> >>> case,
> > >> >>> typically only a fraction of these columns are actually present -
> > 10.
> > >> After
> > >> >>> starting to use deleteColumns, we starting seeing a drop in
> cluster
> > >> wide
> > >> >>> random read performance - 90th percentile latency worsened, so did
> > >> >>> 99th
> > >> >>> probably because of having to traverse delete markers. I attribute
> > >> this to
> > >> >>> profusion of delete markers in the cluster. Major compactions
> slowed
> > >> down
> > >> >>> by almost 50 percent probably because of having to clean out
> > >> significantly
> > >> >>> more delete markers.
> > >> >>>
> > >> >>> 2) deleteColumn()
> > >> >>>
> > >> >>> Ended up with untolerable 15 second calls, which clogged all the
> > >> handlers.
> > >> >>> Making the cluster pretty much unresponsive.
> > >> >>>
> > >> >>> On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <yu...@gmail.com>
> wrote:
> > >> >>>
> > >> >>> > For the 300 column deletes, can you show us how the Delete(s)
> are
> > >> >>> > constructed ?
> > >> >>> >
> > >> >>> > Do you use this method ?
> > >> >>> >
> > >> >>> >   public Delete deleteColumns(byte [] family, byte []
> qualifier) {
> > >> >>> > Thanks
> > >> >>> >
> > >> >>> > On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <
> varun@pinterest.com
> > >
> > >> >>> wrote:
> > >> >>> >
> > >> >>> > > So a Get call with multiple columns on a single row should be
> > >> >>> > > much
> > >> >>> faster
> > >> >>> > > than independent Get(s) on each of those columns for that
> row. I
> > >> >>> > > am
> > >> >>> > > basically seeing severely poor performance (~ 15 seconds) for
> > >> certain
> > >> >>> > > deleteColumn() calls and I am seeing that there is a
> > >> >>> > > prepareDeleteTimestamps() function in HRegion.java which first
> > >> tries to
> > >> >>> > > locate the column by doing individual gets on each column you
> > >> >>> > > want
> > >> to
> > >> >>> > > delete (I am doing 300 column deletes). Now, I think this
> should
> > >> ideall
> > >> >>> > by
> > >> >>> > > 1 get call with the batch of 300 columns so that one scan can
> > >> retrieve
> > >> >>> > the
> > >> >>> > > columns and the columns that are found, are indeed deleted.
> > >> >>> > >
> > >> >>> > > Before I try this fix, I wanted to get an opinion if it will
> > make
> > >> >>> > > a
> > >> >>> > > difference to batch the get() and it seems from your answer,
> it
> > >> should.
> > >> >>> > >
> > >> >>> > > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <
> larsh@apache.org
> > >
> > >> >>> wrote:
> > >> >>> > >
> > >> >>> > > > Everything is stored as a KeyValue in HBase.
> > >> >>> > > > The Key part of a KeyValue contains the row key, column
> > family,
> > >> >>> column
> > >> >>> > > > name, and timestamp in that order.
> > >> >>> > > > Each column family has it's own store and store files.
> > >> >>> > > >
> > >> >>> > > > So in a nutshell a get is executed by starting a scan at the
> > >> >>> > > > row
> > >> key
> > >> >>> > > > (which is a prefix of the key) in each store (CF) and then
> > >> scanning
> > >> >>> > > forward
> > >> >>> > > > in each store until the next row key is reached. (in reality
> > it
> > >> is a
> > >> >>> > bit
> > >> >>> > > > more complicated due to multiple versions, skipping columns,
> > >> >>> > > > etc)
> > >> >>> > > >
> > >> >>> > > >
> > >> >>> > > > -- Lars
> > >> >>> > > > ________________________________
> > >> >>> > > > From: Varun Sharma <va...@pinterest.com>
> > >> >>> > > > To: user@hbase.apache.org
> > >> >>> > > > Sent: Friday, February 8, 2013 9:22 PM
> > >> >>> > > > Subject: Re: Get on a row with multiple columns
> > >> >>> > > >
> > >> >>> > > > Sorry, I was a little unclear with my question.
> > >> >>> > > >
> > >> >>> > > > Lets say you have
> > >> >>> > > >
> > >> >>> > > > Get get = new Get(row)
> > >> >>> > > > get.addColumn("1");
> > >> >>> > > > get.addColumn("2");
> > >> >>> > > > .
> > >> >>> > > > .
> > >> >>> > > > .
> > >> >>> > > >
> > >> >>> > > > When internally hbase executes the batch get, it will seek
> to
> > >> column
> > >> >>> > "1",
> > >> >>> > > > now since data is lexicographically sorted, it does not need
> > to
> > >> seek
> > >> >>> > from
> > >> >>> > > > the beginning to get to "2", it can continue seeking,
> > >> >>> > > > henceforth
> > >> >>> since
> > >> >>> > > > column "2" will always be after column "1". I want to know
> > >> whether
> > >> >>> this
> > >> >>> > > is
> > >> >>> > > > how a multicolumn get on a row works or not.
> > >> >>> > > >
> > >> >>> > > > Thanks
> > >> >>> > > > Varun
> > >> >>> > > >
> > >> >>> > > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <
> mlortiz@uci.cu>
> > >> wrote:
> > >> >>> > > >
> > >> >>> > > > > Like Ishan said, a get give an instance of the Result
> class.
> > >> >>> > > > > All utility methods that you can use are:
> > >> >>> > > > >  byte[] getValue(byte[] family, byte[] qualifier)
> > >> >>> > > > >  byte[] value()
> > >> >>> > > > >  byte[] getRow()
> > >> >>> > > > >  int size()
> > >> >>> > > > >  boolean isEmpty()
> > >> >>> > > > >  KeyValue[] raw() # Like Ishan said, all data here is
> sorted
> > >> >>> > > > >  List<KeyValue> list()
> > >> >>> > > > >
> > >> >>> > > > >
> > >> >>> > > > >
> > >> >>> > > > >
> > >> >>> > > > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
> > >> >>> > > > >
> > >> >>> > > > >> Based on what I read in Lars' book, a get will return a
> > >> result a
> > >> >>> > > Result,
> > >> >>> > > > >> which is internally a KeyValue[]. This KeyValue[] is
> sorted
> > >> by the
> > >> >>> > key
> > >> >>> > > > and
> > >> >>> > > > >> you access this array using raw or list methods on the
> > >> >>> > > > >> Result
> > >> >>> > object.
> > >> >>> > > > >>
> > >> >>> > > > >>
> > >> >>> > > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <
> > >> varun@pinterest.com
> > >> >>> >
> > >> >>> > > > wrote:
> > >> >>> > > > >>
> > >> >>> > > > >>  +user
> > >> >>> > > > >>>
> > >> >>> > > > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <
> > >> >>> varun@pinterest.com>
> > >> >>> > > > >>> wrote:
> > >> >>> > > > >>>
> > >> >>> > > > >>>  Hi,
> > >> >>> > > > >>>>
> > >> >>> > > > >>>> When I do a Get on a row with multiple column
> qualifiers.
> > >> Do we
> > >> >>> > sort
> > >> >>> > > > the
> > >> >>> > > > >>>> column qualifers and make use of the sorted order when
> we
> > >> get
> > >> >>> the
> > >> >>> > > > >>>>
> > >> >>> > > > >>> results ?
> > >> >>> > > > >>>
> > >> >>> > > > >>>> Thanks
> > >> >>> > > > >>>> Varun
> > >> >>> > > > >>>>
> > >> >>> > > > >>>>
> > >> >>> > > > >>
> > >> >>> > > > >>
> > >> >>> > > > > --
> > >> >>> > > > > Marcos Ortiz Valmaseda,
> > >> >>> > > > > Product Manager && Data Scientist at UCI
> > >> >>> > > > > Blog: http://marcosluis2186.**posterous.com<
> > >> >>> > > > http://marcosluis2186.posterous.com>
> > >> >>> > > > > Twitter: @marcosluis2186
> > >> >>> > > > > <http://twitter.com/**marcosluis2186<
> > >> >>> > > > http://twitter.com/marcosluis2186>
> > >> >>> > > > > >
> > >> >>> > > > >
> > >> >>> > > >
> > >> >>> > >
> > >> >>> >
> > >> >>>
> > >> >>
> > >> >>
> > >> >
> > >>
> > >
> >
>

RE: Get on a row with multiple columns

Posted by Anoop Sam John <an...@huawei.com>.

You mean the end point is geetting executed with high QoS?  You checked with some logs? 

-Anoop-
________________________________________
From: Varun Sharma [varun@pinterest.com]
Sent: Monday, February 11, 2013 4:05 AM
To: user@hbase.apache.org; lars hofhansl
Subject: Re: Get on a row with multiple columns

Back to BulkDeleteEndpoint, i got it to work but why are the scanner.next()
calls executing on the Priority handler queue ?

Varun

On Sat, Feb 9, 2013 at 8:46 AM, lars hofhansl <la...@apache.org> wrote:

> The answer is "probably" :)
> It's disabled in 0.96 by default. Check out HBASE-7008 (
> https://issues.apache.org/jira/browse/HBASE-7008) and the discussion
> there.
>
> Also check out the discussion in HBASE-5943 and HADOOP-8069 (
> https://issues.apache.org/jira/browse/HADOOP-8069)
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Jean-Marc Spaggiari <je...@spaggiari.org>
> To: user@hbase.apache.org
> Sent: Saturday, February 9, 2013 5:02 AM
> Subject: Re: Get on a row with multiple columns
>
> Lars, should we always consider disabling Nagle? What's the down side?
>
> JM
>
> 2013/2/9, Varun Sharma <va...@pinterest.com>:
> > Yeah, I meant true...
> >
> > On Sat, Feb 9, 2013 at 12:17 AM, lars hofhansl <la...@apache.org> wrote:
> >
> >> Should be set to true. If tcpnodelay is set to true, Nagle's is
> disabled.
> >>
> >> -- Lars
> >>
> >>
> >>
> >> ________________________________
> >>  From: Varun Sharma <va...@pinterest.com>
> >> To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
> >> Sent: Saturday, February 9, 2013 12:11 AM
> >> Subject: Re: Get on a row with multiple columns
> >>
> >>
> >> Okay I did my research - these need to be set to false. I agree.
> >>
> >>
> >> On Sat, Feb 9, 2013 at 12:05 AM, Varun Sharma <va...@pinterest.com>
> >> wrote:
> >>
> >> I have ipc.client.tcpnodelay, ipc.server.tcpnodelay set to false and the
> >> hbase one - [hbase].ipc.client.tcpnodelay set to true. Do these induce
> >> network latency ?
> >> >
> >> >
> >> >On Fri, Feb 8, 2013 at 11:57 PM, lars hofhansl <la...@apache.org>
> wrote:
> >> >
> >> >Sorry.. I meant set these two config parameters to true (not false as I
> >> state below).
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>----- Original Message -----
> >> >>From: lars hofhansl <la...@apache.org>
> >> >>To: "user@hbase.apache.org" <us...@hbase.apache.org>
> >> >>Cc:
> >> >>Sent: Friday, February 8, 2013 11:41 PM
> >> >>Subject: Re: Get on a row with multiple columns
> >> >>
> >> >>Only somewhat related. Seeing the magic 40ms random read time there.
> >> >> Did
> >> you disable Nagle's?
> >> >>(set hbase.ipc.client.tcpnodelay and ipc.server.tcpnodelay to false in
> >> hbase-site.xml).
> >> >>
> >> >>________________________________
> >> >>From: Varun Sharma <va...@pinterest.com>
> >> >>To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
> >> >>Sent: Friday, February 8, 2013 10:45 PM
> >> >>Subject: Re: Get on a row with multiple columns
> >> >>
> >> >>The use case is like your twitter feed. Tweets from people u follow.
> >> >> When
> >> >>someone unfollows, you need to delete a bunch of his tweets from the
> >> >>following feed. So, its frequent, and we are essentially running into
> >> some
> >> >>extreme corner cases like the one above. We need high write throughput
> >> for
> >> >>this, since when someone tweets, we need to fanout the tweet to all
> the
> >> >>followers. We need the ability to do fast deletes (unfollow) and fast
> >> adds
> >> >>(follow) and also be able to do fast random gets - when a real user
> >> >> loads
> >> >>the feed. I doubt we will able to play much with the schema here since
> >> >> we
> >> >>need to support a bunch of use cases.
> >> >>
> >> >>@lars: It does not take 30 seconds to place 300 delete markers. It
> >> >> takes
> >> 30
> >> >>seconds to first find which of those 300 pins are in the set of
> columns
> >> >>present - this invokes 300 gets and then place the appropriate delete
> >> >>markers. Note that we can have tens of thousands of columns in a
> single
> >> row
> >> >>so a single get is not cheap.
> >> >>
> >> >>If we were to just place delete markers, that is very fast. But when
> >> >>started doing that, our random read performance suffered because of
> too
> >> >>many delete markers. The 90th percentile on random reads shot up from
> >> >> 40
> >> >>milliseconds to 150 milliseconds, which is not acceptable for our
> >> usecase.
> >> >>
> >> >>Thanks
> >> >>Varun
> >> >>
> >> >>On Fri, Feb 8, 2013 at 10:33 PM, lars hofhansl <la...@apache.org>
> >> >> wrote:
> >> >>
> >> >>> Can you organize your columns and then delete by column family?
> >> >>>
> >> >>> deleteColumn without specifying a TS is expensive, since HBase first
> >> has
> >> >>> to figure out what the latest TS is.
> >> >>>
> >> >>> Should be better in 0.94.1 or later since deletes are batched like
> >> >>> Puts
> >> >>> (still need to retrieve the latest version, though).
> >> >>>
> >> >>> In 0.94.3 or later you can also the BulkDeleteEndPoint, which
> >> >>> basically
> >> >>> let's specify a scan condition and then place specific delete marker
> >> for
> >> >>> all KVs encountered.
> >> >>>
> >> >>>
> >> >>> If you wanted to get really
> >> >>> fancy, you could hook up a coprocessor to the compaction process and
> >> >>> simply filter all KVs you no longer want (without ever placing any
> >> >>> delete markers).
> >> >>>
> >> >>>
> >> >>> Are you saying it takes 15 seconds to place 300 version delete
> >> markers?!
> >> >>>
> >> >>>
> >> >>> -- Lars
> >> >>>
> >> >>>
> >> >>>
> >> >>> ________________________________
> >> >>>  From: Varun Sharma <va...@pinterest.com>
> >> >>> To: user@hbase.apache.org
> >> >>> Sent: Friday, February 8, 2013 10:05 PM
> >> >>> Subject: Re: Get on a row with multiple columns
> >> >>>
> >> >>> We are given a set of 300 columns to delete. I tested two cases:
> >> >>>
> >> >>> 1) deleteColumns() - with the 's'
> >> >>>
> >> >>> This function simply adds delete markers for 300 columns, in our
> >> >>> case,
> >> >>> typically only a fraction of these columns are actually present -
> 10.
> >> After
> >> >>> starting to use deleteColumns, we starting seeing a drop in cluster
> >> wide
> >> >>> random read performance - 90th percentile latency worsened, so did
> >> >>> 99th
> >> >>> probably because of having to traverse delete markers. I attribute
> >> this to
> >> >>> profusion of delete markers in the cluster. Major compactions slowed
> >> down
> >> >>> by almost 50 percent probably because of having to clean out
> >> significantly
> >> >>> more delete markers.
> >> >>>
> >> >>> 2) deleteColumn()
> >> >>>
> >> >>> Ended up with untolerable 15 second calls, which clogged all the
> >> handlers.
> >> >>> Making the cluster pretty much unresponsive.
> >> >>>
> >> >>> On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <yu...@gmail.com> wrote:
> >> >>>
> >> >>> > For the 300 column deletes, can you show us how the Delete(s) are
> >> >>> > constructed ?
> >> >>> >
> >> >>> > Do you use this method ?
> >> >>> >
> >> >>> >   public Delete deleteColumns(byte [] family, byte [] qualifier) {
> >> >>> > Thanks
> >> >>> >
> >> >>> > On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <varun@pinterest.com
> >
> >> >>> wrote:
> >> >>> >
> >> >>> > > So a Get call with multiple columns on a single row should be
> >> >>> > > much
> >> >>> faster
> >> >>> > > than independent Get(s) on each of those columns for that row. I
> >> >>> > > am
> >> >>> > > basically seeing severely poor performance (~ 15 seconds) for
> >> certain
> >> >>> > > deleteColumn() calls and I am seeing that there is a
> >> >>> > > prepareDeleteTimestamps() function in HRegion.java which first
> >> tries to
> >> >>> > > locate the column by doing individual gets on each column you
> >> >>> > > want
> >> to
> >> >>> > > delete (I am doing 300 column deletes). Now, I think this should
> >> ideall
> >> >>> > by
> >> >>> > > 1 get call with the batch of 300 columns so that one scan can
> >> retrieve
> >> >>> > the
> >> >>> > > columns and the columns that are found, are indeed deleted.
> >> >>> > >
> >> >>> > > Before I try this fix, I wanted to get an opinion if it will
> make
> >> >>> > > a
> >> >>> > > difference to batch the get() and it seems from your answer, it
> >> should.
> >> >>> > >
> >> >>> > > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <larsh@apache.org
> >
> >> >>> wrote:
> >> >>> > >
> >> >>> > > > Everything is stored as a KeyValue in HBase.
> >> >>> > > > The Key part of a KeyValue contains the row key, column
> family,
> >> >>> column
> >> >>> > > > name, and timestamp in that order.
> >> >>> > > > Each column family has it's own store and store files.
> >> >>> > > >
> >> >>> > > > So in a nutshell a get is executed by starting a scan at the
> >> >>> > > > row
> >> key
> >> >>> > > > (which is a prefix of the key) in each store (CF) and then
> >> scanning
> >> >>> > > forward
> >> >>> > > > in each store until the next row key is reached. (in reality
> it
> >> is a
> >> >>> > bit
> >> >>> > > > more complicated due to multiple versions, skipping columns,
> >> >>> > > > etc)
> >> >>> > > >
> >> >>> > > >
> >> >>> > > > -- Lars
> >> >>> > > > ________________________________
> >> >>> > > > From: Varun Sharma <va...@pinterest.com>
> >> >>> > > > To: user@hbase.apache.org
> >> >>> > > > Sent: Friday, February 8, 2013 9:22 PM
> >> >>> > > > Subject: Re: Get on a row with multiple columns
> >> >>> > > >
> >> >>> > > > Sorry, I was a little unclear with my question.
> >> >>> > > >
> >> >>> > > > Lets say you have
> >> >>> > > >
> >> >>> > > > Get get = new Get(row)
> >> >>> > > > get.addColumn("1");
> >> >>> > > > get.addColumn("2");
> >> >>> > > > .
> >> >>> > > > .
> >> >>> > > > .
> >> >>> > > >
> >> >>> > > > When internally hbase executes the batch get, it will seek to
> >> column
> >> >>> > "1",
> >> >>> > > > now since data is lexicographically sorted, it does not need
> to
> >> seek
> >> >>> > from
> >> >>> > > > the beginning to get to "2", it can continue seeking,
> >> >>> > > > henceforth
> >> >>> since
> >> >>> > > > column "2" will always be after column "1". I want to know
> >> whether
> >> >>> this
> >> >>> > > is
> >> >>> > > > how a multicolumn get on a row works or not.
> >> >>> > > >
> >> >>> > > > Thanks
> >> >>> > > > Varun
> >> >>> > > >
> >> >>> > > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <ml...@uci.cu>
> >> wrote:
> >> >>> > > >
> >> >>> > > > > Like Ishan said, a get give an instance of the Result class.
> >> >>> > > > > All utility methods that you can use are:
> >> >>> > > > >  byte[] getValue(byte[] family, byte[] qualifier)
> >> >>> > > > >  byte[] value()
> >> >>> > > > >  byte[] getRow()
> >> >>> > > > >  int size()
> >> >>> > > > >  boolean isEmpty()
> >> >>> > > > >  KeyValue[] raw() # Like Ishan said, all data here is sorted
> >> >>> > > > >  List<KeyValue> list()
> >> >>> > > > >
> >> >>> > > > >
> >> >>> > > > >
> >> >>> > > > >
> >> >>> > > > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
> >> >>> > > > >
> >> >>> > > > >> Based on what I read in Lars' book, a get will return a
> >> result a
> >> >>> > > Result,
> >> >>> > > > >> which is internally a KeyValue[]. This KeyValue[] is sorted
> >> by the
> >> >>> > key
> >> >>> > > > and
> >> >>> > > > >> you access this array using raw or list methods on the
> >> >>> > > > >> Result
> >> >>> > object.
> >> >>> > > > >>
> >> >>> > > > >>
> >> >>> > > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <
> >> varun@pinterest.com
> >> >>> >
> >> >>> > > > wrote:
> >> >>> > > > >>
> >> >>> > > > >>  +user
> >> >>> > > > >>>
> >> >>> > > > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <
> >> >>> varun@pinterest.com>
> >> >>> > > > >>> wrote:
> >> >>> > > > >>>
> >> >>> > > > >>>  Hi,
> >> >>> > > > >>>>
> >> >>> > > > >>>> When I do a Get on a row with multiple column qualifiers.
> >> Do we
> >> >>> > sort
> >> >>> > > > the
> >> >>> > > > >>>> column qualifers and make use of the sorted order when we
> >> get
> >> >>> the
> >> >>> > > > >>>>
> >> >>> > > > >>> results ?
> >> >>> > > > >>>
> >> >>> > > > >>>> Thanks
> >> >>> > > > >>>> Varun
> >> >>> > > > >>>>
> >> >>> > > > >>>>
> >> >>> > > > >>
> >> >>> > > > >>
> >> >>> > > > > --
> >> >>> > > > > Marcos Ortiz Valmaseda,
> >> >>> > > > > Product Manager && Data Scientist at UCI
> >> >>> > > > > Blog: http://marcosluis2186.**posterous.com<
> >> >>> > > > http://marcosluis2186.posterous.com>
> >> >>> > > > > Twitter: @marcosluis2186
> >> >>> > > > > <http://twitter.com/**marcosluis2186<
> >> >>> > > > http://twitter.com/marcosluis2186>
> >> >>> > > > > >
> >> >>> > > > >
> >> >>> > > >
> >> >>> > >
> >> >>> >
> >> >>>
> >> >>
> >> >>
> >> >
> >>
> >
>

Re: Get on a row with multiple columns

Posted by Varun Sharma <va...@pinterest.com>.

Back to BulkDeleteEndpoint, i got it to work but why are the scanner.next()
calls executing on the Priority handler queue ?

Varun

On Sat, Feb 9, 2013 at 8:46 AM, lars hofhansl <la...@apache.org> wrote:

> The answer is "probably" :)
> It's disabled in 0.96 by default. Check out HBASE-7008 (
> https://issues.apache.org/jira/browse/HBASE-7008) and the discussion
> there.
>
> Also check out the discussion in HBASE-5943 and HADOOP-8069 (
> https://issues.apache.org/jira/browse/HADOOP-8069)
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Jean-Marc Spaggiari <je...@spaggiari.org>
> To: user@hbase.apache.org
> Sent: Saturday, February 9, 2013 5:02 AM
> Subject: Re: Get on a row with multiple columns
>
> Lars, should we always consider disabling Nagle? What's the down side?
>
> JM
>
> 2013/2/9, Varun Sharma <va...@pinterest.com>:
> > Yeah, I meant true...
> >
> > On Sat, Feb 9, 2013 at 12:17 AM, lars hofhansl <la...@apache.org> wrote:
> >
> >> Should be set to true. If tcpnodelay is set to true, Nagle's is
> disabled.
> >>
> >> -- Lars
> >>
> >>
> >>
> >> ________________________________
> >>  From: Varun Sharma <va...@pinterest.com>
> >> To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
> >> Sent: Saturday, February 9, 2013 12:11 AM
> >> Subject: Re: Get on a row with multiple columns
> >>
> >>
> >> Okay I did my research - these need to be set to false. I agree.
> >>
> >>
> >> On Sat, Feb 9, 2013 at 12:05 AM, Varun Sharma <va...@pinterest.com>
> >> wrote:
> >>
> >> I have ipc.client.tcpnodelay, ipc.server.tcpnodelay set to false and the
> >> hbase one - [hbase].ipc.client.tcpnodelay set to true. Do these induce
> >> network latency ?
> >> >
> >> >
> >> >On Fri, Feb 8, 2013 at 11:57 PM, lars hofhansl <la...@apache.org>
> wrote:
> >> >
> >> >Sorry.. I meant set these two config parameters to true (not false as I
> >> state below).
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>----- Original Message -----
> >> >>From: lars hofhansl <la...@apache.org>
> >> >>To: "user@hbase.apache.org" <us...@hbase.apache.org>
> >> >>Cc:
> >> >>Sent: Friday, February 8, 2013 11:41 PM
> >> >>Subject: Re: Get on a row with multiple columns
> >> >>
> >> >>Only somewhat related. Seeing the magic 40ms random read time there.
> >> >> Did
> >> you disable Nagle's?
> >> >>(set hbase.ipc.client.tcpnodelay and ipc.server.tcpnodelay to false in
> >> hbase-site.xml).
> >> >>
> >> >>________________________________
> >> >>From: Varun Sharma <va...@pinterest.com>
> >> >>To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
> >> >>Sent: Friday, February 8, 2013 10:45 PM
> >> >>Subject: Re: Get on a row with multiple columns
> >> >>
> >> >>The use case is like your twitter feed. Tweets from people u follow.
> >> >> When
> >> >>someone unfollows, you need to delete a bunch of his tweets from the
> >> >>following feed. So, its frequent, and we are essentially running into
> >> some
> >> >>extreme corner cases like the one above. We need high write throughput
> >> for
> >> >>this, since when someone tweets, we need to fanout the tweet to all
> the
> >> >>followers. We need the ability to do fast deletes (unfollow) and fast
> >> adds
> >> >>(follow) and also be able to do fast random gets - when a real user
> >> >> loads
> >> >>the feed. I doubt we will able to play much with the schema here since
> >> >> we
> >> >>need to support a bunch of use cases.
> >> >>
> >> >>@lars: It does not take 30 seconds to place 300 delete markers. It
> >> >> takes
> >> 30
> >> >>seconds to first find which of those 300 pins are in the set of
> columns
> >> >>present - this invokes 300 gets and then place the appropriate delete
> >> >>markers. Note that we can have tens of thousands of columns in a
> single
> >> row
> >> >>so a single get is not cheap.
> >> >>
> >> >>If we were to just place delete markers, that is very fast. But when
> >> >>started doing that, our random read performance suffered because of
> too
> >> >>many delete markers. The 90th percentile on random reads shot up from
> >> >> 40
> >> >>milliseconds to 150 milliseconds, which is not acceptable for our
> >> usecase.
> >> >>
> >> >>Thanks
> >> >>Varun
> >> >>
> >> >>On Fri, Feb 8, 2013 at 10:33 PM, lars hofhansl <la...@apache.org>
> >> >> wrote:
> >> >>
> >> >>> Can you organize your columns and then delete by column family?
> >> >>>
> >> >>> deleteColumn without specifying a TS is expensive, since HBase first
> >> has
> >> >>> to figure out what the latest TS is.
> >> >>>
> >> >>> Should be better in 0.94.1 or later since deletes are batched like
> >> >>> Puts
> >> >>> (still need to retrieve the latest version, though).
> >> >>>
> >> >>> In 0.94.3 or later you can also the BulkDeleteEndPoint, which
> >> >>> basically
> >> >>> let's specify a scan condition and then place specific delete marker
> >> for
> >> >>> all KVs encountered.
> >> >>>
> >> >>>
> >> >>> If you wanted to get really
> >> >>> fancy, you could hook up a coprocessor to the compaction process and
> >> >>> simply filter all KVs you no longer want (without ever placing any
> >> >>> delete markers).
> >> >>>
> >> >>>
> >> >>> Are you saying it takes 15 seconds to place 300 version delete
> >> markers?!
> >> >>>
> >> >>>
> >> >>> -- Lars
> >> >>>
> >> >>>
> >> >>>
> >> >>> ________________________________
> >> >>>  From: Varun Sharma <va...@pinterest.com>
> >> >>> To: user@hbase.apache.org
> >> >>> Sent: Friday, February 8, 2013 10:05 PM
> >> >>> Subject: Re: Get on a row with multiple columns
> >> >>>
> >> >>> We are given a set of 300 columns to delete. I tested two cases:
> >> >>>
> >> >>> 1) deleteColumns() - with the 's'
> >> >>>
> >> >>> This function simply adds delete markers for 300 columns, in our
> >> >>> case,
> >> >>> typically only a fraction of these columns are actually present -
> 10.
> >> After
> >> >>> starting to use deleteColumns, we starting seeing a drop in cluster
> >> wide
> >> >>> random read performance - 90th percentile latency worsened, so did
> >> >>> 99th
> >> >>> probably because of having to traverse delete markers. I attribute
> >> this to
> >> >>> profusion of delete markers in the cluster. Major compactions slowed
> >> down
> >> >>> by almost 50 percent probably because of having to clean out
> >> significantly
> >> >>> more delete markers.
> >> >>>
> >> >>> 2) deleteColumn()
> >> >>>
> >> >>> Ended up with untolerable 15 second calls, which clogged all the
> >> handlers.
> >> >>> Making the cluster pretty much unresponsive.
> >> >>>
> >> >>> On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <yu...@gmail.com> wrote:
> >> >>>
> >> >>> > For the 300 column deletes, can you show us how the Delete(s) are
> >> >>> > constructed ?
> >> >>> >
> >> >>> > Do you use this method ?
> >> >>> >
> >> >>> >   public Delete deleteColumns(byte [] family, byte [] qualifier) {
> >> >>> > Thanks
> >> >>> >
> >> >>> > On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <varun@pinterest.com
> >
> >> >>> wrote:
> >> >>> >
> >> >>> > > So a Get call with multiple columns on a single row should be
> >> >>> > > much
> >> >>> faster
> >> >>> > > than independent Get(s) on each of those columns for that row. I
> >> >>> > > am
> >> >>> > > basically seeing severely poor performance (~ 15 seconds) for
> >> certain
> >> >>> > > deleteColumn() calls and I am seeing that there is a
> >> >>> > > prepareDeleteTimestamps() function in HRegion.java which first
> >> tries to
> >> >>> > > locate the column by doing individual gets on each column you
> >> >>> > > want
> >> to
> >> >>> > > delete (I am doing 300 column deletes). Now, I think this should
> >> ideall
> >> >>> > by
> >> >>> > > 1 get call with the batch of 300 columns so that one scan can
> >> retrieve
> >> >>> > the
> >> >>> > > columns and the columns that are found, are indeed deleted.
> >> >>> > >
> >> >>> > > Before I try this fix, I wanted to get an opinion if it will
> make
> >> >>> > > a
> >> >>> > > difference to batch the get() and it seems from your answer, it
> >> should.
> >> >>> > >
> >> >>> > > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <larsh@apache.org
> >
> >> >>> wrote:
> >> >>> > >
> >> >>> > > > Everything is stored as a KeyValue in HBase.
> >> >>> > > > The Key part of a KeyValue contains the row key, column
> family,
> >> >>> column
> >> >>> > > > name, and timestamp in that order.
> >> >>> > > > Each column family has it's own store and store files.
> >> >>> > > >
> >> >>> > > > So in a nutshell a get is executed by starting a scan at the
> >> >>> > > > row
> >> key
> >> >>> > > > (which is a prefix of the key) in each store (CF) and then
> >> scanning
> >> >>> > > forward
> >> >>> > > > in each store until the next row key is reached. (in reality
> it
> >> is a
> >> >>> > bit
> >> >>> > > > more complicated due to multiple versions, skipping columns,
> >> >>> > > > etc)
> >> >>> > > >
> >> >>> > > >
> >> >>> > > > -- Lars
> >> >>> > > > ________________________________
> >> >>> > > > From: Varun Sharma <va...@pinterest.com>
> >> >>> > > > To: user@hbase.apache.org
> >> >>> > > > Sent: Friday, February 8, 2013 9:22 PM
> >> >>> > > > Subject: Re: Get on a row with multiple columns
> >> >>> > > >
> >> >>> > > > Sorry, I was a little unclear with my question.
> >> >>> > > >
> >> >>> > > > Lets say you have
> >> >>> > > >
> >> >>> > > > Get get = new Get(row)
> >> >>> > > > get.addColumn("1");
> >> >>> > > > get.addColumn("2");
> >> >>> > > > .
> >> >>> > > > .
> >> >>> > > > .
> >> >>> > > >
> >> >>> > > > When internally hbase executes the batch get, it will seek to
> >> column
> >> >>> > "1",
> >> >>> > > > now since data is lexicographically sorted, it does not need
> to
> >> seek
> >> >>> > from
> >> >>> > > > the beginning to get to "2", it can continue seeking,
> >> >>> > > > henceforth
> >> >>> since
> >> >>> > > > column "2" will always be after column "1". I want to know
> >> whether
> >> >>> this
> >> >>> > > is
> >> >>> > > > how a multicolumn get on a row works or not.
> >> >>> > > >
> >> >>> > > > Thanks
> >> >>> > > > Varun
> >> >>> > > >
> >> >>> > > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <ml...@uci.cu>
> >> wrote:
> >> >>> > > >
> >> >>> > > > > Like Ishan said, a get give an instance of the Result class.
> >> >>> > > > > All utility methods that you can use are:
> >> >>> > > > >  byte[] getValue(byte[] family, byte[] qualifier)
> >> >>> > > > >  byte[] value()
> >> >>> > > > >  byte[] getRow()
> >> >>> > > > >  int size()
> >> >>> > > > >  boolean isEmpty()
> >> >>> > > > >  KeyValue[] raw() # Like Ishan said, all data here is sorted
> >> >>> > > > >  List<KeyValue> list()
> >> >>> > > > >
> >> >>> > > > >
> >> >>> > > > >
> >> >>> > > > >
> >> >>> > > > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
> >> >>> > > > >
> >> >>> > > > >> Based on what I read in Lars' book, a get will return a
> >> result a
> >> >>> > > Result,
> >> >>> > > > >> which is internally a KeyValue[]. This KeyValue[] is sorted
> >> by the
> >> >>> > key
> >> >>> > > > and
> >> >>> > > > >> you access this array using raw or list methods on the
> >> >>> > > > >> Result
> >> >>> > object.
> >> >>> > > > >>
> >> >>> > > > >>
> >> >>> > > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <
> >> varun@pinterest.com
> >> >>> >
> >> >>> > > > wrote:
> >> >>> > > > >>
> >> >>> > > > >>  +user
> >> >>> > > > >>>
> >> >>> > > > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <
> >> >>> varun@pinterest.com>
> >> >>> > > > >>> wrote:
> >> >>> > > > >>>
> >> >>> > > > >>>  Hi,
> >> >>> > > > >>>>
> >> >>> > > > >>>> When I do a Get on a row with multiple column qualifiers.
> >> Do we
> >> >>> > sort
> >> >>> > > > the
> >> >>> > > > >>>> column qualifers and make use of the sorted order when we
> >> get
> >> >>> the
> >> >>> > > > >>>>
> >> >>> > > > >>> results ?
> >> >>> > > > >>>
> >> >>> > > > >>>> Thanks
> >> >>> > > > >>>> Varun
> >> >>> > > > >>>>
> >> >>> > > > >>>>
> >> >>> > > > >>
> >> >>> > > > >>
> >> >>> > > > > --
> >> >>> > > > > Marcos Ortiz Valmaseda,
> >> >>> > > > > Product Manager && Data Scientist at UCI
> >> >>> > > > > Blog: http://marcosluis2186.**posterous.com<
> >> >>> > > > http://marcosluis2186.posterous.com>
> >> >>> > > > > Twitter: @marcosluis2186
> >> >>> > > > > <http://twitter.com/**marcosluis2186<
> >> >>> > > > http://twitter.com/marcosluis2186>
> >> >>> > > > > >
> >> >>> > > > >
> >> >>> > > >
> >> >>> > >
> >> >>> >
> >> >>>
> >> >>
> >> >>
> >> >
> >>
> >
>

Re: Get on a row with multiple columns

Posted by lars hofhansl <la...@apache.org>.

The answer is "probably" :)
It's disabled in 0.96 by default. Check out HBASE-7008 (https://issues.apache.org/jira/browse/HBASE-7008) and the discussion there.

Also check out the discussion in HBASE-5943 and HADOOP-8069 (https://issues.apache.org/jira/browse/HADOOP-8069)


-- Lars



________________________________
 From: Jean-Marc Spaggiari <je...@spaggiari.org>
To: user@hbase.apache.org 
Sent: Saturday, February 9, 2013 5:02 AM
Subject: Re: Get on a row with multiple columns
 
Lars, should we always consider disabling Nagle? What's the down side?

JM

2013/2/9, Varun Sharma <va...@pinterest.com>:
> Yeah, I meant true...
>
> On Sat, Feb 9, 2013 at 12:17 AM, lars hofhansl <la...@apache.org> wrote:
>
>> Should be set to true. If tcpnodelay is set to true, Nagle's is disabled.
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>>  From: Varun Sharma <va...@pinterest.com>
>> To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
>> Sent: Saturday, February 9, 2013 12:11 AM
>> Subject: Re: Get on a row with multiple columns
>>
>>
>> Okay I did my research - these need to be set to false. I agree.
>>
>>
>> On Sat, Feb 9, 2013 at 12:05 AM, Varun Sharma <va...@pinterest.com>
>> wrote:
>>
>> I have ipc.client.tcpnodelay, ipc.server.tcpnodelay set to false and the
>> hbase one - [hbase].ipc.client.tcpnodelay set to true. Do these induce
>> network latency ?
>> >
>> >
>> >On Fri, Feb 8, 2013 at 11:57 PM, lars hofhansl <la...@apache.org> wrote:
>> >
>> >Sorry.. I meant set these two config parameters to true (not false as I
>> state below).
>> >>
>> >>
>> >>
>> >>
>> >>----- Original Message -----
>> >>From: lars hofhansl <la...@apache.org>
>> >>To: "user@hbase.apache.org" <us...@hbase.apache.org>
>> >>Cc:
>> >>Sent: Friday, February 8, 2013 11:41 PM
>> >>Subject: Re: Get on a row with multiple columns
>> >>
>> >>Only somewhat related. Seeing the magic 40ms random read time there.
>> >> Did
>> you disable Nagle's?
>> >>(set hbase.ipc.client.tcpnodelay and ipc.server.tcpnodelay to false in
>> hbase-site.xml).
>> >>
>> >>________________________________
>> >>From: Varun Sharma <va...@pinterest.com>
>> >>To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
>> >>Sent: Friday, February 8, 2013 10:45 PM
>> >>Subject: Re: Get on a row with multiple columns
>> >>
>> >>The use case is like your twitter feed. Tweets from people u follow.
>> >> When
>> >>someone unfollows, you need to delete a bunch of his tweets from the
>> >>following feed. So, its frequent, and we are essentially running into
>> some
>> >>extreme corner cases like the one above. We need high write throughput
>> for
>> >>this, since when someone tweets, we need to fanout the tweet to all the
>> >>followers. We need the ability to do fast deletes (unfollow) and fast
>> adds
>> >>(follow) and also be able to do fast random gets - when a real user
>> >> loads
>> >>the feed. I doubt we will able to play much with the schema here since
>> >> we
>> >>need to support a bunch of use cases.
>> >>
>> >>@lars: It does not take 30 seconds to place 300 delete markers. It
>> >> takes
>> 30
>> >>seconds to first find which of those 300 pins are in the set of columns
>> >>present - this invokes 300 gets and then place the appropriate delete
>> >>markers. Note that we can have tens of thousands of columns in a single
>> row
>> >>so a single get is not cheap.
>> >>
>> >>If we were to just place delete markers, that is very fast. But when
>> >>started doing that, our random read performance suffered because of too
>> >>many delete markers. The 90th percentile on random reads shot up from
>> >> 40
>> >>milliseconds to 150 milliseconds, which is not acceptable for our
>> usecase.
>> >>
>> >>Thanks
>> >>Varun
>> >>
>> >>On Fri, Feb 8, 2013 at 10:33 PM, lars hofhansl <la...@apache.org>
>> >> wrote:
>> >>
>> >>> Can you organize your columns and then delete by column family?
>> >>>
>> >>> deleteColumn without specifying a TS is expensive, since HBase first
>> has
>> >>> to figure out what the latest TS is.
>> >>>
>> >>> Should be better in 0.94.1 or later since deletes are batched like
>> >>> Puts
>> >>> (still need to retrieve the latest version, though).
>> >>>
>> >>> In 0.94.3 or later you can also the BulkDeleteEndPoint, which
>> >>> basically
>> >>> let's specify a scan condition and then place specific delete marker
>> for
>> >>> all KVs encountered.
>> >>>
>> >>>
>> >>> If you wanted to get really
>> >>> fancy, you could hook up a coprocessor to the compaction process and
>> >>> simply filter all KVs you no longer want (without ever placing any
>> >>> delete markers).
>> >>>
>> >>>
>> >>> Are you saying it takes 15 seconds to place 300 version delete
>> markers?!
>> >>>
>> >>>
>> >>> -- Lars
>> >>>
>> >>>
>> >>>
>> >>> ________________________________
>> >>>  From: Varun Sharma <va...@pinterest.com>
>> >>> To: user@hbase.apache.org
>> >>> Sent: Friday, February 8, 2013 10:05 PM
>> >>> Subject: Re: Get on a row with multiple columns
>> >>>
>> >>> We are given a set of 300 columns to delete. I tested two cases:
>> >>>
>> >>> 1) deleteColumns() - with the 's'
>> >>>
>> >>> This function simply adds delete markers for 300 columns, in our
>> >>> case,
>> >>> typically only a fraction of these columns are actually present - 10.
>> After
>> >>> starting to use deleteColumns, we starting seeing a drop in cluster
>> wide
>> >>> random read performance - 90th percentile latency worsened, so did
>> >>> 99th
>> >>> probably because of having to traverse delete markers. I attribute
>> this to
>> >>> profusion of delete markers in the cluster. Major compactions slowed
>> down
>> >>> by almost 50 percent probably because of having to clean out
>> significantly
>> >>> more delete markers.
>> >>>
>> >>> 2) deleteColumn()
>> >>>
>> >>> Ended up with untolerable 15 second calls, which clogged all the
>> handlers.
>> >>> Making the cluster pretty much unresponsive.
>> >>>
>> >>> On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <yu...@gmail.com> wrote:
>> >>>
>> >>> > For the 300 column deletes, can you show us how the Delete(s) are
>> >>> > constructed ?
>> >>> >
>> >>> > Do you use this method ?
>> >>> >
>> >>> >   public Delete deleteColumns(byte [] family, byte [] qualifier) {
>> >>> > Thanks
>> >>> >
>> >>> > On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <va...@pinterest.com>
>> >>> wrote:
>> >>> >
>> >>> > > So a Get call with multiple columns on a single row should be
>> >>> > > much
>> >>> faster
>> >>> > > than independent Get(s) on each of those columns for that row. I
>> >>> > > am
>> >>> > > basically seeing severely poor performance (~ 15 seconds) for
>> certain
>> >>> > > deleteColumn() calls and I am seeing that there is a
>> >>> > > prepareDeleteTimestamps() function in HRegion.java which first
>> tries to
>> >>> > > locate the column by doing individual gets on each column you
>> >>> > > want
>> to
>> >>> > > delete (I am doing 300 column deletes). Now, I think this should
>> ideall
>> >>> > by
>> >>> > > 1 get call with the batch of 300 columns so that one scan can
>> retrieve
>> >>> > the
>> >>> > > columns and the columns that are found, are indeed deleted.
>> >>> > >
>> >>> > > Before I try this fix, I wanted to get an opinion if it will make
>> >>> > > a
>> >>> > > difference to batch the get() and it seems from your answer, it
>> should.
>> >>> > >
>> >>> > > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <la...@apache.org>
>> >>> wrote:
>> >>> > >
>> >>> > > > Everything is stored as a KeyValue in HBase.
>> >>> > > > The Key part of a KeyValue contains the row key, column family,
>> >>> column
>> >>> > > > name, and timestamp in that order.
>> >>> > > > Each column family has it's own store and store files.
>> >>> > > >
>> >>> > > > So in a nutshell a get is executed by starting a scan at the
>> >>> > > > row
>> key
>> >>> > > > (which is a prefix of the key) in each store (CF) and then
>> scanning
>> >>> > > forward
>> >>> > > > in each store until the next row key is reached. (in reality it
>> is a
>> >>> > bit
>> >>> > > > more complicated due to multiple versions, skipping columns,
>> >>> > > > etc)
>> >>> > > >
>> >>> > > >
>> >>> > > > -- Lars
>> >>> > > > ________________________________
>> >>> > > > From: Varun Sharma <va...@pinterest.com>
>> >>> > > > To: user@hbase.apache.org
>> >>> > > > Sent: Friday, February 8, 2013 9:22 PM
>> >>> > > > Subject: Re: Get on a row with multiple columns
>> >>> > > >
>> >>> > > > Sorry, I was a little unclear with my question.
>> >>> > > >
>> >>> > > > Lets say you have
>> >>> > > >
>> >>> > > > Get get = new Get(row)
>> >>> > > > get.addColumn("1");
>> >>> > > > get.addColumn("2");
>> >>> > > > .
>> >>> > > > .
>> >>> > > > .
>> >>> > > >
>> >>> > > > When internally hbase executes the batch get, it will seek to
>> column
>> >>> > "1",
>> >>> > > > now since data is lexicographically sorted, it does not need to
>> seek
>> >>> > from
>> >>> > > > the beginning to get to "2", it can continue seeking,
>> >>> > > > henceforth
>> >>> since
>> >>> > > > column "2" will always be after column "1". I want to know
>> whether
>> >>> this
>> >>> > > is
>> >>> > > > how a multicolumn get on a row works or not.
>> >>> > > >
>> >>> > > > Thanks
>> >>> > > > Varun
>> >>> > > >
>> >>> > > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <ml...@uci.cu>
>> wrote:
>> >>> > > >
>> >>> > > > > Like Ishan said, a get give an instance of the Result class.
>> >>> > > > > All utility methods that you can use are:
>> >>> > > > >  byte[] getValue(byte[] family, byte[] qualifier)
>> >>> > > > >  byte[] value()
>> >>> > > > >  byte[] getRow()
>> >>> > > > >  int size()
>> >>> > > > >  boolean isEmpty()
>> >>> > > > >  KeyValue[] raw() # Like Ishan said, all data here is sorted
>> >>> > > > >  List<KeyValue> list()
>> >>> > > > >
>> >>> > > > >
>> >>> > > > >
>> >>> > > > >
>> >>> > > > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
>> >>> > > > >
>> >>> > > > >> Based on what I read in Lars' book, a get will return a
>> result a
>> >>> > > Result,
>> >>> > > > >> which is internally a KeyValue[]. This KeyValue[] is sorted
>> by the
>> >>> > key
>> >>> > > > and
>> >>> > > > >> you access this array using raw or list methods on the
>> >>> > > > >> Result
>> >>> > object.
>> >>> > > > >>
>> >>> > > > >>
>> >>> > > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <
>> varun@pinterest.com
>> >>> >
>> >>> > > > wrote:
>> >>> > > > >>
>> >>> > > > >>  +user
>> >>> > > > >>>
>> >>> > > > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <
>> >>> varun@pinterest.com>
>> >>> > > > >>> wrote:
>> >>> > > > >>>
>> >>> > > > >>>  Hi,
>> >>> > > > >>>>
>> >>> > > > >>>> When I do a Get on a row with multiple column qualifiers.
>> Do we
>> >>> > sort
>> >>> > > > the
>> >>> > > > >>>> column qualifers and make use of the sorted order when we
>> get
>> >>> the
>> >>> > > > >>>>
>> >>> > > > >>> results ?
>> >>> > > > >>>
>> >>> > > > >>>> Thanks
>> >>> > > > >>>> Varun
>> >>> > > > >>>>
>> >>> > > > >>>>
>> >>> > > > >>
>> >>> > > > >>
>> >>> > > > > --
>> >>> > > > > Marcos Ortiz Valmaseda,
>> >>> > > > > Product Manager && Data Scientist at UCI
>> >>> > > > > Blog: http://marcosluis2186.**posterous.com<
>> >>> > > > http://marcosluis2186.posterous.com>
>> >>> > > > > Twitter: @marcosluis2186
>> >>> > > > > <http://twitter.com/**marcosluis2186<
>> >>> > > > http://twitter.com/marcosluis2186>
>> >>> > > > > >
>> >>> > > > >
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> >>
>> >>
>> >
>>
>

Re: Get on a row with multiple columns

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Lars, should we always consider disabling Nagle? What's the down side?

JM

2013/2/9, Varun Sharma <va...@pinterest.com>:
> Yeah, I meant true...
>
> On Sat, Feb 9, 2013 at 12:17 AM, lars hofhansl <la...@apache.org> wrote:
>
>> Should be set to true. If tcpnodelay is set to true, Nagle's is disabled.
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>>  From: Varun Sharma <va...@pinterest.com>
>> To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
>> Sent: Saturday, February 9, 2013 12:11 AM
>> Subject: Re: Get on a row with multiple columns
>>
>>
>> Okay I did my research - these need to be set to false. I agree.
>>
>>
>> On Sat, Feb 9, 2013 at 12:05 AM, Varun Sharma <va...@pinterest.com>
>> wrote:
>>
>> I have ipc.client.tcpnodelay, ipc.server.tcpnodelay set to false and the
>> hbase one - [hbase].ipc.client.tcpnodelay set to true. Do these induce
>> network latency ?
>> >
>> >
>> >On Fri, Feb 8, 2013 at 11:57 PM, lars hofhansl <la...@apache.org> wrote:
>> >
>> >Sorry.. I meant set these two config parameters to true (not false as I
>> state below).
>> >>
>> >>
>> >>
>> >>
>> >>----- Original Message -----
>> >>From: lars hofhansl <la...@apache.org>
>> >>To: "user@hbase.apache.org" <us...@hbase.apache.org>
>> >>Cc:
>> >>Sent: Friday, February 8, 2013 11:41 PM
>> >>Subject: Re: Get on a row with multiple columns
>> >>
>> >>Only somewhat related. Seeing the magic 40ms random read time there.
>> >> Did
>> you disable Nagle's?
>> >>(set hbase.ipc.client.tcpnodelay and ipc.server.tcpnodelay to false in
>> hbase-site.xml).
>> >>
>> >>________________________________
>> >>From: Varun Sharma <va...@pinterest.com>
>> >>To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
>> >>Sent: Friday, February 8, 2013 10:45 PM
>> >>Subject: Re: Get on a row with multiple columns
>> >>
>> >>The use case is like your twitter feed. Tweets from people u follow.
>> >> When
>> >>someone unfollows, you need to delete a bunch of his tweets from the
>> >>following feed. So, its frequent, and we are essentially running into
>> some
>> >>extreme corner cases like the one above. We need high write throughput
>> for
>> >>this, since when someone tweets, we need to fanout the tweet to all the
>> >>followers. We need the ability to do fast deletes (unfollow) and fast
>> adds
>> >>(follow) and also be able to do fast random gets - when a real user
>> >> loads
>> >>the feed. I doubt we will able to play much with the schema here since
>> >> we
>> >>need to support a bunch of use cases.
>> >>
>> >>@lars: It does not take 30 seconds to place 300 delete markers. It
>> >> takes
>> 30
>> >>seconds to first find which of those 300 pins are in the set of columns
>> >>present - this invokes 300 gets and then place the appropriate delete
>> >>markers. Note that we can have tens of thousands of columns in a single
>> row
>> >>so a single get is not cheap.
>> >>
>> >>If we were to just place delete markers, that is very fast. But when
>> >>started doing that, our random read performance suffered because of too
>> >>many delete markers. The 90th percentile on random reads shot up from
>> >> 40
>> >>milliseconds to 150 milliseconds, which is not acceptable for our
>> usecase.
>> >>
>> >>Thanks
>> >>Varun
>> >>
>> >>On Fri, Feb 8, 2013 at 10:33 PM, lars hofhansl <la...@apache.org>
>> >> wrote:
>> >>
>> >>> Can you organize your columns and then delete by column family?
>> >>>
>> >>> deleteColumn without specifying a TS is expensive, since HBase first
>> has
>> >>> to figure out what the latest TS is.
>> >>>
>> >>> Should be better in 0.94.1 or later since deletes are batched like
>> >>> Puts
>> >>> (still need to retrieve the latest version, though).
>> >>>
>> >>> In 0.94.3 or later you can also the BulkDeleteEndPoint, which
>> >>> basically
>> >>> let's specify a scan condition and then place specific delete marker
>> for
>> >>> all KVs encountered.
>> >>>
>> >>>
>> >>> If you wanted to get really
>> >>> fancy, you could hook up a coprocessor to the compaction process and
>> >>> simply filter all KVs you no longer want (without ever placing any
>> >>> delete markers).
>> >>>
>> >>>
>> >>> Are you saying it takes 15 seconds to place 300 version delete
>> markers?!
>> >>>
>> >>>
>> >>> -- Lars
>> >>>
>> >>>
>> >>>
>> >>> ________________________________
>> >>>  From: Varun Sharma <va...@pinterest.com>
>> >>> To: user@hbase.apache.org
>> >>> Sent: Friday, February 8, 2013 10:05 PM
>> >>> Subject: Re: Get on a row with multiple columns
>> >>>
>> >>> We are given a set of 300 columns to delete. I tested two cases:
>> >>>
>> >>> 1) deleteColumns() - with the 's'
>> >>>
>> >>> This function simply adds delete markers for 300 columns, in our
>> >>> case,
>> >>> typically only a fraction of these columns are actually present - 10.
>> After
>> >>> starting to use deleteColumns, we starting seeing a drop in cluster
>> wide
>> >>> random read performance - 90th percentile latency worsened, so did
>> >>> 99th
>> >>> probably because of having to traverse delete markers. I attribute
>> this to
>> >>> profusion of delete markers in the cluster. Major compactions slowed
>> down
>> >>> by almost 50 percent probably because of having to clean out
>> significantly
>> >>> more delete markers.
>> >>>
>> >>> 2) deleteColumn()
>> >>>
>> >>> Ended up with untolerable 15 second calls, which clogged all the
>> handlers.
>> >>> Making the cluster pretty much unresponsive.
>> >>>
>> >>> On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <yu...@gmail.com> wrote:
>> >>>
>> >>> > For the 300 column deletes, can you show us how the Delete(s) are
>> >>> > constructed ?
>> >>> >
>> >>> > Do you use this method ?
>> >>> >
>> >>> >   public Delete deleteColumns(byte [] family, byte [] qualifier) {
>> >>> > Thanks
>> >>> >
>> >>> > On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <va...@pinterest.com>
>> >>> wrote:
>> >>> >
>> >>> > > So a Get call with multiple columns on a single row should be
>> >>> > > much
>> >>> faster
>> >>> > > than independent Get(s) on each of those columns for that row. I
>> >>> > > am
>> >>> > > basically seeing severely poor performance (~ 15 seconds) for
>> certain
>> >>> > > deleteColumn() calls and I am seeing that there is a
>> >>> > > prepareDeleteTimestamps() function in HRegion.java which first
>> tries to
>> >>> > > locate the column by doing individual gets on each column you
>> >>> > > want
>> to
>> >>> > > delete (I am doing 300 column deletes). Now, I think this should
>> ideall
>> >>> > by
>> >>> > > 1 get call with the batch of 300 columns so that one scan can
>> retrieve
>> >>> > the
>> >>> > > columns and the columns that are found, are indeed deleted.
>> >>> > >
>> >>> > > Before I try this fix, I wanted to get an opinion if it will make
>> >>> > > a
>> >>> > > difference to batch the get() and it seems from your answer, it
>> should.
>> >>> > >
>> >>> > > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <la...@apache.org>
>> >>> wrote:
>> >>> > >
>> >>> > > > Everything is stored as a KeyValue in HBase.
>> >>> > > > The Key part of a KeyValue contains the row key, column family,
>> >>> column
>> >>> > > > name, and timestamp in that order.
>> >>> > > > Each column family has it's own store and store files.
>> >>> > > >
>> >>> > > > So in a nutshell a get is executed by starting a scan at the
>> >>> > > > row
>> key
>> >>> > > > (which is a prefix of the key) in each store (CF) and then
>> scanning
>> >>> > > forward
>> >>> > > > in each store until the next row key is reached. (in reality it
>> is a
>> >>> > bit
>> >>> > > > more complicated due to multiple versions, skipping columns,
>> >>> > > > etc)
>> >>> > > >
>> >>> > > >
>> >>> > > > -- Lars
>> >>> > > > ________________________________
>> >>> > > > From: Varun Sharma <va...@pinterest.com>
>> >>> > > > To: user@hbase.apache.org
>> >>> > > > Sent: Friday, February 8, 2013 9:22 PM
>> >>> > > > Subject: Re: Get on a row with multiple columns
>> >>> > > >
>> >>> > > > Sorry, I was a little unclear with my question.
>> >>> > > >
>> >>> > > > Lets say you have
>> >>> > > >
>> >>> > > > Get get = new Get(row)
>> >>> > > > get.addColumn("1");
>> >>> > > > get.addColumn("2");
>> >>> > > > .
>> >>> > > > .
>> >>> > > > .
>> >>> > > >
>> >>> > > > When internally hbase executes the batch get, it will seek to
>> column
>> >>> > "1",
>> >>> > > > now since data is lexicographically sorted, it does not need to
>> seek
>> >>> > from
>> >>> > > > the beginning to get to "2", it can continue seeking,
>> >>> > > > henceforth
>> >>> since
>> >>> > > > column "2" will always be after column "1". I want to know
>> whether
>> >>> this
>> >>> > > is
>> >>> > > > how a multicolumn get on a row works or not.
>> >>> > > >
>> >>> > > > Thanks
>> >>> > > > Varun
>> >>> > > >
>> >>> > > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <ml...@uci.cu>
>> wrote:
>> >>> > > >
>> >>> > > > > Like Ishan said, a get give an instance of the Result class.
>> >>> > > > > All utility methods that you can use are:
>> >>> > > > >  byte[] getValue(byte[] family, byte[] qualifier)
>> >>> > > > >  byte[] value()
>> >>> > > > >  byte[] getRow()
>> >>> > > > >  int size()
>> >>> > > > >  boolean isEmpty()
>> >>> > > > >  KeyValue[] raw() # Like Ishan said, all data here is sorted
>> >>> > > > >  List<KeyValue> list()
>> >>> > > > >
>> >>> > > > >
>> >>> > > > >
>> >>> > > > >
>> >>> > > > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
>> >>> > > > >
>> >>> > > > >> Based on what I read in Lars' book, a get will return a
>> result a
>> >>> > > Result,
>> >>> > > > >> which is internally a KeyValue[]. This KeyValue[] is sorted
>> by the
>> >>> > key
>> >>> > > > and
>> >>> > > > >> you access this array using raw or list methods on the
>> >>> > > > >> Result
>> >>> > object.
>> >>> > > > >>
>> >>> > > > >>
>> >>> > > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <
>> varun@pinterest.com
>> >>> >
>> >>> > > > wrote:
>> >>> > > > >>
>> >>> > > > >>  +user
>> >>> > > > >>>
>> >>> > > > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <
>> >>> varun@pinterest.com>
>> >>> > > > >>> wrote:
>> >>> > > > >>>
>> >>> > > > >>>  Hi,
>> >>> > > > >>>>
>> >>> > > > >>>> When I do a Get on a row with multiple column qualifiers.
>> Do we
>> >>> > sort
>> >>> > > > the
>> >>> > > > >>>> column qualifers and make use of the sorted order when we
>> get
>> >>> the
>> >>> > > > >>>>
>> >>> > > > >>> results ?
>> >>> > > > >>>
>> >>> > > > >>>> Thanks
>> >>> > > > >>>> Varun
>> >>> > > > >>>>
>> >>> > > > >>>>
>> >>> > > > >>
>> >>> > > > >>
>> >>> > > > > --
>> >>> > > > > Marcos Ortiz Valmaseda,
>> >>> > > > > Product Manager && Data Scientist at UCI
>> >>> > > > > Blog: http://marcosluis2186.**posterous.com<
>> >>> > > > http://marcosluis2186.posterous.com>
>> >>> > > > > Twitter: @marcosluis2186
>> >>> > > > > <http://twitter.com/**marcosluis2186<
>> >>> > > > http://twitter.com/marcosluis2186>
>> >>> > > > > >
>> >>> > > > >
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> >>
>> >>
>> >
>>
>

Re: Get on a row with multiple columns

Posted by Varun Sharma <va...@pinterest.com>.

Yeah, I meant true...

On Sat, Feb 9, 2013 at 12:17 AM, lars hofhansl <la...@apache.org> wrote:

> Should be set to true. If tcpnodelay is set to true, Nagle's is disabled.
>
> -- Lars
>
>
>
> ________________________________
>  From: Varun Sharma <va...@pinterest.com>
> To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
> Sent: Saturday, February 9, 2013 12:11 AM
> Subject: Re: Get on a row with multiple columns
>
>
> Okay I did my research - these need to be set to false. I agree.
>
>
> On Sat, Feb 9, 2013 at 12:05 AM, Varun Sharma <va...@pinterest.com> wrote:
>
> I have ipc.client.tcpnodelay, ipc.server.tcpnodelay set to false and the
> hbase one - [hbase].ipc.client.tcpnodelay set to true. Do these induce
> network latency ?
> >
> >
> >On Fri, Feb 8, 2013 at 11:57 PM, lars hofhansl <la...@apache.org> wrote:
> >
> >Sorry.. I meant set these two config parameters to true (not false as I
> state below).
> >>
> >>
> >>
> >>
> >>----- Original Message -----
> >>From: lars hofhansl <la...@apache.org>
> >>To: "user@hbase.apache.org" <us...@hbase.apache.org>
> >>Cc:
> >>Sent: Friday, February 8, 2013 11:41 PM
> >>Subject: Re: Get on a row with multiple columns
> >>
> >>Only somewhat related. Seeing the magic 40ms random read time there. Did
> you disable Nagle's?
> >>(set hbase.ipc.client.tcpnodelay and ipc.server.tcpnodelay to false in
> hbase-site.xml).
> >>
> >>________________________________
> >>From: Varun Sharma <va...@pinterest.com>
> >>To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
> >>Sent: Friday, February 8, 2013 10:45 PM
> >>Subject: Re: Get on a row with multiple columns
> >>
> >>The use case is like your twitter feed. Tweets from people u follow. When
> >>someone unfollows, you need to delete a bunch of his tweets from the
> >>following feed. So, its frequent, and we are essentially running into
> some
> >>extreme corner cases like the one above. We need high write throughput
> for
> >>this, since when someone tweets, we need to fanout the tweet to all the
> >>followers. We need the ability to do fast deletes (unfollow) and fast
> adds
> >>(follow) and also be able to do fast random gets - when a real user loads
> >>the feed. I doubt we will able to play much with the schema here since we
> >>need to support a bunch of use cases.
> >>
> >>@lars: It does not take 30 seconds to place 300 delete markers. It takes
> 30
> >>seconds to first find which of those 300 pins are in the set of columns
> >>present - this invokes 300 gets and then place the appropriate delete
> >>markers. Note that we can have tens of thousands of columns in a single
> row
> >>so a single get is not cheap.
> >>
> >>If we were to just place delete markers, that is very fast. But when
> >>started doing that, our random read performance suffered because of too
> >>many delete markers. The 90th percentile on random reads shot up from 40
> >>milliseconds to 150 milliseconds, which is not acceptable for our
> usecase.
> >>
> >>Thanks
> >>Varun
> >>
> >>On Fri, Feb 8, 2013 at 10:33 PM, lars hofhansl <la...@apache.org> wrote:
> >>
> >>> Can you organize your columns and then delete by column family?
> >>>
> >>> deleteColumn without specifying a TS is expensive, since HBase first
> has
> >>> to figure out what the latest TS is.
> >>>
> >>> Should be better in 0.94.1 or later since deletes are batched like Puts
> >>> (still need to retrieve the latest version, though).
> >>>
> >>> In 0.94.3 or later you can also the BulkDeleteEndPoint, which basically
> >>> let's specify a scan condition and then place specific delete marker
> for
> >>> all KVs encountered.
> >>>
> >>>
> >>> If you wanted to get really
> >>> fancy, you could hook up a coprocessor to the compaction process and
> >>> simply filter all KVs you no longer want (without ever placing any
> >>> delete markers).
> >>>
> >>>
> >>> Are you saying it takes 15 seconds to place 300 version delete
> markers?!
> >>>
> >>>
> >>> -- Lars
> >>>
> >>>
> >>>
> >>> ________________________________
> >>>  From: Varun Sharma <va...@pinterest.com>
> >>> To: user@hbase.apache.org
> >>> Sent: Friday, February 8, 2013 10:05 PM
> >>> Subject: Re: Get on a row with multiple columns
> >>>
> >>> We are given a set of 300 columns to delete. I tested two cases:
> >>>
> >>> 1) deleteColumns() - with the 's'
> >>>
> >>> This function simply adds delete markers for 300 columns, in our case,
> >>> typically only a fraction of these columns are actually present - 10.
> After
> >>> starting to use deleteColumns, we starting seeing a drop in cluster
> wide
> >>> random read performance - 90th percentile latency worsened, so did 99th
> >>> probably because of having to traverse delete markers. I attribute
> this to
> >>> profusion of delete markers in the cluster. Major compactions slowed
> down
> >>> by almost 50 percent probably because of having to clean out
> significantly
> >>> more delete markers.
> >>>
> >>> 2) deleteColumn()
> >>>
> >>> Ended up with untolerable 15 second calls, which clogged all the
> handlers.
> >>> Making the cluster pretty much unresponsive.
> >>>
> >>> On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <yu...@gmail.com> wrote:
> >>>
> >>> > For the 300 column deletes, can you show us how the Delete(s) are
> >>> > constructed ?
> >>> >
> >>> > Do you use this method ?
> >>> >
> >>> >   public Delete deleteColumns(byte [] family, byte [] qualifier) {
> >>> > Thanks
> >>> >
> >>> > On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <va...@pinterest.com>
> >>> wrote:
> >>> >
> >>> > > So a Get call with multiple columns on a single row should be much
> >>> faster
> >>> > > than independent Get(s) on each of those columns for that row. I am
> >>> > > basically seeing severely poor performance (~ 15 seconds) for
> certain
> >>> > > deleteColumn() calls and I am seeing that there is a
> >>> > > prepareDeleteTimestamps() function in HRegion.java which first
> tries to
> >>> > > locate the column by doing individual gets on each column you want
> to
> >>> > > delete (I am doing 300 column deletes). Now, I think this should
> ideall
> >>> > by
> >>> > > 1 get call with the batch of 300 columns so that one scan can
> retrieve
> >>> > the
> >>> > > columns and the columns that are found, are indeed deleted.
> >>> > >
> >>> > > Before I try this fix, I wanted to get an opinion if it will make a
> >>> > > difference to batch the get() and it seems from your answer, it
> should.
> >>> > >
> >>> > > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <la...@apache.org>
> >>> wrote:
> >>> > >
> >>> > > > Everything is stored as a KeyValue in HBase.
> >>> > > > The Key part of a KeyValue contains the row key, column family,
> >>> column
> >>> > > > name, and timestamp in that order.
> >>> > > > Each column family has it's own store and store files.
> >>> > > >
> >>> > > > So in a nutshell a get is executed by starting a scan at the row
> key
> >>> > > > (which is a prefix of the key) in each store (CF) and then
> scanning
> >>> > > forward
> >>> > > > in each store until the next row key is reached. (in reality it
> is a
> >>> > bit
> >>> > > > more complicated due to multiple versions, skipping columns, etc)
> >>> > > >
> >>> > > >
> >>> > > > -- Lars
> >>> > > > ________________________________
> >>> > > > From: Varun Sharma <va...@pinterest.com>
> >>> > > > To: user@hbase.apache.org
> >>> > > > Sent: Friday, February 8, 2013 9:22 PM
> >>> > > > Subject: Re: Get on a row with multiple columns
> >>> > > >
> >>> > > > Sorry, I was a little unclear with my question.
> >>> > > >
> >>> > > > Lets say you have
> >>> > > >
> >>> > > > Get get = new Get(row)
> >>> > > > get.addColumn("1");
> >>> > > > get.addColumn("2");
> >>> > > > .
> >>> > > > .
> >>> > > > .
> >>> > > >
> >>> > > > When internally hbase executes the batch get, it will seek to
> column
> >>> > "1",
> >>> > > > now since data is lexicographically sorted, it does not need to
> seek
> >>> > from
> >>> > > > the beginning to get to "2", it can continue seeking, henceforth
> >>> since
> >>> > > > column "2" will always be after column "1". I want to know
> whether
> >>> this
> >>> > > is
> >>> > > > how a multicolumn get on a row works or not.
> >>> > > >
> >>> > > > Thanks
> >>> > > > Varun
> >>> > > >
> >>> > > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <ml...@uci.cu>
> wrote:
> >>> > > >
> >>> > > > > Like Ishan said, a get give an instance of the Result class.
> >>> > > > > All utility methods that you can use are:
> >>> > > > >  byte[] getValue(byte[] family, byte[] qualifier)
> >>> > > > >  byte[] value()
> >>> > > > >  byte[] getRow()
> >>> > > > >  int size()
> >>> > > > >  boolean isEmpty()
> >>> > > > >  KeyValue[] raw() # Like Ishan said, all data here is sorted
> >>> > > > >  List<KeyValue> list()
> >>> > > > >
> >>> > > > >
> >>> > > > >
> >>> > > > >
> >>> > > > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
> >>> > > > >
> >>> > > > >> Based on what I read in Lars' book, a get will return a
> result a
> >>> > > Result,
> >>> > > > >> which is internally a KeyValue[]. This KeyValue[] is sorted
> by the
> >>> > key
> >>> > > > and
> >>> > > > >> you access this array using raw or list methods on the Result
> >>> > object.
> >>> > > > >>
> >>> > > > >>
> >>> > > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <
> varun@pinterest.com
> >>> >
> >>> > > > wrote:
> >>> > > > >>
> >>> > > > >>  +user
> >>> > > > >>>
> >>> > > > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <
> >>> varun@pinterest.com>
> >>> > > > >>> wrote:
> >>> > > > >>>
> >>> > > > >>>  Hi,
> >>> > > > >>>>
> >>> > > > >>>> When I do a Get on a row with multiple column qualifiers.
> Do we
> >>> > sort
> >>> > > > the
> >>> > > > >>>> column qualifers and make use of the sorted order when we
> get
> >>> the
> >>> > > > >>>>
> >>> > > > >>> results ?
> >>> > > > >>>
> >>> > > > >>>> Thanks
> >>> > > > >>>> Varun
> >>> > > > >>>>
> >>> > > > >>>>
> >>> > > > >>
> >>> > > > >>
> >>> > > > > --
> >>> > > > > Marcos Ortiz Valmaseda,
> >>> > > > > Product Manager && Data Scientist at UCI
> >>> > > > > Blog: http://marcosluis2186.**posterous.com<
> >>> > > > http://marcosluis2186.posterous.com>
> >>> > > > > Twitter: @marcosluis2186 <http://twitter.com/**marcosluis2186<
> >>> > > > http://twitter.com/marcosluis2186>
> >>> > > > > >
> >>> > > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>
> >>
> >
>

Re: Get on a row with multiple columns

Posted by lars hofhansl <la...@apache.org>.

Should be set to true. If tcpnodelay is set to true, Nagle's is disabled.

-- Lars



________________________________
 From: Varun Sharma <va...@pinterest.com>
To: user@hbase.apache.org; lars hofhansl <la...@apache.org> 
Sent: Saturday, February 9, 2013 12:11 AM
Subject: Re: Get on a row with multiple columns
 

Okay I did my research - these need to be set to false. I agree.


On Sat, Feb 9, 2013 at 12:05 AM, Varun Sharma <va...@pinterest.com> wrote:

I have ipc.client.tcpnodelay, ipc.server.tcpnodelay set to false and the hbase one - [hbase].ipc.client.tcpnodelay set to true. Do these induce network latency ?
>
>
>On Fri, Feb 8, 2013 at 11:57 PM, lars hofhansl <la...@apache.org> wrote:
>
>Sorry.. I meant set these two config parameters to true (not false as I state below).
>>
>>
>>
>>
>>----- Original Message -----
>>From: lars hofhansl <la...@apache.org>
>>To: "user@hbase.apache.org" <us...@hbase.apache.org>
>>Cc:
>>Sent: Friday, February 8, 2013 11:41 PM
>>Subject: Re: Get on a row with multiple columns
>>
>>Only somewhat related. Seeing the magic 40ms random read time there. Did you disable Nagle's?
>>(set hbase.ipc.client.tcpnodelay and ipc.server.tcpnodelay to false in hbase-site.xml).
>>
>>________________________________
>>From: Varun Sharma <va...@pinterest.com>
>>To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
>>Sent: Friday, February 8, 2013 10:45 PM
>>Subject: Re: Get on a row with multiple columns
>>
>>The use case is like your twitter feed. Tweets from people u follow. When
>>someone unfollows, you need to delete a bunch of his tweets from the
>>following feed. So, its frequent, and we are essentially running into some
>>extreme corner cases like the one above. We need high write throughput for
>>this, since when someone tweets, we need to fanout the tweet to all the
>>followers. We need the ability to do fast deletes (unfollow) and fast adds
>>(follow) and also be able to do fast random gets - when a real user loads
>>the feed. I doubt we will able to play much with the schema here since we
>>need to support a bunch of use cases.
>>
>>@lars: It does not take 30 seconds to place 300 delete markers. It takes 30
>>seconds to first find which of those 300 pins are in the set of columns
>>present - this invokes 300 gets and then place the appropriate delete
>>markers. Note that we can have tens of thousands of columns in a single row
>>so a single get is not cheap.
>>
>>If we were to just place delete markers, that is very fast. But when
>>started doing that, our random read performance suffered because of too
>>many delete markers. The 90th percentile on random reads shot up from 40
>>milliseconds to 150 milliseconds, which is not acceptable for our usecase.
>>
>>Thanks
>>Varun
>>
>>On Fri, Feb 8, 2013 at 10:33 PM, lars hofhansl <la...@apache.org> wrote:
>>
>>> Can you organize your columns and then delete by column family?
>>>
>>> deleteColumn without specifying a TS is expensive, since HBase first has
>>> to figure out what the latest TS is.
>>>
>>> Should be better in 0.94.1 or later since deletes are batched like Puts
>>> (still need to retrieve the latest version, though).
>>>
>>> In 0.94.3 or later you can also the BulkDeleteEndPoint, which basically
>>> let's specify a scan condition and then place specific delete marker for
>>> all KVs encountered.
>>>
>>>
>>> If you wanted to get really
>>> fancy, you could hook up a coprocessor to the compaction process and
>>> simply filter all KVs you no longer want (without ever placing any
>>> delete markers).
>>>
>>>
>>> Are you saying it takes 15 seconds to place 300 version delete markers?!
>>>
>>>
>>> -- Lars
>>>
>>>
>>>
>>> ________________________________
>>>  From: Varun Sharma <va...@pinterest.com>
>>> To: user@hbase.apache.org
>>> Sent: Friday, February 8, 2013 10:05 PM
>>> Subject: Re: Get on a row with multiple columns
>>>
>>> We are given a set of 300 columns to delete. I tested two cases:
>>>
>>> 1) deleteColumns() - with the 's'
>>>
>>> This function simply adds delete markers for 300 columns, in our case,
>>> typically only a fraction of these columns are actually present - 10. After
>>> starting to use deleteColumns, we starting seeing a drop in cluster wide
>>> random read performance - 90th percentile latency worsened, so did 99th
>>> probably because of having to traverse delete markers. I attribute this to
>>> profusion of delete markers in the cluster. Major compactions slowed down
>>> by almost 50 percent probably because of having to clean out significantly
>>> more delete markers.
>>>
>>> 2) deleteColumn()
>>>
>>> Ended up with untolerable 15 second calls, which clogged all the handlers.
>>> Making the cluster pretty much unresponsive.
>>>
>>> On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <yu...@gmail.com> wrote:
>>>
>>> > For the 300 column deletes, can you show us how the Delete(s) are
>>> > constructed ?
>>> >
>>> > Do you use this method ?
>>> >
>>> >   public Delete deleteColumns(byte [] family, byte [] qualifier) {
>>> > Thanks
>>> >
>>> > On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <va...@pinterest.com>
>>> wrote:
>>> >
>>> > > So a Get call with multiple columns on a single row should be much
>>> faster
>>> > > than independent Get(s) on each of those columns for that row. I am
>>> > > basically seeing severely poor performance (~ 15 seconds) for certain
>>> > > deleteColumn() calls and I am seeing that there is a
>>> > > prepareDeleteTimestamps() function in HRegion.java which first tries to
>>> > > locate the column by doing individual gets on each column you want to
>>> > > delete (I am doing 300 column deletes). Now, I think this should ideall
>>> > by
>>> > > 1 get call with the batch of 300 columns so that one scan can retrieve
>>> > the
>>> > > columns and the columns that are found, are indeed deleted.
>>> > >
>>> > > Before I try this fix, I wanted to get an opinion if it will make a
>>> > > difference to batch the get() and it seems from your answer, it should.
>>> > >
>>> > > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <la...@apache.org>
>>> wrote:
>>> > >
>>> > > > Everything is stored as a KeyValue in HBase.
>>> > > > The Key part of a KeyValue contains the row key, column family,
>>> column
>>> > > > name, and timestamp in that order.
>>> > > > Each column family has it's own store and store files.
>>> > > >
>>> > > > So in a nutshell a get is executed by starting a scan at the row key
>>> > > > (which is a prefix of the key) in each store (CF) and then scanning
>>> > > forward
>>> > > > in each store until the next row key is reached. (in reality it is a
>>> > bit
>>> > > > more complicated due to multiple versions, skipping columns, etc)
>>> > > >
>>> > > >
>>> > > > -- Lars
>>> > > > ________________________________
>>> > > > From: Varun Sharma <va...@pinterest.com>
>>> > > > To: user@hbase.apache.org
>>> > > > Sent: Friday, February 8, 2013 9:22 PM
>>> > > > Subject: Re: Get on a row with multiple columns
>>> > > >
>>> > > > Sorry, I was a little unclear with my question.
>>> > > >
>>> > > > Lets say you have
>>> > > >
>>> > > > Get get = new Get(row)
>>> > > > get.addColumn("1");
>>> > > > get.addColumn("2");
>>> > > > .
>>> > > > .
>>> > > > .
>>> > > >
>>> > > > When internally hbase executes the batch get, it will seek to column
>>> > "1",
>>> > > > now since data is lexicographically sorted, it does not need to seek
>>> > from
>>> > > > the beginning to get to "2", it can continue seeking, henceforth
>>> since
>>> > > > column "2" will always be after column "1". I want to know whether
>>> this
>>> > > is
>>> > > > how a multicolumn get on a row works or not.
>>> > > >
>>> > > > Thanks
>>> > > > Varun
>>> > > >
>>> > > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <ml...@uci.cu> wrote:
>>> > > >
>>> > > > > Like Ishan said, a get give an instance of the Result class.
>>> > > > > All utility methods that you can use are:
>>> > > > >  byte[] getValue(byte[] family, byte[] qualifier)
>>> > > > >  byte[] value()
>>> > > > >  byte[] getRow()
>>> > > > >  int size()
>>> > > > >  boolean isEmpty()
>>> > > > >  KeyValue[] raw() # Like Ishan said, all data here is sorted
>>> > > > >  List<KeyValue> list()
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
>>> > > > >
>>> > > > >> Based on what I read in Lars' book, a get will return a result a
>>> > > Result,
>>> > > > >> which is internally a KeyValue[]. This KeyValue[] is sorted by the
>>> > key
>>> > > > and
>>> > > > >> you access this array using raw or list methods on the Result
>>> > object.
>>> > > > >>
>>> > > > >>
>>> > > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <varun@pinterest.com
>>> >
>>> > > > wrote:
>>> > > > >>
>>> > > > >>  +user
>>> > > > >>>
>>> > > > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <
>>> varun@pinterest.com>
>>> > > > >>> wrote:
>>> > > > >>>
>>> > > > >>>  Hi,
>>> > > > >>>>
>>> > > > >>>> When I do a Get on a row with multiple column qualifiers. Do we
>>> > sort
>>> > > > the
>>> > > > >>>> column qualifers and make use of the sorted order when we get
>>> the
>>> > > > >>>>
>>> > > > >>> results ?
>>> > > > >>>
>>> > > > >>>> Thanks
>>> > > > >>>> Varun
>>> > > > >>>>
>>> > > > >>>>
>>> > > > >>
>>> > > > >>
>>> > > > > --
>>> > > > > Marcos Ortiz Valmaseda,
>>> > > > > Product Manager && Data Scientist at UCI
>>> > > > > Blog: http://marcosluis2186.**posterous.com<
>>> > > > http://marcosluis2186.posterous.com>
>>> > > > > Twitter: @marcosluis2186 <http://twitter.com/**marcosluis2186<
>>> > > > http://twitter.com/marcosluis2186>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: Get on a row with multiple columns

Posted by Varun Sharma <va...@pinterest.com>.

Okay I did my research - these need to be set to false. I agree.

On Sat, Feb 9, 2013 at 12:05 AM, Varun Sharma <va...@pinterest.com> wrote:

> I have ipc.client.tcpnodelay, ipc.server.tcpnodelay set to false and the
> hbase one - [hbase].ipc.client.tcpnodelay set to true. Do these induce
> network latency ?
>
> On Fri, Feb 8, 2013 at 11:57 PM, lars hofhansl <la...@apache.org> wrote:
>
>> Sorry.. I meant set these two config parameters to true (not false as I
>> state below).
>>
>>
>>
>> ----- Original Message -----
>> From: lars hofhansl <la...@apache.org>
>> To: "user@hbase.apache.org" <us...@hbase.apache.org>
>> Cc:
>> Sent: Friday, February 8, 2013 11:41 PM
>> Subject: Re: Get on a row with multiple columns
>>
>> Only somewhat related. Seeing the magic 40ms random read time there. Did
>> you disable Nagle's?
>> (set hbase.ipc.client.tcpnodelay and ipc.server.tcpnodelay to false in
>> hbase-site.xml).
>>
>> ________________________________
>> From: Varun Sharma <va...@pinterest.com>
>> To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
>> Sent: Friday, February 8, 2013 10:45 PM
>> Subject: Re: Get on a row with multiple columns
>>
>> The use case is like your twitter feed. Tweets from people u follow. When
>> someone unfollows, you need to delete a bunch of his tweets from the
>> following feed. So, its frequent, and we are essentially running into some
>> extreme corner cases like the one above. We need high write throughput for
>> this, since when someone tweets, we need to fanout the tweet to all the
>> followers. We need the ability to do fast deletes (unfollow) and fast adds
>> (follow) and also be able to do fast random gets - when a real user loads
>> the feed. I doubt we will able to play much with the schema here since we
>> need to support a bunch of use cases.
>>
>> @lars: It does not take 30 seconds to place 300 delete markers. It takes
>> 30
>> seconds to first find which of those 300 pins are in the set of columns
>> present - this invokes 300 gets and then place the appropriate delete
>> markers. Note that we can have tens of thousands of columns in a single
>> row
>> so a single get is not cheap.
>>
>> If we were to just place delete markers, that is very fast. But when
>> started doing that, our random read performance suffered because of too
>> many delete markers. The 90th percentile on random reads shot up from 40
>> milliseconds to 150 milliseconds, which is not acceptable for our usecase.
>>
>> Thanks
>> Varun
>>
>> On Fri, Feb 8, 2013 at 10:33 PM, lars hofhansl <la...@apache.org> wrote:
>>
>> > Can you organize your columns and then delete by column family?
>> >
>> > deleteColumn without specifying a TS is expensive, since HBase first has
>> > to figure out what the latest TS is.
>> >
>> > Should be better in 0.94.1 or later since deletes are batched like Puts
>> > (still need to retrieve the latest version, though).
>> >
>> > In 0.94.3 or later you can also the BulkDeleteEndPoint, which basically
>> > let's specify a scan condition and then place specific delete marker for
>> > all KVs encountered.
>> >
>> >
>> > If you wanted to get really
>> > fancy, you could hook up a coprocessor to the compaction process and
>> > simply filter all KVs you no longer want (without ever placing any
>> > delete markers).
>> >
>> >
>> > Are you saying it takes 15 seconds to place 300 version delete markers?!
>> >
>> >
>> > -- Lars
>> >
>> >
>> >
>> > ________________________________
>> >  From: Varun Sharma <va...@pinterest.com>
>> > To: user@hbase.apache.org
>> > Sent: Friday, February 8, 2013 10:05 PM
>> > Subject: Re: Get on a row with multiple columns
>> >
>> > We are given a set of 300 columns to delete. I tested two cases:
>> >
>> > 1) deleteColumns() - with the 's'
>> >
>> > This function simply adds delete markers for 300 columns, in our case,
>> > typically only a fraction of these columns are actually present - 10.
>> After
>> > starting to use deleteColumns, we starting seeing a drop in cluster wide
>> > random read performance - 90th percentile latency worsened, so did 99th
>> > probably because of having to traverse delete markers. I attribute this
>> to
>> > profusion of delete markers in the cluster. Major compactions slowed
>> down
>> > by almost 50 percent probably because of having to clean out
>> significantly
>> > more delete markers.
>> >
>> > 2) deleteColumn()
>> >
>> > Ended up with untolerable 15 second calls, which clogged all the
>> handlers.
>> > Making the cluster pretty much unresponsive.
>> >
>> > On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <yu...@gmail.com> wrote:
>> >
>> > > For the 300 column deletes, can you show us how the Delete(s) are
>> > > constructed ?
>> > >
>> > > Do you use this method ?
>> > >
>> > >   public Delete deleteColumns(byte [] family, byte [] qualifier) {
>> > > Thanks
>> > >
>> > > On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <va...@pinterest.com>
>> > wrote:
>> > >
>> > > > So a Get call with multiple columns on a single row should be much
>> > faster
>> > > > than independent Get(s) on each of those columns for that row. I am
>> > > > basically seeing severely poor performance (~ 15 seconds) for
>> certain
>> > > > deleteColumn() calls and I am seeing that there is a
>> > > > prepareDeleteTimestamps() function in HRegion.java which first
>> tries to
>> > > > locate the column by doing individual gets on each column you want
>> to
>> > > > delete (I am doing 300 column deletes). Now, I think this should
>> ideall
>> > > by
>> > > > 1 get call with the batch of 300 columns so that one scan can
>> retrieve
>> > > the
>> > > > columns and the columns that are found, are indeed deleted.
>> > > >
>> > > > Before I try this fix, I wanted to get an opinion if it will make a
>> > > > difference to batch the get() and it seems from your answer, it
>> should.
>> > > >
>> > > > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <la...@apache.org>
>> > wrote:
>> > > >
>> > > > > Everything is stored as a KeyValue in HBase.
>> > > > > The Key part of a KeyValue contains the row key, column family,
>> > column
>> > > > > name, and timestamp in that order.
>> > > > > Each column family has it's own store and store files.
>> > > > >
>> > > > > So in a nutshell a get is executed by starting a scan at the row
>> key
>> > > > > (which is a prefix of the key) in each store (CF) and then
>> scanning
>> > > > forward
>> > > > > in each store until the next row key is reached. (in reality it
>> is a
>> > > bit
>> > > > > more complicated due to multiple versions, skipping columns, etc)
>> > > > >
>> > > > >
>> > > > > -- Lars
>> > > > > ________________________________
>> > > > > From: Varun Sharma <va...@pinterest.com>
>> > > > > To: user@hbase.apache.org
>> > > > > Sent: Friday, February 8, 2013 9:22 PM
>> > > > > Subject: Re: Get on a row with multiple columns
>> > > > >
>> > > > > Sorry, I was a little unclear with my question.
>> > > > >
>> > > > > Lets say you have
>> > > > >
>> > > > > Get get = new Get(row)
>> > > > > get.addColumn("1");
>> > > > > get.addColumn("2");
>> > > > > .
>> > > > > .
>> > > > > .
>> > > > >
>> > > > > When internally hbase executes the batch get, it will seek to
>> column
>> > > "1",
>> > > > > now since data is lexicographically sorted, it does not need to
>> seek
>> > > from
>> > > > > the beginning to get to "2", it can continue seeking, henceforth
>> > since
>> > > > > column "2" will always be after column "1". I want to know whether
>> > this
>> > > > is
>> > > > > how a multicolumn get on a row works or not.
>> > > > >
>> > > > > Thanks
>> > > > > Varun
>> > > > >
>> > > > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <ml...@uci.cu>
>> wrote:
>> > > > >
>> > > > > > Like Ishan said, a get give an instance of the Result class.
>> > > > > > All utility methods that you can use are:
>> > > > > >  byte[] getValue(byte[] family, byte[] qualifier)
>> > > > > >  byte[] value()
>> > > > > >  byte[] getRow()
>> > > > > >  int size()
>> > > > > >  boolean isEmpty()
>> > > > > >  KeyValue[] raw() # Like Ishan said, all data here is sorted
>> > > > > >  List<KeyValue> list()
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
>> > > > > >
>> > > > > >> Based on what I read in Lars' book, a get will return a result
>> a
>> > > > Result,
>> > > > > >> which is internally a KeyValue[]. This KeyValue[] is sorted by
>> the
>> > > key
>> > > > > and
>> > > > > >> you access this array using raw or list methods on the Result
>> > > object.
>> > > > > >>
>> > > > > >>
>> > > > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <
>> varun@pinterest.com
>> > >
>> > > > > wrote:
>> > > > > >>
>> > > > > >>  +user
>> > > > > >>>
>> > > > > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <
>> > varun@pinterest.com>
>> > > > > >>> wrote:
>> > > > > >>>
>> > > > > >>>  Hi,
>> > > > > >>>>
>> > > > > >>>> When I do a Get on a row with multiple column qualifiers. Do
>> we
>> > > sort
>> > > > > the
>> > > > > >>>> column qualifers and make use of the sorted order when we get
>> > the
>> > > > > >>>>
>> > > > > >>> results ?
>> > > > > >>>
>> > > > > >>>> Thanks
>> > > > > >>>> Varun
>> > > > > >>>>
>> > > > > >>>>
>> > > > > >>
>> > > > > >>
>> > > > > > --
>> > > > > > Marcos Ortiz Valmaseda,
>> > > > > > Product Manager && Data Scientist at UCI
>> > > > > > Blog: http://marcosluis2186.**posterous.com<
>> > > > > http://marcosluis2186.posterous.com>
>> > > > > > Twitter: @marcosluis2186 <http://twitter.com/**marcosluis2186<
>> > > > > http://twitter.com/marcosluis2186>
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>>
>

Re: Get on a row with multiple columns

Posted by Varun Sharma <va...@pinterest.com>.

I have ipc.client.tcpnodelay, ipc.server.tcpnodelay set to false and the
hbase one - [hbase].ipc.client.tcpnodelay set to true. Do these induce
network latency ?

On Fri, Feb 8, 2013 at 11:57 PM, lars hofhansl <la...@apache.org> wrote:

> Sorry.. I meant set these two config parameters to true (not false as I
> state below).
>
>
>
> ----- Original Message -----
> From: lars hofhansl <la...@apache.org>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Cc:
> Sent: Friday, February 8, 2013 11:41 PM
> Subject: Re: Get on a row with multiple columns
>
> Only somewhat related. Seeing the magic 40ms random read time there. Did
> you disable Nagle's?
> (set hbase.ipc.client.tcpnodelay and ipc.server.tcpnodelay to false in
> hbase-site.xml).
>
> ________________________________
> From: Varun Sharma <va...@pinterest.com>
> To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
> Sent: Friday, February 8, 2013 10:45 PM
> Subject: Re: Get on a row with multiple columns
>
> The use case is like your twitter feed. Tweets from people u follow. When
> someone unfollows, you need to delete a bunch of his tweets from the
> following feed. So, its frequent, and we are essentially running into some
> extreme corner cases like the one above. We need high write throughput for
> this, since when someone tweets, we need to fanout the tweet to all the
> followers. We need the ability to do fast deletes (unfollow) and fast adds
> (follow) and also be able to do fast random gets - when a real user loads
> the feed. I doubt we will able to play much with the schema here since we
> need to support a bunch of use cases.
>
> @lars: It does not take 30 seconds to place 300 delete markers. It takes 30
> seconds to first find which of those 300 pins are in the set of columns
> present - this invokes 300 gets and then place the appropriate delete
> markers. Note that we can have tens of thousands of columns in a single row
> so a single get is not cheap.
>
> If we were to just place delete markers, that is very fast. But when
> started doing that, our random read performance suffered because of too
> many delete markers. The 90th percentile on random reads shot up from 40
> milliseconds to 150 milliseconds, which is not acceptable for our usecase.
>
> Thanks
> Varun
>
> On Fri, Feb 8, 2013 at 10:33 PM, lars hofhansl <la...@apache.org> wrote:
>
> > Can you organize your columns and then delete by column family?
> >
> > deleteColumn without specifying a TS is expensive, since HBase first has
> > to figure out what the latest TS is.
> >
> > Should be better in 0.94.1 or later since deletes are batched like Puts
> > (still need to retrieve the latest version, though).
> >
> > In 0.94.3 or later you can also the BulkDeleteEndPoint, which basically
> > let's specify a scan condition and then place specific delete marker for
> > all KVs encountered.
> >
> >
> > If you wanted to get really
> > fancy, you could hook up a coprocessor to the compaction process and
> > simply filter all KVs you no longer want (without ever placing any
> > delete markers).
> >
> >
> > Are you saying it takes 15 seconds to place 300 version delete markers?!
> >
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> >  From: Varun Sharma <va...@pinterest.com>
> > To: user@hbase.apache.org
> > Sent: Friday, February 8, 2013 10:05 PM
> > Subject: Re: Get on a row with multiple columns
> >
> > We are given a set of 300 columns to delete. I tested two cases:
> >
> > 1) deleteColumns() - with the 's'
> >
> > This function simply adds delete markers for 300 columns, in our case,
> > typically only a fraction of these columns are actually present - 10.
> After
> > starting to use deleteColumns, we starting seeing a drop in cluster wide
> > random read performance - 90th percentile latency worsened, so did 99th
> > probably because of having to traverse delete markers. I attribute this
> to
> > profusion of delete markers in the cluster. Major compactions slowed down
> > by almost 50 percent probably because of having to clean out
> significantly
> > more delete markers.
> >
> > 2) deleteColumn()
> >
> > Ended up with untolerable 15 second calls, which clogged all the
> handlers.
> > Making the cluster pretty much unresponsive.
> >
> > On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > For the 300 column deletes, can you show us how the Delete(s) are
> > > constructed ?
> > >
> > > Do you use this method ?
> > >
> > >   public Delete deleteColumns(byte [] family, byte [] qualifier) {
> > > Thanks
> > >
> > > On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <va...@pinterest.com>
> > wrote:
> > >
> > > > So a Get call with multiple columns on a single row should be much
> > faster
> > > > than independent Get(s) on each of those columns for that row. I am
> > > > basically seeing severely poor performance (~ 15 seconds) for certain
> > > > deleteColumn() calls and I am seeing that there is a
> > > > prepareDeleteTimestamps() function in HRegion.java which first tries
> to
> > > > locate the column by doing individual gets on each column you want to
> > > > delete (I am doing 300 column deletes). Now, I think this should
> ideall
> > > by
> > > > 1 get call with the batch of 300 columns so that one scan can
> retrieve
> > > the
> > > > columns and the columns that are found, are indeed deleted.
> > > >
> > > > Before I try this fix, I wanted to get an opinion if it will make a
> > > > difference to batch the get() and it seems from your answer, it
> should.
> > > >
> > > > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <la...@apache.org>
> > wrote:
> > > >
> > > > > Everything is stored as a KeyValue in HBase.
> > > > > The Key part of a KeyValue contains the row key, column family,
> > column
> > > > > name, and timestamp in that order.
> > > > > Each column family has it's own store and store files.
> > > > >
> > > > > So in a nutshell a get is executed by starting a scan at the row
> key
> > > > > (which is a prefix of the key) in each store (CF) and then scanning
> > > > forward
> > > > > in each store until the next row key is reached. (in reality it is
> a
> > > bit
> > > > > more complicated due to multiple versions, skipping columns, etc)
> > > > >
> > > > >
> > > > > -- Lars
> > > > > ________________________________
> > > > > From: Varun Sharma <va...@pinterest.com>
> > > > > To: user@hbase.apache.org
> > > > > Sent: Friday, February 8, 2013 9:22 PM
> > > > > Subject: Re: Get on a row with multiple columns
> > > > >
> > > > > Sorry, I was a little unclear with my question.
> > > > >
> > > > > Lets say you have
> > > > >
> > > > > Get get = new Get(row)
> > > > > get.addColumn("1");
> > > > > get.addColumn("2");
> > > > > .
> > > > > .
> > > > > .
> > > > >
> > > > > When internally hbase executes the batch get, it will seek to
> column
> > > "1",
> > > > > now since data is lexicographically sorted, it does not need to
> seek
> > > from
> > > > > the beginning to get to "2", it can continue seeking, henceforth
> > since
> > > > > column "2" will always be after column "1". I want to know whether
> > this
> > > > is
> > > > > how a multicolumn get on a row works or not.
> > > > >
> > > > > Thanks
> > > > > Varun
> > > > >
> > > > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <ml...@uci.cu>
> wrote:
> > > > >
> > > > > > Like Ishan said, a get give an instance of the Result class.
> > > > > > All utility methods that you can use are:
> > > > > >  byte[] getValue(byte[] family, byte[] qualifier)
> > > > > >  byte[] value()
> > > > > >  byte[] getRow()
> > > > > >  int size()
> > > > > >  boolean isEmpty()
> > > > > >  KeyValue[] raw() # Like Ishan said, all data here is sorted
> > > > > >  List<KeyValue> list()
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
> > > > > >
> > > > > >> Based on what I read in Lars' book, a get will return a result a
> > > > Result,
> > > > > >> which is internally a KeyValue[]. This KeyValue[] is sorted by
> the
> > > key
> > > > > and
> > > > > >> you access this array using raw or list methods on the Result
> > > object.
> > > > > >>
> > > > > >>
> > > > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <
> varun@pinterest.com
> > >
> > > > > wrote:
> > > > > >>
> > > > > >>  +user
> > > > > >>>
> > > > > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <
> > varun@pinterest.com>
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>>  Hi,
> > > > > >>>>
> > > > > >>>> When I do a Get on a row with multiple column qualifiers. Do
> we
> > > sort
> > > > > the
> > > > > >>>> column qualifers and make use of the sorted order when we get
> > the
> > > > > >>>>
> > > > > >>> results ?
> > > > > >>>
> > > > > >>>> Thanks
> > > > > >>>> Varun
> > > > > >>>>
> > > > > >>>>
> > > > > >>
> > > > > >>
> > > > > > --
> > > > > > Marcos Ortiz Valmaseda,
> > > > > > Product Manager && Data Scientist at UCI
> > > > > > Blog: http://marcosluis2186.**posterous.com<
> > > > > http://marcosluis2186.posterous.com>
> > > > > > Twitter: @marcosluis2186 <http://twitter.com/**marcosluis2186<
> > > > > http://twitter.com/marcosluis2186>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>

Re: Get on a row with multiple columns

Posted by lars hofhansl <la...@apache.org>.

Sorry.. I meant set these two config parameters to true (not false as I state below).



----- Original Message -----
From: lars hofhansl <la...@apache.org>
To: "user@hbase.apache.org" <us...@hbase.apache.org>
Cc: 
Sent: Friday, February 8, 2013 11:41 PM
Subject: Re: Get on a row with multiple columns

Only somewhat related. Seeing the magic 40ms random read time there. Did you disable Nagle's?
(set hbase.ipc.client.tcpnodelay and ipc.server.tcpnodelay to false in hbase-site.xml).

________________________________
From: Varun Sharma <va...@pinterest.com>
To: user@hbase.apache.org; lars hofhansl <la...@apache.org> 
Sent: Friday, February 8, 2013 10:45 PM
Subject: Re: Get on a row with multiple columns

The use case is like your twitter feed. Tweets from people u follow. When
someone unfollows, you need to delete a bunch of his tweets from the
following feed. So, its frequent, and we are essentially running into some
extreme corner cases like the one above. We need high write throughput for
this, since when someone tweets, we need to fanout the tweet to all the
followers. We need the ability to do fast deletes (unfollow) and fast adds
(follow) and also be able to do fast random gets - when a real user loads
the feed. I doubt we will able to play much with the schema here since we
need to support a bunch of use cases.

@lars: It does not take 30 seconds to place 300 delete markers. It takes 30
seconds to first find which of those 300 pins are in the set of columns
present - this invokes 300 gets and then place the appropriate delete
markers. Note that we can have tens of thousands of columns in a single row
so a single get is not cheap.

If we were to just place delete markers, that is very fast. But when
started doing that, our random read performance suffered because of too
many delete markers. The 90th percentile on random reads shot up from 40
milliseconds to 150 milliseconds, which is not acceptable for our usecase.

Thanks
Varun

On Fri, Feb 8, 2013 at 10:33 PM, lars hofhansl <la...@apache.org> wrote:

> Can you organize your columns and then delete by column family?
>
> deleteColumn without specifying a TS is expensive, since HBase first has
> to figure out what the latest TS is.
>
> Should be better in 0.94.1 or later since deletes are batched like Puts
> (still need to retrieve the latest version, though).
>
> In 0.94.3 or later you can also the BulkDeleteEndPoint, which basically
> let's specify a scan condition and then place specific delete marker for
> all KVs encountered.
>
>
> If you wanted to get really
> fancy, you could hook up a coprocessor to the compaction process and
> simply filter all KVs you no longer want (without ever placing any
> delete markers).
>
>
> Are you saying it takes 15 seconds to place 300 version delete markers?!
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Varun Sharma <va...@pinterest.com>
> To: user@hbase.apache.org
> Sent: Friday, February 8, 2013 10:05 PM
> Subject: Re: Get on a row with multiple columns
>
> We are given a set of 300 columns to delete. I tested two cases:
>
> 1) deleteColumns() - with the 's'
>
> This function simply adds delete markers for 300 columns, in our case,
> typically only a fraction of these columns are actually present - 10. After
> starting to use deleteColumns, we starting seeing a drop in cluster wide
> random read performance - 90th percentile latency worsened, so did 99th
> probably because of having to traverse delete markers. I attribute this to
> profusion of delete markers in the cluster. Major compactions slowed down
> by almost 50 percent probably because of having to clean out significantly
> more delete markers.
>
> 2) deleteColumn()
>
> Ended up with untolerable 15 second calls, which clogged all the handlers.
> Making the cluster pretty much unresponsive.
>
> On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > For the 300 column deletes, can you show us how the Delete(s) are
> > constructed ?
> >
> > Do you use this method ?
> >
> >   public Delete deleteColumns(byte [] family, byte [] qualifier) {
> > Thanks
> >
> > On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <va...@pinterest.com>
> wrote:
> >
> > > So a Get call with multiple columns on a single row should be much
> faster
> > > than independent Get(s) on each of those columns for that row. I am
> > > basically seeing severely poor performance (~ 15 seconds) for certain
> > > deleteColumn() calls and I am seeing that there is a
> > > prepareDeleteTimestamps() function in HRegion.java which first tries to
> > > locate the column by doing individual gets on each column you want to
> > > delete (I am doing 300 column deletes). Now, I think this should ideall
> > by
> > > 1 get call with the batch of 300 columns so that one scan can retrieve
> > the
> > > columns and the columns that are found, are indeed deleted.
> > >
> > > Before I try this fix, I wanted to get an opinion if it will make a
> > > difference to batch the get() and it seems from your answer, it should.
> > >
> > > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <la...@apache.org>
> wrote:
> > >
> > > > Everything is stored as a KeyValue in HBase.
> > > > The Key part of a KeyValue contains the row key, column family,
> column
> > > > name, and timestamp in that order.
> > > > Each column family has it's own store and store files.
> > > >
> > > > So in a nutshell a get is executed by starting a scan at the row key
> > > > (which is a prefix of the key) in each store (CF) and then scanning
> > > forward
> > > > in each store until the next row key is reached. (in reality it is a
> > bit
> > > > more complicated due to multiple versions, skipping columns, etc)
> > > >
> > > >
> > > > -- Lars
> > > > ________________________________
> > > > From: Varun Sharma <va...@pinterest.com>
> > > > To: user@hbase.apache.org
> > > > Sent: Friday, February 8, 2013 9:22 PM
> > > > Subject: Re: Get on a row with multiple columns
> > > >
> > > > Sorry, I was a little unclear with my question.
> > > >
> > > > Lets say you have
> > > >
> > > > Get get = new Get(row)
> > > > get.addColumn("1");
> > > > get.addColumn("2");
> > > > .
> > > > .
> > > > .
> > > >
> > > > When internally hbase executes the batch get, it will seek to column
> > "1",
> > > > now since data is lexicographically sorted, it does not need to seek
> > from
> > > > the beginning to get to "2", it can continue seeking, henceforth
> since
> > > > column "2" will always be after column "1". I want to know whether
> this
> > > is
> > > > how a multicolumn get on a row works or not.
> > > >
> > > > Thanks
> > > > Varun
> > > >
> > > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <ml...@uci.cu> wrote:
> > > >
> > > > > Like Ishan said, a get give an instance of the Result class.
> > > > > All utility methods that you can use are:
> > > > >  byte[] getValue(byte[] family, byte[] qualifier)
> > > > >  byte[] value()
> > > > >  byte[] getRow()
> > > > >  int size()
> > > > >  boolean isEmpty()
> > > > >  KeyValue[] raw() # Like Ishan said, all data here is sorted
> > > > >  List<KeyValue> list()
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
> > > > >
> > > > >> Based on what I read in Lars' book, a get will return a result a
> > > Result,
> > > > >> which is internally a KeyValue[]. This KeyValue[] is sorted by the
> > key
> > > > and
> > > > >> you access this array using raw or list methods on the Result
> > object.
> > > > >>
> > > > >>
> > > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <varun@pinterest.com
> >
> > > > wrote:
> > > > >>
> > > > >>  +user
> > > > >>>
> > > > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <
> varun@pinterest.com>
> > > > >>> wrote:
> > > > >>>
> > > > >>>  Hi,
> > > > >>>>
> > > > >>>> When I do a Get on a row with multiple column qualifiers. Do we
> > sort
> > > > the
> > > > >>>> column qualifers and make use of the sorted order when we get
> the
> > > > >>>>
> > > > >>> results ?
> > > > >>>
> > > > >>>> Thanks
> > > > >>>> Varun
> > > > >>>>
> > > > >>>>
> > > > >>
> > > > >>
> > > > > --
> > > > > Marcos Ortiz Valmaseda,
> > > > > Product Manager && Data Scientist at UCI
> > > > > Blog: http://marcosluis2186.**posterous.com<
> > > > http://marcosluis2186.posterous.com>
> > > > > Twitter: @marcosluis2186 <http://twitter.com/**marcosluis2186<
> > > > http://twitter.com/marcosluis2186>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Get on a row with multiple columns

Posted by lars hofhansl <la...@apache.org>.

Only somewhat related. Seeing the magic 40ms random read time there. Did you disable Nagle's?
(set hbase.ipc.client.tcpnodelay and ipc.server.tcpnodelay to false in hbase-site.xml).

________________________________
From: Varun Sharma <va...@pinterest.com>
To: user@hbase.apache.org; lars hofhansl <la...@apache.org> 
Sent: Friday, February 8, 2013 10:45 PM
Subject: Re: Get on a row with multiple columns

The use case is like your twitter feed. Tweets from people u follow. When
someone unfollows, you need to delete a bunch of his tweets from the
following feed. So, its frequent, and we are essentially running into some
extreme corner cases like the one above. We need high write throughput for
this, since when someone tweets, we need to fanout the tweet to all the
followers. We need the ability to do fast deletes (unfollow) and fast adds
(follow) and also be able to do fast random gets - when a real user loads
the feed. I doubt we will able to play much with the schema here since we
need to support a bunch of use cases.

@lars: It does not take 30 seconds to place 300 delete markers. It takes 30
seconds to first find which of those 300 pins are in the set of columns
present - this invokes 300 gets and then place the appropriate delete
markers. Note that we can have tens of thousands of columns in a single row
so a single get is not cheap.

If we were to just place delete markers, that is very fast. But when
started doing that, our random read performance suffered because of too
many delete markers. The 90th percentile on random reads shot up from 40
milliseconds to 150 milliseconds, which is not acceptable for our usecase.

Thanks
Varun

On Fri, Feb 8, 2013 at 10:33 PM, lars hofhansl <la...@apache.org> wrote:

> Can you organize your columns and then delete by column family?
>
> deleteColumn without specifying a TS is expensive, since HBase first has
> to figure out what the latest TS is.
>
> Should be better in 0.94.1 or later since deletes are batched like Puts
> (still need to retrieve the latest version, though).
>
> In 0.94.3 or later you can also the BulkDeleteEndPoint, which basically
> let's specify a scan condition and then place specific delete marker for
> all KVs encountered.
>
>
> If you wanted to get really
> fancy, you could hook up a coprocessor to the compaction process and
> simply filter all KVs you no longer want (without ever placing any
> delete markers).
>
>
> Are you saying it takes 15 seconds to place 300 version delete markers?!
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Varun Sharma <va...@pinterest.com>
> To: user@hbase.apache.org
> Sent: Friday, February 8, 2013 10:05 PM
> Subject: Re: Get on a row with multiple columns
>
> We are given a set of 300 columns to delete. I tested two cases:
>
> 1) deleteColumns() - with the 's'
>
> This function simply adds delete markers for 300 columns, in our case,
> typically only a fraction of these columns are actually present - 10. After
> starting to use deleteColumns, we starting seeing a drop in cluster wide
> random read performance - 90th percentile latency worsened, so did 99th
> probably because of having to traverse delete markers. I attribute this to
> profusion of delete markers in the cluster. Major compactions slowed down
> by almost 50 percent probably because of having to clean out significantly
> more delete markers.
>
> 2) deleteColumn()
>
> Ended up with untolerable 15 second calls, which clogged all the handlers.
> Making the cluster pretty much unresponsive.
>
> On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > For the 300 column deletes, can you show us how the Delete(s) are
> > constructed ?
> >
> > Do you use this method ?
> >
> >   public Delete deleteColumns(byte [] family, byte [] qualifier) {
> > Thanks
> >
> > On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <va...@pinterest.com>
> wrote:
> >
> > > So a Get call with multiple columns on a single row should be much
> faster
> > > than independent Get(s) on each of those columns for that row. I am
> > > basically seeing severely poor performance (~ 15 seconds) for certain
> > > deleteColumn() calls and I am seeing that there is a
> > > prepareDeleteTimestamps() function in HRegion.java which first tries to
> > > locate the column by doing individual gets on each column you want to
> > > delete (I am doing 300 column deletes). Now, I think this should ideall
> > by
> > > 1 get call with the batch of 300 columns so that one scan can retrieve
> > the
> > > columns and the columns that are found, are indeed deleted.
> > >
> > > Before I try this fix, I wanted to get an opinion if it will make a
> > > difference to batch the get() and it seems from your answer, it should.
> > >
> > > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <la...@apache.org>
> wrote:
> > >
> > > > Everything is stored as a KeyValue in HBase.
> > > > The Key part of a KeyValue contains the row key, column family,
> column
> > > > name, and timestamp in that order.
> > > > Each column family has it's own store and store files.
> > > >
> > > > So in a nutshell a get is executed by starting a scan at the row key
> > > > (which is a prefix of the key) in each store (CF) and then scanning
> > > forward
> > > > in each store until the next row key is reached. (in reality it is a
> > bit
> > > > more complicated due to multiple versions, skipping columns, etc)
> > > >
> > > >
> > > > -- Lars
> > > > ________________________________
> > > > From: Varun Sharma <va...@pinterest.com>
> > > > To: user@hbase.apache.org
> > > > Sent: Friday, February 8, 2013 9:22 PM
> > > > Subject: Re: Get on a row with multiple columns
> > > >
> > > > Sorry, I was a little unclear with my question.
> > > >
> > > > Lets say you have
> > > >
> > > > Get get = new Get(row)
> > > > get.addColumn("1");
> > > > get.addColumn("2");
> > > > .
> > > > .
> > > > .
> > > >
> > > > When internally hbase executes the batch get, it will seek to column
> > "1",
> > > > now since data is lexicographically sorted, it does not need to seek
> > from
> > > > the beginning to get to "2", it can continue seeking, henceforth
> since
> > > > column "2" will always be after column "1". I want to know whether
> this
> > > is
> > > > how a multicolumn get on a row works or not.
> > > >
> > > > Thanks
> > > > Varun
> > > >
> > > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <ml...@uci.cu> wrote:
> > > >
> > > > > Like Ishan said, a get give an instance of the Result class.
> > > > > All utility methods that you can use are:
> > > > >  byte[] getValue(byte[] family, byte[] qualifier)
> > > > >  byte[] value()
> > > > >  byte[] getRow()
> > > > >  int size()
> > > > >  boolean isEmpty()
> > > > >  KeyValue[] raw() # Like Ishan said, all data here is sorted
> > > > >  List<KeyValue> list()
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
> > > > >
> > > > >> Based on what I read in Lars' book, a get will return a result a
> > > Result,
> > > > >> which is internally a KeyValue[]. This KeyValue[] is sorted by the
> > key
> > > > and
> > > > >> you access this array using raw or list methods on the Result
> > object.
> > > > >>
> > > > >>
> > > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <varun@pinterest.com
> >
> > > > wrote:
> > > > >>
> > > > >>  +user
> > > > >>>
> > > > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <
> varun@pinterest.com>
> > > > >>> wrote:
> > > > >>>
> > > > >>>  Hi,
> > > > >>>>
> > > > >>>> When I do a Get on a row with multiple column qualifiers. Do we
> > sort
> > > > the
> > > > >>>> column qualifers and make use of the sorted order when we get
> the
> > > > >>>>
> > > > >>> results ?
> > > > >>>
> > > > >>>> Thanks
> > > > >>>> Varun
> > > > >>>>
> > > > >>>>
> > > > >>
> > > > >>
> > > > > --
> > > > > Marcos Ortiz Valmaseda,
> > > > > Product Manager && Data Scientist at UCI
> > > > > Blog: http://marcosluis2186.**posterous.com<
> > > > http://marcosluis2186.posterous.com>
> > > > > Twitter: @marcosluis2186 <http://twitter.com/**marcosluis2186<
> > > > http://twitter.com/marcosluis2186>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Get on a row with multiple columns

Posted by lars hofhansl <la...@apache.org>.

For 1) The Endpoint ships with HBase (in the example package, included by default).

2) exactly. Still places delete markers but you get to control this better.

In your case you want pass BulkDeleteProtocol.DeleteType.VERSION as delete type. This places an exact version delete marker for each KV encountered during the scan, and it does this efficiently region by region.


-- Lars


----- Original Message -----
From: Varun Sharma <va...@pinterest.com>
To: user@hbase.apache.org; lars hofhansl <la...@apache.org>
Cc: 
Sent: Friday, February 8, 2013 10:57 PM
Subject: Re: Get on a row with multiple columns

We are actually doing some filtering already using a coprocessor during
major compactions but we dont really know in advance what is going to be
trimmed out. We only know when an unfollow action happens.

Anyhow this BulkDelete looks promising. I have never done coprocessor
endpoints before, so can you help me with a couple of questions:
1) I am running hbase 0.94.3, do I need to do anything on the region server
side, any configuration to take advantage of this or can i simply follow
the javadoc which is really informative and use the endpoint in my client ?
2) This, as i read it will simply run a single scan (with filters etc.) and
simply place delete markers for all the entries that were found during the
scan.

Thanks
Varun

On Fri, Feb 8, 2013 at 10:45 PM, Varun Sharma <va...@pinterest.com> wrote:

> The use case is like your twitter feed. Tweets from people u follow. When
> someone unfollows, you need to delete a bunch of his tweets from the
> following feed. So, its frequent, and we are essentially running into some
> extreme corner cases like the one above. We need high write throughput for
> this, since when someone tweets, we need to fanout the tweet to all the
> followers. We need the ability to do fast deletes (unfollow) and fast adds
> (follow) and also be able to do fast random gets - when a real user loads
> the feed. I doubt we will able to play much with the schema here since we
> need to support a bunch of use cases.
>
> @lars: It does not take 30 seconds to place 300 delete markers. It takes
> 30 seconds to first find which of those 300 pins are in the set of columns
> present - this invokes 300 gets and then place the appropriate delete
> markers. Note that we can have tens of thousands of columns in a single row
> so a single get is not cheap.
>
> If we were to just place delete markers, that is very fast. But when
> started doing that, our random read performance suffered because of too
> many delete markers. The 90th percentile on random reads shot up from 40
> milliseconds to 150 milliseconds, which is not acceptable for our usecase.
>
> Thanks
> Varun
>
>
> On Fri, Feb 8, 2013 at 10:33 PM, lars hofhansl <la...@apache.org> wrote:
>
>> Can you organize your columns and then delete by column family?
>>
>> deleteColumn without specifying a TS is expensive, since HBase first has
>> to figure out what the latest TS is.
>>
>> Should be better in 0.94.1 or later since deletes are batched like Puts
>> (still need to retrieve the latest version, though).
>>
>> In 0.94.3 or later you can also the BulkDeleteEndPoint, which basically
>> let's specify a scan condition and then place specific delete marker for
>> all KVs encountered.
>>
>>
>> If you wanted to get really
>> fancy, you could hook up a coprocessor to the compaction process and
>> simply filter all KVs you no longer want (without ever placing any
>> delete markers).
>>
>>
>> Are you saying it takes 15 seconds to place 300 version delete markers?!
>>
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>>  From: Varun Sharma <va...@pinterest.com>
>> To: user@hbase.apache.org
>> Sent: Friday, February 8, 2013 10:05 PM
>> Subject: Re: Get on a row with multiple columns
>>
>> We are given a set of 300 columns to delete. I tested two cases:
>>
>> 1) deleteColumns() - with the 's'
>>
>> This function simply adds delete markers for 300 columns, in our case,
>> typically only a fraction of these columns are actually present - 10.
>> After
>> starting to use deleteColumns, we starting seeing a drop in cluster wide
>> random read performance - 90th percentile latency worsened, so did 99th
>> probably because of having to traverse delete markers. I attribute this to
>> profusion of delete markers in the cluster. Major compactions slowed down
>> by almost 50 percent probably because of having to clean out significantly
>> more delete markers.
>>
>> 2) deleteColumn()
>>
>> Ended up with untolerable 15 second calls, which clogged all the handlers.
>> Making the cluster pretty much unresponsive.
>>
>> On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <yu...@gmail.com> wrote:
>>
>> > For the 300 column deletes, can you show us how the Delete(s) are
>> > constructed ?
>> >
>> > Do you use this method ?
>> >
>> >   public Delete deleteColumns(byte [] family, byte [] qualifier) {
>> > Thanks
>> >
>> > On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <va...@pinterest.com>
>> wrote:
>> >
>> > > So a Get call with multiple columns on a single row should be much
>> faster
>> > > than independent Get(s) on each of those columns for that row. I am
>> > > basically seeing severely poor performance (~ 15 seconds) for certain
>> > > deleteColumn() calls and I am seeing that there is a
>> > > prepareDeleteTimestamps() function in HRegion.java which first tries
>> to
>> > > locate the column by doing individual gets on each column you want to
>> > > delete (I am doing 300 column deletes). Now, I think this should
>> ideall
>> > by
>> > > 1 get call with the batch of 300 columns so that one scan can retrieve
>> > the
>> > > columns and the columns that are found, are indeed deleted.
>> > >
>> > > Before I try this fix, I wanted to get an opinion if it will make a
>> > > difference to batch the get() and it seems from your answer, it
>> should.
>> > >
>> > > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <la...@apache.org>
>> wrote:
>> > >
>> > > > Everything is stored as a KeyValue in HBase.
>> > > > The Key part of a KeyValue contains the row key, column family,
>> column
>> > > > name, and timestamp in that order.
>> > > > Each column family has it's own store and store files.
>> > > >
>> > > > So in a nutshell a get is executed by starting a scan at the row key
>> > > > (which is a prefix of the key) in each store (CF) and then scanning
>> > > forward
>> > > > in each store until the next row key is reached. (in reality it is a
>> > bit
>> > > > more complicated due to multiple versions, skipping columns, etc)
>> > > >
>> > > >
>> > > > -- Lars
>> > > > ________________________________
>> > > > From: Varun Sharma <va...@pinterest.com>
>> > > > To: user@hbase.apache.org
>> > > > Sent: Friday, February 8, 2013 9:22 PM
>> > > > Subject: Re: Get on a row with multiple columns
>> > > >
>> > > > Sorry, I was a little unclear with my question.
>> > > >
>> > > > Lets say you have
>> > > >
>> > > > Get get = new Get(row)
>> > > > get.addColumn("1");
>> > > > get.addColumn("2");
>> > > > .
>> > > > .
>> > > > .
>> > > >
>> > > > When internally hbase executes the batch get, it will seek to column
>> > "1",
>> > > > now since data is lexicographically sorted, it does not need to seek
>> > from
>> > > > the beginning to get to "2", it can continue seeking, henceforth
>> since
>> > > > column "2" will always be after column "1". I want to know whether
>> this
>> > > is
>> > > > how a multicolumn get on a row works or not.
>> > > >
>> > > > Thanks
>> > > > Varun
>> > > >
>> > > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <ml...@uci.cu>
>> wrote:
>> > > >
>> > > > > Like Ishan said, a get give an instance of the Result class.
>> > > > > All utility methods that you can use are:
>> > > > >  byte[] getValue(byte[] family, byte[] qualifier)
>> > > > >  byte[] value()
>> > > > >  byte[] getRow()
>> > > > >  int size()
>> > > > >  boolean isEmpty()
>> > > > >  KeyValue[] raw() # Like Ishan said, all data here is sorted
>> > > > >  List<KeyValue> list()
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
>> > > > >
>> > > > >> Based on what I read in Lars' book, a get will return a result a
>> > > Result,
>> > > > >> which is internally a KeyValue[]. This KeyValue[] is sorted by
>> the
>> > key
>> > > > and
>> > > > >> you access this array using raw or list methods on the Result
>> > object.
>> > > > >>
>> > > > >>
>> > > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <
>> varun@pinterest.com>
>> > > > wrote:
>> > > > >>
>> > > > >>  +user
>> > > > >>>
>> > > > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <
>> varun@pinterest.com>
>> > > > >>> wrote:
>> > > > >>>
>> > > > >>>  Hi,
>> > > > >>>>
>> > > > >>>> When I do a Get on a row with multiple column qualifiers. Do we
>> > sort
>> > > > the
>> > > > >>>> column qualifers and make use of the sorted order when we get
>> the
>> > > > >>>>
>> > > > >>> results ?
>> > > > >>>
>> > > > >>>> Thanks
>> > > > >>>> Varun
>> > > > >>>>
>> > > > >>>>
>> > > > >>
>> > > > >>
>> > > > > --
>> > > > > Marcos Ortiz Valmaseda,
>> > > > > Product Manager && Data Scientist at UCI
>> > > > > Blog: http://marcosluis2186.**posterous.com<
>> > > > http://marcosluis2186.posterous.com>
>> > > > > Twitter: @marcosluis2186 <http://twitter.com/**marcosluis2186<
>> > > > http://twitter.com/marcosluis2186>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Get on a row with multiple columns

Posted by Varun Sharma <va...@pinterest.com>.

We are actually doing some filtering already using a coprocessor during
major compactions but we dont really know in advance what is going to be
trimmed out. We only know when an unfollow action happens.

Anyhow this BulkDelete looks promising. I have never done coprocessor
endpoints before, so can you help me with a couple of questions:
1) I am running hbase 0.94.3, do I need to do anything on the region server
side, any configuration to take advantage of this or can i simply follow
the javadoc which is really informative and use the endpoint in my client ?
2) This, as i read it will simply run a single scan (with filters etc.) and
simply place delete markers for all the entries that were found during the
scan.

Thanks
Varun

On Fri, Feb 8, 2013 at 10:45 PM, Varun Sharma <va...@pinterest.com> wrote:

> The use case is like your twitter feed. Tweets from people u follow. When
> someone unfollows, you need to delete a bunch of his tweets from the
> following feed. So, its frequent, and we are essentially running into some
> extreme corner cases like the one above. We need high write throughput for
> this, since when someone tweets, we need to fanout the tweet to all the
> followers. We need the ability to do fast deletes (unfollow) and fast adds
> (follow) and also be able to do fast random gets - when a real user loads
> the feed. I doubt we will able to play much with the schema here since we
> need to support a bunch of use cases.
>
> @lars: It does not take 30 seconds to place 300 delete markers. It takes
> 30 seconds to first find which of those 300 pins are in the set of columns
> present - this invokes 300 gets and then place the appropriate delete
> markers. Note that we can have tens of thousands of columns in a single row
> so a single get is not cheap.
>
> If we were to just place delete markers, that is very fast. But when
> started doing that, our random read performance suffered because of too
> many delete markers. The 90th percentile on random reads shot up from 40
> milliseconds to 150 milliseconds, which is not acceptable for our usecase.
>
> Thanks
> Varun
>
>
> On Fri, Feb 8, 2013 at 10:33 PM, lars hofhansl <la...@apache.org> wrote:
>
>> Can you organize your columns and then delete by column family?
>>
>> deleteColumn without specifying a TS is expensive, since HBase first has
>> to figure out what the latest TS is.
>>
>> Should be better in 0.94.1 or later since deletes are batched like Puts
>> (still need to retrieve the latest version, though).
>>
>> In 0.94.3 or later you can also the BulkDeleteEndPoint, which basically
>> let's specify a scan condition and then place specific delete marker for
>> all KVs encountered.
>>
>>
>> If you wanted to get really
>> fancy, you could hook up a coprocessor to the compaction process and
>> simply filter all KVs you no longer want (without ever placing any
>> delete markers).
>>
>>
>> Are you saying it takes 15 seconds to place 300 version delete markers?!
>>
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>>  From: Varun Sharma <va...@pinterest.com>
>> To: user@hbase.apache.org
>> Sent: Friday, February 8, 2013 10:05 PM
>> Subject: Re: Get on a row with multiple columns
>>
>> We are given a set of 300 columns to delete. I tested two cases:
>>
>> 1) deleteColumns() - with the 's'
>>
>> This function simply adds delete markers for 300 columns, in our case,
>> typically only a fraction of these columns are actually present - 10.
>> After
>> starting to use deleteColumns, we starting seeing a drop in cluster wide
>> random read performance - 90th percentile latency worsened, so did 99th
>> probably because of having to traverse delete markers. I attribute this to
>> profusion of delete markers in the cluster. Major compactions slowed down
>> by almost 50 percent probably because of having to clean out significantly
>> more delete markers.
>>
>> 2) deleteColumn()
>>
>> Ended up with untolerable 15 second calls, which clogged all the handlers.
>> Making the cluster pretty much unresponsive.
>>
>> On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <yu...@gmail.com> wrote:
>>
>> > For the 300 column deletes, can you show us how the Delete(s) are
>> > constructed ?
>> >
>> > Do you use this method ?
>> >
>> >   public Delete deleteColumns(byte [] family, byte [] qualifier) {
>> > Thanks
>> >
>> > On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <va...@pinterest.com>
>> wrote:
>> >
>> > > So a Get call with multiple columns on a single row should be much
>> faster
>> > > than independent Get(s) on each of those columns for that row. I am
>> > > basically seeing severely poor performance (~ 15 seconds) for certain
>> > > deleteColumn() calls and I am seeing that there is a
>> > > prepareDeleteTimestamps() function in HRegion.java which first tries
>> to
>> > > locate the column by doing individual gets on each column you want to
>> > > delete (I am doing 300 column deletes). Now, I think this should
>> ideall
>> > by
>> > > 1 get call with the batch of 300 columns so that one scan can retrieve
>> > the
>> > > columns and the columns that are found, are indeed deleted.
>> > >
>> > > Before I try this fix, I wanted to get an opinion if it will make a
>> > > difference to batch the get() and it seems from your answer, it
>> should.
>> > >
>> > > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <la...@apache.org>
>> wrote:
>> > >
>> > > > Everything is stored as a KeyValue in HBase.
>> > > > The Key part of a KeyValue contains the row key, column family,
>> column
>> > > > name, and timestamp in that order.
>> > > > Each column family has it's own store and store files.
>> > > >
>> > > > So in a nutshell a get is executed by starting a scan at the row key
>> > > > (which is a prefix of the key) in each store (CF) and then scanning
>> > > forward
>> > > > in each store until the next row key is reached. (in reality it is a
>> > bit
>> > > > more complicated due to multiple versions, skipping columns, etc)
>> > > >
>> > > >
>> > > > -- Lars
>> > > > ________________________________
>> > > > From: Varun Sharma <va...@pinterest.com>
>> > > > To: user@hbase.apache.org
>> > > > Sent: Friday, February 8, 2013 9:22 PM
>> > > > Subject: Re: Get on a row with multiple columns
>> > > >
>> > > > Sorry, I was a little unclear with my question.
>> > > >
>> > > > Lets say you have
>> > > >
>> > > > Get get = new Get(row)
>> > > > get.addColumn("1");
>> > > > get.addColumn("2");
>> > > > .
>> > > > .
>> > > > .
>> > > >
>> > > > When internally hbase executes the batch get, it will seek to column
>> > "1",
>> > > > now since data is lexicographically sorted, it does not need to seek
>> > from
>> > > > the beginning to get to "2", it can continue seeking, henceforth
>> since
>> > > > column "2" will always be after column "1". I want to know whether
>> this
>> > > is
>> > > > how a multicolumn get on a row works or not.
>> > > >
>> > > > Thanks
>> > > > Varun
>> > > >
>> > > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <ml...@uci.cu>
>> wrote:
>> > > >
>> > > > > Like Ishan said, a get give an instance of the Result class.
>> > > > > All utility methods that you can use are:
>> > > > >  byte[] getValue(byte[] family, byte[] qualifier)
>> > > > >  byte[] value()
>> > > > >  byte[] getRow()
>> > > > >  int size()
>> > > > >  boolean isEmpty()
>> > > > >  KeyValue[] raw() # Like Ishan said, all data here is sorted
>> > > > >  List<KeyValue> list()
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
>> > > > >
>> > > > >> Based on what I read in Lars' book, a get will return a result a
>> > > Result,
>> > > > >> which is internally a KeyValue[]. This KeyValue[] is sorted by
>> the
>> > key
>> > > > and
>> > > > >> you access this array using raw or list methods on the Result
>> > object.
>> > > > >>
>> > > > >>
>> > > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <
>> varun@pinterest.com>
>> > > > wrote:
>> > > > >>
>> > > > >>  +user
>> > > > >>>
>> > > > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <
>> varun@pinterest.com>
>> > > > >>> wrote:
>> > > > >>>
>> > > > >>>  Hi,
>> > > > >>>>
>> > > > >>>> When I do a Get on a row with multiple column qualifiers. Do we
>> > sort
>> > > > the
>> > > > >>>> column qualifers and make use of the sorted order when we get
>> the
>> > > > >>>>
>> > > > >>> results ?
>> > > > >>>
>> > > > >>>> Thanks
>> > > > >>>> Varun
>> > > > >>>>
>> > > > >>>>
>> > > > >>
>> > > > >>
>> > > > > --
>> > > > > Marcos Ortiz Valmaseda,
>> > > > > Product Manager && Data Scientist at UCI
>> > > > > Blog: http://marcosluis2186.**posterous.com<
>> > > > http://marcosluis2186.posterous.com>
>> > > > > Twitter: @marcosluis2186 <http://twitter.com/**marcosluis2186<
>> > > > http://twitter.com/marcosluis2186>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Get on a row with multiple columns

Posted by Varun Sharma <va...@pinterest.com>.

The use case is like your twitter feed. Tweets from people u follow. When
someone unfollows, you need to delete a bunch of his tweets from the
following feed. So, its frequent, and we are essentially running into some
extreme corner cases like the one above. We need high write throughput for
this, since when someone tweets, we need to fanout the tweet to all the
followers. We need the ability to do fast deletes (unfollow) and fast adds
(follow) and also be able to do fast random gets - when a real user loads
the feed. I doubt we will able to play much with the schema here since we
need to support a bunch of use cases.

@lars: It does not take 30 seconds to place 300 delete markers. It takes 30
seconds to first find which of those 300 pins are in the set of columns
present - this invokes 300 gets and then place the appropriate delete
markers. Note that we can have tens of thousands of columns in a single row
so a single get is not cheap.

If we were to just place delete markers, that is very fast. But when
started doing that, our random read performance suffered because of too
many delete markers. The 90th percentile on random reads shot up from 40
milliseconds to 150 milliseconds, which is not acceptable for our usecase.

Thanks
Varun

On Fri, Feb 8, 2013 at 10:33 PM, lars hofhansl <la...@apache.org> wrote:

> Can you organize your columns and then delete by column family?
>
> deleteColumn without specifying a TS is expensive, since HBase first has
> to figure out what the latest TS is.
>
> Should be better in 0.94.1 or later since deletes are batched like Puts
> (still need to retrieve the latest version, though).
>
> In 0.94.3 or later you can also the BulkDeleteEndPoint, which basically
> let's specify a scan condition and then place specific delete marker for
> all KVs encountered.
>
>
> If you wanted to get really
> fancy, you could hook up a coprocessor to the compaction process and
> simply filter all KVs you no longer want (without ever placing any
> delete markers).
>
>
> Are you saying it takes 15 seconds to place 300 version delete markers?!
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Varun Sharma <va...@pinterest.com>
> To: user@hbase.apache.org
> Sent: Friday, February 8, 2013 10:05 PM
> Subject: Re: Get on a row with multiple columns
>
> We are given a set of 300 columns to delete. I tested two cases:
>
> 1) deleteColumns() - with the 's'
>
> This function simply adds delete markers for 300 columns, in our case,
> typically only a fraction of these columns are actually present - 10. After
> starting to use deleteColumns, we starting seeing a drop in cluster wide
> random read performance - 90th percentile latency worsened, so did 99th
> probably because of having to traverse delete markers. I attribute this to
> profusion of delete markers in the cluster. Major compactions slowed down
> by almost 50 percent probably because of having to clean out significantly
> more delete markers.
>
> 2) deleteColumn()
>
> Ended up with untolerable 15 second calls, which clogged all the handlers.
> Making the cluster pretty much unresponsive.
>
> On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > For the 300 column deletes, can you show us how the Delete(s) are
> > constructed ?
> >
> > Do you use this method ?
> >
> >   public Delete deleteColumns(byte [] family, byte [] qualifier) {
> > Thanks
> >
> > On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <va...@pinterest.com>
> wrote:
> >
> > > So a Get call with multiple columns on a single row should be much
> faster
> > > than independent Get(s) on each of those columns for that row. I am
> > > basically seeing severely poor performance (~ 15 seconds) for certain
> > > deleteColumn() calls and I am seeing that there is a
> > > prepareDeleteTimestamps() function in HRegion.java which first tries to
> > > locate the column by doing individual gets on each column you want to
> > > delete (I am doing 300 column deletes). Now, I think this should ideall
> > by
> > > 1 get call with the batch of 300 columns so that one scan can retrieve
> > the
> > > columns and the columns that are found, are indeed deleted.
> > >
> > > Before I try this fix, I wanted to get an opinion if it will make a
> > > difference to batch the get() and it seems from your answer, it should.
> > >
> > > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <la...@apache.org>
> wrote:
> > >
> > > > Everything is stored as a KeyValue in HBase.
> > > > The Key part of a KeyValue contains the row key, column family,
> column
> > > > name, and timestamp in that order.
> > > > Each column family has it's own store and store files.
> > > >
> > > > So in a nutshell a get is executed by starting a scan at the row key
> > > > (which is a prefix of the key) in each store (CF) and then scanning
> > > forward
> > > > in each store until the next row key is reached. (in reality it is a
> > bit
> > > > more complicated due to multiple versions, skipping columns, etc)
> > > >
> > > >
> > > > -- Lars
> > > > ________________________________
> > > > From: Varun Sharma <va...@pinterest.com>
> > > > To: user@hbase.apache.org
> > > > Sent: Friday, February 8, 2013 9:22 PM
> > > > Subject: Re: Get on a row with multiple columns
> > > >
> > > > Sorry, I was a little unclear with my question.
> > > >
> > > > Lets say you have
> > > >
> > > > Get get = new Get(row)
> > > > get.addColumn("1");
> > > > get.addColumn("2");
> > > > .
> > > > .
> > > > .
> > > >
> > > > When internally hbase executes the batch get, it will seek to column
> > "1",
> > > > now since data is lexicographically sorted, it does not need to seek
> > from
> > > > the beginning to get to "2", it can continue seeking, henceforth
> since
> > > > column "2" will always be after column "1". I want to know whether
> this
> > > is
> > > > how a multicolumn get on a row works or not.
> > > >
> > > > Thanks
> > > > Varun
> > > >
> > > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <ml...@uci.cu> wrote:
> > > >
> > > > > Like Ishan said, a get give an instance of the Result class.
> > > > > All utility methods that you can use are:
> > > > >  byte[] getValue(byte[] family, byte[] qualifier)
> > > > >  byte[] value()
> > > > >  byte[] getRow()
> > > > >  int size()
> > > > >  boolean isEmpty()
> > > > >  KeyValue[] raw() # Like Ishan said, all data here is sorted
> > > > >  List<KeyValue> list()
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
> > > > >
> > > > >> Based on what I read in Lars' book, a get will return a result a
> > > Result,
> > > > >> which is internally a KeyValue[]. This KeyValue[] is sorted by the
> > key
> > > > and
> > > > >> you access this array using raw or list methods on the Result
> > object.
> > > > >>
> > > > >>
> > > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <varun@pinterest.com
> >
> > > > wrote:
> > > > >>
> > > > >>  +user
> > > > >>>
> > > > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <
> varun@pinterest.com>
> > > > >>> wrote:
> > > > >>>
> > > > >>>  Hi,
> > > > >>>>
> > > > >>>> When I do a Get on a row with multiple column qualifiers. Do we
> > sort
> > > > the
> > > > >>>> column qualifers and make use of the sorted order when we get
> the
> > > > >>>>
> > > > >>> results ?
> > > > >>>
> > > > >>>> Thanks
> > > > >>>> Varun
> > > > >>>>
> > > > >>>>
> > > > >>
> > > > >>
> > > > > --
> > > > > Marcos Ortiz Valmaseda,
> > > > > Product Manager && Data Scientist at UCI
> > > > > Blog: http://marcosluis2186.**posterous.com<
> > > > http://marcosluis2186.posterous.com>
> > > > > Twitter: @marcosluis2186 <http://twitter.com/**marcosluis2186<
> > > > http://twitter.com/marcosluis2186>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Get on a row with multiple columns

Posted by lars hofhansl <la...@apache.org>.

Can you organize your columns and then delete by column family?

deleteColumn without specifying a TS is expensive, since HBase first has to figure out what the latest TS is.

Should be better in 0.94.1 or later since deletes are batched like Puts (still need to retrieve the latest version, though).

In 0.94.3 or later you can also the BulkDeleteEndPoint, which basically let's specify a scan condition and then place specific delete marker for all KVs encountered.


If you wanted to get really 
fancy, you could hook up a coprocessor to the compaction process and 
simply filter all KVs you no longer want (without ever placing any 
delete markers).


Are you saying it takes 15 seconds to place 300 version delete markers?!


-- Lars



________________________________
 From: Varun Sharma <va...@pinterest.com>
To: user@hbase.apache.org 
Sent: Friday, February 8, 2013 10:05 PM
Subject: Re: Get on a row with multiple columns
 
We are given a set of 300 columns to delete. I tested two cases:

1) deleteColumns() - with the 's'

This function simply adds delete markers for 300 columns, in our case,
typically only a fraction of these columns are actually present - 10. After
starting to use deleteColumns, we starting seeing a drop in cluster wide
random read performance - 90th percentile latency worsened, so did 99th
probably because of having to traverse delete markers. I attribute this to
profusion of delete markers in the cluster. Major compactions slowed down
by almost 50 percent probably because of having to clean out significantly
more delete markers.

2) deleteColumn()

Ended up with untolerable 15 second calls, which clogged all the handlers.
Making the cluster pretty much unresponsive.

On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <yu...@gmail.com> wrote:

> For the 300 column deletes, can you show us how the Delete(s) are
> constructed ?
>
> Do you use this method ?
>
>   public Delete deleteColumns(byte [] family, byte [] qualifier) {
> Thanks
>
> On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <va...@pinterest.com> wrote:
>
> > So a Get call with multiple columns on a single row should be much faster
> > than independent Get(s) on each of those columns for that row. I am
> > basically seeing severely poor performance (~ 15 seconds) for certain
> > deleteColumn() calls and I am seeing that there is a
> > prepareDeleteTimestamps() function in HRegion.java which first tries to
> > locate the column by doing individual gets on each column you want to
> > delete (I am doing 300 column deletes). Now, I think this should ideall
> by
> > 1 get call with the batch of 300 columns so that one scan can retrieve
> the
> > columns and the columns that are found, are indeed deleted.
> >
> > Before I try this fix, I wanted to get an opinion if it will make a
> > difference to batch the get() and it seems from your answer, it should.
> >
> > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <la...@apache.org> wrote:
> >
> > > Everything is stored as a KeyValue in HBase.
> > > The Key part of a KeyValue contains the row key, column family, column
> > > name, and timestamp in that order.
> > > Each column family has it's own store and store files.
> > >
> > > So in a nutshell a get is executed by starting a scan at the row key
> > > (which is a prefix of the key) in each store (CF) and then scanning
> > forward
> > > in each store until the next row key is reached. (in reality it is a
> bit
> > > more complicated due to multiple versions, skipping columns, etc)
> > >
> > >
> > > -- Lars
> > > ________________________________
> > > From: Varun Sharma <va...@pinterest.com>
> > > To: user@hbase.apache.org
> > > Sent: Friday, February 8, 2013 9:22 PM
> > > Subject: Re: Get on a row with multiple columns
> > >
> > > Sorry, I was a little unclear with my question.
> > >
> > > Lets say you have
> > >
> > > Get get = new Get(row)
> > > get.addColumn("1");
> > > get.addColumn("2");
> > > .
> > > .
> > > .
> > >
> > > When internally hbase executes the batch get, it will seek to column
> "1",
> > > now since data is lexicographically sorted, it does not need to seek
> from
> > > the beginning to get to "2", it can continue seeking, henceforth since
> > > column "2" will always be after column "1". I want to know whether this
> > is
> > > how a multicolumn get on a row works or not.
> > >
> > > Thanks
> > > Varun
> > >
> > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <ml...@uci.cu> wrote:
> > >
> > > > Like Ishan said, a get give an instance of the Result class.
> > > > All utility methods that you can use are:
> > > >  byte[] getValue(byte[] family, byte[] qualifier)
> > > >  byte[] value()
> > > >  byte[] getRow()
> > > >  int size()
> > > >  boolean isEmpty()
> > > >  KeyValue[] raw() # Like Ishan said, all data here is sorted
> > > >  List<KeyValue> list()
> > > >
> > > >
> > > >
> > > >
> > > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
> > > >
> > > >> Based on what I read in Lars' book, a get will return a result a
> > Result,
> > > >> which is internally a KeyValue[]. This KeyValue[] is sorted by the
> key
> > > and
> > > >> you access this array using raw or list methods on the Result
> object.
> > > >>
> > > >>
> > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <va...@pinterest.com>
> > > wrote:
> > > >>
> > > >>  +user
> > > >>>
> > > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <va...@pinterest.com>
> > > >>> wrote:
> > > >>>
> > > >>>  Hi,
> > > >>>>
> > > >>>> When I do a Get on a row with multiple column qualifiers. Do we
> sort
> > > the
> > > >>>> column qualifers and make use of the sorted order when we get the
> > > >>>>
> > > >>> results ?
> > > >>>
> > > >>>> Thanks
> > > >>>> Varun
> > > >>>>
> > > >>>>
> > > >>
> > > >>
> > > > --
> > > > Marcos Ortiz Valmaseda,
> > > > Product Manager && Data Scientist at UCI
> > > > Blog: http://marcosluis2186.**posterous.com<
> > > http://marcosluis2186.posterous.com>
> > > > Twitter: @marcosluis2186 <http://twitter.com/**marcosluis2186<
> > > http://twitter.com/marcosluis2186>
> > > > >
> > > >
> > >
> >
>

Re: Get on a row with multiple columns

Posted by lars hofhansl <la...@apache.org>.

You could use the KeyOnly filter to only retrieve the key part of the KVs.



________________________________
 From: Varun Sharma <va...@pinterest.com>
To: user@hbase.apache.org 
Sent: Friday, February 8, 2013 10:16 PM
Subject: Re: Get on a row with multiple columns
 
Using hbase 0.94.3. Tried that too, ran into performance issues with having
to retrieve the entire row first (this was getting slow when one particular
row is hammered) since row can be big (few megs, some times 10s of megs)
and then finding the columns and then doing a delete.

To me, it looks like the current implementation of deleteColumn is
suboptimal because of the 300 gets vs doing 1.

Thanks
Varun

On Fri, Feb 8, 2013 at 10:09 PM, Ted Yu <yu...@gmail.com> wrote:

> Which HBase version are you using ?
>
> Is there a way to place 10 delete markers from application side instead of
> 300 ?
>
> Thanks
>
> On Fri, Feb 8, 2013 at 10:05 PM, Varun Sharma <va...@pinterest.com> wrote:
>
> > We are given a set of 300 columns to delete. I tested two cases:
> >
> > 1) deleteColumns() - with the 's'
> >
> > This function simply adds delete markers for 300 columns, in our case,
> > typically only a fraction of these columns are actually present - 10.
> After
> > starting to use deleteColumns, we starting seeing a drop in cluster wide
> > random read performance - 90th percentile latency worsened, so did 99th
> > probably because of having to traverse delete markers. I attribute this
> to
> > profusion of delete markers in the cluster. Major compactions slowed down
> > by almost 50 percent probably because of having to clean out
> significantly
> > more delete markers.
> >
> > 2) deleteColumn()
> >
> > Ended up with untolerable 15 second calls, which clogged all the
> handlers.
> > Making the cluster pretty much unresponsive.
> >
> > On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > For the 300 column deletes, can you show us how the Delete(s) are
> > > constructed ?
> > >
> > > Do you use this method ?
> > >
> > >   public Delete deleteColumns(byte [] family, byte [] qualifier) {
> > > Thanks
> > >
> > > On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <va...@pinterest.com>
> > wrote:
> > >
> > > > So a Get call with multiple columns on a single row should be much
> > faster
> > > > than independent Get(s) on each of those columns for that row. I am
> > > > basically seeing severely poor performance (~ 15 seconds) for certain
> > > > deleteColumn() calls and I am seeing that there is a
> > > > prepareDeleteTimestamps() function in HRegion.java which first tries
> to
> > > > locate the column by doing individual gets on each column you want to
> > > > delete (I am doing 300 column deletes). Now, I think this should
> ideall
> > > by
> > > > 1 get call with the batch of 300 columns so that one scan can
> retrieve
> > > the
> > > > columns and the columns that are found, are indeed deleted.
> > > >
> > > > Before I try this fix, I wanted to get an opinion if it will make a
> > > > difference to batch the get() and it seems from your answer, it
> should.
> > > >
> > > > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <la...@apache.org>
> > wrote:
> > > >
> > > > > Everything is stored as a KeyValue in HBase.
> > > > > The Key part of a KeyValue contains the row key, column family,
> > column
> > > > > name, and timestamp in that order.
> > > > > Each column family has it's own store and store files.
> > > > >
> > > > > So in a nutshell a get is executed by starting a scan at the row
> key
> > > > > (which is a prefix of the key) in each store (CF) and then scanning
> > > > forward
> > > > > in each store until the next row key is reached. (in reality it is
> a
> > > bit
> > > > > more complicated due to multiple versions, skipping columns, etc)
> > > > >
> > > > >
> > > > > -- Lars
> > > > > ________________________________
> > > > > From: Varun Sharma <va...@pinterest.com>
> > > > > To: user@hbase.apache.org
> > > > > Sent: Friday, February 8, 2013 9:22 PM
> > > > > Subject: Re: Get on a row with multiple columns
> > > > >
> > > > > Sorry, I was a little unclear with my question.
> > > > >
> > > > > Lets say you have
> > > > >
> > > > > Get get = new Get(row)
> > > > > get.addColumn("1");
> > > > > get.addColumn("2");
> > > > > .
> > > > > .
> > > > > .
> > > > >
> > > > > When internally hbase executes the batch get, it will seek to
> column
> > > "1",
> > > > > now since data is lexicographically sorted, it does not need to
> seek
> > > from
> > > > > the beginning to get to "2", it can continue seeking, henceforth
> > since
> > > > > column "2" will always be after column "1". I want to know whether
> > this
> > > > is
> > > > > how a multicolumn get on a row works or not.
> > > > >
> > > > > Thanks
> > > > > Varun
> > > > >
> > > > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <ml...@uci.cu>
> wrote:
> > > > >
> > > > > > Like Ishan said, a get give an instance of the Result class.
> > > > > > All utility methods that you can use are:
> > > > > >  byte[] getValue(byte[] family, byte[] qualifier)
> > > > > >  byte[] value()
> > > > > >  byte[] getRow()
> > > > > >  int size()
> > > > > >  boolean isEmpty()
> > > > > >  KeyValue[] raw() # Like Ishan said, all data here is sorted
> > > > > >  List<KeyValue> list()
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
> > > > > >
> > > > > >> Based on what I read in Lars' book, a get will return a result a
> > > > Result,
> > > > > >> which is internally a KeyValue[]. This KeyValue[] is sorted by
> the
> > > key
> > > > > and
> > > > > >> you access this array using raw or list methods on the Result
> > > object.
> > > > > >>
> > > > > >>
> > > > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <
> varun@pinterest.com
> > >
> > > > > wrote:
> > > > > >>
> > > > > >>  +user
> > > > > >>>
> > > > > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <
> > varun@pinterest.com>
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>>  Hi,
> > > > > >>>>
> > > > > >>>> When I do a Get on a row with multiple column qualifiers. Do
> we
> > > sort
> > > > > the
> > > > > >>>> column qualifers and make use of the sorted order when we get
> > the
> > > > > >>>>
> > > > > >>> results ?
> > > > > >>>
> > > > > >>>> Thanks
> > > > > >>>> Varun
> > > > > >>>>
> > > > > >>>>
> > > > > >>
> > > > > >>
> > > > > > --
> > > > > > Marcos Ortiz Valmaseda,
> > > > > > Product Manager && Data Scientist at UCI
> > > > > > Blog: http://marcosluis2186.**posterous.com<
> > > > > http://marcosluis2186.posterous.com>
> > > > > > Twitter: @marcosluis2186 <http://twitter.com/**marcosluis2186<
> > > > > http://twitter.com/marcosluis2186>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Get on a row with multiple columns

Posted by Ted <yu...@gmail.com>.

How often do you need to perform such delete operation ?

Is there way to utilize ttl so that you can avoid deletions ?

Pardon me for not knowing your use case very well. 

On Feb 8, 2013, at 10:16 PM, Varun Sharma <va...@pinterest.com> wrote:

> Using hbase 0.94.3. Tried that too, ran into performance issues with having
> to retrieve the entire row first (this was getting slow when one particular
> row is hammered) since row can be big (few megs, some times 10s of megs)
> and then finding the columns and then doing a delete.
> 
> To me, it looks like the current implementation of deleteColumn is
> suboptimal because of the 300 gets vs doing 1.
> 
> Thanks
> Varun
> 
> On Fri, Feb 8, 2013 at 10:09 PM, Ted Yu <yu...@gmail.com> wrote:
> 
>> Which HBase version are you using ?
>> 
>> Is there a way to place 10 delete markers from application side instead of
>> 300 ?
>> 
>> Thanks
>> 
>> On Fri, Feb 8, 2013 at 10:05 PM, Varun Sharma <va...@pinterest.com> wrote:
>> 
>>> We are given a set of 300 columns to delete. I tested two cases:
>>> 
>>> 1) deleteColumns() - with the 's'
>>> 
>>> This function simply adds delete markers for 300 columns, in our case,
>>> typically only a fraction of these columns are actually present - 10.
>> After
>>> starting to use deleteColumns, we starting seeing a drop in cluster wide
>>> random read performance - 90th percentile latency worsened, so did 99th
>>> probably because of having to traverse delete markers. I attribute this
>> to
>>> profusion of delete markers in the cluster. Major compactions slowed down
>>> by almost 50 percent probably because of having to clean out
>> significantly
>>> more delete markers.
>>> 
>>> 2) deleteColumn()
>>> 
>>> Ended up with untolerable 15 second calls, which clogged all the
>> handlers.
>>> Making the cluster pretty much unresponsive.
>>> 
>>> On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <yu...@gmail.com> wrote:
>>> 
>>>> For the 300 column deletes, can you show us how the Delete(s) are
>>>> constructed ?
>>>> 
>>>> Do you use this method ?
>>>> 
>>>>  public Delete deleteColumns(byte [] family, byte [] qualifier) {
>>>> Thanks
>>>> 
>>>> On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <va...@pinterest.com>
>>> wrote:
>>>> 
>>>>> So a Get call with multiple columns on a single row should be much
>>> faster
>>>>> than independent Get(s) on each of those columns for that row. I am
>>>>> basically seeing severely poor performance (~ 15 seconds) for certain
>>>>> deleteColumn() calls and I am seeing that there is a
>>>>> prepareDeleteTimestamps() function in HRegion.java which first tries
>> to
>>>>> locate the column by doing individual gets on each column you want to
>>>>> delete (I am doing 300 column deletes). Now, I think this should
>> ideall
>>>> by
>>>>> 1 get call with the batch of 300 columns so that one scan can
>> retrieve
>>>> the
>>>>> columns and the columns that are found, are indeed deleted.
>>>>> 
>>>>> Before I try this fix, I wanted to get an opinion if it will make a
>>>>> difference to batch the get() and it seems from your answer, it
>> should.
>>>>> 
>>>>> On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <la...@apache.org>
>>> wrote:
>>>>> 
>>>>>> Everything is stored as a KeyValue in HBase.
>>>>>> The Key part of a KeyValue contains the row key, column family,
>>> column
>>>>>> name, and timestamp in that order.
>>>>>> Each column family has it's own store and store files.
>>>>>> 
>>>>>> So in a nutshell a get is executed by starting a scan at the row
>> key
>>>>>> (which is a prefix of the key) in each store (CF) and then scanning
>>>>> forward
>>>>>> in each store until the next row key is reached. (in reality it is
>> a
>>>> bit
>>>>>> more complicated due to multiple versions, skipping columns, etc)
>>>>>> 
>>>>>> 
>>>>>> -- Lars
>>>>>> ________________________________
>>>>>> From: Varun Sharma <va...@pinterest.com>
>>>>>> To: user@hbase.apache.org
>>>>>> Sent: Friday, February 8, 2013 9:22 PM
>>>>>> Subject: Re: Get on a row with multiple columns
>>>>>> 
>>>>>> Sorry, I was a little unclear with my question.
>>>>>> 
>>>>>> Lets say you have
>>>>>> 
>>>>>> Get get = new Get(row)
>>>>>> get.addColumn("1");
>>>>>> get.addColumn("2");
>>>>>> .
>>>>>> .
>>>>>> .
>>>>>> 
>>>>>> When internally hbase executes the batch get, it will seek to
>> column
>>>> "1",
>>>>>> now since data is lexicographically sorted, it does not need to
>> seek
>>>> from
>>>>>> the beginning to get to "2", it can continue seeking, henceforth
>>> since
>>>>>> column "2" will always be after column "1". I want to know whether
>>> this
>>>>> is
>>>>>> how a multicolumn get on a row works or not.
>>>>>> 
>>>>>> Thanks
>>>>>> Varun
>>>>>> 
>>>>>> On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <ml...@uci.cu>
>> wrote:
>>>>>> 
>>>>>>> Like Ishan said, a get give an instance of the Result class.
>>>>>>> All utility methods that you can use are:
>>>>>>> byte[] getValue(byte[] family, byte[] qualifier)
>>>>>>> byte[] value()
>>>>>>> byte[] getRow()
>>>>>>> int size()
>>>>>>> boolean isEmpty()
>>>>>>> KeyValue[] raw() # Like Ishan said, all data here is sorted
>>>>>>> List<KeyValue> list()
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
>>>>>>> 
>>>>>>>> Based on what I read in Lars' book, a get will return a result a
>>>>> Result,
>>>>>>>> which is internally a KeyValue[]. This KeyValue[] is sorted by
>> the
>>>> key
>>>>>> and
>>>>>>>> you access this array using raw or list methods on the Result
>>>> object.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <
>> varun@pinterest.com
>>>> 
>>>>>> wrote:
>>>>>>>> 
>>>>>>>> +user
>>>>>>>>> 
>>>>>>>>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <
>>> varun@pinterest.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> When I do a Get on a row with multiple column qualifiers. Do
>> we
>>>> sort
>>>>>> the
>>>>>>>>>> column qualifers and make use of the sorted order when we get
>>> the
>>>>>>>>> results ?
>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> Varun
>>>>>>> --
>>>>>>> Marcos Ortiz Valmaseda,
>>>>>>> Product Manager && Data Scientist at UCI
>>>>>>> Blog: http://marcosluis2186.**posterous.com<
>>>>>> http://marcosluis2186.posterous.com>
>>>>>>> Twitter: @marcosluis2186 <http://twitter.com/**marcosluis2186<
>>>>>> http://twitter.com/marcosluis2186>
>>

Re: Get on a row with multiple columns

Posted by Varun Sharma <va...@pinterest.com>.

Using hbase 0.94.3. Tried that too, ran into performance issues with having
to retrieve the entire row first (this was getting slow when one particular
row is hammered) since row can be big (few megs, some times 10s of megs)
and then finding the columns and then doing a delete.

To me, it looks like the current implementation of deleteColumn is
suboptimal because of the 300 gets vs doing 1.

Thanks
Varun

On Fri, Feb 8, 2013 at 10:09 PM, Ted Yu <yu...@gmail.com> wrote:

> Which HBase version are you using ?
>
> Is there a way to place 10 delete markers from application side instead of
> 300 ?
>
> Thanks
>
> On Fri, Feb 8, 2013 at 10:05 PM, Varun Sharma <va...@pinterest.com> wrote:
>
> > We are given a set of 300 columns to delete. I tested two cases:
> >
> > 1) deleteColumns() - with the 's'
> >
> > This function simply adds delete markers for 300 columns, in our case,
> > typically only a fraction of these columns are actually present - 10.
> After
> > starting to use deleteColumns, we starting seeing a drop in cluster wide
> > random read performance - 90th percentile latency worsened, so did 99th
> > probably because of having to traverse delete markers. I attribute this
> to
> > profusion of delete markers in the cluster. Major compactions slowed down
> > by almost 50 percent probably because of having to clean out
> significantly
> > more delete markers.
> >
> > 2) deleteColumn()
> >
> > Ended up with untolerable 15 second calls, which clogged all the
> handlers.
> > Making the cluster pretty much unresponsive.
> >
> > On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > For the 300 column deletes, can you show us how the Delete(s) are
> > > constructed ?
> > >
> > > Do you use this method ?
> > >
> > >   public Delete deleteColumns(byte [] family, byte [] qualifier) {
> > > Thanks
> > >
> > > On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <va...@pinterest.com>
> > wrote:
> > >
> > > > So a Get call with multiple columns on a single row should be much
> > faster
> > > > than independent Get(s) on each of those columns for that row. I am
> > > > basically seeing severely poor performance (~ 15 seconds) for certain
> > > > deleteColumn() calls and I am seeing that there is a
> > > > prepareDeleteTimestamps() function in HRegion.java which first tries
> to
> > > > locate the column by doing individual gets on each column you want to
> > > > delete (I am doing 300 column deletes). Now, I think this should
> ideall
> > > by
> > > > 1 get call with the batch of 300 columns so that one scan can
> retrieve
> > > the
> > > > columns and the columns that are found, are indeed deleted.
> > > >
> > > > Before I try this fix, I wanted to get an opinion if it will make a
> > > > difference to batch the get() and it seems from your answer, it
> should.
> > > >
> > > > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <la...@apache.org>
> > wrote:
> > > >
> > > > > Everything is stored as a KeyValue in HBase.
> > > > > The Key part of a KeyValue contains the row key, column family,
> > column
> > > > > name, and timestamp in that order.
> > > > > Each column family has it's own store and store files.
> > > > >
> > > > > So in a nutshell a get is executed by starting a scan at the row
> key
> > > > > (which is a prefix of the key) in each store (CF) and then scanning
> > > > forward
> > > > > in each store until the next row key is reached. (in reality it is
> a
> > > bit
> > > > > more complicated due to multiple versions, skipping columns, etc)
> > > > >
> > > > >
> > > > > -- Lars
> > > > > ________________________________
> > > > > From: Varun Sharma <va...@pinterest.com>
> > > > > To: user@hbase.apache.org
> > > > > Sent: Friday, February 8, 2013 9:22 PM
> > > > > Subject: Re: Get on a row with multiple columns
> > > > >
> > > > > Sorry, I was a little unclear with my question.
> > > > >
> > > > > Lets say you have
> > > > >
> > > > > Get get = new Get(row)
> > > > > get.addColumn("1");
> > > > > get.addColumn("2");
> > > > > .
> > > > > .
> > > > > .
> > > > >
> > > > > When internally hbase executes the batch get, it will seek to
> column
> > > "1",
> > > > > now since data is lexicographically sorted, it does not need to
> seek
> > > from
> > > > > the beginning to get to "2", it can continue seeking, henceforth
> > since
> > > > > column "2" will always be after column "1". I want to know whether
> > this
> > > > is
> > > > > how a multicolumn get on a row works or not.
> > > > >
> > > > > Thanks
> > > > > Varun
> > > > >
> > > > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <ml...@uci.cu>
> wrote:
> > > > >
> > > > > > Like Ishan said, a get give an instance of the Result class.
> > > > > > All utility methods that you can use are:
> > > > > >  byte[] getValue(byte[] family, byte[] qualifier)
> > > > > >  byte[] value()
> > > > > >  byte[] getRow()
> > > > > >  int size()
> > > > > >  boolean isEmpty()
> > > > > >  KeyValue[] raw() # Like Ishan said, all data here is sorted
> > > > > >  List<KeyValue> list()
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
> > > > > >
> > > > > >> Based on what I read in Lars' book, a get will return a result a
> > > > Result,
> > > > > >> which is internally a KeyValue[]. This KeyValue[] is sorted by
> the
> > > key
> > > > > and
> > > > > >> you access this array using raw or list methods on the Result
> > > object.
> > > > > >>
> > > > > >>
> > > > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <
> varun@pinterest.com
> > >
> > > > > wrote:
> > > > > >>
> > > > > >>  +user
> > > > > >>>
> > > > > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <
> > varun@pinterest.com>
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>>  Hi,
> > > > > >>>>
> > > > > >>>> When I do a Get on a row with multiple column qualifiers. Do
> we
> > > sort
> > > > > the
> > > > > >>>> column qualifers and make use of the sorted order when we get
> > the
> > > > > >>>>
> > > > > >>> results ?
> > > > > >>>
> > > > > >>>> Thanks
> > > > > >>>> Varun
> > > > > >>>>
> > > > > >>>>
> > > > > >>
> > > > > >>
> > > > > > --
> > > > > > Marcos Ortiz Valmaseda,
> > > > > > Product Manager && Data Scientist at UCI
> > > > > > Blog: http://marcosluis2186.**posterous.com<
> > > > > http://marcosluis2186.posterous.com>
> > > > > > Twitter: @marcosluis2186 <http://twitter.com/**marcosluis2186<
> > > > > http://twitter.com/marcosluis2186>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Get on a row with multiple columns

Posted by Ted Yu <yu...@gmail.com>.

Which HBase version are you using ?

Is there a way to place 10 delete markers from application side instead of
300 ?

Thanks

On Fri, Feb 8, 2013 at 10:05 PM, Varun Sharma <va...@pinterest.com> wrote:

> We are given a set of 300 columns to delete. I tested two cases:
>
> 1) deleteColumns() - with the 's'
>
> This function simply adds delete markers for 300 columns, in our case,
> typically only a fraction of these columns are actually present - 10. After
> starting to use deleteColumns, we starting seeing a drop in cluster wide
> random read performance - 90th percentile latency worsened, so did 99th
> probably because of having to traverse delete markers. I attribute this to
> profusion of delete markers in the cluster. Major compactions slowed down
> by almost 50 percent probably because of having to clean out significantly
> more delete markers.
>
> 2) deleteColumn()
>
> Ended up with untolerable 15 second calls, which clogged all the handlers.
> Making the cluster pretty much unresponsive.
>
> On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > For the 300 column deletes, can you show us how the Delete(s) are
> > constructed ?
> >
> > Do you use this method ?
> >
> >   public Delete deleteColumns(byte [] family, byte [] qualifier) {
> > Thanks
> >
> > On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <va...@pinterest.com>
> wrote:
> >
> > > So a Get call with multiple columns on a single row should be much
> faster
> > > than independent Get(s) on each of those columns for that row. I am
> > > basically seeing severely poor performance (~ 15 seconds) for certain
> > > deleteColumn() calls and I am seeing that there is a
> > > prepareDeleteTimestamps() function in HRegion.java which first tries to
> > > locate the column by doing individual gets on each column you want to
> > > delete (I am doing 300 column deletes). Now, I think this should ideall
> > by
> > > 1 get call with the batch of 300 columns so that one scan can retrieve
> > the
> > > columns and the columns that are found, are indeed deleted.
> > >
> > > Before I try this fix, I wanted to get an opinion if it will make a
> > > difference to batch the get() and it seems from your answer, it should.
> > >
> > > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <la...@apache.org>
> wrote:
> > >
> > > > Everything is stored as a KeyValue in HBase.
> > > > The Key part of a KeyValue contains the row key, column family,
> column
> > > > name, and timestamp in that order.
> > > > Each column family has it's own store and store files.
> > > >
> > > > So in a nutshell a get is executed by starting a scan at the row key
> > > > (which is a prefix of the key) in each store (CF) and then scanning
> > > forward
> > > > in each store until the next row key is reached. (in reality it is a
> > bit
> > > > more complicated due to multiple versions, skipping columns, etc)
> > > >
> > > >
> > > > -- Lars
> > > > ________________________________
> > > > From: Varun Sharma <va...@pinterest.com>
> > > > To: user@hbase.apache.org
> > > > Sent: Friday, February 8, 2013 9:22 PM
> > > > Subject: Re: Get on a row with multiple columns
> > > >
> > > > Sorry, I was a little unclear with my question.
> > > >
> > > > Lets say you have
> > > >
> > > > Get get = new Get(row)
> > > > get.addColumn("1");
> > > > get.addColumn("2");
> > > > .
> > > > .
> > > > .
> > > >
> > > > When internally hbase executes the batch get, it will seek to column
> > "1",
> > > > now since data is lexicographically sorted, it does not need to seek
> > from
> > > > the beginning to get to "2", it can continue seeking, henceforth
> since
> > > > column "2" will always be after column "1". I want to know whether
> this
> > > is
> > > > how a multicolumn get on a row works or not.
> > > >
> > > > Thanks
> > > > Varun
> > > >
> > > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <ml...@uci.cu> wrote:
> > > >
> > > > > Like Ishan said, a get give an instance of the Result class.
> > > > > All utility methods that you can use are:
> > > > >  byte[] getValue(byte[] family, byte[] qualifier)
> > > > >  byte[] value()
> > > > >  byte[] getRow()
> > > > >  int size()
> > > > >  boolean isEmpty()
> > > > >  KeyValue[] raw() # Like Ishan said, all data here is sorted
> > > > >  List<KeyValue> list()
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
> > > > >
> > > > >> Based on what I read in Lars' book, a get will return a result a
> > > Result,
> > > > >> which is internally a KeyValue[]. This KeyValue[] is sorted by the
> > key
> > > > and
> > > > >> you access this array using raw or list methods on the Result
> > object.
> > > > >>
> > > > >>
> > > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <varun@pinterest.com
> >
> > > > wrote:
> > > > >>
> > > > >>  +user
> > > > >>>
> > > > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <
> varun@pinterest.com>
> > > > >>> wrote:
> > > > >>>
> > > > >>>  Hi,
> > > > >>>>
> > > > >>>> When I do a Get on a row with multiple column qualifiers. Do we
> > sort
> > > > the
> > > > >>>> column qualifers and make use of the sorted order when we get
> the
> > > > >>>>
> > > > >>> results ?
> > > > >>>
> > > > >>>> Thanks
> > > > >>>> Varun
> > > > >>>>
> > > > >>>>
> > > > >>
> > > > >>
> > > > > --
> > > > > Marcos Ortiz Valmaseda,
> > > > > Product Manager && Data Scientist at UCI
> > > > > Blog: http://marcosluis2186.**posterous.com<
> > > > http://marcosluis2186.posterous.com>
> > > > > Twitter: @marcosluis2186 <http://twitter.com/**marcosluis2186<
> > > > http://twitter.com/marcosluis2186>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Get on a row with multiple columns

Posted by Varun Sharma <va...@pinterest.com>.

We are given a set of 300 columns to delete. I tested two cases:

1) deleteColumns() - with the 's'

This function simply adds delete markers for 300 columns, in our case,
typically only a fraction of these columns are actually present - 10. After
starting to use deleteColumns, we starting seeing a drop in cluster wide
random read performance - 90th percentile latency worsened, so did 99th
probably because of having to traverse delete markers. I attribute this to
profusion of delete markers in the cluster. Major compactions slowed down
by almost 50 percent probably because of having to clean out significantly
more delete markers.

2) deleteColumn()

Ended up with untolerable 15 second calls, which clogged all the handlers.
Making the cluster pretty much unresponsive.

On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <yu...@gmail.com> wrote:

> For the 300 column deletes, can you show us how the Delete(s) are
> constructed ?
>
> Do you use this method ?
>
>   public Delete deleteColumns(byte [] family, byte [] qualifier) {
> Thanks
>
> On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <va...@pinterest.com> wrote:
>
> > So a Get call with multiple columns on a single row should be much faster
> > than independent Get(s) on each of those columns for that row. I am
> > basically seeing severely poor performance (~ 15 seconds) for certain
> > deleteColumn() calls and I am seeing that there is a
> > prepareDeleteTimestamps() function in HRegion.java which first tries to
> > locate the column by doing individual gets on each column you want to
> > delete (I am doing 300 column deletes). Now, I think this should ideall
> by
> > 1 get call with the batch of 300 columns so that one scan can retrieve
> the
> > columns and the columns that are found, are indeed deleted.
> >
> > Before I try this fix, I wanted to get an opinion if it will make a
> > difference to batch the get() and it seems from your answer, it should.
> >
> > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <la...@apache.org> wrote:
> >
> > > Everything is stored as a KeyValue in HBase.
> > > The Key part of a KeyValue contains the row key, column family, column
> > > name, and timestamp in that order.
> > > Each column family has it's own store and store files.
> > >
> > > So in a nutshell a get is executed by starting a scan at the row key
> > > (which is a prefix of the key) in each store (CF) and then scanning
> > forward
> > > in each store until the next row key is reached. (in reality it is a
> bit
> > > more complicated due to multiple versions, skipping columns, etc)
> > >
> > >
> > > -- Lars
> > > ________________________________
> > > From: Varun Sharma <va...@pinterest.com>
> > > To: user@hbase.apache.org
> > > Sent: Friday, February 8, 2013 9:22 PM
> > > Subject: Re: Get on a row with multiple columns
> > >
> > > Sorry, I was a little unclear with my question.
> > >
> > > Lets say you have
> > >
> > > Get get = new Get(row)
> > > get.addColumn("1");
> > > get.addColumn("2");
> > > .
> > > .
> > > .
> > >
> > > When internally hbase executes the batch get, it will seek to column
> "1",
> > > now since data is lexicographically sorted, it does not need to seek
> from
> > > the beginning to get to "2", it can continue seeking, henceforth since
> > > column "2" will always be after column "1". I want to know whether this
> > is
> > > how a multicolumn get on a row works or not.
> > >
> > > Thanks
> > > Varun
> > >
> > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <ml...@uci.cu> wrote:
> > >
> > > > Like Ishan said, a get give an instance of the Result class.
> > > > All utility methods that you can use are:
> > > >  byte[] getValue(byte[] family, byte[] qualifier)
> > > >  byte[] value()
> > > >  byte[] getRow()
> > > >  int size()
> > > >  boolean isEmpty()
> > > >  KeyValue[] raw() # Like Ishan said, all data here is sorted
> > > >  List<KeyValue> list()
> > > >
> > > >
> > > >
> > > >
> > > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
> > > >
> > > >> Based on what I read in Lars' book, a get will return a result a
> > Result,
> > > >> which is internally a KeyValue[]. This KeyValue[] is sorted by the
> key
> > > and
> > > >> you access this array using raw or list methods on the Result
> object.
> > > >>
> > > >>
> > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <va...@pinterest.com>
> > > wrote:
> > > >>
> > > >>  +user
> > > >>>
> > > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <va...@pinterest.com>
> > > >>> wrote:
> > > >>>
> > > >>>  Hi,
> > > >>>>
> > > >>>> When I do a Get on a row with multiple column qualifiers. Do we
> sort
> > > the
> > > >>>> column qualifers and make use of the sorted order when we get the
> > > >>>>
> > > >>> results ?
> > > >>>
> > > >>>> Thanks
> > > >>>> Varun
> > > >>>>
> > > >>>>
> > > >>
> > > >>
> > > > --
> > > > Marcos Ortiz Valmaseda,
> > > > Product Manager && Data Scientist at UCI
> > > > Blog: http://marcosluis2186.**posterous.com<
> > > http://marcosluis2186.posterous.com>
> > > > Twitter: @marcosluis2186 <http://twitter.com/**marcosluis2186<
> > > http://twitter.com/marcosluis2186>
> > > > >
> > > >
> > >
> >
>

Re: Get on a row with multiple columns

Posted by Ted Yu <yu...@gmail.com>.

For the 300 column deletes, can you show us how the Delete(s) are
constructed ?

Do you use this method ?

  public Delete deleteColumns(byte [] family, byte [] qualifier) {
Thanks

On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <va...@pinterest.com> wrote:

> So a Get call with multiple columns on a single row should be much faster
> than independent Get(s) on each of those columns for that row. I am
> basically seeing severely poor performance (~ 15 seconds) for certain
> deleteColumn() calls and I am seeing that there is a
> prepareDeleteTimestamps() function in HRegion.java which first tries to
> locate the column by doing individual gets on each column you want to
> delete (I am doing 300 column deletes). Now, I think this should ideall by
> 1 get call with the batch of 300 columns so that one scan can retrieve the
> columns and the columns that are found, are indeed deleted.
>
> Before I try this fix, I wanted to get an opinion if it will make a
> difference to batch the get() and it seems from your answer, it should.
>
> On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <la...@apache.org> wrote:
>
> > Everything is stored as a KeyValue in HBase.
> > The Key part of a KeyValue contains the row key, column family, column
> > name, and timestamp in that order.
> > Each column family has it's own store and store files.
> >
> > So in a nutshell a get is executed by starting a scan at the row key
> > (which is a prefix of the key) in each store (CF) and then scanning
> forward
> > in each store until the next row key is reached. (in reality it is a bit
> > more complicated due to multiple versions, skipping columns, etc)
> >
> >
> > -- Lars
> > ________________________________
> > From: Varun Sharma <va...@pinterest.com>
> > To: user@hbase.apache.org
> > Sent: Friday, February 8, 2013 9:22 PM
> > Subject: Re: Get on a row with multiple columns
> >
> > Sorry, I was a little unclear with my question.
> >
> > Lets say you have
> >
> > Get get = new Get(row)
> > get.addColumn("1");
> > get.addColumn("2");
> > .
> > .
> > .
> >
> > When internally hbase executes the batch get, it will seek to column "1",
> > now since data is lexicographically sorted, it does not need to seek from
> > the beginning to get to "2", it can continue seeking, henceforth since
> > column "2" will always be after column "1". I want to know whether this
> is
> > how a multicolumn get on a row works or not.
> >
> > Thanks
> > Varun
> >
> > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <ml...@uci.cu> wrote:
> >
> > > Like Ishan said, a get give an instance of the Result class.
> > > All utility methods that you can use are:
> > >  byte[] getValue(byte[] family, byte[] qualifier)
> > >  byte[] value()
> > >  byte[] getRow()
> > >  int size()
> > >  boolean isEmpty()
> > >  KeyValue[] raw() # Like Ishan said, all data here is sorted
> > >  List<KeyValue> list()
> > >
> > >
> > >
> > >
> > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
> > >
> > >> Based on what I read in Lars' book, a get will return a result a
> Result,
> > >> which is internally a KeyValue[]. This KeyValue[] is sorted by the key
> > and
> > >> you access this array using raw or list methods on the Result object.
> > >>
> > >>
> > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <va...@pinterest.com>
> > wrote:
> > >>
> > >>  +user
> > >>>
> > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <va...@pinterest.com>
> > >>> wrote:
> > >>>
> > >>>  Hi,
> > >>>>
> > >>>> When I do a Get on a row with multiple column qualifiers. Do we sort
> > the
> > >>>> column qualifers and make use of the sorted order when we get the
> > >>>>
> > >>> results ?
> > >>>
> > >>>> Thanks
> > >>>> Varun
> > >>>>
> > >>>>
> > >>
> > >>
> > > --
> > > Marcos Ortiz Valmaseda,
> > > Product Manager && Data Scientist at UCI
> > > Blog: http://marcosluis2186.**posterous.com<
> > http://marcosluis2186.posterous.com>
> > > Twitter: @marcosluis2186 <http://twitter.com/**marcosluis2186<
> > http://twitter.com/marcosluis2186>
> > > >
> > >
> >
>

Re: Get on a row with multiple columns

Posted by Varun Sharma <va...@pinterest.com>.

So a Get call with multiple columns on a single row should be much faster
than independent Get(s) on each of those columns for that row. I am
basically seeing severely poor performance (~ 15 seconds) for certain
deleteColumn() calls and I am seeing that there is a
prepareDeleteTimestamps() function in HRegion.java which first tries to
locate the column by doing individual gets on each column you want to
delete (I am doing 300 column deletes). Now, I think this should ideall by
1 get call with the batch of 300 columns so that one scan can retrieve the
columns and the columns that are found, are indeed deleted.

Before I try this fix, I wanted to get an opinion if it will make a
difference to batch the get() and it seems from your answer, it should.

On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <la...@apache.org> wrote:

> Everything is stored as a KeyValue in HBase.
> The Key part of a KeyValue contains the row key, column family, column
> name, and timestamp in that order.
> Each column family has it's own store and store files.
>
> So in a nutshell a get is executed by starting a scan at the row key
> (which is a prefix of the key) in each store (CF) and then scanning forward
> in each store until the next row key is reached. (in reality it is a bit
> more complicated due to multiple versions, skipping columns, etc)
>
>
> -- Lars
> ________________________________
> From: Varun Sharma <va...@pinterest.com>
> To: user@hbase.apache.org
> Sent: Friday, February 8, 2013 9:22 PM
> Subject: Re: Get on a row with multiple columns
>
> Sorry, I was a little unclear with my question.
>
> Lets say you have
>
> Get get = new Get(row)
> get.addColumn("1");
> get.addColumn("2");
> .
> .
> .
>
> When internally hbase executes the batch get, it will seek to column "1",
> now since data is lexicographically sorted, it does not need to seek from
> the beginning to get to "2", it can continue seeking, henceforth since
> column "2" will always be after column "1". I want to know whether this is
> how a multicolumn get on a row works or not.
>
> Thanks
> Varun
>
> On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <ml...@uci.cu> wrote:
>
> > Like Ishan said, a get give an instance of the Result class.
> > All utility methods that you can use are:
> >  byte[] getValue(byte[] family, byte[] qualifier)
> >  byte[] value()
> >  byte[] getRow()
> >  int size()
> >  boolean isEmpty()
> >  KeyValue[] raw() # Like Ishan said, all data here is sorted
> >  List<KeyValue> list()
> >
> >
> >
> >
> > On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
> >
> >> Based on what I read in Lars' book, a get will return a result a Result,
> >> which is internally a KeyValue[]. This KeyValue[] is sorted by the key
> and
> >> you access this array using raw or list methods on the Result object.
> >>
> >>
> >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <va...@pinterest.com>
> wrote:
> >>
> >>  +user
> >>>
> >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <va...@pinterest.com>
> >>> wrote:
> >>>
> >>>  Hi,
> >>>>
> >>>> When I do a Get on a row with multiple column qualifiers. Do we sort
> the
> >>>> column qualifers and make use of the sorted order when we get the
> >>>>
> >>> results ?
> >>>
> >>>> Thanks
> >>>> Varun
> >>>>
> >>>>
> >>
> >>
> > --
> > Marcos Ortiz Valmaseda,
> > Product Manager && Data Scientist at UCI
> > Blog: http://marcosluis2186.**posterous.com<
> http://marcosluis2186.posterous.com>
> > Twitter: @marcosluis2186 <http://twitter.com/**marcosluis2186<
> http://twitter.com/marcosluis2186>
> > >
> >
>

Re: Get on a row with multiple columns

Posted by lars hofhansl <la...@apache.org>.

Everything is stored as a KeyValue in HBase.
The Key part of a KeyValue contains the row key, column family, column name, and timestamp in that order.
Each column family has it's own store and store files.

So in a nutshell a get is executed by starting a scan at the row key (which is a prefix of the key) in each store (CF) and then scanning forward in each store until the next row key is reached. (in reality it is a bit more complicated due to multiple versions, skipping columns, etc)

-- Lars
________________________________
From: Varun Sharma <va...@pinterest.com>
To: user@hbase.apache.org 
Sent: Friday, February 8, 2013 9:22 PM
Subject: Re: Get on a row with multiple columns

Sorry, I was a little unclear with my question.

Lets say you have

Get get = new Get(row)
get.addColumn("1");
get.addColumn("2");
.
.
.

When internally hbase executes the batch get, it will seek to column "1",
now since data is lexicographically sorted, it does not need to seek from
the beginning to get to "2", it can continue seeking, henceforth since
column "2" will always be after column "1". I want to know whether this is
how a multicolumn get on a row works or not.

Thanks
Varun

On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <ml...@uci.cu> wrote:

> Like Ishan said, a get give an instance of the Result class.
> All utility methods that you can use are:
>  byte[] getValue(byte[] family, byte[] qualifier)
>  byte[] value()
>  byte[] getRow()
>  int size()
>  boolean isEmpty()
>  KeyValue[] raw() # Like Ishan said, all data here is sorted
>  List<KeyValue> list()
>
>
>
>
> On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
>
>> Based on what I read in Lars' book, a get will return a result a Result,
>> which is internally a KeyValue[]. This KeyValue[] is sorted by the key and
>> you access this array using raw or list methods on the Result object.
>>
>>
>> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <va...@pinterest.com> wrote:
>>
>>  +user
>>>
>>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <va...@pinterest.com>
>>> wrote:
>>>
>>>  Hi,
>>>>
>>>> When I do a Get on a row with multiple column qualifiers. Do we sort the
>>>> column qualifers and make use of the sorted order when we get the
>>>>
>>> results ?
>>>
>>>> Thanks
>>>> Varun
>>>>
>>>>
>>
>>
> --
> Marcos Ortiz Valmaseda,
> Product Manager && Data Scientist at UCI
> Blog: http://marcosluis2186.**posterous.com<http://marcosluis2186.posterous.com>
> Twitter: @marcosluis2186 <http://twitter.com/**marcosluis2186<http://twitter.com/marcosluis2186>
> >
>

Re: Get on a row with multiple columns

Posted by Varun Sharma <va...@pinterest.com>.

Sorry, I was a little unclear with my question.

Lets say you have

Get get = new Get(row)
get.addColumn("1");
get.addColumn("2");
.
.
.

When internally hbase executes the batch get, it will seek to column "1",
now since data is lexicographically sorted, it does not need to seek from
the beginning to get to "2", it can continue seeking, henceforth since
column "2" will always be after column "1". I want to know whether this is
how a multicolumn get on a row works or not.

Thanks
Varun

On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <ml...@uci.cu> wrote:

> Like Ishan said, a get give an instance of the Result class.
> All utility methods that you can use are:
>  byte[] getValue(byte[] family, byte[] qualifier)
>  byte[] value()
>  byte[] getRow()
>  int size()
>  boolean isEmpty()
>  KeyValue[] raw() # Like Ishan said, all data here is sorted
>  List<KeyValue> list()
>
>
>
>
> On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
>
>> Based on what I read in Lars' book, a get will return a result a Result,
>> which is internally a KeyValue[]. This KeyValue[] is sorted by the key and
>> you access this array using raw or list methods on the Result object.
>>
>>
>> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <va...@pinterest.com> wrote:
>>
>>  +user
>>>
>>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <va...@pinterest.com>
>>> wrote:
>>>
>>>  Hi,
>>>>
>>>> When I do a Get on a row with multiple column qualifiers. Do we sort the
>>>> column qualifers and make use of the sorted order when we get the
>>>>
>>> results ?
>>>
>>>> Thanks
>>>> Varun
>>>>
>>>>
>>
>>
> --
> Marcos Ortiz Valmaseda,
> Product Manager && Data Scientist at UCI
> Blog: http://marcosluis2186.**posterous.com<http://marcosluis2186.posterous.com>
> Twitter: @marcosluis2186 <http://twitter.com/**marcosluis2186<http://twitter.com/marcosluis2186>
> >
>

Re: Get on a row with multiple columns

Posted by Marcos Ortiz <ml...@uci.cu>.

Like Ishan said, a get give an instance of the Result class.
All utility methods that you can use are:
  byte[] getValue(byte[] family, byte[] qualifier)
  byte[] value()
  byte[] getRow()
  int size()
  boolean isEmpty()
  KeyValue[] raw() # Like Ishan said, all data here is sorted
  List<KeyValue> list()



On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
> Based on what I read in Lars' book, a get will return a result a Result,
> which is internally a KeyValue[]. This KeyValue[] is sorted by the key and
> you access this array using raw or list methods on the Result object.
>
>
> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <va...@pinterest.com> wrote:
>
>> +user
>>
>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <va...@pinterest.com> wrote:
>>
>>> Hi,
>>>
>>> When I do a Get on a row with multiple column qualifiers. Do we sort the
>>> column qualifers and make use of the sorted order when we get the
>> results ?
>>> Thanks
>>> Varun
>>>
>
>

-- 
Marcos Ortiz Valmaseda,
Product Manager && Data Scientist at UCI
Blog: http://marcosluis2186.posterous.com
Twitter: @marcosluis2186 <http://twitter.com/marcosluis2186>

Re: Get on a row with multiple columns

Posted by Marcos Ortiz <ml...@uci.cu>.

Like Ishan said, a get give an instance of the Result class.
All utility methods that you can use are:
  byte[] getValue(byte[] family, byte[] qualifier)
  byte[] value()
  byte[] getRow()
  int size()
  boolean isEmpty()
  KeyValue[] raw() # Like Ishan said, all data here is sorted
  List<KeyValue> list()



On 02/08/2013 11:29 PM, Ishan Chhabra wrote:
> Based on what I read in Lars' book, a get will return a result a Result,
> which is internally a KeyValue[]. This KeyValue[] is sorted by the key and
> you access this array using raw or list methods on the Result object.
>
>
> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <va...@pinterest.com> wrote:
>
>> +user
>>
>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <va...@pinterest.com> wrote:
>>
>>> Hi,
>>>
>>> When I do a Get on a row with multiple column qualifiers. Do we sort the
>>> column qualifers and make use of the sorted order when we get the
>> results ?
>>> Thanks
>>> Varun
>>>
>
>

-- 
Marcos Ortiz Valmaseda,
Product Manager && Data Scientist at UCI
Blog: http://marcosluis2186.posterous.com
Twitter: @marcosluis2186 <http://twitter.com/marcosluis2186>

Re: Get on a row with multiple columns

Posted by Ishan Chhabra <ic...@rocketfuel.com>.

Based on what I read in Lars' book, a get will return a result a Result,
which is internally a KeyValue[]. This KeyValue[] is sorted by the key and
you access this array using raw or list methods on the Result object.

On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <va...@pinterest.com> wrote:

> +user
>
> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <va...@pinterest.com> wrote:
>
> > Hi,
> >
> > When I do a Get on a row with multiple column qualifiers. Do we sort the
> > column qualifers and make use of the sorted order when we get the
> results ?
> >
> > Thanks
> > Varun
> >
>

-- 
*Ishan Chhabra *| Rocket Scientist | RocketFuel Inc.**

Re: Get on a row with multiple columns

Posted by Ishan Chhabra <ic...@rocketfuel.com>.

Based on what I read in Lars' book, a get will return a result a Result,
which is internally a KeyValue[]. This KeyValue[] is sorted by the key and
you access this array using raw or list methods on the Result object.

On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma <va...@pinterest.com> wrote:

> +user
>
> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <va...@pinterest.com> wrote:
>
> > Hi,
> >
> > When I do a Get on a row with multiple column qualifiers. Do we sort the
> > column qualifers and make use of the sorted order when we get the
> results ?
> >
> > Thanks
> > Varun
> >
>

-- 
*Ishan Chhabra *| Rocket Scientist | RocketFuel Inc.**

Re: Get on a row with multiple columns

Posted by Varun Sharma <va...@pinterest.com>.

+user

On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <va...@pinterest.com> wrote:

> Hi,
>
> When I do a Get on a row with multiple column qualifiers. Do we sort the
> column qualifers and make use of the sorted order when we get the results ?
>
> Thanks
> Varun
>

Re: Get on a row with multiple columns

Posted by Varun Sharma <va...@pinterest.com>.

+user

On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma <va...@pinterest.com> wrote:

> Hi,
>
> When I do a Get on a row with multiple column qualifiers. Do we sort the
> column qualifers and make use of the sorted order when we get the results ?
>
> Thanks
> Varun
>