You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Kristoffer Sjögren <st...@gmail.com> on 2015/04/07 18:13:50 UTC

Rowkey design question

Hi

I have a row with around 100.000 qualifiers with mostly small values around
1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do random
access of 1-10 qualifiers per row.

I would like to understand how HBase loads the data into memory. Will the
entire row be loaded or only the qualifiers I ask for (like pointer access
into a direct ByteBuffer) ?

Cheers,
-Kristoffer

Re: Rowkey design question

Posted by Imants Cekusins <im...@gmail.com>.
> how HBase loads the data into memory.

If you init Get and specify columns with addColumn, it is likely that only
data for these columns is read and loaded in memory.

Rowkey is best kept short. So are column qualifiers.

Re: Rowkey design question

Posted by Michael Segel <mi...@hotmail.com>.
Sorry, but … 

We are in violent agreement. 
If done wrong it can and will kill you. 
Murphy’s law. If there’s more than one way to do something … the wrong way will be chosen, so where does that leave you? 

And then what hasn’t been said is the security concern which is odd because with XASecure now Ranger (until they try a new and different name), you need to use coprocessors on a trigger to see if you have permission to write to, or read data in a table. 

If you’re new to HBase… don’t use a coprocessor. That’s just asking for trouble.  (And I know everyone here knows that to be the truth.) 

> On Apr 12, 2015, at 1:45 AM, lars hofhansl <la...@apache.org> wrote:
> 
> After the fun interlude (sorry about that) let me get back to the issue.
> 
> There a multiple consideration:
> 
> 1. row vs column. If in doubt err on the side of more rows. Only use many columns in a row when you need transaction over the data in the columns.
> 2. Value sizes. HBase is good at dealing with many small things. 1-5mb values here and there are OK, but most rows should be < a few dozen KBs. Otherwise you'll see too much write amplification.
> 3. Column families. Place columns you typically access together in the same column family, and try to keep columns you don't access together mostly in different families.
> HBase can than efficiently rule out a large body of data to scan, by avoiding scanning families that are not needed.
> 4. Coprocessors and filters let you transform/filter things where the data is. The benefit can be huge.  With coprocessors you can "trap" scan requests (next() calls) and inject your own logic.
> Thats what Phoenix does for example, and it's pretty efficient if done right (if you do it wrong you can kill your region server).
> 
> On #2. You might want to invent a scheme where you store smaller values by value (i.e. in HBase) and larger ones by reference.
> 
> I would put the column with the large value in its own family so that you could scan the rest of the metadata without requiring HBase to read the large value.
> You can follow a simple protocol:
> A. If the value is small (pick some notion of small between 1 and 10mb), store it in HBase, in a separate familY.
> B. Otherwise:
> 1. Write a row with the intended location of the file holding the value in HDFS.
> 2. Write the value into the HDFS file. Make sure the file location has a random element to avoid races.
> 3. Update the row created in #1 with a commit column (just a column you set to true), this is like a commit.
> (only when a writer reaches this point should the value be considered written)
> 
> Note the everything is idempotent. The worst that can happen is that the process fails between #2 and #3. Now you have orphaned data in HDFS. Since the HDFS location has a random element in it, you can just retry.
> You can either leave orphaned data (since the commit bit is not set, it's not visible to a client), or you periodically look for those and clean them up.
> 
> Hope this helps. Please let us know how it goes.
> 
> -- Lars
> 
> 
> ________________________________
> From: Kristoffer Sjögren <st...@gmail.com>
> To: user@hbase.apache.org 
> Sent: Wednesday, April 8, 2015 6:41 AM
> Subject: Re: Rowkey design question
> 
> 
> Yes, I think you're right. Adding one or more dimensions to the rowkey
> would indeed make the table narrower.
> 
> And I guess it also make sense to store actual values (bigger qualifiers)
> outside HBase. Keeping them in Hadoop why not? Pulling hot ones out on SSD
> caches would be an interesting solution. And quite a bit simpler.
> 
> Good call and thanks for the tip! :-)
> 
> 
> 
> 
> On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel <mi...@hotmail.com>
> wrote:
> 
>> Ok…
>> 
>> First, I’d suggest you rethink your schema by adding an additional
>> dimension.
>> You’ll end up with more rows, but a narrower table.
>> 
>> In terms of compaction… if the data is relatively static, you won’t have
>> compactions because nothing changed.
>> But if your data is that static… why not put the data in sequence files
>> and use HBase as the index. Could be faster.
>> 
>> HTH
>> 
>> -Mike
>> 
>>> On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren <st...@gmail.com> wrote:
>>> 
>>> I just read through HBase MOB design document and one thing that caught
>> my
>>> attention was the following statement.
>>> 
>>> "When HBase deals with large numbers of values > 100kb and up to ~10MB of
>>> data, it encounters performance degradations due to write amplification
>>> caused by splits and compactions."
>>> 
>>> Is there any chance to run into this problem in the read path for data
>> that
>>> is written infrequently and never changed?
>>> 
>>> On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren <st...@gmail.com>
>> wrote:
>>> 
>>>> A small set of qualifiers will be accessed frequently so keeping them in
>>>> block cache would be very beneficial. Some very seldom. So this sounds
>> very
>>>> promising!
>>>> 
>>>> The reason why i'm considering a coprocessor is that I need to provide
>>>> very specific information in the query request. Same thing with the
>>>> response. Queries are also highly parallelizable across rows and each
>>>> individual query produce a valid result that may or may not be
>> aggregated
>>>> with other results in the client, maybe even inside the region if it
>>>> contained multiple rows targeted by the query.
>>>> 
>>>> So it's a bit like Phoenix but with a different storage format and query
>>>> engine.
>>>> 
>>>> On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk <nd...@gmail.com>
>> wrote:
>>>> 
>>>>> Those rows are written out into HBase blocks on cell boundaries. Your
>>>>> column family has a BLOCK_SIZE attribute, which you may or may have no
>>>>> overridden the default of 64k. Cells are written into a block until is
>> it
>>>>>> = the target block size. So your single 500mb row will be broken down
>>>>> into
>>>>> thousands of HFile blocks in some number of HFiles. Some of those
>> blocks
>>>>> may contain just a cell or two and be a couple MB in size, to hold the
>>>>> largest of your cells. Those blocks will be loaded into the Block
>> Cache as
>>>>> they're accessed. If your careful with your access patterns and only
>>>>> request cells that you need to evaluate, you'll only ever load the
>> blocks
>>>>> containing those cells into the cache.
>>>>> 
>>>>>> Will the entire row be loaded or only the qualifiers I ask for?
>>>>> 
>>>>> So then, the answer to your question is: it depends on how you're
>>>>> interacting with the row from your coprocessor. The read path will only
>>>>> load blocks that your scanner requests. If your coprocessor is
>> producing
>>>>> scanner with to seek to specific qualifiers, you'll only load those
>>>>> blocks.
>>>>> 
>>>>> Related question: Is there a reason you're using a coprocessor instead
>> of
>>>>> a
>>>>> regular filter, or a simple qualified get/scan to access data from
>> these
>>>>> rows? The "default stuff" is already tuned to load data sparsely, as
>> would
>>>>> be desirable for your schema.
>>>>> 
>>>>> -n
>>>>> 
>>>>> On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren <st...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Sorry I should have explained my use case a bit more.
>>>>>> 
>>>>>> Yes, it's a pretty big row and it's "close" to worst case. Normally
>>>>> there
>>>>>> would be fewer qualifiers and the largest qualifiers would be smaller.
>>>>>> 
>>>>>> The reason why these rows gets big is because they stores aggregated
>>>>> data
>>>>>> in indexed compressed form. This format allow for extremely fast
>> queries
>>>>>> (on local disk format) over billions of rows (not rows in HBase
>> speak),
>>>>>> when touching smaller areas of the data. If would store the data as
>>>>> regular
>>>>>> HBase rows things would get very slow unless I had many many region
>>>>>> servers.
>>>>>> 
>>>>>> The coprocessor is used for doing custom queries on the indexed data
>>>>> inside
>>>>>> the region servers. These queries are not like a regular row scan, but
>>>>> very
>>>>>> specific as to how the data is formatted withing each column
>> qualifier.
>>>>>> 
>>>>>> Yes, this is not possible if HBase loads the whole 500MB each time i
>>>>> want
>>>>>> to perform this custom query on a row. Hence my question :-)
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel <
>>>>> michael_segel@hotmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Sorry, but your initial problem statement doesn’t seem to parse …
>>>>>>> 
>>>>>>> Are you saying that you a single row with approximately 100,000
>>>>> elements
>>>>>>> where each element is roughly 1-5KB in size and in addition there are
>>>>> ~5
>>>>>>> elements which will be between one and five MB in size?
>>>>>>> 
>>>>>>> And you then mention a coprocessor?
>>>>>>> 
>>>>>>> Just looking at the numbers… 100K * 5KB means that each row would end
>>>>> up
>>>>>>> being 500MB in size.
>>>>>>> 
>>>>>>> That’s a pretty fat row.
>>>>>>> 
>>>>>>> I would suggest rethinking your strategy.
>>>>>>> 
>>>>>>>> On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren <st...@gmail.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi
>>>>>>>> 
>>>>>>>> I have a row with around 100.000 qualifiers with mostly small values
>>>>>>> around
>>>>>>>> 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do
>>>>> random
>>>>>>>> access of 1-10 qualifiers per row.
>>>>>>>> 
>>>>>>>> I would like to understand how HBase loads the data into memory.
>>>>> Will
>>>>>> the
>>>>>>>> entire row be loaded or only the qualifiers I ask for (like pointer
>>>>>>> access
>>>>>>>> into a direct ByteBuffer) ?
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> -Kristoffer
>>>>>>> 
>>>>>>> The opinions expressed here are mine, while they may reflect a
>>>>> cognitive
>>>>>>> thought, that is purely accidental.
>>>>>>> Use at your own risk.
>>>>>>> Michael Segel
>>>>>>> michael_segel (AT) hotmail.com
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
>> The opinions expressed here are mine, while they may reflect a cognitive
>> thought, that is purely accidental.
>> Use at your own risk.
>> Michael Segel
>> michael_segel (AT) hotmail.com
>> 
>> 
>> 
>> 
>> 
>> 
> 

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com






Re: Rowkey design question

Posted by lars hofhansl <la...@apache.org>.
After the fun interlude (sorry about that) let me get back to the issue.

There a multiple consideration:

1. row vs column. If in doubt err on the side of more rows. Only use many columns in a row when you need transaction over the data in the columns.
2. Value sizes. HBase is good at dealing with many small things. 1-5mb values here and there are OK, but most rows should be < a few dozen KBs. Otherwise you'll see too much write amplification.
3. Column families. Place columns you typically access together in the same column family, and try to keep columns you don't access together mostly in different families.
HBase can than efficiently rule out a large body of data to scan, by avoiding scanning families that are not needed.
4. Coprocessors and filters let you transform/filter things where the data is. The benefit can be huge.  With coprocessors you can "trap" scan requests (next() calls) and inject your own logic.
Thats what Phoenix does for example, and it's pretty efficient if done right (if you do it wrong you can kill your region server).

On #2. You might want to invent a scheme where you store smaller values by value (i.e. in HBase) and larger ones by reference.

I would put the column with the large value in its own family so that you could scan the rest of the metadata without requiring HBase to read the large value.
You can follow a simple protocol:
A. If the value is small (pick some notion of small between 1 and 10mb), store it in HBase, in a separate familY.
B. Otherwise:
1. Write a row with the intended location of the file holding the value in HDFS.
2. Write the value into the HDFS file. Make sure the file location has a random element to avoid races.
3. Update the row created in #1 with a commit column (just a column you set to true), this is like a commit.
(only when a writer reaches this point should the value be considered written)

Note the everything is idempotent. The worst that can happen is that the process fails between #2 and #3. Now you have orphaned data in HDFS. Since the HDFS location has a random element in it, you can just retry.
You can either leave orphaned data (since the commit bit is not set, it's not visible to a client), or you periodically look for those and clean them up.

Hope this helps. Please let us know how it goes.

-- Lars


________________________________
From: Kristoffer Sjögren <st...@gmail.com>
To: user@hbase.apache.org 
Sent: Wednesday, April 8, 2015 6:41 AM
Subject: Re: Rowkey design question


Yes, I think you're right. Adding one or more dimensions to the rowkey
would indeed make the table narrower.

And I guess it also make sense to store actual values (bigger qualifiers)
outside HBase. Keeping them in Hadoop why not? Pulling hot ones out on SSD
caches would be an interesting solution. And quite a bit simpler.

Good call and thanks for the tip! :-)




On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel <mi...@hotmail.com>
wrote:

> Ok…
>
> First, I’d suggest you rethink your schema by adding an additional
> dimension.
> You’ll end up with more rows, but a narrower table.
>
> In terms of compaction… if the data is relatively static, you won’t have
> compactions because nothing changed.
> But if your data is that static… why not put the data in sequence files
> and use HBase as the index. Could be faster.
>
> HTH
>
> -Mike
>
> > On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren <st...@gmail.com> wrote:
> >
> > I just read through HBase MOB design document and one thing that caught
> my
> > attention was the following statement.
> >
> > "When HBase deals with large numbers of values > 100kb and up to ~10MB of
> > data, it encounters performance degradations due to write amplification
> > caused by splits and compactions."
> >
> > Is there any chance to run into this problem in the read path for data
> that
> > is written infrequently and never changed?
> >
> > On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren <st...@gmail.com>
> wrote:
> >
> >> A small set of qualifiers will be accessed frequently so keeping them in
> >> block cache would be very beneficial. Some very seldom. So this sounds
> very
> >> promising!
> >>
> >> The reason why i'm considering a coprocessor is that I need to provide
> >> very specific information in the query request. Same thing with the
> >> response. Queries are also highly parallelizable across rows and each
> >> individual query produce a valid result that may or may not be
> aggregated
> >> with other results in the client, maybe even inside the region if it
> >> contained multiple rows targeted by the query.
> >>
> >> So it's a bit like Phoenix but with a different storage format and query
> >> engine.
> >>
> >> On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk <nd...@gmail.com>
> wrote:
> >>
> >>> Those rows are written out into HBase blocks on cell boundaries. Your
> >>> column family has a BLOCK_SIZE attribute, which you may or may have no
> >>> overridden the default of 64k. Cells are written into a block until is
> it
> >>>> = the target block size. So your single 500mb row will be broken down
> >>> into
> >>> thousands of HFile blocks in some number of HFiles. Some of those
> blocks
> >>> may contain just a cell or two and be a couple MB in size, to hold the
> >>> largest of your cells. Those blocks will be loaded into the Block
> Cache as
> >>> they're accessed. If your careful with your access patterns and only
> >>> request cells that you need to evaluate, you'll only ever load the
> blocks
> >>> containing those cells into the cache.
> >>>
> >>>> Will the entire row be loaded or only the qualifiers I ask for?
> >>>
> >>> So then, the answer to your question is: it depends on how you're
> >>> interacting with the row from your coprocessor. The read path will only
> >>> load blocks that your scanner requests. If your coprocessor is
> producing
> >>> scanner with to seek to specific qualifiers, you'll only load those
> >>> blocks.
> >>>
> >>> Related question: Is there a reason you're using a coprocessor instead
> of
> >>> a
> >>> regular filter, or a simple qualified get/scan to access data from
> these
> >>> rows? The "default stuff" is already tuned to load data sparsely, as
> would
> >>> be desirable for your schema.
> >>>
> >>> -n
> >>>
> >>> On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren <st...@gmail.com>
> >>> wrote:
> >>>
> >>>> Sorry I should have explained my use case a bit more.
> >>>>
> >>>> Yes, it's a pretty big row and it's "close" to worst case. Normally
> >>> there
> >>>> would be fewer qualifiers and the largest qualifiers would be smaller.
> >>>>
> >>>> The reason why these rows gets big is because they stores aggregated
> >>> data
> >>>> in indexed compressed form. This format allow for extremely fast
> queries
> >>>> (on local disk format) over billions of rows (not rows in HBase
> speak),
> >>>> when touching smaller areas of the data. If would store the data as
> >>> regular
> >>>> HBase rows things would get very slow unless I had many many region
> >>>> servers.
> >>>>
> >>>> The coprocessor is used for doing custom queries on the indexed data
> >>> inside
> >>>> the region servers. These queries are not like a regular row scan, but
> >>> very
> >>>> specific as to how the data is formatted withing each column
> qualifier.
> >>>>
> >>>> Yes, this is not possible if HBase loads the whole 500MB each time i
> >>> want
> >>>> to perform this custom query on a row. Hence my question :-)
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel <
> >>> michael_segel@hotmail.com>
> >>>> wrote:
> >>>>
> >>>>> Sorry, but your initial problem statement doesn’t seem to parse …
> >>>>>
> >>>>> Are you saying that you a single row with approximately 100,000
> >>> elements
> >>>>> where each element is roughly 1-5KB in size and in addition there are
> >>> ~5
> >>>>> elements which will be between one and five MB in size?
> >>>>>
> >>>>> And you then mention a coprocessor?
> >>>>>
> >>>>> Just looking at the numbers… 100K * 5KB means that each row would end
> >>> up
> >>>>> being 500MB in size.
> >>>>>
> >>>>> That’s a pretty fat row.
> >>>>>
> >>>>> I would suggest rethinking your strategy.
> >>>>>
> >>>>>> On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren <st...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> Hi
> >>>>>>
> >>>>>> I have a row with around 100.000 qualifiers with mostly small values
> >>>>> around
> >>>>>> 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do
> >>> random
> >>>>>> access of 1-10 qualifiers per row.
> >>>>>>
> >>>>>> I would like to understand how HBase loads the data into memory.
> >>> Will
> >>>> the
> >>>>>> entire row be loaded or only the qualifiers I ask for (like pointer
> >>>>> access
> >>>>>> into a direct ByteBuffer) ?
> >>>>>>
> >>>>>> Cheers,
> >>>>>> -Kristoffer
> >>>>>
> >>>>> The opinions expressed here are mine, while they may reflect a
> >>> cognitive
> >>>>> thought, that is purely accidental.
> >>>>> Use at your own risk.
> >>>>> Michael Segel
> >>>>> michael_segel (AT) hotmail.com
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
>
> The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>

Re: Rowkey design question

Posted by lars hofhansl <la...@apache.org>.
One day I'll post an email with all of Micheal's hot button topics.
1. Salting. Yeah, we call it salting and it works nicely to avoid hot spotting and to parallelize reads.

2. Local indexes. They're nice, because you can easily and cheaply keep them 100% consistent.
3. Coprocessors. Uhm... They are an extension mechanism to avoid subclassing the Java class implementing the region server. You can call System.exit(), the region server process will happily exit.That's by design. They're not stored procedures. Don't like 'em? Don't use 'em. Or implement another mechanism, we take patches.

Oops. I just did post an email with all of Micheal's hot button topics. :)  Did I miss anything?

-- Lars

      From: Michael Segel <mi...@hotmail.com>
 To: user@hbase.apache.org 
 Sent: Thursday, April 9, 2015 2:26 PM
 Subject: Re: Rowkey design question
   
Andrew, 

In a nutshell running end user code within the RS JVM is a bad design. 
To be clear, this is not just my opinion… I just happen to be more vocal about it. ;-)
We’ve covered this ground before and just because the code runs doesn’t mean its good. Or that the design is good.

I would love to see how you can justify HBase as being secure when you have end user code running in the same JVM as the RS. 
I can think of several ways to hack HBase security because of this… 

Note: I’m not saying server side extensibility is bad, I’m saying how it was implemented was bad. 
Hint: You could have sandboxed the end user code which makes it a lot easier to manage.

MapR has avoided this in their MapRDB. They’re adding the extensibility in a different manner and this issue is nothing new. 


And yes. you’ve hit the nail on the head. Rethink your design if you want to use coprocessors and use them as a last resort. 

> On Apr 9, 2015, at 3:02 PM, Andrew Purtell <ap...@apache.org> wrote:
> 
> This is one person's opinion, to which he is absolutely entitled to, but
> blanket black and white statements like "coprocessors are poorly
> implemented" is obviously not an opinion shared by all those who have used
> them successfully, nor the HBase committers, or we would remove the
> feature. On the other hand, you should really ask yourself if in-server
> extension is necessary. That should be a last resort, really, for the
> security and performance considerations Michael mentions.
> 
> 
> On Thu, Apr 9, 2015 at 5:05 AM, Michael Segel <mi...@hotmail.com>
> wrote:
> 
>> Ok…
>> Coprocessors are poorly implemented in HBase.
>> If you work in a secure environment, outside of the system coprocessors…
>> (ones that you load from hbase-site.xml) , you don’t want to use them. (The
>> coprocessor code runs on the same JVM as the RS.)  This means that if you
>> have a poorly written coprocessor, you will kill performance for all of
>> HBase. If you’re not using them in a secure environment, you have to
>> consider how they are going to be used.
>> 
>> 
>> Without really knowing more about your use case..., its impossible to say
>> of the coprocessor would be a good idea.
>> 
>> 
>> It sounds like you may have an unrealistic expectation as to how well
>> HBase performs.
>> 
>> HTH
>> 
>> -Mike
>> 
>>> On Apr 9, 2015, at 1:05 AM, Kristoffer Sjögren <st...@gmail.com> wrote:
>>> 
>>> An HBase coprocessor. My idea is to move as much pre-aggregation as
>>> possible to where the data lives in the region servers, instead of doing
>> it
>>> in the client. If there is good data locality inside and across rows
>> within
>>> regions then I would expect aggregation to be faster in the coprocessor
>>> (utilize many region servers in parallel) rather than transfer data over
>>> the network from multiple region servers to a single client that would do
>>> the same calculation on its own.
>>> 
>>> 
>>> On Thu, Apr 9, 2015 at 4:43 AM, Michael Segel <michael_segel@hotmail.com
>>> 
>>> wrote:
>>> 
>>>> When you say coprocessor, do you mean HBase coprocessors or do you mean
>> a
>>>> physical hardware coprocessor?
>>>> 
>>>> In terms of queries…
>>>> 
>>>> HBase can perform a single get() and return the result back quickly.
>> (The
>>>> size of the data being returned will impact the overall timing.)
>>>> 
>>>> HBase also caches the results so that your first hit will take the
>>>> longest, but as long as the row is cached, the results are returned
>> quickly.
>>>> 
>>>> If you’re trying to do a scan with a start/stop row set … your timing
>> then
>>>> could vary between sub-second and minutes depending on the query.
>>>> 
>>>> 
>>>>> On Apr 8, 2015, at 3:10 PM, Kristoffer Sjögren <st...@gmail.com>
>> wrote:
>>>>> 
>>>>> But if the coprocessor is omitted then CPU cycles from region servers
>> are
>>>>> lost, so where would the query execution go?
>>>>> 
>>>>> Queries needs to be quick (sub-second rather than seconds) and HDFS is
>>>>> quite latency hungry, unless there are optimizations that i'm unaware
>> of?
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, Apr 8, 2015 at 7:43 PM, Michael Segel <
>> michael_segel@hotmail.com
>>>>> 
>>>>> wrote:
>>>>> 
>>>>>> I think you misunderstood.
>>>>>> 
>>>>>> The suggestion was to put the data in to HDFS sequence files and to
>> use
>>>>>> HBase to store an index in to the file. (URL to the file, then offset
>>>> in to
>>>>>> the file for the start of the record…)
>>>>>> 
>>>>>> The reason you want to do this is that you’re reading in large amounts
>>>> of
>>>>>> data and its more efficient to do this from HDFS than through HBase.
>>>>>> 
>>>>>>> On Apr 8, 2015, at 8:41 AM, Kristoffer Sjögren <st...@gmail.com>
>>>> wrote:
>>>>>>> 
>>>>>>> Yes, I think you're right. Adding one or more dimensions to the
>> rowkey
>>>>>>> would indeed make the table narrower.
>>>>>>> 
>>>>>>> And I guess it also make sense to store actual values (bigger
>>>> qualifiers)
>>>>>>> outside HBase. Keeping them in Hadoop why not? Pulling hot ones out
>> on
>>>>>> SSD
>>>>>>> caches would be an interesting solution. And quite a bit simpler.
>>>>>>> 
>>>>>>> Good call and thanks for the tip! :-)
>>>>>>> 
>>>>>>> On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel <
>>>> michael_segel@hotmail.com
>>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Ok…
>>>>>>>> 
>>>>>>>> First, I’d suggest you rethink your schema by adding an additional
>>>>>>>> dimension.
>>>>>>>> You’ll end up with more rows, but a narrower table.
>>>>>>>> 
>>>>>>>> In terms of compaction… if the data is relatively static, you won’t
>>>> have
>>>>>>>> compactions because nothing changed.
>>>>>>>> But if your data is that static… why not put the data in sequence
>>>> files
>>>>>>>> and use HBase as the index. Could be faster.
>>>>>>>> 
>>>>>>>> HTH
>>>>>>>> 
>>>>>>>> -Mike
>>>>>>>> 
>>>>>>>>> On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren <st...@gmail.com>
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> I just read through HBase MOB design document and one thing that
>>>> caught
>>>>>>>> my
>>>>>>>>> attention was the following statement.
>>>>>>>>> 
>>>>>>>>> "When HBase deals with large numbers of values > 100kb and up to
>>>> ~10MB
>>>>>> of
>>>>>>>>> data, it encounters performance degradations due to write
>>>> amplification
>>>>>>>>> caused by splits and compactions."
>>>>>>>>> 
>>>>>>>>> Is there any chance to run into this problem in the read path for
>>>> data
>>>>>>>> that
>>>>>>>>> is written infrequently and never changed?
>>>>>>>>> 
>>>>>>>>> On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren <
>> stoffe@gmail.com
>>>>> 
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> A small set of qualifiers will be accessed frequently so keeping
>>>> them
>>>>>> in
>>>>>>>>>> block cache would be very beneficial. Some very seldom. So this
>>>> sounds
>>>>>>>> very
>>>>>>>>>> promising!
>>>>>>>>>> 
>>>>>>>>>> The reason why i'm considering a coprocessor is that I need to
>>>> provide
>>>>>>>>>> very specific information in the query request. Same thing with
>> the
>>>>>>>>>> response. Queries are also highly parallelizable across rows and
>>>> each
>>>>>>>>>> individual query produce a valid result that may or may not be
>>>>>>>> aggregated
>>>>>>>>>> with other results in the client, maybe even inside the region if
>> it
>>>>>>>>>> contained multiple rows targeted by the query.
>>>>>>>>>> 
>>>>>>>>>> So it's a bit like Phoenix but with a different storage format and
>>>>>> query
>>>>>>>>>> engine.
>>>>>>>>>> 
>>>>>>>>>> On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk <ndimiduk@gmail.com
>>> 
>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Those rows are written out into HBase blocks on cell boundaries.
>>>> Your
>>>>>>>>>>> column family has a BLOCK_SIZE attribute, which you may or may
>> have
>>>>>> no
>>>>>>>>>>> overridden the default of 64k. Cells are written into a block
>> until
>>>>>> is
>>>>>>>> it
>>>>>>>>>>>> = the target block size. So your single 500mb row will be broken
>>>>>> down
>>>>>>>>>>> into
>>>>>>>>>>> thousands of HFile blocks in some number of HFiles. Some of those
>>>>>>>> blocks
>>>>>>>>>>> may contain just a cell or two and be a couple MB in size, to
>> hold
>>>>>> the
>>>>>>>>>>> largest of your cells. Those blocks will be loaded into the Block
>>>>>>>> Cache as
>>>>>>>>>>> they're accessed. If your careful with your access patterns and
>>>> only
>>>>>>>>>>> request cells that you need to evaluate, you'll only ever load
>> the
>>>>>>>> blocks
>>>>>>>>>>> containing those cells into the cache.
>>>>>>>>>>> 
>>>>>>>>>>>> Will the entire row be loaded or only the qualifiers I ask for?
>>>>>>>>>>> 
>>>>>>>>>>> So then, the answer to your question is: it depends on how you're
>>>>>>>>>>> interacting with the row from your coprocessor. The read path
>> will
>>>>>> only
>>>>>>>>>>> load blocks that your scanner requests. If your coprocessor is
>>>>>>>> producing
>>>>>>>>>>> scanner with to seek to specific qualifiers, you'll only load
>> those
>>>>>>>>>>> blocks.
>>>>>>>>>>> 
>>>>>>>>>>> Related question: Is there a reason you're using a coprocessor
>>>>>> instead
>>>>>>>> of
>>>>>>>>>>> a
>>>>>>>>>>> regular filter, or a simple qualified get/scan to access data
>> from
>>>>>>>> these
>>>>>>>>>>> rows? The "default stuff" is already tuned to load data sparsely,
>>>> as
>>>>>>>> would
>>>>>>>>>>> be desirable for your schema.
>>>>>>>>>>> 
>>>>>>>>>>> -n
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren <
>>>> stoffe@gmail.com
>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Sorry I should have explained my use case a bit more.
>>>>>>>>>>>> 
>>>>>>>>>>>> Yes, it's a pretty big row and it's "close" to worst case.
>>>> Normally
>>>>>>>>>>> there
>>>>>>>>>>>> would be fewer qualifiers and the largest qualifiers would be
>>>>>> smaller.
>>>>>>>>>>>> 
>>>>>>>>>>>> The reason why these rows gets big is because they stores
>>>> aggregated
>>>>>>>>>>> data
>>>>>>>>>>>> in indexed compressed form. This format allow for extremely fast
>>>>>>>> queries
>>>>>>>>>>>> (on local disk format) over billions of rows (not rows in HBase
>>>>>>>> speak),
>>>>>>>>>>>> when touching smaller areas of the data. If would store the data
>>>> as
>>>>>>>>>>> regular
>>>>>>>>>>>> HBase rows things would get very slow unless I had many many
>>>> region
>>>>>>>>>>>> servers.
>>>>>>>>>>>> 
>>>>>>>>>>>> The coprocessor is used for doing custom queries on the indexed
>>>> data
>>>>>>>>>>> inside
>>>>>>>>>>>> the region servers. These queries are not like a regular row
>> scan,
>>>>>> but
>>>>>>>>>>> very
>>>>>>>>>>>> specific as to how the data is formatted withing each column
>>>>>>>> qualifier.
>>>>>>>>>>>> 
>>>>>>>>>>>> Yes, this is not possible if HBase loads the whole 500MB each
>>>> time i
>>>>>>>>>>> want
>>>>>>>>>>>> to perform this custom query on a row. Hence my question :-)
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel <
>>>>>>>>>>> michael_segel@hotmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Sorry, but your initial problem statement doesn’t seem to
>> parse …
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Are you saying that you a single row with approximately 100,000
>>>>>>>>>>> elements
>>>>>>>>>>>>> where each element is roughly 1-5KB in size and in addition
>> there
>>>>>> are
>>>>>>>>>>> ~5
>>>>>>>>>>>>> elements which will be between one and five MB in size?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> And you then mention a coprocessor?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Just looking at the numbers… 100K * 5KB means that each row
>> would
>>>>>> end
>>>>>>>>>>> up
>>>>>>>>>>>>> being 500MB in size.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> That’s a pretty fat row.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I would suggest rethinking your strategy.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren <
>>>> stoffe@gmail.com
>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I have a row with around 100.000 qualifiers with mostly small
>>>>>> values
>>>>>>>>>>>>> around
>>>>>>>>>>>>>> 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do
>>>>>>>>>>> random
>>>>>>>>>>>>>> access of 1-10 qualifiers per row.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I would like to understand how HBase loads the data into
>> memory.
>>>>>>>>>>> Will
>>>>>>>>>>>> the
>>>>>>>>>>>>>> entire row be loaded or only the qualifiers I ask for (like
>>>>>> pointer
>>>>>>>>>>>>> access
>>>>>>>>>>>>>> into a direct ByteBuffer) ?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> -Kristoffer
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The opinions expressed here are mine, while they may reflect a
>>>>>>>>>>> cognitive
>>>>>>>>>>>>> thought, that is purely accidental.
>>>>>>>>>>>>> Use at your own risk.
>>>>>>>>>>>>> Michael Segel
>>>>>>>>>>>>> michael_segel (AT) hotmail.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> The opinions expressed here are mine, while they may reflect a
>>>> cognitive
>>>>>>>> thought, that is purely accidental.
>>>>>>>> Use at your own risk.
>>>>>>>> Michael Segel
>>>>>>>> michael_segel (AT) hotmail.com
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> The opinions expressed here are mine, while they may reflect a
>> cognitive
>>>>>> thought, that is purely accidental.
>>>>>> Use at your own risk.
>>>>>> Michael Segel
>>>>>> michael_segel (AT) hotmail.com
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> The opinions expressed here are mine, while they may reflect a cognitive
>>>> thought, that is purely accidental.
>>>> Use at your own risk.
>>>> Michael Segel
>>>> michael_segel (AT) hotmail.com
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>> 
>> The opinions expressed here are mine, while they may reflect a cognitive
>> thought, that is purely accidental.
>> Use at your own risk.
>> Michael Segel
>> michael_segel (AT) hotmail.com
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> -- 
> Best regards,
> 
>  - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)



The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com






  

Re: Rowkey design question

Posted by Andrew Purtell <ap...@apache.org>.
Yes, the tone is a problem here, but the good news is "rectally induced
hypoxia" isn't a real medical condition, a patch seems possible, social
grace isn't required for producing a patch, and patches are always welcome.
What else is there to say, really? I think we're done.

On Saturday, April 11, 2015, Sean Busbey <bu...@cloudera.com> wrote:

> Lars, Andrew, Michael,
>
> This particular discussion isn't bearing fruit for the user@hbase
> audience.
> If you wish to continue it, especially with the current tone, please do so
> on dev@.
>
> Michael, IANAL but the ASF offers indemnification as a means of encouraging
> development and adoption of the projects it hosts. If you'd like to know
> about the specific protections afforded you as a contributor please take it
> up with legal@apache.
>
> --
> Sean
> On Apr 11, 2015 12:59 PM, "Michael Segel" <michael_segel@hotmail.com
> <javascript:;>> wrote:
>
> > Well Lars, looks like that hypoxia has set in…
> >
> > If you’ve paid attention, its not that I’m against server side
> > extensibility.
> >
> > Its how its been implemented which is a bit brain dead.
> >
> > I suggest you think more about why having end user code running in the
> > same JVM as the RS is not a good thing.
> > (Which is why in Feb. Andrew made a patch that allowed one to turn off
> the
> > coprocessor function completely or after the system coprocessors loaded.
> )
> >
> > The sad truth is that you could have run the coprocessor code in a
> > separate JVM.
> > You have to remember coprocessors are triggers, stored procedures and
> > extensibility all rolled in to one.
> >
> > As to providing a patch… will you indemnify me if I get sued?  ;-)
> > Didn’t think so.
> >
> > > On Apr 9, 2015, at 10:13 PM, lars hofhansl <larsh@apache.org
> <javascript:;>> wrote:
> > >
> > >> if you lecture people and call them stupid (as you did in an earlier
> > email)
> > > He said (quote) "committers are suffering from rectal induced hypoxia",
> > we can let that pass as "stupid", I think. :)Maybe Michael can explain
> some
> > day what "rectal induced hypoxia" is. I'm dying to know what I suffer
> from.
> > >
> > > In any case and in all seriousness. Michael, feel free to educate
> > yourself about what the intended use of coprocessors is - preferably
> before
> > you come here and start an argument ... again. We're more than happy to
> > accept a patch from you with a "correct" implementation.
> > >
> > > Can we just let this thread die? It didn't start with a useful
> > proposition.
> > >
> > > -- Lars
> > >
> > >     From: Andrew Purtell <apurtell@apache.org <javascript:;>>
> > > To: "user@hbase.apache.org <javascript:;>" <user@hbase.apache.org
> <javascript:;>>
> > > Sent: Thursday, April 9, 2015 4:53 PM
> > > Subject: Re: Rowkey design question
> > >
> > > On Thu, Apr 9, 2015 at 2:26 PM, Michael Segel <
> michael_segel@hotmail.com <javascript:;>
> > >
> > > wrote:
> > >
> > >> Hint: You could have sandboxed the end user code which makes it a lot
> > >> easier to manage.
> > >>
> > >
> > > I filed the fucking JIRA for that. Look at HBASE-4047. As a matter of
> > > social grace, if you lecture people and call them stupid (as you did in
> > an
> > > earlier email) while making the same fucking argument the other person
> > > made, this doesn't work.
> > >
> > > The reason I never did finish HBASE-4047 is I didn't need to. Nobody
> here
> > > or where I worked, ultimately, was banging down the door for an
> external
> > > coprocessor host. What we have works well enough for people today.
> > >
> > > If you do think the external coprocessor host is essential, try taking
> on
> > > the actual engineering challenges involved. Hint: They are not easy.
> Put
> > up
> > > a patch. Writing words in an email is easy. ​
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > >   - Andy
> > >
> > > Problems worthy of attack prove their worth by hitting back. - Piet
> Hein
> > > (via Tom White)
> > >
> >
> > The opinions expressed here are mine, while they may reflect a cognitive
> > thought, that is purely accidental.
> > Use at your own risk.
> > Michael Segel
> > michael_segel (AT) hotmail.com
> >
> >
> >
> >
> >
> >
>


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Rowkey design question

Posted by Sean Busbey <bu...@cloudera.com>.
Lars, Andrew, Michael,

This particular discussion isn't bearing fruit for the user@hbase audience.
If you wish to continue it, especially with the current tone, please do so
on dev@.

Michael, IANAL but the ASF offers indemnification as a means of encouraging
development and adoption of the projects it hosts. If you'd like to know
about the specific protections afforded you as a contributor please take it
up with legal@apache.

-- 
Sean
On Apr 11, 2015 12:59 PM, "Michael Segel" <mi...@hotmail.com> wrote:

> Well Lars, looks like that hypoxia has set in…
>
> If you’ve paid attention, its not that I’m against server side
> extensibility.
>
> Its how its been implemented which is a bit brain dead.
>
> I suggest you think more about why having end user code running in the
> same JVM as the RS is not a good thing.
> (Which is why in Feb. Andrew made a patch that allowed one to turn off the
> coprocessor function completely or after the system coprocessors loaded. )
>
> The sad truth is that you could have run the coprocessor code in a
> separate JVM.
> You have to remember coprocessors are triggers, stored procedures and
> extensibility all rolled in to one.
>
> As to providing a patch… will you indemnify me if I get sued?  ;-)
> Didn’t think so.
>
> > On Apr 9, 2015, at 10:13 PM, lars hofhansl <la...@apache.org> wrote:
> >
> >> if you lecture people and call them stupid (as you did in an earlier
> email)
> > He said (quote) "committers are suffering from rectal induced hypoxia",
> we can let that pass as "stupid", I think. :)Maybe Michael can explain some
> day what "rectal induced hypoxia" is. I'm dying to know what I suffer from.
> >
> > In any case and in all seriousness. Michael, feel free to educate
> yourself about what the intended use of coprocessors is - preferably before
> you come here and start an argument ... again. We're more than happy to
> accept a patch from you with a "correct" implementation.
> >
> > Can we just let this thread die? It didn't start with a useful
> proposition.
> >
> > -- Lars
> >
> >     From: Andrew Purtell <ap...@apache.org>
> > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > Sent: Thursday, April 9, 2015 4:53 PM
> > Subject: Re: Rowkey design question
> >
> > On Thu, Apr 9, 2015 at 2:26 PM, Michael Segel <michael_segel@hotmail.com
> >
> > wrote:
> >
> >> Hint: You could have sandboxed the end user code which makes it a lot
> >> easier to manage.
> >>
> >
> > I filed the fucking JIRA for that. Look at HBASE-4047. As a matter of
> > social grace, if you lecture people and call them stupid (as you did in
> an
> > earlier email) while making the same fucking argument the other person
> > made, this doesn't work.
> >
> > The reason I never did finish HBASE-4047 is I didn't need to. Nobody here
> > or where I worked, ultimately, was banging down the door for an external
> > coprocessor host. What we have works well enough for people today.
> >
> > If you do think the external coprocessor host is essential, try taking on
> > the actual engineering challenges involved. Hint: They are not easy. Put
> up
> > a patch. Writing words in an email is easy. ​
> >
> >
> >
> >
> >
> >
> > --
> > Best regards,
> >
> >   - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
> >
>
> The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>

Re: Rowkey design question

Posted by Michael Segel <mi...@hotmail.com>.
Well Lars, looks like that hypoxia has set in… 

If you’ve paid attention, its not that I’m against server side extensibility. 

Its how its been implemented which is a bit brain dead. 

I suggest you think more about why having end user code running in the same JVM as the RS is not a good thing.
(Which is why in Feb. Andrew made a patch that allowed one to turn off the coprocessor function completely or after the system coprocessors loaded. ) 

The sad truth is that you could have run the coprocessor code in a separate JVM. 
You have to remember coprocessors are triggers, stored procedures and extensibility all rolled in to one.

As to providing a patch… will you indemnify me if I get sued?  ;-) 
Didn’t think so.

> On Apr 9, 2015, at 10:13 PM, lars hofhansl <la...@apache.org> wrote:
> 
>> if you lecture people and call them stupid (as you did in an earlier email) 
> He said (quote) "committers are suffering from rectal induced hypoxia", we can let that pass as "stupid", I think. :)Maybe Michael can explain some day what "rectal induced hypoxia" is. I'm dying to know what I suffer from.
> 
> In any case and in all seriousness. Michael, feel free to educate yourself about what the intended use of coprocessors is - preferably before you come here and start an argument ... again. We're more than happy to accept a patch from you with a "correct" implementation.
> 
> Can we just let this thread die? It didn't start with a useful proposition.
> 
> -- Lars
> 
>     From: Andrew Purtell <ap...@apache.org>
> To: "user@hbase.apache.org" <us...@hbase.apache.org> 
> Sent: Thursday, April 9, 2015 4:53 PM
> Subject: Re: Rowkey design question
> 
> On Thu, Apr 9, 2015 at 2:26 PM, Michael Segel <mi...@hotmail.com>
> wrote:
> 
>> Hint: You could have sandboxed the end user code which makes it a lot
>> easier to manage.
>> 
> 
> I filed the fucking JIRA for that. Look at HBASE-4047. As a matter of
> social grace, if you lecture people and call them stupid (as you did in an
> earlier email) while making the same fucking argument the other person
> made, this doesn't work.
> 
> The reason I never did finish HBASE-4047 is I didn't need to. Nobody here
> or where I worked, ultimately, was banging down the door for an external
> coprocessor host. What we have works well enough for people today.
> 
> If you do think the external coprocessor host is essential, try taking on
> the actual engineering challenges involved. Hint: They are not easy. Put up
> a patch. Writing words in an email is easy. ​
> 
> 
> 
> 
> 
> 
> -- 
> Best regards,
> 
>   - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
> 

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com






Re: Rowkey design question

Posted by lars hofhansl <la...@apache.org>.
> if you lecture people and call them stupid (as you did in an earlier email) 
He said (quote) "committers are suffering from rectal induced hypoxia", we can let that pass as "stupid", I think. :)Maybe Michael can explain some day what "rectal induced hypoxia" is. I'm dying to know what I suffer from.

In any case and in all seriousness. Michael, feel free to educate yourself about what the intended use of coprocessors is - preferably before you come here and start an argument ... again. We're more than happy to accept a patch from you with a "correct" implementation.

Can we just let this thread die? It didn't start with a useful proposition.

-- Lars

     From: Andrew Purtell <ap...@apache.org>
 To: "user@hbase.apache.org" <us...@hbase.apache.org> 
 Sent: Thursday, April 9, 2015 4:53 PM
 Subject: Re: Rowkey design question
   
On Thu, Apr 9, 2015 at 2:26 PM, Michael Segel <mi...@hotmail.com>
wrote:

> Hint: You could have sandboxed the end user code which makes it a lot
> easier to manage.
>

I filed the fucking JIRA for that. Look at HBASE-4047. As a matter of
social grace, if you lecture people and call them stupid (as you did in an
earlier email) while making the same fucking argument the other person
made, this doesn't work.

The reason I never did finish HBASE-4047 is I didn't need to. Nobody here
or where I worked, ultimately, was banging down the door for an external
coprocessor host. What we have works well enough for people today.

If you do think the external coprocessor host is essential, try taking on
the actual engineering challenges involved. Hint: They are not easy. Put up
a patch. Writing words in an email is easy. ​






-- 
Best regards,

  - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

  

Re: Rowkey design question

Posted by Andrew Purtell <ap...@apache.org>.
On Thu, Apr 9, 2015 at 2:26 PM, Michael Segel <mi...@hotmail.com>
wrote:

> Hint: You could have sandboxed the end user code which makes it a lot
> easier to manage.
>

I filed the fucking JIRA for that. Look at HBASE-4047. As a matter of
social grace, if you lecture people and call them stupid (as you did in an
earlier email) while making the same fucking argument the other person
made, this doesn't work.

The reason I never did finish HBASE-4047 is I didn't need to. Nobody here
or where I worked, ultimately, was banging down the door for an external
coprocessor host. What we have works well enough for people today.

If you do think the external coprocessor host is essential, try taking on
the actual engineering challenges involved. Hint: They are not easy. Put up
a patch. Writing words in an email is easy. ​




-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Rowkey design question

Posted by Andrew Purtell <ap...@apache.org>.
> I would love to see how you can justify HBase as being secure when you
have end user code running in the same JVM as the RS.

Because it's completely optional. We don't disagree except where you take
an absolutist position and then beat us over the head for supposed sins
from that position, which may be _your_ one true answer, but is not _the_
one true answer. But we've already discussed this enough, you're not going
to change, and reality isn't going to change.

I would ask that you not write in to disparage us every time this subject
comes up, but I know from firsthand experience that is somehow asking too
much of you. But hey, maybe this time will be different.



On Thu, Apr 9, 2015 at 2:26 PM, Michael Segel <mi...@hotmail.com>
wrote:

> Andrew,
>
> In a nutshell running end user code within the RS JVM is a bad design.
> To be clear, this is not just my opinion… I just happen to be more vocal
> about it. ;-)
> We’ve covered this ground before and just because the code runs doesn’t
> mean its good. Or that the design is good.
>
> I would love to see how you can justify HBase as being secure when you
> have end user code running in the same JVM as the RS.
> I can think of several ways to hack HBase security because of this…
>
> Note: I’m not saying server side extensibility is bad, I’m saying how it
> was implemented was bad.
> Hint: You could have sandboxed the end user code which makes it a lot
> easier to manage.
>
> MapR has avoided this in their MapRDB. They’re adding the extensibility in
> a different manner and this issue is nothing new.
>
>
> And yes. you’ve hit the nail on the head. Rethink your design if you want
> to use coprocessors and use them as a last resort.
>
> > On Apr 9, 2015, at 3:02 PM, Andrew Purtell <ap...@apache.org> wrote:
> >
> > This is one person's opinion, to which he is absolutely entitled to, but
> > blanket black and white statements like "coprocessors are poorly
> > implemented" is obviously not an opinion shared by all those who have
> used
> > them successfully, nor the HBase committers, or we would remove the
> > feature. On the other hand, you should really ask yourself if in-server
> > extension is necessary. That should be a last resort, really, for the
> > security and performance considerations Michael mentions.
> >
> >
> > On Thu, Apr 9, 2015 at 5:05 AM, Michael Segel <michael_segel@hotmail.com
> >
> > wrote:
> >
> >> Ok…
> >> Coprocessors are poorly implemented in HBase.
> >> If you work in a secure environment, outside of the system coprocessors…
> >> (ones that you load from hbase-site.xml) , you don’t want to use them.
> (The
> >> coprocessor code runs on the same JVM as the RS.)  This means that if
> you
> >> have a poorly written coprocessor, you will kill performance for all of
> >> HBase. If you’re not using them in a secure environment, you have to
> >> consider how they are going to be used.
> >>
> >>
> >> Without really knowing more about your use case..., its impossible to
> say
> >> of the coprocessor would be a good idea.
> >>
> >>
> >> It sounds like you may have an unrealistic expectation as to how well
> >> HBase performs.
> >>
> >> HTH
> >>
> >> -Mike
> >>
> >>> On Apr 9, 2015, at 1:05 AM, Kristoffer Sjögren <st...@gmail.com>
> wrote:
> >>>
> >>> An HBase coprocessor. My idea is to move as much pre-aggregation as
> >>> possible to where the data lives in the region servers, instead of
> doing
> >> it
> >>> in the client. If there is good data locality inside and across rows
> >> within
> >>> regions then I would expect aggregation to be faster in the coprocessor
> >>> (utilize many region servers in parallel) rather than transfer data
> over
> >>> the network from multiple region servers to a single client that would
> do
> >>> the same calculation on its own.
> >>>
> >>>
> >>> On Thu, Apr 9, 2015 at 4:43 AM, Michael Segel <
> michael_segel@hotmail.com
> >>>
> >>> wrote:
> >>>
> >>>> When you say coprocessor, do you mean HBase coprocessors or do you
> mean
> >> a
> >>>> physical hardware coprocessor?
> >>>>
> >>>> In terms of queries…
> >>>>
> >>>> HBase can perform a single get() and return the result back quickly.
> >> (The
> >>>> size of the data being returned will impact the overall timing.)
> >>>>
> >>>> HBase also caches the results so that your first hit will take the
> >>>> longest, but as long as the row is cached, the results are returned
> >> quickly.
> >>>>
> >>>> If you’re trying to do a scan with a start/stop row set … your timing
> >> then
> >>>> could vary between sub-second and minutes depending on the query.
> >>>>
> >>>>
> >>>>> On Apr 8, 2015, at 3:10 PM, Kristoffer Sjögren <st...@gmail.com>
> >> wrote:
> >>>>>
> >>>>> But if the coprocessor is omitted then CPU cycles from region servers
> >> are
> >>>>> lost, so where would the query execution go?
> >>>>>
> >>>>> Queries needs to be quick (sub-second rather than seconds) and HDFS
> is
> >>>>> quite latency hungry, unless there are optimizations that i'm unaware
> >> of?
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, Apr 8, 2015 at 7:43 PM, Michael Segel <
> >> michael_segel@hotmail.com
> >>>>>
> >>>>> wrote:
> >>>>>
> >>>>>> I think you misunderstood.
> >>>>>>
> >>>>>> The suggestion was to put the data in to HDFS sequence files and to
> >> use
> >>>>>> HBase to store an index in to the file. (URL to the file, then
> offset
> >>>> in to
> >>>>>> the file for the start of the record…)
> >>>>>>
> >>>>>> The reason you want to do this is that you’re reading in large
> amounts
> >>>> of
> >>>>>> data and its more efficient to do this from HDFS than through HBase.
> >>>>>>
> >>>>>>> On Apr 8, 2015, at 8:41 AM, Kristoffer Sjögren <st...@gmail.com>
> >>>> wrote:
> >>>>>>>
> >>>>>>> Yes, I think you're right. Adding one or more dimensions to the
> >> rowkey
> >>>>>>> would indeed make the table narrower.
> >>>>>>>
> >>>>>>> And I guess it also make sense to store actual values (bigger
> >>>> qualifiers)
> >>>>>>> outside HBase. Keeping them in Hadoop why not? Pulling hot ones out
> >> on
> >>>>>> SSD
> >>>>>>> caches would be an interesting solution. And quite a bit simpler.
> >>>>>>>
> >>>>>>> Good call and thanks for the tip! :-)
> >>>>>>>
> >>>>>>> On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel <
> >>>> michael_segel@hotmail.com
> >>>>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Ok…
> >>>>>>>>
> >>>>>>>> First, I’d suggest you rethink your schema by adding an additional
> >>>>>>>> dimension.
> >>>>>>>> You’ll end up with more rows, but a narrower table.
> >>>>>>>>
> >>>>>>>> In terms of compaction… if the data is relatively static, you
> won’t
> >>>> have
> >>>>>>>> compactions because nothing changed.
> >>>>>>>> But if your data is that static… why not put the data in sequence
> >>>> files
> >>>>>>>> and use HBase as the index. Could be faster.
> >>>>>>>>
> >>>>>>>> HTH
> >>>>>>>>
> >>>>>>>> -Mike
> >>>>>>>>
> >>>>>>>>> On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren <stoffe@gmail.com
> >
> >>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> I just read through HBase MOB design document and one thing that
> >>>> caught
> >>>>>>>> my
> >>>>>>>>> attention was the following statement.
> >>>>>>>>>
> >>>>>>>>> "When HBase deals with large numbers of values > 100kb and up to
> >>>> ~10MB
> >>>>>> of
> >>>>>>>>> data, it encounters performance degradations due to write
> >>>> amplification
> >>>>>>>>> caused by splits and compactions."
> >>>>>>>>>
> >>>>>>>>> Is there any chance to run into this problem in the read path for
> >>>> data
> >>>>>>>> that
> >>>>>>>>> is written infrequently and never changed?
> >>>>>>>>>
> >>>>>>>>> On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren <
> >> stoffe@gmail.com
> >>>>>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> A small set of qualifiers will be accessed frequently so keeping
> >>>> them
> >>>>>> in
> >>>>>>>>>> block cache would be very beneficial. Some very seldom. So this
> >>>> sounds
> >>>>>>>> very
> >>>>>>>>>> promising!
> >>>>>>>>>>
> >>>>>>>>>> The reason why i'm considering a coprocessor is that I need to
> >>>> provide
> >>>>>>>>>> very specific information in the query request. Same thing with
> >> the
> >>>>>>>>>> response. Queries are also highly parallelizable across rows and
> >>>> each
> >>>>>>>>>> individual query produce a valid result that may or may not be
> >>>>>>>> aggregated
> >>>>>>>>>> with other results in the client, maybe even inside the region
> if
> >> it
> >>>>>>>>>> contained multiple rows targeted by the query.
> >>>>>>>>>>
> >>>>>>>>>> So it's a bit like Phoenix but with a different storage format
> and
> >>>>>> query
> >>>>>>>>>> engine.
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk <
> ndimiduk@gmail.com
> >>>
> >>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Those rows are written out into HBase blocks on cell
> boundaries.
> >>>> Your
> >>>>>>>>>>> column family has a BLOCK_SIZE attribute, which you may or may
> >> have
> >>>>>> no
> >>>>>>>>>>> overridden the default of 64k. Cells are written into a block
> >> until
> >>>>>> is
> >>>>>>>> it
> >>>>>>>>>>>> = the target block size. So your single 500mb row will be
> broken
> >>>>>> down
> >>>>>>>>>>> into
> >>>>>>>>>>> thousands of HFile blocks in some number of HFiles. Some of
> those
> >>>>>>>> blocks
> >>>>>>>>>>> may contain just a cell or two and be a couple MB in size, to
> >> hold
> >>>>>> the
> >>>>>>>>>>> largest of your cells. Those blocks will be loaded into the
> Block
> >>>>>>>> Cache as
> >>>>>>>>>>> they're accessed. If your careful with your access patterns and
> >>>> only
> >>>>>>>>>>> request cells that you need to evaluate, you'll only ever load
> >> the
> >>>>>>>> blocks
> >>>>>>>>>>> containing those cells into the cache.
> >>>>>>>>>>>
> >>>>>>>>>>>> Will the entire row be loaded or only the qualifiers I ask
> for?
> >>>>>>>>>>>
> >>>>>>>>>>> So then, the answer to your question is: it depends on how
> you're
> >>>>>>>>>>> interacting with the row from your coprocessor. The read path
> >> will
> >>>>>> only
> >>>>>>>>>>> load blocks that your scanner requests. If your coprocessor is
> >>>>>>>> producing
> >>>>>>>>>>> scanner with to seek to specific qualifiers, you'll only load
> >> those
> >>>>>>>>>>> blocks.
> >>>>>>>>>>>
> >>>>>>>>>>> Related question: Is there a reason you're using a coprocessor
> >>>>>> instead
> >>>>>>>> of
> >>>>>>>>>>> a
> >>>>>>>>>>> regular filter, or a simple qualified get/scan to access data
> >> from
> >>>>>>>> these
> >>>>>>>>>>> rows? The "default stuff" is already tuned to load data
> sparsely,
> >>>> as
> >>>>>>>> would
> >>>>>>>>>>> be desirable for your schema.
> >>>>>>>>>>>
> >>>>>>>>>>> -n
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren <
> >>>> stoffe@gmail.com
> >>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Sorry I should have explained my use case a bit more.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Yes, it's a pretty big row and it's "close" to worst case.
> >>>> Normally
> >>>>>>>>>>> there
> >>>>>>>>>>>> would be fewer qualifiers and the largest qualifiers would be
> >>>>>> smaller.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The reason why these rows gets big is because they stores
> >>>> aggregated
> >>>>>>>>>>> data
> >>>>>>>>>>>> in indexed compressed form. This format allow for extremely
> fast
> >>>>>>>> queries
> >>>>>>>>>>>> (on local disk format) over billions of rows (not rows in
> HBase
> >>>>>>>> speak),
> >>>>>>>>>>>> when touching smaller areas of the data. If would store the
> data
> >>>> as
> >>>>>>>>>>> regular
> >>>>>>>>>>>> HBase rows things would get very slow unless I had many many
> >>>> region
> >>>>>>>>>>>> servers.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The coprocessor is used for doing custom queries on the
> indexed
> >>>> data
> >>>>>>>>>>> inside
> >>>>>>>>>>>> the region servers. These queries are not like a regular row
> >> scan,
> >>>>>> but
> >>>>>>>>>>> very
> >>>>>>>>>>>> specific as to how the data is formatted withing each column
> >>>>>>>> qualifier.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Yes, this is not possible if HBase loads the whole 500MB each
> >>>> time i
> >>>>>>>>>>> want
> >>>>>>>>>>>> to perform this custom query on a row. Hence my question :-)
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel <
> >>>>>>>>>>> michael_segel@hotmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Sorry, but your initial problem statement doesn’t seem to
> >> parse …
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Are you saying that you a single row with approximately
> 100,000
> >>>>>>>>>>> elements
> >>>>>>>>>>>>> where each element is roughly 1-5KB in size and in addition
> >> there
> >>>>>> are
> >>>>>>>>>>> ~5
> >>>>>>>>>>>>> elements which will be between one and five MB in size?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> And you then mention a coprocessor?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Just looking at the numbers… 100K * 5KB means that each row
> >> would
> >>>>>> end
> >>>>>>>>>>> up
> >>>>>>>>>>>>> being 500MB in size.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> That’s a pretty fat row.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I would suggest rethinking your strategy.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren <
> >>>> stoffe@gmail.com
> >>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I have a row with around 100.000 qualifiers with mostly
> small
> >>>>>> values
> >>>>>>>>>>>>> around
> >>>>>>>>>>>>>> 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor
> do
> >>>>>>>>>>> random
> >>>>>>>>>>>>>> access of 1-10 qualifiers per row.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I would like to understand how HBase loads the data into
> >> memory.
> >>>>>>>>>>> Will
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>> entire row be loaded or only the qualifiers I ask for (like
> >>>>>> pointer
> >>>>>>>>>>>>> access
> >>>>>>>>>>>>>> into a direct ByteBuffer) ?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Cheers,
> >>>>>>>>>>>>>> -Kristoffer
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The opinions expressed here are mine, while they may reflect
> a
> >>>>>>>>>>> cognitive
> >>>>>>>>>>>>> thought, that is purely accidental.
> >>>>>>>>>>>>> Use at your own risk.
> >>>>>>>>>>>>> Michael Segel
> >>>>>>>>>>>>> michael_segel (AT) hotmail.com
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>> The opinions expressed here are mine, while they may reflect a
> >>>> cognitive
> >>>>>>>> thought, that is purely accidental.
> >>>>>>>> Use at your own risk.
> >>>>>>>> Michael Segel
> >>>>>>>> michael_segel (AT) hotmail.com
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>> The opinions expressed here are mine, while they may reflect a
> >> cognitive
> >>>>>> thought, that is purely accidental.
> >>>>>> Use at your own risk.
> >>>>>> Michael Segel
> >>>>>> michael_segel (AT) hotmail.com
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>> The opinions expressed here are mine, while they may reflect a
> cognitive
> >>>> thought, that is purely accidental.
> >>>> Use at your own risk.
> >>>> Michael Segel
> >>>> michael_segel (AT) hotmail.com
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>
> >> The opinions expressed here are mine, while they may reflect a cognitive
> >> thought, that is purely accidental.
> >> Use at your own risk.
> >> Michael Segel
> >> michael_segel (AT) hotmail.com
> >>
> >>
> >>
> >>
> >>
> >>
> >
> >
> > --
> > Best regards,
> >
> >   - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
>
> The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Rowkey design question

Posted by Michael Segel <mi...@hotmail.com>.
Andrew, 

In a nutshell running end user code within the RS JVM is a bad design. 
To be clear, this is not just my opinion… I just happen to be more vocal about it. ;-)
We’ve covered this ground before and just because the code runs doesn’t mean its good. Or that the design is good.

I would love to see how you can justify HBase as being secure when you have end user code running in the same JVM as the RS. 
I can think of several ways to hack HBase security because of this… 

Note: I’m not saying server side extensibility is bad, I’m saying how it was implemented was bad. 
Hint: You could have sandboxed the end user code which makes it a lot easier to manage.

MapR has avoided this in their MapRDB. They’re adding the extensibility in a different manner and this issue is nothing new. 


And yes. you’ve hit the nail on the head. Rethink your design if you want to use coprocessors and use them as a last resort. 

> On Apr 9, 2015, at 3:02 PM, Andrew Purtell <ap...@apache.org> wrote:
> 
> This is one person's opinion, to which he is absolutely entitled to, but
> blanket black and white statements like "coprocessors are poorly
> implemented" is obviously not an opinion shared by all those who have used
> them successfully, nor the HBase committers, or we would remove the
> feature. On the other hand, you should really ask yourself if in-server
> extension is necessary. That should be a last resort, really, for the
> security and performance considerations Michael mentions.
> 
> 
> On Thu, Apr 9, 2015 at 5:05 AM, Michael Segel <mi...@hotmail.com>
> wrote:
> 
>> Ok…
>> Coprocessors are poorly implemented in HBase.
>> If you work in a secure environment, outside of the system coprocessors…
>> (ones that you load from hbase-site.xml) , you don’t want to use them. (The
>> coprocessor code runs on the same JVM as the RS.)  This means that if you
>> have a poorly written coprocessor, you will kill performance for all of
>> HBase. If you’re not using them in a secure environment, you have to
>> consider how they are going to be used.
>> 
>> 
>> Without really knowing more about your use case..., its impossible to say
>> of the coprocessor would be a good idea.
>> 
>> 
>> It sounds like you may have an unrealistic expectation as to how well
>> HBase performs.
>> 
>> HTH
>> 
>> -Mike
>> 
>>> On Apr 9, 2015, at 1:05 AM, Kristoffer Sjögren <st...@gmail.com> wrote:
>>> 
>>> An HBase coprocessor. My idea is to move as much pre-aggregation as
>>> possible to where the data lives in the region servers, instead of doing
>> it
>>> in the client. If there is good data locality inside and across rows
>> within
>>> regions then I would expect aggregation to be faster in the coprocessor
>>> (utilize many region servers in parallel) rather than transfer data over
>>> the network from multiple region servers to a single client that would do
>>> the same calculation on its own.
>>> 
>>> 
>>> On Thu, Apr 9, 2015 at 4:43 AM, Michael Segel <michael_segel@hotmail.com
>>> 
>>> wrote:
>>> 
>>>> When you say coprocessor, do you mean HBase coprocessors or do you mean
>> a
>>>> physical hardware coprocessor?
>>>> 
>>>> In terms of queries…
>>>> 
>>>> HBase can perform a single get() and return the result back quickly.
>> (The
>>>> size of the data being returned will impact the overall timing.)
>>>> 
>>>> HBase also caches the results so that your first hit will take the
>>>> longest, but as long as the row is cached, the results are returned
>> quickly.
>>>> 
>>>> If you’re trying to do a scan with a start/stop row set … your timing
>> then
>>>> could vary between sub-second and minutes depending on the query.
>>>> 
>>>> 
>>>>> On Apr 8, 2015, at 3:10 PM, Kristoffer Sjögren <st...@gmail.com>
>> wrote:
>>>>> 
>>>>> But if the coprocessor is omitted then CPU cycles from region servers
>> are
>>>>> lost, so where would the query execution go?
>>>>> 
>>>>> Queries needs to be quick (sub-second rather than seconds) and HDFS is
>>>>> quite latency hungry, unless there are optimizations that i'm unaware
>> of?
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, Apr 8, 2015 at 7:43 PM, Michael Segel <
>> michael_segel@hotmail.com
>>>>> 
>>>>> wrote:
>>>>> 
>>>>>> I think you misunderstood.
>>>>>> 
>>>>>> The suggestion was to put the data in to HDFS sequence files and to
>> use
>>>>>> HBase to store an index in to the file. (URL to the file, then offset
>>>> in to
>>>>>> the file for the start of the record…)
>>>>>> 
>>>>>> The reason you want to do this is that you’re reading in large amounts
>>>> of
>>>>>> data and its more efficient to do this from HDFS than through HBase.
>>>>>> 
>>>>>>> On Apr 8, 2015, at 8:41 AM, Kristoffer Sjögren <st...@gmail.com>
>>>> wrote:
>>>>>>> 
>>>>>>> Yes, I think you're right. Adding one or more dimensions to the
>> rowkey
>>>>>>> would indeed make the table narrower.
>>>>>>> 
>>>>>>> And I guess it also make sense to store actual values (bigger
>>>> qualifiers)
>>>>>>> outside HBase. Keeping them in Hadoop why not? Pulling hot ones out
>> on
>>>>>> SSD
>>>>>>> caches would be an interesting solution. And quite a bit simpler.
>>>>>>> 
>>>>>>> Good call and thanks for the tip! :-)
>>>>>>> 
>>>>>>> On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel <
>>>> michael_segel@hotmail.com
>>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Ok…
>>>>>>>> 
>>>>>>>> First, I’d suggest you rethink your schema by adding an additional
>>>>>>>> dimension.
>>>>>>>> You’ll end up with more rows, but a narrower table.
>>>>>>>> 
>>>>>>>> In terms of compaction… if the data is relatively static, you won’t
>>>> have
>>>>>>>> compactions because nothing changed.
>>>>>>>> But if your data is that static… why not put the data in sequence
>>>> files
>>>>>>>> and use HBase as the index. Could be faster.
>>>>>>>> 
>>>>>>>> HTH
>>>>>>>> 
>>>>>>>> -Mike
>>>>>>>> 
>>>>>>>>> On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren <st...@gmail.com>
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> I just read through HBase MOB design document and one thing that
>>>> caught
>>>>>>>> my
>>>>>>>>> attention was the following statement.
>>>>>>>>> 
>>>>>>>>> "When HBase deals with large numbers of values > 100kb and up to
>>>> ~10MB
>>>>>> of
>>>>>>>>> data, it encounters performance degradations due to write
>>>> amplification
>>>>>>>>> caused by splits and compactions."
>>>>>>>>> 
>>>>>>>>> Is there any chance to run into this problem in the read path for
>>>> data
>>>>>>>> that
>>>>>>>>> is written infrequently and never changed?
>>>>>>>>> 
>>>>>>>>> On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren <
>> stoffe@gmail.com
>>>>> 
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> A small set of qualifiers will be accessed frequently so keeping
>>>> them
>>>>>> in
>>>>>>>>>> block cache would be very beneficial. Some very seldom. So this
>>>> sounds
>>>>>>>> very
>>>>>>>>>> promising!
>>>>>>>>>> 
>>>>>>>>>> The reason why i'm considering a coprocessor is that I need to
>>>> provide
>>>>>>>>>> very specific information in the query request. Same thing with
>> the
>>>>>>>>>> response. Queries are also highly parallelizable across rows and
>>>> each
>>>>>>>>>> individual query produce a valid result that may or may not be
>>>>>>>> aggregated
>>>>>>>>>> with other results in the client, maybe even inside the region if
>> it
>>>>>>>>>> contained multiple rows targeted by the query.
>>>>>>>>>> 
>>>>>>>>>> So it's a bit like Phoenix but with a different storage format and
>>>>>> query
>>>>>>>>>> engine.
>>>>>>>>>> 
>>>>>>>>>> On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk <ndimiduk@gmail.com
>>> 
>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Those rows are written out into HBase blocks on cell boundaries.
>>>> Your
>>>>>>>>>>> column family has a BLOCK_SIZE attribute, which you may or may
>> have
>>>>>> no
>>>>>>>>>>> overridden the default of 64k. Cells are written into a block
>> until
>>>>>> is
>>>>>>>> it
>>>>>>>>>>>> = the target block size. So your single 500mb row will be broken
>>>>>> down
>>>>>>>>>>> into
>>>>>>>>>>> thousands of HFile blocks in some number of HFiles. Some of those
>>>>>>>> blocks
>>>>>>>>>>> may contain just a cell or two and be a couple MB in size, to
>> hold
>>>>>> the
>>>>>>>>>>> largest of your cells. Those blocks will be loaded into the Block
>>>>>>>> Cache as
>>>>>>>>>>> they're accessed. If your careful with your access patterns and
>>>> only
>>>>>>>>>>> request cells that you need to evaluate, you'll only ever load
>> the
>>>>>>>> blocks
>>>>>>>>>>> containing those cells into the cache.
>>>>>>>>>>> 
>>>>>>>>>>>> Will the entire row be loaded or only the qualifiers I ask for?
>>>>>>>>>>> 
>>>>>>>>>>> So then, the answer to your question is: it depends on how you're
>>>>>>>>>>> interacting with the row from your coprocessor. The read path
>> will
>>>>>> only
>>>>>>>>>>> load blocks that your scanner requests. If your coprocessor is
>>>>>>>> producing
>>>>>>>>>>> scanner with to seek to specific qualifiers, you'll only load
>> those
>>>>>>>>>>> blocks.
>>>>>>>>>>> 
>>>>>>>>>>> Related question: Is there a reason you're using a coprocessor
>>>>>> instead
>>>>>>>> of
>>>>>>>>>>> a
>>>>>>>>>>> regular filter, or a simple qualified get/scan to access data
>> from
>>>>>>>> these
>>>>>>>>>>> rows? The "default stuff" is already tuned to load data sparsely,
>>>> as
>>>>>>>> would
>>>>>>>>>>> be desirable for your schema.
>>>>>>>>>>> 
>>>>>>>>>>> -n
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren <
>>>> stoffe@gmail.com
>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Sorry I should have explained my use case a bit more.
>>>>>>>>>>>> 
>>>>>>>>>>>> Yes, it's a pretty big row and it's "close" to worst case.
>>>> Normally
>>>>>>>>>>> there
>>>>>>>>>>>> would be fewer qualifiers and the largest qualifiers would be
>>>>>> smaller.
>>>>>>>>>>>> 
>>>>>>>>>>>> The reason why these rows gets big is because they stores
>>>> aggregated
>>>>>>>>>>> data
>>>>>>>>>>>> in indexed compressed form. This format allow for extremely fast
>>>>>>>> queries
>>>>>>>>>>>> (on local disk format) over billions of rows (not rows in HBase
>>>>>>>> speak),
>>>>>>>>>>>> when touching smaller areas of the data. If would store the data
>>>> as
>>>>>>>>>>> regular
>>>>>>>>>>>> HBase rows things would get very slow unless I had many many
>>>> region
>>>>>>>>>>>> servers.
>>>>>>>>>>>> 
>>>>>>>>>>>> The coprocessor is used for doing custom queries on the indexed
>>>> data
>>>>>>>>>>> inside
>>>>>>>>>>>> the region servers. These queries are not like a regular row
>> scan,
>>>>>> but
>>>>>>>>>>> very
>>>>>>>>>>>> specific as to how the data is formatted withing each column
>>>>>>>> qualifier.
>>>>>>>>>>>> 
>>>>>>>>>>>> Yes, this is not possible if HBase loads the whole 500MB each
>>>> time i
>>>>>>>>>>> want
>>>>>>>>>>>> to perform this custom query on a row. Hence my question :-)
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel <
>>>>>>>>>>> michael_segel@hotmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Sorry, but your initial problem statement doesn’t seem to
>> parse …
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Are you saying that you a single row with approximately 100,000
>>>>>>>>>>> elements
>>>>>>>>>>>>> where each element is roughly 1-5KB in size and in addition
>> there
>>>>>> are
>>>>>>>>>>> ~5
>>>>>>>>>>>>> elements which will be between one and five MB in size?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> And you then mention a coprocessor?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Just looking at the numbers… 100K * 5KB means that each row
>> would
>>>>>> end
>>>>>>>>>>> up
>>>>>>>>>>>>> being 500MB in size.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> That’s a pretty fat row.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I would suggest rethinking your strategy.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren <
>>>> stoffe@gmail.com
>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I have a row with around 100.000 qualifiers with mostly small
>>>>>> values
>>>>>>>>>>>>> around
>>>>>>>>>>>>>> 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do
>>>>>>>>>>> random
>>>>>>>>>>>>>> access of 1-10 qualifiers per row.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I would like to understand how HBase loads the data into
>> memory.
>>>>>>>>>>> Will
>>>>>>>>>>>> the
>>>>>>>>>>>>>> entire row be loaded or only the qualifiers I ask for (like
>>>>>> pointer
>>>>>>>>>>>>> access
>>>>>>>>>>>>>> into a direct ByteBuffer) ?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>> -Kristoffer
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The opinions expressed here are mine, while they may reflect a
>>>>>>>>>>> cognitive
>>>>>>>>>>>>> thought, that is purely accidental.
>>>>>>>>>>>>> Use at your own risk.
>>>>>>>>>>>>> Michael Segel
>>>>>>>>>>>>> michael_segel (AT) hotmail.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> The opinions expressed here are mine, while they may reflect a
>>>> cognitive
>>>>>>>> thought, that is purely accidental.
>>>>>>>> Use at your own risk.
>>>>>>>> Michael Segel
>>>>>>>> michael_segel (AT) hotmail.com
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> The opinions expressed here are mine, while they may reflect a
>> cognitive
>>>>>> thought, that is purely accidental.
>>>>>> Use at your own risk.
>>>>>> Michael Segel
>>>>>> michael_segel (AT) hotmail.com
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> The opinions expressed here are mine, while they may reflect a cognitive
>>>> thought, that is purely accidental.
>>>> Use at your own risk.
>>>> Michael Segel
>>>> michael_segel (AT) hotmail.com
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>> 
>> The opinions expressed here are mine, while they may reflect a cognitive
>> thought, that is purely accidental.
>> Use at your own risk.
>> Michael Segel
>> michael_segel (AT) hotmail.com
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> -- 
> Best regards,
> 
>   - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com






Re: Rowkey design question

Posted by Kevin O'dell <ke...@cloudera.com>.
Trying to figure out the best place to jump in here...

Kristoffer,

  I would like to echo what Michael and Andrew have said.  While a
pre-aggregation co-proc may "work" in my experience with co-procs they are
typically more trouble than they are worth.  I would first try this outside
the client taking advantage of filters.

How is this data coming in?  Could we help with some pre-aggregation with
Flume interceptors or Storm and enrich the events in flight?  This could
help take some work off of the client give you the speed you need without
dropping custom code into the RS JVM...which should ALWAYS be the last
resort.

On Thu, Apr 9, 2015 at 4:02 PM, Andrew Purtell <ap...@apache.org> wrote:

> This is one person's opinion, to which he is absolutely entitled to, but
> blanket black and white statements like "coprocessors are poorly
> implemented" is obviously not an opinion shared by all those who have used
> them successfully, nor the HBase committers, or we would remove the
> feature. On the other hand, you should really ask yourself if in-server
> extension is necessary. That should be a last resort, really, for the
> security and performance considerations Michael mentions.
>
>
> On Thu, Apr 9, 2015 at 5:05 AM, Michael Segel <mi...@hotmail.com>
> wrote:
>
> > Ok…
> > Coprocessors are poorly implemented in HBase.
> > If you work in a secure environment, outside of the system coprocessors…
> > (ones that you load from hbase-site.xml) , you don’t want to use them.
> (The
> > coprocessor code runs on the same JVM as the RS.)  This means that if you
> > have a poorly written coprocessor, you will kill performance for all of
> > HBase. If you’re not using them in a secure environment, you have to
> > consider how they are going to be used.
> >
> >
> > Without really knowing more about your use case..., its impossible to say
> > of the coprocessor would be a good idea.
> >
> >
> > It sounds like you may have an unrealistic expectation as to how well
> > HBase performs.
> >
> > HTH
> >
> > -Mike
> >
> > > On Apr 9, 2015, at 1:05 AM, Kristoffer Sjögren <st...@gmail.com>
> wrote:
> > >
> > > An HBase coprocessor. My idea is to move as much pre-aggregation as
> > > possible to where the data lives in the region servers, instead of
> doing
> > it
> > > in the client. If there is good data locality inside and across rows
> > within
> > > regions then I would expect aggregation to be faster in the coprocessor
> > > (utilize many region servers in parallel) rather than transfer data
> over
> > > the network from multiple region servers to a single client that would
> do
> > > the same calculation on its own.
> > >
> > >
> > > On Thu, Apr 9, 2015 at 4:43 AM, Michael Segel <
> michael_segel@hotmail.com
> > >
> > > wrote:
> > >
> > >> When you say coprocessor, do you mean HBase coprocessors or do you
> mean
> > a
> > >> physical hardware coprocessor?
> > >>
> > >> In terms of queries…
> > >>
> > >> HBase can perform a single get() and return the result back quickly.
> > (The
> > >> size of the data being returned will impact the overall timing.)
> > >>
> > >> HBase also caches the results so that your first hit will take the
> > >> longest, but as long as the row is cached, the results are returned
> > quickly.
> > >>
> > >> If you’re trying to do a scan with a start/stop row set … your timing
> > then
> > >> could vary between sub-second and minutes depending on the query.
> > >>
> > >>
> > >>> On Apr 8, 2015, at 3:10 PM, Kristoffer Sjögren <st...@gmail.com>
> > wrote:
> > >>>
> > >>> But if the coprocessor is omitted then CPU cycles from region servers
> > are
> > >>> lost, so where would the query execution go?
> > >>>
> > >>> Queries needs to be quick (sub-second rather than seconds) and HDFS
> is
> > >>> quite latency hungry, unless there are optimizations that i'm unaware
> > of?
> > >>>
> > >>>
> > >>>
> > >>> On Wed, Apr 8, 2015 at 7:43 PM, Michael Segel <
> > michael_segel@hotmail.com
> > >>>
> > >>> wrote:
> > >>>
> > >>>> I think you misunderstood.
> > >>>>
> > >>>> The suggestion was to put the data in to HDFS sequence files and to
> > use
> > >>>> HBase to store an index in to the file. (URL to the file, then
> offset
> > >> in to
> > >>>> the file for the start of the record…)
> > >>>>
> > >>>> The reason you want to do this is that you’re reading in large
> amounts
> > >> of
> > >>>> data and its more efficient to do this from HDFS than through HBase.
> > >>>>
> > >>>>> On Apr 8, 2015, at 8:41 AM, Kristoffer Sjögren <st...@gmail.com>
> > >> wrote:
> > >>>>>
> > >>>>> Yes, I think you're right. Adding one or more dimensions to the
> > rowkey
> > >>>>> would indeed make the table narrower.
> > >>>>>
> > >>>>> And I guess it also make sense to store actual values (bigger
> > >> qualifiers)
> > >>>>> outside HBase. Keeping them in Hadoop why not? Pulling hot ones out
> > on
> > >>>> SSD
> > >>>>> caches would be an interesting solution. And quite a bit simpler.
> > >>>>>
> > >>>>> Good call and thanks for the tip! :-)
> > >>>>>
> > >>>>> On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel <
> > >> michael_segel@hotmail.com
> > >>>>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Ok…
> > >>>>>>
> > >>>>>> First, I’d suggest you rethink your schema by adding an additional
> > >>>>>> dimension.
> > >>>>>> You’ll end up with more rows, but a narrower table.
> > >>>>>>
> > >>>>>> In terms of compaction… if the data is relatively static, you
> won’t
> > >> have
> > >>>>>> compactions because nothing changed.
> > >>>>>> But if your data is that static… why not put the data in sequence
> > >> files
> > >>>>>> and use HBase as the index. Could be faster.
> > >>>>>>
> > >>>>>> HTH
> > >>>>>>
> > >>>>>> -Mike
> > >>>>>>
> > >>>>>>> On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren <stoffe@gmail.com
> >
> > >>>> wrote:
> > >>>>>>>
> > >>>>>>> I just read through HBase MOB design document and one thing that
> > >> caught
> > >>>>>> my
> > >>>>>>> attention was the following statement.
> > >>>>>>>
> > >>>>>>> "When HBase deals with large numbers of values > 100kb and up to
> > >> ~10MB
> > >>>> of
> > >>>>>>> data, it encounters performance degradations due to write
> > >> amplification
> > >>>>>>> caused by splits and compactions."
> > >>>>>>>
> > >>>>>>> Is there any chance to run into this problem in the read path for
> > >> data
> > >>>>>> that
> > >>>>>>> is written infrequently and never changed?
> > >>>>>>>
> > >>>>>>> On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren <
> > stoffe@gmail.com
> > >>>
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> A small set of qualifiers will be accessed frequently so keeping
> > >> them
> > >>>> in
> > >>>>>>>> block cache would be very beneficial. Some very seldom. So this
> > >> sounds
> > >>>>>> very
> > >>>>>>>> promising!
> > >>>>>>>>
> > >>>>>>>> The reason why i'm considering a coprocessor is that I need to
> > >> provide
> > >>>>>>>> very specific information in the query request. Same thing with
> > the
> > >>>>>>>> response. Queries are also highly parallelizable across rows and
> > >> each
> > >>>>>>>> individual query produce a valid result that may or may not be
> > >>>>>> aggregated
> > >>>>>>>> with other results in the client, maybe even inside the region
> if
> > it
> > >>>>>>>> contained multiple rows targeted by the query.
> > >>>>>>>>
> > >>>>>>>> So it's a bit like Phoenix but with a different storage format
> and
> > >>>> query
> > >>>>>>>> engine.
> > >>>>>>>>
> > >>>>>>>> On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk <
> ndimiduk@gmail.com
> > >
> > >>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Those rows are written out into HBase blocks on cell
> boundaries.
> > >> Your
> > >>>>>>>>> column family has a BLOCK_SIZE attribute, which you may or may
> > have
> > >>>> no
> > >>>>>>>>> overridden the default of 64k. Cells are written into a block
> > until
> > >>>> is
> > >>>>>> it
> > >>>>>>>>>> = the target block size. So your single 500mb row will be
> broken
> > >>>> down
> > >>>>>>>>> into
> > >>>>>>>>> thousands of HFile blocks in some number of HFiles. Some of
> those
> > >>>>>> blocks
> > >>>>>>>>> may contain just a cell or two and be a couple MB in size, to
> > hold
> > >>>> the
> > >>>>>>>>> largest of your cells. Those blocks will be loaded into the
> Block
> > >>>>>> Cache as
> > >>>>>>>>> they're accessed. If your careful with your access patterns and
> > >> only
> > >>>>>>>>> request cells that you need to evaluate, you'll only ever load
> > the
> > >>>>>> blocks
> > >>>>>>>>> containing those cells into the cache.
> > >>>>>>>>>
> > >>>>>>>>>> Will the entire row be loaded or only the qualifiers I ask
> for?
> > >>>>>>>>>
> > >>>>>>>>> So then, the answer to your question is: it depends on how
> you're
> > >>>>>>>>> interacting with the row from your coprocessor. The read path
> > will
> > >>>> only
> > >>>>>>>>> load blocks that your scanner requests. If your coprocessor is
> > >>>>>> producing
> > >>>>>>>>> scanner with to seek to specific qualifiers, you'll only load
> > those
> > >>>>>>>>> blocks.
> > >>>>>>>>>
> > >>>>>>>>> Related question: Is there a reason you're using a coprocessor
> > >>>> instead
> > >>>>>> of
> > >>>>>>>>> a
> > >>>>>>>>> regular filter, or a simple qualified get/scan to access data
> > from
> > >>>>>> these
> > >>>>>>>>> rows? The "default stuff" is already tuned to load data
> sparsely,
> > >> as
> > >>>>>> would
> > >>>>>>>>> be desirable for your schema.
> > >>>>>>>>>
> > >>>>>>>>> -n
> > >>>>>>>>>
> > >>>>>>>>> On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren <
> > >> stoffe@gmail.com
> > >>>>>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Sorry I should have explained my use case a bit more.
> > >>>>>>>>>>
> > >>>>>>>>>> Yes, it's a pretty big row and it's "close" to worst case.
> > >> Normally
> > >>>>>>>>> there
> > >>>>>>>>>> would be fewer qualifiers and the largest qualifiers would be
> > >>>> smaller.
> > >>>>>>>>>>
> > >>>>>>>>>> The reason why these rows gets big is because they stores
> > >> aggregated
> > >>>>>>>>> data
> > >>>>>>>>>> in indexed compressed form. This format allow for extremely
> fast
> > >>>>>> queries
> > >>>>>>>>>> (on local disk format) over billions of rows (not rows in
> HBase
> > >>>>>> speak),
> > >>>>>>>>>> when touching smaller areas of the data. If would store the
> data
> > >> as
> > >>>>>>>>> regular
> > >>>>>>>>>> HBase rows things would get very slow unless I had many many
> > >> region
> > >>>>>>>>>> servers.
> > >>>>>>>>>>
> > >>>>>>>>>> The coprocessor is used for doing custom queries on the
> indexed
> > >> data
> > >>>>>>>>> inside
> > >>>>>>>>>> the region servers. These queries are not like a regular row
> > scan,
> > >>>> but
> > >>>>>>>>> very
> > >>>>>>>>>> specific as to how the data is formatted withing each column
> > >>>>>> qualifier.
> > >>>>>>>>>>
> > >>>>>>>>>> Yes, this is not possible if HBase loads the whole 500MB each
> > >> time i
> > >>>>>>>>> want
> > >>>>>>>>>> to perform this custom query on a row. Hence my question :-)
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel <
> > >>>>>>>>> michael_segel@hotmail.com>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Sorry, but your initial problem statement doesn’t seem to
> > parse …
> > >>>>>>>>>>>
> > >>>>>>>>>>> Are you saying that you a single row with approximately
> 100,000
> > >>>>>>>>> elements
> > >>>>>>>>>>> where each element is roughly 1-5KB in size and in addition
> > there
> > >>>> are
> > >>>>>>>>> ~5
> > >>>>>>>>>>> elements which will be between one and five MB in size?
> > >>>>>>>>>>>
> > >>>>>>>>>>> And you then mention a coprocessor?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Just looking at the numbers… 100K * 5KB means that each row
> > would
> > >>>> end
> > >>>>>>>>> up
> > >>>>>>>>>>> being 500MB in size.
> > >>>>>>>>>>>
> > >>>>>>>>>>> That’s a pretty fat row.
> > >>>>>>>>>>>
> > >>>>>>>>>>> I would suggest rethinking your strategy.
> > >>>>>>>>>>>
> > >>>>>>>>>>>> On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren <
> > >> stoffe@gmail.com
> > >>>>>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hi
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I have a row with around 100.000 qualifiers with mostly
> small
> > >>>> values
> > >>>>>>>>>>> around
> > >>>>>>>>>>>> 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor
> do
> > >>>>>>>>> random
> > >>>>>>>>>>>> access of 1-10 qualifiers per row.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I would like to understand how HBase loads the data into
> > memory.
> > >>>>>>>>> Will
> > >>>>>>>>>> the
> > >>>>>>>>>>>> entire row be loaded or only the qualifiers I ask for (like
> > >>>> pointer
> > >>>>>>>>>>> access
> > >>>>>>>>>>>> into a direct ByteBuffer) ?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Cheers,
> > >>>>>>>>>>>> -Kristoffer
> > >>>>>>>>>>>
> > >>>>>>>>>>> The opinions expressed here are mine, while they may reflect
> a
> > >>>>>>>>> cognitive
> > >>>>>>>>>>> thought, that is purely accidental.
> > >>>>>>>>>>> Use at your own risk.
> > >>>>>>>>>>> Michael Segel
> > >>>>>>>>>>> michael_segel (AT) hotmail.com
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>>> The opinions expressed here are mine, while they may reflect a
> > >> cognitive
> > >>>>>> thought, that is purely accidental.
> > >>>>>> Use at your own risk.
> > >>>>>> Michael Segel
> > >>>>>> michael_segel (AT) hotmail.com
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>
> > >>>> The opinions expressed here are mine, while they may reflect a
> > cognitive
> > >>>> thought, that is purely accidental.
> > >>>> Use at your own risk.
> > >>>> Michael Segel
> > >>>> michael_segel (AT) hotmail.com
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>
> > >> The opinions expressed here are mine, while they may reflect a
> cognitive
> > >> thought, that is purely accidental.
> > >> Use at your own risk.
> > >> Michael Segel
> > >> michael_segel (AT) hotmail.com
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> >
> > The opinions expressed here are mine, while they may reflect a cognitive
> > thought, that is purely accidental.
> > Use at your own risk.
> > Michael Segel
> > michael_segel (AT) hotmail.com
> >
> >
> >
> >
> >
> >
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>



-- 
Kevin O'Dell
Field Enablement, Cloudera

Re: Rowkey design question

Posted by Andrew Purtell <ap...@apache.org>.
This is one person's opinion, to which he is absolutely entitled to, but
blanket black and white statements like "coprocessors are poorly
implemented" is obviously not an opinion shared by all those who have used
them successfully, nor the HBase committers, or we would remove the
feature. On the other hand, you should really ask yourself if in-server
extension is necessary. That should be a last resort, really, for the
security and performance considerations Michael mentions.


On Thu, Apr 9, 2015 at 5:05 AM, Michael Segel <mi...@hotmail.com>
wrote:

> Ok…
> Coprocessors are poorly implemented in HBase.
> If you work in a secure environment, outside of the system coprocessors…
> (ones that you load from hbase-site.xml) , you don’t want to use them. (The
> coprocessor code runs on the same JVM as the RS.)  This means that if you
> have a poorly written coprocessor, you will kill performance for all of
> HBase. If you’re not using them in a secure environment, you have to
> consider how they are going to be used.
>
>
> Without really knowing more about your use case..., its impossible to say
> of the coprocessor would be a good idea.
>
>
> It sounds like you may have an unrealistic expectation as to how well
> HBase performs.
>
> HTH
>
> -Mike
>
> > On Apr 9, 2015, at 1:05 AM, Kristoffer Sjögren <st...@gmail.com> wrote:
> >
> > An HBase coprocessor. My idea is to move as much pre-aggregation as
> > possible to where the data lives in the region servers, instead of doing
> it
> > in the client. If there is good data locality inside and across rows
> within
> > regions then I would expect aggregation to be faster in the coprocessor
> > (utilize many region servers in parallel) rather than transfer data over
> > the network from multiple region servers to a single client that would do
> > the same calculation on its own.
> >
> >
> > On Thu, Apr 9, 2015 at 4:43 AM, Michael Segel <michael_segel@hotmail.com
> >
> > wrote:
> >
> >> When you say coprocessor, do you mean HBase coprocessors or do you mean
> a
> >> physical hardware coprocessor?
> >>
> >> In terms of queries…
> >>
> >> HBase can perform a single get() and return the result back quickly.
> (The
> >> size of the data being returned will impact the overall timing.)
> >>
> >> HBase also caches the results so that your first hit will take the
> >> longest, but as long as the row is cached, the results are returned
> quickly.
> >>
> >> If you’re trying to do a scan with a start/stop row set … your timing
> then
> >> could vary between sub-second and minutes depending on the query.
> >>
> >>
> >>> On Apr 8, 2015, at 3:10 PM, Kristoffer Sjögren <st...@gmail.com>
> wrote:
> >>>
> >>> But if the coprocessor is omitted then CPU cycles from region servers
> are
> >>> lost, so where would the query execution go?
> >>>
> >>> Queries needs to be quick (sub-second rather than seconds) and HDFS is
> >>> quite latency hungry, unless there are optimizations that i'm unaware
> of?
> >>>
> >>>
> >>>
> >>> On Wed, Apr 8, 2015 at 7:43 PM, Michael Segel <
> michael_segel@hotmail.com
> >>>
> >>> wrote:
> >>>
> >>>> I think you misunderstood.
> >>>>
> >>>> The suggestion was to put the data in to HDFS sequence files and to
> use
> >>>> HBase to store an index in to the file. (URL to the file, then offset
> >> in to
> >>>> the file for the start of the record…)
> >>>>
> >>>> The reason you want to do this is that you’re reading in large amounts
> >> of
> >>>> data and its more efficient to do this from HDFS than through HBase.
> >>>>
> >>>>> On Apr 8, 2015, at 8:41 AM, Kristoffer Sjögren <st...@gmail.com>
> >> wrote:
> >>>>>
> >>>>> Yes, I think you're right. Adding one or more dimensions to the
> rowkey
> >>>>> would indeed make the table narrower.
> >>>>>
> >>>>> And I guess it also make sense to store actual values (bigger
> >> qualifiers)
> >>>>> outside HBase. Keeping them in Hadoop why not? Pulling hot ones out
> on
> >>>> SSD
> >>>>> caches would be an interesting solution. And quite a bit simpler.
> >>>>>
> >>>>> Good call and thanks for the tip! :-)
> >>>>>
> >>>>> On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel <
> >> michael_segel@hotmail.com
> >>>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Ok…
> >>>>>>
> >>>>>> First, I’d suggest you rethink your schema by adding an additional
> >>>>>> dimension.
> >>>>>> You’ll end up with more rows, but a narrower table.
> >>>>>>
> >>>>>> In terms of compaction… if the data is relatively static, you won’t
> >> have
> >>>>>> compactions because nothing changed.
> >>>>>> But if your data is that static… why not put the data in sequence
> >> files
> >>>>>> and use HBase as the index. Could be faster.
> >>>>>>
> >>>>>> HTH
> >>>>>>
> >>>>>> -Mike
> >>>>>>
> >>>>>>> On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren <st...@gmail.com>
> >>>> wrote:
> >>>>>>>
> >>>>>>> I just read through HBase MOB design document and one thing that
> >> caught
> >>>>>> my
> >>>>>>> attention was the following statement.
> >>>>>>>
> >>>>>>> "When HBase deals with large numbers of values > 100kb and up to
> >> ~10MB
> >>>> of
> >>>>>>> data, it encounters performance degradations due to write
> >> amplification
> >>>>>>> caused by splits and compactions."
> >>>>>>>
> >>>>>>> Is there any chance to run into this problem in the read path for
> >> data
> >>>>>> that
> >>>>>>> is written infrequently and never changed?
> >>>>>>>
> >>>>>>> On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren <
> stoffe@gmail.com
> >>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> A small set of qualifiers will be accessed frequently so keeping
> >> them
> >>>> in
> >>>>>>>> block cache would be very beneficial. Some very seldom. So this
> >> sounds
> >>>>>> very
> >>>>>>>> promising!
> >>>>>>>>
> >>>>>>>> The reason why i'm considering a coprocessor is that I need to
> >> provide
> >>>>>>>> very specific information in the query request. Same thing with
> the
> >>>>>>>> response. Queries are also highly parallelizable across rows and
> >> each
> >>>>>>>> individual query produce a valid result that may or may not be
> >>>>>> aggregated
> >>>>>>>> with other results in the client, maybe even inside the region if
> it
> >>>>>>>> contained multiple rows targeted by the query.
> >>>>>>>>
> >>>>>>>> So it's a bit like Phoenix but with a different storage format and
> >>>> query
> >>>>>>>> engine.
> >>>>>>>>
> >>>>>>>> On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk <ndimiduk@gmail.com
> >
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Those rows are written out into HBase blocks on cell boundaries.
> >> Your
> >>>>>>>>> column family has a BLOCK_SIZE attribute, which you may or may
> have
> >>>> no
> >>>>>>>>> overridden the default of 64k. Cells are written into a block
> until
> >>>> is
> >>>>>> it
> >>>>>>>>>> = the target block size. So your single 500mb row will be broken
> >>>> down
> >>>>>>>>> into
> >>>>>>>>> thousands of HFile blocks in some number of HFiles. Some of those
> >>>>>> blocks
> >>>>>>>>> may contain just a cell or two and be a couple MB in size, to
> hold
> >>>> the
> >>>>>>>>> largest of your cells. Those blocks will be loaded into the Block
> >>>>>> Cache as
> >>>>>>>>> they're accessed. If your careful with your access patterns and
> >> only
> >>>>>>>>> request cells that you need to evaluate, you'll only ever load
> the
> >>>>>> blocks
> >>>>>>>>> containing those cells into the cache.
> >>>>>>>>>
> >>>>>>>>>> Will the entire row be loaded or only the qualifiers I ask for?
> >>>>>>>>>
> >>>>>>>>> So then, the answer to your question is: it depends on how you're
> >>>>>>>>> interacting with the row from your coprocessor. The read path
> will
> >>>> only
> >>>>>>>>> load blocks that your scanner requests. If your coprocessor is
> >>>>>> producing
> >>>>>>>>> scanner with to seek to specific qualifiers, you'll only load
> those
> >>>>>>>>> blocks.
> >>>>>>>>>
> >>>>>>>>> Related question: Is there a reason you're using a coprocessor
> >>>> instead
> >>>>>> of
> >>>>>>>>> a
> >>>>>>>>> regular filter, or a simple qualified get/scan to access data
> from
> >>>>>> these
> >>>>>>>>> rows? The "default stuff" is already tuned to load data sparsely,
> >> as
> >>>>>> would
> >>>>>>>>> be desirable for your schema.
> >>>>>>>>>
> >>>>>>>>> -n
> >>>>>>>>>
> >>>>>>>>> On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren <
> >> stoffe@gmail.com
> >>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Sorry I should have explained my use case a bit more.
> >>>>>>>>>>
> >>>>>>>>>> Yes, it's a pretty big row and it's "close" to worst case.
> >> Normally
> >>>>>>>>> there
> >>>>>>>>>> would be fewer qualifiers and the largest qualifiers would be
> >>>> smaller.
> >>>>>>>>>>
> >>>>>>>>>> The reason why these rows gets big is because they stores
> >> aggregated
> >>>>>>>>> data
> >>>>>>>>>> in indexed compressed form. This format allow for extremely fast
> >>>>>> queries
> >>>>>>>>>> (on local disk format) over billions of rows (not rows in HBase
> >>>>>> speak),
> >>>>>>>>>> when touching smaller areas of the data. If would store the data
> >> as
> >>>>>>>>> regular
> >>>>>>>>>> HBase rows things would get very slow unless I had many many
> >> region
> >>>>>>>>>> servers.
> >>>>>>>>>>
> >>>>>>>>>> The coprocessor is used for doing custom queries on the indexed
> >> data
> >>>>>>>>> inside
> >>>>>>>>>> the region servers. These queries are not like a regular row
> scan,
> >>>> but
> >>>>>>>>> very
> >>>>>>>>>> specific as to how the data is formatted withing each column
> >>>>>> qualifier.
> >>>>>>>>>>
> >>>>>>>>>> Yes, this is not possible if HBase loads the whole 500MB each
> >> time i
> >>>>>>>>> want
> >>>>>>>>>> to perform this custom query on a row. Hence my question :-)
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel <
> >>>>>>>>> michael_segel@hotmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Sorry, but your initial problem statement doesn’t seem to
> parse …
> >>>>>>>>>>>
> >>>>>>>>>>> Are you saying that you a single row with approximately 100,000
> >>>>>>>>> elements
> >>>>>>>>>>> where each element is roughly 1-5KB in size and in addition
> there
> >>>> are
> >>>>>>>>> ~5
> >>>>>>>>>>> elements which will be between one and five MB in size?
> >>>>>>>>>>>
> >>>>>>>>>>> And you then mention a coprocessor?
> >>>>>>>>>>>
> >>>>>>>>>>> Just looking at the numbers… 100K * 5KB means that each row
> would
> >>>> end
> >>>>>>>>> up
> >>>>>>>>>>> being 500MB in size.
> >>>>>>>>>>>
> >>>>>>>>>>> That’s a pretty fat row.
> >>>>>>>>>>>
> >>>>>>>>>>> I would suggest rethinking your strategy.
> >>>>>>>>>>>
> >>>>>>>>>>>> On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren <
> >> stoffe@gmail.com
> >>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi
> >>>>>>>>>>>>
> >>>>>>>>>>>> I have a row with around 100.000 qualifiers with mostly small
> >>>> values
> >>>>>>>>>>> around
> >>>>>>>>>>>> 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do
> >>>>>>>>> random
> >>>>>>>>>>>> access of 1-10 qualifiers per row.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I would like to understand how HBase loads the data into
> memory.
> >>>>>>>>> Will
> >>>>>>>>>> the
> >>>>>>>>>>>> entire row be loaded or only the qualifiers I ask for (like
> >>>> pointer
> >>>>>>>>>>> access
> >>>>>>>>>>>> into a direct ByteBuffer) ?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Cheers,
> >>>>>>>>>>>> -Kristoffer
> >>>>>>>>>>>
> >>>>>>>>>>> The opinions expressed here are mine, while they may reflect a
> >>>>>>>>> cognitive
> >>>>>>>>>>> thought, that is purely accidental.
> >>>>>>>>>>> Use at your own risk.
> >>>>>>>>>>> Michael Segel
> >>>>>>>>>>> michael_segel (AT) hotmail.com
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>> The opinions expressed here are mine, while they may reflect a
> >> cognitive
> >>>>>> thought, that is purely accidental.
> >>>>>> Use at your own risk.
> >>>>>> Michael Segel
> >>>>>> michael_segel (AT) hotmail.com
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>> The opinions expressed here are mine, while they may reflect a
> cognitive
> >>>> thought, that is purely accidental.
> >>>> Use at your own risk.
> >>>> Michael Segel
> >>>> michael_segel (AT) hotmail.com
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>
> >> The opinions expressed here are mine, while they may reflect a cognitive
> >> thought, that is purely accidental.
> >> Use at your own risk.
> >> Michael Segel
> >> michael_segel (AT) hotmail.com
> >>
> >>
> >>
> >>
> >>
> >>
>
> The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Rowkey design question

Posted by Michael Segel <mi...@hotmail.com>.
Ok… 
Coprocessors are poorly implemented in HBase. 
If you work in a secure environment, outside of the system coprocessors… (ones that you load from hbase-site.xml) , you don’t want to use them. (The coprocessor code runs on the same JVM as the RS.)  This means that if you have a poorly written coprocessor, you will kill performance for all of HBase. If you’re not using them in a secure environment, you have to consider how they are going to be used.  


Without really knowing more about your use case..., its impossible to say of the coprocessor would be a good idea. 


It sounds like you may have an unrealistic expectation as to how well HBase performs. 

HTH

-Mike

> On Apr 9, 2015, at 1:05 AM, Kristoffer Sjögren <st...@gmail.com> wrote:
> 
> An HBase coprocessor. My idea is to move as much pre-aggregation as
> possible to where the data lives in the region servers, instead of doing it
> in the client. If there is good data locality inside and across rows within
> regions then I would expect aggregation to be faster in the coprocessor
> (utilize many region servers in parallel) rather than transfer data over
> the network from multiple region servers to a single client that would do
> the same calculation on its own.
> 
> 
> On Thu, Apr 9, 2015 at 4:43 AM, Michael Segel <mi...@hotmail.com>
> wrote:
> 
>> When you say coprocessor, do you mean HBase coprocessors or do you mean a
>> physical hardware coprocessor?
>> 
>> In terms of queries…
>> 
>> HBase can perform a single get() and return the result back quickly. (The
>> size of the data being returned will impact the overall timing.)
>> 
>> HBase also caches the results so that your first hit will take the
>> longest, but as long as the row is cached, the results are returned quickly.
>> 
>> If you’re trying to do a scan with a start/stop row set … your timing then
>> could vary between sub-second and minutes depending on the query.
>> 
>> 
>>> On Apr 8, 2015, at 3:10 PM, Kristoffer Sjögren <st...@gmail.com> wrote:
>>> 
>>> But if the coprocessor is omitted then CPU cycles from region servers are
>>> lost, so where would the query execution go?
>>> 
>>> Queries needs to be quick (sub-second rather than seconds) and HDFS is
>>> quite latency hungry, unless there are optimizations that i'm unaware of?
>>> 
>>> 
>>> 
>>> On Wed, Apr 8, 2015 at 7:43 PM, Michael Segel <michael_segel@hotmail.com
>>> 
>>> wrote:
>>> 
>>>> I think you misunderstood.
>>>> 
>>>> The suggestion was to put the data in to HDFS sequence files and to use
>>>> HBase to store an index in to the file. (URL to the file, then offset
>> in to
>>>> the file for the start of the record…)
>>>> 
>>>> The reason you want to do this is that you’re reading in large amounts
>> of
>>>> data and its more efficient to do this from HDFS than through HBase.
>>>> 
>>>>> On Apr 8, 2015, at 8:41 AM, Kristoffer Sjögren <st...@gmail.com>
>> wrote:
>>>>> 
>>>>> Yes, I think you're right. Adding one or more dimensions to the rowkey
>>>>> would indeed make the table narrower.
>>>>> 
>>>>> And I guess it also make sense to store actual values (bigger
>> qualifiers)
>>>>> outside HBase. Keeping them in Hadoop why not? Pulling hot ones out on
>>>> SSD
>>>>> caches would be an interesting solution. And quite a bit simpler.
>>>>> 
>>>>> Good call and thanks for the tip! :-)
>>>>> 
>>>>> On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel <
>> michael_segel@hotmail.com
>>>>> 
>>>>> wrote:
>>>>> 
>>>>>> Ok…
>>>>>> 
>>>>>> First, I’d suggest you rethink your schema by adding an additional
>>>>>> dimension.
>>>>>> You’ll end up with more rows, but a narrower table.
>>>>>> 
>>>>>> In terms of compaction… if the data is relatively static, you won’t
>> have
>>>>>> compactions because nothing changed.
>>>>>> But if your data is that static… why not put the data in sequence
>> files
>>>>>> and use HBase as the index. Could be faster.
>>>>>> 
>>>>>> HTH
>>>>>> 
>>>>>> -Mike
>>>>>> 
>>>>>>> On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren <st...@gmail.com>
>>>> wrote:
>>>>>>> 
>>>>>>> I just read through HBase MOB design document and one thing that
>> caught
>>>>>> my
>>>>>>> attention was the following statement.
>>>>>>> 
>>>>>>> "When HBase deals with large numbers of values > 100kb and up to
>> ~10MB
>>>> of
>>>>>>> data, it encounters performance degradations due to write
>> amplification
>>>>>>> caused by splits and compactions."
>>>>>>> 
>>>>>>> Is there any chance to run into this problem in the read path for
>> data
>>>>>> that
>>>>>>> is written infrequently and never changed?
>>>>>>> 
>>>>>>> On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren <stoffe@gmail.com
>>> 
>>>>>> wrote:
>>>>>>> 
>>>>>>>> A small set of qualifiers will be accessed frequently so keeping
>> them
>>>> in
>>>>>>>> block cache would be very beneficial. Some very seldom. So this
>> sounds
>>>>>> very
>>>>>>>> promising!
>>>>>>>> 
>>>>>>>> The reason why i'm considering a coprocessor is that I need to
>> provide
>>>>>>>> very specific information in the query request. Same thing with the
>>>>>>>> response. Queries are also highly parallelizable across rows and
>> each
>>>>>>>> individual query produce a valid result that may or may not be
>>>>>> aggregated
>>>>>>>> with other results in the client, maybe even inside the region if it
>>>>>>>> contained multiple rows targeted by the query.
>>>>>>>> 
>>>>>>>> So it's a bit like Phoenix but with a different storage format and
>>>> query
>>>>>>>> engine.
>>>>>>>> 
>>>>>>>> On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk <nd...@gmail.com>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Those rows are written out into HBase blocks on cell boundaries.
>> Your
>>>>>>>>> column family has a BLOCK_SIZE attribute, which you may or may have
>>>> no
>>>>>>>>> overridden the default of 64k. Cells are written into a block until
>>>> is
>>>>>> it
>>>>>>>>>> = the target block size. So your single 500mb row will be broken
>>>> down
>>>>>>>>> into
>>>>>>>>> thousands of HFile blocks in some number of HFiles. Some of those
>>>>>> blocks
>>>>>>>>> may contain just a cell or two and be a couple MB in size, to hold
>>>> the
>>>>>>>>> largest of your cells. Those blocks will be loaded into the Block
>>>>>> Cache as
>>>>>>>>> they're accessed. If your careful with your access patterns and
>> only
>>>>>>>>> request cells that you need to evaluate, you'll only ever load the
>>>>>> blocks
>>>>>>>>> containing those cells into the cache.
>>>>>>>>> 
>>>>>>>>>> Will the entire row be loaded or only the qualifiers I ask for?
>>>>>>>>> 
>>>>>>>>> So then, the answer to your question is: it depends on how you're
>>>>>>>>> interacting with the row from your coprocessor. The read path will
>>>> only
>>>>>>>>> load blocks that your scanner requests. If your coprocessor is
>>>>>> producing
>>>>>>>>> scanner with to seek to specific qualifiers, you'll only load those
>>>>>>>>> blocks.
>>>>>>>>> 
>>>>>>>>> Related question: Is there a reason you're using a coprocessor
>>>> instead
>>>>>> of
>>>>>>>>> a
>>>>>>>>> regular filter, or a simple qualified get/scan to access data from
>>>>>> these
>>>>>>>>> rows? The "default stuff" is already tuned to load data sparsely,
>> as
>>>>>> would
>>>>>>>>> be desirable for your schema.
>>>>>>>>> 
>>>>>>>>> -n
>>>>>>>>> 
>>>>>>>>> On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren <
>> stoffe@gmail.com
>>>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Sorry I should have explained my use case a bit more.
>>>>>>>>>> 
>>>>>>>>>> Yes, it's a pretty big row and it's "close" to worst case.
>> Normally
>>>>>>>>> there
>>>>>>>>>> would be fewer qualifiers and the largest qualifiers would be
>>>> smaller.
>>>>>>>>>> 
>>>>>>>>>> The reason why these rows gets big is because they stores
>> aggregated
>>>>>>>>> data
>>>>>>>>>> in indexed compressed form. This format allow for extremely fast
>>>>>> queries
>>>>>>>>>> (on local disk format) over billions of rows (not rows in HBase
>>>>>> speak),
>>>>>>>>>> when touching smaller areas of the data. If would store the data
>> as
>>>>>>>>> regular
>>>>>>>>>> HBase rows things would get very slow unless I had many many
>> region
>>>>>>>>>> servers.
>>>>>>>>>> 
>>>>>>>>>> The coprocessor is used for doing custom queries on the indexed
>> data
>>>>>>>>> inside
>>>>>>>>>> the region servers. These queries are not like a regular row scan,
>>>> but
>>>>>>>>> very
>>>>>>>>>> specific as to how the data is formatted withing each column
>>>>>> qualifier.
>>>>>>>>>> 
>>>>>>>>>> Yes, this is not possible if HBase loads the whole 500MB each
>> time i
>>>>>>>>> want
>>>>>>>>>> to perform this custom query on a row. Hence my question :-)
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel <
>>>>>>>>> michael_segel@hotmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Sorry, but your initial problem statement doesn’t seem to parse …
>>>>>>>>>>> 
>>>>>>>>>>> Are you saying that you a single row with approximately 100,000
>>>>>>>>> elements
>>>>>>>>>>> where each element is roughly 1-5KB in size and in addition there
>>>> are
>>>>>>>>> ~5
>>>>>>>>>>> elements which will be between one and five MB in size?
>>>>>>>>>>> 
>>>>>>>>>>> And you then mention a coprocessor?
>>>>>>>>>>> 
>>>>>>>>>>> Just looking at the numbers… 100K * 5KB means that each row would
>>>> end
>>>>>>>>> up
>>>>>>>>>>> being 500MB in size.
>>>>>>>>>>> 
>>>>>>>>>>> That’s a pretty fat row.
>>>>>>>>>>> 
>>>>>>>>>>> I would suggest rethinking your strategy.
>>>>>>>>>>> 
>>>>>>>>>>>> On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren <
>> stoffe@gmail.com
>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi
>>>>>>>>>>>> 
>>>>>>>>>>>> I have a row with around 100.000 qualifiers with mostly small
>>>> values
>>>>>>>>>>> around
>>>>>>>>>>>> 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do
>>>>>>>>> random
>>>>>>>>>>>> access of 1-10 qualifiers per row.
>>>>>>>>>>>> 
>>>>>>>>>>>> I would like to understand how HBase loads the data into memory.
>>>>>>>>> Will
>>>>>>>>>> the
>>>>>>>>>>>> entire row be loaded or only the qualifiers I ask for (like
>>>> pointer
>>>>>>>>>>> access
>>>>>>>>>>>> into a direct ByteBuffer) ?
>>>>>>>>>>>> 
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> -Kristoffer
>>>>>>>>>>> 
>>>>>>>>>>> The opinions expressed here are mine, while they may reflect a
>>>>>>>>> cognitive
>>>>>>>>>>> thought, that is purely accidental.
>>>>>>>>>>> Use at your own risk.
>>>>>>>>>>> Michael Segel
>>>>>>>>>>> michael_segel (AT) hotmail.com
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> The opinions expressed here are mine, while they may reflect a
>> cognitive
>>>>>> thought, that is purely accidental.
>>>>>> Use at your own risk.
>>>>>> Michael Segel
>>>>>> michael_segel (AT) hotmail.com
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> The opinions expressed here are mine, while they may reflect a cognitive
>>>> thought, that is purely accidental.
>>>> Use at your own risk.
>>>> Michael Segel
>>>> michael_segel (AT) hotmail.com
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>> 
>> The opinions expressed here are mine, while they may reflect a cognitive
>> thought, that is purely accidental.
>> Use at your own risk.
>> Michael Segel
>> michael_segel (AT) hotmail.com
>> 
>> 
>> 
>> 
>> 
>> 

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com






Re: Rowkey design question

Posted by Kristoffer Sjögren <st...@gmail.com>.
An HBase coprocessor. My idea is to move as much pre-aggregation as
possible to where the data lives in the region servers, instead of doing it
in the client. If there is good data locality inside and across rows within
regions then I would expect aggregation to be faster in the coprocessor
(utilize many region servers in parallel) rather than transfer data over
the network from multiple region servers to a single client that would do
the same calculation on its own.


On Thu, Apr 9, 2015 at 4:43 AM, Michael Segel <mi...@hotmail.com>
wrote:

> When you say coprocessor, do you mean HBase coprocessors or do you mean a
> physical hardware coprocessor?
>
> In terms of queries…
>
> HBase can perform a single get() and return the result back quickly. (The
> size of the data being returned will impact the overall timing.)
>
> HBase also caches the results so that your first hit will take the
> longest, but as long as the row is cached, the results are returned quickly.
>
> If you’re trying to do a scan with a start/stop row set … your timing then
> could vary between sub-second and minutes depending on the query.
>
>
> > On Apr 8, 2015, at 3:10 PM, Kristoffer Sjögren <st...@gmail.com> wrote:
> >
> > But if the coprocessor is omitted then CPU cycles from region servers are
> > lost, so where would the query execution go?
> >
> > Queries needs to be quick (sub-second rather than seconds) and HDFS is
> > quite latency hungry, unless there are optimizations that i'm unaware of?
> >
> >
> >
> > On Wed, Apr 8, 2015 at 7:43 PM, Michael Segel <michael_segel@hotmail.com
> >
> > wrote:
> >
> >> I think you misunderstood.
> >>
> >> The suggestion was to put the data in to HDFS sequence files and to use
> >> HBase to store an index in to the file. (URL to the file, then offset
> in to
> >> the file for the start of the record…)
> >>
> >> The reason you want to do this is that you’re reading in large amounts
> of
> >> data and its more efficient to do this from HDFS than through HBase.
> >>
> >>> On Apr 8, 2015, at 8:41 AM, Kristoffer Sjögren <st...@gmail.com>
> wrote:
> >>>
> >>> Yes, I think you're right. Adding one or more dimensions to the rowkey
> >>> would indeed make the table narrower.
> >>>
> >>> And I guess it also make sense to store actual values (bigger
> qualifiers)
> >>> outside HBase. Keeping them in Hadoop why not? Pulling hot ones out on
> >> SSD
> >>> caches would be an interesting solution. And quite a bit simpler.
> >>>
> >>> Good call and thanks for the tip! :-)
> >>>
> >>> On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel <
> michael_segel@hotmail.com
> >>>
> >>> wrote:
> >>>
> >>>> Ok…
> >>>>
> >>>> First, I’d suggest you rethink your schema by adding an additional
> >>>> dimension.
> >>>> You’ll end up with more rows, but a narrower table.
> >>>>
> >>>> In terms of compaction… if the data is relatively static, you won’t
> have
> >>>> compactions because nothing changed.
> >>>> But if your data is that static… why not put the data in sequence
> files
> >>>> and use HBase as the index. Could be faster.
> >>>>
> >>>> HTH
> >>>>
> >>>> -Mike
> >>>>
> >>>>> On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren <st...@gmail.com>
> >> wrote:
> >>>>>
> >>>>> I just read through HBase MOB design document and one thing that
> caught
> >>>> my
> >>>>> attention was the following statement.
> >>>>>
> >>>>> "When HBase deals with large numbers of values > 100kb and up to
> ~10MB
> >> of
> >>>>> data, it encounters performance degradations due to write
> amplification
> >>>>> caused by splits and compactions."
> >>>>>
> >>>>> Is there any chance to run into this problem in the read path for
> data
> >>>> that
> >>>>> is written infrequently and never changed?
> >>>>>
> >>>>> On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren <stoffe@gmail.com
> >
> >>>> wrote:
> >>>>>
> >>>>>> A small set of qualifiers will be accessed frequently so keeping
> them
> >> in
> >>>>>> block cache would be very beneficial. Some very seldom. So this
> sounds
> >>>> very
> >>>>>> promising!
> >>>>>>
> >>>>>> The reason why i'm considering a coprocessor is that I need to
> provide
> >>>>>> very specific information in the query request. Same thing with the
> >>>>>> response. Queries are also highly parallelizable across rows and
> each
> >>>>>> individual query produce a valid result that may or may not be
> >>>> aggregated
> >>>>>> with other results in the client, maybe even inside the region if it
> >>>>>> contained multiple rows targeted by the query.
> >>>>>>
> >>>>>> So it's a bit like Phoenix but with a different storage format and
> >> query
> >>>>>> engine.
> >>>>>>
> >>>>>> On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk <nd...@gmail.com>
> >>>> wrote:
> >>>>>>
> >>>>>>> Those rows are written out into HBase blocks on cell boundaries.
> Your
> >>>>>>> column family has a BLOCK_SIZE attribute, which you may or may have
> >> no
> >>>>>>> overridden the default of 64k. Cells are written into a block until
> >> is
> >>>> it
> >>>>>>>> = the target block size. So your single 500mb row will be broken
> >> down
> >>>>>>> into
> >>>>>>> thousands of HFile blocks in some number of HFiles. Some of those
> >>>> blocks
> >>>>>>> may contain just a cell or two and be a couple MB in size, to hold
> >> the
> >>>>>>> largest of your cells. Those blocks will be loaded into the Block
> >>>> Cache as
> >>>>>>> they're accessed. If your careful with your access patterns and
> only
> >>>>>>> request cells that you need to evaluate, you'll only ever load the
> >>>> blocks
> >>>>>>> containing those cells into the cache.
> >>>>>>>
> >>>>>>>> Will the entire row be loaded or only the qualifiers I ask for?
> >>>>>>>
> >>>>>>> So then, the answer to your question is: it depends on how you're
> >>>>>>> interacting with the row from your coprocessor. The read path will
> >> only
> >>>>>>> load blocks that your scanner requests. If your coprocessor is
> >>>> producing
> >>>>>>> scanner with to seek to specific qualifiers, you'll only load those
> >>>>>>> blocks.
> >>>>>>>
> >>>>>>> Related question: Is there a reason you're using a coprocessor
> >> instead
> >>>> of
> >>>>>>> a
> >>>>>>> regular filter, or a simple qualified get/scan to access data from
> >>>> these
> >>>>>>> rows? The "default stuff" is already tuned to load data sparsely,
> as
> >>>> would
> >>>>>>> be desirable for your schema.
> >>>>>>>
> >>>>>>> -n
> >>>>>>>
> >>>>>>> On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren <
> stoffe@gmail.com
> >>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Sorry I should have explained my use case a bit more.
> >>>>>>>>
> >>>>>>>> Yes, it's a pretty big row and it's "close" to worst case.
> Normally
> >>>>>>> there
> >>>>>>>> would be fewer qualifiers and the largest qualifiers would be
> >> smaller.
> >>>>>>>>
> >>>>>>>> The reason why these rows gets big is because they stores
> aggregated
> >>>>>>> data
> >>>>>>>> in indexed compressed form. This format allow for extremely fast
> >>>> queries
> >>>>>>>> (on local disk format) over billions of rows (not rows in HBase
> >>>> speak),
> >>>>>>>> when touching smaller areas of the data. If would store the data
> as
> >>>>>>> regular
> >>>>>>>> HBase rows things would get very slow unless I had many many
> region
> >>>>>>>> servers.
> >>>>>>>>
> >>>>>>>> The coprocessor is used for doing custom queries on the indexed
> data
> >>>>>>> inside
> >>>>>>>> the region servers. These queries are not like a regular row scan,
> >> but
> >>>>>>> very
> >>>>>>>> specific as to how the data is formatted withing each column
> >>>> qualifier.
> >>>>>>>>
> >>>>>>>> Yes, this is not possible if HBase loads the whole 500MB each
> time i
> >>>>>>> want
> >>>>>>>> to perform this custom query on a row. Hence my question :-)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel <
> >>>>>>> michael_segel@hotmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Sorry, but your initial problem statement doesn’t seem to parse …
> >>>>>>>>>
> >>>>>>>>> Are you saying that you a single row with approximately 100,000
> >>>>>>> elements
> >>>>>>>>> where each element is roughly 1-5KB in size and in addition there
> >> are
> >>>>>>> ~5
> >>>>>>>>> elements which will be between one and five MB in size?
> >>>>>>>>>
> >>>>>>>>> And you then mention a coprocessor?
> >>>>>>>>>
> >>>>>>>>> Just looking at the numbers… 100K * 5KB means that each row would
> >> end
> >>>>>>> up
> >>>>>>>>> being 500MB in size.
> >>>>>>>>>
> >>>>>>>>> That’s a pretty fat row.
> >>>>>>>>>
> >>>>>>>>> I would suggest rethinking your strategy.
> >>>>>>>>>
> >>>>>>>>>> On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren <
> stoffe@gmail.com
> >>>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi
> >>>>>>>>>>
> >>>>>>>>>> I have a row with around 100.000 qualifiers with mostly small
> >> values
> >>>>>>>>> around
> >>>>>>>>>> 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do
> >>>>>>> random
> >>>>>>>>>> access of 1-10 qualifiers per row.
> >>>>>>>>>>
> >>>>>>>>>> I would like to understand how HBase loads the data into memory.
> >>>>>>> Will
> >>>>>>>> the
> >>>>>>>>>> entire row be loaded or only the qualifiers I ask for (like
> >> pointer
> >>>>>>>>> access
> >>>>>>>>>> into a direct ByteBuffer) ?
> >>>>>>>>>>
> >>>>>>>>>> Cheers,
> >>>>>>>>>> -Kristoffer
> >>>>>>>>>
> >>>>>>>>> The opinions expressed here are mine, while they may reflect a
> >>>>>>> cognitive
> >>>>>>>>> thought, that is purely accidental.
> >>>>>>>>> Use at your own risk.
> >>>>>>>>> Michael Segel
> >>>>>>>>> michael_segel (AT) hotmail.com
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>> The opinions expressed here are mine, while they may reflect a
> cognitive
> >>>> thought, that is purely accidental.
> >>>> Use at your own risk.
> >>>> Michael Segel
> >>>> michael_segel (AT) hotmail.com
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>
> >> The opinions expressed here are mine, while they may reflect a cognitive
> >> thought, that is purely accidental.
> >> Use at your own risk.
> >> Michael Segel
> >> michael_segel (AT) hotmail.com
> >>
> >>
> >>
> >>
> >>
> >>
>
> The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>

Re: Rowkey design question

Posted by Michael Segel <mi...@hotmail.com>.
When you say coprocessor, do you mean HBase coprocessors or do you mean a physical hardware coprocessor? 

In terms of queries… 

HBase can perform a single get() and return the result back quickly. (The size of the data being returned will impact the overall timing.) 

HBase also caches the results so that your first hit will take the longest, but as long as the row is cached, the results are returned quickly. 

If you’re trying to do a scan with a start/stop row set … your timing then could vary between sub-second and minutes depending on the query. 


> On Apr 8, 2015, at 3:10 PM, Kristoffer Sjögren <st...@gmail.com> wrote:
> 
> But if the coprocessor is omitted then CPU cycles from region servers are
> lost, so where would the query execution go?
> 
> Queries needs to be quick (sub-second rather than seconds) and HDFS is
> quite latency hungry, unless there are optimizations that i'm unaware of?
> 
> 
> 
> On Wed, Apr 8, 2015 at 7:43 PM, Michael Segel <mi...@hotmail.com>
> wrote:
> 
>> I think you misunderstood.
>> 
>> The suggestion was to put the data in to HDFS sequence files and to use
>> HBase to store an index in to the file. (URL to the file, then offset in to
>> the file for the start of the record…)
>> 
>> The reason you want to do this is that you’re reading in large amounts of
>> data and its more efficient to do this from HDFS than through HBase.
>> 
>>> On Apr 8, 2015, at 8:41 AM, Kristoffer Sjögren <st...@gmail.com> wrote:
>>> 
>>> Yes, I think you're right. Adding one or more dimensions to the rowkey
>>> would indeed make the table narrower.
>>> 
>>> And I guess it also make sense to store actual values (bigger qualifiers)
>>> outside HBase. Keeping them in Hadoop why not? Pulling hot ones out on
>> SSD
>>> caches would be an interesting solution. And quite a bit simpler.
>>> 
>>> Good call and thanks for the tip! :-)
>>> 
>>> On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel <michael_segel@hotmail.com
>>> 
>>> wrote:
>>> 
>>>> Ok…
>>>> 
>>>> First, I’d suggest you rethink your schema by adding an additional
>>>> dimension.
>>>> You’ll end up with more rows, but a narrower table.
>>>> 
>>>> In terms of compaction… if the data is relatively static, you won’t have
>>>> compactions because nothing changed.
>>>> But if your data is that static… why not put the data in sequence files
>>>> and use HBase as the index. Could be faster.
>>>> 
>>>> HTH
>>>> 
>>>> -Mike
>>>> 
>>>>> On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren <st...@gmail.com>
>> wrote:
>>>>> 
>>>>> I just read through HBase MOB design document and one thing that caught
>>>> my
>>>>> attention was the following statement.
>>>>> 
>>>>> "When HBase deals with large numbers of values > 100kb and up to ~10MB
>> of
>>>>> data, it encounters performance degradations due to write amplification
>>>>> caused by splits and compactions."
>>>>> 
>>>>> Is there any chance to run into this problem in the read path for data
>>>> that
>>>>> is written infrequently and never changed?
>>>>> 
>>>>> On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren <st...@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> A small set of qualifiers will be accessed frequently so keeping them
>> in
>>>>>> block cache would be very beneficial. Some very seldom. So this sounds
>>>> very
>>>>>> promising!
>>>>>> 
>>>>>> The reason why i'm considering a coprocessor is that I need to provide
>>>>>> very specific information in the query request. Same thing with the
>>>>>> response. Queries are also highly parallelizable across rows and each
>>>>>> individual query produce a valid result that may or may not be
>>>> aggregated
>>>>>> with other results in the client, maybe even inside the region if it
>>>>>> contained multiple rows targeted by the query.
>>>>>> 
>>>>>> So it's a bit like Phoenix but with a different storage format and
>> query
>>>>>> engine.
>>>>>> 
>>>>>> On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk <nd...@gmail.com>
>>>> wrote:
>>>>>> 
>>>>>>> Those rows are written out into HBase blocks on cell boundaries. Your
>>>>>>> column family has a BLOCK_SIZE attribute, which you may or may have
>> no
>>>>>>> overridden the default of 64k. Cells are written into a block until
>> is
>>>> it
>>>>>>>> = the target block size. So your single 500mb row will be broken
>> down
>>>>>>> into
>>>>>>> thousands of HFile blocks in some number of HFiles. Some of those
>>>> blocks
>>>>>>> may contain just a cell or two and be a couple MB in size, to hold
>> the
>>>>>>> largest of your cells. Those blocks will be loaded into the Block
>>>> Cache as
>>>>>>> they're accessed. If your careful with your access patterns and only
>>>>>>> request cells that you need to evaluate, you'll only ever load the
>>>> blocks
>>>>>>> containing those cells into the cache.
>>>>>>> 
>>>>>>>> Will the entire row be loaded or only the qualifiers I ask for?
>>>>>>> 
>>>>>>> So then, the answer to your question is: it depends on how you're
>>>>>>> interacting with the row from your coprocessor. The read path will
>> only
>>>>>>> load blocks that your scanner requests. If your coprocessor is
>>>> producing
>>>>>>> scanner with to seek to specific qualifiers, you'll only load those
>>>>>>> blocks.
>>>>>>> 
>>>>>>> Related question: Is there a reason you're using a coprocessor
>> instead
>>>> of
>>>>>>> a
>>>>>>> regular filter, or a simple qualified get/scan to access data from
>>>> these
>>>>>>> rows? The "default stuff" is already tuned to load data sparsely, as
>>>> would
>>>>>>> be desirable for your schema.
>>>>>>> 
>>>>>>> -n
>>>>>>> 
>>>>>>> On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren <stoffe@gmail.com
>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Sorry I should have explained my use case a bit more.
>>>>>>>> 
>>>>>>>> Yes, it's a pretty big row and it's "close" to worst case. Normally
>>>>>>> there
>>>>>>>> would be fewer qualifiers and the largest qualifiers would be
>> smaller.
>>>>>>>> 
>>>>>>>> The reason why these rows gets big is because they stores aggregated
>>>>>>> data
>>>>>>>> in indexed compressed form. This format allow for extremely fast
>>>> queries
>>>>>>>> (on local disk format) over billions of rows (not rows in HBase
>>>> speak),
>>>>>>>> when touching smaller areas of the data. If would store the data as
>>>>>>> regular
>>>>>>>> HBase rows things would get very slow unless I had many many region
>>>>>>>> servers.
>>>>>>>> 
>>>>>>>> The coprocessor is used for doing custom queries on the indexed data
>>>>>>> inside
>>>>>>>> the region servers. These queries are not like a regular row scan,
>> but
>>>>>>> very
>>>>>>>> specific as to how the data is formatted withing each column
>>>> qualifier.
>>>>>>>> 
>>>>>>>> Yes, this is not possible if HBase loads the whole 500MB each time i
>>>>>>> want
>>>>>>>> to perform this custom query on a row. Hence my question :-)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel <
>>>>>>> michael_segel@hotmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Sorry, but your initial problem statement doesn’t seem to parse …
>>>>>>>>> 
>>>>>>>>> Are you saying that you a single row with approximately 100,000
>>>>>>> elements
>>>>>>>>> where each element is roughly 1-5KB in size and in addition there
>> are
>>>>>>> ~5
>>>>>>>>> elements which will be between one and five MB in size?
>>>>>>>>> 
>>>>>>>>> And you then mention a coprocessor?
>>>>>>>>> 
>>>>>>>>> Just looking at the numbers… 100K * 5KB means that each row would
>> end
>>>>>>> up
>>>>>>>>> being 500MB in size.
>>>>>>>>> 
>>>>>>>>> That’s a pretty fat row.
>>>>>>>>> 
>>>>>>>>> I would suggest rethinking your strategy.
>>>>>>>>> 
>>>>>>>>>> On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren <stoffe@gmail.com
>>> 
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi
>>>>>>>>>> 
>>>>>>>>>> I have a row with around 100.000 qualifiers with mostly small
>> values
>>>>>>>>> around
>>>>>>>>>> 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do
>>>>>>> random
>>>>>>>>>> access of 1-10 qualifiers per row.
>>>>>>>>>> 
>>>>>>>>>> I would like to understand how HBase loads the data into memory.
>>>>>>> Will
>>>>>>>> the
>>>>>>>>>> entire row be loaded or only the qualifiers I ask for (like
>> pointer
>>>>>>>>> access
>>>>>>>>>> into a direct ByteBuffer) ?
>>>>>>>>>> 
>>>>>>>>>> Cheers,
>>>>>>>>>> -Kristoffer
>>>>>>>>> 
>>>>>>>>> The opinions expressed here are mine, while they may reflect a
>>>>>>> cognitive
>>>>>>>>> thought, that is purely accidental.
>>>>>>>>> Use at your own risk.
>>>>>>>>> Michael Segel
>>>>>>>>> michael_segel (AT) hotmail.com
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> The opinions expressed here are mine, while they may reflect a cognitive
>>>> thought, that is purely accidental.
>>>> Use at your own risk.
>>>> Michael Segel
>>>> michael_segel (AT) hotmail.com
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>> 
>> The opinions expressed here are mine, while they may reflect a cognitive
>> thought, that is purely accidental.
>> Use at your own risk.
>> Michael Segel
>> michael_segel (AT) hotmail.com
>> 
>> 
>> 
>> 
>> 
>> 

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com






Re: Rowkey design question

Posted by Kristoffer Sjögren <st...@gmail.com>.
But if the coprocessor is omitted then CPU cycles from region servers are
lost, so where would the query execution go?

Queries needs to be quick (sub-second rather than seconds) and HDFS is
quite latency hungry, unless there are optimizations that i'm unaware of?



On Wed, Apr 8, 2015 at 7:43 PM, Michael Segel <mi...@hotmail.com>
wrote:

> I think you misunderstood.
>
> The suggestion was to put the data in to HDFS sequence files and to use
> HBase to store an index in to the file. (URL to the file, then offset in to
> the file for the start of the record…)
>
> The reason you want to do this is that you’re reading in large amounts of
> data and its more efficient to do this from HDFS than through HBase.
>
> > On Apr 8, 2015, at 8:41 AM, Kristoffer Sjögren <st...@gmail.com> wrote:
> >
> > Yes, I think you're right. Adding one or more dimensions to the rowkey
> > would indeed make the table narrower.
> >
> > And I guess it also make sense to store actual values (bigger qualifiers)
> > outside HBase. Keeping them in Hadoop why not? Pulling hot ones out on
> SSD
> > caches would be an interesting solution. And quite a bit simpler.
> >
> > Good call and thanks for the tip! :-)
> >
> > On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel <michael_segel@hotmail.com
> >
> > wrote:
> >
> >> Ok…
> >>
> >> First, I’d suggest you rethink your schema by adding an additional
> >> dimension.
> >> You’ll end up with more rows, but a narrower table.
> >>
> >> In terms of compaction… if the data is relatively static, you won’t have
> >> compactions because nothing changed.
> >> But if your data is that static… why not put the data in sequence files
> >> and use HBase as the index. Could be faster.
> >>
> >> HTH
> >>
> >> -Mike
> >>
> >>> On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren <st...@gmail.com>
> wrote:
> >>>
> >>> I just read through HBase MOB design document and one thing that caught
> >> my
> >>> attention was the following statement.
> >>>
> >>> "When HBase deals with large numbers of values > 100kb and up to ~10MB
> of
> >>> data, it encounters performance degradations due to write amplification
> >>> caused by splits and compactions."
> >>>
> >>> Is there any chance to run into this problem in the read path for data
> >> that
> >>> is written infrequently and never changed?
> >>>
> >>> On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren <st...@gmail.com>
> >> wrote:
> >>>
> >>>> A small set of qualifiers will be accessed frequently so keeping them
> in
> >>>> block cache would be very beneficial. Some very seldom. So this sounds
> >> very
> >>>> promising!
> >>>>
> >>>> The reason why i'm considering a coprocessor is that I need to provide
> >>>> very specific information in the query request. Same thing with the
> >>>> response. Queries are also highly parallelizable across rows and each
> >>>> individual query produce a valid result that may or may not be
> >> aggregated
> >>>> with other results in the client, maybe even inside the region if it
> >>>> contained multiple rows targeted by the query.
> >>>>
> >>>> So it's a bit like Phoenix but with a different storage format and
> query
> >>>> engine.
> >>>>
> >>>> On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk <nd...@gmail.com>
> >> wrote:
> >>>>
> >>>>> Those rows are written out into HBase blocks on cell boundaries. Your
> >>>>> column family has a BLOCK_SIZE attribute, which you may or may have
> no
> >>>>> overridden the default of 64k. Cells are written into a block until
> is
> >> it
> >>>>>> = the target block size. So your single 500mb row will be broken
> down
> >>>>> into
> >>>>> thousands of HFile blocks in some number of HFiles. Some of those
> >> blocks
> >>>>> may contain just a cell or two and be a couple MB in size, to hold
> the
> >>>>> largest of your cells. Those blocks will be loaded into the Block
> >> Cache as
> >>>>> they're accessed. If your careful with your access patterns and only
> >>>>> request cells that you need to evaluate, you'll only ever load the
> >> blocks
> >>>>> containing those cells into the cache.
> >>>>>
> >>>>>> Will the entire row be loaded or only the qualifiers I ask for?
> >>>>>
> >>>>> So then, the answer to your question is: it depends on how you're
> >>>>> interacting with the row from your coprocessor. The read path will
> only
> >>>>> load blocks that your scanner requests. If your coprocessor is
> >> producing
> >>>>> scanner with to seek to specific qualifiers, you'll only load those
> >>>>> blocks.
> >>>>>
> >>>>> Related question: Is there a reason you're using a coprocessor
> instead
> >> of
> >>>>> a
> >>>>> regular filter, or a simple qualified get/scan to access data from
> >> these
> >>>>> rows? The "default stuff" is already tuned to load data sparsely, as
> >> would
> >>>>> be desirable for your schema.
> >>>>>
> >>>>> -n
> >>>>>
> >>>>> On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren <stoffe@gmail.com
> >
> >>>>> wrote:
> >>>>>
> >>>>>> Sorry I should have explained my use case a bit more.
> >>>>>>
> >>>>>> Yes, it's a pretty big row and it's "close" to worst case. Normally
> >>>>> there
> >>>>>> would be fewer qualifiers and the largest qualifiers would be
> smaller.
> >>>>>>
> >>>>>> The reason why these rows gets big is because they stores aggregated
> >>>>> data
> >>>>>> in indexed compressed form. This format allow for extremely fast
> >> queries
> >>>>>> (on local disk format) over billions of rows (not rows in HBase
> >> speak),
> >>>>>> when touching smaller areas of the data. If would store the data as
> >>>>> regular
> >>>>>> HBase rows things would get very slow unless I had many many region
> >>>>>> servers.
> >>>>>>
> >>>>>> The coprocessor is used for doing custom queries on the indexed data
> >>>>> inside
> >>>>>> the region servers. These queries are not like a regular row scan,
> but
> >>>>> very
> >>>>>> specific as to how the data is formatted withing each column
> >> qualifier.
> >>>>>>
> >>>>>> Yes, this is not possible if HBase loads the whole 500MB each time i
> >>>>> want
> >>>>>> to perform this custom query on a row. Hence my question :-)
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel <
> >>>>> michael_segel@hotmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Sorry, but your initial problem statement doesn’t seem to parse …
> >>>>>>>
> >>>>>>> Are you saying that you a single row with approximately 100,000
> >>>>> elements
> >>>>>>> where each element is roughly 1-5KB in size and in addition there
> are
> >>>>> ~5
> >>>>>>> elements which will be between one and five MB in size?
> >>>>>>>
> >>>>>>> And you then mention a coprocessor?
> >>>>>>>
> >>>>>>> Just looking at the numbers… 100K * 5KB means that each row would
> end
> >>>>> up
> >>>>>>> being 500MB in size.
> >>>>>>>
> >>>>>>> That’s a pretty fat row.
> >>>>>>>
> >>>>>>> I would suggest rethinking your strategy.
> >>>>>>>
> >>>>>>>> On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren <stoffe@gmail.com
> >
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Hi
> >>>>>>>>
> >>>>>>>> I have a row with around 100.000 qualifiers with mostly small
> values
> >>>>>>> around
> >>>>>>>> 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do
> >>>>> random
> >>>>>>>> access of 1-10 qualifiers per row.
> >>>>>>>>
> >>>>>>>> I would like to understand how HBase loads the data into memory.
> >>>>> Will
> >>>>>> the
> >>>>>>>> entire row be loaded or only the qualifiers I ask for (like
> pointer
> >>>>>>> access
> >>>>>>>> into a direct ByteBuffer) ?
> >>>>>>>>
> >>>>>>>> Cheers,
> >>>>>>>> -Kristoffer
> >>>>>>>
> >>>>>>> The opinions expressed here are mine, while they may reflect a
> >>>>> cognitive
> >>>>>>> thought, that is purely accidental.
> >>>>>>> Use at your own risk.
> >>>>>>> Michael Segel
> >>>>>>> michael_segel (AT) hotmail.com
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>
> >> The opinions expressed here are mine, while they may reflect a cognitive
> >> thought, that is purely accidental.
> >> Use at your own risk.
> >> Michael Segel
> >> michael_segel (AT) hotmail.com
> >>
> >>
> >>
> >>
> >>
> >>
>
> The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>

Re: Rowkey design question

Posted by Michael Segel <mi...@hotmail.com>.
I think you misunderstood. 

The suggestion was to put the data in to HDFS sequence files and to use HBase to store an index in to the file. (URL to the file, then offset in to the file for the start of the record…) 

The reason you want to do this is that you’re reading in large amounts of data and its more efficient to do this from HDFS than through HBase. 

> On Apr 8, 2015, at 8:41 AM, Kristoffer Sjögren <st...@gmail.com> wrote:
> 
> Yes, I think you're right. Adding one or more dimensions to the rowkey
> would indeed make the table narrower.
> 
> And I guess it also make sense to store actual values (bigger qualifiers)
> outside HBase. Keeping them in Hadoop why not? Pulling hot ones out on SSD
> caches would be an interesting solution. And quite a bit simpler.
> 
> Good call and thanks for the tip! :-)
> 
> On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel <mi...@hotmail.com>
> wrote:
> 
>> Ok…
>> 
>> First, I’d suggest you rethink your schema by adding an additional
>> dimension.
>> You’ll end up with more rows, but a narrower table.
>> 
>> In terms of compaction… if the data is relatively static, you won’t have
>> compactions because nothing changed.
>> But if your data is that static… why not put the data in sequence files
>> and use HBase as the index. Could be faster.
>> 
>> HTH
>> 
>> -Mike
>> 
>>> On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren <st...@gmail.com> wrote:
>>> 
>>> I just read through HBase MOB design document and one thing that caught
>> my
>>> attention was the following statement.
>>> 
>>> "When HBase deals with large numbers of values > 100kb and up to ~10MB of
>>> data, it encounters performance degradations due to write amplification
>>> caused by splits and compactions."
>>> 
>>> Is there any chance to run into this problem in the read path for data
>> that
>>> is written infrequently and never changed?
>>> 
>>> On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren <st...@gmail.com>
>> wrote:
>>> 
>>>> A small set of qualifiers will be accessed frequently so keeping them in
>>>> block cache would be very beneficial. Some very seldom. So this sounds
>> very
>>>> promising!
>>>> 
>>>> The reason why i'm considering a coprocessor is that I need to provide
>>>> very specific information in the query request. Same thing with the
>>>> response. Queries are also highly parallelizable across rows and each
>>>> individual query produce a valid result that may or may not be
>> aggregated
>>>> with other results in the client, maybe even inside the region if it
>>>> contained multiple rows targeted by the query.
>>>> 
>>>> So it's a bit like Phoenix but with a different storage format and query
>>>> engine.
>>>> 
>>>> On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk <nd...@gmail.com>
>> wrote:
>>>> 
>>>>> Those rows are written out into HBase blocks on cell boundaries. Your
>>>>> column family has a BLOCK_SIZE attribute, which you may or may have no
>>>>> overridden the default of 64k. Cells are written into a block until is
>> it
>>>>>> = the target block size. So your single 500mb row will be broken down
>>>>> into
>>>>> thousands of HFile blocks in some number of HFiles. Some of those
>> blocks
>>>>> may contain just a cell or two and be a couple MB in size, to hold the
>>>>> largest of your cells. Those blocks will be loaded into the Block
>> Cache as
>>>>> they're accessed. If your careful with your access patterns and only
>>>>> request cells that you need to evaluate, you'll only ever load the
>> blocks
>>>>> containing those cells into the cache.
>>>>> 
>>>>>> Will the entire row be loaded or only the qualifiers I ask for?
>>>>> 
>>>>> So then, the answer to your question is: it depends on how you're
>>>>> interacting with the row from your coprocessor. The read path will only
>>>>> load blocks that your scanner requests. If your coprocessor is
>> producing
>>>>> scanner with to seek to specific qualifiers, you'll only load those
>>>>> blocks.
>>>>> 
>>>>> Related question: Is there a reason you're using a coprocessor instead
>> of
>>>>> a
>>>>> regular filter, or a simple qualified get/scan to access data from
>> these
>>>>> rows? The "default stuff" is already tuned to load data sparsely, as
>> would
>>>>> be desirable for your schema.
>>>>> 
>>>>> -n
>>>>> 
>>>>> On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren <st...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Sorry I should have explained my use case a bit more.
>>>>>> 
>>>>>> Yes, it's a pretty big row and it's "close" to worst case. Normally
>>>>> there
>>>>>> would be fewer qualifiers and the largest qualifiers would be smaller.
>>>>>> 
>>>>>> The reason why these rows gets big is because they stores aggregated
>>>>> data
>>>>>> in indexed compressed form. This format allow for extremely fast
>> queries
>>>>>> (on local disk format) over billions of rows (not rows in HBase
>> speak),
>>>>>> when touching smaller areas of the data. If would store the data as
>>>>> regular
>>>>>> HBase rows things would get very slow unless I had many many region
>>>>>> servers.
>>>>>> 
>>>>>> The coprocessor is used for doing custom queries on the indexed data
>>>>> inside
>>>>>> the region servers. These queries are not like a regular row scan, but
>>>>> very
>>>>>> specific as to how the data is formatted withing each column
>> qualifier.
>>>>>> 
>>>>>> Yes, this is not possible if HBase loads the whole 500MB each time i
>>>>> want
>>>>>> to perform this custom query on a row. Hence my question :-)
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel <
>>>>> michael_segel@hotmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Sorry, but your initial problem statement doesn’t seem to parse …
>>>>>>> 
>>>>>>> Are you saying that you a single row with approximately 100,000
>>>>> elements
>>>>>>> where each element is roughly 1-5KB in size and in addition there are
>>>>> ~5
>>>>>>> elements which will be between one and five MB in size?
>>>>>>> 
>>>>>>> And you then mention a coprocessor?
>>>>>>> 
>>>>>>> Just looking at the numbers… 100K * 5KB means that each row would end
>>>>> up
>>>>>>> being 500MB in size.
>>>>>>> 
>>>>>>> That’s a pretty fat row.
>>>>>>> 
>>>>>>> I would suggest rethinking your strategy.
>>>>>>> 
>>>>>>>> On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren <st...@gmail.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi
>>>>>>>> 
>>>>>>>> I have a row with around 100.000 qualifiers with mostly small values
>>>>>>> around
>>>>>>>> 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do
>>>>> random
>>>>>>>> access of 1-10 qualifiers per row.
>>>>>>>> 
>>>>>>>> I would like to understand how HBase loads the data into memory.
>>>>> Will
>>>>>> the
>>>>>>>> entire row be loaded or only the qualifiers I ask for (like pointer
>>>>>>> access
>>>>>>>> into a direct ByteBuffer) ?
>>>>>>>> 
>>>>>>>> Cheers,
>>>>>>>> -Kristoffer
>>>>>>> 
>>>>>>> The opinions expressed here are mine, while they may reflect a
>>>>> cognitive
>>>>>>> thought, that is purely accidental.
>>>>>>> Use at your own risk.
>>>>>>> Michael Segel
>>>>>>> michael_segel (AT) hotmail.com
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
>> The opinions expressed here are mine, while they may reflect a cognitive
>> thought, that is purely accidental.
>> Use at your own risk.
>> Michael Segel
>> michael_segel (AT) hotmail.com
>> 
>> 
>> 
>> 
>> 
>> 

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com






Re: Rowkey design question

Posted by Kristoffer Sjögren <st...@gmail.com>.
Yes, I think you're right. Adding one or more dimensions to the rowkey
would indeed make the table narrower.

And I guess it also make sense to store actual values (bigger qualifiers)
outside HBase. Keeping them in Hadoop why not? Pulling hot ones out on SSD
caches would be an interesting solution. And quite a bit simpler.

Good call and thanks for the tip! :-)

On Wed, Apr 8, 2015 at 1:45 PM, Michael Segel <mi...@hotmail.com>
wrote:

> Ok…
>
> First, I’d suggest you rethink your schema by adding an additional
> dimension.
> You’ll end up with more rows, but a narrower table.
>
> In terms of compaction… if the data is relatively static, you won’t have
> compactions because nothing changed.
> But if your data is that static… why not put the data in sequence files
> and use HBase as the index. Could be faster.
>
> HTH
>
> -Mike
>
> > On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren <st...@gmail.com> wrote:
> >
> > I just read through HBase MOB design document and one thing that caught
> my
> > attention was the following statement.
> >
> > "When HBase deals with large numbers of values > 100kb and up to ~10MB of
> > data, it encounters performance degradations due to write amplification
> > caused by splits and compactions."
> >
> > Is there any chance to run into this problem in the read path for data
> that
> > is written infrequently and never changed?
> >
> > On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren <st...@gmail.com>
> wrote:
> >
> >> A small set of qualifiers will be accessed frequently so keeping them in
> >> block cache would be very beneficial. Some very seldom. So this sounds
> very
> >> promising!
> >>
> >> The reason why i'm considering a coprocessor is that I need to provide
> >> very specific information in the query request. Same thing with the
> >> response. Queries are also highly parallelizable across rows and each
> >> individual query produce a valid result that may or may not be
> aggregated
> >> with other results in the client, maybe even inside the region if it
> >> contained multiple rows targeted by the query.
> >>
> >> So it's a bit like Phoenix but with a different storage format and query
> >> engine.
> >>
> >> On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk <nd...@gmail.com>
> wrote:
> >>
> >>> Those rows are written out into HBase blocks on cell boundaries. Your
> >>> column family has a BLOCK_SIZE attribute, which you may or may have no
> >>> overridden the default of 64k. Cells are written into a block until is
> it
> >>>> = the target block size. So your single 500mb row will be broken down
> >>> into
> >>> thousands of HFile blocks in some number of HFiles. Some of those
> blocks
> >>> may contain just a cell or two and be a couple MB in size, to hold the
> >>> largest of your cells. Those blocks will be loaded into the Block
> Cache as
> >>> they're accessed. If your careful with your access patterns and only
> >>> request cells that you need to evaluate, you'll only ever load the
> blocks
> >>> containing those cells into the cache.
> >>>
> >>>> Will the entire row be loaded or only the qualifiers I ask for?
> >>>
> >>> So then, the answer to your question is: it depends on how you're
> >>> interacting with the row from your coprocessor. The read path will only
> >>> load blocks that your scanner requests. If your coprocessor is
> producing
> >>> scanner with to seek to specific qualifiers, you'll only load those
> >>> blocks.
> >>>
> >>> Related question: Is there a reason you're using a coprocessor instead
> of
> >>> a
> >>> regular filter, or a simple qualified get/scan to access data from
> these
> >>> rows? The "default stuff" is already tuned to load data sparsely, as
> would
> >>> be desirable for your schema.
> >>>
> >>> -n
> >>>
> >>> On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren <st...@gmail.com>
> >>> wrote:
> >>>
> >>>> Sorry I should have explained my use case a bit more.
> >>>>
> >>>> Yes, it's a pretty big row and it's "close" to worst case. Normally
> >>> there
> >>>> would be fewer qualifiers and the largest qualifiers would be smaller.
> >>>>
> >>>> The reason why these rows gets big is because they stores aggregated
> >>> data
> >>>> in indexed compressed form. This format allow for extremely fast
> queries
> >>>> (on local disk format) over billions of rows (not rows in HBase
> speak),
> >>>> when touching smaller areas of the data. If would store the data as
> >>> regular
> >>>> HBase rows things would get very slow unless I had many many region
> >>>> servers.
> >>>>
> >>>> The coprocessor is used for doing custom queries on the indexed data
> >>> inside
> >>>> the region servers. These queries are not like a regular row scan, but
> >>> very
> >>>> specific as to how the data is formatted withing each column
> qualifier.
> >>>>
> >>>> Yes, this is not possible if HBase loads the whole 500MB each time i
> >>> want
> >>>> to perform this custom query on a row. Hence my question :-)
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel <
> >>> michael_segel@hotmail.com>
> >>>> wrote:
> >>>>
> >>>>> Sorry, but your initial problem statement doesn’t seem to parse …
> >>>>>
> >>>>> Are you saying that you a single row with approximately 100,000
> >>> elements
> >>>>> where each element is roughly 1-5KB in size and in addition there are
> >>> ~5
> >>>>> elements which will be between one and five MB in size?
> >>>>>
> >>>>> And you then mention a coprocessor?
> >>>>>
> >>>>> Just looking at the numbers… 100K * 5KB means that each row would end
> >>> up
> >>>>> being 500MB in size.
> >>>>>
> >>>>> That’s a pretty fat row.
> >>>>>
> >>>>> I would suggest rethinking your strategy.
> >>>>>
> >>>>>> On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren <st...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> Hi
> >>>>>>
> >>>>>> I have a row with around 100.000 qualifiers with mostly small values
> >>>>> around
> >>>>>> 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do
> >>> random
> >>>>>> access of 1-10 qualifiers per row.
> >>>>>>
> >>>>>> I would like to understand how HBase loads the data into memory.
> >>> Will
> >>>> the
> >>>>>> entire row be loaded or only the qualifiers I ask for (like pointer
> >>>>> access
> >>>>>> into a direct ByteBuffer) ?
> >>>>>>
> >>>>>> Cheers,
> >>>>>> -Kristoffer
> >>>>>
> >>>>> The opinions expressed here are mine, while they may reflect a
> >>> cognitive
> >>>>> thought, that is purely accidental.
> >>>>> Use at your own risk.
> >>>>> Michael Segel
> >>>>> michael_segel (AT) hotmail.com
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
>
> The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>

Re: Rowkey design question

Posted by Michael Segel <mi...@hotmail.com>.
Ok… 

First, I’d suggest you rethink your schema by adding an additional dimension. 
You’ll end up with more rows, but a narrower table. 

In terms of compaction… if the data is relatively static, you won’t have compactions because nothing changed. 
But if your data is that static… why not put the data in sequence files and use HBase as the index. Could be faster. 

HTH 

-Mike

> On Apr 8, 2015, at 3:26 AM, Kristoffer Sjögren <st...@gmail.com> wrote:
> 
> I just read through HBase MOB design document and one thing that caught my
> attention was the following statement.
> 
> "When HBase deals with large numbers of values > 100kb and up to ~10MB of
> data, it encounters performance degradations due to write amplification
> caused by splits and compactions."
> 
> Is there any chance to run into this problem in the read path for data that
> is written infrequently and never changed?
> 
> On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren <st...@gmail.com> wrote:
> 
>> A small set of qualifiers will be accessed frequently so keeping them in
>> block cache would be very beneficial. Some very seldom. So this sounds very
>> promising!
>> 
>> The reason why i'm considering a coprocessor is that I need to provide
>> very specific information in the query request. Same thing with the
>> response. Queries are also highly parallelizable across rows and each
>> individual query produce a valid result that may or may not be aggregated
>> with other results in the client, maybe even inside the region if it
>> contained multiple rows targeted by the query.
>> 
>> So it's a bit like Phoenix but with a different storage format and query
>> engine.
>> 
>> On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk <nd...@gmail.com> wrote:
>> 
>>> Those rows are written out into HBase blocks on cell boundaries. Your
>>> column family has a BLOCK_SIZE attribute, which you may or may have no
>>> overridden the default of 64k. Cells are written into a block until is it
>>>> = the target block size. So your single 500mb row will be broken down
>>> into
>>> thousands of HFile blocks in some number of HFiles. Some of those blocks
>>> may contain just a cell or two and be a couple MB in size, to hold the
>>> largest of your cells. Those blocks will be loaded into the Block Cache as
>>> they're accessed. If your careful with your access patterns and only
>>> request cells that you need to evaluate, you'll only ever load the blocks
>>> containing those cells into the cache.
>>> 
>>>> Will the entire row be loaded or only the qualifiers I ask for?
>>> 
>>> So then, the answer to your question is: it depends on how you're
>>> interacting with the row from your coprocessor. The read path will only
>>> load blocks that your scanner requests. If your coprocessor is producing
>>> scanner with to seek to specific qualifiers, you'll only load those
>>> blocks.
>>> 
>>> Related question: Is there a reason you're using a coprocessor instead of
>>> a
>>> regular filter, or a simple qualified get/scan to access data from these
>>> rows? The "default stuff" is already tuned to load data sparsely, as would
>>> be desirable for your schema.
>>> 
>>> -n
>>> 
>>> On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren <st...@gmail.com>
>>> wrote:
>>> 
>>>> Sorry I should have explained my use case a bit more.
>>>> 
>>>> Yes, it's a pretty big row and it's "close" to worst case. Normally
>>> there
>>>> would be fewer qualifiers and the largest qualifiers would be smaller.
>>>> 
>>>> The reason why these rows gets big is because they stores aggregated
>>> data
>>>> in indexed compressed form. This format allow for extremely fast queries
>>>> (on local disk format) over billions of rows (not rows in HBase speak),
>>>> when touching smaller areas of the data. If would store the data as
>>> regular
>>>> HBase rows things would get very slow unless I had many many region
>>>> servers.
>>>> 
>>>> The coprocessor is used for doing custom queries on the indexed data
>>> inside
>>>> the region servers. These queries are not like a regular row scan, but
>>> very
>>>> specific as to how the data is formatted withing each column qualifier.
>>>> 
>>>> Yes, this is not possible if HBase loads the whole 500MB each time i
>>> want
>>>> to perform this custom query on a row. Hence my question :-)
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel <
>>> michael_segel@hotmail.com>
>>>> wrote:
>>>> 
>>>>> Sorry, but your initial problem statement doesn’t seem to parse …
>>>>> 
>>>>> Are you saying that you a single row with approximately 100,000
>>> elements
>>>>> where each element is roughly 1-5KB in size and in addition there are
>>> ~5
>>>>> elements which will be between one and five MB in size?
>>>>> 
>>>>> And you then mention a coprocessor?
>>>>> 
>>>>> Just looking at the numbers… 100K * 5KB means that each row would end
>>> up
>>>>> being 500MB in size.
>>>>> 
>>>>> That’s a pretty fat row.
>>>>> 
>>>>> I would suggest rethinking your strategy.
>>>>> 
>>>>>> On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren <st...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>> Hi
>>>>>> 
>>>>>> I have a row with around 100.000 qualifiers with mostly small values
>>>>> around
>>>>>> 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do
>>> random
>>>>>> access of 1-10 qualifiers per row.
>>>>>> 
>>>>>> I would like to understand how HBase loads the data into memory.
>>> Will
>>>> the
>>>>>> entire row be loaded or only the qualifiers I ask for (like pointer
>>>>> access
>>>>>> into a direct ByteBuffer) ?
>>>>>> 
>>>>>> Cheers,
>>>>>> -Kristoffer
>>>>> 
>>>>> The opinions expressed here are mine, while they may reflect a
>>> cognitive
>>>>> thought, that is purely accidental.
>>>>> Use at your own risk.
>>>>> Michael Segel
>>>>> michael_segel (AT) hotmail.com
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>> 

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com






Re: Rowkey design question

Posted by Kristoffer Sjögren <st...@gmail.com>.
I just read through HBase MOB design document and one thing that caught my
attention was the following statement.

"When HBase deals with large numbers of values > 100kb and up to ~10MB of
data, it encounters performance degradations due to write amplification
caused by splits and compactions."

Is there any chance to run into this problem in the read path for data that
is written infrequently and never changed?

On Wed, Apr 8, 2015 at 9:30 AM, Kristoffer Sjögren <st...@gmail.com> wrote:

> A small set of qualifiers will be accessed frequently so keeping them in
> block cache would be very beneficial. Some very seldom. So this sounds very
> promising!
>
> The reason why i'm considering a coprocessor is that I need to provide
> very specific information in the query request. Same thing with the
> response. Queries are also highly parallelizable across rows and each
> individual query produce a valid result that may or may not be aggregated
> with other results in the client, maybe even inside the region if it
> contained multiple rows targeted by the query.
>
> So it's a bit like Phoenix but with a different storage format and query
> engine.
>
> On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk <nd...@gmail.com> wrote:
>
>> Those rows are written out into HBase blocks on cell boundaries. Your
>> column family has a BLOCK_SIZE attribute, which you may or may have no
>> overridden the default of 64k. Cells are written into a block until is it
>> >= the target block size. So your single 500mb row will be broken down
>> into
>> thousands of HFile blocks in some number of HFiles. Some of those blocks
>> may contain just a cell or two and be a couple MB in size, to hold the
>> largest of your cells. Those blocks will be loaded into the Block Cache as
>> they're accessed. If your careful with your access patterns and only
>> request cells that you need to evaluate, you'll only ever load the blocks
>> containing those cells into the cache.
>>
>> > Will the entire row be loaded or only the qualifiers I ask for?
>>
>> So then, the answer to your question is: it depends on how you're
>> interacting with the row from your coprocessor. The read path will only
>> load blocks that your scanner requests. If your coprocessor is producing
>> scanner with to seek to specific qualifiers, you'll only load those
>> blocks.
>>
>> Related question: Is there a reason you're using a coprocessor instead of
>> a
>> regular filter, or a simple qualified get/scan to access data from these
>> rows? The "default stuff" is already tuned to load data sparsely, as would
>> be desirable for your schema.
>>
>> -n
>>
>> On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren <st...@gmail.com>
>> wrote:
>>
>> > Sorry I should have explained my use case a bit more.
>> >
>> > Yes, it's a pretty big row and it's "close" to worst case. Normally
>> there
>> > would be fewer qualifiers and the largest qualifiers would be smaller.
>> >
>> > The reason why these rows gets big is because they stores aggregated
>> data
>> > in indexed compressed form. This format allow for extremely fast queries
>> > (on local disk format) over billions of rows (not rows in HBase speak),
>> > when touching smaller areas of the data. If would store the data as
>> regular
>> > HBase rows things would get very slow unless I had many many region
>> > servers.
>> >
>> > The coprocessor is used for doing custom queries on the indexed data
>> inside
>> > the region servers. These queries are not like a regular row scan, but
>> very
>> > specific as to how the data is formatted withing each column qualifier.
>> >
>> > Yes, this is not possible if HBase loads the whole 500MB each time i
>> want
>> > to perform this custom query on a row. Hence my question :-)
>> >
>> >
>> >
>> >
>> > On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel <
>> michael_segel@hotmail.com>
>> > wrote:
>> >
>> > > Sorry, but your initial problem statement doesn’t seem to parse …
>> > >
>> > > Are you saying that you a single row with approximately 100,000
>> elements
>> > > where each element is roughly 1-5KB in size and in addition there are
>> ~5
>> > > elements which will be between one and five MB in size?
>> > >
>> > > And you then mention a coprocessor?
>> > >
>> > > Just looking at the numbers… 100K * 5KB means that each row would end
>> up
>> > > being 500MB in size.
>> > >
>> > > That’s a pretty fat row.
>> > >
>> > > I would suggest rethinking your strategy.
>> > >
>> > > > On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren <st...@gmail.com>
>> > > wrote:
>> > > >
>> > > > Hi
>> > > >
>> > > > I have a row with around 100.000 qualifiers with mostly small values
>> > > around
>> > > > 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do
>> random
>> > > > access of 1-10 qualifiers per row.
>> > > >
>> > > > I would like to understand how HBase loads the data into memory.
>> Will
>> > the
>> > > > entire row be loaded or only the qualifiers I ask for (like pointer
>> > > access
>> > > > into a direct ByteBuffer) ?
>> > > >
>> > > > Cheers,
>> > > > -Kristoffer
>> > >
>> > > The opinions expressed here are mine, while they may reflect a
>> cognitive
>> > > thought, that is purely accidental.
>> > > Use at your own risk.
>> > > Michael Segel
>> > > michael_segel (AT) hotmail.com
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> >
>>
>
>

Re: Rowkey design question

Posted by Kristoffer Sjögren <st...@gmail.com>.
A small set of qualifiers will be accessed frequently so keeping them in
block cache would be very beneficial. Some very seldom. So this sounds very
promising!

The reason why i'm considering a coprocessor is that I need to provide very
specific information in the query request. Same thing with the response.
Queries are also highly parallelizable across rows and each individual
query produce a valid result that may or may not be aggregated with other
results in the client, maybe even inside the region if it contained
multiple rows targeted by the query.

So it's a bit like Phoenix but with a different storage format and query
engine.

On Wed, Apr 8, 2015 at 12:46 AM, Nick Dimiduk <nd...@gmail.com> wrote:

> Those rows are written out into HBase blocks on cell boundaries. Your
> column family has a BLOCK_SIZE attribute, which you may or may have no
> overridden the default of 64k. Cells are written into a block until is it
> >= the target block size. So your single 500mb row will be broken down into
> thousands of HFile blocks in some number of HFiles. Some of those blocks
> may contain just a cell or two and be a couple MB in size, to hold the
> largest of your cells. Those blocks will be loaded into the Block Cache as
> they're accessed. If your careful with your access patterns and only
> request cells that you need to evaluate, you'll only ever load the blocks
> containing those cells into the cache.
>
> > Will the entire row be loaded or only the qualifiers I ask for?
>
> So then, the answer to your question is: it depends on how you're
> interacting with the row from your coprocessor. The read path will only
> load blocks that your scanner requests. If your coprocessor is producing
> scanner with to seek to specific qualifiers, you'll only load those blocks.
>
> Related question: Is there a reason you're using a coprocessor instead of a
> regular filter, or a simple qualified get/scan to access data from these
> rows? The "default stuff" is already tuned to load data sparsely, as would
> be desirable for your schema.
>
> -n
>
> On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren <st...@gmail.com>
> wrote:
>
> > Sorry I should have explained my use case a bit more.
> >
> > Yes, it's a pretty big row and it's "close" to worst case. Normally there
> > would be fewer qualifiers and the largest qualifiers would be smaller.
> >
> > The reason why these rows gets big is because they stores aggregated data
> > in indexed compressed form. This format allow for extremely fast queries
> > (on local disk format) over billions of rows (not rows in HBase speak),
> > when touching smaller areas of the data. If would store the data as
> regular
> > HBase rows things would get very slow unless I had many many region
> > servers.
> >
> > The coprocessor is used for doing custom queries on the indexed data
> inside
> > the region servers. These queries are not like a regular row scan, but
> very
> > specific as to how the data is formatted withing each column qualifier.
> >
> > Yes, this is not possible if HBase loads the whole 500MB each time i want
> > to perform this custom query on a row. Hence my question :-)
> >
> >
> >
> >
> > On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel <
> michael_segel@hotmail.com>
> > wrote:
> >
> > > Sorry, but your initial problem statement doesn’t seem to parse …
> > >
> > > Are you saying that you a single row with approximately 100,000
> elements
> > > where each element is roughly 1-5KB in size and in addition there are
> ~5
> > > elements which will be between one and five MB in size?
> > >
> > > And you then mention a coprocessor?
> > >
> > > Just looking at the numbers… 100K * 5KB means that each row would end
> up
> > > being 500MB in size.
> > >
> > > That’s a pretty fat row.
> > >
> > > I would suggest rethinking your strategy.
> > >
> > > > On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren <st...@gmail.com>
> > > wrote:
> > > >
> > > > Hi
> > > >
> > > > I have a row with around 100.000 qualifiers with mostly small values
> > > around
> > > > 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do random
> > > > access of 1-10 qualifiers per row.
> > > >
> > > > I would like to understand how HBase loads the data into memory. Will
> > the
> > > > entire row be loaded or only the qualifiers I ask for (like pointer
> > > access
> > > > into a direct ByteBuffer) ?
> > > >
> > > > Cheers,
> > > > -Kristoffer
> > >
> > > The opinions expressed here are mine, while they may reflect a
> cognitive
> > > thought, that is purely accidental.
> > > Use at your own risk.
> > > Michael Segel
> > > michael_segel (AT) hotmail.com
> > >
> > >
> > >
> > >
> > >
> > >
> >
>

Re: Rowkey design question

Posted by Nick Dimiduk <nd...@gmail.com>.
Those rows are written out into HBase blocks on cell boundaries. Your
column family has a BLOCK_SIZE attribute, which you may or may have no
overridden the default of 64k. Cells are written into a block until is it
>= the target block size. So your single 500mb row will be broken down into
thousands of HFile blocks in some number of HFiles. Some of those blocks
may contain just a cell or two and be a couple MB in size, to hold the
largest of your cells. Those blocks will be loaded into the Block Cache as
they're accessed. If your careful with your access patterns and only
request cells that you need to evaluate, you'll only ever load the blocks
containing those cells into the cache.

> Will the entire row be loaded or only the qualifiers I ask for?

So then, the answer to your question is: it depends on how you're
interacting with the row from your coprocessor. The read path will only
load blocks that your scanner requests. If your coprocessor is producing
scanner with to seek to specific qualifiers, you'll only load those blocks.

Related question: Is there a reason you're using a coprocessor instead of a
regular filter, or a simple qualified get/scan to access data from these
rows? The "default stuff" is already tuned to load data sparsely, as would
be desirable for your schema.

-n

On Tue, Apr 7, 2015 at 2:22 PM, Kristoffer Sjögren <st...@gmail.com> wrote:

> Sorry I should have explained my use case a bit more.
>
> Yes, it's a pretty big row and it's "close" to worst case. Normally there
> would be fewer qualifiers and the largest qualifiers would be smaller.
>
> The reason why these rows gets big is because they stores aggregated data
> in indexed compressed form. This format allow for extremely fast queries
> (on local disk format) over billions of rows (not rows in HBase speak),
> when touching smaller areas of the data. If would store the data as regular
> HBase rows things would get very slow unless I had many many region
> servers.
>
> The coprocessor is used for doing custom queries on the indexed data inside
> the region servers. These queries are not like a regular row scan, but very
> specific as to how the data is formatted withing each column qualifier.
>
> Yes, this is not possible if HBase loads the whole 500MB each time i want
> to perform this custom query on a row. Hence my question :-)
>
>
>
>
> On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel <mi...@hotmail.com>
> wrote:
>
> > Sorry, but your initial problem statement doesn’t seem to parse …
> >
> > Are you saying that you a single row with approximately 100,000 elements
> > where each element is roughly 1-5KB in size and in addition there are ~5
> > elements which will be between one and five MB in size?
> >
> > And you then mention a coprocessor?
> >
> > Just looking at the numbers… 100K * 5KB means that each row would end up
> > being 500MB in size.
> >
> > That’s a pretty fat row.
> >
> > I would suggest rethinking your strategy.
> >
> > > On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren <st...@gmail.com>
> > wrote:
> > >
> > > Hi
> > >
> > > I have a row with around 100.000 qualifiers with mostly small values
> > around
> > > 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do random
> > > access of 1-10 qualifiers per row.
> > >
> > > I would like to understand how HBase loads the data into memory. Will
> the
> > > entire row be loaded or only the qualifiers I ask for (like pointer
> > access
> > > into a direct ByteBuffer) ?
> > >
> > > Cheers,
> > > -Kristoffer
> >
> > The opinions expressed here are mine, while they may reflect a cognitive
> > thought, that is purely accidental.
> > Use at your own risk.
> > Michael Segel
> > michael_segel (AT) hotmail.com
> >
> >
> >
> >
> >
> >
>

Re: Rowkey design question

Posted by Kristoffer Sjögren <st...@gmail.com>.
Sorry I should have explained my use case a bit more.

Yes, it's a pretty big row and it's "close" to worst case. Normally there
would be fewer qualifiers and the largest qualifiers would be smaller.

The reason why these rows gets big is because they stores aggregated data
in indexed compressed form. This format allow for extremely fast queries
(on local disk format) over billions of rows (not rows in HBase speak),
when touching smaller areas of the data. If would store the data as regular
HBase rows things would get very slow unless I had many many region servers.

The coprocessor is used for doing custom queries on the indexed data inside
the region servers. These queries are not like a regular row scan, but very
specific as to how the data is formatted withing each column qualifier.

Yes, this is not possible if HBase loads the whole 500MB each time i want
to perform this custom query on a row. Hence my question :-)




On Tue, Apr 7, 2015 at 11:03 PM, Michael Segel <mi...@hotmail.com>
wrote:

> Sorry, but your initial problem statement doesn’t seem to parse …
>
> Are you saying that you a single row with approximately 100,000 elements
> where each element is roughly 1-5KB in size and in addition there are ~5
> elements which will be between one and five MB in size?
>
> And you then mention a coprocessor?
>
> Just looking at the numbers… 100K * 5KB means that each row would end up
> being 500MB in size.
>
> That’s a pretty fat row.
>
> I would suggest rethinking your strategy.
>
> > On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren <st...@gmail.com>
> wrote:
> >
> > Hi
> >
> > I have a row with around 100.000 qualifiers with mostly small values
> around
> > 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do random
> > access of 1-10 qualifiers per row.
> >
> > I would like to understand how HBase loads the data into memory. Will the
> > entire row be loaded or only the qualifiers I ask for (like pointer
> access
> > into a direct ByteBuffer) ?
> >
> > Cheers,
> > -Kristoffer
>
> The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>

Re: Rowkey design question

Posted by Michael Segel <mi...@hotmail.com>.
Sorry, but your initial problem statement doesn’t seem to parse … 

Are you saying that you a single row with approximately 100,000 elements where each element is roughly 1-5KB in size and in addition there are ~5 elements which will be between one and five MB in size? 

And you then mention a coprocessor? 

Just looking at the numbers… 100K * 5KB means that each row would end up being 500MB in size. 

That’s a pretty fat row.

I would suggest rethinking your strategy. 

> On Apr 7, 2015, at 11:13 AM, Kristoffer Sjögren <st...@gmail.com> wrote:
> 
> Hi
> 
> I have a row with around 100.000 qualifiers with mostly small values around
> 1-5KB and maybe 5 largers ones around 1-5 MB. A coprocessor do random
> access of 1-10 qualifiers per row.
> 
> I would like to understand how HBase loads the data into memory. Will the
> entire row be loaded or only the qualifiers I ask for (like pointer access
> into a direct ByteBuffer) ?
> 
> Cheers,
> -Kristoffer

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com