You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Chris Tarnas <cf...@email.com> on 2011/01/07 19:01:41 UTC

Column family data distribution and performance

I was wondering how much impact on read and write performance a column family would have on rows where they don't contain any data?

I'm testing out an indexing method where rather than have a separate table for storing indexes I just keep them in the same table in an INDEX column family. The construction of the rowkeys guarantees that an index value will never be the same as a rowkey of a normal row. This allows us to send all mutations for one row and its indexes in a single thrift call with a batch mutation rather than two thrift calls. Are there any serious back end downsides to this methodology?

many thanks,
-chris

Re: Column family data distribution and performance

Posted by Sean Bigdatafun <se...@gmail.com>.

On Fri, Jan 7, 2011 at 10:01 AM, Chris Tarnas <cf...@email.com> wrote:

> I was wondering how much impact on read and write performance a column
> family would have on rows where they don't contain any data?
>
> I'm testing out an indexing method where rather than have a separate table
> for storing indexes I just keep them in the same table in an INDEX column
> family. The construction of the rowkeys guarantees that an index value will
> never be the same as a rowkey of a normal row. This allows us to send all
> mutations for one row and its indexes in a single thrift call with a batch
> mutation rather than two thrift calls. Are there any serious back end
> downsides to this methodology?
>

How are you going to guarantee transaction of updating the actual
data row and the index row? Do you update them together in a putlist so that
you guarantee transaction?

Maybe It would be beneficial if you provide details and ask people to give
improvement suggestion.

>
> many thanks,
> -chris

-- 
--Sean

Re: Column family data distribution and performance

Posted by Chris Tarnas <cf...@email.com>.

On Jan 7, 2011, at 10:14 AM, Stack wrote:

> On Fri, Jan 7, 2011 at 10:01 AM, Chris Tarnas <cf...@email.com> wrote:
>> I was wondering how much impact on read and write performance a column family would have on rows where they don't contain any data?
>> 
> 
> The index column family would have data, right, just not data for every row?
> 

yes - The row in the INDEX column family would have one key value - the rowkey that the index points to.

> If you don't query this index cf, then should be near to no impact.
> 

The index column family would not be requested when getting a "normal" row.

> You'd be querying the index and data independently?
> 

Yes - either a single scanner or get would be getting data from the INDEX column family or from the "data" column families but no single get/scanner would be retrieving from both sets. Of course index lookups require two gets (one to lookup the index, the other to get the desired row based on the index lookup) but that seems inevitable at this time.

> 
>> I'm testing out an indexing method where rather than have a separate table for storing indexes I just keep them in the same table in an INDEX column family. The construction of the rowkeys guarantees that an index value will never be the same as a rowkey of a normal row. This allows us to send all mutations for one row and its indexes in a single thrift call with a batch mutation rather than two thrift calls. Are there any serious back end downsides to this methodology?
>> 
> 
> I can't think of any.  Its definetly all upside from keeping two tables.
> 
> St.Ack

Thanks!
-chris

Re: Column family data distribution and performance

Posted by Stack <st...@duboce.net>.

On Fri, Jan 7, 2011 at 10:01 AM, Chris Tarnas <cf...@email.com> wrote:
> I was wondering how much impact on read and write performance a column family would have on rows where they don't contain any data?
>

The index column family would have data, right, just not data for every row?

If you don't query this index cf, then should be near to no impact.

You'd be querying the index and data independently?


> I'm testing out an indexing method where rather than have a separate table for storing indexes I just keep them in the same table in an INDEX column family. The construction of the rowkeys guarantees that an index value will never be the same as a rowkey of a normal row. This allows us to send all mutations for one row and its indexes in a single thrift call with a batch mutation rather than two thrift calls. Are there any serious back end downsides to this methodology?
>

I can't think of any.  Its definetly all upside from keeping two tables.

St.Ack