You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Shimon <sh...@dpolls.com> on 2008/05/29 14:47:43 UTC

querying by column value rather than key value

Hi,
What is most efficient way to retrieve a row when the value of a certain
column is specified ( rather than the row key ) ?

Thanks in advance

Re: querying by column value rather than key value

Posted by stack <st...@duboce.net>.
Jim R. Wilson wrote:
> Hi Shimon,
>
> You may also want to look into a text indexing solution such as Lucene
> or Ferret.  These systems might be able to provide the search
> capabilities you require.  Of course, it'd be your responsibility to
> keep the indexes up to date.
>   

For the initial indexing of your table, you might look at 
http://hadoop.apache.org/hbase/docs/current/org/apache/hadoop/hbase/mapred/BuildTableIndex.html

St.Ack

> -- Jim R. Wilson (jimbojw)
>
> On Thu, May 29, 2008 at 10:51 AM, Bryan Duxbury <br...@rapleaf.com> wrote:
>   
>> Right now, we don't have a plan to add indexing. It's a particularly
>> complicated area of functionality. We'd love to get suggestions (or patches
>> :) for how to do this the right way.
>>
>> Map/reduce would be the perfect way to find all records that matched on a
>> column value, so long as you're able to do all the accesses in batch. If
>> they need to be random and low-latency, then you're going to have some
>> trouble using map/reduce. HBase provides an interface to map/reduce that
>> lets you split up by a region at a time, so you can highly parallelize your
>> processing, so it should be reasonably fast.
>>
>> -Bryan
>>
>> On May 29, 2008, at 7:56 AM, shimon golan wrote:
>>
>>     
>>> Thanx Bryan for the quick response !
>>> My table is very sparse and contains 100 column families , each containing
>>> about 20 items. Also, I need to search the table by each of the columns so
>>> the solution you suggested seems somewhat complicated for this purpose.
>>>
>>> So my ensuing questions are  :
>>> 1. Do you plan to support having indexes on columns and thus avoiding the
>>> necessity of scanning when a column value is queried ?
>>> 2. I thought about using map reduce for the purpose of parallelizing the
>>> search over different regions of the table - could this be accomplished  ?
>>>
>>> On Thu, May 29, 2008 at 5:42 PM, Bryan Duxbury <br...@rapleaf.com> wrote:
>>>
>>>       
>>>> In a regular hbase table, there is no "most efficient" way to do this.
>>>> The
>>>> only operation you have available is a table scan.
>>>>
>>>> If you find yourself looking for rows by their column values, then you
>>>> must
>>>> do some extra work to make that possible. First, make absolutely sure
>>>> that
>>>> you have the right row key picked out. If your accesses are dominated by
>>>> searching on a column value, then perhaps that column should be your
>>>> primary
>>>> key.
>>>>
>>>> If you must have both the existing primary key and the column value
>>>> -based
>>>> lookups, then probably your best bet is to make an "index" table. The way
>>>> that works is that every time you write or delete some value to the
>>>> primary
>>>> table, you also write or delete a value to the index table with the
>>>> column
>>>> value as the key and the row key as a column value. Then, when you are
>>>> trying to find the row by its column value, you look in the index table
>>>> to
>>>> find the row key, and then you query the main table with the row key.
>>>> It's
>>>> more work, but this is the best that HBase can offer at the moment.
>>>>
>>>> -Bryan
>>>>
>>>>
>>>>
>>>> On May 29, 2008, at 5:47 AM, Shimon wrote:
>>>>
>>>>  Hi,
>>>>         
>>>>> What is most efficient way to retrieve a row when the value of a certain
>>>>> column is specified ( rather than the row key ) ?
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>>           
>>>>         
>>     


Re: querying by column value rather than key value

Posted by "Jim R. Wilson" <wi...@gmail.com>.
Hi Shimon,

You may also want to look into a text indexing solution such as Lucene
or Ferret.  These systems might be able to provide the search
capabilities you require.  Of course, it'd be your responsibility to
keep the indexes up to date.

-- Jim R. Wilson (jimbojw)

On Thu, May 29, 2008 at 10:51 AM, Bryan Duxbury <br...@rapleaf.com> wrote:
> Right now, we don't have a plan to add indexing. It's a particularly
> complicated area of functionality. We'd love to get suggestions (or patches
> :) for how to do this the right way.
>
> Map/reduce would be the perfect way to find all records that matched on a
> column value, so long as you're able to do all the accesses in batch. If
> they need to be random and low-latency, then you're going to have some
> trouble using map/reduce. HBase provides an interface to map/reduce that
> lets you split up by a region at a time, so you can highly parallelize your
> processing, so it should be reasonably fast.
>
> -Bryan
>
> On May 29, 2008, at 7:56 AM, shimon golan wrote:
>
>> Thanx Bryan for the quick response !
>> My table is very sparse and contains 100 column families , each containing
>> about 20 items. Also, I need to search the table by each of the columns so
>> the solution you suggested seems somewhat complicated for this purpose.
>>
>> So my ensuing questions are  :
>> 1. Do you plan to support having indexes on columns and thus avoiding the
>> necessity of scanning when a column value is queried ?
>> 2. I thought about using map reduce for the purpose of parallelizing the
>> search over different regions of the table - could this be accomplished  ?
>>
>> On Thu, May 29, 2008 at 5:42 PM, Bryan Duxbury <br...@rapleaf.com> wrote:
>>
>>> In a regular hbase table, there is no "most efficient" way to do this.
>>> The
>>> only operation you have available is a table scan.
>>>
>>> If you find yourself looking for rows by their column values, then you
>>> must
>>> do some extra work to make that possible. First, make absolutely sure
>>> that
>>> you have the right row key picked out. If your accesses are dominated by
>>> searching on a column value, then perhaps that column should be your
>>> primary
>>> key.
>>>
>>> If you must have both the existing primary key and the column value
>>> -based
>>> lookups, then probably your best bet is to make an "index" table. The way
>>> that works is that every time you write or delete some value to the
>>> primary
>>> table, you also write or delete a value to the index table with the
>>> column
>>> value as the key and the row key as a column value. Then, when you are
>>> trying to find the row by its column value, you look in the index table
>>> to
>>> find the row key, and then you query the main table with the row key.
>>> It's
>>> more work, but this is the best that HBase can offer at the moment.
>>>
>>> -Bryan
>>>
>>>
>>>
>>> On May 29, 2008, at 5:47 AM, Shimon wrote:
>>>
>>>  Hi,
>>>>
>>>> What is most efficient way to retrieve a row when the value of a certain
>>>> column is specified ( rather than the row key ) ?
>>>>
>>>> Thanks in advance
>>>>
>>>
>>>
>
>

Re: querying by column value rather than key value

Posted by "news.gmane.org" <sa...@pearsonwholesale.com>.
The yes columns are stored together for storage but you would still have to 
scan all the
records to find the ones that match sense they are stored in this order 
key -> column -> timestamp
Hbase only knows where to find XYZ key it knows nothing forehand about the 
columns data until it reads the column data requested for XYZ key

Billy



"shimon golan" <sg...@gmail.com> wrote in 
message news:a80abb8f0805300406x5cdc53b5g915257e111e0d857@mail.gmail.com...
> But tables in HBase are column family oriented , and thus searching for an
> item in a column should be less expensive. Am I wrong ?
>
> On Thu, May 29, 2008 at 9:24 PM, Jim R. Wilson 
> <wi...@gmail.com>
> wrote:
>
>> He's saying that if you're going to be searching for one random thing
>> at a time, and you want it to be /fast/, then issuing a full table
>> scan to a hadoop cluster as a map/reduce job is not a good solution.
>> It's not really practical to send out a swarm of nodes hunting through
>> your whole dataset each time you want to find one (or several) items
>> out of your whole table.
>>
>> For example, say you have a web-form where users can search for other
>> users by last name, "name:last" being one of your columns.  This is an
>> especially bad case for using mapreduce to scan the full table.
>>
>> -- Jim
>>
>> On Thu, May 29, 2008 at 1:06 PM, shimon golan 
>> <sg...@gmail.com> wrote:
>> > You said "If they need to be random and low-latency, then you're going 
>> > to
>> > have some trouble using map/reduce" -
>> > why is that ?
>> >
>> > Thanks
>> > Shimon
>> >
>> > On Thu, May 29, 2008 at 5:51 PM, Bryan Duxbury 
>> > <br...@rapleaf.com>
>> wrote:
>> >
>> >> Right now, we don't have a plan to add indexing. It's a particularly
>> >> complicated area of functionality. We'd love to get suggestions (or
>> patches
>> >> :) for how to do this the right way.
>> >>
>> >> Map/reduce would be the perfect way to find all records that matched 
>> >> on
>> a
>> >> column value, so long as you're able to do all the accesses in batch. 
>> >> If
>> >> they need to be random and low-latency, then you're going to have some
>> >> trouble using map/reduce. HBase provides an interface to map/reduce 
>> >> that
>> >> lets you split up by a region at a time, so you can highly parallelize
>> your
>> >> processing, so it should be reasonably fast.
>> >>
>> >> -Bryan
>> >>
>> >>
>> >> On May 29, 2008, at 7:56 AM, shimon golan wrote:
>> >>
>> >>  Thanx Bryan for the quick response !
>> >>> My table is very sparse and contains 100 column families , each
>> containing
>> >>> about 20 items. Also, I need to search the table by each of the 
>> >>> columns
>> so
>> >>> the solution you suggested seems somewhat complicated for this 
>> >>> purpose.
>> >>>
>> >>> So my ensuing questions are  :
>> >>> 1. Do you plan to support having indexes on columns and thus avoiding
>> the
>> >>> necessity of scanning when a column value is queried ?
>> >>> 2. I thought about using map reduce for the purpose of parallelizing
>> the
>> >>> search over different regions of the table - could this be 
>> >>> accomplished
>>  ?
>> >>>
>> >>> On Thu, May 29, 2008 at 5:42 PM, Bryan Duxbury 
>> >>> <br...@rapleaf.com>
>> wrote:
>> >>>
>> >>>  In a regular hbase table, there is no "most efficient" way to do 
>> >>> this.
>> >>>> The
>> >>>> only operation you have available is a table scan.
>> >>>>
>> >>>> If you find yourself looking for rows by their column values, then 
>> >>>> you
>> >>>> must
>> >>>> do some extra work to make that possible. First, make absolutely 
>> >>>> sure
>> >>>> that
>> >>>> you have the right row key picked out. If your accesses are 
>> >>>> dominated
>> by
>> >>>> searching on a column value, then perhaps that column should be your
>> >>>> primary
>> >>>> key.
>> >>>>
>> >>>> If you must have both the existing primary key and the column value
>> >>>> -based
>> >>>> lookups, then probably your best bet is to make an "index" table. 
>> >>>> The
>> way
>> >>>> that works is that every time you write or delete some value to the
>> >>>> primary
>> >>>> table, you also write or delete a value to the index table with the
>> >>>> column
>> >>>> value as the key and the row key as a column value. Then, when you 
>> >>>> are
>> >>>> trying to find the row by its column value, you look in the index
>> table
>> >>>> to
>> >>>> find the row key, and then you query the main table with the row 
>> >>>> key.
>> >>>> It's
>> >>>> more work, but this is the best that HBase can offer at the moment.
>> >>>>
>> >>>> -Bryan
>> >>>>
>> >>>>
>> >>>>
>> >>>> On May 29, 2008, at 5:47 AM, Shimon wrote:
>> >>>>
>> >>>>  Hi,
>> >>>>
>> >>>>> What is most efficient way to retrieve a row when the value of a
>> certain
>> >>>>> column is specified ( rather than the row key ) ?
>> >>>>>
>> >>>>> Thanks in advance
>> >>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >>
>> >
>>
> 



Re: querying by column value rather than key value

Posted by shimon golan <sg...@gmail.com>.
But tables in HBase are column family oriented , and thus searching for an
item in a column should be less expensive. Am I wrong ?

On Thu, May 29, 2008 at 9:24 PM, Jim R. Wilson <wi...@gmail.com>
wrote:

> He's saying that if you're going to be searching for one random thing
> at a time, and you want it to be /fast/, then issuing a full table
> scan to a hadoop cluster as a map/reduce job is not a good solution.
> It's not really practical to send out a swarm of nodes hunting through
> your whole dataset each time you want to find one (or several) items
> out of your whole table.
>
> For example, say you have a web-form where users can search for other
> users by last name, "name:last" being one of your columns.  This is an
> especially bad case for using mapreduce to scan the full table.
>
> -- Jim
>
> On Thu, May 29, 2008 at 1:06 PM, shimon golan <sg...@gmail.com> wrote:
> > You said "If they need to be random and low-latency, then you're going to
> > have some trouble using map/reduce" -
> > why is that ?
> >
> > Thanks
> > Shimon
> >
> > On Thu, May 29, 2008 at 5:51 PM, Bryan Duxbury <br...@rapleaf.com>
> wrote:
> >
> >> Right now, we don't have a plan to add indexing. It's a particularly
> >> complicated area of functionality. We'd love to get suggestions (or
> patches
> >> :) for how to do this the right way.
> >>
> >> Map/reduce would be the perfect way to find all records that matched on
> a
> >> column value, so long as you're able to do all the accesses in batch. If
> >> they need to be random and low-latency, then you're going to have some
> >> trouble using map/reduce. HBase provides an interface to map/reduce that
> >> lets you split up by a region at a time, so you can highly parallelize
> your
> >> processing, so it should be reasonably fast.
> >>
> >> -Bryan
> >>
> >>
> >> On May 29, 2008, at 7:56 AM, shimon golan wrote:
> >>
> >>  Thanx Bryan for the quick response !
> >>> My table is very sparse and contains 100 column families , each
> containing
> >>> about 20 items. Also, I need to search the table by each of the columns
> so
> >>> the solution you suggested seems somewhat complicated for this purpose.
> >>>
> >>> So my ensuing questions are  :
> >>> 1. Do you plan to support having indexes on columns and thus avoiding
> the
> >>> necessity of scanning when a column value is queried ?
> >>> 2. I thought about using map reduce for the purpose of parallelizing
> the
> >>> search over different regions of the table - could this be accomplished
>  ?
> >>>
> >>> On Thu, May 29, 2008 at 5:42 PM, Bryan Duxbury <br...@rapleaf.com>
> wrote:
> >>>
> >>>  In a regular hbase table, there is no "most efficient" way to do this.
> >>>> The
> >>>> only operation you have available is a table scan.
> >>>>
> >>>> If you find yourself looking for rows by their column values, then you
> >>>> must
> >>>> do some extra work to make that possible. First, make absolutely sure
> >>>> that
> >>>> you have the right row key picked out. If your accesses are dominated
> by
> >>>> searching on a column value, then perhaps that column should be your
> >>>> primary
> >>>> key.
> >>>>
> >>>> If you must have both the existing primary key and the column value
> >>>> -based
> >>>> lookups, then probably your best bet is to make an "index" table. The
> way
> >>>> that works is that every time you write or delete some value to the
> >>>> primary
> >>>> table, you also write or delete a value to the index table with the
> >>>> column
> >>>> value as the key and the row key as a column value. Then, when you are
> >>>> trying to find the row by its column value, you look in the index
> table
> >>>> to
> >>>> find the row key, and then you query the main table with the row key.
> >>>> It's
> >>>> more work, but this is the best that HBase can offer at the moment.
> >>>>
> >>>> -Bryan
> >>>>
> >>>>
> >>>>
> >>>> On May 29, 2008, at 5:47 AM, Shimon wrote:
> >>>>
> >>>>  Hi,
> >>>>
> >>>>> What is most efficient way to retrieve a row when the value of a
> certain
> >>>>> column is specified ( rather than the row key ) ?
> >>>>>
> >>>>> Thanks in advance
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>
> >
>

Re: querying by column value rather than key value

Posted by "Jim R. Wilson" <wi...@gmail.com>.
He's saying that if you're going to be searching for one random thing
at a time, and you want it to be /fast/, then issuing a full table
scan to a hadoop cluster as a map/reduce job is not a good solution.
It's not really practical to send out a swarm of nodes hunting through
your whole dataset each time you want to find one (or several) items
out of your whole table.

For example, say you have a web-form where users can search for other
users by last name, "name:last" being one of your columns.  This is an
especially bad case for using mapreduce to scan the full table.

-- Jim

On Thu, May 29, 2008 at 1:06 PM, shimon golan <sg...@gmail.com> wrote:
> You said "If they need to be random and low-latency, then you're going to
> have some trouble using map/reduce" -
> why is that ?
>
> Thanks
> Shimon
>
> On Thu, May 29, 2008 at 5:51 PM, Bryan Duxbury <br...@rapleaf.com> wrote:
>
>> Right now, we don't have a plan to add indexing. It's a particularly
>> complicated area of functionality. We'd love to get suggestions (or patches
>> :) for how to do this the right way.
>>
>> Map/reduce would be the perfect way to find all records that matched on a
>> column value, so long as you're able to do all the accesses in batch. If
>> they need to be random and low-latency, then you're going to have some
>> trouble using map/reduce. HBase provides an interface to map/reduce that
>> lets you split up by a region at a time, so you can highly parallelize your
>> processing, so it should be reasonably fast.
>>
>> -Bryan
>>
>>
>> On May 29, 2008, at 7:56 AM, shimon golan wrote:
>>
>>  Thanx Bryan for the quick response !
>>> My table is very sparse and contains 100 column families , each containing
>>> about 20 items. Also, I need to search the table by each of the columns so
>>> the solution you suggested seems somewhat complicated for this purpose.
>>>
>>> So my ensuing questions are  :
>>> 1. Do you plan to support having indexes on columns and thus avoiding the
>>> necessity of scanning when a column value is queried ?
>>> 2. I thought about using map reduce for the purpose of parallelizing the
>>> search over different regions of the table - could this be accomplished  ?
>>>
>>> On Thu, May 29, 2008 at 5:42 PM, Bryan Duxbury <br...@rapleaf.com> wrote:
>>>
>>>  In a regular hbase table, there is no "most efficient" way to do this.
>>>> The
>>>> only operation you have available is a table scan.
>>>>
>>>> If you find yourself looking for rows by their column values, then you
>>>> must
>>>> do some extra work to make that possible. First, make absolutely sure
>>>> that
>>>> you have the right row key picked out. If your accesses are dominated by
>>>> searching on a column value, then perhaps that column should be your
>>>> primary
>>>> key.
>>>>
>>>> If you must have both the existing primary key and the column value
>>>> -based
>>>> lookups, then probably your best bet is to make an "index" table. The way
>>>> that works is that every time you write or delete some value to the
>>>> primary
>>>> table, you also write or delete a value to the index table with the
>>>> column
>>>> value as the key and the row key as a column value. Then, when you are
>>>> trying to find the row by its column value, you look in the index table
>>>> to
>>>> find the row key, and then you query the main table with the row key.
>>>> It's
>>>> more work, but this is the best that HBase can offer at the moment.
>>>>
>>>> -Bryan
>>>>
>>>>
>>>>
>>>> On May 29, 2008, at 5:47 AM, Shimon wrote:
>>>>
>>>>  Hi,
>>>>
>>>>> What is most efficient way to retrieve a row when the value of a certain
>>>>> column is specified ( rather than the row key ) ?
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>>
>>>>
>>>>
>>
>

Re: querying by column value rather than key value

Posted by shimon golan <sg...@gmail.com>.
You said "If they need to be random and low-latency, then you're going to
have some trouble using map/reduce" -
why is that ?

Thanks
Shimon

On Thu, May 29, 2008 at 5:51 PM, Bryan Duxbury <br...@rapleaf.com> wrote:

> Right now, we don't have a plan to add indexing. It's a particularly
> complicated area of functionality. We'd love to get suggestions (or patches
> :) for how to do this the right way.
>
> Map/reduce would be the perfect way to find all records that matched on a
> column value, so long as you're able to do all the accesses in batch. If
> they need to be random and low-latency, then you're going to have some
> trouble using map/reduce. HBase provides an interface to map/reduce that
> lets you split up by a region at a time, so you can highly parallelize your
> processing, so it should be reasonably fast.
>
> -Bryan
>
>
> On May 29, 2008, at 7:56 AM, shimon golan wrote:
>
>  Thanx Bryan for the quick response !
>> My table is very sparse and contains 100 column families , each containing
>> about 20 items. Also, I need to search the table by each of the columns so
>> the solution you suggested seems somewhat complicated for this purpose.
>>
>> So my ensuing questions are  :
>> 1. Do you plan to support having indexes on columns and thus avoiding the
>> necessity of scanning when a column value is queried ?
>> 2. I thought about using map reduce for the purpose of parallelizing the
>> search over different regions of the table - could this be accomplished  ?
>>
>> On Thu, May 29, 2008 at 5:42 PM, Bryan Duxbury <br...@rapleaf.com> wrote:
>>
>>  In a regular hbase table, there is no "most efficient" way to do this.
>>> The
>>> only operation you have available is a table scan.
>>>
>>> If you find yourself looking for rows by their column values, then you
>>> must
>>> do some extra work to make that possible. First, make absolutely sure
>>> that
>>> you have the right row key picked out. If your accesses are dominated by
>>> searching on a column value, then perhaps that column should be your
>>> primary
>>> key.
>>>
>>> If you must have both the existing primary key and the column value
>>> -based
>>> lookups, then probably your best bet is to make an "index" table. The way
>>> that works is that every time you write or delete some value to the
>>> primary
>>> table, you also write or delete a value to the index table with the
>>> column
>>> value as the key and the row key as a column value. Then, when you are
>>> trying to find the row by its column value, you look in the index table
>>> to
>>> find the row key, and then you query the main table with the row key.
>>> It's
>>> more work, but this is the best that HBase can offer at the moment.
>>>
>>> -Bryan
>>>
>>>
>>>
>>> On May 29, 2008, at 5:47 AM, Shimon wrote:
>>>
>>>  Hi,
>>>
>>>> What is most efficient way to retrieve a row when the value of a certain
>>>> column is specified ( rather than the row key ) ?
>>>>
>>>> Thanks in advance
>>>>
>>>>
>>>
>>>
>

Re: querying by column value rather than key value

Posted by Bryan Duxbury <br...@rapleaf.com>.
Right now, we don't have a plan to add indexing. It's a particularly  
complicated area of functionality. We'd love to get suggestions (or  
patches :) for how to do this the right way.

Map/reduce would be the perfect way to find all records that matched  
on a column value, so long as you're able to do all the accesses in  
batch. If they need to be random and low-latency, then you're going  
to have some trouble using map/reduce. HBase provides an interface to  
map/reduce that lets you split up by a region at a time, so you can  
highly parallelize your processing, so it should be reasonably fast.

-Bryan

On May 29, 2008, at 7:56 AM, shimon golan wrote:

> Thanx Bryan for the quick response !
> My table is very sparse and contains 100 column families , each  
> containing
> about 20 items. Also, I need to search the table by each of the  
> columns so
> the solution you suggested seems somewhat complicated for this  
> purpose.
>
> So my ensuing questions are  :
> 1. Do you plan to support having indexes on columns and thus  
> avoiding the
> necessity of scanning when a column value is queried ?
> 2. I thought about using map reduce for the purpose of  
> parallelizing the
> search over different regions of the table - could this be  
> accomplished  ?
>
> On Thu, May 29, 2008 at 5:42 PM, Bryan Duxbury <br...@rapleaf.com>  
> wrote:
>
>> In a regular hbase table, there is no "most efficient" way to do  
>> this. The
>> only operation you have available is a table scan.
>>
>> If you find yourself looking for rows by their column values, then  
>> you must
>> do some extra work to make that possible. First, make absolutely  
>> sure that
>> you have the right row key picked out. If your accesses are  
>> dominated by
>> searching on a column value, then perhaps that column should be  
>> your primary
>> key.
>>
>> If you must have both the existing primary key and the column  
>> value -based
>> lookups, then probably your best bet is to make an "index" table.  
>> The way
>> that works is that every time you write or delete some value to  
>> the primary
>> table, you also write or delete a value to the index table with  
>> the column
>> value as the key and the row key as a column value. Then, when you  
>> are
>> trying to find the row by its column value, you look in the index  
>> table to
>> find the row key, and then you query the main table with the row  
>> key. It's
>> more work, but this is the best that HBase can offer at the moment.
>>
>> -Bryan
>>
>>
>>
>> On May 29, 2008, at 5:47 AM, Shimon wrote:
>>
>>  Hi,
>>> What is most efficient way to retrieve a row when the value of a  
>>> certain
>>> column is specified ( rather than the row key ) ?
>>>
>>> Thanks in advance
>>>
>>
>>


Re: querying by column value rather than key value

Posted by shimon golan <sg...@gmail.com>.
Thanx Bryan for the quick response !
My table is very sparse and contains 100 column families , each containing
about 20 items. Also, I need to search the table by each of the columns so
the solution you suggested seems somewhat complicated for this purpose.

So my ensuing questions are  :
1. Do you plan to support having indexes on columns and thus avoiding the
necessity of scanning when a column value is queried ?
2. I thought about using map reduce for the purpose of parallelizing the
search over different regions of the table - could this be accomplished  ?

On Thu, May 29, 2008 at 5:42 PM, Bryan Duxbury <br...@rapleaf.com> wrote:

> In a regular hbase table, there is no "most efficient" way to do this. The
> only operation you have available is a table scan.
>
> If you find yourself looking for rows by their column values, then you must
> do some extra work to make that possible. First, make absolutely sure that
> you have the right row key picked out. If your accesses are dominated by
> searching on a column value, then perhaps that column should be your primary
> key.
>
> If you must have both the existing primary key and the column value -based
> lookups, then probably your best bet is to make an "index" table. The way
> that works is that every time you write or delete some value to the primary
> table, you also write or delete a value to the index table with the column
> value as the key and the row key as a column value. Then, when you are
> trying to find the row by its column value, you look in the index table to
> find the row key, and then you query the main table with the row key. It's
> more work, but this is the best that HBase can offer at the moment.
>
> -Bryan
>
>
>
> On May 29, 2008, at 5:47 AM, Shimon wrote:
>
>  Hi,
>> What is most efficient way to retrieve a row when the value of a certain
>> column is specified ( rather than the row key ) ?
>>
>> Thanks in advance
>>
>
>

Re: querying by column value rather than key value

Posted by Bryan Duxbury <br...@rapleaf.com>.
In a regular hbase table, there is no "most efficient" way to do  
this. The only operation you have available is a table scan.

If you find yourself looking for rows by their column values, then  
you must do some extra work to make that possible. First, make  
absolutely sure that you have the right row key picked out. If your  
accesses are dominated by searching on a column value, then perhaps  
that column should be your primary key.

If you must have both the existing primary key and the column value - 
based lookups, then probably your best bet is to make an "index"  
table. The way that works is that every time you write or delete some  
value to the primary table, you also write or delete a value to the  
index table with the column value as the key and the row key as a  
column value. Then, when you are trying to find the row by its column  
value, you look in the index table to find the row key, and then you  
query the main table with the row key. It's more work, but this is  
the best that HBase can offer at the moment.

-Bryan


On May 29, 2008, at 5:47 AM, Shimon wrote:

> Hi,
> What is most efficient way to retrieve a row when the value of a  
> certain
> column is specified ( rather than the row key ) ?
>
> Thanks in advance