You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Barış Can Daylık <ba...@iletken.com.tr> on 2011/07/28 13:31:22 UTC

Column Indexing - Top N Columns

Hi everyone,

I do have a column family where I store counts of items under each 
column, and I need to have top N columns (items) sorted by count 
descending. I know hbase doesn't sort columns by value and do not have 
an indexing option to do so. But as I searched I found out a patch 
(IHbase) for this indexing job. However I'm not able to find out a way 
to get only the top N columns even by using IHbase.

Can you suggest an example usage? Or another patch or tool for this job? 
Can lucene be used in such a scenario?

Thanks
Baris

p.s. Column values are positive integers.

RE: Column Indexing - Top N Columns

Posted by "Buttler, David" <bu...@llnl.gov>.
Do you mean that all items are in one row?  That implies that there are not very many items, and you should be able to do a quick sort in memory (since the values from one row are in memory anyway)

Dave

-----Original Message-----
From: Barış Can Daylık [mailto:baris.daylik@iletken.com.tr] 
Sent: Thursday, July 28, 2011 4:31 AM
To: user@hbase.apache.org
Subject: Column Indexing - Top N Columns

Hi everyone,

I do have a column family where I store counts of items under each 
column, and I need to have top N columns (items) sorted by count 
descending. I know hbase doesn't sort columns by value and do not have 
an indexing option to do so. But as I searched I found out a patch 
(IHbase) for this indexing job. However I'm not able to find out a way 
to get only the top N columns even by using IHbase.

Can you suggest an example usage? Or another patch or tool for this job? 
Can lucene be used in such a scenario?

Thanks
Baris

p.s. Column values are positive integers.

Re: Column Indexing - Top N Columns

Posted by Barış Can Daylık <ba...@iletken.com.tr>.
Number of rows will be 6 million. So in the worst case the table will be 
square, but on average 100K columns won't be exceeded.

If I'm not mistaken columns are sorted by column names and not the 
values, does result.raw return columns sorted by their values? If it 
does so, does it sort them when I call result.raw, or are they pre-sorted?

On 07/28/2011 08:18 PM, Xian Woo wrote:
> as far as I am concerned, first you create a Get instance for a specified
> row ,than use the Htable.get() method to return a result to your client from
> the cluster , Then you get can get a key value each time by using
> for(Keyvalue kv : result.raw()), But since result.raw() only return the
> keyvalues in the ascending form,so u may need some extra operations.
> Or maybe you can directly use result.getMap() to get a sorted map of all the
> keyvalues and do some operation youself.
> You can also use a ColumnPaginationFilter if you know exactly how many
> columns in the row which you specify.
>
> And speaking of the number of columns, why do you need so many columns?I
> hear that HBase "hopes" the number of rows is larger than the number of
> columns. So may I ask if the number of rows is much larger than 6 million?
>
> 2011/7/29 Barış Can Daylık<ba...@iletken.com.tr>
>
>> There can be at most 6 million columns, but I don't think it would exceed
>> 100K on average.  What would result.raw() produce?
>>
>>
>> On 07/28/2011 07:11 PM, Xian Woo wrote:
>>
>>> I don't know how many columns there are in your column family, If there
>>> are
>>> not too many columns , using Result.raw() may be a selection.
>>>
>>> 2011/7/28 Barış Can Daylık<ba...@iletken.com.tr>
>>>   Hi everyone,
>>>> I do have a column family where I store counts of items under each
>>>> column,
>>>> and I need to have top N columns (items) sorted by count descending. I
>>>> know
>>>> hbase doesn't sort columns by value and do not have an indexing option to
>>>> do
>>>> so. But as I searched I found out a patch (IHbase) for this indexing job.
>>>> However I'm not able to find out a way to get only the top N columns even
>>>> by
>>>> using IHbase.
>>>>
>>>> Can you suggest an example usage? Or another patch or tool for this job?
>>>> Can lucene be used in such a scenario?
>>>>
>>>> Thanks
>>>> Baris
>>>>
>>>> p.s. Column values are positive integers.
>>>>
>>>>


Re: Column Indexing - Top N Columns

Posted by Xian Woo <in...@gmail.com>.
as far as I am concerned, first you create a Get instance for a specified
row ,than use the Htable.get() method to return a result to your client from
the cluster , Then you get can get a key value each time by using
for(Keyvalue kv : result.raw()), But since result.raw() only return the
keyvalues in the ascending form,so u may need some extra operations.
Or maybe you can directly use result.getMap() to get a sorted map of all the
keyvalues and do some operation youself.
You can also use a ColumnPaginationFilter if you know exactly how many
columns in the row which you specify.

And speaking of the number of columns, why do you need so many columns?I
hear that HBase "hopes" the number of rows is larger than the number of
columns. So may I ask if the number of rows is much larger than 6 million?

2011/7/29 Barış Can Daylık <ba...@iletken.com.tr>

> There can be at most 6 million columns, but I don't think it would exceed
> 100K on average.  What would result.raw() produce?
>
>
> On 07/28/2011 07:11 PM, Xian Woo wrote:
>
>> I don't know how many columns there are in your column family, If there
>> are
>> not too many columns , using Result.raw() may be a selection.
>>
>> 2011/7/28 Barış Can Daylık<ba...@iletken.com.tr>
>> >
>>
>>  Hi everyone,
>>>
>>> I do have a column family where I store counts of items under each
>>> column,
>>> and I need to have top N columns (items) sorted by count descending. I
>>> know
>>> hbase doesn't sort columns by value and do not have an indexing option to
>>> do
>>> so. But as I searched I found out a patch (IHbase) for this indexing job.
>>> However I'm not able to find out a way to get only the top N columns even
>>> by
>>> using IHbase.
>>>
>>> Can you suggest an example usage? Or another patch or tool for this job?
>>> Can lucene be used in such a scenario?
>>>
>>> Thanks
>>> Baris
>>>
>>> p.s. Column values are positive integers.
>>>
>>>
>

Re: Column Indexing - Top N Columns

Posted by Barış Can Daylık <ba...@iletken.com.tr>.
There can be at most 6 million columns, but I don't think it would 
exceed 100K on average.  What would result.raw() produce?

On 07/28/2011 07:11 PM, Xian Woo wrote:
> I don't know how many columns there are in your column family, If there are
> not too many columns , using Result.raw() may be a selection.
>
> 2011/7/28 Barış Can Daylık<ba...@iletken.com.tr>
>
>> Hi everyone,
>>
>> I do have a column family where I store counts of items under each column,
>> and I need to have top N columns (items) sorted by count descending. I know
>> hbase doesn't sort columns by value and do not have an indexing option to do
>> so. But as I searched I found out a patch (IHbase) for this indexing job.
>> However I'm not able to find out a way to get only the top N columns even by
>> using IHbase.
>>
>> Can you suggest an example usage? Or another patch or tool for this job?
>> Can lucene be used in such a scenario?
>>
>> Thanks
>> Baris
>>
>> p.s. Column values are positive integers.
>>


Re: Column Indexing - Top N Columns

Posted by Xian Woo <in...@gmail.com>.
I don't know how many columns there are in your column family, If there are
not too many columns , using Result.raw() may be a selection.

2011/7/28 Barış Can Daylık <ba...@iletken.com.tr>

> Hi everyone,
>
> I do have a column family where I store counts of items under each column,
> and I need to have top N columns (items) sorted by count descending. I know
> hbase doesn't sort columns by value and do not have an indexing option to do
> so. But as I searched I found out a patch (IHbase) for this indexing job.
> However I'm not able to find out a way to get only the top N columns even by
> using IHbase.
>
> Can you suggest an example usage? Or another patch or tool for this job?
> Can lucene be used in such a scenario?
>
> Thanks
> Baris
>
> p.s. Column values are positive integers.
>