You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Andrei Savu <sa...@gmail.com> on 2009/08/14 17:55:38 UTC

Feed Aggregator Schema

Hello,

I am working on a project involving monitoring a large number of
rss/atom feeds. I want to use hbase for data storage and I have some
problems designing the schema. For the first iteration I want to be
able to generate an aggregated feed (last 100 posts from all feeds in
reverse chronological order).

Currently I am using two tables:

Feeds: column families Content and Meta : raw feed stored in Content:raw
Urls: column families Content and Meta : raw post version stored in
Content:raw and the rest of the data found in RSS stored in Meta

I need some sort of index table for the aggregated feed. How should I
build that? Is hbase a good choice for this kind of application?

In other words: Is it possible( in hbase) to design a schema that
could efficiently answer to queries like the one listed bellow?

SELECT data FROM Urls ORDER BY date DESC LIMIT 100

Thanks.

--
Savu Andrei

Website: http://www.andreisavu.ro/

Re: Feed Aggregator Schema

Posted by Andrei Savu <sa...@gmail.com>.

Thanks for your answer Peter.

I will give it a try using this approach and I will let you know how it works.

On Mon, Aug 17, 2009 at 10:26 AM, Peter
Rietzler<pe...@smarter-ecommerce.com> wrote:
>
> Hi
>
> In our project we are handling event lists where we have similar
> requirements. We do ordering by choosing our row keys wisely. We use the
> following key for our events (they should be ordered by time in ascending
> order):
>
> eventListName/yyyyMMddHHmmssSSS-000[-111]
>
> where eventListName is the name of the event list and 000 is a three digit
> instance id to disambiguate between different running instances of
> application, and -111 is optional to disambiguate events that occured in the
> same millisecond on one instance.
>
> We additionally insert and artifical row for each day with the id
>
> eventListName/yyyyMMddHHmmssSSS
>
> This allows us to start scanning at the beginning of each day without
> searching through the event list.
>
> You need to be aware of the fact that if you have a very high load of
> inserts, then always one hbase region server is busy inserting while the
> others are idle ... if that's a problem for you, you have to find different
> keys for your purpose.
>
> You could also use an HBase index table but I have no experience with it and
> I remember an email on the mailing list that this would double all requests
> because the API would first lookup the index table and then the original
> table ??? (please correct me if this is not right ...)
>
> Kind regards,
> Peter
>
>
>
> Andrei Savu wrote:
>>
>> Hello,
>>
>> I am working on a project involving monitoring a large number of
>> rss/atom feeds. I want to use hbase for data storage and I have some
>> problems designing the schema. For the first iteration I want to be
>> able to generate an aggregated feed (last 100 posts from all feeds in
>> reverse chronological order).
>>
>> Currently I am using two tables:
>>
>> Feeds: column families Content and Meta : raw feed stored in Content:raw
>> Urls: column families Content and Meta : raw post version stored in
>> Content:raw and the rest of the data found in RSS stored in Meta
>>
>> I need some sort of index table for the aggregated feed. How should I
>> build that? Is hbase a good choice for this kind of application?
>>
>> In other words: Is it possible( in hbase) to design a schema that
>> could efficiently answer to queries like the one listed bellow?
>>
>> SELECT data FROM Urls ORDER BY date DESC LIMIT 100
>>
>> Thanks.
>>
>> --
>> Savu Andrei
>>
>> Website: http://www.andreisavu.ro/
>>
>>
>
> --
> View this message in context: http://www.nabble.com/Feed-Aggregator-Schema-tp24974071p25002264.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
>



-- 
Savu Andrei

Website: http://www.andreisavu.ro/

Re: Feed Aggregator Schema

Posted by Peter Rietzler <pe...@smarter-ecommerce.com>.

Hi 

In our project we are handling event lists where we have similar
requirements. We do ordering by choosing our row keys wisely. We use the
following key for our events (they should be ordered by time in ascending
order):

eventListName/yyyyMMddHHmmssSSS-000[-111]

where eventListName is the name of the event list and 000 is a three digit
instance id to disambiguate between different running instances of
application, and -111 is optional to disambiguate events that occured in the
same millisecond on one instance. 

We additionally insert and artifical row for each day with the id

eventListName/yyyyMMddHHmmssSSS

This allows us to start scanning at the beginning of each day without
searching through the event list.

You need to be aware of the fact that if you have a very high load of
inserts, then always one hbase region server is busy inserting while the
others are idle ... if that's a problem for you, you have to find different
keys for your purpose. 

You could also use an HBase index table but I have no experience with it and
I remember an email on the mailing list that this would double all requests
because the API would first lookup the index table and then the original
table ??? (please correct me if this is not right ...)

Kind regards, 
Peter

Andrei Savu wrote:
> 
> Hello,
> 
> I am working on a project involving monitoring a large number of
> rss/atom feeds. I want to use hbase for data storage and I have some
> problems designing the schema. For the first iteration I want to be
> able to generate an aggregated feed (last 100 posts from all feeds in
> reverse chronological order).
> 
> Currently I am using two tables:
> 
> Feeds: column families Content and Meta : raw feed stored in Content:raw
> Urls: column families Content and Meta : raw post version stored in
> Content:raw and the rest of the data found in RSS stored in Meta
> 
> I need some sort of index table for the aggregated feed. How should I
> build that? Is hbase a good choice for this kind of application?
> 
> In other words: Is it possible( in hbase) to design a schema that
> could efficiently answer to queries like the one listed bellow?
> 
> SELECT data FROM Urls ORDER BY date DESC LIMIT 100
> 
> Thanks.
> 
> --
> Savu Andrei
> 
> Website: http://www.andreisavu.ro/
> 
> 

-- 
View this message in context: http://www.nabble.com/Feed-Aggregator-Schema-tp24974071p25002264.html
Sent from the HBase User mailing list archive at Nabble.com.