You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Jigar Shah <ji...@infodesk.com> on 2014/03/03 08:23:39 UTC

HBase Schema for IPTC News ML G2

I am working in news processing industry, current system processes more
then million article per week. And provides this data in real time to
users, additionally it provides search capabilities via Lucene.

We convert all news to a standard IPTC NewsML
G2<http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/ 
<http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/>>format,
before providing it to users (in real-time or via search)

We have a requirement of component which provides analytical queries on
news data. I plan to load this all data in HBase and then have Map-Reduce
Jobs to compute analytical queries. More over current system is developed
on postgresql to store only 3 months data, anything more then this is big
data as it dosen't fit on one server.

But i am bit confused in developing schema for it.

Every news article has

*"messageID" as guid*, unique id for news message.
*"version" as int,* incremented if newer version of same news message is 
published.
there are other fields like location, channels, title, content, source etc..

Current database primary key is a composite of (messageID & version).

I thought that, i should use "messageID" as "rowKey" in HBase. and
"version" as "columnFamily" and all columns will be fields of news (like 
location, channels ,title, body, sentTimstamp, ...)

Keeping "version" as "columnFamily" is a good idea ?

In reality "single message may have thousands of version".

Re: HBase Schema for IPTC News ML G2

Posted by Jigar Shah <ji...@infodesk.com>.

Thanks James, Seems very interesting.

On 03/04/2014 03:02 AM, James Taylor wrote:
> Hi Jigar,
> Take a look at Apache Phoenix: http://phoenix.incubator.apache.org/
> It allows you to use SQL to query over your HBase data and supports
> composite primary keys, so you could create a schema like this:
>
> create table news_message(guid varchar not null, version bigint not null,
>      constraint pk primary key (guid, version desc));
>
> The rows will then sort by guid plus version descending. Then you can issue
> sql queries directly against your hbase data without writing map/reduce.
> Note that we don't yet support all the sql constructs that postgres does.
>
> HTH,
> James
>
>
> On Sun, Mar 2, 2014 at 11:23 PM, Jigar Shah <ji...@infodesk.com> wrote:
>
>> I am working in news processing industry, current system processes more
>> then million article per week. And provides this data in real time to
>> users, additionally it provides search capabilities via Lucene.
>>
>> We convert all news to a standard IPTC NewsML
>> G2<http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/ <
>> http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/>>format,
>> before providing it to users (in real-time or via search)
>>
>> We have a requirement of component which provides analytical queries on
>> news data. I plan to load this all data in HBase and then have Map-Reduce
>> Jobs to compute analytical queries. More over current system is developed
>> on postgresql to store only 3 months data, anything more then this is big
>> data as it dosen't fit on one server.
>>
>> But i am bit confused in developing schema for it.
>>
>> Every news article has
>>
>> *"messageID" as guid*, unique id for news message.
>> *"version" as int,* incremented if newer version of same news message is
>> published.
>> there are other fields like location, channels, title, content, source
>> etc..
>>
>> Current database primary key is a composite of (messageID & version).
>>
>> I thought that, i should use "messageID" as "rowKey" in HBase. and
>> "version" as "columnFamily" and all columns will be fields of news (like
>> location, channels ,title, body, sentTimstamp, ...)
>>
>> Keeping "version" as "columnFamily" is a good idea ?
>>
>> In reality "single message may have thousands of version".
>>
>>

Re: HBase Schema for IPTC News ML G2

Posted by James Taylor <jt...@salesforce.com>.

Hi Jigar,
Take a look at Apache Phoenix: http://phoenix.incubator.apache.org/
It allows you to use SQL to query over your HBase data and supports
composite primary keys, so you could create a schema like this:

create table news_message(guid varchar not null, version bigint not null,
    constraint pk primary key (guid, version desc));

The rows will then sort by guid plus version descending. Then you can issue
sql queries directly against your hbase data without writing map/reduce.
Note that we don't yet support all the sql constructs that postgres does.

HTH,
James


On Sun, Mar 2, 2014 at 11:23 PM, Jigar Shah <ji...@infodesk.com> wrote:

> I am working in news processing industry, current system processes more
> then million article per week. And provides this data in real time to
> users, additionally it provides search capabilities via Lucene.
>
> We convert all news to a standard IPTC NewsML
> G2<http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/ <
> http://www.iptc.org/site/News_Exchange_Formats/NewsML-G2/>>format,
> before providing it to users (in real-time or via search)
>
> We have a requirement of component which provides analytical queries on
> news data. I plan to load this all data in HBase and then have Map-Reduce
> Jobs to compute analytical queries. More over current system is developed
> on postgresql to store only 3 months data, anything more then this is big
> data as it dosen't fit on one server.
>
> But i am bit confused in developing schema for it.
>
> Every news article has
>
> *"messageID" as guid*, unique id for news message.
> *"version" as int,* incremented if newer version of same news message is
> published.
> there are other fields like location, channels, title, content, source
> etc..
>
> Current database primary key is a composite of (messageID & version).
>
> I thought that, i should use "messageID" as "rowKey" in HBase. and
> "version" as "columnFamily" and all columns will be fields of news (like
> location, channels ,title, body, sentTimstamp, ...)
>
> Keeping "version" as "columnFamily" is a good idea ?
>
> In reality "single message may have thousands of version".
>
>