You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Adarsh Sharma <ad...@orkash.com> on 2012/02/22 06:33:40 UTC

Corresponding table in Hbase

Dear all,

After some R n D on schema design in hbase. I am confused how to design 
corresponding schema of a table in mysql.

CREATE TABLE `page_content` (
  `crawled_page_id` bigint(20) NOT NULL DEFAULT '0' 'unique value for 
each url,
  `link_level` tinyint(4) DEFAULT NULL,
  `isprocessable` tinyint(2) NOT NULL DEFAULT '1',
  `isvalid` tinyint(4) NOT NULL DEFAULT '1',
  `isanalyzed` tinyint(4) NOT NULL DEFAULT '0' COMMENT ,
  `islocked` tinyint(4) NOT NULL DEFAULT '0' COMMENT 'set 1 when the 
records are in analyzing phase',
  `content_language` varchar(10) DEFAULT NULL,
  `url_id` bigint(20) NOT NULL,
  `publishing_date` varchar(40) DEFAULT NULL,
  `heading` varchar(150) DEFAULT NULL,
  `category` varchar(150) DEFAULT NULL,
  `crawled_page_url` varchar(500) NOT NULL,
  `keywords` varchar(500) DEFAULT NULL,
  `dt_stamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `content` mediumtext,
  PRIMARY KEY (`crawled_page_id`),
  KEY `idx_url` (`crawled_page_url`),
  KEY `idx_head` (`heading`),
  KEY `idx_dtstamp` (`dt_stamp`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

In all the examples , I find the reverse url is the row key in hbase but 
in mysql i create an auto increment column that uniquly locate a document.
Can anyone suggest what is the corresponding table in hbase.

Thanks

Re: Corresponding table in Hbase

Posted by Jacques <wh...@gmail.com>.

I'll be... didn't even know that existed in the HBase code base

:)

On Wed, Feb 22, 2012 at 9:25 AM, Stack <st...@duboce.net> wrote:

> On Wed, Feb 22, 2012 at 8:40 AM, Jacques <wh...@gmail.com> wrote:
> > We have a crawl table and here are a couple quick thoughts:
> >
> > - I'd suggest that you use reverse url as your primary key.
>  Specifically,
> > reversed host name but normal path and query string.
>
> Maybe this utility in hbase helps do what Jacque suggests above?
>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/Keying.html#createKey(java.lang.String)
>
> St.Ack
>

Re: Corresponding table in Hbase

Posted by Stack <st...@duboce.net>.

On Wed, Feb 22, 2012 at 8:40 AM, Jacques <wh...@gmail.com> wrote:
> We have a crawl table and here are a couple quick thoughts:
>
> - I'd suggest that you use reverse url as your primary key.  Specifically,
> reversed host name but normal path and query string.

Maybe this utility in hbase helps do what Jacque suggests above?
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/Keying.html#createKey(java.lang.String)

St.Ack

Re: Corresponding table in Hbase

Posted by Jacques <wh...@gmail.com>.

We have a crawl table and here are a couple quick thoughts:

- I'd suggest that you use reverse url as your primary key.  Specifically,
reversed host name but normal path and query string.
- Rather than maintaining separate rows for the same url using timestamp or
similar, I'd recommend that you use hbase versions if you want to maintain
multiple version of the same url.

Jacques




On Wed, Feb 22, 2012 at 7:32 AM, Ian Varley <iv...@salesforce.com> wrote:

> Adarsh,
>
> HBase doesn't have the concept of a globally unique auto-incrementing "ID"
> column; that would require that all PUTs to any region of a table first go
> through some central ID authority to get a unique ID, and that sort of goes
> against the general HBase approach (in which operations on regions are
> totally independent of each other, for unbounded horizontal scalability).
>
> That said, there are a couple ways to achieve what it seems like you want:
>  - You could create a natural compound row key composed of (for example) a
> hash of the URL plus a timestamp. That way, you would be guaranteed that
> two crawls of the same URL appear as different rows (assuming they can't
> happen at the same millisecond).
>  - You could alternately use a UUID of some sort as the row key, but the
> advantage of using URL_hash + timestamp is that you can find all the rows
> for a particular URL just by knowing the URL; you don't need any external
> index.
>  - You could also "roll your own" global ID creation counter in HBase
> using a table with a counter in it, and use the atomic increment function
> to get unique values. That would still serialize all PUT operations, but it
> would be done in your code (not automatically in HBase).
>
> Remember that HBase doesn't have any secondary indexes, like the 3 you've
> added below. If you want to be able to access the data in HBase by these
> fields, you must either write it in that order according to the row key, or
> else manually write the information, denormalized, into "index-like" tables
> that you maintain yourself (noting that there's no transactional protection
> on this operation like in a relational database, so you must account for
> more failure scenarios). These are reminders that unless your data size is
> so massive that a relational database simply can't accommodate it, you're
> likely giving yourself more problems by using HBase rather than an RDBMS.
>
> Also: you might see the O'Reilly book, "HBase: The Definitive Guide" by
> the esteemed Mr. Lars George; in it, he uses a running example of a URL
> shortener application that might give you some ideas about your use case.
>
> Ian
>
> On Feb 21, 2012, at 11:33 PM, Adarsh Sharma wrote:
>
> Dear all,
>
> After some R n D on schema design in hbase. I am confused how to design
> corresponding schema of a table in mysql.
>
> CREATE TABLE `page_content` (
>  `crawled_page_id` bigint(20) NOT NULL DEFAULT '0' 'unique value for
> each url,
>  `link_level` tinyint(4) DEFAULT NULL,
>  `isprocessable` tinyint(2) NOT NULL DEFAULT '1',
>  `isvalid` tinyint(4) NOT NULL DEFAULT '1',
>  `isanalyzed` tinyint(4) NOT NULL DEFAULT '0' COMMENT ,
>  `islocked` tinyint(4) NOT NULL DEFAULT '0' COMMENT 'set 1 when the
> records are in analyzing phase',
>  `content_language` varchar(10) DEFAULT NULL,
>  `url_id` bigint(20) NOT NULL,
>  `publishing_date` varchar(40) DEFAULT NULL,
>  `heading` varchar(150) DEFAULT NULL,
>  `category` varchar(150) DEFAULT NULL,
>  `crawled_page_url` varchar(500) NOT NULL,
>  `keywords` varchar(500) DEFAULT NULL,
>  `dt_stamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
>  `content` mediumtext,
>  PRIMARY KEY (`crawled_page_id`),
>  KEY `idx_url` (`crawled_page_url`),
>  KEY `idx_head` (`heading`),
>  KEY `idx_dtstamp` (`dt_stamp`)
> ) ENGINE=InnoDB DEFAULT CHARSET=latin1
>
> In all the examples , I find the reverse url is the row key in hbase but
> in mysql i create an auto increment column that uniquly locate a document.
> Can anyone suggest what is the corresponding table in hbase.
>
> Thanks
>
>

Re: Corresponding table in Hbase

Posted by Ian Varley <iv...@salesforce.com>.

Adarsh,

HBase doesn't have the concept of a globally unique auto-incrementing "ID" column; that would require that all PUTs to any region of a table first go through some central ID authority to get a unique ID, and that sort of goes against the general HBase approach (in which operations on regions are totally independent of each other, for unbounded horizontal scalability).

That said, there are a couple ways to achieve what it seems like you want:
 - You could create a natural compound row key composed of (for example) a hash of the URL plus a timestamp. That way, you would be guaranteed that two crawls of the same URL appear as different rows (assuming they can't happen at the same millisecond).
 - You could alternately use a UUID of some sort as the row key, but the advantage of using URL_hash + timestamp is that you can find all the rows for a particular URL just by knowing the URL; you don't need any external index.
 - You could also "roll your own" global ID creation counter in HBase using a table with a counter in it, and use the atomic increment function to get unique values. That would still serialize all PUT operations, but it would be done in your code (not automatically in HBase).

Remember that HBase doesn't have any secondary indexes, like the 3 you've added below. If you want to be able to access the data in HBase by these fields, you must either write it in that order according to the row key, or else manually write the information, denormalized, into "index-like" tables that you maintain yourself (noting that there's no transactional protection on this operation like in a relational database, so you must account for more failure scenarios). These are reminders that unless your data size is so massive that a relational database simply can't accommodate it, you're likely giving yourself more problems by using HBase rather than an RDBMS.

Also: you might see the O'Reilly book, "HBase: The Definitive Guide" by the esteemed Mr. Lars George; in it, he uses a running example of a URL shortener application that might give you some ideas about your use case.

Ian

On Feb 21, 2012, at 11:33 PM, Adarsh Sharma wrote:

Dear all,

After some R n D on schema design in hbase. I am confused how to design
corresponding schema of a table in mysql.

CREATE TABLE `page_content` (
 `crawled_page_id` bigint(20) NOT NULL DEFAULT '0' 'unique value for
each url,
 `link_level` tinyint(4) DEFAULT NULL,
 `isprocessable` tinyint(2) NOT NULL DEFAULT '1',
 `isvalid` tinyint(4) NOT NULL DEFAULT '1',
 `isanalyzed` tinyint(4) NOT NULL DEFAULT '0' COMMENT ,
 `islocked` tinyint(4) NOT NULL DEFAULT '0' COMMENT 'set 1 when the
records are in analyzing phase',
 `content_language` varchar(10) DEFAULT NULL,
 `url_id` bigint(20) NOT NULL,
 `publishing_date` varchar(40) DEFAULT NULL,
 `heading` varchar(150) DEFAULT NULL,
 `category` varchar(150) DEFAULT NULL,
 `crawled_page_url` varchar(500) NOT NULL,
 `keywords` varchar(500) DEFAULT NULL,
 `dt_stamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
 `content` mediumtext,
 PRIMARY KEY (`crawled_page_id`),
 KEY `idx_url` (`crawled_page_url`),
 KEY `idx_head` (`heading`),
 KEY `idx_dtstamp` (`dt_stamp`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

In all the examples , I find the reverse url is the row key in hbase but
in mysql i create an auto increment column that uniquly locate a document.
Can anyone suggest what is the corresponding table in hbase.

Thanks