You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Fuad Efendi <fu...@efendi.ca> on 2008/09/22 20:30:25 UTC

HBase Sample Schema

Hi,

I found this basic sample and I'd like to confirm my understanding of  
use cases and best practices (applicability) of Hbase... Thanks!
=============


Sample (Ankur Goel, 27-March-08,  
http://markmail.org/message/kbm3ys2eqnjn3ipe - I can't reply via  
hbase-user@hadoop.apache.org or Nabble):
=============

DESCRIPTION: Used to store seed urls (both old and newly discovered).
              Initially populated with some seed URLs. The crawl
controller
              picks up the seeds from this table that have status=0 (Not
Visited)
  		 or status=2 (Visited, but ready for re-crawl) and feeds
these seeds
              in batch to different crawl engines that it knows about.

SCHEMA:      Columns families below

	  {"referer_id:", "100"}, // Integer here is Max_Length
         {"url:","1500"},
         {"site:","500"},
         {"last_crawl_date:", "1000"},
         {"next_crawl_date:", "1000"},
         {"create_date:","100"},
         {"status:","100"},
         {"strike:", "100"},
         {"language:","150"},
         {"topic:","500"},
         {"depth:","100000"}


======================
Modified Schema & Analysis (Fuad Efendi):

My understanding is that we need to scan whole table in order to find  
records where (for instance) "last_crawl_date" is "less than specific  
point in time"... Additionally, Crawler should be polite and list of  
URLs to fetch should be evenly distributed between domains-hosts-IPs.

Few solutions to find records "last_crawl_date" were a little  
discussed in BLOGs, distribution list, etc:
- to have scanner
- to have additional Lucene index
- to have Map Reduce job (multithreaded parallel) otputting list of URLs


My own possible solution, need your feedback:
====================

Simplified schema with two tables (non-transactional:

1. URL_TO_FETCH
         {"internal_link_id" + "last_crawl_date", "1000"} PRIMARY KEY  
(sorted row_id),
         {"url:","1500"},

2. URL_CONTENT
         {"url:","1500"}  PRIMARY KEY (sorted row_id),
         {"site:","500"},
         ... ... ...,
         {"language:","150"},
         {"topic:","500"},
         {"depth:","100000"}


Table URL_TO_FETCH is initially seeded with root domain names and  
"dummy" last_crawl_date (with unique-per-host 'old-timestamp'):
00000000000000000001  www.website1.com
00000000000000000002  www.website1.com
00000000000000000003  www.website1.com
00000000000000000004  www.website1.com
...


After successful fetch of initial URLs:
00000000010000000001  www.website1.com/page1
00000000010000000002  www.website2.com/page1
00000000010000000003  www.website3.com/page1
00000000010000000004  www.website4.com/page1
...
00000000020000000001  www.website1.com/page2
00000000020000000002  www.website2.com/page2
00000000020000000003  www.website3.com/page2
00000000020000000004  www.website4.com/page2
...
00000000030000000001  www.website1.com/page3
00000000030000000002  www.website2.com/page3
00000000030000000003  www.website3.com/page3
00000000030000000004  www.website4.com/page3
...
...
...
0000000000xxxxxxxxxx  www.website1.com
0000000000xxxxxxxxxx  www.website1.com
0000000000xxxxxxxxxx  www.website1.com
0000000000xxxxxxxxxx  www.website1.com
...

(xxxxxxxxxx is "current time in milliseconds" - timestamp in case of  
successful fetch)

What we have: we don't need additional Lucene index; we don't need  
MapReduce job to populate list of items to be fetched (the way as it's  
done in Nutch); we don't need thousands per-host-scanners; we have  
mutable primary key; all new records are inserted at the beginning of  
a table; fetched items are moved to end of a table.

Second (helper) table is indexed by URL:
         {"url:","1500"}  PRIMARY KEY (sorted row_id),
         ...


Am I right? It looks cool that with extremely low cost I can maintain  
specific "reordering" by mutable primary key following crawl-specific  
requirements...

Thanks,
Fuad

Re: HBase Sample Schema

Posted by Funtick <fu...@efendi.ca>.

Probably this is mistake in design:

1. URL_TO_FETCH 
         {"internal_link_id" + "last_crawl_date", "1000"} PRIMARY KEY   


Should be reversed: "last_crawl_date" + "per_host_link_counter" + "host"

[0000 +  0002 + www.website1.com]: www.website1.com/page2
[0000 +  0002 + www.website2.com]: www.website2.com/page2
[0000 +  0002 + www.website3.com]: www.website3.com/page2
...
[0000 +  0003 + www.website1.com]: www.website1.com/page3
[0000 +  0003 + www.website1.com]: www.website1.com/page3
[0000 +  0003 + www.website1.com]: www.website1.com/page3
...
[XXXX +  0000 + www.website1.com]: www.website1.com
[XXXX +  0000 + www.website2.com]: www.website2.com
[XXXX +  0000 + www.website3.com]: www.website3.com
[XXXX +  0001 + www.website1.com]: www.website1.com/page1
[XXXX +  0001 + www.website2.com]: www.website2.com/page1
[XXXX +  0001 + www.website3.com]: www.website3.com/page1


where XXXX is timestamp - last_crawl_date (successful crawl)

Doing "delete" with "insert" instead of modifying PK; although it does not
matter for HBase (?)

Thanks



Funtick wrote:
> 
> Hi,
> 
> I found this basic sample and I'd like to confirm my understanding of  
> use cases and best practices (applicability) of Hbase... Thanks!
> =============
> 
> 
> Sample (Ankur Goel, 27-March-08,  
> http://markmail.org/message/kbm3ys2eqnjn3ipe - I can't reply via  
> hbase-user@hadoop.apache.org or Nabble):
> =============
> 
> DESCRIPTION: Used to store seed urls (both old and newly discovered).
>               Initially populated with some seed URLs. The crawl
> controller
>               picks up the seeds from this table that have status=0 (Not
> Visited)
>   		 or status=2 (Visited, but ready for re-crawl) and feeds
> these seeds
>               in batch to different crawl engines that it knows about.
> 
> SCHEMA:      Columns families below
> 
> 	  {"referer_id:", "100"}, // Integer here is Max_Length
>          {"url:","1500"},
>          {"site:","500"},
>          {"last_crawl_date:", "1000"},
>          {"next_crawl_date:", "1000"},
>          {"create_date:","100"},
>          {"status:","100"},
>          {"strike:", "100"},
>          {"language:","150"},
>          {"topic:","500"},
>          {"depth:","100000"}
> 
> 
> ======================
> Modified Schema & Analysis (Fuad Efendi):
> 
> My understanding is that we need to scan whole table in order to find  
> records where (for instance) "last_crawl_date" is "less than specific  
> point in time"... Additionally, Crawler should be polite and list of  
> URLs to fetch should be evenly distributed between domains-hosts-IPs.
> 
> Few solutions to find records "last_crawl_date" were a little  
> discussed in BLOGs, distribution list, etc:
> - to have scanner
> - to have additional Lucene index
> - to have Map Reduce job (multithreaded parallel) otputting list of URLs
> 
> 
> My own possible solution, need your feedback:
> ====================
> 
> Simplified schema with two tables (non-transactional:
> 
> 1. URL_TO_FETCH
>          {"internal_link_id" + "last_crawl_date", "1000"} PRIMARY KEY  
> (sorted row_id),
>          {"url:","1500"},
> 
> 2. URL_CONTENT
>          {"url:","1500"}  PRIMARY KEY (sorted row_id),
>          {"site:","500"},
>          ... ... ...,
>          {"language:","150"},
>          {"topic:","500"},
>          {"depth:","100000"}
> 
> 
> Table URL_TO_FETCH is initially seeded with root domain names and  
> "dummy" last_crawl_date (with unique-per-host 'old-timestamp'):
> 00000000000000000001  www.website1.com
> 00000000000000000002  www.website1.com
> 00000000000000000003  www.website1.com
> 00000000000000000004  www.website1.com
> ...
> 
> 
> After successful fetch of initial URLs:
> 00000000010000000001  www.website1.com/page1
> 00000000010000000002  www.website2.com/page1
> 00000000010000000003  www.website3.com/page1
> 00000000010000000004  www.website4.com/page1
> ...
> 00000000020000000001  www.website1.com/page2
> 00000000020000000002  www.website2.com/page2
> 00000000020000000003  www.website3.com/page2
> 00000000020000000004  www.website4.com/page2
> ...
> 00000000030000000001  www.website1.com/page3
> 00000000030000000002  www.website2.com/page3
> 00000000030000000003  www.website3.com/page3
> 00000000030000000004  www.website4.com/page3
> ...
> ...
> ...
> 0000000000xxxxxxxxxx  www.website1.com
> 0000000000xxxxxxxxxx  www.website1.com
> 0000000000xxxxxxxxxx  www.website1.com
> 0000000000xxxxxxxxxx  www.website1.com
> ...
> 
> (xxxxxxxxxx is "current time in milliseconds" - timestamp in case of  
> successful fetch)
> 
> What we have: we don't need additional Lucene index; we don't need  
> MapReduce job to populate list of items to be fetched (the way as it's  
> done in Nutch); we don't need thousands per-host-scanners; we have  
> mutable primary key; all new records are inserted at the beginning of  
> a table; fetched items are moved to end of a table.
> 
> Second (helper) table is indexed by URL:
>          {"url:","1500"}  PRIMARY KEY (sorted row_id),
>          ...
> 
> 
> Am I right? It looks cool that with extremely low cost I can maintain  
> specific "reordering" by mutable primary key following crawl-specific  
> requirements...
> 
> Thanks,
> Fuad
> 
> 
> 
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/HBase-Sample-Schema-tp19613834p19616281.html
Sent from the HBase User mailing list archive at Nabble.com.

RE: HBase Sample Schema

Posted by Fuad Efendi <fu...@efendi.ca>.

Probably this is mistake in design: 

1. URL_TO_FETCH 
         {"internal_link_id" + "last_crawl_date", "1000"} PRIMARY KEY   


Should be reversed: "last_crawl_date" + "per_host_link_counter" + "host" 

[0000 +  0002 + www.website1.com]: www.website1.com/page2 
[0000 +  0002 + www.website2.com]: www.website2.com/page2 
[0000 +  0002 + www.website3.com]: www.website3.com/page2 
... 
[0000 +  0003 + www.website1.com]: www.website1.com/page3 
[0000 +  0003 + www.website1.com]: www.website1.com/page3 
[0000 +  0003 + www.website1.com]: www.website1.com/page3 
... 
[XXXX +  0000 + www.website1.com]: www.website1.com 
[XXXX +  0000 + www.website2.com]: www.website2.com 
[XXXX +  0000 + www.website3.com]: www.website3.com 
[XXXX +  0001 + www.website1.com]: www.website1.com/page1 
[XXXX +  0001 + www.website2.com]: www.website2.com/page1 
[XXXX +  0001 + www.website3.com]: www.website3.com/page1 


where XXXX is timestamp: last_crawl_date (successful crawl) 

Doing "delete" with "insert" instead of modifying PK; although it does not
matter for HBase (?) 


Thanks... Any thoughts?


http://www.linkedin.com/in/liferay
==================================
Tokenizer Inc.
Project Management, Software Development
Natural Language Processing, Search Engines
==================================
http://www.tokenizer.org


> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@efendi.ca] 
> Sent: Monday, September 22, 2008 2:30 PM
> To: hbase-user@hadoop.apache.org
> Subject: HBase Sample Schema
> 
> 
> Hi,
> 
> I found this basic sample and I'd like to confirm my 
> understanding of  
> use cases and best practices (applicability) of Hbase... Thanks!
> =============
> 
> 
> Sample (Ankur Goel, 27-March-08,  
> http://markmail.org/message/kbm3ys2eqnjn3ipe - I can't reply via  
> hbase-user@hadoop.apache.org or Nabble):
> =============
> 
> DESCRIPTION: Used to store seed urls (both old and newly discovered).
>               Initially populated with some seed URLs. The crawl
> controller
>               picks up the seeds from this table that have 
> status=0 (Not
> Visited)
>   		 or status=2 (Visited, but ready for re-crawl) and feeds
> these seeds
>               in batch to different crawl engines that it knows about.
> 
> SCHEMA:      Columns families below
> 
> 	  {"referer_id:", "100"}, // Integer here is Max_Length
>          {"url:","1500"},
>          {"site:","500"},
>          {"last_crawl_date:", "1000"},
>          {"next_crawl_date:", "1000"},
>          {"create_date:","100"},
>          {"status:","100"},
>          {"strike:", "100"},
>          {"language:","150"},
>          {"topic:","500"},
>          {"depth:","100000"}
> 
> 
> ======================
> Modified Schema & Analysis (Fuad Efendi):
> 
> My understanding is that we need to scan whole table in order 
> to find  
> records where (for instance) "last_crawl_date" is "less than 
> specific  
> point in time"... Additionally, Crawler should be polite and list of  
> URLs to fetch should be evenly distributed between domains-hosts-IPs.
> 
> Few solutions to find records "last_crawl_date" were a little  
> discussed in BLOGs, distribution list, etc:
> - to have scanner
> - to have additional Lucene index
> - to have Map Reduce job (multithreaded parallel) otputting 
> list of URLs
> 
> 
> My own possible solution, need your feedback:
> ====================
> 
> Simplified schema with two tables (non-transactional:
> 
> 1. URL_TO_FETCH
>          {"internal_link_id" + "last_crawl_date", "1000"} 
> PRIMARY KEY  
> (sorted row_id),
>          {"url:","1500"},
> 
> 2. URL_CONTENT
>          {"url:","1500"}  PRIMARY KEY (sorted row_id),
>          {"site:","500"},
>          ... ... ...,
>          {"language:","150"},
>          {"topic:","500"},
>          {"depth:","100000"}
> 
> 
> Table URL_TO_FETCH is initially seeded with root domain names and  
> "dummy" last_crawl_date (with unique-per-host 'old-timestamp'):
> 00000000000000000001  www.website1.com
> 00000000000000000002  www.website1.com
> 00000000000000000003  www.website1.com
> 00000000000000000004  www.website1.com
> ...
> 
> 
> After successful fetch of initial URLs:
> 00000000010000000001  www.website1.com/page1
> 00000000010000000002  www.website2.com/page1
> 00000000010000000003  www.website3.com/page1
> 00000000010000000004  www.website4.com/page1
> ...
> 00000000020000000001  www.website1.com/page2
> 00000000020000000002  www.website2.com/page2
> 00000000020000000003  www.website3.com/page2
> 00000000020000000004  www.website4.com/page2
> ...
> 00000000030000000001  www.website1.com/page3
> 00000000030000000002  www.website2.com/page3
> 00000000030000000003  www.website3.com/page3
> 00000000030000000004  www.website4.com/page3
> ...
> ...
> ...
> 0000000000xxxxxxxxxx  www.website1.com
> 0000000000xxxxxxxxxx  www.website1.com
> 0000000000xxxxxxxxxx  www.website1.com
> 0000000000xxxxxxxxxx  www.website1.com
> ...
> 
> (xxxxxxxxxx is "current time in milliseconds" - timestamp in case of  
> successful fetch)
> 
> What we have: we don't need additional Lucene index; we don't need  
> MapReduce job to populate list of items to be fetched (the 
> way as it's  
> done in Nutch); we don't need thousands per-host-scanners; we have  
> mutable primary key; all new records are inserted at the 
> beginning of  
> a table; fetched items are moved to end of a table.
> 
> Second (helper) table is indexed by URL:
>          {"url:","1500"}  PRIMARY KEY (sorted row_id),
>          ...
> 
> 
> Am I right? It looks cool that with extremely low cost I can 
> maintain  
> specific "reordering" by mutable primary key following 
> crawl-specific  
> requirements...
> 
> Thanks,
> Fuad
> 
> 
> 
>