You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Fuad Efendi <fu...@efendi.ca> on 2008/09/22 20:30:25 UTC
HBase Sample Schema
Hi,
I found this basic sample and I'd like to confirm my understanding of
use cases and best practices (applicability) of Hbase... Thanks!
=============
Sample (Ankur Goel, 27-March-08,
http://markmail.org/message/kbm3ys2eqnjn3ipe - I can't reply via
hbase-user@hadoop.apache.org or Nabble):
=============
DESCRIPTION: Used to store seed urls (both old and newly discovered).
Initially populated with some seed URLs. The crawl
controller
picks up the seeds from this table that have status=0 (Not
Visited)
or status=2 (Visited, but ready for re-crawl) and feeds
these seeds
in batch to different crawl engines that it knows about.
SCHEMA: Columns families below
{"referer_id:", "100"}, // Integer here is Max_Length
{"url:","1500"},
{"site:","500"},
{"last_crawl_date:", "1000"},
{"next_crawl_date:", "1000"},
{"create_date:","100"},
{"status:","100"},
{"strike:", "100"},
{"language:","150"},
{"topic:","500"},
{"depth:","100000"}
======================
Modified Schema & Analysis (Fuad Efendi):
My understanding is that we need to scan whole table in order to find
records where (for instance) "last_crawl_date" is "less than specific
point in time"... Additionally, Crawler should be polite and list of
URLs to fetch should be evenly distributed between domains-hosts-IPs.
Few solutions to find records "last_crawl_date" were a little
discussed in BLOGs, distribution list, etc:
- to have scanner
- to have additional Lucene index
- to have Map Reduce job (multithreaded parallel) otputting list of URLs
My own possible solution, need your feedback:
====================
Simplified schema with two tables (non-transactional:
1. URL_TO_FETCH
{"internal_link_id" + "last_crawl_date", "1000"} PRIMARY KEY
(sorted row_id),
{"url:","1500"},
2. URL_CONTENT
{"url:","1500"} PRIMARY KEY (sorted row_id),
{"site:","500"},
... ... ...,
{"language:","150"},
{"topic:","500"},
{"depth:","100000"}
Table URL_TO_FETCH is initially seeded with root domain names and
"dummy" last_crawl_date (with unique-per-host 'old-timestamp'):
00000000000000000001 www.website1.com
00000000000000000002 www.website1.com
00000000000000000003 www.website1.com
00000000000000000004 www.website1.com
...
After successful fetch of initial URLs:
00000000010000000001 www.website1.com/page1
00000000010000000002 www.website2.com/page1
00000000010000000003 www.website3.com/page1
00000000010000000004 www.website4.com/page1
...
00000000020000000001 www.website1.com/page2
00000000020000000002 www.website2.com/page2
00000000020000000003 www.website3.com/page2
00000000020000000004 www.website4.com/page2
...
00000000030000000001 www.website1.com/page3
00000000030000000002 www.website2.com/page3
00000000030000000003 www.website3.com/page3
00000000030000000004 www.website4.com/page3
...
...
...
0000000000xxxxxxxxxx www.website1.com
0000000000xxxxxxxxxx www.website1.com
0000000000xxxxxxxxxx www.website1.com
0000000000xxxxxxxxxx www.website1.com
...
(xxxxxxxxxx is "current time in milliseconds" - timestamp in case of
successful fetch)
What we have: we don't need additional Lucene index; we don't need
MapReduce job to populate list of items to be fetched (the way as it's
done in Nutch); we don't need thousands per-host-scanners; we have
mutable primary key; all new records are inserted at the beginning of
a table; fetched items are moved to end of a table.
Second (helper) table is indexed by URL:
{"url:","1500"} PRIMARY KEY (sorted row_id),
...
Am I right? It looks cool that with extremely low cost I can maintain
specific "reordering" by mutable primary key following crawl-specific
requirements...
Thanks,
Fuad
Re: HBase Sample Schema
Posted by Funtick <fu...@efendi.ca>.
Probably this is mistake in design:
1. URL_TO_FETCH
{"internal_link_id" + "last_crawl_date", "1000"} PRIMARY KEY
Should be reversed: "last_crawl_date" + "per_host_link_counter" + "host"
[0000 + 0002 + www.website1.com]: www.website1.com/page2
[0000 + 0002 + www.website2.com]: www.website2.com/page2
[0000 + 0002 + www.website3.com]: www.website3.com/page2
...
[0000 + 0003 + www.website1.com]: www.website1.com/page3
[0000 + 0003 + www.website1.com]: www.website1.com/page3
[0000 + 0003 + www.website1.com]: www.website1.com/page3
...
[XXXX + 0000 + www.website1.com]: www.website1.com
[XXXX + 0000 + www.website2.com]: www.website2.com
[XXXX + 0000 + www.website3.com]: www.website3.com
[XXXX + 0001 + www.website1.com]: www.website1.com/page1
[XXXX + 0001 + www.website2.com]: www.website2.com/page1
[XXXX + 0001 + www.website3.com]: www.website3.com/page1
where XXXX is timestamp - last_crawl_date (successful crawl)
Doing "delete" with "insert" instead of modifying PK; although it does not
matter for HBase (?)
Thanks
Funtick wrote:
>
> Hi,
>
> I found this basic sample and I'd like to confirm my understanding of
> use cases and best practices (applicability) of Hbase... Thanks!
> =============
>
>
> Sample (Ankur Goel, 27-March-08,
> http://markmail.org/message/kbm3ys2eqnjn3ipe - I can't reply via
> hbase-user@hadoop.apache.org or Nabble):
> =============
>
> DESCRIPTION: Used to store seed urls (both old and newly discovered).
> Initially populated with some seed URLs. The crawl
> controller
> picks up the seeds from this table that have status=0 (Not
> Visited)
> or status=2 (Visited, but ready for re-crawl) and feeds
> these seeds
> in batch to different crawl engines that it knows about.
>
> SCHEMA: Columns families below
>
> {"referer_id:", "100"}, // Integer here is Max_Length
> {"url:","1500"},
> {"site:","500"},
> {"last_crawl_date:", "1000"},
> {"next_crawl_date:", "1000"},
> {"create_date:","100"},
> {"status:","100"},
> {"strike:", "100"},
> {"language:","150"},
> {"topic:","500"},
> {"depth:","100000"}
>
>
> ======================
> Modified Schema & Analysis (Fuad Efendi):
>
> My understanding is that we need to scan whole table in order to find
> records where (for instance) "last_crawl_date" is "less than specific
> point in time"... Additionally, Crawler should be polite and list of
> URLs to fetch should be evenly distributed between domains-hosts-IPs.
>
> Few solutions to find records "last_crawl_date" were a little
> discussed in BLOGs, distribution list, etc:
> - to have scanner
> - to have additional Lucene index
> - to have Map Reduce job (multithreaded parallel) otputting list of URLs
>
>
> My own possible solution, need your feedback:
> ====================
>
> Simplified schema with two tables (non-transactional:
>
> 1. URL_TO_FETCH
> {"internal_link_id" + "last_crawl_date", "1000"} PRIMARY KEY
> (sorted row_id),
> {"url:","1500"},
>
> 2. URL_CONTENT
> {"url:","1500"} PRIMARY KEY (sorted row_id),
> {"site:","500"},
> ... ... ...,
> {"language:","150"},
> {"topic:","500"},
> {"depth:","100000"}
>
>
> Table URL_TO_FETCH is initially seeded with root domain names and
> "dummy" last_crawl_date (with unique-per-host 'old-timestamp'):
> 00000000000000000001 www.website1.com
> 00000000000000000002 www.website1.com
> 00000000000000000003 www.website1.com
> 00000000000000000004 www.website1.com
> ...
>
>
> After successful fetch of initial URLs:
> 00000000010000000001 www.website1.com/page1
> 00000000010000000002 www.website2.com/page1
> 00000000010000000003 www.website3.com/page1
> 00000000010000000004 www.website4.com/page1
> ...
> 00000000020000000001 www.website1.com/page2
> 00000000020000000002 www.website2.com/page2
> 00000000020000000003 www.website3.com/page2
> 00000000020000000004 www.website4.com/page2
> ...
> 00000000030000000001 www.website1.com/page3
> 00000000030000000002 www.website2.com/page3
> 00000000030000000003 www.website3.com/page3
> 00000000030000000004 www.website4.com/page3
> ...
> ...
> ...
> 0000000000xxxxxxxxxx www.website1.com
> 0000000000xxxxxxxxxx www.website1.com
> 0000000000xxxxxxxxxx www.website1.com
> 0000000000xxxxxxxxxx www.website1.com
> ...
>
> (xxxxxxxxxx is "current time in milliseconds" - timestamp in case of
> successful fetch)
>
> What we have: we don't need additional Lucene index; we don't need
> MapReduce job to populate list of items to be fetched (the way as it's
> done in Nutch); we don't need thousands per-host-scanners; we have
> mutable primary key; all new records are inserted at the beginning of
> a table; fetched items are moved to end of a table.
>
> Second (helper) table is indexed by URL:
> {"url:","1500"} PRIMARY KEY (sorted row_id),
> ...
>
>
> Am I right? It looks cool that with extremely low cost I can maintain
> specific "reordering" by mutable primary key following crawl-specific
> requirements...
>
> Thanks,
> Fuad
>
>
>
>
>
>
--
View this message in context: http://www.nabble.com/HBase-Sample-Schema-tp19613834p19616281.html
Sent from the HBase User mailing list archive at Nabble.com.
RE: HBase Sample Schema
Posted by Fuad Efendi <fu...@efendi.ca>.
Probably this is mistake in design:
1. URL_TO_FETCH
{"internal_link_id" + "last_crawl_date", "1000"} PRIMARY KEY
Should be reversed: "last_crawl_date" + "per_host_link_counter" + "host"
[0000 + 0002 + www.website1.com]: www.website1.com/page2
[0000 + 0002 + www.website2.com]: www.website2.com/page2
[0000 + 0002 + www.website3.com]: www.website3.com/page2
...
[0000 + 0003 + www.website1.com]: www.website1.com/page3
[0000 + 0003 + www.website1.com]: www.website1.com/page3
[0000 + 0003 + www.website1.com]: www.website1.com/page3
...
[XXXX + 0000 + www.website1.com]: www.website1.com
[XXXX + 0000 + www.website2.com]: www.website2.com
[XXXX + 0000 + www.website3.com]: www.website3.com
[XXXX + 0001 + www.website1.com]: www.website1.com/page1
[XXXX + 0001 + www.website2.com]: www.website2.com/page1
[XXXX + 0001 + www.website3.com]: www.website3.com/page1
where XXXX is timestamp: last_crawl_date (successful crawl)
Doing "delete" with "insert" instead of modifying PK; although it does not
matter for HBase (?)
Thanks... Any thoughts?
http://www.linkedin.com/in/liferay
==================================
Tokenizer Inc.
Project Management, Software Development
Natural Language Processing, Search Engines
==================================
http://www.tokenizer.org
> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@efendi.ca]
> Sent: Monday, September 22, 2008 2:30 PM
> To: hbase-user@hadoop.apache.org
> Subject: HBase Sample Schema
>
>
> Hi,
>
> I found this basic sample and I'd like to confirm my
> understanding of
> use cases and best practices (applicability) of Hbase... Thanks!
> =============
>
>
> Sample (Ankur Goel, 27-March-08,
> http://markmail.org/message/kbm3ys2eqnjn3ipe - I can't reply via
> hbase-user@hadoop.apache.org or Nabble):
> =============
>
> DESCRIPTION: Used to store seed urls (both old and newly discovered).
> Initially populated with some seed URLs. The crawl
> controller
> picks up the seeds from this table that have
> status=0 (Not
> Visited)
> or status=2 (Visited, but ready for re-crawl) and feeds
> these seeds
> in batch to different crawl engines that it knows about.
>
> SCHEMA: Columns families below
>
> {"referer_id:", "100"}, // Integer here is Max_Length
> {"url:","1500"},
> {"site:","500"},
> {"last_crawl_date:", "1000"},
> {"next_crawl_date:", "1000"},
> {"create_date:","100"},
> {"status:","100"},
> {"strike:", "100"},
> {"language:","150"},
> {"topic:","500"},
> {"depth:","100000"}
>
>
> ======================
> Modified Schema & Analysis (Fuad Efendi):
>
> My understanding is that we need to scan whole table in order
> to find
> records where (for instance) "last_crawl_date" is "less than
> specific
> point in time"... Additionally, Crawler should be polite and list of
> URLs to fetch should be evenly distributed between domains-hosts-IPs.
>
> Few solutions to find records "last_crawl_date" were a little
> discussed in BLOGs, distribution list, etc:
> - to have scanner
> - to have additional Lucene index
> - to have Map Reduce job (multithreaded parallel) otputting
> list of URLs
>
>
> My own possible solution, need your feedback:
> ====================
>
> Simplified schema with two tables (non-transactional:
>
> 1. URL_TO_FETCH
> {"internal_link_id" + "last_crawl_date", "1000"}
> PRIMARY KEY
> (sorted row_id),
> {"url:","1500"},
>
> 2. URL_CONTENT
> {"url:","1500"} PRIMARY KEY (sorted row_id),
> {"site:","500"},
> ... ... ...,
> {"language:","150"},
> {"topic:","500"},
> {"depth:","100000"}
>
>
> Table URL_TO_FETCH is initially seeded with root domain names and
> "dummy" last_crawl_date (with unique-per-host 'old-timestamp'):
> 00000000000000000001 www.website1.com
> 00000000000000000002 www.website1.com
> 00000000000000000003 www.website1.com
> 00000000000000000004 www.website1.com
> ...
>
>
> After successful fetch of initial URLs:
> 00000000010000000001 www.website1.com/page1
> 00000000010000000002 www.website2.com/page1
> 00000000010000000003 www.website3.com/page1
> 00000000010000000004 www.website4.com/page1
> ...
> 00000000020000000001 www.website1.com/page2
> 00000000020000000002 www.website2.com/page2
> 00000000020000000003 www.website3.com/page2
> 00000000020000000004 www.website4.com/page2
> ...
> 00000000030000000001 www.website1.com/page3
> 00000000030000000002 www.website2.com/page3
> 00000000030000000003 www.website3.com/page3
> 00000000030000000004 www.website4.com/page3
> ...
> ...
> ...
> 0000000000xxxxxxxxxx www.website1.com
> 0000000000xxxxxxxxxx www.website1.com
> 0000000000xxxxxxxxxx www.website1.com
> 0000000000xxxxxxxxxx www.website1.com
> ...
>
> (xxxxxxxxxx is "current time in milliseconds" - timestamp in case of
> successful fetch)
>
> What we have: we don't need additional Lucene index; we don't need
> MapReduce job to populate list of items to be fetched (the
> way as it's
> done in Nutch); we don't need thousands per-host-scanners; we have
> mutable primary key; all new records are inserted at the
> beginning of
> a table; fetched items are moved to end of a table.
>
> Second (helper) table is indexed by URL:
> {"url:","1500"} PRIMARY KEY (sorted row_id),
> ...
>
>
> Am I right? It looks cool that with extremely low cost I can
> maintain
> specific "reordering" by mutable primary key following
> crawl-specific
> requirements...
>
> Thanks,
> Fuad
>
>
>
>