You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Li Li <fa...@gmail.com> on 2014/01/09 11:42:12 UTC

is this rowkey schema feasible?

hi all,
    I want to use hbase to store all urls for a distributed crawler.
there is a central scheduler to schedule all unCrawled urls by
priority. Following is my design of rowkey and common data access
pattern, is there any better rowkey design for my usecase?

    the row key is: reverse_host--status--priority--MD5(path). some example:
    com.google.www/-0-10-MD5(path1)
    com.google.www/-0-9-MD5(path2)
    ...
    com.google.www/-1-10-MD5(path3)
    status 0 means not crawled and 1 means crawled
    my scheduler:
    int batchSize=10000;
    Map<String,Integer> hostCount=calcHostPriority(batchSize);
    List<String> toBeCrawledUrls=..
    for(Map.Entry<String,Integer> entry:hostCount.entrySet()){
         //select top N priority uncrawled urls for this host
        startRow=Bytes.toString(reverse(entry.getKey())+"/-0");
        stopRow=Bytes.toString(reverse(entry.getKey())+"/-1");
         Scan s = new Scan(startRow, stopRow);
         s.setMaxResultSize(entry.getValue());
         for(String url:scanResult){
              toBeCrawledUrls.add(url);
         }
    }

    //update after crawling
    for(String url:crawledUrls){
         delete url //com.google.www/-0-10-MD5(path)
         put url //com.google.www/-1-10-MD5(path)
    }

    //check url exists
    any better method than this?
     assuming only 1-10 priority
   try get:
        com.google.www/-0-10-MD5(path)
        com.google.www/-1-10-MD5(path)
        com.google.www/-0-9-MD5(path)
        ....
        com.google.www/-1-1-MD5(path)
    if any exists, then true
    else false

Re: is this rowkey schema feasible?

Posted by Li Li <fa...@gmail.com>.

Here is the detailed architecture of my distributed vertical crawler.
http://www.flickr.com/photos/114261973@N07/

hope you can give me some advices.

1. goal
    I want to implement a distributed vertical(topical) crawler. it
will only store webpages of a certan topic. I will have a classifier
to do this.
    I estimated the amount of  webpages that need be store is about
tens of millions(maybe hundreds of millions as time goes).
    for vertical crawler, it should crawl the pages most likely
related to my target topics. So I need a frontier that can dispatch
task by priorities.
    for now, the priority is simple but we hope it can deal with
complicated priority algorithms.
    1.1 host priority
          we should crawl many hosts rather than only one single host
at the same time. initally, each hosts should be equally crawled. but
after time, we can calculate the priority of host dynamically
          e.g. we can control the speed of a certain host by it's
crawl history(some site will ban our crawler if we use too many
concurrent thread to it). or we can adjust the priority of a host by
whether it
          is relevant to our topic(we can calculate the relevance of
crawled page).
    1.2 enqueue time
          first enqueued webpages should get higher priority
    1.3 depth
          webpages with small depth will get higher priority(something
like BFS traverse)
    1.4 other page priorities
          e.g. page rank, list page/detail page ...

2. archeitecture
   see picture:   http://www.flickr.com/photos/114261973@N07/
    2.1 Seed Discover
          use google or other website to find some seed urls
    2.2 Url DB
          a distributed DB to store all metadata about urls(that's the
most hbase related)
    2.3 Task Scheduler
         as described before, the task scheduler select top N priority
webpages and dispatch them to fetcher clusters
    2.4 Message Queues
         we use ActiveMQ to decouple different modules and also load balance
    2.5 Fetchers
          Download webpages
    2.6 WebPageDB
          store webpages crawled and extracted metadata(such as
title,content, pub_time, author, etc ....) of this webpage. we
consider using hbase too.
    2.7 Extractors
          Using classifier to judge whether this page is related to
our topics and extracting metadata from it and store them to WebPageDB


3. main challenges
    3.1 Url DB
       as described before, this store(maybe hbase) should support
sophisticated pirority algorithms. and also we use it to avoid
crawling a webpage more than once.
    3.2 task scheduler
       how to achieve our goal

4. current solution
    4.1 use hbase(maybe together with phoenix) to store urls(we now
have not done the schema design, hoping get some advice here)
    4.2 scheduler algorithm
          int batchSize=10000;
          //dispatch batchSize tasks to different hosts by host priorities;
          Map<String,Integer> hostCount=...
          //select top priority urls from each host
          List<String> toBeCrawledUrls=new ArrayList<String>(batchSize);
          for(Entry<String,Integer> entry:hostCount.entrySet()){
               //select top priority N urls from a given host
               List<String>
urls=selectTopNUrlsFromHost(entry.getKey(), entry.getValue());
               toBeCrawledUrls.addAll(urls);
          }
          //dispatch this urls to message queue
          //monitor the message queue status
          //if the queue is all(or 3/4) consumed, goto top  and
dispatch another batch urls

5. using map-reduce or hbase?
     we discussed the possible usage of map-reduce or only hbase
     if the scheduling algorithm is very complicated and should
consider many things, maybe we should use map-reduce
     But for now, our algorithm is simple and using hbase
coprocesser(or phoenix) can be thought of a simple online map-reduce
     we can use coprocesser to implement simple aggregating function
or using phoenix sql like select count where group by having....




On Fri, Jan 3, 2014 at 8:19 PM, Jean-Marc Spaggiari
<je...@spaggiari.org> wrote:
> Interesting. This is exactly what I'm doing ;)
>
> I'm using 3 tables to achieve this.
>
> One table with the URL already crawled (80 millions), one URL with the URL
> to crawle (2 billions) and one URL with the URLs been processed. I'm not
> running any SQL requests against my dataset but I have MR jobs doing many
> different things. I have many other tables to help with the work on the
> URLs.
>
> I'm "salting" the keys using the URL hash so I can find them back very
> quickly. There can be some collisions so I store also the URL itself on the
> key. So very small scans returning 1 or something 2 rows allow me to
> quickly find a row knowing the URL.
>
> I also have secondary index tables to store the CRCs of the pages to
> identify duplicate pages based on this value.
>
> And so on ;) Working on that for 2 years now. I might have been able to use
> Nuthc and others, but my goal was to learn and do that with a distributed
> client on a single dataset...
>
> Enjoy.
>
> JM
>
>
> 2014/1/3 James Taylor <jt...@salesforce.com>
>
>> Sure, no problem. One addition: depending on the cardinality of your
>> priority column, you may want to salt your table to prevent hotspotting,
>> since you'll have a monotonically increasing date in the key. To do that,
>> just add " SALT_BUCKETS=<n>" on to your query, where <n> is the number of
>> machines in your cluster. You can read more about salting here:
>> http://phoenix.incubator.apache.org/salted.html
>>
>>
>> On Thu, Jan 2, 2014 at 11:36 PM, Li Li <fa...@gmail.com> wrote:
>>
>> > thank you. it's great.
>> >
>> > On Fri, Jan 3, 2014 at 3:15 PM, James Taylor <jt...@salesforce.com>
>> > wrote:
>> > > Hi LiLi,
>> > > Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a
>> > SQL
>> > > skin on top of HBase. You can model your schema and issue your queries
>> > just
>> > > like you would with MySQL. Something like this:
>> > >
>> > > // Create table that optimizes for your most common query
>> > > // (i.e. the PRIMARY KEY constraint should be ordered as you'd want
>> your
>> > > rows ordered)
>> > > CREATE TABLE url_db (
>> > >     status TINYINT,
>> > >     priority INTEGER NOT NULL,
>> > >     added_time DATE,
>> > >     url VARCHAR NOT NULL
>> > >     CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url));
>> > >
>> > > int lastStatus = 0;
>> > > int lastPriority = 0;
>> > > Date lastAddedTime = new Date(0);
>> > > String lastUrl = "";
>> > >
>> > > while (true) {
>> > >     // Use row value constructor to page through results in batches of
>> > 1000
>> > >     String query = "
>> > >         SELECT * FROM url_db
>> > >         WHERE status=0 AND (status, priority, added_time, url) > (?, ?,
>> > ?,
>> > > ?)
>> > >         ORDER BY status, priority, added_time, url
>> > >         LIMIT 1000"
>> > >     PreparedStatement stmt = connection.prepareStatement(query);
>> > >
>> > >     // Bind parameters
>> > >     stmt.setInt(1, lastStatus);
>> > >     stmt.setInt(2, lastPriority);
>> > >     stmt.setDate(3, lastAddedTime);
>> > >     stmt.setString(4, lastUrl);
>> > >     ResultSet resultSet = stmt.executeQuery();
>> > >
>> > >     while (resultSet.next()) {
>> > >         // Remember last row processed so that you can start after that
>> > for
>> > > next batch
>> > >         lastStatus = resultSet.getInt(1);
>> > >         lastPriority = resultSet.getInt(2);
>> > >         lastAddedTime = resultSet.getDate(3);
>> > >         lastUrl = resultSet.getString(4);
>> > >
>> > >         doSomethingWithUrls();
>> > >
>> > >         UPSERT INTO url_db(status, priority, added_time, url)
>> > >         VALUES (1, ?, CURRENT_DATE(), ?);
>> > >
>> > >     }
>> > > }
>> > >
>> > > If you need to efficiently query on url, add a secondary index like
>> this:
>> > >
>> > > CREATE INDEX url_index ON url_db (url);
>> > >
>> > > Please let me know if you have questions.
>> > >
>> > > Thanks,
>> > > James
>> > >
>> > >
>> > >
>> > >
>> > > On Thu, Jan 2, 2014 at 10:22 PM, Li Li <fa...@gmail.com> wrote:
>> > >
>> > >> thank you. But I can't use nutch. could you tell me how hbase is used
>> > >> in nutch? or hbase is only used to store webpage.
>> > >>
>> > >> On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic
>> > >> <ot...@gmail.com> wrote:
>> > >> > Hi,
>> > >> >
>> > >> > Have a look at http://nutch.apache.org .  Version 2.x uses HBase
>> > under
>> > >> the
>> > >> > hood.
>> > >> >
>> > >> > Otis
>> > >> > --
>> > >> > Performance Monitoring * Log Analytics * Search Analytics
>> > >> > Solr & Elasticsearch Support * http://sematext.com/
>> > >> >
>> > >> >
>> > >> > On Fri, Jan 3, 2014 at 1:12 AM, Li Li <fa...@gmail.com> wrote:
>> > >> >
>> > >> >> hi all,
>> > >> >>      I want to use hbase to store all urls(crawled or not crawled).
>> > >> >> And each url will has a column named priority which represent the
>> > >> >> priority of the url. I want to get the top N urls order by
>> > priority(if
>> > >> >> priority is the same then url whose timestamp is ealier is
>> prefered).
>> > >> >>      in using something like mysql, my client application may like:
>> > >> >>      while true:
>> > >> >>          select  url from url_db order by priority,addedTime limit
>> > >> >> 1000 where status='not_crawled';
>> > >> >>          do something with this urls;
>> > >> >>          extract more urls and insert them into url_db;
>> > >> >>      How should I design hbase schema for this application? Is
>> hbase
>> > >> >> suitable for me?
>> > >> >>      I found in this article
>> > >> >>
>> > >>
>> >
>> http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/
>> > >> >> ,
>> > >> >> they use redis to store urls. I think hbase is originated from
>> > >> >> bigtable and google use bigtable to store webpage, so for huge
>> number
>> > >> >> of urls, I prefer distributed system like hbase.
>> > >> >>
>> > >>
>> >
>>


On Sat, Jan 11, 2014 at 1:38 AM, Stack <st...@duboce.net> wrote:
> On Thu, Jan 9, 2014 at 6:08 PM, Li Li <fa...@gmail.com> wrote:
>
>> thanks.
>> 1. this is just a url frontier for url duplication and scheduler usage.
>>
>
>
> I have made a few attempts at trying to reply to this note but I keep
> running into the fact that I know little about how you are thinking of
> architecting the crawler, what scale you are hoping to achieve, how the
> crawlers will be fed and where they will put the crawled content when done,
> what is doing the page parse for new URLs to crawl, is it inline w/ the
> crawl or done offline, whether the crawl is continuous or stepped (what is
> wrong w/ nutch), how revisit priority is done, page fingerpriinting and how
> you will resolve when you find the same page via two different URLs,
> etc....... and so my response keeps running off into speculation.
>
> If you have a one pager you'd like to share off-list, I could give some
> feedback (no problem).  I have an interest in this topic.
>
> High level I'd consider HBase as a candidate repository for all URLs ever
> seen keeping all state related to an URL for a distributed crawler.  Ditto
> for domains.
>
> Regards already-seen/dup URLs lookup and for the queues the crawlers pull
> from allowing inserts of higher priority items, etc., while you could do
> this HBase -- and it might be ok to start here -- I think you will soon
> find that you will want to do more purposed implementations.  For example,
> if the already-seen check is being done inline w/ the page crawl, a lookup
> into hbase would be sub-millisecond if out of cache but milliseconds if
> not.  You do not want your crawler to stall on a millisecond lookup per URL
> discovered, etc.
>
> On 6., try and make it so you can scan rather than point get for each URL.
>  Less i/o.
>
> St.Ack

Re: is this rowkey schema feasible?

Posted by Stack <st...@duboce.net>.

On Thu, Jan 9, 2014 at 6:08 PM, Li Li <fa...@gmail.com> wrote:

> thanks.
> 1. this is just a url frontier for url duplication and scheduler usage.
>

I have made a few attempts at trying to reply to this note but I keep
running into the fact that I know little about how you are thinking of
architecting the crawler, what scale you are hoping to achieve, how the
crawlers will be fed and where they will put the crawled content when done,
what is doing the page parse for new URLs to crawl, is it inline w/ the
crawl or done offline, whether the crawl is continuous or stepped (what is
wrong w/ nutch), how revisit priority is done, page fingerpriinting and how
you will resolve when you find the same page via two different URLs,
etc....... and so my response keeps running off into speculation.

If you have a one pager you'd like to share off-list, I could give some
feedback (no problem).  I have an interest in this topic.

High level I'd consider HBase as a candidate repository for all URLs ever
seen keeping all state related to an URL for a distributed crawler.  Ditto
for domains.

Regards already-seen/dup URLs lookup and for the queues the crawlers pull
from allowing inserts of higher priority items, etc., while you could do
this HBase -- and it might be ok to start here -- I think you will soon
find that you will want to do more purposed implementations.  For example,
if the already-seen check is being done inline w/ the page crawl, a lookup
into hbase would be sub-millisecond if out of cache but milliseconds if
not.  You do not want your crawler to stall on a millisecond lookup per URL
discovered, etc.

On 6., try and make it so you can scan rather than point get for each URL.
 Less i/o.

St.Ack

Re: is this rowkey schema feasible?

Posted by Li Li <fa...@gmail.com>.

thanks.
1. this is just a url frontier for url duplication and scheduler usage.
2. after get top N prior urls, I will send them to a message
queue(activemq) and the fetcher cluster will do their work.
3. you said keep the 'crawl status' in a separate column family, I
think it's a good idea. so update status will not delete a row and
then insert a new row.
4. you said Using an hbase table as a queue is usually not a good
idea, any other better solution for this? redis? memcached?I need a
distributed frontier which support order by(order by priority first
then enqueue time)
5. yes, I should use checkAndPut to avoid duplicated insert.
6. scan maybe not efficient?
    there are many urls,I don't know the status and priority of
existing url(the priority is calculated dynamically).because MD5(path)
is in the end of rowkey, I have to scan all the urls in the
host(www.google.com). Maybe scan all urls with the same
host is quicker than 20 get because the urls in a domain are likely
located in a single region(block). but 10M urls for a single host is
not uncommon.

    com.google.www/-0-10-MD5(path1)
    com.google.www/-0-10-MD5(path2)
    ...
    com.google.www/-0-10-MD5(path10,000,000)

    com.google.www/-1-10-MD5(path1)
    com.google.www/-1-10-MD5(path2)
    ....
    com.google.www/-1-10-MD5(path10,000,000)

    com.google.www/-0-9-MD5(path1)
    com.google.www/-0-9-MD5(path2)
    .....
    com.google.www/-1-1-MD5(path1)
    .....
    com.google.www/-1-1-MD5(path10,000,000)

On Fri, Jan 10, 2014 at 2:02 AM, Stack <st...@duboce.net> wrote:
> On Thu, Jan 9, 2014 at 2:42 AM, Li Li <fa...@gmail.com> wrote:
>
>> hi all,
>>     I want to use hbase to store all urls for a distributed crawler.
>> there is a central scheduler to schedule all unCrawled urls by
>> priority.
>
>
>
> Are you building from scratch?  If so, have you looked at nutch?
>
>
>
>> Following is my design of rowkey and common data access
>> pattern, is there any better rowkey design for my usecase?
>>
>>     the row key is: reverse_host--status--priority--MD5(path). some
>> example:
>>     com.google.www/-0-10-MD5(path1)
>>     com.google.www/-0-9-MD5(path2)
>>     ...
>>     com.google.www/-1-10-MD5(path3)
>>     status 0 means not crawled and 1 means crawled
>>
>
> Is this your total schema?  Where is crawl history kept and the crawled
> content and rate of change for the content and when to recrawl, etc?  Is
> this data elsewhere in other tables and this table is just a frontier
> 'queue' with crawl URLs added to the head and when the page is crawled, you
> add a new row to this table w/ state set to 1?  How you going to request a
> page be recrawled in such a scheme?
>
> How do the distributed crawlers divvy up the work?
>
> Generally you do not want to keep state in the key itself.
>
> Using an hbase table as a queue is usually not a good idea especially when
> lots of churn as will be the case in a distributed crawler.
>
> You could keep the 'crawl status' in a separate column family with nothing
> but this in it so your crawlers can scan fast and update this one attribute
> only after the page is pulled, or, you might want to use something else
> altogether for the list-of-urls to crawl by crawler since it a small
> dataset and you need to go real fast against it.
>
>
>
>
>
>>     my scheduler:
>>     int batchSize=10000;
>>     Map<String,Integer> hostCount=calcHostPriority(batchSize);
>>     List<String> toBeCrawledUrls=..
>>     for(Map.Entry<String,Integer> entry:hostCount.entrySet()){
>>          //select top N priority uncrawled urls for this host
>>         startRow=Bytes.toString(reverse(entry.getKey())+"/-0");
>>         stopRow=Bytes.toString(reverse(entry.getKey())+"/-1");
>>          Scan s = new Scan(startRow, stopRow);
>>          s.setMaxResultSize(entry.getValue());
>>          for(String url:scanResult){
>>               toBeCrawledUrls.add(url);
>>          }
>>     }
>>
>>     //update after crawling
>>     for(String url:crawledUrls){
>>          delete url //com.google.www/-0-10-MD5(path)
>>          put url //com.google.www/-1-10-MD5(path)
>>
>
> Each delete adds a new entry.
>
> How you intend to make it so two distributed crawlers do not pull the same
> url to fetch?  (Use checkAndPut and set a 'currently-assigned-to-a-crawler'
> flag).
>
>
>
>
>>     }
>>
>>     //check url exists
>>     any better method than this?
>>      assuming only 1-10 priority
>>    try get:
>>         com.google.www/-0-10-MD5(path)
>>         com.google.www/-1-10-MD5(path)
>>         com.google.www/-0-9-MD5(path)
>>         ....
>>         com.google.www/-1-1-MD5(path)
>>     if any exists, then true
>>     else false
>>
>
>
> You don't want to 'get', you want to 'scan'.
>
> St.Ack

Re: is this rowkey schema feasible?

Posted by Stack <st...@duboce.net>.

On Thu, Jan 9, 2014 at 2:42 AM, Li Li <fa...@gmail.com> wrote:

> hi all,
>     I want to use hbase to store all urls for a distributed crawler.
> there is a central scheduler to schedule all unCrawled urls by
> priority.

Are you building from scratch?  If so, have you looked at nutch?

> Following is my design of rowkey and common data access
> pattern, is there any better rowkey design for my usecase?
>
>     the row key is: reverse_host--status--priority--MD5(path). some
> example:
>     com.google.www/-0-10-MD5(path1)
>     com.google.www/-0-9-MD5(path2)
>     ...
>     com.google.www/-1-10-MD5(path3)
>     status 0 means not crawled and 1 means crawled
>

Is this your total schema?  Where is crawl history kept and the crawled
content and rate of change for the content and when to recrawl, etc?  Is
this data elsewhere in other tables and this table is just a frontier
'queue' with crawl URLs added to the head and when the page is crawled, you
add a new row to this table w/ state set to 1?  How you going to request a
page be recrawled in such a scheme?

How do the distributed crawlers divvy up the work?

Generally you do not want to keep state in the key itself.

Using an hbase table as a queue is usually not a good idea especially when
lots of churn as will be the case in a distributed crawler.

You could keep the 'crawl status' in a separate column family with nothing
but this in it so your crawlers can scan fast and update this one attribute
only after the page is pulled, or, you might want to use something else
altogether for the list-of-urls to crawl by crawler since it a small
dataset and you need to go real fast against it.

>     my scheduler:
>     int batchSize=10000;
>     Map<String,Integer> hostCount=calcHostPriority(batchSize);
>     List<String> toBeCrawledUrls=..
>     for(Map.Entry<String,Integer> entry:hostCount.entrySet()){
>          //select top N priority uncrawled urls for this host
>         startRow=Bytes.toString(reverse(entry.getKey())+"/-0");
>         stopRow=Bytes.toString(reverse(entry.getKey())+"/-1");
>          Scan s = new Scan(startRow, stopRow);
>          s.setMaxResultSize(entry.getValue());
>          for(String url:scanResult){
>               toBeCrawledUrls.add(url);
>          }
>     }
>
>     //update after crawling
>     for(String url:crawledUrls){
>          delete url //com.google.www/-0-10-MD5(path)
>          put url //com.google.www/-1-10-MD5(path)
>

Each delete adds a new entry.

How you intend to make it so two distributed crawlers do not pull the same
url to fetch?  (Use checkAndPut and set a 'currently-assigned-to-a-crawler'
flag).

>     }
>
>     //check url exists
>     any better method than this?
>      assuming only 1-10 priority
>    try get:
>         com.google.www/-0-10-MD5(path)
>         com.google.www/-1-10-MD5(path)
>         com.google.www/-0-9-MD5(path)
>         ....
>         com.google.www/-1-1-MD5(path)
>     if any exists, then true
>     else false
>

You don't want to 'get', you want to 'scan'.

St.Ack