You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by maxjar10 <jc...@gmail.com> on 2009/07/05 02:21:23 UTC

HBase schema for crawling

Hi All,

I am developing a schema that will be used for crawling. All of the examples
that I have seen to date use a webcrawl table that looks like the below:

Table: webcrawl
rowkey details family
com.yahoo.www lastFetchDate:timestamp content:somedownloadedpage

I understand wanting to use the rowkey in reverse domain order so that it's
easy to recrawl all of a specific site including it's subdomains. However,
it seems inefficient to scan through a large table looking for
"lastFetchDate" where you want to refetch the page.

In my case I'm not concerned with having to recrawl a particular domain as I
am about efficiently locating the urls that I need to recrawl because I
haven't crawled them in a while.

rowkey family
20090631;com.google.www contents:somedownloadedgooglepage
20090701;com.yahoo.www contents:somedownloadedyahoopage

This would allow you to quickly get to the content needed to recrawl and do
it by date so that you ensure that you recrawl the most stale item first.

Now, here's the dilemma I have... When I create a MapReduce job to go
through each row in the above I want to schedule the url to be recrawled
again at some date in the future. For example,

// Simple psudeocode
Map( row, rowResult )
{
BatchUpdate update = new BatchUpdate( row.get() );
update.put( "contents:content", downloadPage( pageUrl ) );
update.updateKey( nextFetchDate + ":" reverseDomain( pageUrl ) ); //
???? No idea how to do this
}

1) Does HBase you to update the key for a row? Are HBase row keys immutable?

2) If I can't update a key what's the easiest way to copy a row and assign
it a different key?

3) What are the implications for updating/deleting from a table that you are
currently scanning as part of the mapReduce job?

It seems to me that I may want to do a map and a reduce and during the map
phase I would record the rows that I fetched while in the reduce phase I
would then take those rows, re-add them with the nextFetchDate and then
remove the old row.

I would probably want to do this process in phases (e.g. scan only 5,000
rows at a time) so that if my Mapper died for any particular reason I could
address the issue and in the worst case only have lost the work that I had
done on 5,000 rows.

Thanks!

--
View this message in context: http://www.nabble.com/HBase-schema-for-crawling-tp24339168p24339168.html
Sent from the HBase User mailing list archive at Nabble.com.

Re: HBase schema for crawling

Posted by Marcus Herou <ma...@tailsweep.com>.

Hi.

I suggest you build an index with two cols. nextFetchDate, rowKey
Only update the index with the newly fetched items and optimize every night
or so.

If I am not totally incorrect I think these days you have some index
structure within HBase already ? Which means you might not need Lucene.

Cheers

//Marcus


On Sun, Jul 5, 2009 at 11:26 PM, stack <st...@duboce.net> wrote:

> On Sat, Jul 4, 2009 at 5:21 PM, maxjar10 <jc...@gmail.com> wrote:
>
> >
> > Hi All,
> >
> > I am developing a schema that will be used for crawling.
>
>
> Out of interest, what crawler are you using?
>
>
> >
> > Now, here's the dilemma I have... When I create a MapReduce job to go
> > through each row in the above I want to schedule the url to be recrawled
> > again at some date in the future. For example,
> >
> > // Simple psudeocode
> > Map( row, rowResult )
> > {
> >      BatchUpdate update = new BatchUpdate( row.get() );
> >      update.put( "contents:content", downloadPage( pageUrl ) );
> >      update.updateKey( nextFetchDate + ":"  reverseDomain( pageUrl ) );
> //
> > ???? No idea how to do this
> > }
>
>
> So you want to write a new row with a nextFetchDate prefix?
>
> FYI, have you seen
>
> http://hadoop.apache.org/hbase/docs/r0.19.3/api/org/apache/hadoop/hbase/util/Keying.html#createKey(java.lang.String)<http://hadoop.apache.org/hbase/docs/r0.19.3/api/org/apache/hadoop/hbase/util/Keying.html#createKey%28java.lang.String%29>
> ?
>
> (You might also find http://sourceforge.net/projects/publicsuffix/ might
> also be useful)
>
>
>
> > 1) Does HBase you to update the key for a row? Are HBase row keys
> > immutable?
> >
>
>
> Yes.
>
> If you 'update' a row key, changing it, you will create a new row.
>
>
>
> >
> > 2) If I can't update a key what's the easiest way to copy a row and
> assign
> > it a different key?
> >
>
>
> Get all of the row and then put it all with the new key (Billy Pearson's
> suggestion would be the way to go I'd suggest -- keeping a column with
> timestamp in it or using hbase versions -- in TRUNK you can ask for data
> within a timerange.  Running a scanner asking for rows > some timestamp
> should be fast).
>
>
>
> >
> > 3) What are the implications for updating/deleting from a table that you
> > are
> > currently scanning as part of the mapReduce job?
> >
>
>
> Scanners return the state of the row at the time they trip over it.
>
>
>
> >
> > It seems to me that I may want to do a map and a reduce and during the
> map
> > phase I would record the rows that I fetched while in the reduce phase I
> > would then take those rows, re-add them with the nextFetchDate and then
> > remove the old row.
>
>
> Do you have to remove old data?  You could let it age or be removed when
> the
> number of versions of pages are > configured maximum.
>
>
> > I would probably want to do this process in phases (e.g. scan only 5,000
> > rows at a time) so that if my Mapper died for any particular reason I
> could
> > address the issue and in the worst case only have lost the work that I
> had
> > done on 5,000 rows.
>
>
> You could keep an already-seen in another hbase table and just rerun whole
> job if first job failed.  Check the already-seen before crawling a page to
> see if you'd crawled it recently or not?
>
> St.Ack
>



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.herou@tailsweep.com
http://www.tailsweep.com/

Re: HBase schema for crawling

Posted by stack <st...@duboce.net>.

On Sat, Jul 4, 2009 at 5:21 PM, maxjar10 <jc...@gmail.com> wrote:

>
> Hi All,
>
> I am developing a schema that will be used for crawling.

Out of interest, what crawler are you using?

>
> Now, here's the dilemma I have... When I create a MapReduce job to go
> through each row in the above I want to schedule the url to be recrawled
> again at some date in the future. For example,
>
> // Simple psudeocode
> Map( row, rowResult )
> {
>      BatchUpdate update = new BatchUpdate( row.get() );
>      update.put( "contents:content", downloadPage( pageUrl ) );
>      update.updateKey( nextFetchDate + ":"  reverseDomain( pageUrl ) ); //
> ???? No idea how to do this
> }

So you want to write a new row with a nextFetchDate prefix?

FYI, have you seen
http://hadoop.apache.org/hbase/docs/r0.19.3/api/org/apache/hadoop/hbase/util/Keying.html#createKey(java.lang.String)
?

(You might also find http://sourceforge.net/projects/publicsuffix/ might
also be useful)

> 1) Does HBase you to update the key for a row? Are HBase row keys
> immutable?
>

Yes.

If you 'update' a row key, changing it, you will create a new row.

>
> 2) If I can't update a key what's the easiest way to copy a row and assign
> it a different key?
>

Get all of the row and then put it all with the new key (Billy Pearson's
suggestion would be the way to go I'd suggest -- keeping a column with
timestamp in it or using hbase versions -- in TRUNK you can ask for data
within a timerange.  Running a scanner asking for rows > some timestamp
should be fast).

>
> 3) What are the implications for updating/deleting from a table that you
> are
> currently scanning as part of the mapReduce job?
>

Scanners return the state of the row at the time they trip over it.

>
> It seems to me that I may want to do a map and a reduce and during the map
> phase I would record the rows that I fetched while in the reduce phase I
> would then take those rows, re-add them with the nextFetchDate and then
> remove the old row.

Do you have to remove old data?  You could let it age or be removed when the
number of versions of pages are > configured maximum.

> I would probably want to do this process in phases (e.g. scan only 5,000
> rows at a time) so that if my Mapper died for any particular reason I could
> address the issue and in the worst case only have lost the work that I had
> done on 5,000 rows.

You could keep an already-seen in another hbase table and just rerun whole
job if first job failed.  Check the already-seen before crawling a page to
see if you'd crawled it recently or not?

St.Ack

Re: HBase schema for crawling

Posted by Billy Pearson <sa...@pearsonwholesale.com>.

I have stored the spider time in a column like stime: to keep from having to 
fetch the pages content in the map of the row just for the timestamp
then just scan over that one column to get last spider time etc..

In my setup I did not spider from the map reduce job I build a spider
list then ran the spider in a different language that I know more about then
java so no experience with that.


"maxjar10" <jc...@gmail.com> wrote in 
message news:24339168.post@talk.nabble.com...
>
> Hi All,
>
> I am developing a schema that will be used for crawling. All of the 
> examples
> that I have seen to date use a webcrawl table that looks like the below:
>
> Table: webcrawl
> rowkey                details                                   family
> com.yahoo.www    lastFetchDate:timestamp 
> content:somedownloadedpage
>
> I understand wanting to use the rowkey in reverse domain order so that 
> it's
> easy to recrawl all of a specific site including it's subdomains. However,
> it seems inefficient to scan through a large table looking for
> "lastFetchDate" where you want to refetch the page.
>
> In my case I'm not concerned with having to recrawl a particular domain as 
> I
> am about efficiently locating the urls that I need to recrawl because I
> haven't crawled them in a while.
>
> rowkey                              family
> 20090631;com.google.www   contents:somedownloadedgooglepage
> 20090701;com.yahoo.www    contents:somedownloadedyahoopage
>
> This would allow you to quickly get to the content needed to recrawl and 
> do
> it by date so that you ensure that you recrawl the most stale item first.
>
> Now, here's the dilemma I have... When I create a MapReduce job to go
> through each row in the above I want to schedule the url to be recrawled
> again at some date in the future. For example,
>
> // Simple psudeocode
> Map( row, rowResult )
> {
>      BatchUpdate update = new BatchUpdate( row.get() );
>      update.put( "contents:content", downloadPage( pageUrl ) );
>      update.updateKey( nextFetchDate + ":"  reverseDomain( pageUrl ) ); //
> ???? No idea how to do this
> }
>
> 1) Does HBase you to update the key for a row? Are HBase row keys 
> immutable?
>
> 2) If I can't update a key what's the easiest way to copy a row and assign
> it a different key?
>
> 3) What are the implications for updating/deleting from a table that you 
> are
> currently scanning as part of the mapReduce job?
>
> It seems to me that I may want to do a map and a reduce and during the map
> phase I would record the rows that I fetched while in the reduce phase I
> would then take those rows, re-add them with the nextFetchDate and then
> remove the old row.
>
> I would probably want to do this process in phases (e.g. scan only 5,000
> rows at a time) so that if my Mapper died for any particular reason I 
> could
> address the issue and in the worst case only have lost the work that I had
> done on 5,000 rows.
>
> Thanks!
>
> -- 
> View this message in context: 
> http://www.nabble.com/HBase-schema-for-crawling-tp24339168p24339168.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
>