You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Ninad Raut <hb...@gmail.com> on 2009/04/20 11:49:11 UTC

Crawling Using HBase as a back end --Issue

Hi,

I have been trying crawling data using MapReduce on HBase. Here is the scenario:

1) I have a Fetch list which has all the permalinks to be fetched
.They are stored in a PermalinkTable

2) A MapReduce scans over each permalink and tries fetching for the
data and dumping it in ContentTable.

Here are the issues I face:

The permalink table is not split so I have just one map running on a
single machine. The use of mapreduce gets nullified.

The map reduce keeps givinf scanner time our exceptions causing task
failures and further delays.


If any one can give me tips for this use case it would really help me.

Re: Crawling Using HBase as a back end --Issue

Posted by Andrew Purtell <ap...@apache.org>.

Hi Ninad,

I developed a crawling application for HBase with the same
basic design if I understand you correctly. 

First, you can set the split threshold lower for your work
table (the one which you run the TableMap job against). See
this JIRA for more info in that regard:
    https://issues.apache.org/jira/browse/HBASE-903

As stack suggests you can also manually split the work table.
Really, you should also prime it with > 1M jobs or similar,
enough to store in enough data for splits to be meaningful.
However, you also have to increase the scanner timeout and
perhaps also the mapred job timeout to compensate for
crawler maps which stall for long periods of time. 

After tinkering with this however I went in a different 
direction and used Heritrix 2.0 and the hbase-writer. See:
    http://code.google.com/p/hbase-writer/

Nutch would have been another option for me.

Hope this helps,

   - Andy


> From: Ninad Raut
> Subject: Re: Crawling Using HBase as a back end --Issue
> To: hbase-user@hadoop.apache.org
> Date: Monday, April 20, 2009, 9:37 AM
> Nutch 650 looks good.. vl test it .Thanks for the direction.
> ...
> 
> On Mon, Apr 20, 2009 at 9:48 PM, stack
> <st...@duboce.net> wrote:
> 
> > Ninad:
> >
> > Are you using Nutch crawling?  If not, out of
> interest, why not?  Have you
> > seen NUTCH-650 -- it works I believe (jdcryans?).
> >
> > Your PermalinkTable is small?  Has only a few rows?  
> Maybe down the size
> > at
> > which this table splits by changing flush and maximum
> file size -- see
> > hbase-default.xml.
> >
> > St.Ack
> >
> > On Mon, Apr 20, 2009 at 4:14 AM, Jean-Daniel Cryans
> <jdcryans@apache.org
> > >wrote:
> >
> > > Ninad,
> > >
> > > Regards the timeouts, I recently gave a tip in
> the thread "Tip when
> > > scanning and spending a lot of time on each
> row" which should solve
> > > your problem.
> > >
> > > Regards your table, you should split it. In the
> shell, type the
> > > command "tools" to see how to use the
> "split" command. Issue a couple
> > > of them, waiting a bit between each call.
> > >
> > > J-D
> > >
> > > On Mon, Apr 20, 2009 at 5:49 AM, Ninad Raut
> <hb...@gmail.com>
> > > wrote:
> > > > Hi,
> > > >
> > > > I have been trying crawling data using
> MapReduce on HBase. Here is the
> > > scenario:
> > > >
> > > > 1) I have a Fetch list which has all the
> permalinks to be fetched
> > > > .They are stored in a PermalinkTable
> > > >
> > > > 2) A MapReduce scans over each permalink and
> tries fetching for the
> > > > data and dumping it in ContentTable.
> > > >
> > > > Here are the issues I face:
> > > >
> > > > The permalink table is not split so I have
> just one map running on a
> > > > single machine. The use of mapreduce gets
> nullified.
> > > >
> > > > The map reduce keeps givinf scanner time our
> exceptions causing task
> > > > failures and further delays.
> > > >
> > > >
> > > > If any one can give me tips for this use
> case it would really help me.
> > > >
> > >
> >

Re: Crawling Using HBase as a back end --Issue

Posted by Ninad Raut <hb...@gmail.com>.

Nutch 650 looks good.. vl test it .Thanks for the direction. ...

On Mon, Apr 20, 2009 at 9:48 PM, stack <st...@duboce.net> wrote:

> Ninad:
>
> Are you using Nutch crawling?  If not, out of interest, why not?  Have you
> seen NUTCH-650 -- it works I believe (jdcryans?).
>
> Your PermalinkTable is small?  Has only a few rows?   Maybe down the size
> at
> which this table splits by changing flush and maximum file size -- see
> hbase-default.xml.
>
> St.Ack
>
> On Mon, Apr 20, 2009 at 4:14 AM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
>
> > Ninad,
> >
> > Regards the timeouts, I recently gave a tip in the thread "Tip when
> > scanning and spending a lot of time on each row" which should solve
> > your problem.
> >
> > Regards your table, you should split it. In the shell, type the
> > command "tools" to see how to use the "split" command. Issue a couple
> > of them, waiting a bit between each call.
> >
> > J-D
> >
> > On Mon, Apr 20, 2009 at 5:49 AM, Ninad Raut <hb...@gmail.com>
> > wrote:
> > > Hi,
> > >
> > > I have been trying crawling data using MapReduce on HBase. Here is the
> > scenario:
> > >
> > > 1) I have a Fetch list which has all the permalinks to be fetched
> > > .They are stored in a PermalinkTable
> > >
> > > 2) A MapReduce scans over each permalink and tries fetching for the
> > > data and dumping it in ContentTable.
> > >
> > > Here are the issues I face:
> > >
> > > The permalink table is not split so I have just one map running on a
> > > single machine. The use of mapreduce gets nullified.
> > >
> > > The map reduce keeps givinf scanner time our exceptions causing task
> > > failures and further delays.
> > >
> > >
> > > If any one can give me tips for this use case it would really help me.
> > >
> >
>

Re: Crawling Using HBase as a back end --Issue

Posted by stack <st...@duboce.net>.

Ninad:

Are you using Nutch crawling?  If not, out of interest, why not?  Have you
seen NUTCH-650 -- it works I believe (jdcryans?).

Your PermalinkTable is small?  Has only a few rows?   Maybe down the size at
which this table splits by changing flush and maximum file size -- see
hbase-default.xml.

St.Ack

On Mon, Apr 20, 2009 at 4:14 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> Ninad,
>
> Regards the timeouts, I recently gave a tip in the thread "Tip when
> scanning and spending a lot of time on each row" which should solve
> your problem.
>
> Regards your table, you should split it. In the shell, type the
> command "tools" to see how to use the "split" command. Issue a couple
> of them, waiting a bit between each call.
>
> J-D
>
> On Mon, Apr 20, 2009 at 5:49 AM, Ninad Raut <hb...@gmail.com>
> wrote:
> > Hi,
> >
> > I have been trying crawling data using MapReduce on HBase. Here is the
> scenario:
> >
> > 1) I have a Fetch list which has all the permalinks to be fetched
> > .They are stored in a PermalinkTable
> >
> > 2) A MapReduce scans over each permalink and tries fetching for the
> > data and dumping it in ContentTable.
> >
> > Here are the issues I face:
> >
> > The permalink table is not split so I have just one map running on a
> > single machine. The use of mapreduce gets nullified.
> >
> > The map reduce keeps givinf scanner time our exceptions causing task
> > failures and further delays.
> >
> >
> > If any one can give me tips for this use case it would really help me.
> >
>

Re: Crawling Using HBase as a back end --Issue

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Ninad,

Regards the timeouts, I recently gave a tip in the thread "Tip when
scanning and spending a lot of time on each row" which should solve
your problem.

Regards your table, you should split it. In the shell, type the
command "tools" to see how to use the "split" command. Issue a couple
of them, waiting a bit between each call.

J-D

On Mon, Apr 20, 2009 at 5:49 AM, Ninad Raut <hb...@gmail.com> wrote:
> Hi,
>
> I have been trying crawling data using MapReduce on HBase. Here is the scenario:
>
> 1) I have a Fetch list which has all the permalinks to be fetched
> .They are stored in a PermalinkTable
>
> 2) A MapReduce scans over each permalink and tries fetching for the
> data and dumping it in ContentTable.
>
> Here are the issues I face:
>
> The permalink table is not split so I have just one map running on a
> single machine. The use of mapreduce gets nullified.
>
> The map reduce keeps givinf scanner time our exceptions causing task
> failures and further delays.
>
>
> If any one can give me tips for this use case it would really help me.
>