You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Mariam Salloum <ma...@gmail.com> on 2013/07/10 21:52:07 UTC

Using Batch Id

Hi All,


I'm using Nutch 2.x along with Hbase and SOLR. I have the following
question.

(a) Lets say I run a crawl (generate, fetch, parse, update, etc.) with
Batch ID  '1' and set the depth to 3.
(b) After this, I may still have some pages unfetched and they should be
marked with Batch ID 1'

(c) I then inject additional URLS
(d) Run a crawl (generate, fetch, parse, update, etc.) with Batch ID  '2'

My question is what pages get assigned this new batch id? Do the pages from
the previous crawl (unfetched pages) get assigned this new batch id? Or
only newly injected pages.

I guess I don't fully understand the concept of batch id and how to utilize
it. I already searched the Nutch site and past posts, but could not find
clarification on this.

Thank you for your help

Re: Using Batch Id

Posted by Mariam Salloum <ms...@cs.ucr.edu>.

Thanks Bai for your explanation, it make alot of sense.

I had another question. I see you had posted a question on how to query all
unfetched pages from HBase. Were you able to get the query below to work?

<<I'm trying to check hbase for urls that have unfetched status but my
query
isn't working correctly.  No matter what I don't get a match.

scan 'webpage', {COLUMNS=>['f:bas', 'f:st'],
FILTER=>SingleColumnValueFilter.new(Bytes.toBytes('f'),
Bytes.toBytes('st'), CompareFilter::CompareOp.valueOf('EQUAL'),
Bytes.toBytes('1'))} >>

Thanks a lot for your help

Mariam





On Thu, Jul 11, 2013 at 4:16 AM, Bai Shen <ba...@gmail.com> wrote:

> The crawl script doesn't accept Batch ID.  So in order to use Batch ID you
> would run the commands separately which would not involve depth.  Depth is
> just the number of times to run the generate, fetch, parse, update cycle.
>
> Any unfetched pages will not have a Batch ID.  The Batch ID only applies to
> the pages that were generated.  By default all of the unfetched and
> injected pages are available to be generated with Batch ID 2.
>
> Batch ID is useful because it allows you to run fetch, parse, and index
> commands only on the generated urls instead of the entire database.
>
> Hope that makes sense.
>
>
> On Wed, Jul 10, 2013 at 3:52 PM, Mariam Salloum <mariam.salloum@gmail.com
> >wrote:
>
> > Hi All,
> >
> >
> > I'm using Nutch 2.x along with Hbase and SOLR. I have the following
> > question.
> >
> > (a) Lets say I run a crawl (generate, fetch, parse, update, etc.) with
> > Batch ID  '1' and set the depth to 3.
> > (b) After this, I may still have some pages unfetched and they should be
> > marked with Batch ID 1'
> >
> > (c) I then inject additional URLS
> > (d) Run a crawl (generate, fetch, parse, update, etc.) with Batch ID  '2'
> >
> > My question is what pages get assigned this new batch id? Do the pages
> from
> > the previous crawl (unfetched pages) get assigned this new batch id? Or
> > only newly injected pages.
> >
> > I guess I don't fully understand the concept of batch id and how to
> utilize
> > it. I already searched the Nutch site and past posts, but could not find
> > clarification on this.
> >
> > Thank you for your help
> >
>

Re: Using Batch Id

Posted by Mariam Salloum <ma...@gmail.com>.

Thanks Bai for your explanation, it makes alot of sense.

I had another question. I see you had posted a question on how to query all
unfetched pages from HBase. Were you able to get the query below to work?

<<I'm trying to check hbase for urls that have unfetched status but my
query
isn't working correctly.  No matter what I don't get a match.

scan 'webpage', {COLUMNS=>['f:bas', 'f:st'],
FILTER=>SingleColumnValueFilter.new(Bytes.toBytes('f'),
Bytes.toBytes('st'), CompareFilter::CompareOp.valueOf('EQUAL'),
Bytes.toBytes('1'))} >>

Thanks a lot for your help


On Thu, Jul 11, 2013 at 4:16 AM, Bai Shen <ba...@gmail.com> wrote:

> The crawl script doesn't accept Batch ID.  So in order to use Batch ID you
> would run the commands separately which would not involve depth.  Depth is
> just the number of times to run the generate, fetch, parse, update cycle.
>
> Any unfetched pages will not have a Batch ID.  The Batch ID only applies to
> the pages that were generated.  By default all of the unfetched and
> injected pages are available to be generated with Batch ID 2.
>
> Batch ID is useful because it allows you to run fetch, parse, and index
> commands only on the generated urls instead of the entire database.
>
> Hope that makes sense.
>
>
> On Wed, Jul 10, 2013 at 3:52 PM, Mariam Salloum <mariam.salloum@gmail.com
> >wrote:
>
> > Hi All,
> >
> >
> > I'm using Nutch 2.x along with Hbase and SOLR. I have the following
> > question.
> >
> > (a) Lets say I run a crawl (generate, fetch, parse, update, etc.) with
> > Batch ID  '1' and set the depth to 3.
> > (b) After this, I may still have some pages unfetched and they should be
> > marked with Batch ID 1'
> >
> > (c) I then inject additional URLS
> > (d) Run a crawl (generate, fetch, parse, update, etc.) with Batch ID  '2'
> >
> > My question is what pages get assigned this new batch id? Do the pages
> from
> > the previous crawl (unfetched pages) get assigned this new batch id? Or
> > only newly injected pages.
> >
> > I guess I don't fully understand the concept of batch id and how to
> utilize
> > it. I already searched the Nutch site and past posts, but could not find
> > clarification on this.
> >
> > Thank you for your help
> >
>

Re: Using Batch Id

Posted by Bai Shen <ba...@gmail.com>.

The crawl script doesn't accept Batch ID.  So in order to use Batch ID you
would run the commands separately which would not involve depth.  Depth is
just the number of times to run the generate, fetch, parse, update cycle.

Any unfetched pages will not have a Batch ID.  The Batch ID only applies to
the pages that were generated.  By default all of the unfetched and
injected pages are available to be generated with Batch ID 2.

Batch ID is useful because it allows you to run fetch, parse, and index
commands only on the generated urls instead of the entire database.

Hope that makes sense.

On Wed, Jul 10, 2013 at 3:52 PM, Mariam Salloum <ma...@gmail.com>wrote:

> Hi All,
>
>
> I'm using Nutch 2.x along with Hbase and SOLR. I have the following
> question.
>
> (a) Lets say I run a crawl (generate, fetch, parse, update, etc.) with
> Batch ID  '1' and set the depth to 3.
> (b) After this, I may still have some pages unfetched and they should be
> marked with Batch ID 1'
>
> (c) I then inject additional URLS
> (d) Run a crawl (generate, fetch, parse, update, etc.) with Batch ID  '2'
>
> My question is what pages get assigned this new batch id? Do the pages from
> the previous crawl (unfetched pages) get assigned this new batch id? Or
> only newly injected pages.
>
> I guess I don't fully understand the concept of batch id and how to utilize
> it. I already searched the Nutch site and past posts, but could not find
> clarification on this.
>
> Thank you for your help
>