You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Bai Shen <ba...@gmail.com> on 2012/08/02 14:59:12 UTC

Re: Different batch id

I just tried running this with the actual batch Id instead of using -all,
and I'm still getting similar results.

On Mon, Jul 30, 2012 at 1:12 PM, Bai Shen <ba...@gmail.com> wrote:

> I set up Nutch 2.x with a new instance of HBase.  I ran the following
> commands.
>
> bin/nutch inject urls
> bin/nutch generate -topN 1000
> bin/nutch fetch -all
> bin/nutch parse -all
>
> When looking at the parse log, I'm seeing a bunch of "different batch id"
> messages.  These are all on urls that I did not inject into the database.
>
> Any ideas what's causing this?
>
> Thanks.
>

Re: Different batch id

Posted by Ferdy Galema <fe...@kalooga.com>.
Hi,

It depends on the expectation ;)

I agree that it may be confusing, but currently the -all option in the
various Nutch tools only process "all with a mark". There is a separate
option that is able to process "all regardless if mark is present or not".
For the parser this is -reparse. For the indexer -reindex. (At least in the
current branch).There is no such thing for the fetcher. It is up for
discussion if a "-refetch" option would be useful here. If there is such an
option, the purpose of the generator would be gone.

Ferdy.

On Thu, Aug 2, 2012 at 8:47 PM, <al...@aim.com> wrote:

> Hi,
>
> I have found out that, what happens after
>
> bin/nutch generate -topN 1000
>
> is that only 1000 of the urls have been marked by gnmrk
>
> Then
> bin/nutch fetch -all
>
> skips all urls that do not have gnmrk
> according to the code
> Utf8 mark = Mark.GENERATE_MARK.checkMark(page);
>  if (!NutchJob.shouldProcess(mark, batchId)) {
>         if (LOG.isDebugEnabled()) {
>           LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + ";
> different batch id (" + mark + ")");
>         }
>         return;
>       }
>
> since shouldProcess(mark, batchId) returns false if mark is null.
>
> Then
>
> bin/nutch parse -all
> skips all urls that do not have fetch mark
> according to the code
>  Utf8 mark = Mark.FETCH_MARK.checkMark(page);
>       String unreverseKey = TableUtil.unreverseUrl(key);
>       if (!NutchJob.shouldProcess(mark, batchId)) {
>         LOG.info("Skipping " + unreverseKey + "; different batch id");
>         return;
>       }
>
> this outputs to log as INFO and are those that you see in log file.
>
> So, it seems to me that -all option to fetch, parse and solrindex do not
> work as expected.
>
> Alex.
>
>
>
> -----Original Message-----
> From: Bai Shen <ba...@gmail.com>
> To: user <us...@nutch.apache.org>
> Sent: Thu, Aug 2, 2012 5:59 am
> Subject: Re: Different batch id
>
>
> I just tried running this with the actual batch Id instead of using -all,
> and I'm still getting similar results.
>
> On Mon, Jul 30, 2012 at 1:12 PM, Bai Shen <ba...@gmail.com> wrote:
>
> > I set up Nutch 2.x with a new instance of HBase.  I ran the following
> > commands.
> >
> > bin/nutch inject urls
> > bin/nutch generate -topN 1000
> > bin/nutch fetch -all
> > bin/nutch parse -all
> >
> > When looking at the parse log, I'm seeing a bunch of "different batch id"
> > messages.  These are all on urls that I did not inject into the database.
> >
> > Any ideas what's causing this?
> >
> > Thanks.
> >
>
>
>

Re: Different batch id

Posted by al...@aim.com.
Hi,

I have found out that, what happens after 

bin/nutch generate -topN 1000

is that only 1000 of the urls have been marked by gnmrk

Then 
bin/nutch fetch -all

skips all urls that do not have gnmrk
according to the code 
Utf8 mark = Mark.GENERATE_MARK.checkMark(page);
 if (!NutchJob.shouldProcess(mark, batchId)) {
        if (LOG.isDebugEnabled()) {
          LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + "; different batch id (" + mark + ")");
        }
        return;
      }

since shouldProcess(mark, batchId) returns false if mark is null.

Then

bin/nutch parse -all
skips all urls that do not have fetch mark
according to the code
 Utf8 mark = Mark.FETCH_MARK.checkMark(page);
      String unreverseKey = TableUtil.unreverseUrl(key);
      if (!NutchJob.shouldProcess(mark, batchId)) {
        LOG.info("Skipping " + unreverseKey + "; different batch id");
        return;
      }

this outputs to log as INFO and are those that you see in log file.

So, it seems to me that -all option to fetch, parse and solrindex do not work as expected.

Alex. 



-----Original Message-----
From: Bai Shen <ba...@gmail.com>
To: user <us...@nutch.apache.org>
Sent: Thu, Aug 2, 2012 5:59 am
Subject: Re: Different batch id


I just tried running this with the actual batch Id instead of using -all,
and I'm still getting similar results.

On Mon, Jul 30, 2012 at 1:12 PM, Bai Shen <ba...@gmail.com> wrote:

> I set up Nutch 2.x with a new instance of HBase.  I ran the following
> commands.
>
> bin/nutch inject urls
> bin/nutch generate -topN 1000
> bin/nutch fetch -all
> bin/nutch parse -all
>
> When looking at the parse log, I'm seeing a bunch of "different batch id"
> messages.  These are all on urls that I did not inject into the database.
>
> Any ideas what's causing this?
>
> Thanks.
>