You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Bai Shen <ba...@gmail.com> on 2012/08/02 14:59:12 UTC
Re: Different batch id
I just tried running this with the actual batch Id instead of using -all,
and I'm still getting similar results.
On Mon, Jul 30, 2012 at 1:12 PM, Bai Shen <ba...@gmail.com> wrote:
> I set up Nutch 2.x with a new instance of HBase. I ran the following
> commands.
>
> bin/nutch inject urls
> bin/nutch generate -topN 1000
> bin/nutch fetch -all
> bin/nutch parse -all
>
> When looking at the parse log, I'm seeing a bunch of "different batch id"
> messages. These are all on urls that I did not inject into the database.
>
> Any ideas what's causing this?
>
> Thanks.
>
Re: Different batch id
Posted by Ferdy Galema <fe...@kalooga.com>.
Hi,
It depends on the expectation ;)
I agree that it may be confusing, but currently the -all option in the
various Nutch tools only process "all with a mark". There is a separate
option that is able to process "all regardless if mark is present or not".
For the parser this is -reparse. For the indexer -reindex. (At least in the
current branch).There is no such thing for the fetcher. It is up for
discussion if a "-refetch" option would be useful here. If there is such an
option, the purpose of the generator would be gone.
Ferdy.
On Thu, Aug 2, 2012 at 8:47 PM, <al...@aim.com> wrote:
> Hi,
>
> I have found out that, what happens after
>
> bin/nutch generate -topN 1000
>
> is that only 1000 of the urls have been marked by gnmrk
>
> Then
> bin/nutch fetch -all
>
> skips all urls that do not have gnmrk
> according to the code
> Utf8 mark = Mark.GENERATE_MARK.checkMark(page);
> if (!NutchJob.shouldProcess(mark, batchId)) {
> if (LOG.isDebugEnabled()) {
> LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + ";
> different batch id (" + mark + ")");
> }
> return;
> }
>
> since shouldProcess(mark, batchId) returns false if mark is null.
>
> Then
>
> bin/nutch parse -all
> skips all urls that do not have fetch mark
> according to the code
> Utf8 mark = Mark.FETCH_MARK.checkMark(page);
> String unreverseKey = TableUtil.unreverseUrl(key);
> if (!NutchJob.shouldProcess(mark, batchId)) {
> LOG.info("Skipping " + unreverseKey + "; different batch id");
> return;
> }
>
> this outputs to log as INFO and are those that you see in log file.
>
> So, it seems to me that -all option to fetch, parse and solrindex do not
> work as expected.
>
> Alex.
>
>
>
> -----Original Message-----
> From: Bai Shen <ba...@gmail.com>
> To: user <us...@nutch.apache.org>
> Sent: Thu, Aug 2, 2012 5:59 am
> Subject: Re: Different batch id
>
>
> I just tried running this with the actual batch Id instead of using -all,
> and I'm still getting similar results.
>
> On Mon, Jul 30, 2012 at 1:12 PM, Bai Shen <ba...@gmail.com> wrote:
>
> > I set up Nutch 2.x with a new instance of HBase. I ran the following
> > commands.
> >
> > bin/nutch inject urls
> > bin/nutch generate -topN 1000
> > bin/nutch fetch -all
> > bin/nutch parse -all
> >
> > When looking at the parse log, I'm seeing a bunch of "different batch id"
> > messages. These are all on urls that I did not inject into the database.
> >
> > Any ideas what's causing this?
> >
> > Thanks.
> >
>
>
>
Re: Different batch id
Posted by al...@aim.com.
Hi,
I have found out that, what happens after
bin/nutch generate -topN 1000
is that only 1000 of the urls have been marked by gnmrk
Then
bin/nutch fetch -all
skips all urls that do not have gnmrk
according to the code
Utf8 mark = Mark.GENERATE_MARK.checkMark(page);
if (!NutchJob.shouldProcess(mark, batchId)) {
if (LOG.isDebugEnabled()) {
LOG.debug("Skipping " + TableUtil.unreverseUrl(key) + "; different batch id (" + mark + ")");
}
return;
}
since shouldProcess(mark, batchId) returns false if mark is null.
Then
bin/nutch parse -all
skips all urls that do not have fetch mark
according to the code
Utf8 mark = Mark.FETCH_MARK.checkMark(page);
String unreverseKey = TableUtil.unreverseUrl(key);
if (!NutchJob.shouldProcess(mark, batchId)) {
LOG.info("Skipping " + unreverseKey + "; different batch id");
return;
}
this outputs to log as INFO and are those that you see in log file.
So, it seems to me that -all option to fetch, parse and solrindex do not work as expected.
Alex.
-----Original Message-----
From: Bai Shen <ba...@gmail.com>
To: user <us...@nutch.apache.org>
Sent: Thu, Aug 2, 2012 5:59 am
Subject: Re: Different batch id
I just tried running this with the actual batch Id instead of using -all,
and I'm still getting similar results.
On Mon, Jul 30, 2012 at 1:12 PM, Bai Shen <ba...@gmail.com> wrote:
> I set up Nutch 2.x with a new instance of HBase. I ran the following
> commands.
>
> bin/nutch inject urls
> bin/nutch generate -topN 1000
> bin/nutch fetch -all
> bin/nutch parse -all
>
> When looking at the parse log, I'm seeing a bunch of "different batch id"
> messages. These are all on urls that I did not inject into the database.
>
> Any ideas what's causing this?
>
> Thanks.
>