You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Gabriele Kahlout <ga...@mysimpatico.com> on 2011/06/20 23:14:28 UTC

How do I debug why a url doesn't pass through generate despite being the only one?

Hello,

I've noticed that for some urls don't make it into my index. Debugging I've
created a seed file that has only one of them (
http://abcnews.go.com/Technology/google-chromebook-works-great-long-online/story?id=13850997)
and tried to crawl for it on an empty crawldb. However I notice that already
at the bin/nutch generate stage the script exists reporting that there are
no urls to fetch. So it got nothing to do with parsing, or fetching (we
don't even reach the host yet). What could it be?
I've tried enconding it into
http%3A%2F%2Fabcnews.go.com%2FTechnology%2Fgoogle-chromebook-works-great-long-online%2Fstory%3Fid%3D13850997,
but that didn't help.

STEPS TO REPRODUCE:

wget http://apache.panu.it//nutch/apache-nutch-1.3-src.zip
unzip apache-nutch-1.3-src.zip
ant
cat > urls << __EOF__
http://abcnews.go.com/Technology/google-chromebook-works-great-long-online/story?id=13850997
__EOF__
runtime/local/bin/nutch inject crawl urls
runtime/local/bin/nutch generate crawl crawl/segs -topN 1 #even w/o -topN
you will get the same
# Generator: 0 records selected for fetching, exiting ...

-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: How do I debug why a url doesn't pass through generate despite being the only one?

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.

I tried that, but my code won't compile anymore. I was convinced the conf
dir was external.

On Tue, Jun 21, 2011 at 8:16 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> Did you rebuild the Nutch job file with the updated configuration?
>
> > You were right, and indeed fixing that it now works locally. However
> trying
> > it on the server it seems the configuration won't update. I'm not sure
> why!
> > Where is that documented?
> >
> > On Mon, Jun 20, 2011 at 11:22 PM, Markus Jelsma
> >
> > <ma...@openindex.io>wrote:
> > > You're the victim of the default regex url filter.
> > >
> > > 31      # skip URLs containing certain characters as probable queries,
> > > etc. 32      -[?*!@=]
> > >
> > > The injector won't inject that URL. This can be trickty indeed as the
> > > filters
> > > don't log rejected URL's.
> > >
> > > > Hello,
> > > >
> > > > I've noticed that for some urls don't make it into my index.
> Debugging
> > >
> > > I've
> > >
> > > > created a seed file that has only one of them (
> > >
> > >
> http://abcnews.go.com/Technology/google-chromebook-works-great-long-onlin
> > > e/
> > >
> > > > story?id=13850997) and tried to crawl for it on an empty crawldb.
> > > > However
> > >
> > > I
> > >
> > > > notice that already at the bin/nutch generate stage the script exists
> > > > reporting that there are no urls to fetch. So it got nothing to do
> with
> > > > parsing, or fetching (we don't even reach the host yet). What could
> it
> > >
> > > be?
> > >
> > > > I've tried enconding it into
> > > > http%3A%2F%2Fabcnews.go.com
> > >
> > > %2FTechnology%2Fgoogle-chromebook-works-great-lo
> > >
> > > > ng-online%2Fstory%3Fid%3D13850997, but that didn't help.
> > > >
> > > > STEPS TO REPRODUCE:
> > > >
> > > > wget http://apache.panu.it//nutch/apache-nutch-1.3-src.zip
> > > > unzip apache-nutch-1.3-src.zip
> > > > ant
> > > > cat > urls << __EOF__
> > >
> > >
> http://abcnews.go.com/Technology/google-chromebook-works-great-long-onlin
> > > e/
> > >
> > > > story?id=13850997 __EOF__
> > > > runtime/local/bin/nutch inject crawl urls
> > > > runtime/local/bin/nutch generate crawl crawl/segs -topN 1 #even w/o
> > > > -topN you will get the same
> > > > # Generator: 0 records selected for fetching, exiting ...
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: How do I debug why a url doesn't pass through generate despite being the only one?

Posted by Markus Jelsma <ma...@openindex.io>.

Did you rebuild the Nutch job file with the updated configuration?

> You were right, and indeed fixing that it now works locally. However trying
> it on the server it seems the configuration won't update. I'm not sure why!
> Where is that documented?
> 
> On Mon, Jun 20, 2011 at 11:22 PM, Markus Jelsma
> 
> <ma...@openindex.io>wrote:
> > You're the victim of the default regex url filter.
> > 
> > 31      # skip URLs containing certain characters as probable queries,
> > etc. 32      -[?*!@=]
> > 
> > The injector won't inject that URL. This can be trickty indeed as the
> > filters
> > don't log rejected URL's.
> > 
> > > Hello,
> > > 
> > > I've noticed that for some urls don't make it into my index. Debugging
> > 
> > I've
> > 
> > > created a seed file that has only one of them (
> > 
> > http://abcnews.go.com/Technology/google-chromebook-works-great-long-onlin
> > e/
> > 
> > > story?id=13850997) and tried to crawl for it on an empty crawldb.
> > > However
> > 
> > I
> > 
> > > notice that already at the bin/nutch generate stage the script exists
> > > reporting that there are no urls to fetch. So it got nothing to do with
> > > parsing, or fetching (we don't even reach the host yet). What could it
> > 
> > be?
> > 
> > > I've tried enconding it into
> > > http%3A%2F%2Fabcnews.go.com
> > 
> > %2FTechnology%2Fgoogle-chromebook-works-great-lo
> > 
> > > ng-online%2Fstory%3Fid%3D13850997, but that didn't help.
> > > 
> > > STEPS TO REPRODUCE:
> > > 
> > > wget http://apache.panu.it//nutch/apache-nutch-1.3-src.zip
> > > unzip apache-nutch-1.3-src.zip
> > > ant
> > > cat > urls << __EOF__
> > 
> > http://abcnews.go.com/Technology/google-chromebook-works-great-long-onlin
> > e/
> > 
> > > story?id=13850997 __EOF__
> > > runtime/local/bin/nutch inject crawl urls
> > > runtime/local/bin/nutch generate crawl crawl/segs -topN 1 #even w/o
> > > -topN you will get the same
> > > # Generator: 0 records selected for fetching, exiting ...

Re: How do I debug why a url doesn't pass through generate despite being the only one?

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.

You were right, and indeed fixing that it now works locally. However trying
it on the server it seems the configuration won't update. I'm not sure why!
Where is that documented?

On Mon, Jun 20, 2011 at 11:22 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> You're the victim of the default regex url filter.
>
> 31      # skip URLs containing certain characters as probable queries, etc.
> 32      -[?*!@=]
>
> The injector won't inject that URL. This can be trickty indeed as the
> filters
> don't log rejected URL's.
>
> > Hello,
> >
> > I've noticed that for some urls don't make it into my index. Debugging
> I've
> > created a seed file that has only one of them (
> >
> http://abcnews.go.com/Technology/google-chromebook-works-great-long-online/
> > story?id=13850997) and tried to crawl for it on an empty crawldb. However
> I
> > notice that already at the bin/nutch generate stage the script exists
> > reporting that there are no urls to fetch. So it got nothing to do with
> > parsing, or fetching (we don't even reach the host yet). What could it
> be?
> > I've tried enconding it into
> > http%3A%2F%2Fabcnews.go.com
> %2FTechnology%2Fgoogle-chromebook-works-great-lo
> > ng-online%2Fstory%3Fid%3D13850997, but that didn't help.
> >
> > STEPS TO REPRODUCE:
> >
> > wget http://apache.panu.it//nutch/apache-nutch-1.3-src.zip
> > unzip apache-nutch-1.3-src.zip
> > ant
> > cat > urls << __EOF__
> >
> http://abcnews.go.com/Technology/google-chromebook-works-great-long-online/
> > story?id=13850997 __EOF__
> > runtime/local/bin/nutch inject crawl urls
> > runtime/local/bin/nutch generate crawl crawl/segs -topN 1 #even w/o -topN
> > you will get the same
> > # Generator: 0 records selected for fetching, exiting ...
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: How do I debug why a url doesn't pass through generate despite being the only one?

Posted by Markus Jelsma <ma...@openindex.io>.

You're the victim of the default regex url filter.

31 	# skip URLs containing certain characters as probable queries, etc.
32 	-[?*!@=] 

The injector won't inject that URL. This can be trickty indeed as the filters 
don't log rejected URL's.

> Hello,
> 
> I've noticed that for some urls don't make it into my index. Debugging I've
> created a seed file that has only one of them (
> http://abcnews.go.com/Technology/google-chromebook-works-great-long-online/
> story?id=13850997) and tried to crawl for it on an empty crawldb. However I
> notice that already at the bin/nutch generate stage the script exists
> reporting that there are no urls to fetch. So it got nothing to do with
> parsing, or fetching (we don't even reach the host yet). What could it be?
> I've tried enconding it into
> http%3A%2F%2Fabcnews.go.com%2FTechnology%2Fgoogle-chromebook-works-great-lo
> ng-online%2Fstory%3Fid%3D13850997, but that didn't help.
> 
> STEPS TO REPRODUCE:
> 
> wget http://apache.panu.it//nutch/apache-nutch-1.3-src.zip
> unzip apache-nutch-1.3-src.zip
> ant
> cat > urls << __EOF__
> http://abcnews.go.com/Technology/google-chromebook-works-great-long-online/
> story?id=13850997 __EOF__
> runtime/local/bin/nutch inject crawl urls
> runtime/local/bin/nutch generate crawl crawl/segs -topN 1 #even w/o -topN
> you will get the same
> # Generator: 0 records selected for fetching, exiting ...