You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by aceyin <ac...@126.com> on 2011/09/07 11:21:07 UTC

Generator: 0 records selected for fetching, exiting

  Hi :
    I met some strange problem when i try to use Nutch-1.3 . i list what I did bellow , hope there is someone can help me :

1. Operations
A.I tried to use Nutch-1.3 to crawl a web site which is protected by "Basic HTTP authorize" , but found that the nutch did not crawled anything after it finish running .After check the hudoop.log , I got some information bellow :
2011-09-07 04:11:37,539 WARN  crawl.Generator - Generator: 0 records selected for fetching, exiting ...
2011-09-07 04:11:37,541 INFO  crawl.Crawl - Stopping at depth=1 - no more URLs to fetch.
I tried to find answer by Google, but got no useful information.
B.So , I change the URL to a public site (such as www.yahoo.com) and run the nutch crawl again , this time the nutch worked well - all page were crawled and indexed into solr
2. Configurations - the only difference of configuration files for the 2 operations is :
for operationA the plugin.includes's value is :protocol-httpclient|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)
for operationB the plugin.includes's value is :protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)A. nutch-site.xml
<property>
  <name>plugin.includes</name>
  <value>protocol-httpclient|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description></description>
</property>
B. httpclient-auth.xml
<auth-configuration>
<credentials username="user" password="password">
      <default/>
</credentials>
</auth-configuration>
C. regex-urlfilter.txt
-^(file|ftp|mailto):
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
-[?*!@=]
+.
That's all configurations and operations i used, but for the site protected by "Basic HTTP authorize" i always got the error message .
Could someone help me on this ?

Thanks a lot ~

//BR

Re: Generator: 0 records selected for fetching, exiting

Posted by Markus Jelsma <ma...@openindex.io>.
So it is fetched. You can also check parse output by using the tool: bin/nutch 
org.apache.nutch.parse.ParserChecker <url> this also shows outlinks.


> Hi Markus
>     many thanks for your response, i'm sure the protocol-httpclient is
> working from the hadoop.log : you can see in the log bellow, the nutch has
> tried 2 times to crawl the protected page: the 1st time, nutch crawler got
> "401" error, and then he try the 2nd time and the got the right result:
> 
> 
>     ----- the 1st time / 401 returned ----
>     2011-09-06 16:55:38,563
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod:194 : DEBUG
> httpclient.HttpMethodDirector - Retry authentication 2011-09-06
> 16:55:38,563 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.content - << "<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">[\n]"
> 2011-09-06 16:55:38,564 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.content - << "<html><head>[\n]" 2011-09-06 16:55:38,564
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - <<
> "<title>401 Authorization Required</title>[\n]" 2011-09-06 16:55:38,564
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - <<
> "</head><body>[\n]" 2011-09-06 16:55:38,564
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - <<
> "<h1>Authorization Required</h1>[\n]" 2011-09-06 16:55:38,564
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - <<
> "<p>This server could not verify that you[\n]" 2011-09-06 16:55:38,564
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "are
> authorized to access the document[\n]" 2011-09-06 16:55:38,564
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - <<
> "requested.  Either you supplied the wrong[\n]" 2011-09-06 16:55:38,564
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - <<
> "credentials (e.g., bad password), or your[\n]" 2011-09-06 16:55:38,565
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - <<
> "browser doesn't understand how to supply[\n]" 2011-09-06 16:55:38,565
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "the
> credentials required.</p>[\n]" 2011-09-06 16:55:38,565
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - <<
> "<hr>[\n]" 2011-09-06 16:55:38,565
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - <<
> "<address>Apache/2.2.17 (Fedora) Server at xxxx.com Port 80</address>     
>   [\n]" 2011-09-06 16:55:38,565 org.apache.commons.httpclient.Wire.wire:70
> : DEBUG wire.content - << "</body></html>[\n]"
> 
> 
>     ---- try the 2nd time ----
> 2011-09-06 16:55:38,565
> org.apache.commons.httpclient.HttpMethodBase.shouldCloseConnection:1008 :
> DEBUG httpclient.HttpMethodBase - Should close connection in response to
> directive: close 2011-09-06 16:55:38,566
> org.apache.commons.httpclient.HttpMethodDirector.authenticateHost:278 :
> DEBUG httpclient.HttpMethodDirector - Authenticating with BASIC 'xxxx SVN
> repository'@xxxx.com:80 2011-09-06 16:55:38,566
> org.apache.commons.httpclient.params.HttpMethodParams.getCredentialCharset
> :384 : DEBUG params.HttpMethodParams - Credential charset not configured,
> using HTTP element charset ---- got the right page source ----
> 2011-09-06 16:55:38,815 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.header - >> "GET http://xxxx.com/dev/xxxx/ HTTP/1.0[\r][\n]"
> 2011-09-06 16:55:38,815
> org.apache.commons.httpclient.HttpMethodBase.addHostRequestHeader:1352 :
> DEBUG httpclient.HttpMethodBase - Adding Host request header 2011-09-06
> 16:55:38,815 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.header - >> "User-Agent: nutch-1.3/Nutch-1.3[\r][\n]" 2011-09-06
> 16:55:38,815 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.header - >> "Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3[\r][\n]"
> 2011-09-06 16:55:38,816 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.header - >> "Accept-Charset: utf-8,ISO-8859-1;q=0.7,*;q=0.7[\r][\n]"
> 2011-09-06 16:55:38,816 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.header - >> "Accept:
> text/html,application/xml;q=0.9,application/xhtml+xml,text/xml;q=0.9,text/
> plain;q=0.8,image/png,*/*;q=0.5[\r][\n]" 2011-09-06 16:55:38,816
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - >>
> "Accept-Encoding: x-gzip, gzip, deflate[\r][\n]" 2011-09-06 16:55:38,816
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - >>
> "Proxy-Connection: Keep-Alive[\r][\n]" 2011-09-06 16:55:38,816
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - >>
> "Authorization: Basic ZW5pYXlpbjpjaGFuZ2VtZQ==[\r][\n]" 2011-09-06
> 16:55:38,816 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.header - >> "Host: xxxx.com[\r][\n]" 2011-09-06 16:55:38,817
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - >>
> "[\r][\n]" 2011-09-06 16:55:38,848
> org.apache.nutch.fetcher.Fetcher.run:1038 : INFO  fetcher.Fetcher -
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 2011-09-06
> 16:55:39,118 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.header - << "HTTP/1.0 200 OK[\r][\n]" 2011-09-06 16:55:39,118
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - <<
> "HTTP/1.0 200 OK[\r][\n]" 2011-09-06 16:55:39,118
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << "Date:
> Tue, 06 Sep 2011 08:55:39 GMT[\r][\n]" 2011-09-06 16:55:39,118
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - <<
> "Server: Apache/2.2.17 (Fedora)[\r][\n]" 2011-09-06 16:55:39,119
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - <<
> "Last-Modified: Thu, 28 Jul 2011 06:05:39 GMT[\r][\n]" 2011-09-06
> 16:55:39,119 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.header - << "ETag: W/"277655//xxxx/src"[\r][\n]" 2011-09-06
> 16:55:39,119 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.header - << "Accept-Ranges: bytes[\r][\n]" 2011-09-06 16:55:39,119
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - <<
> "Content-Length: 528[\r][\n]" 2011-09-06 16:55:39,119
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - <<
> "Content-Type: text/html; charset=UTF-8[\r][\n]" 2011-09-06 16:55:39,120
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - <<
> "X-Cache: MISS from xxxx.com[\r][\n]" 2011-09-06 16:55:39,120
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - <<
> "X-Cache-Lookup: MISS from xxxx.com:3128[\r][\n]" 2011-09-06 16:55:39,120
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << "Via:
> 1.0 xxxx.com:3128 (squid/2.6.STABLE21)[\r][\n]" 2011-09-06 16:55:39,120
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - <<
> "Proxy-Connection: keep-alive[\r][\n]" 2011-09-06 16:55:39,120
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - <<
> "[\r][\n]" 2011-09-06 16:55:39,121
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - <<
> "<html><head><title>dev - Revision 280006: /xxxx/src</title></head>[\n]"
> 2011-09-06 16:55:39,121 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.content - << "<body>[\n]" 2011-09-06 16:55:39,121
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "
> <h2>dev - Revision 280006: /xxxx/src</h2>[\n]" 2011-09-06 16:55:39,121
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "
> <ul>[\n]" 2011-09-06 16:55:39,121
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << " 
> <li><a href="../">..</a></li>[\n]" 2011-09-06 16:55:39,121
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << " 
> <li><a href="com/">com/</a></li>[\n]" 2011-09-06 16:55:39,121
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << " 
> <li><a
> href="commons-logging.properties">commons-logging.properties</a></li>[\n]"
> 2011-09-06 16:55:39,121 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.content - << "  <li><a
> href="simplelog.properties">simplelog.properties</a></li>[\n]" 2011-09-06
> 16:55:39,122 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.content - << " </ul>[\n]" 2011-09-06 16:55:39,122
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << " <hr
> noshade><em>Powered by <a
> href="http://subversion.tigris.org/">Subversion</a> version 1.6.15
> (r1038135).</em>[\n]" 2011-09-06 16:55:39,122
> org.apache.commons.httpclient.Wire.wire:84 : DEBUG wire.content - <<
> "</body></html>"
> 
> At 2011-09-07 19:46:28,"Markus Jelsma" <ma...@openindex.io> wrote:
> >I don't know if protocol-httpclient is still working at all. To narrow
> >down the problem check the HTTP logs of the protected server and your
> >Nutch logs.
> >
> >On Wednesday 07 September 2011 11:21:07 aceyin wrote:
> >>   Hi :
> >>     I met some strange problem when i try to use Nutch-1.3 . i list what
> >>     I
> >> 
> >> did bellow , hope there is someone can help me :
> >> 
> >> 1. Operations
> >> A.I tried to use Nutch-1.3 to crawl a web site which is protected by
> >> "Basic HTTP authorize" , but found that the nutch did not crawled
> >> anything after it finish running .After check the hudoop.log , I got
> >> some information bellow : 2011-09-07 04:11:37,539 WARN  crawl.Generator
> >> - Generator: 0 records selected for fetching, exiting ... 2011-09-07
> >> 04:11:37,541 INFO crawl.Crawl - Stopping at depth=1 - no more URLs to
> >> fetch. I tried to find answer by Google, but got no useful information.
> >> B.So , I change the URL to a public site (such as www.yahoo.com) and run
> >> the nutch crawl again , this time the nutch worked well - all page were
> >> crawled and indexed into solr 2. Configurations - the only difference of
> >> configuration files for the 2 operations is : for operationA the
> >> plugin.includes's value is
> >> 
> >> :protocol-httpclient|urlfilter-regex|parse-(html|tika|text)|index-(basic
> >> :|a
> >> 
> >> nchor)|scoring-opic|urlnormalizer-(pass|regex|basic) for operationB the
> >> plugin.includes's value is
> >> 
> >> :protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|ancho
> >> :r)
> >> :
> >> |scoring-opic|urlnormalizer-(pass|regex|basic)A. nutch-site.xml
> >> |<property>
> >> |
> >>   <name>plugin.includes</name>
> >> 
> >> <value>protocol-httpclient|urlfilter-regex|parse-(html|tika|text)|index-
> >> (b asic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> >> <description></description>
> >> </property>
> >> B. httpclient-auth.xml
> >> <auth-configuration>
> >> <credentials username="user" password="password">
> >> 
> >>       <default/>
> >> 
> >> </credentials>
> >> </auth-configuration>
> >> C. regex-urlfilter.txt
> >> -^(file|ftp|mailto):
> >> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
> >> pm| tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ -[?*!@=]
> >> +.
> >> That's all configurations and operations i used, but for the site
> >> protected by "Basic HTTP authorize" i always got the error message .
> >> Could someone help me on this ?
> >> 
> >> Thanks a lot ~
> >> 
> >> //BR

Re:Re: Generator: 0 records selected for fetching, exiting

Posted by aceyin <ac...@126.com>.
Hi Markus
    many thanks for your response, i'm sure the protocol-httpclient is working from the hadoop.log :
    you can see in the log bellow, the nutch has tried 2 times to crawl the protected page: 
    the 1st time, nutch crawler got "401" error, and then he try the 2nd time and the got the right result:


    ----- the 1st time / 401 returned ----
    2011-09-06 16:55:38,563 org.apache.commons.httpclient.HttpMethodDirector.executeMethod:194 : DEBUG httpclient.HttpMethodDirector - Retry authentication
    2011-09-06 16:55:38,563 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">[\n]"
    2011-09-06 16:55:38,564 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "<html><head>[\n]"
    2011-09-06 16:55:38,564 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "<title>401 Authorization Required</title>[\n]"
    2011-09-06 16:55:38,564 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "</head><body>[\n]"
    2011-09-06 16:55:38,564 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "<h1>Authorization Required</h1>[\n]"
    2011-09-06 16:55:38,564 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "<p>This server could not verify that you[\n]"
    2011-09-06 16:55:38,564 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "are authorized to access the document[\n]"
    2011-09-06 16:55:38,564 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "requested.  Either you supplied the wrong[\n]"
    2011-09-06 16:55:38,564 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "credentials (e.g., bad password), or your[\n]"
    2011-09-06 16:55:38,565 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "browser doesn't understand how to supply[\n]"
    2011-09-06 16:55:38,565 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "the credentials required.</p>[\n]"
    2011-09-06 16:55:38,565 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "<hr>[\n]"
    2011-09-06 16:55:38,565 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "<address>Apache/2.2.17 (Fedora) Server at xxxx.com Port 80</address>        [\n]"
    2011-09-06 16:55:38,565 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "</body></html>[\n]"


    ---- try the 2nd time ----
2011-09-06 16:55:38,565 org.apache.commons.httpclient.HttpMethodBase.shouldCloseConnection:1008 : DEBUG httpclient.HttpMethodBase - Should close connection in response to directive: close
2011-09-06 16:55:38,566 org.apache.commons.httpclient.HttpMethodDirector.authenticateHost:278 : DEBUG httpclient.HttpMethodDirector - Authenticating with BASIC 'xxxx SVN repository'@xxxx.com:80
2011-09-06 16:55:38,566 org.apache.commons.httpclient.params.HttpMethodParams.getCredentialCharset:384 : DEBUG params.HttpMethodParams - Credential charset not configured, using HTTP element charset
    ---- got the right page source ----
2011-09-06 16:55:38,815 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - >> "GET http://xxxx.com/dev/xxxx/ HTTP/1.0[\r][\n]"
2011-09-06 16:55:38,815 org.apache.commons.httpclient.HttpMethodBase.addHostRequestHeader:1352 : DEBUG httpclient.HttpMethodBase - Adding Host request header
2011-09-06 16:55:38,815 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - >> "User-Agent: nutch-1.3/Nutch-1.3[\r][\n]"
2011-09-06 16:55:38,815 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - >> "Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3[\r][\n]"
2011-09-06 16:55:38,816 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - >> "Accept-Charset: utf-8,ISO-8859-1;q=0.7,*;q=0.7[\r][\n]"
2011-09-06 16:55:38,816 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - >> "Accept: text/html,application/xml;q=0.9,application/xhtml+xml,text/xml;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5[\r][\n]"
2011-09-06 16:55:38,816 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - >> "Accept-Encoding: x-gzip, gzip, deflate[\r][\n]"
2011-09-06 16:55:38,816 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - >> "Proxy-Connection: Keep-Alive[\r][\n]"
2011-09-06 16:55:38,816 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - >> "Authorization: Basic ZW5pYXlpbjpjaGFuZ2VtZQ==[\r][\n]"
2011-09-06 16:55:38,816 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - >> "Host: xxxx.com[\r][\n]"
2011-09-06 16:55:38,817 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - >> "[\r][\n]"
2011-09-06 16:55:38,848 org.apache.nutch.fetcher.Fetcher.run:1038 : INFO  fetcher.Fetcher - -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
2011-09-06 16:55:39,118 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << "HTTP/1.0 200 OK[\r][\n]"
2011-09-06 16:55:39,118 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << "HTTP/1.0 200 OK[\r][\n]"
2011-09-06 16:55:39,118 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << "Date: Tue, 06 Sep 2011 08:55:39 GMT[\r][\n]"
2011-09-06 16:55:39,118 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << "Server: Apache/2.2.17 (Fedora)[\r][\n]"
2011-09-06 16:55:39,119 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << "Last-Modified: Thu, 28 Jul 2011 06:05:39 GMT[\r][\n]"
2011-09-06 16:55:39,119 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << "ETag: W/"277655//xxxx/src"[\r][\n]"
2011-09-06 16:55:39,119 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << "Accept-Ranges: bytes[\r][\n]"
2011-09-06 16:55:39,119 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << "Content-Length: 528[\r][\n]"
2011-09-06 16:55:39,119 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << "Content-Type: text/html; charset=UTF-8[\r][\n]"
2011-09-06 16:55:39,120 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << "X-Cache: MISS from xxxx.com[\r][\n]"
2011-09-06 16:55:39,120 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << "X-Cache-Lookup: MISS from xxxx.com:3128[\r][\n]"
2011-09-06 16:55:39,120 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << "Via: 1.0 xxxx.com:3128 (squid/2.6.STABLE21)[\r][\n]"
2011-09-06 16:55:39,120 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << "Proxy-Connection: keep-alive[\r][\n]"
2011-09-06 16:55:39,120 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << "[\r][\n]"
2011-09-06 16:55:39,121 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "<html><head><title>dev - Revision 280006: /xxxx/src</title></head>[\n]"
2011-09-06 16:55:39,121 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "<body>[\n]"
2011-09-06 16:55:39,121 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << " <h2>dev - Revision 280006: /xxxx/src</h2>[\n]"
2011-09-06 16:55:39,121 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << " <ul>[\n]"
2011-09-06 16:55:39,121 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "  <li><a href="../">..</a></li>[\n]"
2011-09-06 16:55:39,121 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "  <li><a href="com/">com/</a></li>[\n]"
2011-09-06 16:55:39,121 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "  <li><a href="commons-logging.properties">commons-logging.properties</a></li>[\n]"
2011-09-06 16:55:39,121 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "  <li><a href="simplelog.properties">simplelog.properties</a></li>[\n]"
2011-09-06 16:55:39,122 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << " </ul>[\n]"
2011-09-06 16:55:39,122 org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << " <hr noshade><em>Powered by <a href="http://subversion.tigris.org/">Subversion</a> version 1.6.15 (r1038135).</em>[\n]"
2011-09-06 16:55:39,122 org.apache.commons.httpclient.Wire.wire:84 : DEBUG wire.content - << "</body></html>"





At 2011-09-07 19:46:28,"Markus Jelsma" <ma...@openindex.io> wrote:
>I don't know if protocol-httpclient is still working at all. To narrow down 
>the problem check the HTTP logs of the protected server and your Nutch logs.
>
>On Wednesday 07 September 2011 11:21:07 aceyin wrote:
>>   Hi :
>>     I met some strange problem when i try to use Nutch-1.3 . i list what I
>> did bellow , hope there is someone can help me :
>> 
>> 1. Operations
>> A.I tried to use Nutch-1.3 to crawl a web site which is protected by "Basic
>> HTTP authorize" , but found that the nutch did not crawled anything after
>> it finish running .After check the hudoop.log , I got some information
>> bellow : 2011-09-07 04:11:37,539 WARN  crawl.Generator - Generator: 0
>> records selected for fetching, exiting ... 2011-09-07 04:11:37,541 INFO 
>> crawl.Crawl - Stopping at depth=1 - no more URLs to fetch. I tried to find
>> answer by Google, but got no useful information.
>> B.So , I change the URL to a public site (such as www.yahoo.com) and run
>> the nutch crawl again , this time the nutch worked well - all page were
>> crawled and indexed into solr 2. Configurations - the only difference of
>> configuration files for the 2 operations is : for operationA the
>> plugin.includes's value is
>> :protocol-httpclient|urlfilter-regex|parse-(html|tika|text)|index-(basic|a
>> nchor)|scoring-opic|urlnormalizer-(pass|regex|basic) for operationB the
>> plugin.includes's value is
>> :protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)
>> |scoring-opic|urlnormalizer-(pass|regex|basic)A. nutch-site.xml <property>
>>   <name>plugin.includes</name>
>>  
>> <value>protocol-httpclient|urlfilter-regex|parse-(html|tika|text)|index-(b
>> asic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>> <description></description>
>> </property>
>> B. httpclient-auth.xml
>> <auth-configuration>
>> <credentials username="user" password="password">
>>       <default/>
>> </credentials>
>> </auth-configuration>
>> C. regex-urlfilter.txt
>> -^(file|ftp|mailto):
>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|
>> tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ -[?*!@=]
>> +.
>> That's all configurations and operations i used, but for the site protected
>> by "Basic HTTP authorize" i always got the error message . Could someone
>> help me on this ?
>> 
>> Thanks a lot ~
>> 
>> //BR
>
>-- 
>Markus Jelsma - CTO - Openindex
>http://www.linkedin.com/in/markus17
>050-8536620 / 06-50258350

Re: Generator: 0 records selected for fetching, exiting

Posted by Markus Jelsma <ma...@openindex.io>.
I don't know if protocol-httpclient is still working at all. To narrow down 
the problem check the HTTP logs of the protected server and your Nutch logs.

On Wednesday 07 September 2011 11:21:07 aceyin wrote:
>   Hi :
>     I met some strange problem when i try to use Nutch-1.3 . i list what I
> did bellow , hope there is someone can help me :
> 
> 1. Operations
> A.I tried to use Nutch-1.3 to crawl a web site which is protected by "Basic
> HTTP authorize" , but found that the nutch did not crawled anything after
> it finish running .After check the hudoop.log , I got some information
> bellow : 2011-09-07 04:11:37,539 WARN  crawl.Generator - Generator: 0
> records selected for fetching, exiting ... 2011-09-07 04:11:37,541 INFO 
> crawl.Crawl - Stopping at depth=1 - no more URLs to fetch. I tried to find
> answer by Google, but got no useful information.
> B.So , I change the URL to a public site (such as www.yahoo.com) and run
> the nutch crawl again , this time the nutch worked well - all page were
> crawled and indexed into solr 2. Configurations - the only difference of
> configuration files for the 2 operations is : for operationA the
> plugin.includes's value is
> :protocol-httpclient|urlfilter-regex|parse-(html|tika|text)|index-(basic|a
> nchor)|scoring-opic|urlnormalizer-(pass|regex|basic) for operationB the
> plugin.includes's value is
> :protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)
> |scoring-opic|urlnormalizer-(pass|regex|basic)A. nutch-site.xml <property>
>   <name>plugin.includes</name>
>  
> <value>protocol-httpclient|urlfilter-regex|parse-(html|tika|text)|index-(b
> asic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> <description></description>
> </property>
> B. httpclient-auth.xml
> <auth-configuration>
> <credentials username="user" password="password">
>       <default/>
> </credentials>
> </auth-configuration>
> C. regex-urlfilter.txt
> -^(file|ftp|mailto):
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|
> tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ -[?*!@=]
> +.
> That's all configurations and operations i used, but for the site protected
> by "Basic HTTP authorize" i always got the error message . Could someone
> help me on this ?
> 
> Thanks a lot ~
> 
> //BR

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350