You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Sivakumar_NCS <si...@ncs.com.sg> on 2008/05/21 07:27:39 UTC

Nutch Crawling - Failed for internet crawling

Hi,

I am a new bie to crawling and exploring the possiblities of crawling the
internet websites from my work PC.My work environment is having a proxy to
access the web.
So I have configure the proxy information under the <NUTCH_HOME>/conf/ by
overriding the nutch-site.xml.Attached is the xml for reference.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>http.agent.name</name>
  <value>ABC</value>
  <description>ABC</description>
</property>
<property>
  <name>http.agent.description</name>
  <value>Acompany</value>
  <description>A company</description>
</property>
<property>
  <name>http.agent.url</name>
  <value></value>
  <description></description>
</property>
<property> 
<name>http.agent.email</name>
  <value></value>
  <description></description>
</property>
<property>
  <name>http.timeout</name>
  <value>10000</value>
  <description>The default network timeout, in milliseconds.</description>
</property>
<property>
  <name>http.max.delays</name>
  <value>100</value>
  <description>The number of times a thread will delay when trying to
  fetch a page.  Each time it finds that a host is busy, it will wait
  fetcher.server.delay.  After http.max.delays attepts, it will give
  up on the page for now.</description>
</property>
<property>
  <name>plugin.includes</name>
 
<value>protocol-httpclient|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable 
  protocol-httpclient, but be aware of possible intermittent problems with
the 
  underlying commons-httpclient library.
  </description>
</property>
<property>
  <name>http.proxy.host</name>
  <value>proxy.ABC.COM</value><!--MY WORK PROXY-->
  <description>The proxy hostname.  If empty, no proxy is
used.</description>
</property>
<property>
  <name>http.proxy.port</name>
  <value>8080</value>
  <description>The proxy port.</description>
</property>
<property>
  <name>http.proxy.username</name>
  <value>ABCUSER</value><!--MY NETWORK USERID-->
  <description>Username for proxy. This will be used by
  'protocol-httpclient', if the proxy server requests basic, digest
  and/or NTLM authentication. To use this, 'protocol-httpclient' must
  be present in the value of 'plugin.includes' property.
  NOTE: For NTLM authentication, do not prefix the username with the
  domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect.
  </description>
</property>
<property>
  <name>http.proxy.password</name>
  <value>XXXXX</value>
  <description>Password for proxy. This will be used by
  'protocol-httpclient', if the proxy server requests basic, digest
  and/or NTLM authentication. To use this, 'protocol-httpclient' must
  be present in the value of 'plugin.includes' property.
  </description>
</property>
<property> 
 <name>http.proxy.realm</name>
  <value>ABC</value><!--MY NETWORK DOMAIN-->
  <description>Authentication realm for proxy. Do not define a value
  if realm is not required or authentication should take place for any
  realm. NTLM does not use the notion of realms. Specify the domain name
  of NTLM authentication as the value for this property. To use this,
  'protocol-httpclient' must be present in the value of
  'plugin.includes' property.
  </description>
</property>
<property>
  <name>http.agent.host</name>
  <value>xxx.xxx.xxx.xx</value><!--MY LOCAL PC'S IP-->
  <description>Name or IP address of the host on which the Nutch crawler
  would be running. Currently this is used by 'protocol-httpclient'
  plugin.
  </description>
</property>
</configuration>

my crawl-urlfilter.txt is as follows:
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+^http://([a-z0-9]*\.)*yahoo.com/

# skip everything else
-.


my regex-urlfilter.txt is as follows:
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/.+?)/.*?\1/.*?\1/

# accept hosts in MY.DOMAIN.NAME
# +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
#+^http://([a-z0-9]*\.)*apache.org/
+^http://([a-z0-9]*\.)*yahoo.com/

# skip everything else
-.

Also attached is the console /hadoop.log:

Administrator@Siva-ABC /cygdrive/d/nutch-0.9-IntranetS-Proxy-C
$ bin/nutch crawl urls -dir crawl -depth 3 -topN 50
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
topN = 50
Injector: starting
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
tempDir:::/tmp/hadoop-administrator/mapred/temp/inject-temp-1144725146
Injector: Converting injected urls to crawl db entries.
map url: http://www.yahoo.com/
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080521130128
Generator: filtering: false
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20080521130128
Fetcher: threads: 10
fetching http://www.yahoo.com/
http.proxy.host = proxy.abc.com
http.proxy.port = 8080
http.timeout = 10000
http.content.limit = 65536
http.agent = abc/Nutch-0.9 (Acompany)
protocol.plugin.check.blocking = true
protocol.plugin.check.robots = true
fetcher.server.delay = 1000
http.max.delays = 100
Configured Client
fetch of http://www.yahoo.com/ failed with: Http code=407,
url=http://www.yahoo.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080521130128]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080521130140
Generator: filtering: false
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20080521130140
Fetcher: threads: 10
fetching http://www.yahoo.com/
http.proxy.host = proxy.abc.com
http.proxy.port = 8080
http.timeout = 10000
http.content.limit = 65536
http.agent = ABC/Nutch-0.9 (Acompany)
protocol.plugin.check.blocking = true
protocol.plugin.check.robots = true
fetcher.server.delay = 1000
http.max.delays = 100
Configured Client
fetch of http://www.yahoo.com/ failed with: Http code=407,
url=http://www.yahoo.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080521130140]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl/segments/20080521130154
Generator: filtering: false
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: crawl/segments/20080521130154
Fetcher: threads: 10
fetching http://www.yahoo.com/
http.proxy.host = proxy.abc.com
http.proxy.port = 8080
http.timeout = 10000
http.content.limit = 65536
http.agent = ABC/Nutch-0.9 (Acompany)
protocol.plugin.check.blocking = true
protocol.plugin.check.robots = true
fetcher.server.delay = 1000
http.max.delays = 100
Configured Client
fetch of http://www.yahoo.com/ failed with: Http code=407,
url=http://www.yahoo.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080521130154]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: crawl/segments/20080521130128
LinkDb: adding segment: crawl/segments/20080521130140
LinkDb: adding segment: crawl/segments/20080521130154
LinkDb: done
Indexer: starting
Indexer: linkdb: crawl/linkdb
Indexer: adding segment: crawl/segments/20080521130128
Indexer: adding segment: crawl/segments/20080521130140
Indexer: adding segment: crawl/segments/20080521130154
Optimizing index.
Indexer: done
Dedup: starting
Dedup: adding indexes in: crawl/indexes
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java
:439)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:137)

Administrator@Siva-ABC /cygdrive/d/nutch-0.9-IntranetS-Proxy-C



Please clarify what is the issue with the conf and also guide me if any
configurations missing.your help will be greatly appreciated.thanks in
advance.

Regards
Siva

-- 
View this message in context: http://www.nabble.com/Nutch-Crawling---Failed-for-internet-crawling-tp17356187p17356187.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

RE: Nutch Crawling - Failed for internet crawling

Posted by Sivakumar Sivagnanam NCS <si...@ncs.com.sg>.

Hi,

Please find the files attached as requested. thanks for the reply.

 


 
 
 
 
Thanks & Regards
Siva
65567233

-----Original Message-----
From: All day coders [mailto:rac.nosotros@gmail.com] 
Sent: Saturday, May 24, 2008 11:13 PM
To: nutch-dev@lucene.apache.org
Subject: Re: Nutch Crawling - Failed for internet crawling

Do you mind attaching the configuration files? That way is more human
readable. The hadoop.log file will be useful too (if too big, please
compress)

On Wed, May 21, 2008 at 1:27 AM, Sivakumar_NCS <si...@ncs.com.sg>
wrote:

>
> Hi,
>
> I am a new bie to crawling and exploring the possiblities of crawling
the
> internet websites from my work PC.My work environment is having a
proxy to
> access the web.
> So I have configure the proxy information under the <NUTCH_HOME>/conf/
by
> overriding the nutch-site.xml.Attached is the xml for reference.
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
>  <name>http.agent.name</name>
>  <value>ABC</value>
>  <description>ABC</description>
> </property>
> <property>
>  <name>http.agent.description</name>
>  <value>Acompany</value>
>  <description>A company</description>
> </property>
> <property>
>  <name>http.agent.url</name>
>  <value></value>
>  <description></description>
> </property>
> <property>
> <name>http.agent.email</name>
>  <value></value>
>  <description></description>
> </property>
> <property>
>  <name>http.timeout</name>
>  <value>10000</value>
>  <description>The default network timeout, in
milliseconds.</description>
> </property>
> <property>
>  <name>http.max.delays</name>
>  <value>100</value>
>  <description>The number of times a thread will delay when trying to
>  fetch a page.  Each time it finds that a host is busy, it will wait
>  fetcher.server.delay.  After http.max.delays attepts, it will give
>  up on the page for now.</description>
> </property>
> <property>
>  <name>plugin.includes</name>
>
>
>
<value>protocol-httpclient|protocol-http|urlfilter-regex|parse-(text|htm
l|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urln
ormalizer-(pass|regex|basic)</value>
>  <description>Regular expression naming plugin directory names to
>  include.  Any plugin not matching this expression is excluded.
>  In any case you need at least include the nutch-extensionpoints
plugin. By
>  default Nutch includes crawling just HTML and plain text via HTTP,
>  and basic indexing and search plugins. In order to use HTTPS please
enable
>  protocol-httpclient, but be aware of possible intermittent problems
with
> the
>  underlying commons-httpclient library.
>  </description>
> </property>
> <property>
>  <name>http.proxy.host</name>
>  <value>proxy.ABC.COM</value><!--MY WORK PROXY-->
>  <description>The proxy hostname.  If empty, no proxy is
> used.</description>
> </property>
> <property>
>  <name>http.proxy.port</name>
>  <value>8080</value>
>  <description>The proxy port.</description>
> </property>
> <property>
>  <name>http.proxy.username</name>
>  <value>ABCUSER</value><!--MY NETWORK USERID-->
>  <description>Username for proxy. This will be used by
>  'protocol-httpclient', if the proxy server requests basic, digest
>  and/or NTLM authentication. To use this, 'protocol-httpclient' must
>  be present in the value of 'plugin.includes' property.
>  NOTE: For NTLM authentication, do not prefix the username with the
>  domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect.
>  </description>
> </property>
> <property>
>  <name>http.proxy.password</name>
>  <value>XXXXX</value>
>  <description>Password for proxy. This will be used by
>  'protocol-httpclient', if the proxy server requests basic, digest
>  and/or NTLM authentication. To use this, 'protocol-httpclient' must
>  be present in the value of 'plugin.includes' property.
>  </description>
> </property>
> <property>
>  <name>http.proxy.realm</name>
>  <value>ABC</value><!--MY NETWORK DOMAIN-->
>  <description>Authentication realm for proxy. Do not define a value
>  if realm is not required or authentication should take place for any
>  realm. NTLM does not use the notion of realms. Specify the domain
name
>  of NTLM authentication as the value for this property. To use this,
>  'protocol-httpclient' must be present in the value of
>  'plugin.includes' property.
>  </description>
> </property>
> <property>
>  <name>http.agent.host</name>
>  <value>xxx.xxx.xxx.xx</value><!--MY LOCAL PC'S IP-->
>  <description>Name or IP address of the host on which the Nutch
crawler
>  would be running. Currently this is used by 'protocol-httpclient'
>  plugin.
>  </description>
> </property>
> </configuration>
>
> my crawl-urlfilter.txt is as follows:
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
>
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to
break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> +^http://([a-z0-9]*\.)*yahoo.com/
>
> # skip everything else
> -.
>
>
> my regex-urlfilter.txt is as follows:
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
>
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
pm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to
break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> # +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> #+^http://([a-z0-9]*\.)*apache.org/
> +^http://([a-z0-9]*\.)*yahoo.com/
>
> # skip everything else
> -.
>
> Also attached is the console /hadoop.log:
>
> Administrator@Siva-ABC /cygdrive/d/nutch-0.9-IntranetS-Proxy-C
> $ bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 3
> topN = 50
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> tempDir:::/tmp/hadoop-administrator/mapred/temp/inject-temp-1144725146
> Injector: Converting injected urls to crawl db entries.
> map url: http://www.yahoo.com/
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080521130128
> Generator: filtering: false
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080521130128
> Fetcher: threads: 10
> fetching http://www.yahoo.com/
> http.proxy.host = proxy.abc.com
> http.proxy.port = 8080
> http.timeout = 10000
> http.content.limit = 65536
> http.agent = abc/Nutch-0.9 (Acompany)
> protocol.plugin.check.blocking = true
> protocol.plugin.check.robots = true
> fetcher.server.delay = 1000
> http.max.delays = 100
> Configured Client
> fetch of http://www.yahoo.com/ failed with: Http code=407,
> url=http://www.yahoo.com/
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080521130128]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080521130140
> Generator: filtering: false
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080521130140
> Fetcher: threads: 10
> fetching http://www.yahoo.com/
> http.proxy.host = proxy.abc.com
> http.proxy.port = 8080
> http.timeout = 10000
> http.content.limit = 65536
> http.agent = ABC/Nutch-0.9 (Acompany)
> protocol.plugin.check.blocking = true
> protocol.plugin.check.robots = true
> fetcher.server.delay = 1000
> http.max.delays = 100
> Configured Client
> fetch of http://www.yahoo.com/ failed with: Http code=407,
> url=http://www.yahoo.com/
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080521130140]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080521130154
> Generator: filtering: false
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080521130154
> Fetcher: threads: 10
> fetching http://www.yahoo.com/
> http.proxy.host = proxy.abc.com
> http.proxy.port = 8080
> http.timeout = 10000
> http.content.limit = 65536
> http.agent = ABC/Nutch-0.9 (Acompany)
> protocol.plugin.check.blocking = true
> protocol.plugin.check.robots = true
> fetcher.server.delay = 1000
> http.max.delays = 100
> Configured Client
> fetch of http://www.yahoo.com/ failed with: Http code=407,
> url=http://www.yahoo.com/
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080521130154]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: crawl/segments/20080521130128
> LinkDb: adding segment: crawl/segments/20080521130140
> LinkDb: adding segment: crawl/segments/20080521130154
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: crawl/linkdb
> Indexer: adding segment: crawl/segments/20080521130128
> Indexer: adding segment: crawl/segments/20080521130140
> Indexer: adding segment: crawl/segments/20080521130154
> Optimizing index.
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: crawl/indexes
> Exception in thread "main" java.io.IOException: Job failed!
>        at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>        at
> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java
> :439)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:137)
>
> Administrator@Siva-ABC /cygdrive/d/nutch-0.9-IntranetS-Proxy-C
>
>
>
> Please clarify what is the issue with the conf and also guide me if
any
> configurations missing.your help will be greatly appreciated.thanks in
> advance.
>
> Regards
> Siva
>
> --
> View this message in context:
>
http://www.nabble.com/Nutch-Crawling---Failed-for-internet-crawling-tp17
356187p17356187.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>
>

Re: Nutch Crawling - Failed for internet crawling

Posted by All day coders <ra...@gmail.com>.

Do you mind attaching the configuration files? That way is more human
readable. The hadoop.log file will be useful too (if too big, please
compress)

On Wed, May 21, 2008 at 1:27 AM, Sivakumar_NCS <si...@ncs.com.sg> wrote:

>
> Hi,
>
> I am a new bie to crawling and exploring the possiblities of crawling the
> internet websites from my work PC.My work environment is having a proxy to
> access the web.
> So I have configure the proxy information under the <NUTCH_HOME>/conf/ by
> overriding the nutch-site.xml.Attached is the xml for reference.
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
>  <name>http.agent.name</name>
>  <value>ABC</value>
>  <description>ABC</description>
> </property>
> <property>
>  <name>http.agent.description</name>
>  <value>Acompany</value>
>  <description>A company</description>
> </property>
> <property>
>  <name>http.agent.url</name>
>  <value></value>
>  <description></description>
> </property>
> <property>
> <name>http.agent.email</name>
>  <value></value>
>  <description></description>
> </property>
> <property>
>  <name>http.timeout</name>
>  <value>10000</value>
>  <description>The default network timeout, in milliseconds.</description>
> </property>
> <property>
>  <name>http.max.delays</name>
>  <value>100</value>
>  <description>The number of times a thread will delay when trying to
>  fetch a page.  Each time it finds that a host is busy, it will wait
>  fetcher.server.delay.  After http.max.delays attepts, it will give
>  up on the page for now.</description>
> </property>
> <property>
>  <name>plugin.includes</name>
>
>
> <value>protocol-httpclient|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>  <description>Regular expression naming plugin directory names to
>  include.  Any plugin not matching this expression is excluded.
>  In any case you need at least include the nutch-extensionpoints plugin. By
>  default Nutch includes crawling just HTML and plain text via HTTP,
>  and basic indexing and search plugins. In order to use HTTPS please enable
>  protocol-httpclient, but be aware of possible intermittent problems with
> the
>  underlying commons-httpclient library.
>  </description>
> </property>
> <property>
>  <name>http.proxy.host</name>
>  <value>proxy.ABC.COM</value><!--MY WORK PROXY-->
>  <description>The proxy hostname.  If empty, no proxy is
> used.</description>
> </property>
> <property>
>  <name>http.proxy.port</name>
>  <value>8080</value>
>  <description>The proxy port.</description>
> </property>
> <property>
>  <name>http.proxy.username</name>
>  <value>ABCUSER</value><!--MY NETWORK USERID-->
>  <description>Username for proxy. This will be used by
>  'protocol-httpclient', if the proxy server requests basic, digest
>  and/or NTLM authentication. To use this, 'protocol-httpclient' must
>  be present in the value of 'plugin.includes' property.
>  NOTE: For NTLM authentication, do not prefix the username with the
>  domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect.
>  </description>
> </property>
> <property>
>  <name>http.proxy.password</name>
>  <value>XXXXX</value>
>  <description>Password for proxy. This will be used by
>  'protocol-httpclient', if the proxy server requests basic, digest
>  and/or NTLM authentication. To use this, 'protocol-httpclient' must
>  be present in the value of 'plugin.includes' property.
>  </description>
> </property>
> <property>
>  <name>http.proxy.realm</name>
>  <value>ABC</value><!--MY NETWORK DOMAIN-->
>  <description>Authentication realm for proxy. Do not define a value
>  if realm is not required or authentication should take place for any
>  realm. NTLM does not use the notion of realms. Specify the domain name
>  of NTLM authentication as the value for this property. To use this,
>  'protocol-httpclient' must be present in the value of
>  'plugin.includes' property.
>  </description>
> </property>
> <property>
>  <name>http.agent.host</name>
>  <value>xxx.xxx.xxx.xx</value><!--MY LOCAL PC'S IP-->
>  <description>Name or IP address of the host on which the Nutch crawler
>  would be running. Currently this is used by 'protocol-httpclient'
>  plugin.
>  </description>
> </property>
> </configuration>
>
> my crawl-urlfilter.txt is as follows:
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> +^http://([a-z0-9]*\.)*yahoo.com/
>
> # skip everything else
> -.
>
>
> my regex-urlfilter.txt is as follows:
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[?*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> # +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
> #+^http://([a-z0-9]*\.)*apache.org/
> +^http://([a-z0-9]*\.)*yahoo.com/
>
> # skip everything else
> -.
>
> Also attached is the console /hadoop.log:
>
> Administrator@Siva-ABC /cygdrive/d/nutch-0.9-IntranetS-Proxy-C
> $ bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> crawl started in: crawl
> rootUrlDir = urls
> threads = 10
> depth = 3
> topN = 50
> Injector: starting
> Injector: crawlDb: crawl/crawldb
> Injector: urlDir: urls
> tempDir:::/tmp/hadoop-administrator/mapred/temp/inject-temp-1144725146
> Injector: Converting injected urls to crawl db entries.
> map url: http://www.yahoo.com/
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080521130128
> Generator: filtering: false
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080521130128
> Fetcher: threads: 10
> fetching http://www.yahoo.com/
> http.proxy.host = proxy.abc.com
> http.proxy.port = 8080
> http.timeout = 10000
> http.content.limit = 65536
> http.agent = abc/Nutch-0.9 (Acompany)
> protocol.plugin.check.blocking = true
> protocol.plugin.check.robots = true
> fetcher.server.delay = 1000
> http.max.delays = 100
> Configured Client
> fetch of http://www.yahoo.com/ failed with: Http code=407,
> url=http://www.yahoo.com/
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080521130128]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080521130140
> Generator: filtering: false
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080521130140
> Fetcher: threads: 10
> fetching http://www.yahoo.com/
> http.proxy.host = proxy.abc.com
> http.proxy.port = 8080
> http.timeout = 10000
> http.content.limit = 65536
> http.agent = ABC/Nutch-0.9 (Acompany)
> protocol.plugin.check.blocking = true
> protocol.plugin.check.robots = true
> fetcher.server.delay = 1000
> http.max.delays = 100
> Configured Client
> fetch of http://www.yahoo.com/ failed with: Http code=407,
> url=http://www.yahoo.com/
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080521130140]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: crawl/segments/20080521130154
> Generator: filtering: false
> Generator: topN: 50
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080521130154
> Fetcher: threads: 10
> fetching http://www.yahoo.com/
> http.proxy.host = proxy.abc.com
> http.proxy.port = 8080
> http.timeout = 10000
> http.content.limit = 65536
> http.agent = ABC/Nutch-0.9 (Acompany)
> protocol.plugin.check.blocking = true
> protocol.plugin.check.robots = true
> fetcher.server.delay = 1000
> http.max.delays = 100
> Configured Client
> fetch of http://www.yahoo.com/ failed with: Http code=407,
> url=http://www.yahoo.com/
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080521130154]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: crawl/segments/20080521130128
> LinkDb: adding segment: crawl/segments/20080521130140
> LinkDb: adding segment: crawl/segments/20080521130154
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: crawl/linkdb
> Indexer: adding segment: crawl/segments/20080521130128
> Indexer: adding segment: crawl/segments/20080521130140
> Indexer: adding segment: crawl/segments/20080521130154
> Optimizing index.
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: crawl/indexes
> Exception in thread "main" java.io.IOException: Job failed!
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>        at
> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java
> :439)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:137)
>
> Administrator@Siva-ABC /cygdrive/d/nutch-0.9-IntranetS-Proxy-C
>
>
>
> Please clarify what is the issue with the conf and also guide me if any
> configurations missing.your help will be greatly appreciated.thanks in
> advance.
>
> Regards
> Siva
>
> --
> View this message in context:
> http://www.nabble.com/Nutch-Crawling---Failed-for-internet-crawling-tp17356187p17356187.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>
>