You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Edward Quick <ed...@hotmail.com> on 2005/09/15 16:49:08 UTC

can't parse https

Hi,

I'm trying to run a crawl on our DMZ intranet using the root url 
https://planet.abc.com/
However nutch says it can't parse https. I have loaded protocol-httpclient, 
and set crawl-urlfilter.txt to

+^https://planet.abc.com/

In nutch-site.xml I have:

<property>
  <name>plugin.includes</name>
  
<value>protocol-(httpclient|http|file|ftp|file)|urlfilter-regex|parse-(text|html|js|msword|pdf|rss|ext)|index-(basic|more)|query-(basic|site|url|more)</value>
  <description></description>
</property>

And this is what the crawl.log gives me:

050915 154033 Overall processing: Sorted 1 entries in 0.0090 seconds.
050915 154033 Overall processing: Sorted 0.0090 entries/second
050915 154033 FetchListTool completed
050915 154033 logging at INFO
050915 154034 fetching https://planet.abc.com/
050915 154034 http.proxy.host = null
050915 154034 http.proxy.port = 8080
050915 154034 http.timeout = 10000
050915 154034 http.content.limit = -1
050915 154034 http.agent = NutchCVS/0.7 (Nutch; 
http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
050915 154034 http.auth.ntlm.username =
050915 154034 fetcher.server.delay = 1000
050915 154034 http.max.delays = 100
050915 154035 Configured Client
050915 154042 fetch okay, but can't parse https://planet.abc.com/, reason: 
failed(2,0): No external command defined for contentType:


Can anyone help me out please?

Thanks,

Ed.