You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Ted Yu <yu...@gmail.com> on 2009/12/17 20:03:52 UTC

parser not found exception

Hi,
When using bin/nutch, I see the following errors:

fetch of https://commerce.in.reuters.com/profile/pages/newsletter/begin.dofailed
with: org.apache.nutch.protocol.ProtocolNotFound:
protocol not found for url=https

fetching http://static2.px.yelp.com/bphoto/LWNUe5ydBU-__kJRbXR7Og/ms
Error parsing: http://static2.px.yelp.com/bphoto/LWNUe5ydBU-__kJRbXR7Og/ms:
org.apache.nutch.parse.ParseException:
parser not found for contentType=image/jpeg url=
http://static2.px.yelp.com/bphoto/LWNUe5ydBU-__kJRbXR7Og/ms
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)
fetching http://in.reuters.com/resources/js/article.js
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=223
Error parsing: http://in.reuters.com/resources/js/article.js:
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/javascript url=
http://in.reuters.com/resources/js/article.js
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)

How do I configure https protocol handler ?

Also, when I don't want to crawl previously crawled URLs, how do I clear
crawldb ?

Thanks