You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jon Shoberg <jo...@shoberg.net> on 2005/09/23 19:26:42 UTC

Parcer Policy - Re: No external command defined for contentType:

Following are output from the fetcher and headers from the firefoxweb 
developer toolbar.

I'd appreciate any thoughts.  Perhaps something for parser policy.  I've 
traced the source code a bit and nothing jumped out at me...

-j

--

050923 020413 fetch okay, but can't parse 
http://medicalcenter.osu.edu/pdfs/PatientEd/Materials/PDFDocs/procedure/handwsh.pdf, 
reason: failed(2,0): No external command defined for contentType:

Response Headers - 
http://medicalcenter.osu.edu/pdfs/PatientEd/Materials/PDFDocs/procedure/handwsh.pdf

Server: Microsoft-IIS/5.0
X-Powered-By: ASP.NET
Date: Fri, 23 Sep 2005 17:14:19 GMT
Content-Type: application/pdf
Accept-Ranges: bytes
Last-Modified: Mon, 21 Jun 2004 16:10:22 GMT
Etag: "02b341aa57c41:96b"
Content-Length: 85604

200 OK


050923 020507 fetch okay, but can't parse 
http://vet.osu.edu/sa/atcenter/vm522/webweek2/bovhd9.html, reason: 
failed(2,0): No external command defined for contentType:

Response Headers - http://vet.osu.edu/sa/atcenter/vm522/webweek2/bovhd9.html

Date: Fri, 23 Sep 2005 17:20:57 GMT
Server: Apache/1.3.33 (Darwin) PHP/4.3.11
Cache-Control: max-age=60
Expires: Fri, 23 Sep 2005 17:21:57 GMT
Last-Modified: Fri, 15 Apr 2005 15:49:06 GMT
Etag: "31dd9-1c0-425fe272"
Accept-Ranges: bytes
Content-Length: 448
Connection: close
Content-Type: text/html

200 OK


050923 021427 fetch okay, but can't parse 
http://felix.us.ohio-state.edu/search/o?SEARCH=21305366, reason: 
failed(2,0): No external command defined for contentType:

Response Headers - http://felix.us.ohio-state.edu/search/o?SEARCH=1755564

Server: III 100
Pragma: no-cache
Expires: 0
Date: Fri Sep 23 17:25:05 2005 GMT
MIME-version: 1.0
Set-Cookie: SESSION_ID=1127496305.29650; path=/
Content-Type: text/html; charset=UTF-8

200 OK





Vanderdray, Jake wrote:
> 	What's the URL?  I think someone else had a similar problem and
> it turned out to that the URL produced a redirect to URL containing a
> query string.  Since Nutch was configured not to fetch URLs with query
> strings, it just failed.
> 
> Jake.
> 
> -----Original Message-----
> From: Jon Shoberg [mailto:jon@shoberg.net] 
> Sent: Friday, September 23, 2005 12:27 PM
> To: nutch-user@lucene.apache.org
> Subject: No external command defined for contentType: 
> 
> Anyone else get the message "No external command defined for 
> contentType:" without any sort of MIME content type declaration?
> 
> I can see HTML, PDF, and other documents getting fetched but failing on 
> the parse with the above message.  When I go directly to the server and 
> manually get the document I see a valid MIME header for content type 
> returned in the HTTP response header.
> 
> Anyone else seen this?  I'm fetching content but not parsing it
> reliably. 
> 
> -j




Re: Parcer Policy - Re: No external command defined for contentType:

Posted by Jérôme Charron <je...@gmail.com>.
> "should be ok" ... as in content will be parsed correctly or that we
> will not see the error message.

This is a workaround, not an error make-up.
So yes, parsing should be ok!

> Lack of an error message does nto mean
> thigns are ok. :)

Thanks for this great geeks lesson!

> Pased below is the file. This is from the release-0.7 build with
> patches as 0.7.1 is getting prepared.

OK, all seems to be ok in the file.
Thanks to give us feed back on the workaround.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: Parcer Policy - Re: No external command defined for contentType:

Posted by Jon Shoberg <jo...@shoberg.net>.
Jérôme Charron wrote:
> Hello Jon, and sorry for the late response,
> 
> 
>>I'd appreciate any thoughts. Perhaps something for parser policy. I've
>>
>>>traced the source code a bit and nothing jumped out at me...
> 
> 
> There's some currently identified issues on the parser policy (ie
> ParserFactory), and we are actively working on it.
> I don't undestand why the parse-ext plugin is called in your case, whereas
> it should be parser-pdf or parse-html plugins.
> Here's a workaround: if you don't have needs for the parse-ext (plugin used
> to perform parsing using some exernal commands), simply remove it and all
> should be ok.
> Could you please send me your /usr/local/nutch/plugins/parse-ext/plugin.xml
> file so that I can check if something goes wrong in it.
> 
> Regards
> 
> Jérôme
> 
> --
> http://motrech.free.fr/
> http://www.frutch.org/
> 


"should be ok" ... as in content will be parsed correctly or that we 
will not see the error message.  Lack of an error message does nto mean 
thigns are ok. :)

Pased below is the file.  This is from the release-0.7 build with 
patches as 0.7.1 is getting prepared.

<?xml version="1.0" encoding="UTF-8"?>
<plugin
    id="parse-ext"
    name="External Parser Plug-in"
    version="1.0.0"
    provider-name="nutch.org">



    <runtime>
       <library name="parse-ext.jar">
          <export name="*"/>
       </library>
    </runtime>

    <extension id="org.apache.nutch.parse.ext"
               name="ExtParse"
               point="org.apache.nutch.parse.Parser">

       <implementation id="ExtParser"
                       class="org.apache.nutch.parse.ext.ExtParser"
                       contentType="application/vnd.nutch.example.cat"
                       pathSuffix=""
                       command="./build/plugins/parse-ext/command"
                       timeout="10"/>

       <implementation id="ExtParser"
                       class="org.apache.nutch.parse.ext.ExtParser"
                       contentType="application/vnd.nutch.example.md5sum"
                       pathSuffix=""
                       command="./build/plugins/parse-ext/command"
                       timeout="20"/>

    </extension>

</plugin>


Re: Parcer Policy - Re: No external command defined for contentType:

Posted by Jérôme Charron <je...@gmail.com>.
Hello Jon, and sorry for the late response,

> I'd appreciate any thoughts. Perhaps something for parser policy. I've
> > traced the source code a bit and nothing jumped out at me...

There's some currently identified issues on the parser policy (ie
ParserFactory), and we are actively working on it.
I don't undestand why the parse-ext plugin is called in your case, whereas
it should be parser-pdf or parse-html plugins.
Here's a workaround: if you don't have needs for the parse-ext (plugin used
to perform parsing using some exernal commands), simply remove it and all
should be ok.
Could you please send me your /usr/local/nutch/plugins/parse-ext/plugin.xml
file so that I can check if something goes wrong in it.

Regards

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: Parcer Policy - Re: No external command defined for contentType:

Posted by Jon Shoberg <jo...@shoberg.net>.
Jérôme Charron wrote:
> 
>     Following are output from the fetcher and headers from the firefoxweb
>     developer toolbar.
> 
>     I'd appreciate any thoughts.  Perhaps something for parser policy.  I've
>     traced the source code a bit and nothing jumped out at me...
> 
> Could you provide your plugins configuration, and the nutch startup logs.
> 
> Jérôme

Jerome,

   See below.

--

<property>
   <name>plugin.includes</name>
 
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|pdf|msword|rss|ext)|index-basic|query-(basic|site|url)</value>
   <description>Regular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.  By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins.
   </description>
</property>


--

050923 020323 parsing file:/usr/local/nutch/conf/nutch-default.xml
050923 020323 parsing file:/usr/local/nutch/conf/nutch-site.xml
050923 020323 No FS indicated, using default:local
050923 020323 Plugins: looking in: /usr/local/nutch/plugins
050923 020323 not including: /usr/local/nutch/plugins/protocol-ftp
050923 020323 not including: /usr/local/nutch/plugins/urlfilter-prefix
050923 020323 parsing: /usr/local/nutch/plugins/parse-text/plugin.xml
050923 020323 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.text.TextParser
050923 020323 not including: /usr/local/nutch/plugins/ontology
050923 020323 parsing: /usr/local/nutch/plugins/parse-ext/plugin.xml
050923 020323 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.ext.ExtParser
050923 020323 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.ext.ExtParser
050923 020323 parsing: /usr/local/nutch/plugins/parse-rss/plugin.xml
050923 020323 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.rss.RSSParser
050923 020323 parsing: 
/usr/local/nutch/plugins/protocol-httpclient/plugin.xml
050923 020323 impl: point=org.apache.nutch.protocol.Protocol 
class=org.apache.nutch.protocol.httpclient.Http
050923 020323 impl: point=org.apache.nutch.protocol.Protocol 
class=org.apache.nutch.protocol.httpclient.Http
050923 020323 parsing: /usr/local/nutch/plugins/parse-pdf/plugin.xml
050923 020323 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.pdf.PdfParser
050923 020323 not including: /usr/local/nutch/plugins/creativecommons
050923 020323 parsing: /usr/local/nutch/plugins/parse-html/plugin.xml
050923 020323 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.html.HtmlParser
050923 020323 parsing: /usr/local/nutch/plugins/parse-msword/plugin.xml
050923 020323 impl: point=org.apache.nutch.parse.Parser 
class=org.apache.nutch.parse.msword.MSWordParser
050923 020323 parsing: /usr/local/nutch/plugins/query-basic/plugin.xml
050923 020323 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.basic.BasicQueryFilter
050923 020323 not including: /usr/local/nutch/plugins/protocol-http
050923 020323 not including: /usr/local/nutch/plugins/index-more
050923 020323 not including: /usr/local/nutch/plugins/query-more
050923 020323 not including: /usr/local/nutch/plugins/parse-js
050923 020323 parsing: /usr/local/nutch/plugins/index-basic/plugin.xml
050923 020323 impl: point=org.apache.nutch.indexer.IndexingFilter 
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
050923 020323 not including: /usr/local/nutch/plugins/language-identifier
050923 020323 parsing: /usr/local/nutch/plugins/query-site/plugin.xml
050923 020323 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.site.SiteQueryFilter
050923 020323 not including: /usr/local/nutch/plugins/clustering-carrot2
050923 020323 not including: /usr/local/nutch/plugins/protocol-file
050923 020323 parsing: /usr/local/nutch/plugins/urlfilter-regex/plugin.xml
050923 020323 impl: point=org.apache.nutch.net.URLFilter 
class=org.apache.nutch.net.RegexURLFilter
050923 020323 parsing: /usr/local/nutch/plugins/query-url/plugin.xml
050923 020323 impl: point=org.apache.nutch.searcher.QueryFilter 
class=org.apache.nutch.searcher.url.URLQueryFilter
050923 020323 logging at INFO
050923 020323 fetching 
http://vet.osu.edu/assets/courses/vm602/quotes/quote46.html
050923 020323 fetching 
http://vet.osu.edu/assets/courses/vm562/muir/sedatives.pdf
050923 020323 http.proxy.host = null
050923 020323 http.proxy.port = 8080
050923 020323 http.timeout = 10000
050923 020323 http.content.limit = 7168000
050923 020323 http.agent = Nutch/0.7 ( nutch; http://xxxxxxx, 
xxxxxx@xxxxxxxx)
050923 020323 http.auth.ntlm.username =
050923 020323 fetcher.server.delay = 3000
050923 020323 http.max.delays = 10
050923 020324 Configured Client

Re: Parcer Policy - Re: No external command defined for contentType:

Posted by Jérôme Charron <je...@gmail.com>.
> Following are output from the fetcher and headers from the firefoxweb
> developer toolbar.
>
> I'd appreciate any thoughts. Perhaps something for parser policy. I've
> traced the source code a bit and nothing jumped out at me...

Could you provide your plugins configuration, and the nutch startup logs.

Jérôme


--
http://motrech.free.fr/
http://www.frutch.org/