You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Taichi Ho <he...@gmail.com> on 2015/10/03 19:47:33 UTC

Tika parsing

I keep getting this Tika error when I am using nutch. 

Can't retrieve Tika parser for mime-type text/css
Can't retrieve Tika parser for mime-type application/javascript
Can't retrieve Tika parser for mime-type text/x-php
Can't retrieve Tika parser for mime-type text/aspdotnet

I haven't actually do any particular configuration about. All is default.

Parse-plugin.xml
<parse-plugins>

  
	<mimeType name="*">
	  <plugin id="parse-tika" />
	</mimeType>
 
	<mimeType name="application/rss+xml">
	    <plugin id="parse-tika" />
	    <plugin id="feed" />
	</mimeType>

	<mimeType name="application/x-bzip2">
		
		<plugin id="parse-zip" />
	</mimeType>

	<mimeType name="application/x-gzip">
		
		<plugin id="parse-zip" />
	</mimeType>

	<mimeType name="application/x-javascript">
		<plugin id="parse-js" />
	</mimeType>

	<mimeType name="application/x-shockwave-flash">
		<plugin id="parse-swf" />
	</mimeType>

	<mimeType name="application/zip">
		<plugin id="parse-zip" />
	</mimeType>

	<mimeType name="text/html">
		<plugin id="parse-html" />
	</mimeType>

        <mimeType name="application/xhtml+xml">
		<plugin id="parse-html" />
	</mimeType>

	<mimeType name="text/xml">
		<plugin id="parse-tika" />
		<plugin id="feed" />
	</mimeType>

       

	<mimeType name="application/vnd.nutch.example.cat">
		<plugin id="parse-ext" />
	</mimeType>

	<mimeType name="application/vnd.nutch.example.md5sum">
		<plugin id="parse-ext" />
	</mimeType>

	
	<aliases>
		<alias name="parse-tika" 
			extension-id="org.apache.nutch.parse.tika.TikaParser" />
		<alias name="parse-ext" extension-id="ExtParser" />
		<alias name="parse-html"
			extension-id="org.apache.nutch.parse.html.HtmlParser" />
		<alias name="parse-js" extension-id="JSParser" />
		<alias name="feed"
			extension-id="org.apache.nutch.parse.feed.FeedParser" />
		<alias name="parse-swf"
			extension-id="org.apache.nutch.parse.swf.SWFParser" />
		<alias name="parse-zip"
			extension-id="org.apache.nutch.parse.zip.ZipParser" />
	</aliases>
	
</parse-plugins>

nutch-site.xml

<property>
  <name>plugin.includes</name>
   
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
    <description>Regular expression naming plugin directory names to
    include.  Any plugin not matching this expression is excluded.
    In any case you need at least include the nutch-extensionpoints plugin.
By
    default Nutch includes crawling just HTML and plain text via HTTP,
    and basic indexing and search plugins. In order to use HTTPS please
enable 
    protocol-httpclient, but be aware of possible intermittent problems with
the 
    underlying commons-httpclient library.
    </description>
  </property>




--
View this message in context: http://lucene.472066.n3.nabble.com/Tika-parsing-tp4232582.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Re: Tika parsing

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi,

this is not necessarily a problem.
It may be the case that Tika does not (yet)
provide parsers for these document types.
Unless you really want to "read" this
documents, it does not matter, it's just
warning.

Sebastian


On 10/03/2015 07:47 PM, Taichi Ho wrote:
> I keep getting this Tika error when I am using nutch. 
> 
> Can't retrieve Tika parser for mime-type text/css
> Can't retrieve Tika parser for mime-type application/javascript
> Can't retrieve Tika parser for mime-type text/x-php
> Can't retrieve Tika parser for mime-type text/aspdotnet
> 
> I haven't actually do any particular configuration about. All is default.
> 
> Parse-plugin.xml
> <parse-plugins>
> 
>   
> 	<mimeType name="*">
> 	  <plugin id="parse-tika" />
> 	</mimeType>
>  
> 	<mimeType name="application/rss+xml">
> 	    <plugin id="parse-tika" />
> 	    <plugin id="feed" />
> 	</mimeType>
> 
> 	<mimeType name="application/x-bzip2">
> 		
> 		<plugin id="parse-zip" />
> 	</mimeType>
> 
> 	<mimeType name="application/x-gzip">
> 		
> 		<plugin id="parse-zip" />
> 	</mimeType>
> 
> 	<mimeType name="application/x-javascript">
> 		<plugin id="parse-js" />
> 	</mimeType>
> 
> 	<mimeType name="application/x-shockwave-flash">
> 		<plugin id="parse-swf" />
> 	</mimeType>
> 
> 	<mimeType name="application/zip">
> 		<plugin id="parse-zip" />
> 	</mimeType>
> 
> 	<mimeType name="text/html">
> 		<plugin id="parse-html" />
> 	</mimeType>
> 
>         <mimeType name="application/xhtml+xml">
> 		<plugin id="parse-html" />
> 	</mimeType>
> 
> 	<mimeType name="text/xml">
> 		<plugin id="parse-tika" />
> 		<plugin id="feed" />
> 	</mimeType>
> 
>        
> 
> 	<mimeType name="application/vnd.nutch.example.cat">
> 		<plugin id="parse-ext" />
> 	</mimeType>
> 
> 	<mimeType name="application/vnd.nutch.example.md5sum">
> 		<plugin id="parse-ext" />
> 	</mimeType>
> 
> 	
> 	<aliases>
> 		<alias name="parse-tika" 
> 			extension-id="org.apache.nutch.parse.tika.TikaParser" />
> 		<alias name="parse-ext" extension-id="ExtParser" />
> 		<alias name="parse-html"
> 			extension-id="org.apache.nutch.parse.html.HtmlParser" />
> 		<alias name="parse-js" extension-id="JSParser" />
> 		<alias name="feed"
> 			extension-id="org.apache.nutch.parse.feed.FeedParser" />
> 		<alias name="parse-swf"
> 			extension-id="org.apache.nutch.parse.swf.SWFParser" />
> 		<alias name="parse-zip"
> 			extension-id="org.apache.nutch.parse.zip.ZipParser" />
> 	</aliases>
> 	
> </parse-plugins>
> 
> nutch-site.xml
> 
> <property>
>   <name>plugin.includes</name>
>    
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>     <description>Regular expression naming plugin directory names to
>     include.  Any plugin not matching this expression is excluded.
>     In any case you need at least include the nutch-extensionpoints plugin.
> By
>     default Nutch includes crawling just HTML and plain text via HTTP,
>     and basic indexing and search plugins. In order to use HTTPS please
> enable 
>     protocol-httpclient, but be aware of possible intermittent problems with
> the 
>     underlying commons-httpclient library.
>     </description>
>   </property>
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Tika-parsing-tp4232582.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>