You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by forwardswing <wa...@sohu.com> on 2012/05/14 05:24:29 UTC

Can't retrieve Tika parser for mime-type text/javascript

when I use Nutch1.2,it alwayls occurs the following error:
dtree.js: failed(2,0): Can't retrieve Tika parser for mime-type
text/javascript
main.js: failed(2,0): Can't retrieve Tika parser for mime-type
text/javascript
Progress.js: failed(2,0): Can't retrieve Tika parser for mime-type
text/javascript

my parse-plugins.xml is:
<mimeType name="text/html">
		<plugin id="parse-html" />
	</mimeType>

        <mimeType name="application/xhtml+xml">
		<plugin id="parse-html" />
	</mimeType>

	<mimeType name="application/rss+xml">
	    <plugin id="parse-rss" />
	    <plugin id="feed" />
	</mimeType>

	<mimeType name="application/x-bzip2">
		
		<plugin id="parse-zip" />
	</mimeType>

	<mimeType name="application/x-gzip">
		
		<plugin id="parse-zip" />
	</mimeType>

	<mimeType name="application/x-javascript">
		<plugin id="parse-js" />
	</mimeType>

	<mimeType name="application/x-shockwave-flash">
		<plugin id="parse-swf" />
	</mimeType>

	<mimeType name="application/zip">
		<plugin id="parse-zip" />
	</mimeType>

	<mimeType name="text/xml">
		<plugin id="parse-html" />
		<plugin id="parse-rss" />
        <plugin id="feed" />
	</mimeType>

       

	<mimeType name="application/vnd.nutch.example.cat">
		<plugin id="parse-ext" />
	</mimeType>

	<mimeType name="application/vnd.nutch.example.md5sum">
		<plugin id="parse-ext" />
	</mimeType>
	
	<mimeType name="application/javascript">
		<plugin id="parse-tika" />
	</mimeType>
	<mimeType name="text/javascript">
		<plugin id="parse-tika" />
	</mimeType>
	

	
	<aliases>
	    <alias name="parse-tika" 
	        extension-id="org.apache.nutch.parse.tika.Parser" />
		<alias name="parse-ext" extension-id="ExtParser" />
		<alias name="parse-html"
			extension-id="org.apache.nutch.parse.html.HtmlParser" />
		<alias name="parse-js" extension-id="JSParser" />
		<alias name="parse-msexcel"
			extension-id="org.apache.nutch.parse.msexcel.MSExcelParser" />
		<alias name="parse-mspowerpoint"
			extension-id="org.apache.nutch.parse.mspowerpoint.MSPowerPointParser" />
		<alias name="parse-msword"
			extension-id="org.apache.nutch.parse.msword.MSWordParser" />
		<alias name="parse-oo"
			extension-id="org.apache.nutch.parse.oo.OpenDocument.Text" />
		<alias name="parse-pdf"
			extension-id="org.apache.nutch.parse.pdf.PdfParser" />
		<alias name="parse-rss"
			extension-id="org.apache.nutch.parse.rss.RSSParser" />
		<alias name="feed"
			extension-id="org.apache.nutch.parse.feed.FeedParser" />
		<alias name="parse-swf"
			extension-id="org.apache.nutch.parse.swf.SWFParser" />
		<alias name="parse-text"
			extension-id="org.apache.nutch.parse.text.TextParser" />
		<alias name="parse-zip"
			extension-id="org.apache.nutch.parse.zip.ZipParser" />
	</aliases>


and  nutch-site.xml is:
<property>
  <name>plugin.includes</name>
 
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
 </property>



Who can help me ?

--
View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Can't retrieve Tika parser for mime-type text/javascript

Posted by forwardswing <wa...@sohu.com>.
I have a page which is mainly controlled by javascript & ajax.

So i need to parse it.

Thanks a lot.

--
View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599p3984018.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Can't retrieve Tika parser for mime-type text/javascript

Posted by Lewis John Mcgibbney <le...@gmail.com>.
One final poin there which I forgot.
The point of the parse-js plugin is to extract outlinks from JS pages.
The page you supplied contained only one outlink to a page which no
longer exists, so depending on what your purposes are you may not find
the parse-js plugin of much help

Lewis

On Fri, May 18, 2012 at 11:09 AM, Lewis John Mcgibbney
<le...@gmail.com> wrote:
> I tried configuring my instance to fetch and parse your page with the
> following result
>
> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local/bin$
> ./nutch parsechecker
> http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js
> fetching: http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js
> parsing: http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js
> contentType: application/javascript
> signature: 4bf7aa15c0e79cb2330bc80c417f0a55
> ---------
> Url
> ---------------
> http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js
> ---------
> ParseData
> ---------
> Version: 5
> Status: UNKNOWN!(-53,0): Content not JavaScript: 'application/javascript'
> Title:
> Outlinks: 0
> Content Metadata:
> Parse Metadata:
>
> So I tried a small experiment to see if I could hack a solution but
> unfortunately as far as I got was to find that beginning on line 152
> of the JSParserFilter class we see
>
> public ParseResult getParse(Content c) {
>    String type = c.getContentType();
>    if (type != null && !type.trim().equals("") &&
> !type.toLowerCase().startsWith("application/x-javascript"))
>      return new ParseStatus(ParseStatus.FAILED_INVALID_FORMAT,
>              "Content not JavaScript: '" + type +
> "'").getEmptyParseResult(c.getUrl(), getConf());
>
> It appears from the ParserChecker that ParseStatus is returning the
> FAILED_INVALID_FORMAT message which we get. If you are going to focus
> on getting the plugin to actually parse your files, I would begin
> there, however I wouldn't expect miracles from the Parser if it is
> geared specifically for mimeType application/x-javascript
>
> hth
>
> Lewis
>
> On Fri, May 18, 2012 at 6:12 AM, forwardswing <wa...@sohu.com> wrote:
>> First of all,thank you very much for your reply.
>>
>> I have followed your suggestion and did the following modification:
>>
>> <mimeType name="application/javascript">
>>                <plugin id="parse-js" />
>>        </mimeType>
>>        <mimeType name="text/javascript">
>>                <plugin id="parse-js" />
>>        </mimeType>
>>
>> <alias name="parse-js" extension-id="org.apache.nutch.parse.js.JSParser" />
>>
>> There is still an error:
>> dtree.js: failed(2,0): Can't retrieve Tika parser for mime-type
>> text/javascript
>> here is the js file to be parse,could you please have a try in your
>> environment ?
>>
>> http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js dtree.js
>>
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599p3984604.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
>
> --
> Lewis



-- 
Lewis

Re: Can't retrieve Tika parser for mime-type text/javascript

Posted by Lewis John Mcgibbney <le...@gmail.com>.
I tried configuring my instance to fetch and parse your page with the
following result

lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local/bin$
./nutch parsechecker
http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js
fetching: http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js
parsing: http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js
contentType: application/javascript
signature: 4bf7aa15c0e79cb2330bc80c417f0a55
---------
Url
---------------
http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js
---------
ParseData
---------
Version: 5
Status: UNKNOWN!(-53,0): Content not JavaScript: 'application/javascript'
Title:
Outlinks: 0
Content Metadata:
Parse Metadata:

So I tried a small experiment to see if I could hack a solution but
unfortunately as far as I got was to find that beginning on line 152
of the JSParserFilter class we see

public ParseResult getParse(Content c) {
    String type = c.getContentType();
    if (type != null && !type.trim().equals("") &&
!type.toLowerCase().startsWith("application/x-javascript"))
      return new ParseStatus(ParseStatus.FAILED_INVALID_FORMAT,
              "Content not JavaScript: '" + type +
"'").getEmptyParseResult(c.getUrl(), getConf());

It appears from the ParserChecker that ParseStatus is returning the
FAILED_INVALID_FORMAT message which we get. If you are going to focus
on getting the plugin to actually parse your files, I would begin
there, however I wouldn't expect miracles from the Parser if it is
geared specifically for mimeType application/x-javascript

hth

Lewis

On Fri, May 18, 2012 at 6:12 AM, forwardswing <wa...@sohu.com> wrote:
> First of all,thank you very much for your reply.
>
> I have followed your suggestion and did the following modification:
>
> <mimeType name="application/javascript">
>                <plugin id="parse-js" />
>        </mimeType>
>        <mimeType name="text/javascript">
>                <plugin id="parse-js" />
>        </mimeType>
>
> <alias name="parse-js" extension-id="org.apache.nutch.parse.js.JSParser" />
>
> There is still an error:
> dtree.js: failed(2,0): Can't retrieve Tika parser for mime-type
> text/javascript
> here is the js file to be parse,could you please have a try in your
> environment ?
>
> http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js dtree.js
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599p3984604.html
> Sent from the Nutch - User mailing list archive at Nabble.com.



-- 
Lewis

Re: Can't retrieve Tika parser for mime-type text/javascript

Posted by forwardswing <wa...@sohu.com>.
First of all,thank you very much for your reply.

I have followed your suggestion and did the following modification:

<mimeType name="application/javascript">
		<plugin id="parse-js" />
	</mimeType>
	<mimeType name="text/javascript">
		<plugin id="parse-js" />
	</mimeType>

<alias name="parse-js" extension-id="org.apache.nutch.parse.js.JSParser" />

There is still an error:
dtree.js: failed(2,0): Can't retrieve Tika parser for mime-type
text/javascript
here is the js file to be parse,could you please have a try in your
environment ?

http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js dtree.js 

--
View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599p3984604.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Can't retrieve Tika parser for mime-type text/javascript

Posted by Lewis John Mcgibbney <le...@gmail.com>.
I see some problems from the thread.

1) Please ensure both of the following are mapped to parse-js as
Markus suggested

<mimeType name="application/javascript">
                <plugin id="parse-tika" />
        </mimeType>
        <mimeType name="text/javascript">
                <plugin id="parse-tika" />
        </mimeType>

2) Your alias for the parse-ja plugin class is incorrect. You can find
the correct path here [0]

3) Please ensure that your regex-urlfilter configuration does NOT skip
JS and js mimeTypes

4) I tried fetching and parsing one of the links you provided in your
thread... which did not work. Is there maybe something else at play
here?

[0] http://svn.apache.org/repos/asf/nutch/tags/release-1.2/src/plugin/parse-js/src/java/org/apache/nutch/parse/js/

On Wed, May 16, 2012 at 3:15 PM, forwardswing <wa...@sohu.com> wrote:
> Is there a way to resolve this ?
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599p3984115.html
> Sent from the Nutch - User mailing list archive at Nabble.com.



-- 
Lewis

Re: Can't retrieve Tika parser for mime-type text/javascript

Posted by forwardswing <wa...@sohu.com>.
Is there a way to resolve this ?

--
View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599p3984115.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Can't retrieve Tika parser for mime-type text/javascript

Posted by Markus Jelsma <ma...@openindex.io>.
I see, it doesn't work. The JSParser is known not to work very well, or work 
at all.  Why do you want to parse JS anyway? It's not a very common practice 
to do so.

On Monday 14 May 2012 01:35:01 forwardswing wrote:
> I modify the parse-plugins.xml clip from:
> <mimeType name="text/javascript">
> 		<plugin id="parse-tike" />
> 	</mimeType>
> 
> to :
> <mimeType name="text/javascript">
> 		<plugin id="parse-js" />
> 	</mimeType>
> 
> but there occurs another error:
> Error parsing: http://10.31.8.29:8080/AWIsys/dtree.js: UNKNOWN!(-53,0):
> Content not JavaScript: 'text/javascript'
>  fetch of http://10.31.8.29:8080/AWIsys/dtree.js failed with:
> java.lang.ArrayIndexOutOfBoundsException: -53
> 
> Error parsing: http://10.31.8.29:8080/AWIsys/main.js: UNKNOWN!(-53,0):
> Content not JavaScript: 'text/javascript'
> fetch of http://10.31.8.29:8080/AWIsys/main.js failed with:
> java.lang.ArrayIndexOutOfBoundsException: -53
> 
> Error parsing: http://10.31.8.29:8080/AWIsys/Progress.js: UNKNOWN!(-53,0):
> Content not JavaScript: 'text/javascript'
> fetch of http://10.31.8.29:8080/AWIsys/Progress.js failed with:
> java.lang.ArrayIndexOutOfBoundsException: -53
> 
> Error parsing: http://10.31.8.29:8080/AWIsys/table_sorter_script.js:
> UNKNOWN!(-53,0): Content not JavaScript: 'text/javascript'
> fetch of http://10.31.8.29:8080/AWIsys/table_sorter_script.js failed with:
> java.lang.ArrayIndexOutOfBoundsException: -53
> 
> 
> What's the meaning of "-53"
> 
> If necessary ,I can provide the js files.
> 
> Thank you for your help.
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type
> -text-javascript-tp3983599p3983627.html Sent from the Nutch - User mailing
> list archive at Nabble.com.
-- 
Markus Jelsma - CTO - Openindex


Re: Can't retrieve Tika parser for mime-type text/javascript

Posted by forwardswing <wa...@sohu.com>.
I am sincerely waiting for your reply.

--
View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599p3983795.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Can't retrieve Tika parser for mime-type text/javascript

Posted by forwardswing <wa...@sohu.com>.
I modify the parse-plugins.xml clip from:
<mimeType name="text/javascript">
		<plugin id="parse-tike" />
	</mimeType>

to :
<mimeType name="text/javascript">
		<plugin id="parse-js" />
	</mimeType>

but there occurs another error:
Error parsing: http://10.31.8.29:8080/AWIsys/dtree.js: UNKNOWN!(-53,0):
Content not JavaScript: 'text/javascript'
 fetch of http://10.31.8.29:8080/AWIsys/dtree.js failed with:
java.lang.ArrayIndexOutOfBoundsException: -53

Error parsing: http://10.31.8.29:8080/AWIsys/main.js: UNKNOWN!(-53,0):
Content not JavaScript: 'text/javascript'
fetch of http://10.31.8.29:8080/AWIsys/main.js failed with:
java.lang.ArrayIndexOutOfBoundsException: -53

Error parsing: http://10.31.8.29:8080/AWIsys/Progress.js: UNKNOWN!(-53,0):
Content not JavaScript: 'text/javascript'
fetch of http://10.31.8.29:8080/AWIsys/Progress.js failed with:
java.lang.ArrayIndexOutOfBoundsException: -53

Error parsing: http://10.31.8.29:8080/AWIsys/table_sorter_script.js:
UNKNOWN!(-53,0): Content not JavaScript: 'text/javascript'
fetch of http://10.31.8.29:8080/AWIsys/table_sorter_script.js failed with:
java.lang.ArrayIndexOutOfBoundsException: -53


What's the meaning of "-53"

If necessary ,I can provide the js files.

Thank you for your help.

--
View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599p3983627.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Can't retrieve Tika parser for mime-type text/javascript

Posted by Markus Jelsma <ma...@openindex.io>.
 you have text/javascript mapped to Tika but Tika does not have a parser 
 for this MIME-type. Remove the mappings but keep it mapped to parse-js. 
 That should work, that is, the proper parser should be invoked.

 On Sun, 13 May 2012 20:24:29 -0700 (PDT), forwardswing 
 <wa...@sohu.com> wrote:
> when I use Nutch1.2,it alwayls occurs the following error:
> dtree.js: failed(2,0): Can't retrieve Tika parser for mime-type
> text/javascript
> main.js: failed(2,0): Can't retrieve Tika parser for mime-type
> text/javascript
> Progress.js: failed(2,0): Can't retrieve Tika parser for mime-type
> text/javascript
>
> my parse-plugins.xml is:
> <mimeType name="text/html">
> 		<plugin id="parse-html" />
> 	</mimeType>
>
>         <mimeType name="application/xhtml+xml">
> 		<plugin id="parse-html" />
> 	</mimeType>
>
> 	<mimeType name="application/rss+xml">
> 	    <plugin id="parse-rss" />
> 	    <plugin id="feed" />
> 	</mimeType>
>
> 	<mimeType name="application/x-bzip2">
>
> 		<plugin id="parse-zip" />
> 	</mimeType>
>
> 	<mimeType name="application/x-gzip">
>
> 		<plugin id="parse-zip" />
> 	</mimeType>
>
> 	<mimeType name="application/x-javascript">
> 		<plugin id="parse-js" />
> 	</mimeType>
>
> 	<mimeType name="application/x-shockwave-flash">
> 		<plugin id="parse-swf" />
> 	</mimeType>
>
> 	<mimeType name="application/zip">
> 		<plugin id="parse-zip" />
> 	</mimeType>
>
> 	<mimeType name="text/xml">
> 		<plugin id="parse-html" />
> 		<plugin id="parse-rss" />
>         <plugin id="feed" />
> 	</mimeType>
>
>
>
> 	<mimeType name="application/vnd.nutch.example.cat">
> 		<plugin id="parse-ext" />
> 	</mimeType>
>
> 	<mimeType name="application/vnd.nutch.example.md5sum">
> 		<plugin id="parse-ext" />
> 	</mimeType>
>
> 	<mimeType name="application/javascript">
> 		<plugin id="parse-tika" />
> 	</mimeType>
> 	<mimeType name="text/javascript">
> 		<plugin id="parse-tika" />
> 	</mimeType>
>
>
>
> 	<aliases>
> 	    <alias name="parse-tika"
> 	        extension-id="org.apache.nutch.parse.tika.Parser" />
> 		<alias name="parse-ext" extension-id="ExtParser" />
> 		<alias name="parse-html"
> 			extension-id="org.apache.nutch.parse.html.HtmlParser" />
> 		<alias name="parse-js" extension-id="JSParser" />
> 		<alias name="parse-msexcel"
> 			extension-id="org.apache.nutch.parse.msexcel.MSExcelParser" />
> 		<alias name="parse-mspowerpoint"
> 
> 			extension-id="org.apache.nutch.parse.mspowerpoint.MSPowerPointParser" 
> />
> 		<alias name="parse-msword"
> 			extension-id="org.apache.nutch.parse.msword.MSWordParser" />
> 		<alias name="parse-oo"
> 			extension-id="org.apache.nutch.parse.oo.OpenDocument.Text" />
> 		<alias name="parse-pdf"
> 			extension-id="org.apache.nutch.parse.pdf.PdfParser" />
> 		<alias name="parse-rss"
> 			extension-id="org.apache.nutch.parse.rss.RSSParser" />
> 		<alias name="feed"
> 			extension-id="org.apache.nutch.parse.feed.FeedParser" />
> 		<alias name="parse-swf"
> 			extension-id="org.apache.nutch.parse.swf.SWFParser" />
> 		<alias name="parse-text"
> 			extension-id="org.apache.nutch.parse.text.TextParser" />
> 		<alias name="parse-zip"
> 			extension-id="org.apache.nutch.parse.zip.ZipParser" />
> 	</aliases>
>
>
> and  nutch-site.xml is:
> <property>
>   <name>plugin.includes</name>
>
> 
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>  </property>
>
>
>
> Who can help me ?
>
> --
> View this message in context:
> 
> http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599.html
> Sent from the Nutch - User mailing list archive at Nabble.com.

-- 
 Markus Jelsma - CTO - Openindex