You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by forwardswing <wa...@sohu.com> on 2012/05/14 05:24:29 UTC
Can't retrieve Tika parser for mime-type text/javascript
when I use Nutch1.2,it alwayls occurs the following error:
dtree.js: failed(2,0): Can't retrieve Tika parser for mime-type
text/javascript
main.js: failed(2,0): Can't retrieve Tika parser for mime-type
text/javascript
Progress.js: failed(2,0): Can't retrieve Tika parser for mime-type
text/javascript
my parse-plugins.xml is:
<mimeType name="text/html">
<plugin id="parse-html" />
</mimeType>
<mimeType name="application/xhtml+xml">
<plugin id="parse-html" />
</mimeType>
<mimeType name="application/rss+xml">
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>
<mimeType name="application/x-bzip2">
<plugin id="parse-zip" />
</mimeType>
<mimeType name="application/x-gzip">
<plugin id="parse-zip" />
</mimeType>
<mimeType name="application/x-javascript">
<plugin id="parse-js" />
</mimeType>
<mimeType name="application/x-shockwave-flash">
<plugin id="parse-swf" />
</mimeType>
<mimeType name="application/zip">
<plugin id="parse-zip" />
</mimeType>
<mimeType name="text/xml">
<plugin id="parse-html" />
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>
<mimeType name="application/vnd.nutch.example.cat">
<plugin id="parse-ext" />
</mimeType>
<mimeType name="application/vnd.nutch.example.md5sum">
<plugin id="parse-ext" />
</mimeType>
<mimeType name="application/javascript">
<plugin id="parse-tika" />
</mimeType>
<mimeType name="text/javascript">
<plugin id="parse-tika" />
</mimeType>
<aliases>
<alias name="parse-tika"
extension-id="org.apache.nutch.parse.tika.Parser" />
<alias name="parse-ext" extension-id="ExtParser" />
<alias name="parse-html"
extension-id="org.apache.nutch.parse.html.HtmlParser" />
<alias name="parse-js" extension-id="JSParser" />
<alias name="parse-msexcel"
extension-id="org.apache.nutch.parse.msexcel.MSExcelParser" />
<alias name="parse-mspowerpoint"
extension-id="org.apache.nutch.parse.mspowerpoint.MSPowerPointParser" />
<alias name="parse-msword"
extension-id="org.apache.nutch.parse.msword.MSWordParser" />
<alias name="parse-oo"
extension-id="org.apache.nutch.parse.oo.OpenDocument.Text" />
<alias name="parse-pdf"
extension-id="org.apache.nutch.parse.pdf.PdfParser" />
<alias name="parse-rss"
extension-id="org.apache.nutch.parse.rss.RSSParser" />
<alias name="feed"
extension-id="org.apache.nutch.parse.feed.FeedParser" />
<alias name="parse-swf"
extension-id="org.apache.nutch.parse.swf.SWFParser" />
<alias name="parse-text"
extension-id="org.apache.nutch.parse.text.TextParser" />
<alias name="parse-zip"
extension-id="org.apache.nutch.parse.zip.ZipParser" />
</aliases>
and nutch-site.xml is:
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
Who can help me ?
--
View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Can't retrieve Tika parser for mime-type text/javascript
Posted by forwardswing <wa...@sohu.com>.
I have a page which is mainly controlled by javascript & ajax.
So i need to parse it.
Thanks a lot.
--
View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599p3984018.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Can't retrieve Tika parser for mime-type text/javascript
Posted by Lewis John Mcgibbney <le...@gmail.com>.
One final poin there which I forgot.
The point of the parse-js plugin is to extract outlinks from JS pages.
The page you supplied contained only one outlink to a page which no
longer exists, so depending on what your purposes are you may not find
the parse-js plugin of much help
Lewis
On Fri, May 18, 2012 at 11:09 AM, Lewis John Mcgibbney
<le...@gmail.com> wrote:
> I tried configuring my instance to fetch and parse your page with the
> following result
>
> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local/bin$
> ./nutch parsechecker
> http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js
> fetching: http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js
> parsing: http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js
> contentType: application/javascript
> signature: 4bf7aa15c0e79cb2330bc80c417f0a55
> ---------
> Url
> ---------------
> http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js
> ---------
> ParseData
> ---------
> Version: 5
> Status: UNKNOWN!(-53,0): Content not JavaScript: 'application/javascript'
> Title:
> Outlinks: 0
> Content Metadata:
> Parse Metadata:
>
> So I tried a small experiment to see if I could hack a solution but
> unfortunately as far as I got was to find that beginning on line 152
> of the JSParserFilter class we see
>
> public ParseResult getParse(Content c) {
> String type = c.getContentType();
> if (type != null && !type.trim().equals("") &&
> !type.toLowerCase().startsWith("application/x-javascript"))
> return new ParseStatus(ParseStatus.FAILED_INVALID_FORMAT,
> "Content not JavaScript: '" + type +
> "'").getEmptyParseResult(c.getUrl(), getConf());
>
> It appears from the ParserChecker that ParseStatus is returning the
> FAILED_INVALID_FORMAT message which we get. If you are going to focus
> on getting the plugin to actually parse your files, I would begin
> there, however I wouldn't expect miracles from the Parser if it is
> geared specifically for mimeType application/x-javascript
>
> hth
>
> Lewis
>
> On Fri, May 18, 2012 at 6:12 AM, forwardswing <wa...@sohu.com> wrote:
>> First of all,thank you very much for your reply.
>>
>> I have followed your suggestion and did the following modification:
>>
>> <mimeType name="application/javascript">
>> <plugin id="parse-js" />
>> </mimeType>
>> <mimeType name="text/javascript">
>> <plugin id="parse-js" />
>> </mimeType>
>>
>> <alias name="parse-js" extension-id="org.apache.nutch.parse.js.JSParser" />
>>
>> There is still an error:
>> dtree.js: failed(2,0): Can't retrieve Tika parser for mime-type
>> text/javascript
>> here is the js file to be parse,could you please have a try in your
>> environment ?
>>
>> http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js dtree.js
>>
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599p3984604.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
>
> --
> Lewis
--
Lewis
Re: Can't retrieve Tika parser for mime-type text/javascript
Posted by Lewis John Mcgibbney <le...@gmail.com>.
I tried configuring my instance to fetch and parse your page with the
following result
lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local/bin$
./nutch parsechecker
http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js
fetching: http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js
parsing: http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js
contentType: application/javascript
signature: 4bf7aa15c0e79cb2330bc80c417f0a55
---------
Url
---------------
http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js
---------
ParseData
---------
Version: 5
Status: UNKNOWN!(-53,0): Content not JavaScript: 'application/javascript'
Title:
Outlinks: 0
Content Metadata:
Parse Metadata:
So I tried a small experiment to see if I could hack a solution but
unfortunately as far as I got was to find that beginning on line 152
of the JSParserFilter class we see
public ParseResult getParse(Content c) {
String type = c.getContentType();
if (type != null && !type.trim().equals("") &&
!type.toLowerCase().startsWith("application/x-javascript"))
return new ParseStatus(ParseStatus.FAILED_INVALID_FORMAT,
"Content not JavaScript: '" + type +
"'").getEmptyParseResult(c.getUrl(), getConf());
It appears from the ParserChecker that ParseStatus is returning the
FAILED_INVALID_FORMAT message which we get. If you are going to focus
on getting the plugin to actually parse your files, I would begin
there, however I wouldn't expect miracles from the Parser if it is
geared specifically for mimeType application/x-javascript
hth
Lewis
On Fri, May 18, 2012 at 6:12 AM, forwardswing <wa...@sohu.com> wrote:
> First of all,thank you very much for your reply.
>
> I have followed your suggestion and did the following modification:
>
> <mimeType name="application/javascript">
> <plugin id="parse-js" />
> </mimeType>
> <mimeType name="text/javascript">
> <plugin id="parse-js" />
> </mimeType>
>
> <alias name="parse-js" extension-id="org.apache.nutch.parse.js.JSParser" />
>
> There is still an error:
> dtree.js: failed(2,0): Can't retrieve Tika parser for mime-type
> text/javascript
> here is the js file to be parse,could you please have a try in your
> environment ?
>
> http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js dtree.js
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599p3984604.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
--
Lewis
Re: Can't retrieve Tika parser for mime-type text/javascript
Posted by forwardswing <wa...@sohu.com>.
First of all,thank you very much for your reply.
I have followed your suggestion and did the following modification:
<mimeType name="application/javascript">
<plugin id="parse-js" />
</mimeType>
<mimeType name="text/javascript">
<plugin id="parse-js" />
</mimeType>
<alias name="parse-js" extension-id="org.apache.nutch.parse.js.JSParser" />
There is still an error:
dtree.js: failed(2,0): Can't retrieve Tika parser for mime-type
text/javascript
here is the js file to be parse,could you please have a try in your
environment ?
http://lucene.472066.n3.nabble.com/file/n3984604/dtree.js dtree.js
--
View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599p3984604.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Can't retrieve Tika parser for mime-type text/javascript
Posted by Lewis John Mcgibbney <le...@gmail.com>.
I see some problems from the thread.
1) Please ensure both of the following are mapped to parse-js as
Markus suggested
<mimeType name="application/javascript">
<plugin id="parse-tika" />
</mimeType>
<mimeType name="text/javascript">
<plugin id="parse-tika" />
</mimeType>
2) Your alias for the parse-ja plugin class is incorrect. You can find
the correct path here [0]
3) Please ensure that your regex-urlfilter configuration does NOT skip
JS and js mimeTypes
4) I tried fetching and parsing one of the links you provided in your
thread... which did not work. Is there maybe something else at play
here?
[0] http://svn.apache.org/repos/asf/nutch/tags/release-1.2/src/plugin/parse-js/src/java/org/apache/nutch/parse/js/
On Wed, May 16, 2012 at 3:15 PM, forwardswing <wa...@sohu.com> wrote:
> Is there a way to resolve this ?
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599p3984115.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
--
Lewis
Re: Can't retrieve Tika parser for mime-type text/javascript
Posted by forwardswing <wa...@sohu.com>.
Is there a way to resolve this ?
--
View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599p3984115.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Can't retrieve Tika parser for mime-type text/javascript
Posted by Markus Jelsma <ma...@openindex.io>.
I see, it doesn't work. The JSParser is known not to work very well, or work
at all. Why do you want to parse JS anyway? It's not a very common practice
to do so.
On Monday 14 May 2012 01:35:01 forwardswing wrote:
> I modify the parse-plugins.xml clip from:
> <mimeType name="text/javascript">
> <plugin id="parse-tike" />
> </mimeType>
>
> to :
> <mimeType name="text/javascript">
> <plugin id="parse-js" />
> </mimeType>
>
> but there occurs another error:
> Error parsing: http://10.31.8.29:8080/AWIsys/dtree.js: UNKNOWN!(-53,0):
> Content not JavaScript: 'text/javascript'
> fetch of http://10.31.8.29:8080/AWIsys/dtree.js failed with:
> java.lang.ArrayIndexOutOfBoundsException: -53
>
> Error parsing: http://10.31.8.29:8080/AWIsys/main.js: UNKNOWN!(-53,0):
> Content not JavaScript: 'text/javascript'
> fetch of http://10.31.8.29:8080/AWIsys/main.js failed with:
> java.lang.ArrayIndexOutOfBoundsException: -53
>
> Error parsing: http://10.31.8.29:8080/AWIsys/Progress.js: UNKNOWN!(-53,0):
> Content not JavaScript: 'text/javascript'
> fetch of http://10.31.8.29:8080/AWIsys/Progress.js failed with:
> java.lang.ArrayIndexOutOfBoundsException: -53
>
> Error parsing: http://10.31.8.29:8080/AWIsys/table_sorter_script.js:
> UNKNOWN!(-53,0): Content not JavaScript: 'text/javascript'
> fetch of http://10.31.8.29:8080/AWIsys/table_sorter_script.js failed with:
> java.lang.ArrayIndexOutOfBoundsException: -53
>
>
> What's the meaning of "-53"
>
> If necessary ,I can provide the js files.
>
> Thank you for your help.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type
> -text-javascript-tp3983599p3983627.html Sent from the Nutch - User mailing
> list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
Re: Can't retrieve Tika parser for mime-type text/javascript
Posted by forwardswing <wa...@sohu.com>.
I am sincerely waiting for your reply.
--
View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599p3983795.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Can't retrieve Tika parser for mime-type text/javascript
Posted by forwardswing <wa...@sohu.com>.
I modify the parse-plugins.xml clip from:
<mimeType name="text/javascript">
<plugin id="parse-tike" />
</mimeType>
to :
<mimeType name="text/javascript">
<plugin id="parse-js" />
</mimeType>
but there occurs another error:
Error parsing: http://10.31.8.29:8080/AWIsys/dtree.js: UNKNOWN!(-53,0):
Content not JavaScript: 'text/javascript'
fetch of http://10.31.8.29:8080/AWIsys/dtree.js failed with:
java.lang.ArrayIndexOutOfBoundsException: -53
Error parsing: http://10.31.8.29:8080/AWIsys/main.js: UNKNOWN!(-53,0):
Content not JavaScript: 'text/javascript'
fetch of http://10.31.8.29:8080/AWIsys/main.js failed with:
java.lang.ArrayIndexOutOfBoundsException: -53
Error parsing: http://10.31.8.29:8080/AWIsys/Progress.js: UNKNOWN!(-53,0):
Content not JavaScript: 'text/javascript'
fetch of http://10.31.8.29:8080/AWIsys/Progress.js failed with:
java.lang.ArrayIndexOutOfBoundsException: -53
Error parsing: http://10.31.8.29:8080/AWIsys/table_sorter_script.js:
UNKNOWN!(-53,0): Content not JavaScript: 'text/javascript'
fetch of http://10.31.8.29:8080/AWIsys/table_sorter_script.js failed with:
java.lang.ArrayIndexOutOfBoundsException: -53
What's the meaning of "-53"
If necessary ,I can provide the js files.
Thank you for your help.
--
View this message in context: http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599p3983627.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Can't retrieve Tika parser for mime-type text/javascript
Posted by Markus Jelsma <ma...@openindex.io>.
you have text/javascript mapped to Tika but Tika does not have a parser
for this MIME-type. Remove the mappings but keep it mapped to parse-js.
That should work, that is, the proper parser should be invoked.
On Sun, 13 May 2012 20:24:29 -0700 (PDT), forwardswing
<wa...@sohu.com> wrote:
> when I use Nutch1.2,it alwayls occurs the following error:
> dtree.js: failed(2,0): Can't retrieve Tika parser for mime-type
> text/javascript
> main.js: failed(2,0): Can't retrieve Tika parser for mime-type
> text/javascript
> Progress.js: failed(2,0): Can't retrieve Tika parser for mime-type
> text/javascript
>
> my parse-plugins.xml is:
> <mimeType name="text/html">
> <plugin id="parse-html" />
> </mimeType>
>
> <mimeType name="application/xhtml+xml">
> <plugin id="parse-html" />
> </mimeType>
>
> <mimeType name="application/rss+xml">
> <plugin id="parse-rss" />
> <plugin id="feed" />
> </mimeType>
>
> <mimeType name="application/x-bzip2">
>
> <plugin id="parse-zip" />
> </mimeType>
>
> <mimeType name="application/x-gzip">
>
> <plugin id="parse-zip" />
> </mimeType>
>
> <mimeType name="application/x-javascript">
> <plugin id="parse-js" />
> </mimeType>
>
> <mimeType name="application/x-shockwave-flash">
> <plugin id="parse-swf" />
> </mimeType>
>
> <mimeType name="application/zip">
> <plugin id="parse-zip" />
> </mimeType>
>
> <mimeType name="text/xml">
> <plugin id="parse-html" />
> <plugin id="parse-rss" />
> <plugin id="feed" />
> </mimeType>
>
>
>
> <mimeType name="application/vnd.nutch.example.cat">
> <plugin id="parse-ext" />
> </mimeType>
>
> <mimeType name="application/vnd.nutch.example.md5sum">
> <plugin id="parse-ext" />
> </mimeType>
>
> <mimeType name="application/javascript">
> <plugin id="parse-tika" />
> </mimeType>
> <mimeType name="text/javascript">
> <plugin id="parse-tika" />
> </mimeType>
>
>
>
> <aliases>
> <alias name="parse-tika"
> extension-id="org.apache.nutch.parse.tika.Parser" />
> <alias name="parse-ext" extension-id="ExtParser" />
> <alias name="parse-html"
> extension-id="org.apache.nutch.parse.html.HtmlParser" />
> <alias name="parse-js" extension-id="JSParser" />
> <alias name="parse-msexcel"
> extension-id="org.apache.nutch.parse.msexcel.MSExcelParser" />
> <alias name="parse-mspowerpoint"
>
> extension-id="org.apache.nutch.parse.mspowerpoint.MSPowerPointParser"
> />
> <alias name="parse-msword"
> extension-id="org.apache.nutch.parse.msword.MSWordParser" />
> <alias name="parse-oo"
> extension-id="org.apache.nutch.parse.oo.OpenDocument.Text" />
> <alias name="parse-pdf"
> extension-id="org.apache.nutch.parse.pdf.PdfParser" />
> <alias name="parse-rss"
> extension-id="org.apache.nutch.parse.rss.RSSParser" />
> <alias name="feed"
> extension-id="org.apache.nutch.parse.feed.FeedParser" />
> <alias name="parse-swf"
> extension-id="org.apache.nutch.parse.swf.SWFParser" />
> <alias name="parse-text"
> extension-id="org.apache.nutch.parse.text.TextParser" />
> <alias name="parse-zip"
> extension-id="org.apache.nutch.parse.zip.ZipParser" />
> </aliases>
>
>
> and nutch-site.xml is:
> <property>
> <name>plugin.includes</name>
>
>
> <value>protocol-httpclient|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> </property>
>
>
>
> Who can help me ?
>
> --
> View this message in context:
>
> http://lucene.472066.n3.nabble.com/Can-t-retrieve-Tika-parser-for-mime-type-text-javascript-tp3983599.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex