You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by W <wi...@gmail.com> on 2009/01/21 05:27:09 UTC
problems when crawling mp3 files ...
Hello Guys,
I try to crawling mp3 files on local filesystems and get lots of error
like this :
Error parsing: file:/home/wildan/personal/Musik/Indonesia/Ebit G
Ade/06 rembulan menangis.mp3: org.apache.nutch.parse.ParseException:
parser not found for contentType=application/octet-stream
url=file:/home/wildan/personal/Musik/Indonesia/Ebit G Ade/06 rembulan
menangis.mp3
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:337)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:179)
is there any one here have successfully parse mp3 files ?
the following is my nutch-site.xml file :
--cut-----
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- HTTP properties -->
<property>
<name>http.agent.name</name>
<value>tobeThink!</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.agent.description</name>
<value>Openthink Distributed Search</value>
<description>Openthink Distributed Search</description>
</property>
<property>
<name>http.agent.url</name>
<value>http://wildanm.wordpress.com </value>
<description>Wildan M Blog</description>
</property>
<property>
<name>http.agent.email</name>
<value>MyEmail</value>
<description>wildan.m@gmail.com</description>
</property>
<!-- plugin properties -->
<property>
<name>plugin.folders</name>
<value>plugins</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
</property>
<property>
<name>plugin.auto-activation</name>
<value>true</value>
<description>Defines if some plugins that are not activated regarding
the plugin.includes and plugin.excludes properties must be automaticaly
activated if they are needed by some actived plugins.
</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-(file|http)|urlfilter-regex|parse-(html|js|mp3|msexcel|mspowerpoint|msword|oo|pdf|)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|clustering-carrot2</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>
<property>
<name>plugin.excludes</name>
<value></value>
<description>Regular expression naming plugin directory names to exclude.
</description>
</property>
<!-- file properties -->
<property>
<name>file.content.limit</name>
<value>-1</value>
</property>
<!-- clustering extension properties, carrot2 related -->
<property>
<name>extension.clustering.hits-to-cluster</name>
<value>100</value>
<description>Number of snippets retrieved for the clustering extension
if clustering extension is available and user requested results
to be clustered.</description>
</property>
<property>
<name>extension.clustering.extension-name</name>
<value></value>
<description>Use the specified online clustering extension. If empty,
the first available extension will be used. The "name" here refers to an 'id'
attribute of the 'implementation' element in the plugin descriptor XML
file.</description>
</property>
</configuration>
--cut----
Best Regards,
Wildan
--
---
tobeThink!
www.tobethink.com
Aligning IT and Education
>> 021-99325243
Y! : hawking_123
Linkedln : http://www.linkedin.com/in/wildanmaulana