You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by W <wi...@gmail.com> on 2009/01/21 05:27:09 UTC

problems when crawling mp3 files ...

Hello Guys,

I try to crawling mp3 files on local filesystems and get lots of error
like this :

Error parsing: file:/home/wildan/personal/Musik/Indonesia/Ebit G
Ade/06 rembulan menangis.mp3: org.apache.nutch.parse.ParseException:
parser not found for contentType=application/octet-stream
url=file:/home/wildan/personal/Musik/Indonesia/Ebit G Ade/06 rembulan
menangis.mp3
	at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
	at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:337)
	at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:179)

is there any one here have successfully parse mp3 files ?

the following is my nutch-site.xml file :

--cut-----
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<!-- HTTP properties -->

<property>
  <name>http.agent.name</name>
  <value>tobeThink!</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty -
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

	http.robots.agents
	http.agent.description
	http.agent.url
	http.agent.email
	http.agent.version

  and set their values appropriately.

  </description>
</property>


<property>
  <name>http.agent.description</name>
  <value>Openthink Distributed Search</value>
  <description>Openthink Distributed Search</description>
</property>


<property>
  <name>http.agent.url</name>
  <value>http://wildanm.wordpress.com </value>
  <description>Wildan M Blog</description>
</property>



<property>
  <name>http.agent.email</name>
  <value>MyEmail</value>
  <description>wildan.m@gmail.com</description>
</property>

<!-- plugin properties -->

<property>
  <name>plugin.folders</name>
  <value>plugins</value>
  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.</description>
</property>

<property>
  <name>plugin.auto-activation</name>
  <value>true</value>
  <description>Defines if some plugins that are not activated regarding
  the plugin.includes and plugin.excludes properties must be automaticaly
  activated if they are needed by some actived plugins.
  </description>
</property>

<property>
  <name>plugin.includes</name>

<value>protocol-(file|http)|urlfilter-regex|parse-(html|js|mp3|msexcel|mspowerpoint|msword|oo|pdf|)|index-(basic|more)|query-(basic|site|url|more)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|clustering-carrot2</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  </description>
</property>

<property>
  <name>plugin.excludes</name>
  <value></value>
  <description>Regular expression naming plugin directory names to exclude.
  </description>
</property>

<!-- file properties -->

<property>
  <name>file.content.limit</name>
  <value>-1</value>
</property>

<!-- clustering extension properties, carrot2 related -->

<property>
  <name>extension.clustering.hits-to-cluster</name>
  <value>100</value>
  <description>Number of snippets retrieved for the clustering extension
  if clustering extension is available and user requested results
  to be clustered.</description>
</property>

<property>
  <name>extension.clustering.extension-name</name>
  <value></value>
  <description>Use the specified online clustering extension. If empty,
  the first available extension will be used. The "name" here refers to an 'id'
  attribute of the 'implementation' element in the plugin descriptor XML
  file.</description>
</property>



</configuration>

--cut----


Best Regards,
Wildan

-- 
---
tobeThink!
www.tobethink.com

Aligning IT and Education

>> 021-99325243
Y! : hawking_123
Linkedln : http://www.linkedin.com/in/wildanmaulana