You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ensheng Wang <nu...@yahoo.com.cn> on 2006/03/19 13:48:33 UTC

crawl by contentType and don't store data only build index

For example,I  only want to crawl .mp3 file on the internet, store the file description and url,and index that,don't want to store mp3 file data.
  How to do that?
  thanks!

__________________________________________________
赶快注册雅虎超大容量免费邮箱?
http://cn.mail.yahoo.com

RE: crawl by contentType and don't store data only build index

Posted by Gal Nitzan <gn...@usa.net>.
Hi,

There are few stages:

1. Set in nutch-site.xml the property: fetcher.store.content to false.
2. Write a parse filter which will set some metadata variables during parse stage like the description
3. Write a index filter which will add your description variable to the index (or replace the content field in doc to your variable)

If you will have many fields you will have to add also a query filter.

Gal.

-----Original Message-----
From: Ensheng Wang [mailto:nutch_user@yahoo.com.cn] 
Sent: Sunday, March 19, 2006 2:49 PM
To: nutch-user@lucene.apache.org
Subject: crawl by contentType and don't store data only build index

For example,I  only want to crawl .mp3 file on the internet, store the file description and url,and index that,don't want to store mp3 file data.
  How to do that?
  thanks!

__________________________________________________
赶快注册雅虎超大容量免费邮箱?
http://cn.mail.yahoo.com