You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2005/07/17 02:46:28 UTC

[Nutch Wiki] Trivial Update of "bin/nutch prune" by RobPettengill

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The following page has been changed by RobPettengill:
http://wiki.apache.org/nutch/bin/nutch_prune

------------------------------------------------------------------------------

This tool prunes existing Nutch indexes of unwanted content. The main method accepts a list of segment directories (containing indexes). These indexes will be pruned of any content that matches one or more query from a list of Lucene queries read from a file (defined in standard config file, or explicitly overridden from command-line). Segments should already be indexed, if some of them are missing indexes then these segments will be skipped.

- NOTE 1: Queries are expressed in Lucene's QueryParser syntax, so a knowledge of available Lucene document fields is required. This can be obtained by reading sources of index-basic and index-more plugins, or using tools like Luke. During query parsing a WhitespaceAnalyzer is used - this choice has been made to minimize side effects of Analyzer on the final set of query terms. You can use link net.nutch.searcher.Query.main(String[]) method to translate queries in Nutch syntax to queries in Lucene syntax.
+ NOTE 1: Queries are expressed in Lucene's !QueryParser syntax, so a knowledge of available Lucene document fields is required. This can be obtained by reading sources of index-basic and index-more plugins, or using tools like Luke. During query parsing a !WhitespaceAnalyzer is used - this choice has been made to minimize side effects of Analyzer on the final set of query terms. You can use link net.nutch.searcher.Query.main(String[]) method to translate queries in Nutch syntax to queries in Lucene syntax.
- If additional level of control is required, an instance of {@link PruneChecker} can be provided to check each document before it's deleted. The results of all checkers are logically AND-ed, which means that any checker in the chain can veto the deletion of the current document. Two example checker implementations are provided - PrintFieldsChecker prints the values of selected index fields, StoreUrlsChecker stores the URLs of deleted documents to a file. Any of them can be activated by providing respective command-line options.
+ If additional level of control is required, an instance of !PruneChecker can be provided to check each document before it's deleted. The results of all checkers are logically AND-ed, which means that any checker in the chain can veto the deletion of the current document. Two example checker implementations are provided - !PrintFieldsChecker prints the values of selected index fields, !StoreUrlsChecker stores the URLs of deleted documents to a file. Any of them can be activated by providing respective command-line options.

Typical Useage: bin/nutch net.nutch.tools.!PruneIndexTool index_dir -dryrun -queries queries.txt -showfields url,title[[BR}}