You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-commits@lucene.apache.org by yo...@apache.org on 2008/11/11 03:35:52 UTC
svn commit: r712922 [4/9] - in /lucene/java/trunk:
contrib/analyzers/src/java/org/apache/lucene/analysis/el/
contrib/analyzers/src/java/org/apache/lucene/analysis/fr/
contrib/analyzers/src/java/org/apache/lucene/analysis/miscellaneous/
contrib/analyzer...
Modified: lucene/java/trunk/contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThis.java
URL: http://svn.apache.org/viewvc/lucene/java/trunk/contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThis.java?rev=712922&r1=712921&r2=712922&view=diff
==============================================================================
--- lucene/java/trunk/contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThis.java (original)
+++ lucene/java/trunk/contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThis.java Mon Nov 10 18:35:46 2008
@@ -1,956 +1,956 @@
-/**
- * Copyright 2004-2005 The Apache Software Foundation.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-package org.apache.lucene.search.similar;
-
-import org.apache.lucene.util.PriorityQueue;
-import org.apache.lucene.index.IndexReader;
-import org.apache.lucene.index.Term;
-import org.apache.lucene.index.TermFreqVector;
-import org.apache.lucene.search.BooleanClause;
-import org.apache.lucene.search.DefaultSimilarity;
-import org.apache.lucene.search.Similarity;
-import org.apache.lucene.search.TermQuery;
-import org.apache.lucene.search.BooleanQuery;
-import org.apache.lucene.search.IndexSearcher;
-import org.apache.lucene.search.Query;
-import org.apache.lucene.search.Hits;
-import org.apache.lucene.analysis.Analyzer;
-import org.apache.lucene.analysis.Token;
-import org.apache.lucene.analysis.TokenStream;
-import org.apache.lucene.analysis.standard.StandardAnalyzer;
-import org.apache.lucene.document.Document;
-
-import java.util.Set;
-import java.util.HashMap;
-import java.util.Map;
-import java.util.Collection;
-import java.util.Iterator;
-import java.io.IOException;
-import java.io.Reader;
-import java.io.File;
-import java.io.PrintStream;
-import java.io.StringReader;
-import java.io.FileReader;
-import java.io.InputStreamReader;
-import java.net.URL;
-import java.util.ArrayList;
-
-
-/**
- * Generate "more like this" similarity queries.
- * Based on this mail:
- * <code><pre>
- * Lucene does let you access the document frequency of terms, with IndexReader.docFreq().
- * Term frequencies can be computed by re-tokenizing the text, which, for a single document,
- * is usually fast enough. But looking up the docFreq() of every term in the document is
- * probably too slow.
- *
- * You can use some heuristics to prune the set of terms, to avoid calling docFreq() too much,
- * or at all. Since you're trying to maximize a tf*idf score, you're probably most interested
- * in terms with a high tf. Choosing a tf threshold even as low as two or three will radically
- * reduce the number of terms under consideration. Another heuristic is that terms with a
- * high idf (i.e., a low df) tend to be longer. So you could threshold the terms by the
- * number of characters, not selecting anything less than, e.g., six or seven characters.
- * With these sorts of heuristics you can usually find small set of, e.g., ten or fewer terms
- * that do a pretty good job of characterizing a document.
- *
- * It all depends on what you're trying to do. If you're trying to eek out that last percent
- * of precision and recall regardless of computational difficulty so that you can win a TREC
- * competition, then the techniques I mention above are useless. But if you're trying to
- * provide a "more like this" button on a search results page that does a decent job and has
- * good performance, such techniques might be useful.
- *
- * An efficient, effective "more-like-this" query generator would be a great contribution, if
- * anyone's interested. I'd imagine that it would take a Reader or a String (the document's
- * text), analyzer Analyzer, and return a set of representative terms using heuristics like those
- * above. The frequency and length thresholds could be parameters, etc.
- *
- * Doug
- * </pre></code>
- *
- *
- * <p>
- * <h3>Initial Usage</h3>
- *
- * This class has lots of options to try to make it efficient and flexible.
- * See the body of {@link #main main()} below in the source for real code, or
- * if you want pseudo code, the simpliest possible usage is as follows. The bold
- * fragment is specific to this class.
- *
- * <code><pre>
- *
- * IndexReader ir = ...
- * IndexSearcher is = ...
- * <b>
- * MoreLikeThis mlt = new MoreLikeThis(ir);
- * Reader target = ... </b><em>// orig source of doc you want to find similarities to</em><b>
- * Query query = mlt.like( target);
- * </b>
- * Hits hits = is.search(query);
- * <em>// now the usual iteration thru 'hits' - the only thing to watch for is to make sure
- * you ignore the doc if it matches your 'target' document, as it should be similar to itself </em>
- *
- * </pre></code>
- *
- * Thus you:
- * <ol>
- * <li> do your normal, Lucene setup for searching,
- * <li> create a MoreLikeThis,
- * <li> get the text of the doc you want to find similaries to
- * <li> then call one of the like() calls to generate a similarity query
- * <li> call the searcher to find the similar docs
- * </ol>
- *
- * <h3>More Advanced Usage</h3>
- *
- * You may want to use {@link #setFieldNames setFieldNames(...)} so you can examine
- * multiple fields (e.g. body and title) for similarity.
- * <p>
- *
- * Depending on the size of your index and the size and makeup of your documents you
- * may want to call the other set methods to control how the similarity queries are
- * generated:
- * <ul>
- * <li> {@link #setMinTermFreq setMinTermFreq(...)}
- * <li> {@link #setMinDocFreq setMinDocFreq(...)}
- * <li> {@link #setMinWordLen setMinWordLen(...)}
- * <li> {@link #setMaxWordLen setMaxWordLen(...)}
- * <li> {@link #setMaxQueryTerms setMaxQueryTerms(...)}
- * <li> {@link #setMaxNumTokensParsed setMaxNumTokensParsed(...)}
- * <li> {@link #setStopWords setStopWord(...)}
- * </ul>
- *
- * <hr>
- * <pre>
- * Changes: Mark Harwood 29/02/04
- * Some bugfixing, some refactoring, some optimisation.
- * - bugfix: retrieveTerms(int docNum) was not working for indexes without a termvector -added missing code
- * - bugfix: No significant terms being created for fields with a termvector - because
- * was only counting one occurence per term/field pair in calculations(ie not including frequency info from TermVector)
- * - refactor: moved common code into isNoiseWord()
- * - optimise: when no termvector support available - used maxNumTermsParsed to limit amount of tokenization
- * </pre>
- *
- */
-public final class MoreLikeThis {
-
- /**
- * Default maximum number of tokens to parse in each example doc field that is not stored with TermVector support.
- * @see #getMaxNumTokensParsed
- */
- public static final int DEFAULT_MAX_NUM_TOKENS_PARSED=5000;
-
-
- /**
- * Default analyzer to parse source doc with.
- * @see #getAnalyzer
- */
- public static final Analyzer DEFAULT_ANALYZER = new StandardAnalyzer();
-
- /**
- * Ignore terms with less than this frequency in the source doc.
- * @see #getMinTermFreq
- * @see #setMinTermFreq
- */
- public static final int DEFAULT_MIN_TERM_FREQ = 2;
-
- /**
- * Ignore words which do not occur in at least this many docs.
- * @see #getMinDocFreq
- * @see #setMinDocFreq
- */
- public static final int DEFAULT_MIN_DOC_FREQ = 5;
-
- /**
- * Boost terms in query based on score.
- * @see #isBoost
- * @see #setBoost
- */
- public static final boolean DEFAULT_BOOST = false;
-
- /**
- * Default field names. Null is used to specify that the field names should be looked
- * up at runtime from the provided reader.
- */
- public static final String[] DEFAULT_FIELD_NAMES = new String[] { "contents"};
-
- /**
- * Ignore words less than this length or if 0 then this has no effect.
- * @see #getMinWordLen
- * @see #setMinWordLen
- */
- public static final int DEFAULT_MIN_WORD_LENGTH = 0;
-
- /**
- * Ignore words greater than this length or if 0 then this has no effect.
- * @see #getMaxWordLen
- * @see #setMaxWordLen
- */
- public static final int DEFAULT_MAX_WORD_LENGTH = 0;
-
- /**
- * Default set of stopwords.
- * If null means to allow stop words.
- *
- * @see #setStopWords
- * @see #getStopWords
- */
- public static final Set DEFAULT_STOP_WORDS = null;
-
- /**
- * Current set of stop words.
- */
- private Set stopWords = DEFAULT_STOP_WORDS;
-
- /**
- * Return a Query with no more than this many terms.
- *
- * @see BooleanQuery#getMaxClauseCount
- * @see #getMaxQueryTerms
- * @see #setMaxQueryTerms
- */
- public static final int DEFAULT_MAX_QUERY_TERMS = 25;
-
- /**
- * Analyzer that will be used to parse the doc.
- */
- private Analyzer analyzer = DEFAULT_ANALYZER;
-
- /**
- * Ignore words less freqent that this.
- */
- private int minTermFreq = DEFAULT_MIN_TERM_FREQ;
-
- /**
- * Ignore words which do not occur in at least this many docs.
- */
- private int minDocFreq = DEFAULT_MIN_DOC_FREQ;
-
- /**
- * Should we apply a boost to the Query based on the scores?
- */
- private boolean boost = DEFAULT_BOOST;
-
- /**
- * Field name we'll analyze.
- */
- private String[] fieldNames = DEFAULT_FIELD_NAMES;
-
- /**
- * The maximum number of tokens to parse in each example doc field that is not stored with TermVector support
- */
- private int maxNumTokensParsed=DEFAULT_MAX_NUM_TOKENS_PARSED;
-
-
-
- /**
- * Ignore words if less than this len.
- */
- private int minWordLen = DEFAULT_MIN_WORD_LENGTH;
-
- /**
- * Ignore words if greater than this len.
- */
- private int maxWordLen = DEFAULT_MAX_WORD_LENGTH;
-
- /**
- * Don't return a query longer than this.
- */
- private int maxQueryTerms = DEFAULT_MAX_QUERY_TERMS;
-
- /**
- * For idf() calculations.
- */
- private Similarity similarity;// = new DefaultSimilarity();
-
- /**
- * IndexReader to use
- */
- private final IndexReader ir;
-
- /**
- * Constructor requiring an IndexReader.
- */
- public MoreLikeThis(IndexReader ir) {
- this(ir, new DefaultSimilarity());
- }
-
- public MoreLikeThis(IndexReader ir, Similarity sim){
- this.ir = ir;
- this.similarity = sim;
- }
-
-
- public Similarity getSimilarity() {
- return similarity;
- }
-
- public void setSimilarity(Similarity similarity) {
- this.similarity = similarity;
- }
-
- /**
- * Returns an analyzer that will be used to parse source doc with. The default analyzer
- * is the {@link #DEFAULT_ANALYZER}.
- *
- * @return the analyzer that will be used to parse source doc with.
- * @see #DEFAULT_ANALYZER
- */
- public Analyzer getAnalyzer() {
- return analyzer;
- }
-
- /**
- * Sets the analyzer to use. An analyzer is not required for generating a query with the
- * {@link #like(int)} method, all other 'like' methods require an analyzer.
- *
- * @param analyzer the analyzer to use to tokenize text.
- */
- public void setAnalyzer(Analyzer analyzer) {
- this.analyzer = analyzer;
- }
-
- /**
- * Returns the frequency below which terms will be ignored in the source doc. The default
- * frequency is the {@link #DEFAULT_MIN_TERM_FREQ}.
- *
- * @return the frequency below which terms will be ignored in the source doc.
- */
- public int getMinTermFreq() {
- return minTermFreq;
- }
-
- /**
- * Sets the frequency below which terms will be ignored in the source doc.
- *
- * @param minTermFreq the frequency below which terms will be ignored in the source doc.
- */
- public void setMinTermFreq(int minTermFreq) {
- this.minTermFreq = minTermFreq;
- }
-
- /**
- * Returns the frequency at which words will be ignored which do not occur in at least this
- * many docs. The default frequency is {@link #DEFAULT_MIN_DOC_FREQ}.
- *
- * @return the frequency at which words will be ignored which do not occur in at least this
- * many docs.
- */
- public int getMinDocFreq() {
- return minDocFreq;
- }
-
- /**
- * Sets the frequency at which words will be ignored which do not occur in at least this
- * many docs.
- *
- * @param minDocFreq the frequency at which words will be ignored which do not occur in at
- * least this many docs.
- */
- public void setMinDocFreq(int minDocFreq) {
- this.minDocFreq = minDocFreq;
- }
-
- /**
- * Returns whether to boost terms in query based on "score" or not. The default is
- * {@link #DEFAULT_BOOST}.
- *
- * @return whether to boost terms in query based on "score" or not.
- * @see #setBoost
- */
- public boolean isBoost() {
- return boost;
- }
-
- /**
- * Sets whether to boost terms in query based on "score" or not.
- *
- * @param boost true to boost terms in query based on "score", false otherwise.
- * @see #isBoost
- */
- public void setBoost(boolean boost) {
- this.boost = boost;
- }
-
- /**
- * Returns the field names that will be used when generating the 'More Like This' query.
- * The default field names that will be used is {@link #DEFAULT_FIELD_NAMES}.
- *
- * @return the field names that will be used when generating the 'More Like This' query.
- */
- public String[] getFieldNames() {
- return fieldNames;
- }
-
- /**
- * Sets the field names that will be used when generating the 'More Like This' query.
- * Set this to null for the field names to be determined at runtime from the IndexReader
- * provided in the constructor.
- *
- * @param fieldNames the field names that will be used when generating the 'More Like This'
- * query.
- */
- public void setFieldNames(String[] fieldNames) {
- this.fieldNames = fieldNames;
- }
-
- /**
- * Returns the minimum word length below which words will be ignored. Set this to 0 for no
- * minimum word length. The default is {@link #DEFAULT_MIN_WORD_LENGTH}.
- *
- * @return the minimum word length below which words will be ignored.
- */
- public int getMinWordLen() {
- return minWordLen;
- }
-
- /**
- * Sets the minimum word length below which words will be ignored.
- *
- * @param minWordLen the minimum word length below which words will be ignored.
- */
- public void setMinWordLen(int minWordLen) {
- this.minWordLen = minWordLen;
- }
-
- /**
- * Returns the maximum word length above which words will be ignored. Set this to 0 for no
- * maximum word length. The default is {@link #DEFAULT_MAX_WORD_LENGTH}.
- *
- * @return the maximum word length above which words will be ignored.
- */
- public int getMaxWordLen() {
- return maxWordLen;
- }
-
- /**
- * Sets the maximum word length above which words will be ignored.
- *
- * @param maxWordLen the maximum word length above which words will be ignored.
- */
- public void setMaxWordLen(int maxWordLen) {
- this.maxWordLen = maxWordLen;
- }
-
- /**
- * Set the set of stopwords.
- * Any word in this set is considered "uninteresting" and ignored.
- * Even if your Analyzer allows stopwords, you might want to tell the MoreLikeThis code to ignore them, as
- * for the purposes of document similarity it seems reasonable to assume that "a stop word is never interesting".
- *
- * @param stopWords set of stopwords, if null it means to allow stop words
- *
- * @see org.apache.lucene.analysis.StopFilter#makeStopSet StopFilter.makeStopSet()
- * @see #getStopWords
- */
- public void setStopWords(Set stopWords) {
- this.stopWords = stopWords;
- }
-
- /**
- * Get the current stop words being used.
- * @see #setStopWords
- */
- public Set getStopWords() {
- return stopWords;
- }
-
-
- /**
- * Returns the maximum number of query terms that will be included in any generated query.
- * The default is {@link #DEFAULT_MAX_QUERY_TERMS}.
- *
- * @return the maximum number of query terms that will be included in any generated query.
- */
- public int getMaxQueryTerms() {
- return maxQueryTerms;
- }
-
- /**
- * Sets the maximum number of query terms that will be included in any generated query.
- *
- * @param maxQueryTerms the maximum number of query terms that will be included in any
- * generated query.
- */
- public void setMaxQueryTerms(int maxQueryTerms) {
- this.maxQueryTerms = maxQueryTerms;
- }
-
- /**
- * @return The maximum number of tokens to parse in each example doc field that is not stored with TermVector support
- * @see #DEFAULT_MAX_NUM_TOKENS_PARSED
- */
- public int getMaxNumTokensParsed()
- {
- return maxNumTokensParsed;
- }
-
- /**
- * @param i The maximum number of tokens to parse in each example doc field that is not stored with TermVector support
- */
- public void setMaxNumTokensParsed(int i)
- {
- maxNumTokensParsed = i;
- }
-
-
-
-
- /**
- * Return a query that will return docs like the passed lucene document ID.
- *
- * @param docNum the documentID of the lucene doc to generate the 'More Like This" query for.
- * @return a query that will return docs like the passed lucene document ID.
- */
- public Query like(int docNum) throws IOException {
- if (fieldNames == null) {
- // gather list of valid fields from lucene
- Collection fields = ir.getFieldNames( IndexReader.FieldOption.INDEXED);
- fieldNames = (String[]) fields.toArray(new String[fields.size()]);
- }
-
- return createQuery(retrieveTerms(docNum));
- }
-
- /**
- * Return a query that will return docs like the passed file.
- *
- * @return a query that will return docs like the passed file.
- */
- public Query like(File f) throws IOException {
- if (fieldNames == null) {
- // gather list of valid fields from lucene
- Collection fields = ir.getFieldNames( IndexReader.FieldOption.INDEXED);
- fieldNames = (String[]) fields.toArray(new String[fields.size()]);
- }
-
- return like(new FileReader(f));
- }
-
- /**
- * Return a query that will return docs like the passed URL.
- *
- * @return a query that will return docs like the passed URL.
- */
- public Query like(URL u) throws IOException {
- return like(new InputStreamReader(u.openConnection().getInputStream()));
- }
-
- /**
- * Return a query that will return docs like the passed stream.
- *
- * @return a query that will return docs like the passed stream.
- */
- public Query like(java.io.InputStream is) throws IOException {
- return like(new InputStreamReader(is));
- }
-
- /**
- * Return a query that will return docs like the passed Reader.
- *
- * @return a query that will return docs like the passed Reader.
- */
- public Query like(Reader r) throws IOException {
- return createQuery(retrieveTerms(r));
- }
-
- /**
- * Create the More like query from a PriorityQueue
- */
- private Query createQuery(PriorityQueue q) {
- BooleanQuery query = new BooleanQuery();
- Object cur;
- int qterms = 0;
- float bestScore = 0;
-
- while (((cur = q.pop()) != null)) {
- Object[] ar = (Object[]) cur;
- TermQuery tq = new TermQuery(new Term((String) ar[1], (String) ar[0]));
-
- if (boost) {
- if (qterms == 0) {
- bestScore = ((Float) ar[2]).floatValue();
- }
- float myScore = ((Float) ar[2]).floatValue();
-
- tq.setBoost(myScore / bestScore);
- }
-
- try {
- query.add(tq, BooleanClause.Occur.SHOULD);
- }
- catch (BooleanQuery.TooManyClauses ignore) {
- break;
- }
-
- qterms++;
- if (maxQueryTerms > 0 && qterms >= maxQueryTerms) {
- break;
- }
- }
-
- return query;
- }
-
- /**
- * Create a PriorityQueue from a word->tf map.
- *
- * @param words a map of words keyed on the word(String) with Int objects as the values.
- */
- private PriorityQueue createQueue(Map words) throws IOException {
- // have collected all words in doc and their freqs
- int numDocs = ir.numDocs();
- FreqQ res = new FreqQ(words.size()); // will order words by score
-
- Iterator it = words.keySet().iterator();
- while (it.hasNext()) { // for every word
- String word = (String) it.next();
-
- int tf = ((Int) words.get(word)).x; // term freq in the source doc
- if (minTermFreq > 0 && tf < minTermFreq) {
- continue; // filter out words that don't occur enough times in the source
- }
-
- // go through all the fields and find the largest document frequency
- String topField = fieldNames[0];
- int docFreq = 0;
- for (int i = 0; i < fieldNames.length; i++) {
- int freq = ir.docFreq(new Term(fieldNames[i], word));
- topField = (freq > docFreq) ? fieldNames[i] : topField;
- docFreq = (freq > docFreq) ? freq : docFreq;
- }
-
- if (minDocFreq > 0 && docFreq < minDocFreq) {
- continue; // filter out words that don't occur in enough docs
- }
-
- if (docFreq == 0) {
- continue; // index update problem?
- }
-
- float idf = similarity.idf(docFreq, numDocs);
- float score = tf * idf;
-
- // only really need 1st 3 entries, other ones are for troubleshooting
- res.insert(new Object[]{word, // the word
- topField, // the top field
- new Float(score), // overall score
- new Float(idf), // idf
- new Integer(docFreq), // freq in all docs
- new Integer(tf)
- });
- }
- return res;
- }
-
- /**
- * Describe the parameters that control how the "more like this" query is formed.
- */
- public String describeParams() {
- StringBuffer sb = new StringBuffer();
- sb.append("\t" + "maxQueryTerms : " + maxQueryTerms + "\n");
- sb.append("\t" + "minWordLen : " + minWordLen + "\n");
- sb.append("\t" + "maxWordLen : " + maxWordLen + "\n");
- sb.append("\t" + "fieldNames : ");
- String delim = "";
- for (int i = 0; i < fieldNames.length; i++) {
- String fieldName = fieldNames[i];
- sb.append(delim).append(fieldName);
- delim = ", ";
- }
- sb.append("\n");
- sb.append("\t" + "boost : " + boost + "\n");
- sb.append("\t" + "minTermFreq : " + minTermFreq + "\n");
- sb.append("\t" + "minDocFreq : " + minDocFreq + "\n");
- return sb.toString();
- }
-
- /**
- * Test driver.
- * Pass in "-i INDEX" and then either "-fn FILE" or "-url URL".
- */
- public static void main(String[] a) throws Throwable {
- String indexName = "localhost_index";
- String fn = "c:/Program Files/Apache Group/Apache/htdocs/manual/vhosts/index.html.en";
- URL url = null;
- for (int i = 0; i < a.length; i++) {
- if (a[i].equals("-i")) {
- indexName = a[++i];
- }
- else if (a[i].equals("-f")) {
- fn = a[++i];
- }
- else if (a[i].equals("-url")) {
- url = new URL(a[++i]);
- }
- }
-
- PrintStream o = System.out;
- IndexReader r = IndexReader.open(indexName);
- o.println("Open index " + indexName + " which has " + r.numDocs() + " docs");
-
- MoreLikeThis mlt = new MoreLikeThis(r);
-
- o.println("Query generation parameters:");
- o.println(mlt.describeParams());
- o.println();
-
- Query query = null;
- if (url != null) {
- o.println("Parsing URL: " + url);
- query = mlt.like(url);
- }
- else if (fn != null) {
- o.println("Parsing file: " + fn);
- query = mlt.like(new File(fn));
- }
-
- o.println("q: " + query);
- o.println();
- IndexSearcher searcher = new IndexSearcher(indexName);
-
- Hits hits = searcher.search(query);
- int len = hits.length();
- o.println("found: " + len + " documents matching");
- o.println();
- for (int i = 0; i < Math.min(25, len); i++) {
- Document d = hits.doc(i);
- String summary = d.get( "summary");
- o.println("score : " + hits.score(i));
- o.println("url : " + d.get("url"));
- o.println("\ttitle : " + d.get("title"));
- if ( summary != null)
- o.println("\tsummary: " + d.get("summary"));
- o.println();
- }
- }
-
- /**
- * Find words for a more-like-this query former.
- *
- * @param docNum the id of the lucene document from which to find terms
- */
- public PriorityQueue retrieveTerms(int docNum) throws IOException {
- Map termFreqMap = new HashMap();
- for (int i = 0; i < fieldNames.length; i++) {
- String fieldName = fieldNames[i];
- TermFreqVector vector = ir.getTermFreqVector(docNum, fieldName);
-
- // field does not store term vector info
- if (vector == null) {
- Document d=ir.document(docNum);
- String text[]=d.getValues(fieldName);
- if(text!=null)
- {
- for (int j = 0; j < text.length; j++) {
- addTermFrequencies(new StringReader(text[j]), termFreqMap, fieldName);
- }
- }
- }
- else {
- addTermFrequencies(termFreqMap, vector);
- }
-
- }
-
- return createQueue(termFreqMap);
- }
-
- /**
- * Adds terms and frequencies found in vector into the Map termFreqMap
- * @param termFreqMap a Map of terms and their frequencies
- * @param vector List of terms and their frequencies for a doc/field
- */
- private void addTermFrequencies(Map termFreqMap, TermFreqVector vector)
- {
- String[] terms = vector.getTerms();
- int freqs[]=vector.getTermFrequencies();
- for (int j = 0; j < terms.length; j++) {
- String term = terms[j];
-
- if(isNoiseWord(term)){
- continue;
- }
- // increment frequency
- Int cnt = (Int) termFreqMap.get(term);
- if (cnt == null) {
- cnt=new Int();
- termFreqMap.put(term, cnt);
- cnt.x=freqs[j];
- }
- else {
- cnt.x+=freqs[j];
- }
- }
- }
- /**
- * Adds term frequencies found by tokenizing text from reader into the Map words
- * @param r a source of text to be tokenized
- * @param termFreqMap a Map of terms and their frequencies
- * @param fieldName Used by analyzer for any special per-field analysis
- */
- private void addTermFrequencies(Reader r, Map termFreqMap, String fieldName)
- throws IOException
- {
- TokenStream ts = analyzer.tokenStream(fieldName, r);
- int tokenCount=0;
- // for every token
- final Token reusableToken = new Token();
- for (Token nextToken = ts.next(reusableToken); nextToken != null; nextToken = ts.next(reusableToken)) {
- String word = nextToken.term();
- tokenCount++;
- if(tokenCount>maxNumTokensParsed)
- {
- break;
- }
- if(isNoiseWord(word)){
- continue;
- }
-
- // increment frequency
- Int cnt = (Int) termFreqMap.get(word);
- if (cnt == null) {
- termFreqMap.put(word, new Int());
- }
- else {
- cnt.x++;
- }
- }
- }
-
-
- /** determines if the passed term is likely to be of interest in "more like" comparisons
- *
- * @param term The word being considered
- * @return true if should be ignored, false if should be used in further analysis
- */
- private boolean isNoiseWord(String term)
- {
- int len = term.length();
- if (minWordLen > 0 && len < minWordLen) {
- return true;
- }
- if (maxWordLen > 0 && len > maxWordLen) {
- return true;
- }
- if (stopWords != null && stopWords.contains( term)) {
- return true;
- }
- return false;
- }
-
-
- /**
- * Find words for a more-like-this query former.
- * The result is a priority queue of arrays with one entry for <b>every word</b> in the document.
- * Each array has 6 elements.
- * The elements are:
- * <ol>
- * <li> The word (String)
- * <li> The top field that this word comes from (String)
- * <li> The score for this word (Float)
- * <li> The IDF value (Float)
- * <li> The frequency of this word in the index (Integer)
- * <li> The frequency of this word in the source document (Integer)
- * </ol>
- * This is a somewhat "advanced" routine, and in general only the 1st entry in the array is of interest.
- * This method is exposed so that you can identify the "interesting words" in a document.
- * For an easier method to call see {@link #retrieveInterestingTerms retrieveInterestingTerms()}.
- *
- * @param r the reader that has the content of the document
- * @return the most interesting words in the document ordered by score, with the highest scoring, or best entry, first
- *
- * @see #retrieveInterestingTerms
- */
- public PriorityQueue retrieveTerms(Reader r) throws IOException {
- Map words = new HashMap();
- for (int i = 0; i < fieldNames.length; i++) {
- String fieldName = fieldNames[i];
- addTermFrequencies(r, words, fieldName);
- }
- return createQueue(words);
- }
-
- /**
- * @see #retrieveInterestingTerms(java.io.Reader)
- */
- public String [] retrieveInterestingTerms(int docNum) throws IOException{
- ArrayList al = new ArrayList( maxQueryTerms);
- PriorityQueue pq = retrieveTerms(docNum);
- Object cur;
- int lim = maxQueryTerms; // have to be careful, retrieveTerms returns all words but that's probably not useful to our caller...
- // we just want to return the top words
- while (((cur = pq.pop()) != null) && lim-- > 0) {
- Object[] ar = (Object[]) cur;
- al.add( ar[ 0]); // the 1st entry is the interesting word
- }
- String[] res = new String[ al.size()];
- return (String[]) al.toArray( res);
- }
-
- /**
- * Convenience routine to make it easy to return the most interesting words in a document.
- * More advanced users will call {@link #retrieveTerms(java.io.Reader) retrieveTerms()} directly.
- * @param r the source document
- * @return the most interesting words in the document
- *
- * @see #retrieveTerms(java.io.Reader)
- * @see #setMaxQueryTerms
- */
- public String[] retrieveInterestingTerms( Reader r) throws IOException {
- ArrayList al = new ArrayList( maxQueryTerms);
- PriorityQueue pq = retrieveTerms( r);
- Object cur;
- int lim = maxQueryTerms; // have to be careful, retrieveTerms returns all words but that's probably not useful to our caller...
- // we just want to return the top words
- while (((cur = pq.pop()) != null) && lim-- > 0) {
- Object[] ar = (Object[]) cur;
- al.add( ar[ 0]); // the 1st entry is the interesting word
- }
- String[] res = new String[ al.size()];
- return (String[]) al.toArray( res);
- }
-
- /**
- * PriorityQueue that orders words by score.
- */
- private static class FreqQ extends PriorityQueue {
- FreqQ (int s) {
- initialize(s);
- }
-
- protected boolean lessThan(Object a, Object b) {
- Object[] aa = (Object[]) a;
- Object[] bb = (Object[]) b;
- Float fa = (Float) aa[2];
- Float fb = (Float) bb[2];
- return fa.floatValue() > fb.floatValue();
- }
- }
-
- /**
- * Use for frequencies and to avoid renewing Integers.
- */
- private static class Int {
- int x;
-
- Int() {
- x = 1;
- }
- }
-
-
-}
+/**
+ * Copyright 2004-2005 The Apache Software Foundation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search.similar;
+
+import org.apache.lucene.util.PriorityQueue;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.index.TermFreqVector;
+import org.apache.lucene.search.BooleanClause;
+import org.apache.lucene.search.DefaultSimilarity;
+import org.apache.lucene.search.Similarity;
+import org.apache.lucene.search.TermQuery;
+import org.apache.lucene.search.BooleanQuery;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.Hits;
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.analysis.Token;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.analysis.standard.StandardAnalyzer;
+import org.apache.lucene.document.Document;
+
+import java.util.Set;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Collection;
+import java.util.Iterator;
+import java.io.IOException;
+import java.io.Reader;
+import java.io.File;
+import java.io.PrintStream;
+import java.io.StringReader;
+import java.io.FileReader;
+import java.io.InputStreamReader;
+import java.net.URL;
+import java.util.ArrayList;
+
+
+/**
+ * Generate "more like this" similarity queries.
+ * Based on this mail:
+ * <code><pre>
+ * Lucene does let you access the document frequency of terms, with IndexReader.docFreq().
+ * Term frequencies can be computed by re-tokenizing the text, which, for a single document,
+ * is usually fast enough. But looking up the docFreq() of every term in the document is
+ * probably too slow.
+ *
+ * You can use some heuristics to prune the set of terms, to avoid calling docFreq() too much,
+ * or at all. Since you're trying to maximize a tf*idf score, you're probably most interested
+ * in terms with a high tf. Choosing a tf threshold even as low as two or three will radically
+ * reduce the number of terms under consideration. Another heuristic is that terms with a
+ * high idf (i.e., a low df) tend to be longer. So you could threshold the terms by the
+ * number of characters, not selecting anything less than, e.g., six or seven characters.
+ * With these sorts of heuristics you can usually find small set of, e.g., ten or fewer terms
+ * that do a pretty good job of characterizing a document.
+ *
+ * It all depends on what you're trying to do. If you're trying to eek out that last percent
+ * of precision and recall regardless of computational difficulty so that you can win a TREC
+ * competition, then the techniques I mention above are useless. But if you're trying to
+ * provide a "more like this" button on a search results page that does a decent job and has
+ * good performance, such techniques might be useful.
+ *
+ * An efficient, effective "more-like-this" query generator would be a great contribution, if
+ * anyone's interested. I'd imagine that it would take a Reader or a String (the document's
+ * text), analyzer Analyzer, and return a set of representative terms using heuristics like those
+ * above. The frequency and length thresholds could be parameters, etc.
+ *
+ * Doug
+ * </pre></code>
+ *
+ *
+ * <p>
+ * <h3>Initial Usage</h3>
+ *
+ * This class has lots of options to try to make it efficient and flexible.
+ * See the body of {@link #main main()} below in the source for real code, or
+ * if you want pseudo code, the simpliest possible usage is as follows. The bold
+ * fragment is specific to this class.
+ *
+ * <code><pre>
+ *
+ * IndexReader ir = ...
+ * IndexSearcher is = ...
+ * <b>
+ * MoreLikeThis mlt = new MoreLikeThis(ir);
+ * Reader target = ... </b><em>// orig source of doc you want to find similarities to</em><b>
+ * Query query = mlt.like( target);
+ * </b>
+ * Hits hits = is.search(query);
+ * <em>// now the usual iteration thru 'hits' - the only thing to watch for is to make sure
+ * you ignore the doc if it matches your 'target' document, as it should be similar to itself </em>
+ *
+ * </pre></code>
+ *
+ * Thus you:
+ * <ol>
+ * <li> do your normal, Lucene setup for searching,
+ * <li> create a MoreLikeThis,
+ * <li> get the text of the doc you want to find similaries to
+ * <li> then call one of the like() calls to generate a similarity query
+ * <li> call the searcher to find the similar docs
+ * </ol>
+ *
+ * <h3>More Advanced Usage</h3>
+ *
+ * You may want to use {@link #setFieldNames setFieldNames(...)} so you can examine
+ * multiple fields (e.g. body and title) for similarity.
+ * <p>
+ *
+ * Depending on the size of your index and the size and makeup of your documents you
+ * may want to call the other set methods to control how the similarity queries are
+ * generated:
+ * <ul>
+ * <li> {@link #setMinTermFreq setMinTermFreq(...)}
+ * <li> {@link #setMinDocFreq setMinDocFreq(...)}
+ * <li> {@link #setMinWordLen setMinWordLen(...)}
+ * <li> {@link #setMaxWordLen setMaxWordLen(...)}
+ * <li> {@link #setMaxQueryTerms setMaxQueryTerms(...)}
+ * <li> {@link #setMaxNumTokensParsed setMaxNumTokensParsed(...)}
+ * <li> {@link #setStopWords setStopWord(...)}
+ * </ul>
+ *
+ * <hr>
+ * <pre>
+ * Changes: Mark Harwood 29/02/04
+ * Some bugfixing, some refactoring, some optimisation.
+ * - bugfix: retrieveTerms(int docNum) was not working for indexes without a termvector -added missing code
+ * - bugfix: No significant terms being created for fields with a termvector - because
+ * was only counting one occurence per term/field pair in calculations(ie not including frequency info from TermVector)
+ * - refactor: moved common code into isNoiseWord()
+ * - optimise: when no termvector support available - used maxNumTermsParsed to limit amount of tokenization
+ * </pre>
+ *
+ */
+public final class MoreLikeThis {
+
+ /**
+ * Default maximum number of tokens to parse in each example doc field that is not stored with TermVector support.
+ * @see #getMaxNumTokensParsed
+ */
+ public static final int DEFAULT_MAX_NUM_TOKENS_PARSED=5000;
+
+
+ /**
+ * Default analyzer to parse source doc with.
+ * @see #getAnalyzer
+ */
+ public static final Analyzer DEFAULT_ANALYZER = new StandardAnalyzer();
+
+ /**
+ * Ignore terms with less than this frequency in the source doc.
+ * @see #getMinTermFreq
+ * @see #setMinTermFreq
+ */
+ public static final int DEFAULT_MIN_TERM_FREQ = 2;
+
+ /**
+ * Ignore words which do not occur in at least this many docs.
+ * @see #getMinDocFreq
+ * @see #setMinDocFreq
+ */
+ public static final int DEFAULT_MIN_DOC_FREQ = 5;
+
+ /**
+ * Boost terms in query based on score.
+ * @see #isBoost
+ * @see #setBoost
+ */
+ public static final boolean DEFAULT_BOOST = false;
+
+ /**
+ * Default field names. Null is used to specify that the field names should be looked
+ * up at runtime from the provided reader.
+ */
+ public static final String[] DEFAULT_FIELD_NAMES = new String[] { "contents"};
+
+ /**
+ * Ignore words less than this length or if 0 then this has no effect.
+ * @see #getMinWordLen
+ * @see #setMinWordLen
+ */
+ public static final int DEFAULT_MIN_WORD_LENGTH = 0;
+
+ /**
+ * Ignore words greater than this length or if 0 then this has no effect.
+ * @see #getMaxWordLen
+ * @see #setMaxWordLen
+ */
+ public static final int DEFAULT_MAX_WORD_LENGTH = 0;
+
+ /**
+ * Default set of stopwords.
+ * If null means to allow stop words.
+ *
+ * @see #setStopWords
+ * @see #getStopWords
+ */
+ public static final Set DEFAULT_STOP_WORDS = null;
+
+ /**
+ * Current set of stop words.
+ */
+ private Set stopWords = DEFAULT_STOP_WORDS;
+
+ /**
+ * Return a Query with no more than this many terms.
+ *
+ * @see BooleanQuery#getMaxClauseCount
+ * @see #getMaxQueryTerms
+ * @see #setMaxQueryTerms
+ */
+ public static final int DEFAULT_MAX_QUERY_TERMS = 25;
+
+ /**
+ * Analyzer that will be used to parse the doc.
+ */
+ private Analyzer analyzer = DEFAULT_ANALYZER;
+
+ /**
+ * Ignore words less freqent that this.
+ */
+ private int minTermFreq = DEFAULT_MIN_TERM_FREQ;
+
+ /**
+ * Ignore words which do not occur in at least this many docs.
+ */
+ private int minDocFreq = DEFAULT_MIN_DOC_FREQ;
+
+ /**
+ * Should we apply a boost to the Query based on the scores?
+ */
+ private boolean boost = DEFAULT_BOOST;
+
+ /**
+ * Field name we'll analyze.
+ */
+ private String[] fieldNames = DEFAULT_FIELD_NAMES;
+
+ /**
+ * The maximum number of tokens to parse in each example doc field that is not stored with TermVector support
+ */
+ private int maxNumTokensParsed=DEFAULT_MAX_NUM_TOKENS_PARSED;
+
+
+
+ /**
+ * Ignore words if less than this len.
+ */
+ private int minWordLen = DEFAULT_MIN_WORD_LENGTH;
+
+ /**
+ * Ignore words if greater than this len.
+ */
+ private int maxWordLen = DEFAULT_MAX_WORD_LENGTH;
+
+ /**
+ * Don't return a query longer than this.
+ */
+ private int maxQueryTerms = DEFAULT_MAX_QUERY_TERMS;
+
+ /**
+ * For idf() calculations.
+ */
+ private Similarity similarity;// = new DefaultSimilarity();
+
+ /**
+ * IndexReader to use
+ */
+ private final IndexReader ir;
+
+ /**
+ * Constructor requiring an IndexReader.
+ */
+ public MoreLikeThis(IndexReader ir) {
+ this(ir, new DefaultSimilarity());
+ }
+
+ public MoreLikeThis(IndexReader ir, Similarity sim){
+ this.ir = ir;
+ this.similarity = sim;
+ }
+
+
+ public Similarity getSimilarity() {
+ return similarity;
+ }
+
+ public void setSimilarity(Similarity similarity) {
+ this.similarity = similarity;
+ }
+
+ /**
+ * Returns an analyzer that will be used to parse source doc with. The default analyzer
+ * is the {@link #DEFAULT_ANALYZER}.
+ *
+ * @return the analyzer that will be used to parse source doc with.
+ * @see #DEFAULT_ANALYZER
+ */
+ public Analyzer getAnalyzer() {
+ return analyzer;
+ }
+
+ /**
+ * Sets the analyzer to use. An analyzer is not required for generating a query with the
+ * {@link #like(int)} method, all other 'like' methods require an analyzer.
+ *
+ * @param analyzer the analyzer to use to tokenize text.
+ */
+ public void setAnalyzer(Analyzer analyzer) {
+ this.analyzer = analyzer;
+ }
+
+ /**
+ * Returns the frequency below which terms will be ignored in the source doc. The default
+ * frequency is the {@link #DEFAULT_MIN_TERM_FREQ}.
+ *
+ * @return the frequency below which terms will be ignored in the source doc.
+ */
+ public int getMinTermFreq() {
+ return minTermFreq;
+ }
+
+ /**
+ * Sets the frequency below which terms will be ignored in the source doc.
+ *
+ * @param minTermFreq the frequency below which terms will be ignored in the source doc.
+ */
+ public void setMinTermFreq(int minTermFreq) {
+ this.minTermFreq = minTermFreq;
+ }
+
+ /**
+ * Returns the frequency at which words will be ignored which do not occur in at least this
+ * many docs. The default frequency is {@link #DEFAULT_MIN_DOC_FREQ}.
+ *
+ * @return the frequency at which words will be ignored which do not occur in at least this
+ * many docs.
+ */
+ public int getMinDocFreq() {
+ return minDocFreq;
+ }
+
+ /**
+ * Sets the frequency at which words will be ignored which do not occur in at least this
+ * many docs.
+ *
+ * @param minDocFreq the frequency at which words will be ignored which do not occur in at
+ * least this many docs.
+ */
+ public void setMinDocFreq(int minDocFreq) {
+ this.minDocFreq = minDocFreq;
+ }
+
+ /**
+ * Returns whether to boost terms in query based on "score" or not. The default is
+ * {@link #DEFAULT_BOOST}.
+ *
+ * @return whether to boost terms in query based on "score" or not.
+ * @see #setBoost
+ */
+ public boolean isBoost() {
+ return boost;
+ }
+
+ /**
+ * Sets whether to boost terms in query based on "score" or not.
+ *
+ * @param boost true to boost terms in query based on "score", false otherwise.
+ * @see #isBoost
+ */
+ public void setBoost(boolean boost) {
+ this.boost = boost;
+ }
+
+ /**
+ * Returns the field names that will be used when generating the 'More Like This' query.
+ * The default field names that will be used is {@link #DEFAULT_FIELD_NAMES}.
+ *
+ * @return the field names that will be used when generating the 'More Like This' query.
+ */
+ public String[] getFieldNames() {
+ return fieldNames;
+ }
+
+ /**
+ * Sets the field names that will be used when generating the 'More Like This' query.
+ * Set this to null for the field names to be determined at runtime from the IndexReader
+ * provided in the constructor.
+ *
+ * @param fieldNames the field names that will be used when generating the 'More Like This'
+ * query.
+ */
+ public void setFieldNames(String[] fieldNames) {
+ this.fieldNames = fieldNames;
+ }
+
+ /**
+ * Returns the minimum word length below which words will be ignored. Set this to 0 for no
+ * minimum word length. The default is {@link #DEFAULT_MIN_WORD_LENGTH}.
+ *
+ * @return the minimum word length below which words will be ignored.
+ */
+ public int getMinWordLen() {
+ return minWordLen;
+ }
+
+ /**
+ * Sets the minimum word length below which words will be ignored.
+ *
+ * @param minWordLen the minimum word length below which words will be ignored.
+ */
+ public void setMinWordLen(int minWordLen) {
+ this.minWordLen = minWordLen;
+ }
+
+ /**
+ * Returns the maximum word length above which words will be ignored. Set this to 0 for no
+ * maximum word length. The default is {@link #DEFAULT_MAX_WORD_LENGTH}.
+ *
+ * @return the maximum word length above which words will be ignored.
+ */
+ public int getMaxWordLen() {
+ return maxWordLen;
+ }
+
+ /**
+ * Sets the maximum word length above which words will be ignored.
+ *
+ * @param maxWordLen the maximum word length above which words will be ignored.
+ */
+ public void setMaxWordLen(int maxWordLen) {
+ this.maxWordLen = maxWordLen;
+ }
+
+ /**
+ * Set the set of stopwords.
+ * Any word in this set is considered "uninteresting" and ignored.
+ * Even if your Analyzer allows stopwords, you might want to tell the MoreLikeThis code to ignore them, as
+ * for the purposes of document similarity it seems reasonable to assume that "a stop word is never interesting".
+ *
+ * @param stopWords set of stopwords, if null it means to allow stop words
+ *
+ * @see org.apache.lucene.analysis.StopFilter#makeStopSet StopFilter.makeStopSet()
+ * @see #getStopWords
+ */
+ public void setStopWords(Set stopWords) {
+ this.stopWords = stopWords;
+ }
+
+ /**
+ * Get the current stop words being used.
+ * @see #setStopWords
+ */
+ public Set getStopWords() {
+ return stopWords;
+ }
+
+
+ /**
+ * Returns the maximum number of query terms that will be included in any generated query.
+ * The default is {@link #DEFAULT_MAX_QUERY_TERMS}.
+ *
+ * @return the maximum number of query terms that will be included in any generated query.
+ */
+ public int getMaxQueryTerms() {
+ return maxQueryTerms;
+ }
+
+ /**
+ * Sets the maximum number of query terms that will be included in any generated query.
+ *
+ * @param maxQueryTerms the maximum number of query terms that will be included in any
+ * generated query.
+ */
+ public void setMaxQueryTerms(int maxQueryTerms) {
+ this.maxQueryTerms = maxQueryTerms;
+ }
+
+ /**
+ * @return The maximum number of tokens to parse in each example doc field that is not stored with TermVector support
+ * @see #DEFAULT_MAX_NUM_TOKENS_PARSED
+ */
+ public int getMaxNumTokensParsed()
+ {
+ return maxNumTokensParsed;
+ }
+
+ /**
+ * @param i The maximum number of tokens to parse in each example doc field that is not stored with TermVector support
+ */
+ public void setMaxNumTokensParsed(int i)
+ {
+ maxNumTokensParsed = i;
+ }
+
+
+
+
+ /**
+ * Return a query that will return docs like the passed lucene document ID.
+ *
+ * @param docNum the documentID of the lucene doc to generate the 'More Like This" query for.
+ * @return a query that will return docs like the passed lucene document ID.
+ */
+ public Query like(int docNum) throws IOException {
+ if (fieldNames == null) {
+ // gather list of valid fields from lucene
+ Collection fields = ir.getFieldNames( IndexReader.FieldOption.INDEXED);
+ fieldNames = (String[]) fields.toArray(new String[fields.size()]);
+ }
+
+ return createQuery(retrieveTerms(docNum));
+ }
+
+ /**
+ * Return a query that will return docs like the passed file.
+ *
+ * @return a query that will return docs like the passed file.
+ */
+ public Query like(File f) throws IOException {
+ if (fieldNames == null) {
+ // gather list of valid fields from lucene
+ Collection fields = ir.getFieldNames( IndexReader.FieldOption.INDEXED);
+ fieldNames = (String[]) fields.toArray(new String[fields.size()]);
+ }
+
+ return like(new FileReader(f));
+ }
+
+ /**
+ * Return a query that will return docs like the passed URL.
+ *
+ * @return a query that will return docs like the passed URL.
+ */
+ public Query like(URL u) throws IOException {
+ return like(new InputStreamReader(u.openConnection().getInputStream()));
+ }
+
+ /**
+ * Return a query that will return docs like the passed stream.
+ *
+ * @return a query that will return docs like the passed stream.
+ */
+ public Query like(java.io.InputStream is) throws IOException {
+ return like(new InputStreamReader(is));
+ }
+
+ /**
+ * Return a query that will return docs like the passed Reader.
+ *
+ * @return a query that will return docs like the passed Reader.
+ */
+ public Query like(Reader r) throws IOException {
+ return createQuery(retrieveTerms(r));
+ }
+
+ /**
+ * Create the More like query from a PriorityQueue
+ */
+ private Query createQuery(PriorityQueue q) {
+ BooleanQuery query = new BooleanQuery();
+ Object cur;
+ int qterms = 0;
+ float bestScore = 0;
+
+ while (((cur = q.pop()) != null)) {
+ Object[] ar = (Object[]) cur;
+ TermQuery tq = new TermQuery(new Term((String) ar[1], (String) ar[0]));
+
+ if (boost) {
+ if (qterms == 0) {
+ bestScore = ((Float) ar[2]).floatValue();
+ }
+ float myScore = ((Float) ar[2]).floatValue();
+
+ tq.setBoost(myScore / bestScore);
+ }
+
+ try {
+ query.add(tq, BooleanClause.Occur.SHOULD);
+ }
+ catch (BooleanQuery.TooManyClauses ignore) {
+ break;
+ }
+
+ qterms++;
+ if (maxQueryTerms > 0 && qterms >= maxQueryTerms) {
+ break;
+ }
+ }
+
+ return query;
+ }
+
+ /**
+ * Create a PriorityQueue from a word->tf map.
+ *
+ * @param words a map of words keyed on the word(String) with Int objects as the values.
+ */
+ private PriorityQueue createQueue(Map words) throws IOException {
+ // have collected all words in doc and their freqs
+ int numDocs = ir.numDocs();
+ FreqQ res = new FreqQ(words.size()); // will order words by score
+
+ Iterator it = words.keySet().iterator();
+ while (it.hasNext()) { // for every word
+ String word = (String) it.next();
+
+ int tf = ((Int) words.get(word)).x; // term freq in the source doc
+ if (minTermFreq > 0 && tf < minTermFreq) {
+ continue; // filter out words that don't occur enough times in the source
+ }
+
+ // go through all the fields and find the largest document frequency
+ String topField = fieldNames[0];
+ int docFreq = 0;
+ for (int i = 0; i < fieldNames.length; i++) {
+ int freq = ir.docFreq(new Term(fieldNames[i], word));
+ topField = (freq > docFreq) ? fieldNames[i] : topField;
+ docFreq = (freq > docFreq) ? freq : docFreq;
+ }
+
+ if (minDocFreq > 0 && docFreq < minDocFreq) {
+ continue; // filter out words that don't occur in enough docs
+ }
+
+ if (docFreq == 0) {
+ continue; // index update problem?
+ }
+
+ float idf = similarity.idf(docFreq, numDocs);
+ float score = tf * idf;
+
+ // only really need 1st 3 entries, other ones are for troubleshooting
+ res.insert(new Object[]{word, // the word
+ topField, // the top field
+ new Float(score), // overall score
+ new Float(idf), // idf
+ new Integer(docFreq), // freq in all docs
+ new Integer(tf)
+ });
+ }
+ return res;
+ }
+
+ /**
+ * Describe the parameters that control how the "more like this" query is formed.
+ */
+ public String describeParams() {
+ StringBuffer sb = new StringBuffer();
+ sb.append("\t" + "maxQueryTerms : " + maxQueryTerms + "\n");
+ sb.append("\t" + "minWordLen : " + minWordLen + "\n");
+ sb.append("\t" + "maxWordLen : " + maxWordLen + "\n");
+ sb.append("\t" + "fieldNames : ");
+ String delim = "";
+ for (int i = 0; i < fieldNames.length; i++) {
+ String fieldName = fieldNames[i];
+ sb.append(delim).append(fieldName);
+ delim = ", ";
+ }
+ sb.append("\n");
+ sb.append("\t" + "boost : " + boost + "\n");
+ sb.append("\t" + "minTermFreq : " + minTermFreq + "\n");
+ sb.append("\t" + "minDocFreq : " + minDocFreq + "\n");
+ return sb.toString();
+ }
+
+ /**
+ * Test driver.
+ * Pass in "-i INDEX" and then either "-fn FILE" or "-url URL".
+ */
+ public static void main(String[] a) throws Throwable {
+ String indexName = "localhost_index";
+ String fn = "c:/Program Files/Apache Group/Apache/htdocs/manual/vhosts/index.html.en";
+ URL url = null;
+ for (int i = 0; i < a.length; i++) {
+ if (a[i].equals("-i")) {
+ indexName = a[++i];
+ }
+ else if (a[i].equals("-f")) {
+ fn = a[++i];
+ }
+ else if (a[i].equals("-url")) {
+ url = new URL(a[++i]);
+ }
+ }
+
+ PrintStream o = System.out;
+ IndexReader r = IndexReader.open(indexName);
+ o.println("Open index " + indexName + " which has " + r.numDocs() + " docs");
+
+ MoreLikeThis mlt = new MoreLikeThis(r);
+
+ o.println("Query generation parameters:");
+ o.println(mlt.describeParams());
+ o.println();
+
+ Query query = null;
+ if (url != null) {
+ o.println("Parsing URL: " + url);
+ query = mlt.like(url);
+ }
+ else if (fn != null) {
+ o.println("Parsing file: " + fn);
+ query = mlt.like(new File(fn));
+ }
+
+ o.println("q: " + query);
+ o.println();
+ IndexSearcher searcher = new IndexSearcher(indexName);
+
+ Hits hits = searcher.search(query);
+ int len = hits.length();
+ o.println("found: " + len + " documents matching");
+ o.println();
+ for (int i = 0; i < Math.min(25, len); i++) {
+ Document d = hits.doc(i);
+ String summary = d.get( "summary");
+ o.println("score : " + hits.score(i));
+ o.println("url : " + d.get("url"));
+ o.println("\ttitle : " + d.get("title"));
+ if ( summary != null)
+ o.println("\tsummary: " + d.get("summary"));
+ o.println();
+ }
+ }
+
+ /**
+ * Find words for a more-like-this query former.
+ *
+ * @param docNum the id of the lucene document from which to find terms
+ */
+ public PriorityQueue retrieveTerms(int docNum) throws IOException {
+ Map termFreqMap = new HashMap();
+ for (int i = 0; i < fieldNames.length; i++) {
+ String fieldName = fieldNames[i];
+ TermFreqVector vector = ir.getTermFreqVector(docNum, fieldName);
+
+ // field does not store term vector info
+ if (vector == null) {
+ Document d=ir.document(docNum);
+ String text[]=d.getValues(fieldName);
+ if(text!=null)
+ {
+ for (int j = 0; j < text.length; j++) {
+ addTermFrequencies(new StringReader(text[j]), termFreqMap, fieldName);
+ }
+ }
+ }
+ else {
+ addTermFrequencies(termFreqMap, vector);
+ }
+
+ }
+
+ return createQueue(termFreqMap);
+ }
+
+ /**
+ * Adds terms and frequencies found in vector into the Map termFreqMap
+ * @param termFreqMap a Map of terms and their frequencies
+ * @param vector List of terms and their frequencies for a doc/field
+ */
+ private void addTermFrequencies(Map termFreqMap, TermFreqVector vector)
+ {
+ String[] terms = vector.getTerms();
+ int freqs[]=vector.getTermFrequencies();
+ for (int j = 0; j < terms.length; j++) {
+ String term = terms[j];
+
+ if(isNoiseWord(term)){
+ continue;
+ }
+ // increment frequency
+ Int cnt = (Int) termFreqMap.get(term);
+ if (cnt == null) {
+ cnt=new Int();
+ termFreqMap.put(term, cnt);
+ cnt.x=freqs[j];
+ }
+ else {
+ cnt.x+=freqs[j];
+ }
+ }
+ }
+ /**
+ * Adds term frequencies found by tokenizing text from reader into the Map words
+ * @param r a source of text to be tokenized
+ * @param termFreqMap a Map of terms and their frequencies
+ * @param fieldName Used by analyzer for any special per-field analysis
+ */
+ private void addTermFrequencies(Reader r, Map termFreqMap, String fieldName)
+ throws IOException
+ {
+ TokenStream ts = analyzer.tokenStream(fieldName, r);
+ int tokenCount=0;
+ // for every token
+ final Token reusableToken = new Token();
+ for (Token nextToken = ts.next(reusableToken); nextToken != null; nextToken = ts.next(reusableToken)) {
+ String word = nextToken.term();
+ tokenCount++;
+ if(tokenCount>maxNumTokensParsed)
+ {
+ break;
+ }
+ if(isNoiseWord(word)){
+ continue;
+ }
+
+ // increment frequency
+ Int cnt = (Int) termFreqMap.get(word);
+ if (cnt == null) {
+ termFreqMap.put(word, new Int());
+ }
+ else {
+ cnt.x++;
+ }
+ }
+ }
+
+
+ /** determines if the passed term is likely to be of interest in "more like" comparisons
+ *
+ * @param term The word being considered
+ * @return true if should be ignored, false if should be used in further analysis
+ */
+ private boolean isNoiseWord(String term)
+ {
+ int len = term.length();
+ if (minWordLen > 0 && len < minWordLen) {
+ return true;
+ }
+ if (maxWordLen > 0 && len > maxWordLen) {
+ return true;
+ }
+ if (stopWords != null && stopWords.contains( term)) {
+ return true;
+ }
+ return false;
+ }
+
+
+ /**
+ * Find words for a more-like-this query former.
+ * The result is a priority queue of arrays with one entry for <b>every word</b> in the document.
+ * Each array has 6 elements.
+ * The elements are:
+ * <ol>
+ * <li> The word (String)
+ * <li> The top field that this word comes from (String)
+ * <li> The score for this word (Float)
+ * <li> The IDF value (Float)
+ * <li> The frequency of this word in the index (Integer)
+ * <li> The frequency of this word in the source document (Integer)
+ * </ol>
+ * This is a somewhat "advanced" routine, and in general only the 1st entry in the array is of interest.
+ * This method is exposed so that you can identify the "interesting words" in a document.
+ * For an easier method to call see {@link #retrieveInterestingTerms retrieveInterestingTerms()}.
+ *
+ * @param r the reader that has the content of the document
+ * @return the most interesting words in the document ordered by score, with the highest scoring, or best entry, first
+ *
+ * @see #retrieveInterestingTerms
+ */
+ public PriorityQueue retrieveTerms(Reader r) throws IOException {
+ Map words = new HashMap();
+ for (int i = 0; i < fieldNames.length; i++) {
+ String fieldName = fieldNames[i];
+ addTermFrequencies(r, words, fieldName);
+ }
+ return createQueue(words);
+ }
+
+ /**
+ * @see #retrieveInterestingTerms(java.io.Reader)
+ */
+ public String [] retrieveInterestingTerms(int docNum) throws IOException{
+ ArrayList al = new ArrayList( maxQueryTerms);
+ PriorityQueue pq = retrieveTerms(docNum);
+ Object cur;
+ int lim = maxQueryTerms; // have to be careful, retrieveTerms returns all words but that's probably not useful to our caller...
+ // we just want to return the top words
+ while (((cur = pq.pop()) != null) && lim-- > 0) {
+ Object[] ar = (Object[]) cur;
+ al.add( ar[ 0]); // the 1st entry is the interesting word
+ }
+ String[] res = new String[ al.size()];
+ return (String[]) al.toArray( res);
+ }
+
+ /**
+ * Convenience routine to make it easy to return the most interesting words in a document.
+ * More advanced users will call {@link #retrieveTerms(java.io.Reader) retrieveTerms()} directly.
+ * @param r the source document
+ * @return the most interesting words in the document
+ *
+ * @see #retrieveTerms(java.io.Reader)
+ * @see #setMaxQueryTerms
+ */
+ public String[] retrieveInterestingTerms( Reader r) throws IOException {
+ ArrayList al = new ArrayList( maxQueryTerms);
+ PriorityQueue pq = retrieveTerms( r);
+ Object cur;
+ int lim = maxQueryTerms; // have to be careful, retrieveTerms returns all words but that's probably not useful to our caller...
+ // we just want to return the top words
+ while (((cur = pq.pop()) != null) && lim-- > 0) {
+ Object[] ar = (Object[]) cur;
+ al.add( ar[ 0]); // the 1st entry is the interesting word
+ }
+ String[] res = new String[ al.size()];
+ return (String[]) al.toArray( res);
+ }
+
+ /**
+ * PriorityQueue that orders words by score.
+ */
+ private static class FreqQ extends PriorityQueue {
+ FreqQ (int s) {
+ initialize(s);
+ }
+
+ protected boolean lessThan(Object a, Object b) {
+ Object[] aa = (Object[]) a;
+ Object[] bb = (Object[]) b;
+ Float fa = (Float) aa[2];
+ Float fb = (Float) bb[2];
+ return fa.floatValue() > fb.floatValue();
+ }
+ }
+
+ /**
+ * Use for frequencies and to avoid renewing Integers.
+ */
+ private static class Int {
+ int x;
+
+ Int() {
+ x = 1;
+ }
+ }
+
+
+}
Propchange: lucene/java/trunk/contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThis.java
------------------------------------------------------------------------------
svn:eol-style = native
Modified: lucene/java/trunk/contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThisQuery.java
URL: http://svn.apache.org/viewvc/lucene/java/trunk/contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThisQuery.java?rev=712922&r1=712921&r2=712922&view=diff
==============================================================================
--- lucene/java/trunk/contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThisQuery.java (original)
+++ lucene/java/trunk/contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThisQuery.java Mon Nov 10 18:35:46 2008
@@ -1,164 +1,164 @@
-/*
- * Created on 25-Jan-2006
- */
-package org.apache.lucene.search.similar;
-
-/**
- * Licensed to the Apache Software Foundation (ASF) under one or more
- * contributor license agreements. See the NOTICE file distributed with
- * this work for additional information regarding copyright ownership.
- * The ASF licenses this file to You under the Apache License, Version 2.0
- * (the "License"); you may not use this file except in compliance with
- * the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-import java.io.ByteArrayInputStream;
-import java.io.IOException;
-import java.util.Set;
-
-import org.apache.lucene.analysis.Analyzer;
-import org.apache.lucene.index.IndexReader;
-import org.apache.lucene.search.BooleanClause;
-import org.apache.lucene.search.BooleanQuery;
-import org.apache.lucene.search.Query;
-import org.apache.lucene.search.similar.MoreLikeThis;
-
-/**
- * A simple wrapper for MoreLikeThis for use in scenarios where a Query object is required eg
- * in custom QueryParser extensions. At query.rewrite() time the reader is used to construct the
- * actual MoreLikeThis object and obtain the real Query object.
- */
-public class MoreLikeThisQuery extends Query
-{
-
-
- private String likeText;
- private String[] moreLikeFields;
- private Analyzer analyzer;
- float percentTermsToMatch=0.3f;
- int minTermFrequency=1;
- int maxQueryTerms=5;
- Set stopWords=null;
- int minDocFreq=-1;
-
-
- /**
- * @param moreLikeFields
- */
- public MoreLikeThisQuery(String likeText, String[] moreLikeFields, Analyzer analyzer)
- {
- this.likeText=likeText;
- this.moreLikeFields=moreLikeFields;
- this.analyzer=analyzer;
- }
-
- public Query rewrite(IndexReader reader) throws IOException
- {
- MoreLikeThis mlt=new MoreLikeThis(reader);
-
- mlt.setFieldNames(moreLikeFields);
- mlt.setAnalyzer(analyzer);
- mlt.setMinTermFreq(minTermFrequency);
- if(minDocFreq>=0)
- {
- mlt.setMinDocFreq(minDocFreq);
- }
- mlt.setMaxQueryTerms(maxQueryTerms);
- mlt.setStopWords(stopWords);
- BooleanQuery bq= (BooleanQuery) mlt.like(new ByteArrayInputStream(likeText.getBytes()));
- BooleanClause[] clauses = bq.getClauses();
- //make at least half the terms match
- bq.setMinimumNumberShouldMatch((int)(clauses.length*percentTermsToMatch));
- return bq;
- }
- /* (non-Javadoc)
- * @see org.apache.lucene.search.Query#toString(java.lang.String)
- */
- public String toString(String field)
- {
- return "like:"+likeText;
- }
-
- public float getPercentTermsToMatch() {
- return percentTermsToMatch;
- }
- public void setPercentTermsToMatch(float percentTermsToMatch) {
- this.percentTermsToMatch = percentTermsToMatch;
- }
-
- public Analyzer getAnalyzer()
- {
- return analyzer;
- }
-
- public void setAnalyzer(Analyzer analyzer)
- {
- this.analyzer = analyzer;
- }
-
- public String getLikeText()
- {
- return likeText;
- }
-
- public void setLikeText(String likeText)
- {
- this.likeText = likeText;
- }
-
- public int getMaxQueryTerms()
- {
- return maxQueryTerms;
- }
-
- public void setMaxQueryTerms(int maxQueryTerms)
- {
- this.maxQueryTerms = maxQueryTerms;
- }
-
- public int getMinTermFrequency()
- {
- return minTermFrequency;
- }
-
- public void setMinTermFrequency(int minTermFrequency)
- {
- this.minTermFrequency = minTermFrequency;
- }
-
- public String[] getMoreLikeFields()
- {
- return moreLikeFields;
- }
-
- public void setMoreLikeFields(String[] moreLikeFields)
- {
- this.moreLikeFields = moreLikeFields;
- }
- public Set getStopWords()
- {
- return stopWords;
- }
- public void setStopWords(Set stopWords)
- {
- this.stopWords = stopWords;
- }
-
- public int getMinDocFreq()
- {
- return minDocFreq;
- }
-
- public void setMinDocFreq(int minDocFreq)
- {
- this.minDocFreq = minDocFreq;
- }
-}
+/*
+ * Created on 25-Jan-2006
+ */
+package org.apache.lucene.search.similar;
+
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import java.io.ByteArrayInputStream;
+import java.io.IOException;
+import java.util.Set;
+
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.search.BooleanClause;
+import org.apache.lucene.search.BooleanQuery;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.similar.MoreLikeThis;
+
+/**
+ * A simple wrapper for MoreLikeThis for use in scenarios where a Query object is required eg
+ * in custom QueryParser extensions. At query.rewrite() time the reader is used to construct the
+ * actual MoreLikeThis object and obtain the real Query object.
+ */
+public class MoreLikeThisQuery extends Query
+{
+
+
+ private String likeText;
+ private String[] moreLikeFields;
+ private Analyzer analyzer;
+ float percentTermsToMatch=0.3f;
+ int minTermFrequency=1;
+ int maxQueryTerms=5;
+ Set stopWords=null;
+ int minDocFreq=-1;
+
+
+ /**
+ * @param moreLikeFields
+ */
+ public MoreLikeThisQuery(String likeText, String[] moreLikeFields, Analyzer analyzer)
+ {
+ this.likeText=likeText;
+ this.moreLikeFields=moreLikeFields;
+ this.analyzer=analyzer;
+ }
+
+ public Query rewrite(IndexReader reader) throws IOException
+ {
+ MoreLikeThis mlt=new MoreLikeThis(reader);
+
+ mlt.setFieldNames(moreLikeFields);
+ mlt.setAnalyzer(analyzer);
+ mlt.setMinTermFreq(minTermFrequency);
+ if(minDocFreq>=0)
+ {
+ mlt.setMinDocFreq(minDocFreq);
+ }
+ mlt.setMaxQueryTerms(maxQueryTerms);
+ mlt.setStopWords(stopWords);
+ BooleanQuery bq= (BooleanQuery) mlt.like(new ByteArrayInputStream(likeText.getBytes()));
+ BooleanClause[] clauses = bq.getClauses();
+ //make at least half the terms match
+ bq.setMinimumNumberShouldMatch((int)(clauses.length*percentTermsToMatch));
+ return bq;
+ }
+ /* (non-Javadoc)
+ * @see org.apache.lucene.search.Query#toString(java.lang.String)
+ */
+ public String toString(String field)
+ {
+ return "like:"+likeText;
+ }
+
+ public float getPercentTermsToMatch() {
+ return percentTermsToMatch;
+ }
+ public void setPercentTermsToMatch(float percentTermsToMatch) {
+ this.percentTermsToMatch = percentTermsToMatch;
+ }
+
+ public Analyzer getAnalyzer()
+ {
+ return analyzer;
+ }
+
+ public void setAnalyzer(Analyzer analyzer)
+ {
+ this.analyzer = analyzer;
+ }
+
+ public String getLikeText()
+ {
+ return likeText;
+ }
+
+ public void setLikeText(String likeText)
+ {
+ this.likeText = likeText;
+ }
+
+ public int getMaxQueryTerms()
+ {
+ return maxQueryTerms;
+ }
+
+ public void setMaxQueryTerms(int maxQueryTerms)
+ {
+ this.maxQueryTerms = maxQueryTerms;
+ }
+
+ public int getMinTermFrequency()
+ {
+ return minTermFrequency;
+ }
+
+ public void setMinTermFrequency(int minTermFrequency)
+ {
+ this.minTermFrequency = minTermFrequency;
+ }
+
+ public String[] getMoreLikeFields()
+ {
+ return moreLikeFields;
+ }
+
+ public void setMoreLikeFields(String[] moreLikeFields)
+ {
+ this.moreLikeFields = moreLikeFields;
+ }
+ public Set getStopWords()
+ {
+ return stopWords;
+ }
+ public void setStopWords(Set stopWords)
+ {
+ this.stopWords = stopWords;
+ }
+
+ public int getMinDocFreq()
+ {
+ return minDocFreq;
+ }
+
+ public void setMinDocFreq(int minDocFreq)
+ {
+ this.minDocFreq = minDocFreq;
+ }
+}
Propchange: lucene/java/trunk/contrib/queries/src/java/org/apache/lucene/search/similar/MoreLikeThisQuery.java
------------------------------------------------------------------------------
svn:eol-style = native
Modified: lucene/java/trunk/contrib/queries/src/java/org/apache/lucene/search/similar/SimilarityQueries.java
URL: http://svn.apache.org/viewvc/lucene/java/trunk/contrib/queries/src/java/org/apache/lucene/search/similar/SimilarityQueries.java?rev=712922&r1=712921&r2=712922&view=diff
==============================================================================
--- lucene/java/trunk/contrib/queries/src/java/org/apache/lucene/search/similar/SimilarityQueries.java (original)
+++ lucene/java/trunk/contrib/queries/src/java/org/apache/lucene/search/similar/SimilarityQueries.java Mon Nov 10 18:35:46 2008
@@ -1,113 +1,113 @@
-/**
- * Copyright 2004 The Apache Software Foundation.
- *
- * Licensed under the Apache License, Version 2.0 (the "License");
- * you may not use this file except in compliance with the License.
- * You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-package org.apache.lucene.search.similar;
-
-import java.io.IOException;
-import java.io.StringReader;
-import java.util.HashSet;
-import java.util.Set;
-
-import org.apache.lucene.analysis.Analyzer;
-import org.apache.lucene.analysis.Token;
-import org.apache.lucene.analysis.TokenStream;
-import org.apache.lucene.index.Term;
-import org.apache.lucene.search.BooleanClause;
-import org.apache.lucene.search.BooleanQuery;
-import org.apache.lucene.search.IndexSearcher;
-import org.apache.lucene.search.Query;
-import org.apache.lucene.search.TermQuery;
-
-/**
- * Simple similarity measures.
- *
- * @see MoreLikeThis
- */
-public final class SimilarityQueries
-{
- /**
- *
- */
- private SimilarityQueries()
- {
- }
-
- /**
- * Simple similarity query generators.
- * Takes every unique word and forms a boolean query where all words are optional.
- * After you get this you'll use to to query your {@link IndexSearcher} for similar docs.
- * The only caveat is the first hit returned <b>should be</b> your source document - you'll
- * need to then ignore that.
- *
- * <p>
- * So, if you have a code fragment like this:
- * <br>
- * <code>
- * Query q = formSimilaryQuery( "I use Lucene to search fast. Fast searchers are good", new StandardAnalyzer(), "contents", null);
- * </code>
- *
- * <p>
- * The query returned, in string form, will be <code>'(i use lucene to search fast searchers are good')</code>.
- *
- * <p>
- * The philosophy behind this method is "two documents are similar if they share lots of words".
- * Note that behind the scenes, Lucenes scoring algorithm will tend to give two documents a higher similarity score if the share more uncommon words.
- *
- * <P>
- * This method is fail-safe in that if a long 'body' is passed in and
- * {@link BooleanQuery#add BooleanQuery.add()} (used internally)
- * throws
- * {@link org.apache.lucene.search.BooleanQuery.TooManyClauses BooleanQuery.TooManyClauses}, the
- * query as it is will be returned.
- *
- * @param body the body of the document you want to find similar documents to
- * @param a the analyzer to use to parse the body
- * @param field the field you want to search on, probably something like "contents" or "body"
- * @param stop optional set of stop words to ignore
- * @return a query with all unique words in 'body'
- * @throws IOException this can't happen...
- */
- public static Query formSimilarQuery( String body,
- Analyzer a,
- String field,
- Set stop)
- throws IOException
- {
- TokenStream ts = a.tokenStream( field, new StringReader( body));
- BooleanQuery tmp = new BooleanQuery();
- Set already = new HashSet(); // ignore dups
- final Token reusableToken = new Token();
- for (Token nextToken = ts.next(reusableToken); nextToken != null; nextToken = ts.next(reusableToken)) {
- String word = nextToken.term();
- // ignore opt stop words
- if ( stop != null &&
- stop.contains( word)) continue;
- // ignore dups
- if ( ! already.add( word)) continue;
- // add to query
- TermQuery tq = new TermQuery( new Term( field, word));
- try
- {
- tmp.add( tq, BooleanClause.Occur.SHOULD);
- }
- catch( BooleanQuery.TooManyClauses too)
- {
- // fail-safe, just return what we have, not the end of the world
- break;
- }
- }
- return tmp;
- }
-}
+/**
+ * Copyright 2004 The Apache Software Foundation.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search.similar;
+
+import java.io.IOException;
+import java.io.StringReader;
+import java.util.HashSet;
+import java.util.Set;
+
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.analysis.Token;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.search.BooleanClause;
+import org.apache.lucene.search.BooleanQuery;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.TermQuery;
+
+/**
+ * Simple similarity measures.
+ *
+ * @see MoreLikeThis
+ */
+public final class SimilarityQueries
+{
+ /**
+ *
+ */
+ private SimilarityQueries()
+ {
+ }
+
+ /**
+ * Simple similarity query generators.
+ * Takes every unique word and forms a boolean query where all words are optional.
+ * After you get this you'll use to to query your {@link IndexSearcher} for similar docs.
+ * The only caveat is the first hit returned <b>should be</b> your source document - you'll
+ * need to then ignore that.
+ *
+ * <p>
+ * So, if you have a code fragment like this:
+ * <br>
+ * <code>
+ * Query q = formSimilaryQuery( "I use Lucene to search fast. Fast searchers are good", new StandardAnalyzer(), "contents", null);
+ * </code>
+ *
+ * <p>
+ * The query returned, in string form, will be <code>'(i use lucene to search fast searchers are good')</code>.
+ *
+ * <p>
+ * The philosophy behind this method is "two documents are similar if they share lots of words".
+ * Note that behind the scenes, Lucenes scoring algorithm will tend to give two documents a higher similarity score if the share more uncommon words.
+ *
+ * <P>
+ * This method is fail-safe in that if a long 'body' is passed in and
+ * {@link BooleanQuery#add BooleanQuery.add()} (used internally)
+ * throws
+ * {@link org.apache.lucene.search.BooleanQuery.TooManyClauses BooleanQuery.TooManyClauses}, the
+ * query as it is will be returned.
+ *
+ * @param body the body of the document you want to find similar documents to
+ * @param a the analyzer to use to parse the body
+ * @param field the field you want to search on, probably something like "contents" or "body"
+ * @param stop optional set of stop words to ignore
+ * @return a query with all unique words in 'body'
+ * @throws IOException this can't happen...
+ */
+ public static Query formSimilarQuery( String body,
+ Analyzer a,
+ String field,
+ Set stop)
+ throws IOException
+ {
+ TokenStream ts = a.tokenStream( field, new StringReader( body));
+ BooleanQuery tmp = new BooleanQuery();
+ Set already = new HashSet(); // ignore dups
+ final Token reusableToken = new Token();
+ for (Token nextToken = ts.next(reusableToken); nextToken != null; nextToken = ts.next(reusableToken)) {
+ String word = nextToken.term();
+ // ignore opt stop words
+ if ( stop != null &&
+ stop.contains( word)) continue;
+ // ignore dups
+ if ( ! already.add( word)) continue;
+ // add to query
+ TermQuery tq = new TermQuery( new Term( field, word));
+ try
+ {
+ tmp.add( tq, BooleanClause.Occur.SHOULD);
+ }
+ catch( BooleanQuery.TooManyClauses too)
+ {
+ // fail-safe, just return what we have, not the end of the world
+ break;
+ }
+ }
+ return tmp;
+ }
+}
Propchange: lucene/java/trunk/contrib/queries/src/java/org/apache/lucene/search/similar/SimilarityQueries.java
------------------------------------------------------------------------------
svn:eol-style = native