You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by David Spencer <da...@tropo.com> on 2004/02/12 02:51:25 UTC

code for "more like this" query "expansion" - was - Re: setMaxClauseCount ??

Doug Cutting wrote:

> Karl Koch wrote:
>
>> Do you know good papers about strategies of how
>> to select keywords effectivly beyond the scope of stopword lists and 
>> stemming?
>>
>> Using term frequencies of the document is not really possible since 
>> lucene
>> is not providing access to a document vector, isn't it?
>
>
> Lucene does let you access the document frequency of terms, with 
> IndexReader.docFreq().  Term frequencies can be computed by 
> re-tokenizing the text, which, for a single document, is usually fast 
> enough.  But looking up the docFreq() of every term in the document is 
> probably too slow.
>
> You can use some heuristics to prune the set of terms, to avoid 
> calling docFreq() too much, or at all.  Since you're trying to 
> maximize a tf*idf score, you're probably most interested in terms with 
> a high tf. Choosing a tf threshold even as low as two or three will 
> radically reduce the number of terms under consideration.  Another 
> heuristic is that terms with a high idf (i.e., a low df) tend to be 
> longer.  So you could threshold the terms by the number of characters, 
> not selecting anything less than, e.g., six or seven characters.  With 
> these sorts of heuristics you can usually find small set of, e.g., ten 
> or fewer terms that do a pretty good job of characterizing a document.
>
> It all depends on what you're trying to do.  If you're trying to eek 
> out that last percent of precision and recall regardless of 
> computational difficulty so that you can win a TREC competition, then 
> the techniques I mention above are useless.  But if you're trying to 
> provide a "more like this" button on a search results page that does a 
> decent job and has good performance, such techniques might be useful.
>
> An efficient, effective "more-like-this" query generator would be a 
> great contribution, if anyone's interested.  I'd imagine that it would 
> take a Reader or a String (the document's text), an Analyzer, and 
> return a set of representative terms using heuristics like those 
> above.  The frequency and length thresholds could be parameters, etc.


Well I've done a prelim impl of the above. Maybe someone could proofread 
my code.
The code is hot off the presses and seems to work...

Questions are:
[a] is the code right
[b] are any more (less) params needed to properly "genericize" the 
algorithm? e.g. max "words" to return?
[c] I can tweak the code to be a little more usable..does it make sense 
to return, say, a Query?
[d] then the eternal question - I think these things are interesting but 
my theory is that Queries (is-a Query impls) which are not implemented 
into the QueryParser will never really be used....

Anyway:

There are two parts - the main() quick test I did which is not set up to 
run on another system
right now but shows how the mlt rountine (mlt->MoreLikeThis) is called:



    public static void main( String[] a)
        throws Throwable
    {
        Hashtable stopTable = StopFilter.makeStopTable( 
StopAnalyzer.ENGLISH_STOP_WORDS);
        String fn = "c:/Program Files/Apache 
Group/Apache/htdocs/manual/vhosts/index.html.en";
        PrintStream o = System.out;
        final IndexReader r = IndexReader.open( "localhost_index");

        String body = new com.tropo.html.HTMLTextMuncher( new 
FileInputStream( fn)).getText();
        PriorityQueue q = mlt(  new StringReader( body), 
getDefAnalyzer(), r, "contents", 2, stopTable, 0, 0);

        o.println( "res..." + q.size());
        o.println();

        Object cur;
        while ( (cur = q.pop()) != null)
        {
            Object[] ar = (Object[]) cur;
            o.println( ar[ 0] + " = " + ar[ 1]);
        }


    }




And the impl which will compile with appropriate imports.

import java.io.*;
import java.util.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.document.*;
import org.apache.lucene.search.*;
import org.apache.lucene.index.*;
import org.apache.lucene.store.*;
import org.apache.lucene.util.*;


    /**
     * Find words for a more-like-this query former.
     *
     * @param r the reader that has the content of the document
     * @param a the analyzer to parse the reader with
     * @param field the field of interest in the document
     * @param minFreq filter out terms that occur less than this in the 
document
     * @param stop a table of stopwords to ignore
     * @param minLen ignore words less than this length or pass in 0 to 
not use this
     * @param maxLen ignore words greater than this length or pass in 0 
to not use this
     * @return a priority queue ordered by docs with the largest score 
(tf*idf)
     */
    public static PriorityQueue mlt( Reader r,
                                     Analyzer a,
                                     IndexReader ir,
                                     String field,
                                     int minFreq,
                                     Hashtable stop,
                                     int minLen,
                                     int maxLen)
        throws IOException
    {
        Similarity sim = new DefaultSimilarity(); // for idf
        TokenStream ts = a.tokenStream( field, r);
        org.apache.lucene.analysis.Token toke;
        Map words = new HashMap();
        while ( (toke = ts.next()) != null)
        {
            String word = toke.termText().toLowerCase();
            if ( stop.contains( word)) continue;
            int len = word.length();
            if ( minLen > 0 && len < minLen) continue;
            if ( maxLen > 0 && len < maxLen) continue;
            if ( words.containsKey( word))
            {
                Integer x = (Integer) words.get( word);
                words.put( word, new Integer( x.intValue()+1));
            }
            else
                words.put( word, new Integer( 1));
        }
        Iterator it = words.keySet().iterator();

        int numDocs = ir.numDocs();
        FreqQ res = new FreqQ( words.size());
       
        while ( it.hasNext())
        {
            String word = (String) it.next();
            int tf = ((Integer) words.get( word)).intValue();
            if (tf < minFreq) continue;
            Term t = new Term( field, word);
            int docFreq = ir.docFreq( t);
            float idf = sim.idf( docFreq, numDocs);
            float score = tf * idf;
            res.insert( new Object[] { word, new Float( score),
                                       new Float( idf),
                                       new Integer( docFreq)});
        }
        return res;
    }

    /**
     *
     */
    static class FreqQ
        extends PriorityQueue
    {
        FreqQ( int s)
        {
            initialize( s);
        }

        protected boolean lessThan(Object a, Object b)
        {
            Object[] aa =(Object[]) a;
            Object[] bb =(Object[]) b;
            Float fa = (Float) aa[ 1];
            Float fb = (Float) bb[ 1];
            return fa.floatValue() > fb.floatValue();
        }
    }


>
> Doug
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: MoreLikeThis Query generator - Re: code for "more like this" query "expansion" - was - Re: setMaxClauseCount ??

Posted by David Spencer <da...@tropo.com>.
Bruce Ritchie wrote:

> David Spencer wrote:
>
>> Code rewritten, automagically chooses lots of defaults, lets you 
>> override
>> the defs thru the static vars at the bottom or the non-static vars 
>> also at the bottom.
>
>
> I've taken the liberty to update this code to handle multiple fields 
> and use the new term vector support in CVS so that retokenizing a 
> document's text isn't necessary if you have a document ID that has 
> indexed and term vector supported fields. I've added the apache 2.0 
> license to the top however if that isn't the licence you want this code to

Thank you for doing this - I'm sure the license is fine. I have not 
tested your changes but will try to do so soonish.

 - Dave

> be released under let me know and I'll change it immediately.
>
>
> Regards,
>
> Bruce Ritchie
> http://www.jivesoftware.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: MoreLikeThis Query generator - Re: code for "more like this" query "expansion" - was - Re: setMaxClauseCount ??

Posted by Bruce Ritchie <br...@jivesoftware.com>.
David Spencer wrote:

> Code rewritten, automagically chooses lots of defaults, lets you override
> the defs thru the static vars at the bottom or the non-static vars also 
> at the bottom.

I've taken the liberty to update this code to handle multiple fields and use the new term vector 
support in CVS so that retokenizing a document's text isn't necessary if you have a document ID that 
has indexed and term vector supported fields. I've added the apache 2.0 license to the top however 
if that isn't the licence you want this code to be released under let me know and I'll change it 
immediately.


Regards,

Bruce Ritchie
http://www.jivesoftware.com/

Re: MoreLikeThis Query generator - Re: code for "more like this" query "expansion" - was - Re: setMaxClauseCount ??

Posted by David Spencer <da...@tropo.com>.
Bruce Ritchie wrote:

> David Spencer wrote:
>
>> [c] "interesting words" - uses code from MoreLikeThis to give a table 
>> of all interesting
>> words in the current "source" doc ordered by score.
>> Remember score is idf*tf as per Dougs mail (and as per my
>> hopefully correct understanding of these things). This page is of 
>> course more of a debugging
>> tool that something a normal user would see.  One possible area of 
>> improvement that jumped out at me after reviewing this table is using 
>> stemming, say, allowing more words in the generated query when 2 
>> words have the same stem.
>
>
> Actually, the analyzer should do that, shouldn't it? For example, I 
> have stemming analyzers for a variety of languages that both stem and 
> remove stop words - it seems silly to me to duplicate that 
> functionality when it's so easily provided by the analyzer. Given 
> that, I would suggest removing the stop word functionality from this class

Actually I realized this is a trickly and possibly counterintuitive issue.

In theory one might want the MoreLikeThis logic to use a *larger* stop 
word list than the Analyzer uses, even in the case where the Analyzer 
does not use any stop word list.

Reasoning is:
-- maybe you don't want Analyzer to have any stop words (so user can 
find the classic "to be or not to be" phrase) and the search index 
compression won't (in theory?) be affected by frequent stop words anyway
-- the stop words used by MoreLikeThis are a heuristic with 2 points 
behind them - the obvious (stop words
are not interesting in similarity) and the fact that they're there to 
minimize the expensive IndexReader.docFreq() calls, thus more stop words 
are fine to reduce docFreq() calls and let the query generator run faster

As an aside I sometimes use a list of ~500 English stop words from 
"SMART" (sorry, can't easily find the ref, though this might be close: 
http://citeseer.nj.nec.com/context/45797/0 ). I can contribute these if 
wanted.

> as it is not needed and only confuses things.
>
>
> Regards,
>
> Bruce Ritchie
> http://www.jivesoftware.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: MoreLikeThis Query generator - Re: code for "more like this" query "expansion" - was - Re: setMaxClauseCount ??

Posted by Bruce Ritchie <br...@jivesoftware.com>.
David Spencer wrote:

> [c] "interesting words" - uses code from MoreLikeThis to give a table of 
> all interesting
> words in the current "source" doc ordered by score.
> Remember score is idf*tf as per Dougs mail (and as per my
> hopefully correct understanding of these things). This page is of course 
> more of a debugging
> tool that something a normal user would see.  One possible area of 
> improvement that jumped out at me after reviewing this table is using 
> stemming, say, allowing more words in the generated query when 2 words 
> have the same stem.

Actually, the analyzer should do that, shouldn't it? For example, I have stemming analyzers for a 
variety of languages that both stem and remove stop words - it seems silly to me to duplicate that 
functionality when it's so easily provided by the analyzer. Given that, I would suggest removing the 
stop word functionality from this class as it is not needed and only confuses things.


Regards,

Bruce Ritchie
http://www.jivesoftware.com/

Re: MoreLikeThis Query generator - Re: code for "more like this" query "expansion" - was - Re: setMaxClauseCount ??

Posted by Bruce Ritchie <br...@jivesoftware.com>.
David Spencer wrote:

> I'd appreciate if someone could proofread MoreLikeThis.like(Reader) and 
> mlt(Reader).
> 
> At a glance it seems to return reasonable results on my site.

One thing that I would find extremely useful is updating the code to handle multiple fields since 
many (most?) indexes do not use just 1 field. I'm in the process of doing just that as well as 
making some other changes to the code and will contribute it back if someone doesn't beat me to it 
first.


Regards,

Bruce Ritchie
http://www.jivesoftware.com/

Re: MoreLikeThis Query generator - Re: code for "more like this" query "expansion" - was - Re: setMaxClauseCount ??

Posted by David Spencer <da...@tropo.com>.
Doug Cutting wrote:

> David Spencer wrote:
>
>> Code rewritten, automagically chooses lots of defaults, lets you 
>> override
>> the defs thru the static vars at the bottom or the non-static vars 
>> also at the bottom.
>
>
> Has anyone used this?  Was it useful?

I've put it up on my "demo" site (rfc::search) in which I have a humble 
index of approx 3500 RFCs.

This is the site:

http://www.hostmon.com/rfc/index.jsp

A typical search takes you here:

http://www.hostmon.com/rfc/search.jsp?s=LDAP+Security&x=33&y=9



Then clicking on a match takes you to a link to view an RFC like this 
where things start to get interesting.

http://www.hostmon.com/rfc/get.jsp?id=1823&s=LDAP%20Security

There are 3 links of interest now at the top/middle of the page in the 
brownish background.

[a] "show similar" - forms a query from *all* words in the doc - no 
heuristics wrt idf(), etc.

[b] "more like this" - uses the MoreLikeThis code I wrote with the 
default settings.

[c] "interesting words" - uses code from MoreLikeThis to give a table of 
all interesting
words in the current "source" doc ordered by score.
Remember score is idf*tf as per Dougs mail (and as per my
hopefully correct understanding of these things). This page is of course 
more of a debugging
tool that something a normal user would see.  One possible area of 
improvement that jumped out at me after reviewing this table is using 
stemming, say, allowing more words in the generated query when 2 words 
have the same stem.

Note - [a] uses no code from [b] and [c]. It is just there for comparision.

> Should we add it to the sandbox?

I'd appreciate if someone could proofread MoreLikeThis.like(Reader) and 
mlt(Reader).

At a glance it seems to return reasonable results on my site.

-- Dave

>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: MoreLikeThis Query generator - Re: code for "more like this" query "expansion" - was - Re: setMaxClauseCount ??

Posted by Doug Cutting <cu...@apache.org>.
David Spencer wrote:
> Code rewritten, automagically chooses lots of defaults, lets you override
> the defs thru the static vars at the bottom or the non-static vars also 
> at the bottom.

Has anyone used this?  Was it useful?  Should we add it to the sandbox?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


MoreLikeThis Query generator - Re: code for "more like this" query "expansion" - was - Re: setMaxClauseCount ??

Posted by David Spencer <da...@tropo.com>.
Otis Gospodnetic wrote:

>Lots of params in that mlt method, but it seems flexible.
>  
>
Code rewritten, automagically chooses lots of defaults, lets you override
the defs thru the static vars at the bottom or the non-static vars also 
at the bottom.

Quickest way to use, choosing all defaults is:

MoreLikeThis mlt = new MoreLikeThis( r); // pass in IndexReader
Query q = mlt.like( x); // x is URL/InputStream/File "source doc"
// then pass q to IndexSearcher


To see what parameters it's using display the ret value from 
mlt.describeParams().

The main driver takes args of "-i INDEX" and then "-f FILE" or "-url 
URL" and shows
the params and up to 25 "similar" docs. HTML docs won't work right as 
I'm just
blindly using StandardAnalyzer on the stream. Trying to keep everything 
simple
so it fits in one file....


>I'll try it.
>
>Small optimization suggestion: use int[] with a single element for that
>words Map, instead of creating lots of Integer()s.  
>
Done sort of - put in my own mutable int to avoid object creation.

>Actually, maybe
>JVMs are smart and don't allocate new objects for the same int wrapped
>  
>
Dubious that they do..

>in Integer.... I don't know for sure.
>
>Otis
>
>
>--- David Spencer <da...@tropo.com> wrote:
>  
>
>>Doug Cutting wrote:
>>
>>    
>>
>>>Karl Koch wrote:
>>>
>>>      
>>>
>>>>Do you know good papers about strategies of how
>>>>to select keywords effectivly beyond the scope of stopword lists
>>>>        
>>>>
>>and 
>>    
>>
>>>>stemming?
>>>>
>>>>Using term frequencies of the document is not really possible
>>>>        
>>>>
>>since 
>>    
>>
>>>>lucene
>>>>is not providing access to a document vector, isn't it?
>>>>        
>>>>
>>>Lucene does let you access the document frequency of terms, with 
>>>IndexReader.docFreq().  Term frequencies can be computed by 
>>>re-tokenizing the text, which, for a single document, is usually
>>>      
>>>
>>fast 
>>    
>>
>>>enough.  But looking up the docFreq() of every term in the document
>>>      
>>>
>>is 
>>    
>>
>>>probably too slow.
>>>
>>>You can use some heuristics to prune the set of terms, to avoid 
>>>calling docFreq() too much, or at all.  Since you're trying to 
>>>maximize a tf*idf score, you're probably most interested in terms
>>>      
>>>
>>with 
>>    
>>
>>>a high tf. Choosing a tf threshold even as low as two or three will
>>>      
>>>
>>>radically reduce the number of terms under consideration.  Another 
>>>heuristic is that terms with a high idf (i.e., a low df) tend to be
>>>      
>>>
>>>longer.  So you could threshold the terms by the number of
>>>      
>>>
>>characters, 
>>    
>>
>>>not selecting anything less than, e.g., six or seven characters. 
>>>      
>>>
>>With 
>>    
>>
>>>these sorts of heuristics you can usually find small set of, e.g.,
>>>      
>>>
>>ten 
>>    
>>
>>>or fewer terms that do a pretty good job of characterizing a
>>>      
>>>
>>document.
>>    
>>
>>>It all depends on what you're trying to do.  If you're trying to
>>>      
>>>
>>eek 
>>    
>>
>>>out that last percent of precision and recall regardless of 
>>>computational difficulty so that you can win a TREC competition,
>>>      
>>>
>>then 
>>    
>>
>>>the techniques I mention above are useless.  But if you're trying
>>>      
>>>
>>to 
>>    
>>
>>>provide a "more like this" button on a search results page that
>>>      
>>>
>>does a 
>>    
>>
>>>decent job and has good performance, such techniques might be
>>>      
>>>
>>useful.
>>    
>>
>>>An efficient, effective "more-like-this" query generator would be a
>>>      
>>>
>>>great contribution, if anyone's interested.  I'd imagine that it
>>>      
>>>
>>would 
>>    
>>
>>>take a Reader or a String (the document's text), an Analyzer, and 
>>>return a set of representative terms using heuristics like those 
>>>above.  The frequency and length thresholds could be parameters,
>>>      
>>>
>>etc.
>>
>>
>>Well I've done a prelim impl of the above. Maybe someone could
>>proofread 
>>my code.
>>The code is hot off the presses and seems to work...
>>
>>Questions are:
>>[a] is the code right
>>[b] are any more (less) params needed to properly "genericize" the 
>>algorithm? e.g. max "words" to return?
>>[c] I can tweak the code to be a little more usable..does it make
>>sense 
>>to return, say, a Query?
>>[d] then the eternal question - I think these things are interesting
>>but 
>>my theory is that Queries (is-a Query impls) which are not
>>implemented 
>>into the QueryParser will never really be used....
>>
>>Anyway:
>>
>>There are two parts - the main() quick test I did which is not set up
>>to 
>>run on another system
>>right now but shows how the mlt rountine (mlt->MoreLikeThis) is
>>called:
>>
>>
>>
>>    public static void main( String[] a)
>>        throws Throwable
>>    {
>>        Hashtable stopTable = StopFilter.makeStopTable( 
>>StopAnalyzer.ENGLISH_STOP_WORDS);
>>        String fn = "c:/Program Files/Apache 
>>Group/Apache/htdocs/manual/vhosts/index.html.en";
>>        PrintStream o = System.out;
>>        final IndexReader r = IndexReader.open( "localhost_index");
>>
>>        String body = new com.tropo.html.HTMLTextMuncher( new 
>>FileInputStream( fn)).getText();
>>        PriorityQueue q = mlt(  new StringReader( body), 
>>getDefAnalyzer(), r, "contents", 2, stopTable, 0, 0);
>>
>>        o.println( "res..." + q.size());
>>        o.println();
>>
>>        Object cur;
>>        while ( (cur = q.pop()) != null)
>>        {
>>            Object[] ar = (Object[]) cur;
>>            o.println( ar[ 0] + " = " + ar[ 1]);
>>        }
>>
>>
>>    }
>>
>>
>>
>>
>>And the impl which will compile with appropriate imports.
>>
>>import java.io.*;
>>import java.util.*;
>>import org.apache.lucene.analysis.*;
>>import org.apache.lucene.document.*;
>>import org.apache.lucene.search.*;
>>import org.apache.lucene.index.*;
>>import org.apache.lucene.store.*;
>>import org.apache.lucene.util.*;
>>
>>
>>    /**
>>     * Find words for a more-like-this query former.
>>     *
>>     * @param r the reader that has the content of the document
>>     * @param a the analyzer to parse the reader with
>>     * @param field the field of interest in the document
>>     * @param minFreq filter out terms that occur less than this in
>>the 
>>document
>>     * @param stop a table of stopwords to ignore
>>     * @param minLen ignore words less than this length or pass in 0
>>to 
>>not use this
>>     * @param maxLen ignore words greater than this length or pass in
>>0 
>>to not use this
>>     * @return a priority queue ordered by docs with the largest
>>score 
>>(tf*idf)
>>     */
>>    public static PriorityQueue mlt( Reader r,
>>                                     Analyzer a,
>>                                     IndexReader ir,
>>                                     String field,
>>                                     int minFreq,
>>                                     Hashtable stop,
>>                                     int minLen,
>>                                     int maxLen)
>>        throws IOException
>>    {
>>        Similarity sim = new DefaultSimilarity(); // for idf
>>        TokenStream ts = a.tokenStream( field, r);
>>        org.apache.lucene.analysis.Token toke;
>>        Map words = new HashMap();
>>        while ( (toke = ts.next()) != null)
>>        {
>>            String word = toke.termText().toLowerCase();
>>            if ( stop.contains( word)) continue;
>>            int len = word.length();
>>            if ( minLen > 0 && len < minLen) continue;
>>            if ( maxLen > 0 && len < maxLen) continue;
>>            if ( words.containsKey( word))
>>            {
>>                Integer x = (Integer) words.get( word);
>>                words.put( word, new Integer( x.intValue()+1));
>>            }
>>            else
>>                words.put( word, new Integer( 1));
>>        }
>>        Iterator it = words.keySet().iterator();
>>
>>        int numDocs = ir.numDocs();
>>        FreqQ res = new FreqQ( words.size());
>>       
>>        while ( it.hasNext())
>>        {
>>            String word = (String) it.next();
>>            int tf = ((Integer) words.get( word)).intValue();
>>            if (tf < minFreq) continue;
>>            Term t = new Term( field, word);
>>            int docFreq = ir.docFreq( t);
>>            float idf = sim.idf( docFreq, numDocs);
>>            float score = tf * idf;
>>            res.insert( new Object[] { word, new Float( score),
>>                                       new Float( idf),
>>                                       new Integer( docFreq)});
>>        }
>>        return res;
>>    }
>>
>>    /**
>>     *
>>     */
>>    static class FreqQ
>>        extends PriorityQueue
>>    {
>>        FreqQ( int s)
>>        {
>>            initialize( s);
>>        }
>>
>>        protected boolean lessThan(Object a, Object b)
>>        {
>>            Object[] aa =(Object[]) a;
>>            Object[] bb =(Object[]) b;
>>            Float fa = (Float) aa[ 1];
>>            Float fb = (Float) bb[ 1];
>>            return fa.floatValue() > fb.floatValue();
>>        }
>>    }
>>
>>
>>    
>>
>>>Doug
>>>
>>>
>>>
>>>      
>>>
>>---------------------------------------------------------------------
>>    
>>
>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>For additional commands, e-mail:
>>>      
>>>
>>lucene-user-help@jakarta.apache.org
>>    
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>    
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>  
>


Re: code for "more like this" query "expansion" - was - Re: setMaxClauseCount ??

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Lots of params in that mlt method, but it seems flexible.
I'll try it.

Small optimization suggestion: use int[] with a single element for that
words Map, instead of creating lots of Integer()s.  Actually, maybe
JVMs are smart and don't allocate new objects for the same int wrapped
in Integer.... I don't know for sure.

Otis


--- David Spencer <da...@tropo.com> wrote:
> Doug Cutting wrote:
> 
> > Karl Koch wrote:
> >
> >> Do you know good papers about strategies of how
> >> to select keywords effectivly beyond the scope of stopword lists
> and 
> >> stemming?
> >>
> >> Using term frequencies of the document is not really possible
> since 
> >> lucene
> >> is not providing access to a document vector, isn't it?
> >
> >
> > Lucene does let you access the document frequency of terms, with 
> > IndexReader.docFreq().  Term frequencies can be computed by 
> > re-tokenizing the text, which, for a single document, is usually
> fast 
> > enough.  But looking up the docFreq() of every term in the document
> is 
> > probably too slow.
> >
> > You can use some heuristics to prune the set of terms, to avoid 
> > calling docFreq() too much, or at all.  Since you're trying to 
> > maximize a tf*idf score, you're probably most interested in terms
> with 
> > a high tf. Choosing a tf threshold even as low as two or three will
> 
> > radically reduce the number of terms under consideration.  Another 
> > heuristic is that terms with a high idf (i.e., a low df) tend to be
> 
> > longer.  So you could threshold the terms by the number of
> characters, 
> > not selecting anything less than, e.g., six or seven characters. 
> With 
> > these sorts of heuristics you can usually find small set of, e.g.,
> ten 
> > or fewer terms that do a pretty good job of characterizing a
> document.
> >
> > It all depends on what you're trying to do.  If you're trying to
> eek 
> > out that last percent of precision and recall regardless of 
> > computational difficulty so that you can win a TREC competition,
> then 
> > the techniques I mention above are useless.  But if you're trying
> to 
> > provide a "more like this" button on a search results page that
> does a 
> > decent job and has good performance, such techniques might be
> useful.
> >
> > An efficient, effective "more-like-this" query generator would be a
> 
> > great contribution, if anyone's interested.  I'd imagine that it
> would 
> > take a Reader or a String (the document's text), an Analyzer, and 
> > return a set of representative terms using heuristics like those 
> > above.  The frequency and length thresholds could be parameters,
> etc.
> 
> 
> Well I've done a prelim impl of the above. Maybe someone could
> proofread 
> my code.
> The code is hot off the presses and seems to work...
> 
> Questions are:
> [a] is the code right
> [b] are any more (less) params needed to properly "genericize" the 
> algorithm? e.g. max "words" to return?
> [c] I can tweak the code to be a little more usable..does it make
> sense 
> to return, say, a Query?
> [d] then the eternal question - I think these things are interesting
> but 
> my theory is that Queries (is-a Query impls) which are not
> implemented 
> into the QueryParser will never really be used....
> 
> Anyway:
> 
> There are two parts - the main() quick test I did which is not set up
> to 
> run on another system
> right now but shows how the mlt rountine (mlt->MoreLikeThis) is
> called:
> 
> 
> 
>     public static void main( String[] a)
>         throws Throwable
>     {
>         Hashtable stopTable = StopFilter.makeStopTable( 
> StopAnalyzer.ENGLISH_STOP_WORDS);
>         String fn = "c:/Program Files/Apache 
> Group/Apache/htdocs/manual/vhosts/index.html.en";
>         PrintStream o = System.out;
>         final IndexReader r = IndexReader.open( "localhost_index");
> 
>         String body = new com.tropo.html.HTMLTextMuncher( new 
> FileInputStream( fn)).getText();
>         PriorityQueue q = mlt(  new StringReader( body), 
> getDefAnalyzer(), r, "contents", 2, stopTable, 0, 0);
> 
>         o.println( "res..." + q.size());
>         o.println();
> 
>         Object cur;
>         while ( (cur = q.pop()) != null)
>         {
>             Object[] ar = (Object[]) cur;
>             o.println( ar[ 0] + " = " + ar[ 1]);
>         }
> 
> 
>     }
> 
> 
> 
> 
> And the impl which will compile with appropriate imports.
> 
> import java.io.*;
> import java.util.*;
> import org.apache.lucene.analysis.*;
> import org.apache.lucene.document.*;
> import org.apache.lucene.search.*;
> import org.apache.lucene.index.*;
> import org.apache.lucene.store.*;
> import org.apache.lucene.util.*;
> 
> 
>     /**
>      * Find words for a more-like-this query former.
>      *
>      * @param r the reader that has the content of the document
>      * @param a the analyzer to parse the reader with
>      * @param field the field of interest in the document
>      * @param minFreq filter out terms that occur less than this in
> the 
> document
>      * @param stop a table of stopwords to ignore
>      * @param minLen ignore words less than this length or pass in 0
> to 
> not use this
>      * @param maxLen ignore words greater than this length or pass in
> 0 
> to not use this
>      * @return a priority queue ordered by docs with the largest
> score 
> (tf*idf)
>      */
>     public static PriorityQueue mlt( Reader r,
>                                      Analyzer a,
>                                      IndexReader ir,
>                                      String field,
>                                      int minFreq,
>                                      Hashtable stop,
>                                      int minLen,
>                                      int maxLen)
>         throws IOException
>     {
>         Similarity sim = new DefaultSimilarity(); // for idf
>         TokenStream ts = a.tokenStream( field, r);
>         org.apache.lucene.analysis.Token toke;
>         Map words = new HashMap();
>         while ( (toke = ts.next()) != null)
>         {
>             String word = toke.termText().toLowerCase();
>             if ( stop.contains( word)) continue;
>             int len = word.length();
>             if ( minLen > 0 && len < minLen) continue;
>             if ( maxLen > 0 && len < maxLen) continue;
>             if ( words.containsKey( word))
>             {
>                 Integer x = (Integer) words.get( word);
>                 words.put( word, new Integer( x.intValue()+1));
>             }
>             else
>                 words.put( word, new Integer( 1));
>         }
>         Iterator it = words.keySet().iterator();
> 
>         int numDocs = ir.numDocs();
>         FreqQ res = new FreqQ( words.size());
>        
>         while ( it.hasNext())
>         {
>             String word = (String) it.next();
>             int tf = ((Integer) words.get( word)).intValue();
>             if (tf < minFreq) continue;
>             Term t = new Term( field, word);
>             int docFreq = ir.docFreq( t);
>             float idf = sim.idf( docFreq, numDocs);
>             float score = tf * idf;
>             res.insert( new Object[] { word, new Float( score),
>                                        new Float( idf),
>                                        new Integer( docFreq)});
>         }
>         return res;
>     }
> 
>     /**
>      *
>      */
>     static class FreqQ
>         extends PriorityQueue
>     {
>         FreqQ( int s)
>         {
>             initialize( s);
>         }
> 
>         protected boolean lessThan(Object a, Object b)
>         {
>             Object[] aa =(Object[]) a;
>             Object[] bb =(Object[]) b;
>             Float fa = (Float) aa[ 1];
>             Float fb = (Float) bb[ 1];
>             return fa.floatValue() > fb.floatValue();
>         }
>     }
> 
> 
> >
> > Doug
> >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org