You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Erik Hatcher <er...@ehatchersolutions.com> on 2005/07/14 12:45:17 UTC

SIPs and CAPs

Has anyone developed code to extract SIPs (statistically improbable  
phrases) and CAPs (capitalized phrases) from a Lucene index, such as  
Amazon does with it's books as shown here?

     <http://www.amazon.com/exec/obidos/tg/detail/-/0764526413/ 
ref=sip_top_dp/102-8573693-0514548?%5Fencoding=UTF8&v=glance>

I'm curious as it is something I'd like to do with some of my work.   
Of course CAPs would be impossible to extract from an index that used  
a lowercasing analyzer, so that is a special case that would require  
work during indexing.  But SIPs could be extracted from an existing  
index.

Thanks,
     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: SIPs and CAPs

Posted by mark harwood <ma...@yahoo.co.uk>.

> Do you just do this with terms or do you also
> extract phrases?   

The scheme involves these phases:
1) Identify top terms (using algo described)
2) Identify all term "runs" in original text.
3) Identify sensible phrases from large list of term
runs
4) Provide shortlist of top scoring terms AND phrases

Step 1 is done as described in my earlier post.
Step 2 I currently do be re-running an Analyzer on the
original text. It is possible that this could be done
using the RAMDirectory used in Step 1 and SpanQueries
or some such but I have found it is important to
resort  to the original text to get sensible
terms/phrases.
If your indexed content used stemming and stop word
removal and you *didn't* look at the original text you
would identify phrases like "united state america"
instead of "United States of America".
Step 3 is needed to consolidate all of the learning
about term usage. For example, the code may choose to
collapse the run "United States Of America invades"
into the shorter "United States" run because it occurs
much less and all of the shorter run's terms are in
the longer one.
Step 4 ranks the phrases and terms to produce a
shortlist consisting of both. Some terms are always
used in phrases (so will not be selected as a single
term). Some terms *never* appear in a phrase so are
considered for shortlisting.

There's probably a number of ways in which these
different phases can be implemented but I've found
them all to be necessary if you want to present the
findings in a readable form to end-users.




		
___________________________________________________________ 
How much free photo storage do you get? Store your holiday 
snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: SIPs and CAPs

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jul 14, 2005, at 7:17 AM, mark harwood wrote:
> I've done this by comparing term frequency in a subset
> (in Amazon's case a single book) and looking for a
> significant "uplift" in term popularity vs that of the
> general corpus popularity. Practically speaking, in
> the amazon case you can treat each page in the example
> book as a Lucene document, create a RAMDirectory and
> then use it's TermEnum to get the docFreqs for all
> words and compare them with the corpus docFreqs.
>
> The "uplift" score for each term is
> (subsetDocFreq/subsetNumDocs)-(corpusDocFreq/corpusNumDocs)
>
> Take the top "n" terms scored by the above then
> analyze the text of the subset looking for runs of
> these terms.
>
> I have some code for this that I have wanted to
> package up as a contribution for some time.

Nice!

Do you just do this with terms or do you also extract phrases?   
Phrases would be more intensive to deal with since positional  
information is needed as well as some rules to decide on minimum/ 
maximum length of phrases and such.  Perhaps the technique you  
describe would be useful in locating spots to dig into for phrases?

As for CAP's, perhaps a specialized TokenFilter could be used to do  
this during the indexing analysis step - I don't think it would be  
difficult.

     Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: SIPs and CAPs

Posted by mark harwood <ma...@yahoo.co.uk>.

I've done this by comparing term frequency in a subset
(in Amazon's case a single book) and looking for a
significant "uplift" in term popularity vs that of the
general corpus popularity. Practically speaking, in
the amazon case you can treat each page in the example
book as a Lucene document, create a RAMDirectory and
then use it's TermEnum to get the docFreqs for all
words and compare them with the corpus docFreqs.

The "uplift" score for each term is
(subsetDocFreq/subsetNumDocs)-(corpusDocFreq/corpusNumDocs)

Take the top "n" terms scored by the above then
analyze the text of the subset looking for runs of
these terms.

I have some code for this that I have wanted to
package up as a contribution for some time.


	
	
		
___________________________________________________________ 
Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail http://uk.messenger.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org