You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Erik Hatcher <er...@ehatchersolutions.com> on 2005/07/14 12:45:17 UTC
SIPs and CAPs
Has anyone developed code to extract SIPs (statistically improbable
phrases) and CAPs (capitalized phrases) from a Lucene index, such as
Amazon does with it's books as shown here?
<http://www.amazon.com/exec/obidos/tg/detail/-/0764526413/
ref=sip_top_dp/102-8573693-0514548?%5Fencoding=UTF8&v=glance>
I'm curious as it is something I'd like to do with some of my work.
Of course CAPs would be impossible to extract from an index that used
a lowercasing analyzer, so that is a special case that would require
work during indexing. But SIPs could be extracted from an existing
index.
Thanks,
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: SIPs and CAPs
Posted by mark harwood <ma...@yahoo.co.uk>.
> Do you just do this with terms or do you also
> extract phrases?
The scheme involves these phases:
1) Identify top terms (using algo described)
2) Identify all term "runs" in original text.
3) Identify sensible phrases from large list of term
runs
4) Provide shortlist of top scoring terms AND phrases
Step 1 is done as described in my earlier post.
Step 2 I currently do be re-running an Analyzer on the
original text. It is possible that this could be done
using the RAMDirectory used in Step 1 and SpanQueries
or some such but I have found it is important to
resort to the original text to get sensible
terms/phrases.
If your indexed content used stemming and stop word
removal and you *didn't* look at the original text you
would identify phrases like "united state america"
instead of "United States of America".
Step 3 is needed to consolidate all of the learning
about term usage. For example, the code may choose to
collapse the run "United States Of America invades"
into the shorter "United States" run because it occurs
much less and all of the shorter run's terms are in
the longer one.
Step 4 ranks the phrases and terms to produce a
shortlist consisting of both. Some terms are always
used in phrases (so will not be selected as a single
term). Some terms *never* appear in a phrase so are
considered for shortlisting.
There's probably a number of ways in which these
different phases can be implemented but I've found
them all to be necessary if you want to present the
findings in a readable form to end-users.
___________________________________________________________
How much free photo storage do you get? Store your holiday
snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: SIPs and CAPs
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jul 14, 2005, at 7:17 AM, mark harwood wrote:
> I've done this by comparing term frequency in a subset
> (in Amazon's case a single book) and looking for a
> significant "uplift" in term popularity vs that of the
> general corpus popularity. Practically speaking, in
> the amazon case you can treat each page in the example
> book as a Lucene document, create a RAMDirectory and
> then use it's TermEnum to get the docFreqs for all
> words and compare them with the corpus docFreqs.
>
> The "uplift" score for each term is
> (subsetDocFreq/subsetNumDocs)-(corpusDocFreq/corpusNumDocs)
>
> Take the top "n" terms scored by the above then
> analyze the text of the subset looking for runs of
> these terms.
>
> I have some code for this that I have wanted to
> package up as a contribution for some time.
Nice!
Do you just do this with terms or do you also extract phrases?
Phrases would be more intensive to deal with since positional
information is needed as well as some rules to decide on minimum/
maximum length of phrases and such. Perhaps the technique you
describe would be useful in locating spots to dig into for phrases?
As for CAP's, perhaps a specialized TokenFilter could be used to do
this during the indexing analysis step - I don't think it would be
difficult.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: SIPs and CAPs
Posted by mark harwood <ma...@yahoo.co.uk>.
I've done this by comparing term frequency in a subset
(in Amazon's case a single book) and looking for a
significant "uplift" in term popularity vs that of the
general corpus popularity. Practically speaking, in
the amazon case you can treat each page in the example
book as a Lucene document, create a RAMDirectory and
then use it's TermEnum to get the docFreqs for all
words and compare them with the corpus docFreqs.
The "uplift" score for each term is
(subsetDocFreq/subsetNumDocs)-(corpusDocFreq/corpusNumDocs)
Take the top "n" terms scored by the above then
analyze the text of the subset looking for runs of
these terms.
I have some code for this that I have wanted to
package up as a contribution for some time.
___________________________________________________________
Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail http://uk.messenger.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org