You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by sai hariharan <sa...@gmail.com> on 2007/04/13 10:50:35 UTC

Help regarding an Algo.

Hi to all,

I've an algortihm thats given below, can anybody help me implementing it.
Any sort of suggestion will be appreciated. I've finished removing stop
words,
calculating term frequencies with Lucene. The rest of the part is not quite
clear.
I'm working only on the English part.

3.2.3 Extracting Representative Words
After preprocessing the page, the system retains the most
characteristic parts of the contents. Association rules are
used to collect Chinese and English phrases as
representative words. The approach can be outlined as
follows:

   -  Scan the documents to retrieve each Chinese and English word. Then,
   each individual Chinese word

       and English word can be distinguished.


   -        ‧ Discard all stopwords, i.e., neither Chinese nor English
   word. Stopwords are high frequency words

       that are found in all texts and that do not contribute to the
distinctive meaning of the text, for example

       prepositions and articles such as "in", "on" and "the".


   - ‧      Calculate the frequency of occurrence for each remaining
   word.



   - ‧      If the frequency of a word is less than two, it is
   discarded. Next, the association rule is used to

                compute the confidence value as follows.
.

Confidence (A ⇒B) = P(A∩B) / P(A)

If P(A) is 1 and the co-occurrence, P(A∩B), is also 1, then the resulting
confidence value must be 1.

Obviously, it must be greater than the threshold Clearly, P(A) cannot be 0
as this will result in a

division-by-zero [13].


   - ‧ Assign the frequencies of the remaining words to each word in the
   document. For all the words in the

document that already have been eliminated a frequency of 0 is assigned.


   - ‧ After all the inverse term frequencies of the document are
   identified and terms with a frequency

less than 2 are discarded, the document is scanned again to calculate the
confidence value of the cooccurrence.

The word is not incorporated into the current document's phrase array if: 1)
the confidence

value is below the threshold, 2) the same word has already been assigned to
document's phrase array, or

3) the subclass word is already assigned to the current document's phrase
array. For example, assume the

current document's phrase array includes the phrase "operating system". If
the system subsequently

processes the related phrase "operating", then the latter phrase is not
added to the current document's
phrase array.

The steps needed to compute the confidence value of each
word are as follows.

(1) Fetch the previous two words until either term
frequency is not equal to 0.

(2) Calculate the confidence value of this term and the
next word. This step is discussed further in a later
section.

(3) When the confidence value is less than the threshold
(0.7), the system looks up whether the value of
computing words is two. If it is, then skip and ignore
it; otherwise, we will save this phrase, excluding the
current word, into the phrase array and continue
scanning the next word. A check is performed for
each word to determine whether it exists in the phrase
array or not.

(4) If the confidence value of the phrase is larger than the
threshold, the next word is fetched. If its term
frequency is 0, then save the current phrase and the
frequency; otherwise, steps 2 to 4 are repeated until
the end of the document is reached.

Thanx in Advance,
-- 
சாய் Hari

Re: Help regarding an Algo.

Posted by Doron Cohen <DO...@il.ibm.com>.
"sai hariharan" <sa...@gmail.com> wrote on 13/04/2007 01:50:35:

> Hi to all,
>
> I've an algortihm thats given below, can anybody help me implementing it.
> Any sort of suggestion will be appreciated. I've finished removing stop
> words,
> calculating term frequencies with Lucene. The rest of the part is not
quite
> clear.
> I'm working only on the English part.
> ....
> ....
>    - ‧      If the frequency of a word is less than two, it is
>    discarded. Next, the association rule is used to
>
>                 compute the confidence value as follows.
> .
>
> Confidence (A ⇒B) = P(A∩B) / P(A)
>
> If P(A) is 1 and the co-occurrence, P(A∩B), is also 1, then the resulting
> confidence value must be 1.
>
> Obviously, it must be greater than the threshold Clearly, P(A) cannot be
0
> as this will result in a
>
> division-by-zero [13].

This seems part of "Using Association Rules for Expanding Search Engine
Recommendation Keywords in English and Chinese Queries (2005)" - Y.P.
Huang, C.-A. Tsai (Taiwan), and F.E. Sandnes (Norway) -
http://www.actapress.com/PaperInfo.aspx?PaperID=22954 - and it would be
hard to help without reading the paper (which is not freely available...)

Anyhow from the abstract it seems this step attempts to infer whether two
consecutive words A B in a document should be treated as a phrase by basing
on the probability of B to appear after A in (I think) the entire
collection, or more likely in some training collection.

So for P(A) one could use the total occurrence count of A = sum termFreq(A)
over all documents containing A (again, in the training collection). Can
divide by a collection size factor (words count) to make this a [0..1]
probability. ...  And for P(A∩B) could similarly use the total number of
occurrences of B right after A in the entire training collection.  Mmmm...,
seems both values should be computed in advance in a preprocessing step for
all the candidate words - actually all the pairs of non stop words A B
(appearing at least twice) in the entire training collection.

I *guess* the probabilities are later used to expand user queries either
automatically for improved recall or just suggesting an expanded query via
UI, but this is mostly *guessing* as I don't know that paper... Wouldn't it
be best to contact the authors?