You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "Asbjørn A. Fellinghaug" <as...@fellinghaug.com> on 2008/07/03 15:52:21 UTC

Enhancing phrase searching in Lucene

Hi.

I've just finished my master thesis regarding how to enhance overall
phrase searching in search engines nowadays. The focus in the thesis is
to experiment with a new approach, whereas I've focused on pair of
words (bigrams). The thesis can be freely downloaded here [1].

What I've specifically experimented with is bigrams based on stopwords
and their characteristics. In this experiment there is created an
Analyzer which create bigram Tokens compounded of pair of words. First
we have a predefined list of stopwords, and then we analyze each token
in the Analyze. Given that a stopword token is identified, then we
create two new bigram tokens:
    1) previouse token + stopword token
    2) stopword token + next token

The identified stopword token is discarded, as it pose a huge posting
list in the inverted index. 

The overall main goal is to drastically reduce the posting lists
lengths, and thereby save I/O and processing made by Apache Lucene.
Based on the experiments performed, this new phrase searching approach
in Lucene introduce some performance gains.

The code which was created in the experiment will be made available
shortly. I just need to make some Javadoc, and prettify some. There is
nothing revolutionary in the code, as I've noticed by this maillist that
others have also been into this subject.

Hope someone finds some of the aspects discussed in my master thesis
useful. I've also, into some extend, tried to describe Apache Lucene and
how it works.

[1] http://asbjorn.fellinghaug.com/filer/master/Master_thesis.pdf

-- 
Asbjørn A. Fellinghaug
asbjorn@fellinghaug.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org