You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by jian chen <ch...@gmail.com> on 2007/03/23 18:37:07 UTC

(LUCENE-835) An IndexReader with run-time support for synonyms

Hi, Mark,

Thanks for providing this original approach for synonyms. I read through
your code and think maybe this could be extended to handle the word stemming
problem as well.

Here is my thought.

1) Before indexing, create a Map<String, ArrayList<String>> stemmedWordMap,
the key is the stemmed word.

1) At indexing, we still index the word as it is, but, we stem the word
(using PorterStemmer) and then insert/update the stemmedWordMap to add the
mapping: stemmedWord <=>Word.

Example, "lighting", "lighted", these two words will be stored in the
ArrayList with the key "light".

2) At query time, when someone searched on "lighting", we stem the word to
"light", then, find from the stemmedWordMap the synonyms for this word. In
this case, we find "lighted". Then, we perform the search using the synonyms
search.

This way, we can combine both the synonyms and the stemmed words together.
The nice part of this is, we only need to store the index with the original
words. Saving disk space as well as indexing time.

However, I do have the following concerns:

1) As documents could be removed from the index, the stemmedWordMap needs to
be somehow kept up to date. This could be done periodically by rebuilding
the stemmedWordMap?

2) Typically, people would like to see their exact match first. So, the
synonyms search could be enhanced to take advantage of the position level
boosting (payload for position). So, the search result for "lighting" should
rank the documents with 'lighting" higher than documents with "lighted".

3) I am still not sure if this is a best approach in general. Does it make
sense to keep the two indexes, one with original words indexed, the other
one with all words stemmed? Then, searching will be run against both
indexes.

4) How does Google perform this type of search? I guess the web search
engines have different approach. There maybe no need for using a stemmer at
all.

First, the web documents are huge, searching for "lighting" will bring up
enough results, who cares bringing back results with "lighted"?

Second, the anchor texts that point to a web page of interest would contain
all the variants (synonyms and stemmed words), so, they don't need to worry
about search results being incomplete? For example, search for "rectangular"
in google, http://www.google.com/search?hl=en&q=rectangular&btnG=Search, the
wikipedia page comes up first. It only contains "Rectangle", however, click
on Cached link, you will see "rectangular" is contained in the anchor text
that points to this page.

My ultimate question, if I want to do a search engine, as a general rule,
what's the best way to do it?

Mark, could be shed some light?

Thanks,

Jian

On 3/18/07, Mark Harwood (JIRA) <ji...@apache.org> wrote:
>
>
>      [
> https://issues.apache.org/jira/browse/LUCENE-835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Mark Harwood updated LUCENE-835:
> --------------------------------
>
>     Attachment: TestSynonymIndexReader.java
>
> > An IndexReader with run-time support for synonyms
> > -------------------------------------------------
> >
> >                 Key: LUCENE-835
> >                 URL: https://issues.apache.org/jira/browse/LUCENE-835
> >             Project: Lucene - Java
> >          Issue Type: New Feature
> >          Components: Index
> >    Affects Versions: 2.1
> >            Reporter: Mark Harwood
> >         Assigned To: Mark Harwood
> >         Attachments: Synonym.java, SynonymIndexReader.java,
> SynonymSet.java, TestSynonymIndexReader.java
> >
> >
> > These classes provide support for enabling the use of synonyms for terms
> in an existing index.
> > While Analyzers can be used at Query-parse time or Index-time to inject
> synonyms these are not always satisfactory means of providing support for
> synonyms:
> > * Index-time injection of synonyms is less flexible because changing the
> lists of synonyms requires an index rebuild.
> > * Query-parse-time injection is awkward because special support is
> required in the parser/query logic  to recognise and cater for the tokens
> that appear in the same position. Additionally, any statistical analysis of
> the index content via TermEnum/TermDocs etc does not consider the synonyms
> unless specific code is added.
> > What is perhaps more useful is a transparent wrapper for the IndexReader
> that provides a synonym-ized view of the index without requiring specialised
> support in the calling code. All of the TermEnum/TermDocs interfaces remain
> the same but behind the scenes synonyms are being considered/applied
> silently.
> > The classes supplied here provide this "virtual" view of the index and
> all queries or other code that examines this index using the special reader
> benefit from this view without requiring specialized code. A Junit test
> illustrates this code in action.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>