You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Aaron Galea <ag...@nextgen.net.mt> on 2002/11/10 13:19:23 UTC

Indexing synonyms

Hi everyone,

I need to create a filter that extends a tokenfilter whose purpose is to generate some synonyms for words in the document using Wordnet. Well searching for synonyms using wordnet is not that problematic but I need to add the synonym words to Lucene tokenstream before they are passed for indexing. However TokenStream class does not support any add method. Did anyone ever needed to do this? Can someone suggest an alternative of how to add some synonym words to the index?

Thanks
Aaron

Re: Indexing synonyms

Posted by Joshua O'Madadhain <jm...@ics.uci.edu>.

On Sun, 10 Nov 2002, Aaron Galea wrote:

> I need to create a filter that extends a tokenfilter whose purpose is
> to generate some synonyms for words in the document using Wordnet.
> Well searching for synonyms using wordnet is not that problematic but
> I need to add the synonym words to Lucene tokenstream before they are
> passed for indexing. However TokenStream class does not support any
> add method. Did anyone ever needed to do this? Can someone suggest an
> alternative of how to add some synonym words to the index?

I have done some related work with Lucene involving query expansion
(although not with Wordnet), but it's not clear to me what you're trying
to accomplish with synonyms.  It sounds to me as though you want to store
all the synonyms of a word with the word in the index, in such a way that 
if a query involves any of them, they all respond.

If you want to set it up so that all documents that contain synonyms of at
least 1 query term to have positive scores, then it may be easier to do
term expansion via post-processing of the query: at query time, get
whatever related words you want (e.g., all synonyms of each term) from
Wordnet, and add them to the query.  I would guess that this would be fast
enough for your purposes, is more flexible (in case you want to expand or
contract your notion of a synonym), and requires no additional index
space.

If you have something else in mind, what is it?

Regards,

Joshua O'Madadhain

(Incidentally, I'd be interested to hear--off-list--of your experiences
with using Wordnet.)

 jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden
  Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
 It's that moment of dawning comprehension that I live for--Bill Watterson
My opinions are too rational and insightful to be those of any organization.

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Indexing synonyms

Posted by Aaron Galea <ag...@nextgen.net.mt>.

Thanks for all your replies,

Well I will start of with an idea of what I am trying to achieve. I am
building a question answer system and one of its modules is an FAQ Module.
Since the QA system is concerned with education, users can concentrate their
question on a particular subject reducing the set of question/answer pair to
consider. Since there is this hierarchical indexing the index files are not
that big so I can store synonyms for each word in a question in the index.
Query expansion will solve the problem and eliminating the need to store
synonyms in the index but this will slow things as there is no depth limit
to consider for term expansion. It is not my intension to build something
similar to the FAQFinder system but I want to further reduce the subset of
questions to consider on which a question reformulation algorithm would be
applied. Therefore the idea is get an faq file dealing with one education
subject, index all of its questions and expand each term in the question.
Using lucene I will retrieve the questions that are likely to be similar to
a user question, select say the top 5 and apply a query reformulation
algorithm. If this succeeds fine and I return the answer to user, otherwise
submit the question to an answer extraction module. The most important thing
is speed so putting term expansion in the index hopefully should improve
things. Obviously problems arise with this method as there is no word sense
disambiguation but the query reformulation algorithm will solve this.
However it is slow so I must reduce the number of questions it is applied
on. It is a tradeoff!!!

Well I managed to solve this by overriding the next() method and when it
gets to an EOS I start returning the new expanded terms that I accumulated
in a list.

Thanks everyone for your reply!!!!

Aaron

NB : And yep I am a Malteser Otis ! :)

----- Original Message -----
From: "Alex Murzaku" <li...@lissus.com>
To: "'Lucene Users List'" <lu...@jakarta.apache.org>
Sent: Monday, November 11, 2002 12:17 AM
Subject: RE: Indexing synonyms

> You could also do something with org.apache.lucene.analyzer.Token which
> includes the following self-explanatory note:
>
>   /** Set the position increment.  This determines the position of this
> token
>    * relative to the previous Token in a {@link TokenStream}, used in
> phrase
>    * searching.
>    *
>    * <p>The default value is one.
>    *
>    * <p>Some common uses for this are:<ul>
>    *
>    * <li>Set it to zero to put multiple terms in the same position.
> This is
>    * useful if, e.g., a word has multiple stems.  Searches for phrases
>    * including either stem will match.  In this case, all but the first
> stem's
>    * increment should be set to zero: the increment of the first
> instance
>    * should be one.  Repeating a token with an increment of zero can
> also be
>    * used to boost the scores of matches on that token.
>    *
>    * <li>Set it to values greater than one to inhibit exact phrase
> matches.
>    * If, for example, one does not want phrases to match across removed
> stop
>    * words, then one could build a stop word filter that removes stop
> words and
>    * also sets the increment to the number of stop words removed before
> each
>    * non-stop word.  Then exact phrase queries will only match when the
> terms
>    * occur with no intervening stop words.
>    *
>    * </ul>
>    * @see TermPositions
>    */
>   public void setPositionIncrement(int positionIncrement) {
>     if (positionIncrement < 0)
>       throw new IllegalArgumentException
>         ("Increment must be positive: " + positionIncrement);
>     this.positionIncrement = positionIncrement;
>   }
>
>
> --
> Alex Murzaku
> ___________________________________________
>  alex(at)lissus.com  http://www.lissus.com
>
> -----Original Message-----
> From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
> Sent: Sunday, November 10, 2002 1:30 PM
> To: Lucene Users List
> Subject: Re: Indexing synonyms
>
>
> .mt?  Malta?  That's rare! :)
>
> A person called Clemens Marschner just submitted diffs for query
> rewriting to lucene-dev list 1-2 weeks ago.  The diffs are not in CVS
> yet, and they are a bit old now becase the code they were made against
> has changed since they were made. You could either try applying them
> yourself, of waiting until they get applied and then you could get a
> nightly build.
>
> Otis
>
> --- Aaron Galea <ag...@nextgen.net.mt> wrote:
> > Hi everyone,
> >
> > I need to create a filter that extends a tokenfilter whose purpose is
> > to generate some synonyms for words in the document using Wordnet.
> > Well searching for synonyms using wordnet is not that problematic but
> > I need to add the synonym words to Lucene tokenstream before they are
> > passed for indexing. However TokenStream class does not support any
> > add method. Did anyone ever needed to do this? Can someone suggest an
> > alternative of how to add some synonym words to the index?
> >
> > Thanks
> > Aaron
> >
>
>
> __________________________________________________
> Do you Yahoo!?
> U2 on LAUNCH - Exclusive greatest hits videos http://launch.yahoo.com/u2
>
> --
> To unsubscribe, e-mail:
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
>
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>
> ---
> [This E-mail was scanned for spam and viruses by NextGen.net.]
>
>
>

---
[This E-mail was scanned for spam and viruses by NextGen.net.]

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

RE: Indexing synonyms

Posted by Alex Murzaku <li...@lissus.com>.

You could also do something with org.apache.lucene.analyzer.Token which
includes the following self-explanatory note:

  /** Set the position increment.  This determines the position of this
token
   * relative to the previous Token in a {@link TokenStream}, used in
phrase
   * searching.
   *
   * <p>The default value is one.
   *
   * <p>Some common uses for this are:<ul>
   *
   * <li>Set it to zero to put multiple terms in the same position.
This is
   * useful if, e.g., a word has multiple stems.  Searches for phrases
   * including either stem will match.  In this case, all but the first
stem's
   * increment should be set to zero: the increment of the first
instance
   * should be one.  Repeating a token with an increment of zero can
also be
   * used to boost the scores of matches on that token.
   *
   * <li>Set it to values greater than one to inhibit exact phrase
matches.
   * If, for example, one does not want phrases to match across removed
stop
   * words, then one could build a stop word filter that removes stop
words and
   * also sets the increment to the number of stop words removed before
each
   * non-stop word.  Then exact phrase queries will only match when the
terms
   * occur with no intervening stop words.
   *
   * </ul>
   * @see TermPositions
   */
  public void setPositionIncrement(int positionIncrement) {
    if (positionIncrement < 0)
      throw new IllegalArgumentException
        ("Increment must be positive: " + positionIncrement);
    this.positionIncrement = positionIncrement;
  }


-- 
Alex Murzaku
___________________________________________
 alex(at)lissus.com  http://www.lissus.com            

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
Sent: Sunday, November 10, 2002 1:30 PM
To: Lucene Users List
Subject: Re: Indexing synonyms


.mt?  Malta?  That's rare! :)

A person called Clemens Marschner just submitted diffs for query
rewriting to lucene-dev list 1-2 weeks ago.  The diffs are not in CVS
yet, and they are a bit old now becase the code they were made against
has changed since they were made. You could either try applying them
yourself, of waiting until they get applied and then you could get a
nightly build.

Otis

--- Aaron Galea <ag...@nextgen.net.mt> wrote:
> Hi everyone,
> 
> I need to create a filter that extends a tokenfilter whose purpose is 
> to generate some synonyms for words in the document using Wordnet. 
> Well searching for synonyms using wordnet is not that problematic but 
> I need to add the synonym words to Lucene tokenstream before they are 
> passed for indexing. However TokenStream class does not support any 
> add method. Did anyone ever needed to do this? Can someone suggest an 
> alternative of how to add some synonym words to the index?
> 
> Thanks
> Aaron
> 


__________________________________________________
Do you Yahoo!?
U2 on LAUNCH - Exclusive greatest hits videos http://launch.yahoo.com/u2

--
To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
For additional commands, e-mail:
<ma...@jakarta.apache.org>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Indexing synonyms

Posted by Otis Gospodnetic <ot...@yahoo.com>.

.mt?  Malta?  That's rare! :)

A person called Clemens Marschner just submitted diffs for query
rewriting to lucene-dev list 1-2 weeks ago.  The diffs are not in CVS
yet, and they are a bit old now becase the code they were made against
has changed since they were made.
You could either try applying them yourself, of waiting until they get
applied and then you could get a nightly build.

Otis

--- Aaron Galea <ag...@nextgen.net.mt> wrote:
> Hi everyone,
> 
> I need to create a filter that extends a tokenfilter whose purpose is
> to generate some synonyms for words in the document using Wordnet.
> Well searching for synonyms using wordnet is not that problematic but
> I need to add the synonym words to Lucene tokenstream before they are
> passed for indexing. However TokenStream class does not support any
> add method. Did anyone ever needed to do this? Can someone suggest an
> alternative of how to add some synonym words to the index?
> 
> Thanks
> Aaron
> 

__________________________________________________
Do you Yahoo!?
U2 on LAUNCH - Exclusive greatest hits videos
http://launch.yahoo.com/u2

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>