You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Lajos <la...@protulae.com> on 2009/08/31 16:32:59 UTC

Help! Issue with tokens in custom synonym filter

Hi all,

I've been writing some custom synonym filters and have run into an issue 
with returning a list of tokens. I have a synonym filter that uses the 
WordNet database to extract synonyms. My problem is how to define the 
offsets and position increments in the new Tokens I'm returning.

For an input token, I get a list of synonyms from the WordNet database. 
I then create a List<Token> of those results. Each Token is created with 
the same startOffset, endOffset and positionIncrement of the input 
Token. Is this correct? My understanding from looking at the Lucene 
codebase is that the startOffset/endOffset should be the same, as we are 
referring to the same term in the original text. However, I don't quite 
get the positionIncrement. I understand that it is relative to the 
previous term ... does this mean all my synonyms should have a 
positionIncrement of 0? But whether I use 0 or the positionIncrement of 
the original input Token, Solr seems to ignore the returned tokens ...

This is a summary of what is in my filter:

*************************************************

private Iterator<Token> output;
private ArrayList<Token> synonyms = null;

public Token next(Token in) throws IOException {
   if (output != null) {
     // Here we are just outputing matched synonyms
     // that we previously created from the input token
     // The input token has already been returned
     if (output.hasNext()) {
       return output.next();
     } else {
       return null;
     }
   }

   synonyms = new ArrayList<Token>();

   Token t = input.next(in);
   if (t == null) return null;

   String value = new String(t.termBuffer(), 0,
     t.termLength()).toLowerCase();

   // Get list of WordNet synonyms (code removed)
   // Iterate thru WordNet synonyms
   for (String wordNetSyn : wordNetSyns) {
     Token synonym = new Token(t.startOffset(), t.endOffset(), 
t.type());	    synonym.setPositionIncrement(t.getPositionIncrement());
     synonym.setTermBuffer(wordNetSyn .toCharArray(), 0,
       wordNetSyn .length());
     synonyms.add(synonym);
   }

   output = synonyms.iterator();

   // Return the original word, we want it
   return t;
}

Re: Help! Issue with tokens in custom synonym filter

Posted by Lajos <la...@protulae.com>.

Hi David & Ahmet,

I hadn't seen the SynonymTokenFilter from Lucene, so that helped. 
Ultimately, however, it seems I was pretty much doing the right thing, 
although my token type might have been wrong.

Unfortunately, while the tokens are being returned properly (AFAIK), 
when I do a query using one of the synonyms, I can't get any results. 
This is not the case if I just directly code in the synonym into the 
synonyms file with the standard solr synonym filter.

So I'll have to keep on hacking away ;)

Regarding generating the file from WordNet, we'd considered that but our 
requirements essentially mean we have to do the heavy lifting within the 
filter itself. Not that I'm opposed, it is just that I'm apparently 
missing something simple still.

Thanks for the replies.

Lajos


Smiley, David W. wrote:
> Although this is not a direct answer to your question, you may want to consider generating a synonyms file from wordnet.  Then, you can use the standard synonym filter in Solr.  The only downside to this is that the synonym file might be pretty large... but you've probably got some large file for wordnet data any way.
> 
> ~ David Smiley
>  Author: http://www.packtpub.com/solr-1-4-enterprise-search-server
> 
> 
> 
> On 8/31/09 10:32 AM, "Lajos" <la...@protulae.com> wrote:
> 
> Hi all,
> 
> I've been writing some custom synonym filters and have run into an issue
> with returning a list of tokens. I have a synonym filter that uses the
> WordNet database to extract synonyms. My problem is how to define the
> offsets and position increments in the new Tokens I'm returning.
> 
> For an input token, I get a list of synonyms from the WordNet database.
> I then create a List<Token> of those results. Each Token is created with
> the same startOffset, endOffset and positionIncrement of the input
> Token. Is this correct? My understanding from looking at the Lucene
> codebase is that the startOffset/endOffset should be the same, as we are
> referring to the same term in the original text. However, I don't quite
> get the positionIncrement. I understand that it is relative to the
> previous term ... does this mean all my synonyms should have a
> positionIncrement of 0? But whether I use 0 or the positionIncrement of
> the original input Token, Solr seems to ignore the returned tokens ...
> 
> This is a summary of what is in my filter:
> 
> *************************************************
> 
> private Iterator<Token> output;
> private ArrayList<Token> synonyms = null;
> 
> public Token next(Token in) throws IOException {
>    if (output != null) {
>      // Here we are just outputing matched synonyms
>      // that we previously created from the input token
>      // The input token has already been returned
>      if (output.hasNext()) {
>        return output.next();
>      } else {
>        return null;
>      }
>    }
> 
>    synonyms = new ArrayList<Token>();
> 
>    Token t = input.next(in);
>    if (t == null) return null;
> 
>    String value = new String(t.termBuffer(), 0,
>      t.termLength()).toLowerCase();
> 
>    // Get list of WordNet synonyms (code removed)
>    // Iterate thru WordNet synonyms
>    for (String wordNetSyn : wordNetSyns) {
>      Token synonym = new Token(t.startOffset(), t.endOffset(),
> t.type());          synonym.setPositionIncrement(t.getPositionIncrement());
>      synonym.setTermBuffer(wordNetSyn .toCharArray(), 0,
>        wordNetSyn .length());
>      synonyms.add(synonym);
>    }
> 
>    output = synonyms.iterator();
> 
>    // Return the original word, we want it
>    return t;
> }
> 
> 
> 
> 
> ------------------------------------------------------------------------
> 
> 
> No virus found in this incoming message.
> Checked by AVG - www.avg.com 
> Version: 8.5.409 / Virus Database: 270.13.71/2334 - Release Date: 08/29/09 17:51:00
>

Re: Help! Issue with tokens in custom synonym filter

Posted by "Smiley, David W." <ds...@mitre.org>.

Although this is not a direct answer to your question, you may want to consider generating a synonyms file from wordnet.  Then, you can use the standard synonym filter in Solr.  The only downside to this is that the synonym file might be pretty large... but you've probably got some large file for wordnet data any way.

~ David Smiley
 Author: http://www.packtpub.com/solr-1-4-enterprise-search-server



On 8/31/09 10:32 AM, "Lajos" <la...@protulae.com> wrote:

Hi all,

I've been writing some custom synonym filters and have run into an issue
with returning a list of tokens. I have a synonym filter that uses the
WordNet database to extract synonyms. My problem is how to define the
offsets and position increments in the new Tokens I'm returning.

For an input token, I get a list of synonyms from the WordNet database.
I then create a List<Token> of those results. Each Token is created with
the same startOffset, endOffset and positionIncrement of the input
Token. Is this correct? My understanding from looking at the Lucene
codebase is that the startOffset/endOffset should be the same, as we are
referring to the same term in the original text. However, I don't quite
get the positionIncrement. I understand that it is relative to the
previous term ... does this mean all my synonyms should have a
positionIncrement of 0? But whether I use 0 or the positionIncrement of
the original input Token, Solr seems to ignore the returned tokens ...

This is a summary of what is in my filter:

*************************************************

private Iterator<Token> output;
private ArrayList<Token> synonyms = null;

public Token next(Token in) throws IOException {
   if (output != null) {
     // Here we are just outputing matched synonyms
     // that we previously created from the input token
     // The input token has already been returned
     if (output.hasNext()) {
       return output.next();
     } else {
       return null;
     }
   }

   synonyms = new ArrayList<Token>();

   Token t = input.next(in);
   if (t == null) return null;

   String value = new String(t.termBuffer(), 0,
     t.termLength()).toLowerCase();

   // Get list of WordNet synonyms (code removed)
   // Iterate thru WordNet synonyms
   for (String wordNetSyn : wordNetSyns) {
     Token synonym = new Token(t.startOffset(), t.endOffset(),
t.type());          synonym.setPositionIncrement(t.getPositionIncrement());
     synonym.setTermBuffer(wordNetSyn .toCharArray(), 0,
       wordNetSyn .length());
     synonyms.add(synonym);
   }

   output = synonyms.iterator();

   // Return the original word, we want it
   return t;
}

Re: Help! Issue with tokens in custom synonym filter

Posted by AHMET ARSLAN <io...@yahoo.com>.

> I've been writing some custom synonym filters and have run
> into an issue with returning a list of tokens. I have a
> synonym filter that uses the WordNet database to extract
> synonyms. My problem is how to define the offsets and
> position increments in the new Tokens I'm returning.
> 
> For an input token, I get a list of synonyms from the
> WordNet database. I then create a List<Token> of those
> results. Each Token is created with the same startOffset,
> endOffset and positionIncrement of the input Token. Is this
> correct? My understanding from looking at the Lucene
> codebase is that the startOffset/endOffset should be the
> same, as we are referring to the same term in the original
> text. However, I don't quite get the positionIncrement. I
> understand that it is relative to the previous term ... does
> this mean all my synonyms should have a positionIncrement of
> 0? But whether I use 0 or the positionIncrement of the
> original input Token, Solr seems to ignore the returned
> tokens ...

You can look at the source code of SynonymTokenFilter[1] and SynonymMap[2] in Lucene.

[1] http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/index/memory/SynonymTokenFilter.html
[2] http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/index/memory/SynonymMap.html