You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jason Brown <Ja...@sjp.co.uk> on 2010/11/24 10:43:35 UTC

Synonym processing at index time

Good Morning - I will explain my current config/fucntionality.

I have 4 fields in my index...

1) Doc Title - a text field
2) Keyword Phrase, e.g. fund manager, a text field (with some edge n gram functionality at index time)
3) Keyword Phrase, e.g. fund manager, a string field (for facetting)
4) Content Field, i.e. my full document text, a text field

I have a nice bit of auto-complete functionality in my UI which works as follows.......

user searches -> fund ma

and my service layer calls SOLR to say please find all docs with fund and ma in it. My search results are fine, I also ask for facets and counts in this same query so I can use them in my auto-complete (I ask for field (3) above when facetting).

This allows me to use the facets and counts to show a nice auto-complete each time a user hits a key.

Ok so far. I have a nice auto-complete based upon business domain Keyword Phrases.

Now.....on to synonyms, for example fund manager and fund lead are the same thing in my business domain.

I was planning on simply adding the synonyms as normal entries into fields 2 and 3 (both multi-valued fileds) so that they would be inserted into the index and be available for my auto-complete. This would be OK and to clarify, nothing to do with the synonyms.txt file at this point.

However, as SOLR has synonym processing I should take advantage of it (also at this point my synonym fund lead would not have found its way into field 4 (full text off the document) where fund manager was in the content).

SO I belive I should so something like...

fund manager, fund lead 

...in my synonym file that I only want to process at index time (so it appears in my autocomplete) with expansion on. I want wherever fund manager or fund lead is found, for the index to have fund manager and fund lead.

As I have expansion on and have multi word synonyms (phrases as both a source and target) then to use the synonym file at index time seems best.

However, I am very confused at this point.

I can see how the synonym file would be processed correctly for field 3 (a string field) and both terms fund maanger and fund lead should go into the index OK.

But I can't see how it would work for the text fields (2 and 4).

My Index time filter chain has synonym processing as per the default text field processing (after whitespace tokenisation), so I cant see how my terms fund manager and fund lead can be found by the synonym filter. 

I've looked in the book by Eric Pugh and they say that for multi-word synonyms to work you must use synonyms at index time and with expansion - they say you cant do synonym processing at query time as synonym phrases aren't recognised after whitespace parsing - but my index chain (and the defauly SOLR config for text fields ) also whitespace parses.

it would be great to take advantage of synonym processing by SOLR instead of mty original plan - but am confused how multi-word synonms can be recognised at index time and added to the index - am I missing something about inde time processign of synonyms here?

Many Thanks for any help/advice.

Jason.






If you wish to view the St. James's Place email disclaimer, please use the link below

http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer

Re: Synonym processing at index time

Posted by Lance Norskog <go...@gmail.com>.
I gave up trying to utterly totally master the analyzer classes.

solr/admin/analysis.jsp allows you to see exactly how your analysis
stack processes text, including what it does with synonyms both at
index and query times. This is the easiest way to start and maintain
this kind of feature; you might change the analysis stack and break
the synonym handling.

On Wed, Nov 24, 2010 at 1:43 AM, Jason Brown <Ja...@sjp.co.uk> wrote:
>
> Good Morning - I will explain my current config/fucntionality.
>
> I have 4 fields in my index...
>
> 1) Doc Title - a text field
> 2) Keyword Phrase, e.g. fund manager, a text field (with some edge n gram functionality at index time)
> 3) Keyword Phrase, e.g. fund manager, a string field (for facetting)
> 4) Content Field, i.e. my full document text, a text field
>
> I have a nice bit of auto-complete functionality in my UI which works as follows.......
>
> user searches -> fund ma
>
> and my service layer calls SOLR to say please find all docs with fund and ma in it. My search results are fine, I also ask for facets and counts in this same query so I can use them in my auto-complete (I ask for field (3) above when facetting).
>
> This allows me to use the facets and counts to show a nice auto-complete each time a user hits a key.
>
> Ok so far. I have a nice auto-complete based upon business domain Keyword Phrases.
>
> Now.....on to synonyms, for example fund manager and fund lead are the same thing in my business domain.
>
> I was planning on simply adding the synonyms as normal entries into fields 2 and 3 (both multi-valued fileds) so that they would be inserted into the index and be available for my auto-complete. This would be OK and to clarify, nothing to do with the synonyms.txt file at this point.
>
> However, as SOLR has synonym processing I should take advantage of it (also at this point my synonym fund lead would not have found its way into field 4 (full text off the document) where fund manager was in the content).
>
> SO I belive I should so something like...
>
> fund manager, fund lead
>
> ...in my synonym file that I only want to process at index time (so it appears in my autocomplete) with expansion on. I want wherever fund manager or fund lead is found, for the index to have fund manager and fund lead.
>
> As I have expansion on and have multi word synonyms (phrases as both a source and target) then to use the synonym file at index time seems best.
>
> However, I am very confused at this point.
>
> I can see how the synonym file would be processed correctly for field 3 (a string field) and both terms fund maanger and fund lead should go into the index OK.
>
> But I can't see how it would work for the text fields (2 and 4).
>
> My Index time filter chain has synonym processing as per the default text field processing (after whitespace tokenisation), so I cant see how my terms fund manager and fund lead can be found by the synonym filter.
>
> I've looked in the book by Eric Pugh and they say that for multi-word synonyms to work you must use synonyms at index time and with expansion - they say you cant do synonym processing at query time as synonym phrases aren't recognised after whitespace parsing - but my index chain (and the defauly SOLR config for text fields ) also whitespace parses.
>
> it would be great to take advantage of synonym processing by SOLR instead of mty original plan - but am confused how multi-word synonms can be recognised at index time and added to the index - am I missing something about inde time processign of synonyms here?
>
> Many Thanks for any help/advice.
>
> Jason.
>
>
>
>
>
>
> If you wish to view the St. James's Place email disclaimer, please use the link below
>
> http://www.sjp.co.uk/portal/internet/SJPemaildisclaimer
>



-- 
Lance Norskog
goksron@gmail.com