You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Ahmet Arslan <io...@yahoo.com> on 2010/07/06 15:40:54 UTC

Re: multi-term synonym expansion

> My custom SKOSAnalyzer already performs synonym expansion
> based on the labels defined in a given SKOS model. But now I
> have the problem that real-world thesauri often define
> (multi terms) synonyms for mult-term words. Here is an
> example that defines the abbreviation "UN" as synonym for
> "United Nations"
> 
> <skos:Concept rdf:about="http://www.cs.univie.ac.at/thesaurus/concept/6">
>       <skos:prefLabel>United
> Nations</skos:prefLabel>
>      
> <skos:altLabel>UN</skos:altLabel>
>  </skos:Concept>
> 
> At the end the analyzer should add the term UN at the right
> position in the index. Taking the example above, a sentence
> "I work for the United Nations" should appear in the index
> as 
> 
> 2: [work: 2-> 6]
> 5: [united nations: 15->29] [un: 15->29]
> 
> ...so that a query "I work for the UN" also matches the
> document.
> 
> What is the best solution to implement that. With a
> TokenFilter I can work through the sentence token by token
> (using incrementToken()) and check if there is a synonym
> available. How can I analyze token sequences in a given
> text? Do I need to implement a custom tokenizer that
> recognizes entities based on a given dictionary?
> 
> I am grateful for any suggestions or advice.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory can handle multi-word synonyms. This may help.


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: multi-term synonym expansion

Posted by da...@ontrenet.com.
How does the synonym filter work internally? I configured it with a very
large synonym file (90,000 lines) running Solr in glassfish and it started
fine, but when I queried, it hung and ran out of memory. The file wasn' big
enough to exhaust the heap....I never was able to get it to run smoothly.

On Tue, 6 Jul 2010 06:40:54 -0700 (PDT), Ahmet Arslan <io...@yahoo.com>
wrote:
>> My custom SKOSAnalyzer already performs synonym expansion
>> based on the labels defined in a given SKOS model. But now I
>> have the problem that real-world thesauri often define
>> (multi terms) synonyms for mult-term words. Here is an
>> example that defines the abbreviation "UN" as synonym for
>> "United Nations"
>> 
>> <skos:Concept
rdf:about="http://www.cs.univie.ac.at/thesaurus/concept/6">
>>       <skos:prefLabel>United
>> Nations</skos:prefLabel>
>>      
>> <skos:altLabel>UN</skos:altLabel>
>>  </skos:Concept>
>> 
>> At the end the analyzer should add the term UN at the right
>> position in the index. Taking the example above, a sentence
>> "I work for the United Nations" should appear in the index
>> as 
>> 
>> 2: [work: 2-> 6]
>> 5: [united nations: 15->29] [un: 15->29]
>> 
>> ...so that a query "I work for the UN" also matches the
>> document.
>> 
>> What is the best solution to implement that. With a
>> TokenFilter I can work through the sentence token by token
>> (using incrementToken()) and check if there is a synonym
>> available. How can I analyze token sequences in a given
>> text? Do I need to implement a custom tokenizer that
>> recognizes entities based on a given dictionary?
>> 
>> I am grateful for any suggestions or advice.
> 
>
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
> can handle multi-word synonyms. This may help.
> 
> 
>       
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org