You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Van Nguyen <vn...@ur.com> on 2006/12/01 02:25:18 UTC

any ides on this type of analyzer?

I've been trying to brainstorm on this but could not figure out a way to
go about this.

 

Let's say I'm searching for "batman".  I want results that include:

 

batman

bat man

bat-man

etc.

 

or if I search screwdriver, I would want results to include:

 

screwdriver

screw drivers

etc.

 

I've tried using the SnowballAnalyzer.  I've thought about creating a
"SynonymAnalyzer" as described in the Lucene In Action book (but that
would mean I would have to know all the synonyms for each word I need to
index - at this point I do not).  Any suggestions on how to go about
this?

 

Van


Re: any ides on this type of analyzer?

Posted by Soeren Pekrul <so...@gmx.de>.
Hello Van,

it looks like splitting of compound words. This topic was discussed in 
the thread "Analysis/tokenization of compound words" 
(http://www.gossamer-threads.com/lists/lucene/java-user/40164?do=post_view_threaded).

The main idea is as follow:
You have a corpus (lexicon/dictionary). You want to split a word. Build 
letter wise two part words and search them in the corpus. If you can 
find them you can split your word at these parts. You can do that 
recursively finding all sub-parts. This method searches the smallest 
possible words. [1]
If you have a corpus without compound words you can have a better 
quality by searching the largest possible words by searching the word 
first and then splitting it if you couldn’t find it. Do this recursively 
as well. [2]

Warning: The results are usually not linguistically correct. How ever, 
the quality should be good enough for searching.

Sören

[1] C. Monz and M. De Rijke: Shallow Morphological Analysis in 
Monolingual Information Retrieval for Dutch, German and Italian, 
Language & Inference Technology, University of Amsterdam.

[2] Pasqualino Imbemba: A Splitter for German Compound Words, Free 
University Of Bolzano, Bolzano, 2006

Van Nguyen wrote:
> I've been trying to brainstorm on this but could not figure out a way to
> go about this.
> 
>  
> 
> Let's say I'm searching for "batman".  I want results that include:
> 
>  
> 
> batman
> 
> bat man
> 
> bat-man
> 
> etc.
> 
>  
> 
> or if I search screwdriver, I would want results to include:
> 
>  
> 
> screwdriver
> 
> screw drivers
> 
> etc.
> 
>  
> 
> I've tried using the SnowballAnalyzer.  I've thought about creating a
> "SynonymAnalyzer" as described in the Lucene In Action book (but that
> would mean I would have to know all the synonyms for each word I need to
> index - at this point I do not).  Any suggestions on how to go about
> this?
> 
>  
> 
> Van

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: any ides on this type of analyzer?

Posted by Dennis Watson <de...@guba.com>.
NGramAnalyzer should do this.  I think there is one in the contribs area or in 
LUA.


Dennis


On Thursday 30 November 2006 17:25, Van Nguyen wrote:
> I've been trying to brainstorm on this but could not figure out a way to
> go about this.
>
>
>
> Let's say I'm searching for "batman".  I want results that include:
>
>
>
> batman
>
> bat man
>
> bat-man
>
> etc.
>
>
>
> or if I search screwdriver, I would want results to include:
>
>
>
> screwdriver
>
> screw drivers
>
> etc.
>
>
>
> I've tried using the SnowballAnalyzer.  I've thought about creating a
> "SynonymAnalyzer" as described in the Lucene In Action book (but that
> would mean I would have to know all the synonyms for each word I need to
> index - at this point I do not).  Any suggestions on how to go about
> this?
>
>
>
> Van

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org