You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Thomas Rankin <to...@tomrankin.net> on 2011/04/04 06:15:50 UTC

[Lucene.Net] English Language Concatenated Word Tokenizer

Hey everyone, I'm trying to figure out the best way to get lucene to detect
concatenated words in a body of copy or a URL.

I've got a few scenarios I'm trying to handle.  Many times in source code
and URL's, several words are concatenated together to create a meaningful
string ie. UserRegistrationService.cs and www.worldbestwebsites.com.  I
would like index these as User Registration Service cs and www world best
websites com etc.  I'm not expecting an easy answer, but would like to know
how the community at large is dealing with these types of scenarios.

Thanks,

Thomas

Re: [Lucene.Net] English Language Concatenated Word Tokenizer

Posted by digy digy <di...@gmail.com>.
Instead of splitting the token into meaningful words, you may want to try to
use the SingleCharTokenAnalyzer in contrib.
It allows %text% searches.
(
http://svn.apache.org/viewvc/incubator/lucene.net/trunk/C%23/contrib/Contrib.Net/Contrib.Net/Analysis/Ext/Analysis.Ext.cs
)

given "www.worldbestwebsites.com", you can search "world", "best", "web",
"website", "bestweb" or "stwebsi" :)  etc.

DIGY

On Mon, Apr 4, 2011 at 7:15 AM, Thomas Rankin <to...@tomrankin.net> wrote:

> Hey everyone, I'm trying to figure out the best way to get lucene to detect
> concatenated words in a body of copy or a URL.
>
> I've got a few scenarios I'm trying to handle.  Many times in source code
> and URL's, several words are concatenated together to create a meaningful
> string ie. UserRegistrationService.cs and www.worldbestwebsites.com.  I
> would like index these as User Registration Service cs and www world best
> websites com etc.  I'm not expecting an easy answer, but would like to know
> how the community at large is dealing with these types of scenarios.
>
> Thanks,
>
> Thomas
>