You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Thomas Skovsende <IT...@gyldendal.dk> on 2007/10/19 11:47:00 UTC

Problem with words that have been split

Hi guys,
 
you were a great help last time, and I have a little problem.
 
In my current setup lets say I have a tekst i want to find:
"SomethingElse"
 
but my user happen to search for "something else".
 
Now - what I do to modify their standard search abit is that i actually
rewrite their search to:
 
"(something OR something~0.6) (else OR else~0.6)"
 
Problem is that this does not hit on SomethingElse. It is a particular
nasty problem because alot of words in danish are put together to form
one word, but alot of people split it up in several words due to
influense from English.
 
Best regards,
  Thomas

RE: Problem with words that have been split

Posted by DIGY <di...@gmail.com>.
1- Searching like something* is the easiest way.

2- I don't know if you have some danish stemmer that can split the Compound
words or not, but the best way would be to write an analyzer that can do
this job. 

xxxyyyzzz ----> analyzer ----> xxx yyy zzz.

If you use that analyzer both in indexing and searching then you don't have
to do nothing.

Ex. 
When indexing xxxyyyzzz , tokens will be xxx yyy and zzz.
When searching xxxyyyzzzz , tokens will be again "xxx yyy zzz".
Besides that searching xxx or yyy (or zzz) will also return results.


DIGY.


-----Original Message-----
From: Thomas Skovsende [mailto:ITTS@gyldendal.dk] 
Sent: Friday, October 19, 2007 12:47 PM
To: lucene-net-user@incubator.apache.org
Subject: Problem with words that have been split

Hi guys,
 
you were a great help last time, and I have a little problem.
 
In my current setup lets say I have a tekst i want to find:
"SomethingElse"
 
but my user happen to search for "something else".
 
Now - what I do to modify their standard search abit is that i actually
rewrite their search to:
 
"(something OR something~0.6) (else OR else~0.6)"
 
Problem is that this does not hit on SomethingElse. It is a particular
nasty problem because alot of words in danish are put together to form
one word, but alot of people split it up in several words due to
influense from English.
 
Best regards,
  Thomas