You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by rubdabadub <ru...@gmail.com> on 2007/02/19 10:22:38 UTC
common words not stop words?? how to ??
Hi:
I was wondering how are you guys dealing with "common words"? What I
mean by common words is the ones that fall outside the "stop words"
category. Offcourse "stop words" is subjective i.e. its up to the
implementor. What I would like to do is how do i increase or decrease
boost value based on such "common words". Should I have a field
"Common_Words_Plus" and "Common_Words_Minus"? Plus for words that
needs to be boosted up and minus for the words that gets boosted
down?.. No?
The above sounds like not so professional -- quick fix.. does any one
have a better solution.. how are you dealing with the above?
Regards
Re: common words not stop words?? how to ??
Posted by rubdabadub <ru...@gmail.com>.
Walter:
Thanks for the feedback.
On 2/19/07, Walter Underwood <wu...@netflix.com> wrote:
> Lucene/Solr does this automatically. That is how a tf.idf
> engine works, it boosts rare words.
>
> Do you have examples of problems or are you worrying about
> something that might happen?
Actually my use case is the following: Lets say hypothetically you
have a field with 100 "sentence long title". If you read those title
you can pretty much group them into 5 subject matter. A hypothetical
example is.. (Total number of title is 125, 25 of them can not be
grouped)
22 title is about = How good is Person X
14 title is about = How bad is Product Y
10 title is about = London weather
36 title is about = How cool is the movie Z
18 title is about = The next big MS virus.
What I am trying to achive is
I would like to weed out "London weather" as a group cos it is not
interesting in my use case .. Lets say it is noise not signal. So I
thought I could use some "common words" .. Furthermore I was thinking
having common words .. I could boost certain field i.e. if the Person
X is a known person example a "Prime minister" or " a "movie star"
having certain word attached to another known word meaning its
important. Maybe I defined my problem wrongly.. I hope above gives
you an overview..
Regards
Re: common words not stop words?? how to ??
Posted by Walter Underwood <wu...@netflix.com>.
Lucene/Solr does this automatically. That is how a tf.idf
engine works, it boosts rare words.
Do you have examples of problems or are you worrying about
something that might happen?
wunder
On 2/19/07 1:22 AM, "rubdabadub" <ru...@gmail.com> wrote:
> Hi:
>
> I was wondering how are you guys dealing with "common words"? What I
> mean by common words is the ones that fall outside the "stop words"
> category. Offcourse "stop words" is subjective i.e. its up to the
> implementor. What I would like to do is how do i increase or decrease
> boost value based on such "common words". Should I have a field
> "Common_Words_Plus" and "Common_Words_Minus"? Plus for words that
> needs to be boosted up and minus for the words that gets boosted
> down?.. No?
>
> The above sounds like not so professional -- quick fix.. does any one
> have a better solution.. how are you dealing with the above?
>
> Regards