You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by akash jayaweera <ak...@gmail.com> on 2019/06/23 03:14:37 UTC

Identify stopwords using TF-IDF

Hello All,
I'm trying to identify stopwords for a non-English corpus using TF-IDF
score. I calculated the score for each unique term in the corpus. But my
question is how can I select stopwords using the score.
For example if we have a corpus of football, term "football" get the lowest
TF-IDF score. But for my requirement I don't want to identify "football" as
a stopword.
How can I clearly Identify stopword. Is there any other simple method to
identify stopwords than TF-IDF score.

Regards,
*Akash Jayaweera.*


E akash.jayaweera@gmail.com <ak...@gmail.com>
M + 94 77 2472635 <+94%2077%20247%202635>

Re: Identify stopwords using TF-IDF

Posted by Walter Underwood <wu...@wunderwood.org>.

I haven’t removed stopwords since 1996, when I joined Infoseek. What is your special case where you must remove them?

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 22, 2019, at 9:51 PM, akash jayaweera <ak...@gmail.com> wrote:
> 
> Hello Walter,
> 
> Thank you for the reply.
> But for some of my use-case I need to identify stopword. So I need a better
> way to identify domain specific stopwords. I used TF-IDF to identify
> stopwords. But it has the issue I mentioned above.
> 
> Regards,
> *Akash Jayaweera.*
> 
> 
> E akash.jayaweera@gmail.com <ak...@gmail.com>
> M + 94 77 2472635 <+94%2077%20247%202635>
> 
> 
> On Sun, Jun 23, 2019 at 10:13 AM Walter Underwood <wu...@wunderwood.org>
> wrote:
> 
>> Don’t remove stopwords. That was a useful hack when we were running search
>> engines on 16-bit machines. These days, it causes more problems than it
>> solves.
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Jun 22, 2019, at 8:14 PM, akash jayaweera <ak...@gmail.com>
>> wrote:
>>> 
>>> Hello All,
>>> I'm trying to identify stopwords for a non-English corpus using TF-IDF
>>> score. I calculated the score for each unique term in the corpus. But my
>>> question is how can I select stopwords using the score.
>>> For example if we have a corpus of football, term "football" get the
>> lowest
>>> TF-IDF score. But for my requirement I don't want to identify "football"
>> as
>>> a stopword.
>>> How can I clearly Identify stopword. Is there any other simple method to
>>> identify stopwords than TF-IDF score.
>>> 
>>> Regards,
>>> *Akash Jayaweera.*
>>> 
>>> 
>>> E akash.jayaweera@gmail.com <ak...@gmail.com>
>>> M + 94 77 2472635 <+94%2077%20247%202635>
>> 
>>

Re: Identify stopwords using TF-IDF

Posted by akash jayaweera <ak...@gmail.com>.

Hello Walter,

Thank you for the reply.
But for some of my use-case I need to identify stopword. So I need a better
way to identify domain specific stopwords. I used TF-IDF to identify
stopwords. But it has the issue I mentioned above.

Regards,
*Akash Jayaweera.*


E akash.jayaweera@gmail.com <ak...@gmail.com>
M + 94 77 2472635 <+94%2077%20247%202635>


On Sun, Jun 23, 2019 at 10:13 AM Walter Underwood <wu...@wunderwood.org>
wrote:

> Don’t remove stopwords. That was a useful hack when we were running search
> engines on 16-bit machines. These days, it causes more problems than it
> solves.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Jun 22, 2019, at 8:14 PM, akash jayaweera <ak...@gmail.com>
> wrote:
> >
> > Hello All,
> > I'm trying to identify stopwords for a non-English corpus using TF-IDF
> > score. I calculated the score for each unique term in the corpus. But my
> > question is how can I select stopwords using the score.
> > For example if we have a corpus of football, term "football" get the
> lowest
> > TF-IDF score. But for my requirement I don't want to identify "football"
> as
> > a stopword.
> > How can I clearly Identify stopword. Is there any other simple method to
> > identify stopwords than TF-IDF score.
> >
> > Regards,
> > *Akash Jayaweera.*
> >
> >
> > E akash.jayaweera@gmail.com <ak...@gmail.com>
> > M + 94 77 2472635 <+94%2077%20247%202635>
>
>

Re: Identify stopwords using TF-IDF

Posted by Walter Underwood <wu...@wunderwood.org>.

Don’t remove stopwords. That was a useful hack when we were running search engines on 16-bit machines. These days, it causes more problems than it solves.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 22, 2019, at 8:14 PM, akash jayaweera <ak...@gmail.com> wrote:
> 
> Hello All,
> I'm trying to identify stopwords for a non-English corpus using TF-IDF
> score. I calculated the score for each unique term in the corpus. But my
> question is how can I select stopwords using the score.
> For example if we have a corpus of football, term "football" get the lowest
> TF-IDF score. But for my requirement I don't want to identify "football" as
> a stopword.
> How can I clearly Identify stopword. Is there any other simple method to
> identify stopwords than TF-IDF score.
> 
> Regards,
> *Akash Jayaweera.*
> 
> 
> E akash.jayaweera@gmail.com <ak...@gmail.com>
> M + 94 77 2472635 <+94%2077%20247%202635>