You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by mpermar <mp...@gmail.com> on 2008/07/22 11:48:10 UTC

Opposite to StopFilter. Anything already implemented out there?

Hi All, 

I want to index some incoming text. In this case what I want to do is just
detect keywords in that text. Therefore I want to discard everything that is
not in the keywords set. This sounds to me pretty much like the reverse of
using stop words, that is it I want to use a set of "accepted" words. 

So I planned to create a new filter that just checks that incoming words are
in the "acceptable set" and discards them otherwise. Are you aware of any
analyzer/filter out there that uses this approach? Is there any other better
way to do this?

Best Regards,
Martin
-- 
View this message in context: http://www.nabble.com/Opposite-to-StopFilter.-Anything-already-implemented-out-there--tp18585878p18585878.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Opposite to StopFilter. Anything already implemented out there?

Posted by mpermar <mp...@gmail.com>.
Absolutely!

Thanks Steven. 

Best Regards,
Martin


Steven A Rowe wrote:
> 
> Hi Martin,
> 
> On 07/22/2008 at 5:48 AM, mpermar wrote:
>> I want to index some incoming text. In this case what I want
>> to do is just detect keywords in that text. Therefore I want
>> to discard everything that is not in the keywords set. This
>> sounds to me pretty much like the reverse of using stop words,
>> that is it I want to use a set of "accepted" words.
>> 
>> So I planned to create a new filter that just checks that
>> incoming words are in the "acceptable set" and discards them
>> otherwise. Are you aware of any analyzer/filter out there that
>> uses this approach? Is there any other better way to do this?
> 
> Solr has KeepWordFilter - it sounds exactly like what you want: 
> 
> Javadoc:
> <http://lucene.apache.org/solr/api/org/apache/solr/analysis/KeepWordFilter.html>
> Source:
> <http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/analysis/KeepWordFilter.java?view=markup>
> 
> Depending on your requirements and the nature of your keywords list, you
> might consider applying this filter only to queries, rather than at index
> time.  That way, the keyword list can change without having to re-index.
> 
> Steve
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Opposite-to-StopFilter.-Anything-already-implemented-out-there--tp18585878p18591960.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Opposite to StopFilter. Anything already implemented out there?

Posted by Steven A Rowe <sa...@syr.edu>.
Hi Martin,

On 07/22/2008 at 5:48 AM, mpermar wrote:
> I want to index some incoming text. In this case what I want
> to do is just detect keywords in that text. Therefore I want
> to discard everything that is not in the keywords set. This
> sounds to me pretty much like the reverse of using stop words,
> that is it I want to use a set of "accepted" words.
> 
> So I planned to create a new filter that just checks that
> incoming words are in the "acceptable set" and discards them
> otherwise. Are you aware of any analyzer/filter out there that
> uses this approach? Is there any other better way to do this?

Solr has KeepWordFilter - it sounds exactly like what you want: 

Javadoc: <http://lucene.apache.org/solr/api/org/apache/solr/analysis/KeepWordFilter.html>
Source: <http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/analysis/KeepWordFilter.java?view=markup>

Depending on your requirements and the nature of your keywords list, you might consider applying this filter only to queries, rather than at index time.  That way, the keyword list can change without having to re-index.

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org