You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Mike Perham <mp...@onespot.com> on 2010/01/28 22:46:54 UTC

implementing profanity detector

We'd like to implement a profanity detector for documents during indexing.
 That is, given a file of profane words, we'd like to be able to mark a
document as safe or not safe if it contains any of those words so that we
can have something similar to google's safe search.

I'm trying to figure out how best to implement this with Solr 1.4:

- An UpdateRequestProcessor would allow me to dynamically populate a "safe"
boolean field but requires me to pull out the content, tokenize it and run
each token through my set of profanities, essentially running the analysis
pipeline again.  That's a lot of overheard AFAIK.

- A TokenFilter would allow me to tap into the existing analysis pipeline so
I get the tokens for free but I can't access the document.

Any suggestions on how to best implement this?

Thanks in advance,
mike

Re: implementing profanity detector

Posted by Alexey Serba <as...@gmail.com>.

> - A TokenFilter would allow me to tap into the existing analysis pipeline so
> I get the tokens for free but I can't access the document.
https://issues.apache.org/jira/browse/SOLR-1536

On Fri, Jan 29, 2010 at 12:46 AM, Mike Perham <mp...@onespot.com> wrote:
> We'd like to implement a profanity detector for documents during indexing.
>  That is, given a file of profane words, we'd like to be able to mark a
> document as safe or not safe if it contains any of those words so that we
> can have something similar to google's safe search.
>
> I'm trying to figure out how best to implement this with Solr 1.4:
>
> - An UpdateRequestProcessor would allow me to dynamically populate a "safe"
> boolean field but requires me to pull out the content, tokenize it and run
> each token through my set of profanities, essentially running the analysis
> pipeline again.  That's a lot of overheard AFAIK.
>
> - A TokenFilter would allow me to tap into the existing analysis pipeline so
> I get the tokens for free but I can't access the document.
>
> Any suggestions on how to best implement this?
>
> Thanks in advance,
> mike
>

Re: implementing profanity detector

Posted by Lance Norskog <go...@gmail.com>.

You could have a synonym file that, for each dirty word, changes the
word into an "impossible word": for example, xyzzy. Then, a search for
clean contents is:

(user search) AND NOT xyzzy

A synonym filter that included payloads would be cool.

On Thu, Jan 28, 2010 at 2:31 PM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
> How about this crazy idea - a custom TokenFilter that stores the safe flag in ThreadLocal?
>
>
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Hadoop ecosystem search :: http://search-hadoop.com/
>
>
>
> ----- Original Message ----
>> From: Mike Perham <mp...@onespot.com>
>> To: solr-user@lucene.apache.org
>> Sent: Thu, January 28, 2010 4:46:54 PM
>> Subject: implementing profanity detector
>>
>> We'd like to implement a profanity detector for documents during indexing.
>> That is, given a file of profane words, we'd like to be able to mark a
>> document as safe or not safe if it contains any of those words so that we
>> can have something similar to google's safe search.
>>
>> I'm trying to figure out how best to implement this with Solr 1.4:
>>
>> - An UpdateRequestProcessor would allow me to dynamically populate a "safe"
>> boolean field but requires me to pull out the content, tokenize it and run
>> each token through my set of profanities, essentially running the analysis
>> pipeline again.  That's a lot of overheard AFAIK.
>>
>> - A TokenFilter would allow me to tap into the existing analysis pipeline so
>> I get the tokens for free but I can't access the document.
>>
>> Any suggestions on how to best implement this?
>>
>> Thanks in advance,
>> mike
>
>



-- 
Lance Norskog
goksron@gmail.com

implementing profanity detector

Posted by Mike Perham <mp...@onespot.com>.

FYI this does not work.  It appears that the update seems to run on a
different thread to the analysis, perhaps because the update is done
when the commit happens?  I'm sending the document XML with
commitWithin="60000".

I would appreciate any other ideas.  I'm drawing a blank on how to
implement this efficiently with Lucene/Solr.

mike

On Thu, Jan 28, 2010 at 4:31 PM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
>
> How about this crazy idea - a custom TokenFilter that stores the safe flag in ThreadLocal?
>
>
>
> ----- Original Message ----
> > From: Mike Perham <mp...@onespot.com>
> > To: solr-user@lucene.apache.org
> > Sent: Thu, January 28, 2010 4:46:54 PM
> > Subject: implementing profanity detector
> >
> > We'd like to implement a profanity detector for documents during indexing.
> > That is, given a file of profane words, we'd like to be able to mark a
> > document as safe or not safe if it contains any of those words so that we
> > can have something similar to google's safe search.
> >
> > I'm trying to figure out how best to implement this with Solr 1.4:
> >
> > - An UpdateRequestProcessor would allow me to dynamically populate a "safe"
> > boolean field but requires me to pull out the content, tokenize it and run
> > each token through my set of profanities, essentially running the analysis
> > pipeline again.  That's a lot of overheard AFAIK.
> >
> > - A TokenFilter would allow me to tap into the existing analysis pipeline so
> > I get the tokens for free but I can't access the document.
> >
> > Any suggestions on how to best implement this?
> >
> > Thanks in advance,
> > mike
>

Re: implementing profanity detector

Posted by Otis Gospodnetic <ot...@yahoo.com>.

How about this crazy idea - a custom TokenFilter that stores the safe flag in ThreadLocal?


Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



----- Original Message ----
> From: Mike Perham <mp...@onespot.com>
> To: solr-user@lucene.apache.org
> Sent: Thu, January 28, 2010 4:46:54 PM
> Subject: implementing profanity detector
> 
> We'd like to implement a profanity detector for documents during indexing.
> That is, given a file of profane words, we'd like to be able to mark a
> document as safe or not safe if it contains any of those words so that we
> can have something similar to google's safe search.
> 
> I'm trying to figure out how best to implement this with Solr 1.4:
> 
> - An UpdateRequestProcessor would allow me to dynamically populate a "safe"
> boolean field but requires me to pull out the content, tokenize it and run
> each token through my set of profanities, essentially running the analysis
> pipeline again.  That's a lot of overheard AFAIK.
> 
> - A TokenFilter would allow me to tap into the existing analysis pipeline so
> I get the tokens for free but I can't access the document.
> 
> Any suggestions on how to best implement this?
> 
> Thanks in advance,
> mike

Re: implementing profanity detector

Posted by Mike Perham <mp...@onespot.com>.

On Thu, Feb 11, 2010 at 10:49 AM, Grant Ingersoll <gs...@apache.org> wrote:
>
> Otherwise, I'd do it via copy fields.  Your first field is your main field and is analyzed as before.  Your second field does the profanity detection and simply outputs a single token at the end, safe/unsafe.
>
> How long are your documents?  The extra copy field is extra work, but in this case it should be fast as you should be able to create a pretty streamlined analyzer chain for the second task.
>

The documents are web page text, so they shouldn't be more than 10-20k
generally.  Would something like this do the trick?

  @Override
  public boolean incrementToken() throws IOException {
    while (input.incrementToken()) {
      if (profanities.contains(termAtt.termBuffer(), 0, termAtt.termLength())) {
          termAtt.setTermBuffer("y", 0, 1);
          return false;
      }
    }
    termAtt.setTermBuffer("n", 0, 1);
    return false;
  }

mike

Re: implementing profanity detector

Posted by Lance Norskog <go...@gmail.com>.

A problem is that your profanity list will not stop growing, and with
each new word you will want to rescrub the index.

We had a thousand-word NOT clause in every query (a filter query would
be true for 99% of the index) until we switched to another
arrangement.

Another small problem was that I knew of many more perversions than my
co-workers, but did not wish to display my vast erudition in the
seamier side of life :)

On Fri, Feb 12, 2010 at 4:26 PM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : Otherwise, I'd do it via copy fields.  Your first field is your main
> : field and is analyzed as before.  Your second field does the profanity
> : detection and simply outputs a single token at the end, safe/unsafe.
>
> you don't even need custom code for this ... copyFiled all your text into
> a 'has_profanity' field where you use a suitable Tokenizer followed by the
> KeepWordsTokenFilter that only keeps profane words and then a
> PatternReplaceTokenFilter that matches .* and replaces it with "HELL_YEA"
> ... now a search for "is_profane:HELL_YEA" finds all profane docs, with
> the added bonus that the scores are based on how many profane words occur
> in the doc.
>
> it could be used as a filter query (probably negated) as needed.
>
>
>
> -Hoss
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: implementing profanity detector

Posted by Chris Hostetter <ho...@fucit.org>.

: Otherwise, I'd do it via copy fields.  Your first field is your main 
: field and is analyzed as before.  Your second field does the profanity 
: detection and simply outputs a single token at the end, safe/unsafe.

you don't even need custom code for this ... copyFiled all your text into 
a 'has_profanity' field where you use a suitable Tokenizer followed by the 
KeepWordsTokenFilter that only keeps profane words and then a 
PatternReplaceTokenFilter that matches .* and replaces it with "HELL_YEA" 
... now a search for "is_profane:HELL_YEA" finds all profane docs, with 
the added bonus that the scores are based on how many profane words occur 
in the doc.

it could be used as a filter query (probably negated) as needed.



-Hoss

Re: implementing profanity detector

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 28, 2010, at 4:46 PM, Mike Perham wrote:

> We'd like to implement a profanity detector for documents during indexing.
> That is, given a file of profane words, we'd like to be able to mark a
> document as safe or not safe if it contains any of those words so that we
> can have something similar to google's safe search.
> 
> I'm trying to figure out how best to implement this with Solr 1.4:
> 
> - An UpdateRequestProcessor would allow me to dynamically populate a "safe"
> boolean field but requires me to pull out the content, tokenize it and run
> each token through my set of profanities, essentially running the analysis
> pipeline again.  That's a lot of overheard AFAIK.
> 
> - A TokenFilter would allow me to tap into the existing analysis pipeline so
> I get the tokens for free but I can't access the document.
> 
> Any suggestions on how to best implement this?
> 

TeeSinkTokenFilter (Lucene only) would do the trick if you're up for some hardcoding b/c it isn't supported in Solr (patch welcome) all that well.  A one-off solution shouldn't be too hard to wedge in, but it will involve hardcoding some field names in your analyzer, I think.  

Otherwise, I'd do it via copy fields.  Your first field is your main field and is analyzed as before.  Your second field does the profanity detection and simply outputs a single token at the end, safe/unsafe.

How long are your documents?  The extra copy field is extra work, but in this case it should be fast as you should be able to create a pretty streamlined analyzer chain for the second task.

Short term, I'd do the copy field approach while maybe, depending on its importance to you, working on the first approach.

-Grant