You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Mike Perham <mp...@onespot.com> on 2010/02/10 23:22:13 UTC

implementing profanity detector

FYI this does not work.  It appears that the update seems to run on a
different thread to the analysis, perhaps because the update is done
when the commit happens?  I'm sending the document XML with
commitWithin="60000".

I would appreciate any other ideas.  I'm drawing a blank on how to
implement this efficiently with Lucene/Solr.

mike

On Thu, Jan 28, 2010 at 4:31 PM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
>
> How about this crazy idea - a custom TokenFilter that stores the safe flag in ThreadLocal?
>
>
>
> ----- Original Message ----
> > From: Mike Perham <mp...@onespot.com>
> > To: solr-user@lucene.apache.org
> > Sent: Thu, January 28, 2010 4:46:54 PM
> > Subject: implementing profanity detector
> >
> > We'd like to implement a profanity detector for documents during indexing.
> > That is, given a file of profane words, we'd like to be able to mark a
> > document as safe or not safe if it contains any of those words so that we
> > can have something similar to google's safe search.
> >
> > I'm trying to figure out how best to implement this with Solr 1.4:
> >
> > - An UpdateRequestProcessor would allow me to dynamically populate a "safe"
> > boolean field but requires me to pull out the content, tokenize it and run
> > each token through my set of profanities, essentially running the analysis
> > pipeline again.  That's a lot of overheard AFAIK.
> >
> > - A TokenFilter would allow me to tap into the existing analysis pipeline so
> > I get the tokens for free but I can't access the document.
> >
> > Any suggestions on how to best implement this?
> >
> > Thanks in advance,
> > mike
>