You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Jeff Thorne <je...@yahoo.com> on 2006/02/06 04:56:21 UTC

Inappropriate content detection

I am trying to figure out whether or not Lucene is an appropriate solution
for a problem that our site faces. Our site

allows users to post their opinions on various topics. Due to various
government legislations around the world our management would like us to
scan each users post against various keywords that would indicate
inappropriate content

in the users posting. We are looking for racial slurs, profanity and attacks
against sexual orientation. Each users posting is

generally not more that a few paragraphs.

 

I would like to analyze each users post for various words and expressions
before publishing their post to the DB. I am reading through the Lucene in
action book and it looks as if I cannot analyze a string without first
indexing it. If this is true will indexing each post be a performance hit to
the site? I was wondering if someone could shed some light on the best way
to tackle this problem with Lucene or another api if doing so makes more
sense?

 

Thanks,

Jeff

Re: Inappropriate content detection

Posted by Jeff Rodenburg <je...@gmail.com>.

You can generate a token stream for a block of text without having to index
it. Take a look at the highlighter code, it does this very thing.



On 2/5/06, Jeff Thorne <je...@yahoo.com> wrote:
>
> I am trying to figure out whether or not Lucene is an appropriate solution
> for a problem that our site faces. Our site
>
> allows users to post their opinions on various topics. Due to various
> government legislations around the world our management would like us to
> scan each users post against various keywords that would indicate
> inappropriate content
>
> in the users posting. We are looking for racial slurs, profanity and
> attacks
> against sexual orientation. Each users posting is
>
> generally not more that a few paragraphs.
>
>
>
> I would like to analyze each users post for various words and expressions
> before publishing their post to the DB. I am reading through the Lucene in
> action book and it looks as if I cannot analyze a string without first
> indexing it. If this is true will indexing each post be a performance hit
> to
> the site? I was wondering if someone could shed some light on the best way
> to tackle this problem with Lucene or another api if doing so makes more
> sense?
>
>
>
> Thanks,
>
> Jeff
>
>
>
>
>

Re: Inappropriate content detection

Posted by Daniel Noll <da...@nuix.com.au>.

Jeff Thorne wrote:
> I am trying to figure out whether or not Lucene is an appropriate solution
> for a problem that our site faces.
<cut>
> I would like to analyze each users post for various words and expressions
> before publishing their post to the DB. I am reading through the Lucene in
> action book and it looks as if I cannot analyze a string without first
> indexing it. If this is true will indexing each post be a performance hit to
> the site? I was wondering if someone could shed some light on the best way
> to tackle this problem with Lucene or another api if doing so makes more
> sense?

You can definitely use Lucene's analyser classes without indexing.  Our 
own application does this when it needs to do things like highlighting 
text on the screen.

The idea would be you'd have a bunch of terms which are considered 
nasty, and then every new document would get analysed, and you would 
look through the terms returned from the analyser for the suspicious ones.

But no, it certainly isn't something that Lucene as a whole is very good 
at solving.  Lucene is fast for executing a single query against 
multiple documents, but what you really need is something fast for 
executing multiple queries against a single document.

Daniel

-- 
Daniel Noll

Nuix Australia Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
Phone: (02) 9280 0699
Fax:   (02) 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Inappropriate content detection

Posted by Daniel Noll <da...@nuix.com.au>.

Jason Polites wrote:
> There is also an open source java anti spam api which does a baysian 
> scan of
> email content (plus other stuff).
> 
> You could retro-fit to work with raw text.

There is also Classifier4J, which is more geared toward pure 
classification (comes with a Bayesian classifier but others can be 
implemented.)  Perhaps it's better than retro-fitting something more 
powerful, perhaps not.

http://classifier4j.sourceforge.net/

Daniel

-- 
Daniel Noll

Nuix Australia Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
Phone: (02) 9280 0699
Fax:   (02) 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Inappropriate content detection

Posted by Jason Polites <ja...@tpg.com.au>.

There is also an open source java anti spam api which does a baysian scan of
email content (plus other stuff).

You could retro-fit to work with raw text.

www.jasen.org

(get the latest HEAD from CVS as the current release is a bit old... new
version imminent)

----- Original Message ----- 
From: "Gwyn Carwardine" <gw...@carwardine.net>
To: <ja...@lucene.apache.org>
Sent: Tuesday, February 07, 2006 12:58 AM
Subject: RE: Inappropriate content detection


> The good bit about Bayesian is that it continuously learns.
>
> The downside is that you have to teach it.
>
> Not quite as simple as a list of rude words.
>
> There's an open source Bayesian mail filter called spambayes
> (http://spambayes.sourceforge.net) which may lead you to interesting
> places.
>
> -Gwyn
>
> -----Original Message-----
> From: Jeff Thorne [mailto:jeff_thorne@yahoo.com]
> Sent: 06 February 2006 13:30
> To: java-user@lucene.apache.org
> Subject: RE: Inappropriate content detection
>
> The site will have million+ posts. I am not familiar with Bayesian
> algorithms. Is there an off the shelf API that can provide this type of
> capability. As for performance would Bayesian be the way to go over
> Lucene?
>
> Thanks for the help,
> Jeff
>
> -----Original Message-----
> From: gekkokid [mailto:me@gekkokid.org.uk]
> Sent: Sunday, February 05, 2006 8:40 PM
> To: java-user@lucene.apache.org
> Subject: Re: Inappropriate content detection
>
> Hi, what scale is this website? millions of posts or under?
>
> wouldn't it be easiler to use a bayesian algorithm to scan each new post
> before it is posted to detect whether it is acceptable or not? just a
> quick
> idea of my head
>
>
>
> _gk
>
> ----- Original Message ----- 
> From: "Jeff Thorne" <je...@yahoo.com>
> To: <ja...@lucene.apache.org>
> Sent: Monday, February 06, 2006 3:56 AM
> Subject: Inappropriate content detection
>
>
>>I am trying to figure out whether or not Lucene is an appropriate solution
>> for a problem that our site faces. Our site
>>
>> allows users to post their opinions on various topics. Due to various
>> government legislations around the world our management would like us to
>> scan each users post against various keywords that would indicate
>> inappropriate content
>>
>> in the users posting. We are looking for racial slurs, profanity and
>> attacks
>> against sexual orientation. Each users posting is
>>
>> generally not more that a few paragraphs.
>>
>>
>>
>> I would like to analyze each users post for various words and expressions
>> before publishing their post to the DB. I am reading through the Lucene
>> in
>> action book and it looks as if I cannot analyze a string without first
>> indexing it. If this is true will indexing each post be a performance hit
>> to
>> the site? I was wondering if someone could shed some light on the best
>> way
>> to tackle this problem with Lucene or another api if doing so makes more
>> sense?
>>
>>
>>
>> Thanks,
>>
>> Jeff
>>
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Inappropriate content detection

Posted by Gwyn Carwardine <gw...@carwardine.net>.

The good bit about Bayesian is that it continuously learns.

The downside is that you have to teach it.

Not quite as simple as a list of rude words. 

There's an open source Bayesian mail filter called spambayes
(http://spambayes.sourceforge.net) which may lead you to interesting places.

-Gwyn

-----Original Message-----
From: Jeff Thorne [mailto:jeff_thorne@yahoo.com] 
Sent: 06 February 2006 13:30
To: java-user@lucene.apache.org
Subject: RE: Inappropriate content detection

The site will have million+ posts. I am not familiar with Bayesian
algorithms. Is there an off the shelf API that can provide this type of
capability. As for performance would Bayesian be the way to go over Lucene?

Thanks for the help,
Jeff

-----Original Message-----
From: gekkokid [mailto:me@gekkokid.org.uk] 
Sent: Sunday, February 05, 2006 8:40 PM
To: java-user@lucene.apache.org
Subject: Re: Inappropriate content detection

Hi, what scale is this website? millions of posts or under?

wouldn't it be easiler to use a bayesian algorithm to scan each new post 
before it is posted to detect whether it is acceptable or not? just a quick 
idea of my head



_gk

----- Original Message ----- 
From: "Jeff Thorne" <je...@yahoo.com>
To: <ja...@lucene.apache.org>
Sent: Monday, February 06, 2006 3:56 AM
Subject: Inappropriate content detection


>I am trying to figure out whether or not Lucene is an appropriate solution
> for a problem that our site faces. Our site
>
> allows users to post their opinions on various topics. Due to various
> government legislations around the world our management would like us to
> scan each users post against various keywords that would indicate
> inappropriate content
>
> in the users posting. We are looking for racial slurs, profanity and 
> attacks
> against sexual orientation. Each users posting is
>
> generally not more that a few paragraphs.
>
>
>
> I would like to analyze each users post for various words and expressions
> before publishing their post to the DB. I am reading through the Lucene in
> action book and it looks as if I cannot analyze a string without first
> indexing it. If this is true will indexing each post be a performance hit 
> to
> the site? I was wondering if someone could shed some light on the best way
> to tackle this problem with Lucene or another api if doing so makes more
> sense?
>
>
>
> Thanks,
>
> Jeff
>
>
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Inappropriate content detection

Posted by Jeff Thorne <je...@yahoo.com>.

The site will have million+ posts. I am not familiar with Bayesian
algorithms. Is there an off the shelf API that can provide this type of
capability. As for performance would Bayesian be the way to go over Lucene?

Thanks for the help,
Jeff

-----Original Message-----
From: gekkokid [mailto:me@gekkokid.org.uk] 
Sent: Sunday, February 05, 2006 8:40 PM
To: java-user@lucene.apache.org
Subject: Re: Inappropriate content detection

Hi, what scale is this website? millions of posts or under?

wouldn't it be easiler to use a bayesian algorithm to scan each new post 
before it is posted to detect whether it is acceptable or not? just a quick 
idea of my head



_gk

----- Original Message ----- 
From: "Jeff Thorne" <je...@yahoo.com>
To: <ja...@lucene.apache.org>
Sent: Monday, February 06, 2006 3:56 AM
Subject: Inappropriate content detection


>I am trying to figure out whether or not Lucene is an appropriate solution
> for a problem that our site faces. Our site
>
> allows users to post their opinions on various topics. Due to various
> government legislations around the world our management would like us to
> scan each users post against various keywords that would indicate
> inappropriate content
>
> in the users posting. We are looking for racial slurs, profanity and 
> attacks
> against sexual orientation. Each users posting is
>
> generally not more that a few paragraphs.
>
>
>
> I would like to analyze each users post for various words and expressions
> before publishing their post to the DB. I am reading through the Lucene in
> action book and it looks as if I cannot analyze a string without first
> indexing it. If this is true will indexing each post be a performance hit 
> to
> the site? I was wondering if someone could shed some light on the best way
> to tackle this problem with Lucene or another api if doing so makes more
> sense?
>
>
>
> Thanks,
>
> Jeff
>
>
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Inappropriate content detection

Posted by gekkokid <me...@gekkokid.org.uk>.

Hi, what scale is this website? millions of posts or under?

wouldn't it be easiler to use a bayesian algorithm to scan each new post 
before it is posted to detect whether it is acceptable or not? just a quick 
idea of my head



_gk

----- Original Message ----- 
From: "Jeff Thorne" <je...@yahoo.com>
To: <ja...@lucene.apache.org>
Sent: Monday, February 06, 2006 3:56 AM
Subject: Inappropriate content detection


>I am trying to figure out whether or not Lucene is an appropriate solution
> for a problem that our site faces. Our site
>
> allows users to post their opinions on various topics. Due to various
> government legislations around the world our management would like us to
> scan each users post against various keywords that would indicate
> inappropriate content
>
> in the users posting. We are looking for racial slurs, profanity and 
> attacks
> against sexual orientation. Each users posting is
>
> generally not more that a few paragraphs.
>
>
>
> I would like to analyze each users post for various words and expressions
> before publishing their post to the DB. I am reading through the Lucene in
> action book and it looks as if I cannot analyze a string without first
> indexing it. If this is true will indexing each post be a performance hit 
> to
> the site? I was wondering if someone could shed some light on the best way
> to tackle this problem with Lucene or another api if doing so makes more
> sense?
>
>
>
> Thanks,
>
> Jeff
>
>
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org