You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jspwiki.apache.org by Andrew Jaquith <an...@gmail.com> on 2009/09/25 19:54:55 UTC

Spam package redesign

** Warning: long post **

After some fooling around and some actual work, I've finished my first
pass at refactoring on the anti-spam code. I'm proposing a new
package, org.apache.wiki.content.inspect, which contains a
general-purpose content-inspection capability, of which spam is just
one potential application. Here is a draft of the package javadocs.

The "inspect" package provides facilities that allow content to be
inspected and scored based on various criteria such as whether a wiki
page change contains spam. Content may be scored by initializing an
Inspection object and then calling the {@link Inspection#inspect()}
method. The {@code inspect} method, in turn, iterates through a
variety of previously-configured {@link Inspector} objects and calls
the {@link Inspector#inspect(String,String)} method for each one. Each
of the configured inspectors is free to perform whatever evaluations
it wishes, and can increase, decrease or reset the "scores" for any
score category, for example the {@link Score#SPAM} category.

Callers can also add to the Inspection "limit" or thresholds for
categories so that inspection stops when one of these limits is
reached. For example, a caller can specify that the Inspection should
stop immediately when Score.SPAM reaches 3. An Inspector can
arbitrarily halt processing by throwing an {@link
InspectionInterruptedException}. The choice of Inspectors to use for
the Inspection, the types of Scores that are modified, and what limits
are placed on scores, are completely configurable.

How to use the classes in this package

The key classes in this package are: Inspection, InspectionContext,
Score, and Inspector.

* The Inspection is the core class in this package. It implements the
Gang-of-Four "strategy" pattern and is itself quite lightweight. It is
initialized by constructing a new Inspection object and supplying the
current WikiContext, InspectionContext, and an array of Inspector
objects that conduct the inspection. The Inspection maintains an
internal running integer total for each Score category, for example
{@link Score#SPAM}. A Score category can be incremented, decremented,
or reset to zero by calling {@link Inspection#changeScore(Score)}. The
integer value for a Score is retrieved at any time by calling {@link
Inspection getScore(Score.Type)}. The initial value is always 0. If a
Score category exceeds its threshold, {@link Inspection#isFailed()}
will return true.

* InspectionContext keeps references to the WikiEngine and other
shared-state objects needed by Inspections and Inspectors. The
InspectionContext persists between HTTP requests and keeps a reference
to the {@link BanList}. It also tracks the IP addresses of hosts that
have modified content recently, along with the changes they have made.
Callers (such as Inspectors) can add a host to the list of recent
modifications by calling
InspectionContext.addModifier(HttpServletRequest,Change). The
InspectionContext is normally initialized just once, as part of the
WikiEngine startup.

* Score objects supply instructions to the parent Inspection object to
increment, decrement or reset the score for a particular category.
Each Score object is constructed with a category (for example,
Score.SPAM), an integer indicating how much to change the score, and
an optional String message that provides context for the change. For
example, a Score that increments the spam score by 1 could be
constructed by new Score( Score.SPAM, 1, "Bot detected." ). Negative
numbers can be supplied also to decrease the score. For convenience,
{@link Score#INCREMENT_SCORE} means "add 1", {@link
Score#DECREMENT_SCORE} means "subtract 1", and {@link Score#RESET}
means "reset to zero."

* The Inspector interface specifies how to implement a particular
strategy for inspecting content. The Inspector has just one primary
method: {@link Inspector#inspect(Inspection, String, Change)}.
Inspectors do just about anything. The Inspection parameter can be
used to obtain the HttpServletRequest and WikiContext. Scores can be
changed by calling {@link Inspection#changeScore(Score)}. Inspectors
that need to terminate processing can throw an {@link
InterruptedInspectionException}. Inspectors are intended to be
instantiated just once, and are initialized via the method {{@link
Inspector#initialize(InspectionContext)}. The InspectionContext
parameter can be used to obtain references to the BanList,
WikiEngine,and WikiEngine Properties. Inspectors are meant to be
re-used and should not keep state between invocations of inspect.
Examples of Inspector implementations include the ChangeRateInspector
(which determines whether the current user has made too many recent
changes) and the LinkCountInspector (which counts links).

Here is an example of how to create and execute an Inspection, modeled
after how SpamFilter does it. SpamFilter's initialize method
constructs the InspectionContext, which will be shared by multiple
Inspections. Notice how the Inspector objects are instantiated and
initialized at this time also.

private InspectionContext m_config = null;

public void initialize( WikiEngine engine, Properties properties )
{
  ...
  m_config = new InspectionContext( engine, properties );
  ...
  m_inspectors = new Inspector[] { new UserInspector(), new
BanListInspector(), ... };
  for( Inspector inspector : m_inspectors )
  {
    inspector.initialize( m_config );
  }
}

Later, when SpamFilter's preSave method executes, a new lightweight
Inspection object is created. The WikiContext, String and Change
parameters supply all the request, content and change information,
respectively, for conducting the inspection:


public String preSave( WikiContext context, String content ) throws
RedirectException
{
  Change change = getChange( context, content );

  // Run the Inspection
  Inspection inspection = new Inspection( context, m_config, m_inspectors );
  inspection.addLimit( Score.SPAM, m_scoreLimit );
  inspection.inspect( content, change );
  int spamScore = inspection.getScore( Score.SPAM );
  context.setVariable( ATTR_SPAMFILTER_SCORE, spamScore );

  // Redirect user if score too high
  if( inspection.isFailed() )
  {
    ...
  }
  ...
}
Finally, notice also how the inspection.addLimit method is used to set
an upper limit for spam scoring. If the limit is reached, the
inspection.isFailed() condition will hold, and the redirection code
(unspecified in this example) will execute.

----

So, that's how it's shaping up so far. I'm quite happy with the
design, although I haven't written unit tests to shake out the bugs.
All of the existing spam-checking logic has been re-factored into the
Inspector classes, which include: BanListInspector, BotTrapInspector,
ChangeRateInspector, LinkCountInspector, SpamInspector (for Akismet),
and UserInspector. I also plan to write a CaptchInspector for
validating Captcha responses. Creating additional Inspectors is easy,
and can be plugged into the array of Inspectors passed to an
Inspection.

I can foresee other uses for this too, for example general-purpose
content classification. But that's for another day.

Comments, thoughts? It's going to take some time to get unit tests
done, so I won't be committing this for a little while.

Andrew

Re: Spam package redesign

Posted by Andrew Jaquith <an...@gmail.com>.
I just checked in the org.apache.wiki.content.inspect package. It is
very close to what we agreed on before. I incorporated the listener
strategy Janne suggested. An additional key change was the creation of
the "InspectionPlan" class where the number, order, and weighting for
the Inspector classes are organized. Thus, you could create a "spam"
InspectionPlan with a defined stack of Inspectors. I've provided one
for this purpose that is created as a singleton-per-wiki and
instantiated at startup based on jspwiki.properties.

All of this is well documented in the package I checked in. If you
want to see the package.html javadoc, it is here:

https://svn.apache.org/repos/asf/incubator/jspwiki/trunk/src/java/org/apache/wiki/content/inspect/package.html

There are a few caveats with this initial release. First, I haven't
yet written an Akismet Inspector, or ones for various Captcha
implementations. Both of these are coming next -- it should be
straightforward to refactor the existing code into them. Second, I
haven't yet reworked Comment.jsp or the FCK editor JSP to work with
the new inspection API. Also, I don't yet have default "weights" for
the inspectors in the jspwiki.properties template. All of these things
will come in future checkins.

But in general, I'm fairly pleased with how it turned out. It is very
extensible and flexible, but also very SIMPLE to understand. Also,
I've written some reasonably extensive unit tests that were a lot more
comprehensive for testing the various Inspectors than what we had for
the monolithic SpamFilter.

As always, comments are welcome.

Andrew

On Mon, Sep 28, 2009 at 11:24 AM, Andrew Jaquith
<an...@gmail.com> wrote:
> There's some room for optimization in what you proposed, but your
> p-code is close to what I was thinking. in essence, it is the Strategy
> pattern that calls the inspector logic, combined with a listener for
> scoring changes (aka Visitor pattern) that interested callers would
> subscribe to. That works.
>
> After having originally suggested an "InspectionInterruptedException"
> so that inspectors could abort a running inspection chain, I now think
> it's better to simply return an object from the listener/visited
> method, rather than throw an exception. Exceptions could be perceived
> by an implementer as a little trickier to debug. I'd rather make the
> API simpler.
>
> FYI, the workflow package isn't totally suitable here. When you cited
> the Decision class, I suspect you actually meant the Outcome class,
> no?
>
> Ok. I've got what I need to do this properly. This should work pretty well.
>
> Andrew
>
> On Mon, Sep 28, 2009 at 3:11 AM, Janne Jalkanen
> <Ja...@ecyrd.com> wrote:
>>
>> Yah, it's just that smaller numbers are usually easier to deal with (numbers
>> from 1-20).  We need more resolution at the low end, not at the high end.
>>  That's my preference for using floats.
>>
>> A simple way to do as you suggest would be to have a Score return value,
>> e.g.
>>
>> public Score inspect(...)
>> {
>>    // Detection score goes here
>>
>>    if( isSpam )
>>        return new Score( "mywonderfulspamdetectortest", 0.5f );
>>
>>    return Score.OK; // where score.OK is defined to be 0.0f.
>> }
>>
>> where the first String is a key and the float is a default value, which can
>> be overridden in a config file using the key. This would be pretty close to
>> the SpamAssassin model (and close to the java.util.Properties conceptually,
>> so it would be easy to adapt).
>>
>> (The reason why I'm proposing a default value is so that when you add a new
>> Inspector, you don't have to modify two files.)
>>
>> After each Inspector would be called, then the parent would call the
>> callback I proposed, and that would then determine what to do.  In
>> pseudocode:
>>
>> // JSPWiki internal code.
>> private void inspect( DecisionMaker decisionMaker, Change change, <some
>> other values> )
>> {
>>    ArrayList scores;
>>    for( Inspector inspector : m_inspectorChain )
>>    {
>>        Score s = inspector.inspect( change, <some values> );
>>
>>        scores.add( s );
>>
>>        if( decisionMaker.visit( scores ) == Decision.SPAM ) { ... }
>>    }
>> }
>>
>> We can then provide something like this
>>
>> class SpamFilter implements PageFilter, DecisionMaker
>> {
>>    ...
>>
>>    // This emulates the 2.8.x SpamFilter functionality.
>>    public Decision visit( List<Score> scores )
>>    {
>>        if( scores.last().getValue() > 0.0 ) return Decision.SPAM;
>>
>>        return Decision.CONTINUE;
>>    }
>> }
>>
>> (At any rate, it might make sense to reuse some classes from the workflow
>> package here; after all, this is a machine workflow in a sense, is it not?)
>>
>> (Another possible pattern would be to have the DecisionMaker throw an
>> exception. Don't quite know what's better.)
>>
>> /Janne
>>
>> On Sep 28, 2009, at 02:04 , Andrew Jaquith wrote:
>>
>>> It's only more complicated in the sense that I'd have to change some
>>> recently-written code. :)
>>>
>>> Both integer and floating point -- as you point out -- are really the
>>> same except for the choice of where to put the decimal point. I chose
>>> integer to preserve the same scoring strategy from the previous
>>> design. Basically: add stuff up, and if the total exceeds a theshold,
>>> you've got spam.
>>>
>>> I took a quick look at SpamAssassin. They do use floating point
>>> scores. I like the way they allow weights to be customized for each
>>> rule. That seems like it would be a good enhancement for us, too.
>>>
>>> So, it seems to me that if we go floating-point, we should make it
>>> possible to configure custom weights for each Inspector. That also
>>> implies that it would be better to simply have Inspector.inspect()
>>> simply return a "pass/fail/no result" result and have the parent
>>> Inspection object figure out what to add or subtract from the running
>>> total. This would mean that the Inspectors wouldn't be able to modify
>>> scores directly. And it would be closer to the SpamAssessin model, and
>>> would probably simpler to code for too.
>>>
>>> Reasonable?
>>>
>>> On Sun, Sep 27, 2009 at 4:12 PM, Janne Jalkanen
>>> <Ja...@ecyrd.com> wrote:
>>>>
>>>> Heya!
>>>>
>>>> I don't quite understand - why would a floating point value be any more
>>>> complicated than an integer?
>>>>
>>>> Note that at the moment we mark a modification a spam if a single test
>>>> fails, which essentially means doing logical OR more than any sort of an
>>>> addition (=threshold is always 1).  The integer is there more or less as
>>>>
>>>> something-which-was-never-completed-and-I-am-almost-sure-is-the-wrong-design.
>>>> In fact, I did lament the fact that you can't say that "X is a bit more
>>>> efficient than Y" - with an integer-based design you essentially go for
>>>> fixed point arithmetics (test A gives you 100, test B gives you 150, etc,
>>>> where you just shifted decimal points to the right just for the heck of
>>>> it).
>>>>
>>>> If you take a look at SpamAssassin, it uses floats.
>>>>
>>>> I'd hesitate changing it later, since it will create an API contract.
>>>>
>>>> /Janne
>>>>
>>>> On Sep 27, 2009, at 22:34 , Andrew Jaquith wrote:
>>>>
>>>>> Janne --- good critique, and good suggestions. I've killed the
>>>>> limit-setting methods on the Inspection class itself in favor of a
>>>>> listener
>>>>> strategy. I adopteed your other suggestions too, except...
>>>>>
>>>>> ...at the moment I am inclined to leave the scoring scale as
>>>>> integer-based
>>>>> for now. I like it because it is simple, and resembles what we do now.
>>>>> We
>>>>> can change it later if need be.
>>>>>
>>>>> In answer to your first question: the inspector package does not help
>>>>> graduation, although it does help the beta that comes after the diploma.
>>>>> :)
>>>>>
>>>>> Andrew
>>>>>
>>>>> On Sep 27, 2009, at 4:37, Janne Jalkanen <ja...@ecyrd.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> My main concern: does this advance our graduation?
>>>>>>
>>>>>> Others inline.
>>>>>>
>>>>>> On 25 Sep 2009, at 20:54, Andrew Jaquith wrote:
>>>>>>
>>>>>>> The "inspect" package provides facilities that allow content to be
>>>>>>> inspected and scored based on various criteria such as whether a wiki
>>>>>>> page change contains spam. Content may be scored by initializing an
>>>>>>> Inspection object and then calling the {@link Inspection#inspect()}
>>>>>>> method. The {@code inspect} method, in turn, iterates through a
>>>>>>> variety of previously-configured {@link Inspector} objects and calls
>>>>>>> the {@link Inspector#inspect(String,String)} method for each one. Each
>>>>>>> of the configured inspectors is free to perform whatever evaluations
>>>>>>> it wishes, and can increase, decrease or reset the "scores" for any
>>>>>>> score category, for example the {@link Score#SPAM} category.
>>>>>>
>>>>>> Ok.
>>>>>>
>>>>>>> * Score objects supply instructions to the parent Inspection object to
>>>>>>> increment, decrement or reset the score for a particular category.
>>>>>>> Each Score object is constructed with a category (for example,
>>>>>>> Score.SPAM), an integer indicating how much to change the score, and
>>>>>>> an optional String message that provides context for the change. For
>>>>>>> example, a Score that increments the spam score by 1 could be
>>>>>>> constructed by new Score( Score.SPAM, 1, "Bot detected." ). Negative
>>>>>>> numbers can be supplied also to decrease the score. For convenience,
>>>>>>> {@link Score#INCREMENT_SCORE} means "add 1", {@link
>>>>>>> Score#DECREMENT_SCORE} means "subtract 1", and {@link Score#RESET}
>>>>>>> means "reset to zero."
>>>>>>
>>>>>> Smells of overdesign to me.  Creating a new operation and methodology
>>>>>> to
>>>>>> support addition sounds like a bad idea. Not to mention the fact that
>>>>>> you
>>>>>> might actually want to use fractions.
>>>>>>
>>>>>> Something like inspection.changeScore( Score.SPAM, 0.5f ) sounds more
>>>>>> appealing to me.
>>>>>>
>>>>>>> public String preSave( WikiContext context, String content ) throws
>>>>>>> RedirectException
>>>>>>> {
>>>>>>> Change change = getChange( context, content );
>>>>>>>
>>>>>>> // Run the Inspection
>>>>>>> Inspection inspection = new Inspection( context, m_config,
>>>>>>> m_inspectors
>>>>>>> );
>>>>>>> inspection.addLimit( Score.SPAM, m_scoreLimit );
>>>>>>> inspection.inspect( content, change );
>>>>>>> int spamScore = inspection.getScore( Score.SPAM );
>>>>>>> context.setVariable( ATTR_SPAMFILTER_SCORE, spamScore );
>>>>>>>
>>>>>>> // Redirect user if score too high
>>>>>>> if( inspection.isFailed() )
>>>>>>> {
>>>>>>>  ...
>>>>>>> }
>>>>>>> ...
>>>>>>> }
>>>>>>
>>>>>> Yes, this looks unnecessarily complicated and limiting to me.  I would
>>>>>> remove automatic limit-rating altogether, and would just concentrate on
>>>>>> refactoring the individual methods from SpamFilter into the
>>>>>> inspect-package
>>>>>> with a light wrapper around them.  I think a callback design is better
>>>>>> than
>>>>>> adding quite limited limit thingies anyway, as this gives full
>>>>>> programmatic
>>>>>> control to the developer without us having to pre-guess the possible
>>>>>> limitations.
>>>>>>
>>>>>> E.g. inspection.inspect( content, change, new
>>>>>> InspectionResultListener()
>>>>>> {
>>>>>>  public boolean visit(Inspection ins)
>>>>>>  {
>>>>>>      // Any possible decision code goes here
>>>>>>      return ins.getScore() > m_maxScore;
>>>>>>  }
>>>>>> });
>>>>>>
>>>>>> Obviously, these listeners would be optional, and you could just let
>>>>>> the
>>>>>> entire chain run without interference. And for simplicity, we can
>>>>>> provide
>>>>>> some default listeners, e.g.
>>>>>>
>>>>>> inspection.inspect( content, change, new StopAtScoreListener(
>>>>>> Score.SPAM,
>>>>>> 5 ) );
>>>>>>
>>>>>> (Yeah, I'm a big fan of this pattern 'cos it gives much more control to
>>>>>> the developer and eases debugging immensely.)
>>>>>>
>>>>>>> ChangeRateInspector, LinkCountInspector, SpamInspector (for Akismet),
>>>>>>
>>>>>> This should obviously be called AkismetInspector; all the others detect
>>>>>> spammers too (and Akismet is probably the worst of the lot anyway).
>>>>>>
>>>>>> /Janne
>>>>
>>>>
>>
>>
>

Re: Spam package redesign

Posted by Andrew Jaquith <an...@gmail.com>.
There's some room for optimization in what you proposed, but your
p-code is close to what I was thinking. in essence, it is the Strategy
pattern that calls the inspector logic, combined with a listener for
scoring changes (aka Visitor pattern) that interested callers would
subscribe to. That works.

After having originally suggested an "InspectionInterruptedException"
so that inspectors could abort a running inspection chain, I now think
it's better to simply return an object from the listener/visited
method, rather than throw an exception. Exceptions could be perceived
by an implementer as a little trickier to debug. I'd rather make the
API simpler.

FYI, the workflow package isn't totally suitable here. When you cited
the Decision class, I suspect you actually meant the Outcome class,
no?

Ok. I've got what I need to do this properly. This should work pretty well.

Andrew

On Mon, Sep 28, 2009 at 3:11 AM, Janne Jalkanen
<Ja...@ecyrd.com> wrote:
>
> Yah, it's just that smaller numbers are usually easier to deal with (numbers
> from 1-20).  We need more resolution at the low end, not at the high end.
>  That's my preference for using floats.
>
> A simple way to do as you suggest would be to have a Score return value,
> e.g.
>
> public Score inspect(...)
> {
>    // Detection score goes here
>
>    if( isSpam )
>        return new Score( "mywonderfulspamdetectortest", 0.5f );
>
>    return Score.OK; // where score.OK is defined to be 0.0f.
> }
>
> where the first String is a key and the float is a default value, which can
> be overridden in a config file using the key. This would be pretty close to
> the SpamAssassin model (and close to the java.util.Properties conceptually,
> so it would be easy to adapt).
>
> (The reason why I'm proposing a default value is so that when you add a new
> Inspector, you don't have to modify two files.)
>
> After each Inspector would be called, then the parent would call the
> callback I proposed, and that would then determine what to do.  In
> pseudocode:
>
> // JSPWiki internal code.
> private void inspect( DecisionMaker decisionMaker, Change change, <some
> other values> )
> {
>    ArrayList scores;
>    for( Inspector inspector : m_inspectorChain )
>    {
>        Score s = inspector.inspect( change, <some values> );
>
>        scores.add( s );
>
>        if( decisionMaker.visit( scores ) == Decision.SPAM ) { ... }
>    }
> }
>
> We can then provide something like this
>
> class SpamFilter implements PageFilter, DecisionMaker
> {
>    ...
>
>    // This emulates the 2.8.x SpamFilter functionality.
>    public Decision visit( List<Score> scores )
>    {
>        if( scores.last().getValue() > 0.0 ) return Decision.SPAM;
>
>        return Decision.CONTINUE;
>    }
> }
>
> (At any rate, it might make sense to reuse some classes from the workflow
> package here; after all, this is a machine workflow in a sense, is it not?)
>
> (Another possible pattern would be to have the DecisionMaker throw an
> exception. Don't quite know what's better.)
>
> /Janne
>
> On Sep 28, 2009, at 02:04 , Andrew Jaquith wrote:
>
>> It's only more complicated in the sense that I'd have to change some
>> recently-written code. :)
>>
>> Both integer and floating point -- as you point out -- are really the
>> same except for the choice of where to put the decimal point. I chose
>> integer to preserve the same scoring strategy from the previous
>> design. Basically: add stuff up, and if the total exceeds a theshold,
>> you've got spam.
>>
>> I took a quick look at SpamAssassin. They do use floating point
>> scores. I like the way they allow weights to be customized for each
>> rule. That seems like it would be a good enhancement for us, too.
>>
>> So, it seems to me that if we go floating-point, we should make it
>> possible to configure custom weights for each Inspector. That also
>> implies that it would be better to simply have Inspector.inspect()
>> simply return a "pass/fail/no result" result and have the parent
>> Inspection object figure out what to add or subtract from the running
>> total. This would mean that the Inspectors wouldn't be able to modify
>> scores directly. And it would be closer to the SpamAssessin model, and
>> would probably simpler to code for too.
>>
>> Reasonable?
>>
>> On Sun, Sep 27, 2009 at 4:12 PM, Janne Jalkanen
>> <Ja...@ecyrd.com> wrote:
>>>
>>> Heya!
>>>
>>> I don't quite understand - why would a floating point value be any more
>>> complicated than an integer?
>>>
>>> Note that at the moment we mark a modification a spam if a single test
>>> fails, which essentially means doing logical OR more than any sort of an
>>> addition (=threshold is always 1).  The integer is there more or less as
>>>
>>> something-which-was-never-completed-and-I-am-almost-sure-is-the-wrong-design.
>>> In fact, I did lament the fact that you can't say that "X is a bit more
>>> efficient than Y" - with an integer-based design you essentially go for
>>> fixed point arithmetics (test A gives you 100, test B gives you 150, etc,
>>> where you just shifted decimal points to the right just for the heck of
>>> it).
>>>
>>> If you take a look at SpamAssassin, it uses floats.
>>>
>>> I'd hesitate changing it later, since it will create an API contract.
>>>
>>> /Janne
>>>
>>> On Sep 27, 2009, at 22:34 , Andrew Jaquith wrote:
>>>
>>>> Janne --- good critique, and good suggestions. I've killed the
>>>> limit-setting methods on the Inspection class itself in favor of a
>>>> listener
>>>> strategy. I adopteed your other suggestions too, except...
>>>>
>>>> ...at the moment I am inclined to leave the scoring scale as
>>>> integer-based
>>>> for now. I like it because it is simple, and resembles what we do now.
>>>> We
>>>> can change it later if need be.
>>>>
>>>> In answer to your first question: the inspector package does not help
>>>> graduation, although it does help the beta that comes after the diploma.
>>>> :)
>>>>
>>>> Andrew
>>>>
>>>> On Sep 27, 2009, at 4:37, Janne Jalkanen <ja...@ecyrd.com>
>>>> wrote:
>>>>
>>>>>
>>>>> My main concern: does this advance our graduation?
>>>>>
>>>>> Others inline.
>>>>>
>>>>> On 25 Sep 2009, at 20:54, Andrew Jaquith wrote:
>>>>>
>>>>>> The "inspect" package provides facilities that allow content to be
>>>>>> inspected and scored based on various criteria such as whether a wiki
>>>>>> page change contains spam. Content may be scored by initializing an
>>>>>> Inspection object and then calling the {@link Inspection#inspect()}
>>>>>> method. The {@code inspect} method, in turn, iterates through a
>>>>>> variety of previously-configured {@link Inspector} objects and calls
>>>>>> the {@link Inspector#inspect(String,String)} method for each one. Each
>>>>>> of the configured inspectors is free to perform whatever evaluations
>>>>>> it wishes, and can increase, decrease or reset the "scores" for any
>>>>>> score category, for example the {@link Score#SPAM} category.
>>>>>
>>>>> Ok.
>>>>>
>>>>>> * Score objects supply instructions to the parent Inspection object to
>>>>>> increment, decrement or reset the score for a particular category.
>>>>>> Each Score object is constructed with a category (for example,
>>>>>> Score.SPAM), an integer indicating how much to change the score, and
>>>>>> an optional String message that provides context for the change. For
>>>>>> example, a Score that increments the spam score by 1 could be
>>>>>> constructed by new Score( Score.SPAM, 1, "Bot detected." ). Negative
>>>>>> numbers can be supplied also to decrease the score. For convenience,
>>>>>> {@link Score#INCREMENT_SCORE} means "add 1", {@link
>>>>>> Score#DECREMENT_SCORE} means "subtract 1", and {@link Score#RESET}
>>>>>> means "reset to zero."
>>>>>
>>>>> Smells of overdesign to me.  Creating a new operation and methodology
>>>>> to
>>>>> support addition sounds like a bad idea. Not to mention the fact that
>>>>> you
>>>>> might actually want to use fractions.
>>>>>
>>>>> Something like inspection.changeScore( Score.SPAM, 0.5f ) sounds more
>>>>> appealing to me.
>>>>>
>>>>>> public String preSave( WikiContext context, String content ) throws
>>>>>> RedirectException
>>>>>> {
>>>>>> Change change = getChange( context, content );
>>>>>>
>>>>>> // Run the Inspection
>>>>>> Inspection inspection = new Inspection( context, m_config,
>>>>>> m_inspectors
>>>>>> );
>>>>>> inspection.addLimit( Score.SPAM, m_scoreLimit );
>>>>>> inspection.inspect( content, change );
>>>>>> int spamScore = inspection.getScore( Score.SPAM );
>>>>>> context.setVariable( ATTR_SPAMFILTER_SCORE, spamScore );
>>>>>>
>>>>>> // Redirect user if score too high
>>>>>> if( inspection.isFailed() )
>>>>>> {
>>>>>>  ...
>>>>>> }
>>>>>> ...
>>>>>> }
>>>>>
>>>>> Yes, this looks unnecessarily complicated and limiting to me.  I would
>>>>> remove automatic limit-rating altogether, and would just concentrate on
>>>>> refactoring the individual methods from SpamFilter into the
>>>>> inspect-package
>>>>> with a light wrapper around them.  I think a callback design is better
>>>>> than
>>>>> adding quite limited limit thingies anyway, as this gives full
>>>>> programmatic
>>>>> control to the developer without us having to pre-guess the possible
>>>>> limitations.
>>>>>
>>>>> E.g. inspection.inspect( content, change, new
>>>>> InspectionResultListener()
>>>>> {
>>>>>  public boolean visit(Inspection ins)
>>>>>  {
>>>>>      // Any possible decision code goes here
>>>>>      return ins.getScore() > m_maxScore;
>>>>>  }
>>>>> });
>>>>>
>>>>> Obviously, these listeners would be optional, and you could just let
>>>>> the
>>>>> entire chain run without interference. And for simplicity, we can
>>>>> provide
>>>>> some default listeners, e.g.
>>>>>
>>>>> inspection.inspect( content, change, new StopAtScoreListener(
>>>>> Score.SPAM,
>>>>> 5 ) );
>>>>>
>>>>> (Yeah, I'm a big fan of this pattern 'cos it gives much more control to
>>>>> the developer and eases debugging immensely.)
>>>>>
>>>>>> ChangeRateInspector, LinkCountInspector, SpamInspector (for Akismet),
>>>>>
>>>>> This should obviously be called AkismetInspector; all the others detect
>>>>> spammers too (and Akismet is probably the worst of the lot anyway).
>>>>>
>>>>> /Janne
>>>
>>>
>
>

Re: Spam package redesign

Posted by Janne Jalkanen <Ja...@ecyrd.com>.
Yah, it's just that smaller numbers are usually easier to deal with  
(numbers from 1-20).  We need more resolution at the low end, not at  
the high end.  That's my preference for using floats.

A simple way to do as you suggest would be to have a Score return  
value, e.g.

public Score inspect(...)
{
     // Detection score goes here

     if( isSpam )
         return new Score( "mywonderfulspamdetectortest", 0.5f );

     return Score.OK; // where score.OK is defined to be 0.0f.
}

where the first String is a key and the float is a default value,  
which can be overridden in a config file using the key. This would be  
pretty close to the SpamAssassin model (and close to the  
java.util.Properties conceptually, so it would be easy to adapt).

(The reason why I'm proposing a default value is so that when you add  
a new Inspector, you don't have to modify two files.)

After each Inspector would be called, then the parent would call the  
callback I proposed, and that would then determine what to do.  In  
pseudocode:

// JSPWiki internal code.
private void inspect( DecisionMaker decisionMaker, Change change,  
<some other values> )
{
     ArrayList scores;
     for( Inspector inspector : m_inspectorChain )
     {
         Score s = inspector.inspect( change, <some values> );

         scores.add( s );

         if( decisionMaker.visit( scores ) == Decision.SPAM ) { ... }
     }
}

We can then provide something like this

class SpamFilter implements PageFilter, DecisionMaker
{
     ...

     // This emulates the 2.8.x SpamFilter functionality.
     public Decision visit( List<Score> scores )
     {
         if( scores.last().getValue() > 0.0 ) return Decision.SPAM;

         return Decision.CONTINUE;
     }
}

(At any rate, it might make sense to reuse some classes from the  
workflow package here; after all, this is a machine workflow in a  
sense, is it not?)

(Another possible pattern would be to have the DecisionMaker throw an  
exception. Don't quite know what's better.)

/Janne

On Sep 28, 2009, at 02:04 , Andrew Jaquith wrote:

> It's only more complicated in the sense that I'd have to change some
> recently-written code. :)
>
> Both integer and floating point -- as you point out -- are really the
> same except for the choice of where to put the decimal point. I chose
> integer to preserve the same scoring strategy from the previous
> design. Basically: add stuff up, and if the total exceeds a theshold,
> you've got spam.
>
> I took a quick look at SpamAssassin. They do use floating point
> scores. I like the way they allow weights to be customized for each
> rule. That seems like it would be a good enhancement for us, too.
>
> So, it seems to me that if we go floating-point, we should make it
> possible to configure custom weights for each Inspector. That also
> implies that it would be better to simply have Inspector.inspect()
> simply return a "pass/fail/no result" result and have the parent
> Inspection object figure out what to add or subtract from the running
> total. This would mean that the Inspectors wouldn't be able to modify
> scores directly. And it would be closer to the SpamAssessin model, and
> would probably simpler to code for too.
>
> Reasonable?
>
> On Sun, Sep 27, 2009 at 4:12 PM, Janne Jalkanen
> <Ja...@ecyrd.com> wrote:
>>
>> Heya!
>>
>> I don't quite understand - why would a floating point value be any  
>> more
>> complicated than an integer?
>>
>> Note that at the moment we mark a modification a spam if a single  
>> test
>> fails, which essentially means doing logical OR more than any sort  
>> of an
>> addition (=threshold is always 1).  The integer is there more or  
>> less as
>> something-which-was-never-completed-and-I-am-almost-sure-is-the- 
>> wrong-design.
>> In fact, I did lament the fact that you can't say that "X is a bit  
>> more
>> efficient than Y" - with an integer-based design you essentially go  
>> for
>> fixed point arithmetics (test A gives you 100, test B gives you  
>> 150, etc,
>> where you just shifted decimal points to the right just for the  
>> heck of it).
>>
>> If you take a look at SpamAssassin, it uses floats.
>>
>> I'd hesitate changing it later, since it will create an API contract.
>>
>> /Janne
>>
>> On Sep 27, 2009, at 22:34 , Andrew Jaquith wrote:
>>
>>> Janne --- good critique, and good suggestions. I've killed the
>>> limit-setting methods on the Inspection class itself in favor of a  
>>> listener
>>> strategy. I adopteed your other suggestions too, except...
>>>
>>> ...at the moment I am inclined to leave the scoring scale as  
>>> integer-based
>>> for now. I like it because it is simple, and resembles what we do  
>>> now. We
>>> can change it later if need be.
>>>
>>> In answer to your first question: the inspector package does not  
>>> help
>>> graduation, although it does help the beta that comes after the  
>>> diploma. :)
>>>
>>> Andrew
>>>
>>> On Sep 27, 2009, at 4:37, Janne Jalkanen  
>>> <ja...@ecyrd.com> wrote:
>>>
>>>>
>>>> My main concern: does this advance our graduation?
>>>>
>>>> Others inline.
>>>>
>>>> On 25 Sep 2009, at 20:54, Andrew Jaquith wrote:
>>>>
>>>>> The "inspect" package provides facilities that allow content to be
>>>>> inspected and scored based on various criteria such as whether a  
>>>>> wiki
>>>>> page change contains spam. Content may be scored by initializing  
>>>>> an
>>>>> Inspection object and then calling the {@link  
>>>>> Inspection#inspect()}
>>>>> method. The {@code inspect} method, in turn, iterates through a
>>>>> variety of previously-configured {@link Inspector} objects and  
>>>>> calls
>>>>> the {@link Inspector#inspect(String,String)} method for each  
>>>>> one. Each
>>>>> of the configured inspectors is free to perform whatever  
>>>>> evaluations
>>>>> it wishes, and can increase, decrease or reset the "scores" for  
>>>>> any
>>>>> score category, for example the {@link Score#SPAM} category.
>>>>
>>>> Ok.
>>>>
>>>>> * Score objects supply instructions to the parent Inspection  
>>>>> object to
>>>>> increment, decrement or reset the score for a particular category.
>>>>> Each Score object is constructed with a category (for example,
>>>>> Score.SPAM), an integer indicating how much to change the score,  
>>>>> and
>>>>> an optional String message that provides context for the change.  
>>>>> For
>>>>> example, a Score that increments the spam score by 1 could be
>>>>> constructed by new Score( Score.SPAM, 1, "Bot detected." ).  
>>>>> Negative
>>>>> numbers can be supplied also to decrease the score. For  
>>>>> convenience,
>>>>> {@link Score#INCREMENT_SCORE} means "add 1", {@link
>>>>> Score#DECREMENT_SCORE} means "subtract 1", and {@link Score#RESET}
>>>>> means "reset to zero."
>>>>
>>>> Smells of overdesign to me.  Creating a new operation and  
>>>> methodology to
>>>> support addition sounds like a bad idea. Not to mention the fact  
>>>> that you
>>>> might actually want to use fractions.
>>>>
>>>> Something like inspection.changeScore( Score.SPAM, 0.5f ) sounds  
>>>> more
>>>> appealing to me.
>>>>
>>>>> public String preSave( WikiContext context, String content )  
>>>>> throws
>>>>> RedirectException
>>>>> {
>>>>> Change change = getChange( context, content );
>>>>>
>>>>> // Run the Inspection
>>>>> Inspection inspection = new Inspection( context, m_config,  
>>>>> m_inspectors
>>>>> );
>>>>> inspection.addLimit( Score.SPAM, m_scoreLimit );
>>>>> inspection.inspect( content, change );
>>>>> int spamScore = inspection.getScore( Score.SPAM );
>>>>> context.setVariable( ATTR_SPAMFILTER_SCORE, spamScore );
>>>>>
>>>>> // Redirect user if score too high
>>>>> if( inspection.isFailed() )
>>>>> {
>>>>>  ...
>>>>> }
>>>>> ...
>>>>> }
>>>>
>>>> Yes, this looks unnecessarily complicated and limiting to me.  I  
>>>> would
>>>> remove automatic limit-rating altogether, and would just  
>>>> concentrate on
>>>> refactoring the individual methods from SpamFilter into the  
>>>> inspect-package
>>>> with a light wrapper around them.  I think a callback design is  
>>>> better than
>>>> adding quite limited limit thingies anyway, as this gives full  
>>>> programmatic
>>>> control to the developer without us having to pre-guess the  
>>>> possible
>>>> limitations.
>>>>
>>>> E.g. inspection.inspect( content, change, new  
>>>> InspectionResultListener()
>>>> {
>>>>  public boolean visit(Inspection ins)
>>>>  {
>>>>       // Any possible decision code goes here
>>>>       return ins.getScore() > m_maxScore;
>>>>  }
>>>> });
>>>>
>>>> Obviously, these listeners would be optional, and you could just  
>>>> let the
>>>> entire chain run without interference. And for simplicity, we can  
>>>> provide
>>>> some default listeners, e.g.
>>>>
>>>> inspection.inspect( content, change, new  
>>>> StopAtScoreListener( Score.SPAM,
>>>> 5 ) );
>>>>
>>>> (Yeah, I'm a big fan of this pattern 'cos it gives much more  
>>>> control to
>>>> the developer and eases debugging immensely.)
>>>>
>>>>> ChangeRateInspector, LinkCountInspector, SpamInspector (for  
>>>>> Akismet),
>>>>
>>>> This should obviously be called AkismetInspector; all the others  
>>>> detect
>>>> spammers too (and Akismet is probably the worst of the lot anyway).
>>>>
>>>> /Janne
>>
>>


Re: Spam package redesign

Posted by Andrew Jaquith <an...@gmail.com>.
It's only more complicated in the sense that I'd have to change some
recently-written code. :)

Both integer and floating point -- as you point out -- are really the
same except for the choice of where to put the decimal point. I chose
integer to preserve the same scoring strategy from the previous
design. Basically: add stuff up, and if the total exceeds a theshold,
you've got spam.

I took a quick look at SpamAssassin. They do use floating point
scores. I like the way they allow weights to be customized for each
rule. That seems like it would be a good enhancement for us, too.

So, it seems to me that if we go floating-point, we should make it
possible to configure custom weights for each Inspector. That also
implies that it would be better to simply have Inspector.inspect()
simply return a "pass/fail/no result" result and have the parent
Inspection object figure out what to add or subtract from the running
total. This would mean that the Inspectors wouldn't be able to modify
scores directly. And it would be closer to the SpamAssessin model, and
would probably simpler to code for too.

Reasonable?

On Sun, Sep 27, 2009 at 4:12 PM, Janne Jalkanen
<Ja...@ecyrd.com> wrote:
>
> Heya!
>
> I don't quite understand - why would a floating point value be any more
> complicated than an integer?
>
> Note that at the moment we mark a modification a spam if a single test
> fails, which essentially means doing logical OR more than any sort of an
> addition (=threshold is always 1).  The integer is there more or less as
> something-which-was-never-completed-and-I-am-almost-sure-is-the-wrong-design.
> In fact, I did lament the fact that you can't say that "X is a bit more
> efficient than Y" - with an integer-based design you essentially go for
> fixed point arithmetics (test A gives you 100, test B gives you 150, etc,
> where you just shifted decimal points to the right just for the heck of it).
>
> If you take a look at SpamAssassin, it uses floats.
>
> I'd hesitate changing it later, since it will create an API contract.
>
> /Janne
>
> On Sep 27, 2009, at 22:34 , Andrew Jaquith wrote:
>
>> Janne --- good critique, and good suggestions. I've killed the
>> limit-setting methods on the Inspection class itself in favor of a listener
>> strategy. I adopteed your other suggestions too, except...
>>
>> ...at the moment I am inclined to leave the scoring scale as integer-based
>> for now. I like it because it is simple, and resembles what we do now. We
>> can change it later if need be.
>>
>> In answer to your first question: the inspector package does not help
>> graduation, although it does help the beta that comes after the diploma. :)
>>
>> Andrew
>>
>> On Sep 27, 2009, at 4:37, Janne Jalkanen <ja...@ecyrd.com> wrote:
>>
>>>
>>> My main concern: does this advance our graduation?
>>>
>>> Others inline.
>>>
>>> On 25 Sep 2009, at 20:54, Andrew Jaquith wrote:
>>>
>>>> The "inspect" package provides facilities that allow content to be
>>>> inspected and scored based on various criteria such as whether a wiki
>>>> page change contains spam. Content may be scored by initializing an
>>>> Inspection object and then calling the {@link Inspection#inspect()}
>>>> method. The {@code inspect} method, in turn, iterates through a
>>>> variety of previously-configured {@link Inspector} objects and calls
>>>> the {@link Inspector#inspect(String,String)} method for each one. Each
>>>> of the configured inspectors is free to perform whatever evaluations
>>>> it wishes, and can increase, decrease or reset the "scores" for any
>>>> score category, for example the {@link Score#SPAM} category.
>>>
>>> Ok.
>>>
>>>> * Score objects supply instructions to the parent Inspection object to
>>>> increment, decrement or reset the score for a particular category.
>>>> Each Score object is constructed with a category (for example,
>>>> Score.SPAM), an integer indicating how much to change the score, and
>>>> an optional String message that provides context for the change. For
>>>> example, a Score that increments the spam score by 1 could be
>>>> constructed by new Score( Score.SPAM, 1, "Bot detected." ). Negative
>>>> numbers can be supplied also to decrease the score. For convenience,
>>>> {@link Score#INCREMENT_SCORE} means "add 1", {@link
>>>> Score#DECREMENT_SCORE} means "subtract 1", and {@link Score#RESET}
>>>> means "reset to zero."
>>>
>>> Smells of overdesign to me.  Creating a new operation and methodology to
>>> support addition sounds like a bad idea. Not to mention the fact that you
>>> might actually want to use fractions.
>>>
>>> Something like inspection.changeScore( Score.SPAM, 0.5f ) sounds more
>>> appealing to me.
>>>
>>>> public String preSave( WikiContext context, String content ) throws
>>>> RedirectException
>>>> {
>>>> Change change = getChange( context, content );
>>>>
>>>> // Run the Inspection
>>>> Inspection inspection = new Inspection( context, m_config, m_inspectors
>>>> );
>>>> inspection.addLimit( Score.SPAM, m_scoreLimit );
>>>> inspection.inspect( content, change );
>>>> int spamScore = inspection.getScore( Score.SPAM );
>>>> context.setVariable( ATTR_SPAMFILTER_SCORE, spamScore );
>>>>
>>>> // Redirect user if score too high
>>>> if( inspection.isFailed() )
>>>> {
>>>>  ...
>>>> }
>>>> ...
>>>> }
>>>
>>> Yes, this looks unnecessarily complicated and limiting to me.  I would
>>> remove automatic limit-rating altogether, and would just concentrate on
>>> refactoring the individual methods from SpamFilter into the inspect-package
>>> with a light wrapper around them.  I think a callback design is better than
>>> adding quite limited limit thingies anyway, as this gives full programmatic
>>> control to the developer without us having to pre-guess the possible
>>> limitations.
>>>
>>> E.g. inspection.inspect( content, change, new InspectionResultListener()
>>> {
>>>  public boolean visit(Inspection ins)
>>>  {
>>>       // Any possible decision code goes here
>>>       return ins.getScore() > m_maxScore;
>>>  }
>>> });
>>>
>>> Obviously, these listeners would be optional, and you could just let the
>>> entire chain run without interference. And for simplicity, we can provide
>>> some default listeners, e.g.
>>>
>>> inspection.inspect( content, change, new StopAtScoreListener( Score.SPAM,
>>> 5 ) );
>>>
>>> (Yeah, I'm a big fan of this pattern 'cos it gives much more control to
>>> the developer and eases debugging immensely.)
>>>
>>>> ChangeRateInspector, LinkCountInspector, SpamInspector (for Akismet),
>>>
>>> This should obviously be called AkismetInspector; all the others detect
>>> spammers too (and Akismet is probably the worst of the lot anyway).
>>>
>>> /Janne
>
>

Re: Spam package redesign

Posted by Janne Jalkanen <Ja...@ecyrd.com>.
Heya!

I don't quite understand - why would a floating point value be any  
more complicated than an integer?

Note that at the moment we mark a modification a spam if a single test  
fails, which essentially means doing logical OR more than any sort of  
an addition (=threshold is always 1).  The integer is there more or  
less as something-which-was-never-completed-and-I-am-almost-sure-is- 
the-wrong-design. In fact, I did lament the fact that you can't say  
that "X is a bit more efficient than Y" - with an integer-based design  
you essentially go for fixed point arithmetics (test A gives you 100,  
test B gives you 150, etc, where you just shifted decimal points to  
the right just for the heck of it).

If you take a look at SpamAssassin, it uses floats.

I'd hesitate changing it later, since it will create an API contract.

/Janne

On Sep 27, 2009, at 22:34 , Andrew Jaquith wrote:

> Janne --- good critique, and good suggestions. I've killed the limit- 
> setting methods on the Inspection class itself in favor of a  
> listener strategy. I adopteed your other suggestions too, except...
>
> ...at the moment I am inclined to leave the scoring scale as integer- 
> based for now. I like it because it is simple, and resembles what we  
> do now. We can change it later if need be.
>
> In answer to your first question: the inspector package does not  
> help graduation, although it does help the beta that comes after the  
> diploma. :)
>
> Andrew
>
> On Sep 27, 2009, at 4:37, Janne Jalkanen <ja...@ecyrd.com>  
> wrote:
>
>>
>> My main concern: does this advance our graduation?
>>
>> Others inline.
>>
>> On 25 Sep 2009, at 20:54, Andrew Jaquith wrote:
>>
>>> The "inspect" package provides facilities that allow content to be
>>> inspected and scored based on various criteria such as whether a  
>>> wiki
>>> page change contains spam. Content may be scored by initializing an
>>> Inspection object and then calling the {@link Inspection#inspect()}
>>> method. The {@code inspect} method, in turn, iterates through a
>>> variety of previously-configured {@link Inspector} objects and calls
>>> the {@link Inspector#inspect(String,String)} method for each one.  
>>> Each
>>> of the configured inspectors is free to perform whatever evaluations
>>> it wishes, and can increase, decrease or reset the "scores" for any
>>> score category, for example the {@link Score#SPAM} category.
>>
>> Ok.
>>
>>> * Score objects supply instructions to the parent Inspection  
>>> object to
>>> increment, decrement or reset the score for a particular category.
>>> Each Score object is constructed with a category (for example,
>>> Score.SPAM), an integer indicating how much to change the score, and
>>> an optional String message that provides context for the change. For
>>> example, a Score that increments the spam score by 1 could be
>>> constructed by new Score( Score.SPAM, 1, "Bot detected." ). Negative
>>> numbers can be supplied also to decrease the score. For convenience,
>>> {@link Score#INCREMENT_SCORE} means "add 1", {@link
>>> Score#DECREMENT_SCORE} means "subtract 1", and {@link Score#RESET}
>>> means "reset to zero."
>>
>> Smells of overdesign to me.  Creating a new operation and  
>> methodology to support addition sounds like a bad idea. Not to  
>> mention the fact that you might actually want to use fractions.
>>
>> Something like inspection.changeScore( Score.SPAM, 0.5f ) sounds  
>> more appealing to me.
>>
>>> public String preSave( WikiContext context, String content ) throws
>>> RedirectException
>>> {
>>> Change change = getChange( context, content );
>>>
>>> // Run the Inspection
>>> Inspection inspection = new Inspection( context, m_config,  
>>> m_inspectors );
>>> inspection.addLimit( Score.SPAM, m_scoreLimit );
>>> inspection.inspect( content, change );
>>> int spamScore = inspection.getScore( Score.SPAM );
>>> context.setVariable( ATTR_SPAMFILTER_SCORE, spamScore );
>>>
>>> // Redirect user if score too high
>>> if( inspection.isFailed() )
>>> {
>>>  ...
>>> }
>>> ...
>>> }
>>
>> Yes, this looks unnecessarily complicated and limiting to me.  I  
>> would remove automatic limit-rating altogether, and would just  
>> concentrate on refactoring the individual methods from SpamFilter  
>> into the inspect-package with a light wrapper around them.  I think  
>> a callback design is better than adding quite limited limit  
>> thingies anyway, as this gives full programmatic control to the  
>> developer without us having to pre-guess the possible limitations.
>>
>> E.g. inspection.inspect( content, change, new  
>> InspectionResultListener() {
>>   public boolean visit(Inspection ins)
>>   {
>>        // Any possible decision code goes here
>>        return ins.getScore() > m_maxScore;
>>   }
>> });
>>
>> Obviously, these listeners would be optional, and you could just  
>> let the entire chain run without interference. And for simplicity,  
>> we can provide some default listeners, e.g.
>>
>> inspection.inspect( content, change, new  
>> StopAtScoreListener( Score.SPAM, 5 ) );
>>
>> (Yeah, I'm a big fan of this pattern 'cos it gives much more  
>> control to the developer and eases debugging immensely.)
>>
>>> ChangeRateInspector, LinkCountInspector, SpamInspector (for  
>>> Akismet),
>>
>> This should obviously be called AkismetInspector; all the others  
>> detect spammers too (and Akismet is probably the worst of the lot  
>> anyway).
>>
>> /Janne


Re: Spam package redesign

Posted by Andrew Jaquith <an...@gmail.com>.
Janne --- good critique, and good suggestions. I've killed the limit- 
setting methods on the Inspection class itself in favor of a listener  
strategy. I adopteed your other suggestions too, except...

...at the moment I am inclined to leave the scoring scale as integer- 
based for now. I like it because it is simple, and resembles what we  
do now. We can change it later if need be.

In answer to your first question: the inspector package does not help  
graduation, although it does help the beta that comes after the  
diploma. :)

Andrew

On Sep 27, 2009, at 4:37, Janne Jalkanen <ja...@ecyrd.com>  
wrote:

>
> My main concern: does this advance our graduation?
>
> Others inline.
>
> On 25 Sep 2009, at 20:54, Andrew Jaquith wrote:
>
>> The "inspect" package provides facilities that allow content to be
>> inspected and scored based on various criteria such as whether a wiki
>> page change contains spam. Content may be scored by initializing an
>> Inspection object and then calling the {@link Inspection#inspect()}
>> method. The {@code inspect} method, in turn, iterates through a
>> variety of previously-configured {@link Inspector} objects and calls
>> the {@link Inspector#inspect(String,String)} method for each one.  
>> Each
>> of the configured inspectors is free to perform whatever evaluations
>> it wishes, and can increase, decrease or reset the "scores" for any
>> score category, for example the {@link Score#SPAM} category.
>
> Ok.
>
>> * Score objects supply instructions to the parent Inspection object  
>> to
>> increment, decrement or reset the score for a particular category.
>> Each Score object is constructed with a category (for example,
>> Score.SPAM), an integer indicating how much to change the score, and
>> an optional String message that provides context for the change. For
>> example, a Score that increments the spam score by 1 could be
>> constructed by new Score( Score.SPAM, 1, "Bot detected." ). Negative
>> numbers can be supplied also to decrease the score. For convenience,
>> {@link Score#INCREMENT_SCORE} means "add 1", {@link
>> Score#DECREMENT_SCORE} means "subtract 1", and {@link Score#RESET}
>> means "reset to zero."
>
> Smells of overdesign to me.  Creating a new operation and  
> methodology to support addition sounds like a bad idea. Not to  
> mention the fact that you might actually want to use fractions.
>
> Something like inspection.changeScore( Score.SPAM, 0.5f ) sounds  
> more appealing to me.
>
>> public String preSave( WikiContext context, String content ) throws
>> RedirectException
>> {
>> Change change = getChange( context, content );
>>
>> // Run the Inspection
>> Inspection inspection = new Inspection( context, m_config,  
>> m_inspectors );
>> inspection.addLimit( Score.SPAM, m_scoreLimit );
>> inspection.inspect( content, change );
>> int spamScore = inspection.getScore( Score.SPAM );
>> context.setVariable( ATTR_SPAMFILTER_SCORE, spamScore );
>>
>> // Redirect user if score too high
>> if( inspection.isFailed() )
>> {
>>   ...
>> }
>> ...
>> }
>
> Yes, this looks unnecessarily complicated and limiting to me.  I  
> would remove automatic limit-rating altogether, and would just  
> concentrate on refactoring the individual methods from SpamFilter  
> into the inspect-package with a light wrapper around them.  I think  
> a callback design is better than adding quite limited limit thingies  
> anyway, as this gives full programmatic control to the developer  
> without us having to pre-guess the possible limitations.
>
> E.g. inspection.inspect( content, change, new  
> InspectionResultListener() {
>    public boolean visit(Inspection ins)
>    {
>         // Any possible decision code goes here
>         return ins.getScore() > m_maxScore;
>    }
> });
>
> Obviously, these listeners would be optional, and you could just let  
> the entire chain run without interference. And for simplicity, we  
> can provide some default listeners, e.g.
>
> inspection.inspect( content, change, new StopAtScoreListener 
> ( Score.SPAM, 5 ) );
>
> (Yeah, I'm a big fan of this pattern 'cos it gives much more control  
> to the developer and eases debugging immensely.)
>
>> ChangeRateInspector, LinkCountInspector, SpamInspector (for Akismet),
>
> This should obviously be called AkismetInspector; all the others  
> detect spammers too (and Akismet is probably the worst of the lot  
> anyway).
>
> /Janne

Re: Spam package redesign

Posted by Janne Jalkanen <ja...@ecyrd.com>.
My main concern: does this advance our graduation?

Others inline.

On 25 Sep 2009, at 20:54, Andrew Jaquith wrote:

> The "inspect" package provides facilities that allow content to be
> inspected and scored based on various criteria such as whether a wiki
> page change contains spam. Content may be scored by initializing an
> Inspection object and then calling the {@link Inspection#inspect()}
> method. The {@code inspect} method, in turn, iterates through a
> variety of previously-configured {@link Inspector} objects and calls
> the {@link Inspector#inspect(String,String)} method for each one. Each
> of the configured inspectors is free to perform whatever evaluations
> it wishes, and can increase, decrease or reset the "scores" for any
> score category, for example the {@link Score#SPAM} category.

Ok.

> * Score objects supply instructions to the parent Inspection object to
> increment, decrement or reset the score for a particular category.
> Each Score object is constructed with a category (for example,
> Score.SPAM), an integer indicating how much to change the score, and
> an optional String message that provides context for the change. For
> example, a Score that increments the spam score by 1 could be
> constructed by new Score( Score.SPAM, 1, "Bot detected." ). Negative
> numbers can be supplied also to decrease the score. For convenience,
> {@link Score#INCREMENT_SCORE} means "add 1", {@link
> Score#DECREMENT_SCORE} means "subtract 1", and {@link Score#RESET}
> means "reset to zero."

Smells of overdesign to me.  Creating a new operation and methodology  
to support addition sounds like a bad idea. Not to mention the fact  
that you might actually want to use fractions.

Something like inspection.changeScore( Score.SPAM, 0.5f ) sounds more  
appealing to me.

> public String preSave( WikiContext context, String content ) throws
> RedirectException
> {
>  Change change = getChange( context, content );
>
>  // Run the Inspection
>  Inspection inspection = new Inspection( context, m_config,  
> m_inspectors );
>  inspection.addLimit( Score.SPAM, m_scoreLimit );
>  inspection.inspect( content, change );
>  int spamScore = inspection.getScore( Score.SPAM );
>  context.setVariable( ATTR_SPAMFILTER_SCORE, spamScore );
>
>  // Redirect user if score too high
>  if( inspection.isFailed() )
>  {
>    ...
>  }
>  ...
> }

Yes, this looks unnecessarily complicated and limiting to me.  I would  
remove automatic limit-rating altogether, and would just concentrate  
on refactoring the individual methods from SpamFilter into the inspect- 
package with a light wrapper around them.  I think a callback design  
is better than adding quite limited limit thingies anyway, as this  
gives full programmatic control to the developer without us having to  
pre-guess the possible limitations.

E.g. inspection.inspect( content, change, new  
InspectionResultListener() {
     public boolean visit(Inspection ins)
     {
          // Any possible decision code goes here
          return ins.getScore() > m_maxScore;
     }
});

Obviously, these listeners would be optional, and you could just let  
the entire chain run without interference. And for simplicity, we can  
provide some default listeners, e.g.

inspection.inspect( content, change, new  
StopAtScoreListener( Score.SPAM, 5 ) );

(Yeah, I'm a big fan of this pattern 'cos it gives much more control  
to the developer and eases debugging immensely.)

> ChangeRateInspector, LinkCountInspector, SpamInspector (for Akismet),

This should obviously be called AkismetInspector; all the others  
detect spammers too (and Akismet is probably the worst of the lot  
anyway).

/Janne

Re: Spam package redesign

Posted by Andrew Jaquith <an...@gmail.com>.
You could certainly do that with the package as I've described it --
the pseudo-subject or facet is what I called a Score "category," if I
catch your meaning.

I'll take this as a vote of confidence -- when I check it in (in the
next few weeks probably), you'll be able to see the code for yourself.
:)

Andrew

On Sat, Sep 26, 2009 at 12:17 AM, Murray Altheim <mu...@altheim.com> wrote:
> Andrew Jaquith wrote:
>>
>> ** Warning: long post **
>>
>> After some fooling around and some actual work, I've finished my first
>> pass at refactoring on the anti-spam code. I'm proposing a new
>> package, org.apache.wiki.content.inspect, which contains a
>> general-purpose content-inspection capability, of which spam is just
>> one potential application. Here is a draft of the package javadocs.
>
> [...]
>>
>> I can foresee other uses for this too, for example general-purpose
>> content classification. But that's for another day.
>>
>> Comments, thoughts? It's going to take some time to get unit tests
>> done, so I won't be committing this for a little while.
>
> Hi Andrew,
>
> This sounds pretty impressive, all in all. With my library hat on, my
> interest was piqued by the idea of using this for non-spam applications,
> so the only comment I have at this point is wondering how you might at
> this point include the hook into Lucene.
>
> The way I'd see this working would be as follows.
>
> I'd not want to overload the Dublin Core Subject, but as a sort of
> informative field that might actually be used to populate the Subject.
> The structure of the result of the inspection would be a map of
> pseudo-subject (facet?) identifiers and a scope for each, e.g.,
>
>  Subject:         Shipping, Shipwrecks, Transportation
>  Pseudo-Subject:  Lusitania                             Score: 0.67
>  Pseudo-Subject:  http://en.wikipedia.org/wiki/Titanic  Score: 0.89
>  Pseudo-Subject:  Storm                                 Score: 0.56
>  Pseudo-Subject:  Mermaid                               Score: 0.24
>
> Where the "pseudo-subject" can be either a string or a URI subject
> identifier. And noting that "pseudo-subject" is not a term of art and
> I'd hope to come up with something more suitable. One could then use
> some mathematically-sensible composite of the scores to obtain the
> overall score for the document. You could even choose subsets of the
> pseudo-subjects to obtain targeted scores. This would still work for
> spam detection but would potentially be very powerful for subject
> classification, especially if it was tied into the search functionality.
>
> Does this make any sense?
>
> Murray
>

Re: Spam package redesign

Posted by Murray Altheim <mu...@altheim.com>.
Andrew Jaquith wrote:
> ** Warning: long post **
> 
> After some fooling around and some actual work, I've finished my first
> pass at refactoring on the anti-spam code. I'm proposing a new
> package, org.apache.wiki.content.inspect, which contains a
> general-purpose content-inspection capability, of which spam is just
> one potential application. Here is a draft of the package javadocs.
[...]
> I can foresee other uses for this too, for example general-purpose
> content classification. But that's for another day.
> 
> Comments, thoughts? It's going to take some time to get unit tests
> done, so I won't be committing this for a little while.

Hi Andrew,

This sounds pretty impressive, all in all. With my library hat on, my
interest was piqued by the idea of using this for non-spam applications,
so the only comment I have at this point is wondering how you might at
this point include the hook into Lucene.

The way I'd see this working would be as follows.

I'd not want to overload the Dublin Core Subject, but as a sort of
informative field that might actually be used to populate the Subject.
The structure of the result of the inspection would be a map of
pseudo-subject (facet?) identifiers and a scope for each, e.g.,

   Subject:         Shipping, Shipwrecks, Transportation
   Pseudo-Subject:  Lusitania                             Score: 0.67
   Pseudo-Subject:  http://en.wikipedia.org/wiki/Titanic  Score: 0.89
   Pseudo-Subject:  Storm                                 Score: 0.56
   Pseudo-Subject:  Mermaid                               Score: 0.24

Where the "pseudo-subject" can be either a string or a URI subject
identifier. And noting that "pseudo-subject" is not a term of art and
I'd hope to come up with something more suitable. One could then use
some mathematically-sensible composite of the scores to obtain the
overall score for the document. You could even choose subsets of the
pseudo-subjects to obtain targeted scores. This would still work for
spam detection but would potentially be very powerful for subject
classification, especially if it was tied into the search functionality.

Does this make any sense?

Murray