You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Adam Katz <an...@khopis.com> on 2009/09/26 17:11:05 UTC

iXhash with minimum size

Karsten Bräckelmann wrote:
> > This is a plain RE rule I once wrote, to limit some rule to really short
> > messages only.
> >
> >    rawbody __KB_RAWBODY_200  /^.{0,200}$/s

Warren Togami mused:
> I suspect meta limiting Adam's IXHASH rules with a minimum size subrule
> would eliminate many of the IXHASH false positives.  I was using his
> IXHASH plugin for a while, but stopped because I noticed too many FP's
> on short e-mails.  I wonder if his IXHASH plugin is suitable to put into
> the sandbox for actual statistical testing.

Quick note - iXhash isn't mine.  The project is the brainchild of Dirk
Bonengel, http://dbonengel.users.sourceforge.net/#, who was inspired by
NiX Spam (by Bert Ungerer).  The credits at http://ixhash.sf.net/ don't
actually mention Dirk (Dirk -- take credit!).

I merely wrote that meta rule to link the three of them together rather
than the more common approach of assigning points to each of them.
Combining that with Karsten's rawbody check (though I'm not sure what char
length threshold would be a good one), we'd get (please unwrap meta line):

meta IXHASH_CHECK     __KB_RAWBODY_200 && (GENERIC_IXHASH ||
                      NIXSPAM_IXHASH || CTYME_IXHASH || HOSTEUROPE_IXHASH)
describe IXHASH_CHECK BODY: MD5 checksum matches known spam
score IXHASH_CHECK    0 2 0 2

-- 
Adam Katz
khopesh on irc://irc.freenode.net/#spamassassin
http://khopesh.com/Anti-Spam

Re: iXhash with minimum size

Posted by Per Jessen <pe...@computer.org>.
Henrik K wrote:

> On Mon, Sep 28, 2009 at 12:21:29PM +0200, Per Jessen wrote:
>> Henrik K wrote:
>> 
>> > Current iXhash has many bugs, which I noticed when I worked on my
>> > own version with SA native DNS lookups.
>> > 
>> > One of the bigger problems of iXhash is probably of historical
>> > nature. There is no decoding of messages (base64 etc).
>> > 
>> > Looking at method #1, which is supposed to apply on messages with
>> > 20 spaces and 2 newlines:
>> > 
>> >   if (($body =~ /(?>\s.+?){20}/g) || ( $body =~ /\n.*\n/ ) ){
>> > 
>> > Since it's buggily OR'd instead of &&
>> 
>> I think you've got some old code there.  My ixhash plugin has this
>> line instead:
>> 
>> if (($body =~ /([\s\t].+?){20}/ ) && ($body =~ /.*$.*$.*/)) {
> 
> I only know of http://ixhash.sf.net/ which results in
> iXhash-1.5.5.zip.

I guess I have an older version with the correct '&&'.  Interesting. 


/Per Jessen, Zürich


Re: iXhash with minimum size

Posted by Henrik K <he...@hege.li>.
On Mon, Sep 28, 2009 at 12:21:29PM +0200, Per Jessen wrote:
> Henrik K wrote:
> 
> > Current iXhash has many bugs, which I noticed when I worked on my own
> > version with SA native DNS lookups.
> > 
> > One of the bigger problems of iXhash is probably of historical nature.
> > There is no decoding of messages (base64 etc).
> > 
> > Looking at method #1, which is supposed to apply on messages with 20
> > spaces and 2 newlines:
> > 
> >   if (($body =~ /(?>\s.+?){20}/g) || ( $body =~ /\n.*\n/ ) ){
> > 
> > Since it's buggily OR'd instead of &&
> 
> I think you've got some old code there.  My ixhash plugin has this line
> instead:
> 
> if (($body =~ /([\s\t].+?){20}/ ) && ($body =~ /.*$.*$.*/)) {

I only know of http://ixhash.sf.net/ which results in iXhash-1.5.5.zip.

So it seems you have an "unofficial" version, which really offers very
little improvement. Not only are those regex extremely slow (I just
benchmarked), but it also happily hashes all base64 messages with 20 lines,
which probably also generates some nice FPs.

Cheers,
Henrik

Re: iXhash with minimum size

Posted by Per Jessen <pe...@computer.org>.
Henrik K wrote:

> Current iXhash has many bugs, which I noticed when I worked on my own
> version with SA native DNS lookups.
> 
> One of the bigger problems of iXhash is probably of historical nature.
> There is no decoding of messages (base64 etc).
> 
> Looking at method #1, which is supposed to apply on messages with 20
> spaces and 2 newlines:
> 
>   if (($body =~ /(?>\s.+?){20}/g) || ( $body =~ /\n.*\n/ ) ){
> 
> Since it's buggily OR'd instead of &&

I think you've got some old code there.  My ixhash plugin has this line
instead:

if (($body =~ /([\s\t].+?){20}/ ) && ($body =~ /.*$.*$.*/)) {

> When I fixed this, for some reason hash #1 was rarely generated on a
> mail. It seems the /(?>\s.+?){20}/g clause seemed to match only when
> there are 20 whitespaces on the same line, which rarely happens.

On my test-system since 2009-09-28 00:00 I have 

hash#1 - 1529 generated.
hash#2 - 7740 generated
hash#3 - 820 generated


/Per Jessen, Zürich


Re: iXhash with minimum size

Posted by Henrik K <he...@hege.li>.
On Sat, Sep 26, 2009 at 11:11:05AM -0400, Adam Katz wrote:
> Karsten BrÃ?ckelmann wrote:
> > > This is a plain RE rule I once wrote, to limit some rule to really short
> > > messages only.
> > >
> > >    rawbody __KB_RAWBODY_200  /^.{0,200}$/s
> 
> Warren Togami mused:
> > I suspect meta limiting Adam's IXHASH rules with a minimum size subrule
> > would eliminate many of the IXHASH false positives.  I was using his
> > IXHASH plugin for a while, but stopped because I noticed too many FP's
> > on short e-mails.  I wonder if his IXHASH plugin is suitable to put into
> > the sandbox for actual statistical testing.
> 
> Quick note - iXhash isn't mine.  The project is the brainchild of Dirk
> Bonengel, http://dbonengel.users.sourceforge.net/#, who was inspired by
> NiX Spam (by Bert Ungerer).  The credits at http://ixhash.sf.net/ don't
> actually mention Dirk (Dirk -- take credit!).

FYI..

Current iXhash has many bugs, which I noticed when I worked on my own
version with SA native DNS lookups.

One of the bigger problems of iXhash is probably of historical nature. There
is no decoding of messages (base64 etc).

Looking at method #1, which is supposed to apply on messages with 20
spaces and 2 newlines:

  if (($body =~ /(?>\s.+?){20}/g) || ( $body =~ /\n.*\n/ ) ){

Since it's buggily OR'd instead of &&, it's enough that mail only has two
newlines. Especially short base64 messages are basically hashed from a few
newlines and equal signs, making even completely different contents into
same hashes.

When I fixed this, for some reason hash #1 was rarely generated on a mail.
It seems the /(?>\s.+?){20}/g clause seemed to match only when there are 20
whitespaces on the same line, which rarely happens. Anyways, making it
/(?:\s.+?){20}/s worked but some foreign mails made the RE hang for tens of
seconds. Rewrote it in completely different way..

If someone wants to have a look, here is my unofficial version. All the FPs
I got are practically gone.

http://sa.hege.li/iXhash2.pm
http://sa.hege.li/iXhash2.cf

I've let Dirk know about the bugs, we'll see what the future brings. Maybe a
real iXhash2 that actually does decoding etc. I'm sure there could be many
more enhancements, so I think this is good time for many eyes to give a
serious look at the REs and methods! Quite long time that these bugs were
unnoticed..


Re: iXhash with minimum size

Posted by John Hardin <jh...@impsec.org>.
On Sat, 26 Sep 2009, Adam Katz wrote:

> Warren Togami mused:
>> I noticed too many FP's on short e-mails.
>
> Combining that with Karsten's rawbody check (though I'm not sure what char
> length threshold would be a good one), we'd get (please unwrap meta line):
>
> meta IXHASH_CHECK     __KB_RAWBODY_200 && (GENERIC_IXHASH ||
>                      NIXSPAM_IXHASH || CTYME_IXHASH || HOSTEUROPE_IXHASH)

Shouldn't that be !__KB_RAWBODY_200 ?

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   North Korea: the only country in the world where people would risk
   execution to flee to communist China.                  -- Ride Fast
-----------------------------------------------------------------------
  Approximately 8887200 firearms legally purchased in the U.S. this year