You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2007/04/02 18:44:24 UTC
Re: Fundamental question about spam image processing.
Jeff writes:
> On Apr 2, 2007, at 11:00 AM, Steven W. Orr wrote:
>
> > On Friday I attended the annual Spam Conference at MIT. While
> > there, I spoke with a person who was an employee of Sophos. They
> > are very proud of the proprietary spam filtering they do. We talked
> > about SA and FuzzyOCR and I learned that they do extremely accurate
> > spam analysis on image attachments without OCR. I was very
> > intrigued because FuzzyOCR AFAICT is hugely CPU intensive. I tried
> > running it at home and it worked for me (to a point) but I can't
> > imagine this being viable in an industrial setting.
> >
> > It turns out that the basis for their analysis is to look at the
> > size of the image as well as the number of colors. 99.99% of all
> > spam images have less than 16 colors. Once they found an image with
> > 22 colors. This sounds like a dirt cheap way to get a huge boost in
> > spam recognition. They may have other tricks they do, but I just
> > wanted to report what I learned.
> >
> > Can we do this?
>
> Sounds like a perfect Summer of Code project.
Except, as Chris already noted, we already have a ruleset that
does this almost exactly as described -- ImageInfo ;)
Dunno if it yet decodes the CLUT header to examine the number of
colours, though. seems like an easy thing to try out...
--j.