You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2007/04/02 18:44:24 UTC

Re: Fundamental question about spam image processing.

Jeff writes:
> On Apr 2, 2007, at 11:00 AM, Steven W. Orr wrote:
> 
> > On Friday I attended the annual Spam Conference at MIT. While  
> > there, I spoke with a person who was an employee of Sophos. They  
> > are very proud of the proprietary spam filtering they do. We talked  
> > about SA and FuzzyOCR and I learned that they do extremely accurate  
> > spam analysis on image attachments without OCR. I was very  
> > intrigued because FuzzyOCR AFAICT is hugely CPU intensive. I tried  
> > running it at home and it worked for me (to a point) but I can't  
> > imagine this being viable in an industrial setting.
> >
> > It turns out that the basis for their analysis is to look at the  
> > size of the image as well as the number of colors. 99.99% of all  
> > spam images have less than 16 colors. Once they found an image with  
> > 22 colors. This sounds like a dirt cheap way to get a huge boost in  
> > spam recognition. They may have other tricks they do, but I just  
> > wanted to report what I learned.
> >
> > Can we do this?
> 
> Sounds like a perfect Summer of Code project.

Except, as Chris already noted, we already have a ruleset that
does this almost exactly as described -- ImageInfo ;)

Dunno if it yet decodes the CLUT header to examine the number of
colours, though.  seems like an easy thing to try out...

--j.