You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2005/12/07 16:09:30 UTC

[Bug 4725] New: Add support for extracting terms from gif images for bayes subsystem

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4725

           Summary: Add support for extracting terms from gif images for
                    bayes subsystem
           Product: Spamassassin
           Version: unspecified
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: enhancement
          Priority: P5
         Component: spamassassin
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: kaede.news@online.ru


Probably SpamAssassin bayes subsystem should have support for extracting terms
from gif images in emails. These terms can provide much needed data for emails
that contain nothing but headers and an image attachment.

This idea was first offered by nico [tbb@hideout.ath.cx] to the author of
spamprobe bayesian filter, and when implemented showed improvement in filtering
spam.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4725] Add support for extracting terms from gif images for bayes subsystem

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4725





------- Additional Comments From sidney@sidney.com  2005-12-07 20:15 -------
Here are some details I found after digging into it: This is a feature in
SpamProbe being tried out in the experimental 1.3x2 release of SpamProbe as
announced at http://sourceforge.net/mailarchive/message.php?msg_id=14058893

Google didn't find any detailed discussion of the feature.

The source code (file src/parser/GifParser.cc from the 1.3x2 source tarball)
appears to use libungif to extract the following information from embedded GIF
files in an email and generate tokens from them for the Bayesian filter. I don't
see anything to handle jpeg images right now:

MD5 digest of the image (I think digest of the image bytes from the message, not
parsed or uncompressed to pixels, but I'm not sure)

height, width, left , top, interlaced or not, color map or no color map, bits
per pixel, the red green and blue values inthe color map if there is one, and
the extension code and characters of any GIF extension records.

The "image number" is made part of the tokens in the above paragraph (not the
MD5) where I think that may have to do with multiple images in a single GIF object.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4725] Add support for extracting terms from gif images for bayes subsystem

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4725


felicity@apache.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |WORKSFORME




------- Additional Comments From felicity@apache.org  2006-12-31 13:01 -------
In 3.2, plugins can "render" any part, including image/* parts, to text, which
will be used in body rules, the bayes tokenizer, etc.  So I think this is done. :)



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4725] Add support for extracting terms from gif images for bayes subsystem

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4725





------- Additional Comments From sidney@sidney.com  2005-12-08 20:51 -------
>I wonder if stuff like specific colour choices (e.g. "this image contain
> #040400 and #ffff82") would make a good signature?

I'm very skeptical. It's one thing to match on something that is characteristic
of the content of spam (e.g., V!agra or $$$!!!), but there is no reason for spam
to have characteristic image sizes or color maps or use of colors. Those are
easily changed by spammers to arbitrary values if we do start looking for
charasteristic values.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4725] Add support for extracting terms from gif images for bayes subsystem

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4725





------- Additional Comments From jm@jmason.org  2005-12-07 21:00 -------
there are spammers who randomise the colour lookup table, inserting random
values in the unused spaces, and reordering the CLUT in random order, to defeat
MD5 sums -- so that's probably not useful.

I wonder if stuff like specific colour choices (e.g. "this image contains
#040400 and #ffff82") would make a good signature?



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4725] Add support for extracting terms from gif images for bayes subsystem

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4725





------- Additional Comments From sidney@sidney.com  2005-12-07 22:17 -------
Even if this technique has some success in SpamProbe, which only uses Bayes,
(and we don't yet know that, as it is a new exprimental feature), that would
still not indicate whether it would do more in SpamAssassin than the HTML_IMAGE*
rules. Of course I would not mind seeing the results if somebody wants to
extract information from spam and ham embedded images and run some tests.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4725] Add support for extracting terms from gif images for bayes subsystem

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4725





------- Additional Comments From jm@jmason.org  2005-12-08 21:14 -------
'there is no reason for spam
to have characteristic image sizes or color maps or use of colors.'

actually, there was; certain spammers would use certain sizes, colors, etc. in
their campaigns.

'Those are
easily changed by spammers to arbitrary values if we do start looking for
charasteristic values'

and there's the rub.  In my recent testing (at $DAYJOB), they've been doing a
lot of this.  I doubt these features are reliable indicators anymore against
current spam. :(



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4725] Add support for extracting terms from gif images for bayes subsystem

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4725





------- Additional Comments From lwilton@earthlink.net  2005-12-09 08:48 -------
Standard text formatting (or more commonly standard HTML markup patterns) are 
still moderately good spam indicators in many cases.  Since the text formatting 
patterns can be useful, I see no reason why the generic concept of image 
formatting patterns shouldn't be about as useful.

However, I'm talking about "image formatting patterns" as a generality, and not 
necessarily the exact colors or images sizes.  This is probably a really good 
place for a fuzzy matching algorithm that could do things like determine x% of 
the image is background color, or there is a pattern change 37% of the way from 
top to bottom, or 22% of the space appears to be text, etc.

I have no idea at all how to produce those statistics in a useful way for 
filtering.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.