You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2014/12/19 20:36:24 UTC

[Bug 7115] New: Adding SHA digests of MIME parts as Bayes tokens allows bayes to 'see' non-textual content

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7115

            Bug ID: 7115
           Summary: Adding SHA digests of MIME parts as Bayes tokens
                    allows bayes to 'see' non-textual content
           Product: Spamassassin
           Version: 3.4 SVN branch
          Hardware: All
                OS: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Libraries
          Assignee: dev@spamassassin.apache.org
          Reporter: Mark.Martinec@ijs.si

Created attachment 5262
  --> https://issues.apache.org/SpamAssassin/attachment.cgi?id=5262&action=edit
suggested change

As promised, here is an enhancement to bayes token-collecting code
to also take into account message-digests (currently SHA1) of
*each* leaf MIME part, regardless of it being textual or non-textual,
and including all alternative-parts.

The idea is based on a suggestion (made earlier this year, May 2014)
by Andreas Schulze, who experimented with collecting and analyzing
MIME part digests in Amavis, with interesting results.
It seems to me a natural next step is to feed this data to the
existing Bayes classifier in SpamAssassin and let it do its magic.

Besides allowing bayes to notice also non-textual mail content
like attached icons, photos, PDF, Office documents, powerpoint,
encrypted or compressed parts, it also 'sees' textual parts
'as a whole', including such parts as ASCII-art -only, mostly
empty parts, etc.

The code is fairly straightforward, just takes advantage of
existing Base64 and quoted-printable decoding, and existing
Digest::SHA or older Digest::SHA1 module, the same as already
used by the Bayes plugin.

If a caller already has MIME part digests computed, it may
pass them to SpamAssassin and avoid duplicate processing.
This also makes it possible for SpamAssassin's Bayes to notice
digests of *all* MIME parts, even when as message is very large
and only partly passed (truncated) to SpamAssassin.

Early results are encouraging. Observing the top 5 bayes tokens
as reported by macros HAMMYTOKENS and SPAMMYTOKENS, after a day
or two (with Bayes auto-learning enabled) one can start noticing
interesting spammy tokens like empty or mostly-empty text/plain
parts, virus attachments, or hammy tokens like season's greeting
comics being passed around among friends these days, or business
documents.

Btw, initially I used digests directly as bayes tokens. Which is
mostly fine, except in case of empty of mostly empty MIME parts,
where it seemes more appropriate to distinguish for example
and empty text/plan from an empty text/html and an empty text/xml.
So I ended up with a bayes token consisting of a MIME part digest
concatenated with a Content-Type of the MIME part, which now makes
more sense.

During testing a couple of inconsistencies were discovered, like
in non-compliant QP decoding in MS::Utils (now fixed), or mangling
of Content-Type containing dots (now fixed), or breakage done
intentionally by MIME parses in MS::Message (like splitting long
lines into multiple lines, deleting sequences of more than 20 empty
lines) - which I have not touched, but warrants reconsideration.
Also it seems that complete first sections of delivery-reports are
being discarded by MIME parser - this needs to be investigated.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7115] Adding SHA digests of MIME parts as Bayes tokens allows bayes to 'see' non-textual content

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7115

--- Comment #6 from Mark Martinec <Ma...@ijs.si> ---
Created attachment 5263
  --> https://issues.apache.org/SpamAssassin/attachment.cgi?id=5263&action=edit
added configurability

> > I'd REALLY like to see this extra tokenizing as a switchable option.
> Will do something along these lines.

Here it comes. Adds a config option, and conditionalizes sources
of input to Bayes. Most of the diff is due to indentation change,
consistency of variable names, and some cosmetics.

This is the added documentation (man Mail::SpamAssassin::Conf):


bayes_token_sources  (default: header visible invisible uri)

  Controls which sources in a mail message can contribute tokens
  (e.g. words, phrases, etc.) to a Bayes classifier. The argument is
  a space-separated list of keywords: header, visible, invisible,
  uri, mimepart), each of which may be prefixed by a no to indicate
  its exclusion. Additionally two reserved keywords are allowed: all
  and none (or: noall). The list of keywords is processed
  sequentially: a keyword all adds all available keywords to a set
  being built, a none or noall clears the set, other non-negated
  keywords are added to the set, and negated keywords are removed
  from the set. Keywords are case-insensitive.

  The default set is: header visible invisible uri, which is
  equivalent for example to: All NoMIMEpart. The reason why mimepart
  is not currently in a default set is that it is a newer source
  (introduced with SpamAssassin version 3.4.1) and not much
  experience has yet been gathered regarding its usefulness.

  See also option "bayes_ignore_header" for a fine-grained control on
  individual header fields under the umbrella of a more general
  keyword header here.

  Keywords imply the following data sources:

    header - tokens collected from a message header section
    visible - words from visible text (plain or HTML) in a message body
    invisible - hidden/invisible text in HTML parts of a message body
    uri - URIs collected from a message body
    mimepart - digests (hashes) of all MIME parts (textual or non-
      textual) of a message, computed after Base64 and quoted-printable
      decoding, suffixed by their Content-Type
    all - adds all the above keywords to the set being assembled
    none or noall - removes all keywords from the set

  The "bayes_token_sources" directive may appear multiple times, its
  keywords are interpreted sequentially, adding or removing items
  from the final set as they appear in their order in
  "bayes_token_sources" directive(s).

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7115] Adding SHA digests of MIME parts as Bayes tokens allows bayes to 'see' non-textual content

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7115

--- Comment #7 from Mark Martinec <Ma...@ijs.si> ---
Bug 7115: Adding SHA digests of MIME parts as Bayes tokens
allows bayes to 'see' non-textual content - added configurability
  Sending lib/Mail/SpamAssassin/Conf.pm
  Sending lib/Mail/SpamAssassin/Plugin/Bayes.pm
Committed revision 1647707.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7115] Adding SHA digests of MIME parts as Bayes tokens allows bayes to 'see' non-textual content

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7115

John Hardin <jh...@impsec.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jhardin@impsec.org

--- Comment #8 from John Hardin <jh...@impsec.org> ---
(In reply to Mark Martinec from comment #6)
>   Keywords imply the following data sources:
> 
>     header - tokens collected from a message header section
>     visible - words from visible text (plain or HTML) in a message body
>     invisible - hidden/invisible text in HTML parts of a message body
>     uri - URIs collected from a message body
>     mimepart - digests (hashes) of all MIME parts (textual or non-
>       textual) of a message, computed after Base64 and quoted-printable
>       decoding, suffixed by their Content-Type

I'd like to ask you to consider adding something like
textattachvis/textattachinvis, which pulls words (visible/hidden) from text
attachments (plain or HTML, detected by MIME type or filename extension). 

One tactic spammers use is to attach a plain text or HTML file and the body of
the message is "please see the attachment", and the attachment is obvious spam
or something like a phishing form. SA doesn't scan that because it's not
strictly "visible message body text".

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7115] Adding SHA digests of MIME parts as Bayes tokens allows bayes to 'see' non-textual content

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7115

--- Comment #14 from AXB <ax...@gmail.com> ---
now we're getting all these new bayes bells & whistles, would't it be a good
moment to get rid of the BAYES_999 and do the correct BAYES_100 ?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7115] Adding SHA digests of MIME parts as Bayes tokens allows bayes to 'see' non-textual content

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7115

--- Comment #11 from Mark Martinec <Ma...@ijs.si> ---
Bug 7115, more informative bayes debugging:
  report number of tokens for each source
Sending lib/Mail/SpamAssassin/Plugin/Bayes.pm
Committed revision 1648372.

This change adds token count to the following debugging report,
e.g.:
  dbg: bayes: tokenized body: 2444 tokens
  dbg: bayes: tokenized uri: 1178 tokens
  dbg: bayes: tokenized invisible: 2 tokens
  dbg: bayes: tokenized mime parts: 2 tokens
  dbg: bayes: tokenized header: 83 tokens

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7115] Adding SHA digests of MIME parts as Bayes tokens allows bayes to 'see' non-textual content

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7115

--- Comment #12 from Mark Martinec <Ma...@ijs.si> ---
While investigating John's concern, I realized there are three almost
identical code copies in Message.pm, which only differ in the hash key
where they cache result, and in a method being called.
Couldn't resist factoring out common code from
  get_rendered_body_text_array(),
  get_visible_rendered_body_text_array(), and
  get_invisible_rendered_body_text_array()

No functional change.

Bug 7115, factoring out common code in Message.pm
  Sending lib/Mail/SpamAssassin/Message.pm
Committed revision 1648381.


Btw, there was a comment at get_visible_rendered_body_text_array()

  # TODO: possibly this should just replace get_rendered_body_text_array().
  # (although watch out, this one doesn't copy {html} to metadata)

which does not match code, all three routines do/did copy
{html} to metadata the same way (unless it's already there).

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7115] Adding SHA digests of MIME parts as Bayes tokens allows bayes to 'see' non-textual content

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7115

--- Comment #9 from Mark Martinec <Ma...@ijs.si> ---
(In reply to John Hardin from comment #8)
> I'd like to ask you to consider adding something like
> textattachvis/textattachinvis, which pulls words (visible/hidden) from text
> attachments (plain or HTML, detected by MIME type or filename extension).  
> One tactic spammers use is to attach a plain text or HTML file and the body
> of the message is "please see the attachment", and the attachment is obvious
> spam or something like a phishing form. SA doesn't scan that because it's
> not strictly "visible message body text".

Isn't this how it already works?

I checked tokenization of a test message which was a multipart/mixed,
where the first subtree was multipart/alternative with a text/plain and
text/html parts, followed by one text/plain and one text/html attachment.
Words from all four MIME parts ended up as Bayes tokens.

Please attach a sample message where you find that textual attachments
were not tokenized for Bayes.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7115] Adding SHA digests of MIME parts as Bayes tokens allows bayes to 'see' non-textual content

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7115

Kevin A. McGrail <km...@pccc.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |kmcgrail@pccc.com

--- Comment #13 from Kevin A. McGrail <km...@pccc.com> ---
(In reply to Mark Martinec from comment #12)
> Couldn't resist factoring out common code from

+1.00000000000

KAM

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7115] Adding SHA digests of MIME parts as Bayes tokens allows bayes to 'see' non-textual content

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7115

--- Comment #15 from Kevin A. McGrail <km...@pccc.com> ---
(In reply to Kevin A. McGrail from comment #13)
> (In reply to Mark Martinec from comment #12)
> > Couldn't resist factoring out common code from
> 
> +1.00000000000
> 
> KAM

(In reply to AXB from comment #14)
> now we're getting all these new bayes bells & whistles, would't it be a good
> moment to get rid of the BAYES_999 and do the correct BAYES_100 ?

Not a real high priority for me.  bug 7013 had some discussion on this but I
seem to remember there were barriers in code that we found when trending down
this path before.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7115] Adding SHA digests of MIME parts as Bayes tokens allows bayes to 'see' non-textual content

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7115

RW <rw...@googlemail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rwmaillists@googlemail.com

--- Comment #5 from RW <rw...@googlemail.com> ---
(In reply to Henrik Krohns from comment #2)
> I actually run something similar, tokenizing attachment names
> etc, but overall it make very little difference. I think it actually hurt in
> some cases, but I don't remember the exact figures anymore..

It seems likely that a binary would be reused more commonly than a filename.
IIRC the OCR plugin used to have a caching option which I presume was based on
checksum. 

A single token might not have much effect on the Bayes result, but it might be
very effective to use Bayes to keep track of attachment checksums and have a
separate rule for scoring checksums only seen in spam.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7115] Adding SHA digests of MIME parts as Bayes tokens allows bayes to 'see' non-textual content

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7115

--- Comment #3 from AXB <ax...@gmail.com> ---
Imo, anything involving "non-textual mail content" should be optional.

I'd REALLY like to see this extra tokenizing as a switchable option.

bayes_uses_non_textual = [1-0]

Can you do that , Mark?

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7115] Adding SHA digests of MIME parts as Bayes tokens allows bayes to 'see' non-textual content

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7115

--- Comment #16 from AXB <ax...@gmail.com> ---
(In reply to Kevin A. McGrail from comment #15)
> (In reply to Kevin A. McGrail from comment #13)
> > (In reply to Mark Martinec from comment #12)
> > > Couldn't resist factoring out common code from
> > 
> > +1.00000000000
> > 
> > KAM
> 
> (In reply to AXB from comment #14)
> > now we're getting all these new bayes bells & whistles, would't it be a good
> > moment to get rid of the BAYES_999 and do the correct BAYES_100 ?
> 
> Not a real high priority for me.  bug 7013 had some discussion on this but I
> seem to remember there were barriers in code that we found when trending
> down this path before.

from what I'm seeing all it needs is refactoring a cou'le of eval:check_bayes
rules in 23_bayes.cf and adapting the scores.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7115] Adding SHA digests of MIME parts as Bayes tokens allows bayes to 'see' non-textual content

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7115

--- Comment #10 from John Hardin <jh...@impsec.org> ---
(In reply to Mark Martinec from comment #9)
> (In reply to John Hardin from comment #8)
> > I'd like to ask you to consider adding something like
> > textattachvis/textattachinvis, which pulls words (visible/hidden) from text
> > attachments (plain or HTML, detected by MIME type or filename extension).  
> > One tactic spammers use is to attach a plain text or HTML file and the body
> > of the message is "please see the attachment", and the attachment is obvious
> > spam or something like a phishing form. SA doesn't scan that because it's
> > not strictly "visible message body text".
> 
> Isn't this how it already works?
> 
> I checked tokenization of a test message which was a multipart/mixed,
> where the first subtree was multipart/alternative with a text/plain and
> text/html parts, followed by one text/plain and one text/html attachment.
> Words from all four MIME parts ended up as Bayes tokens.

I did not actually test this before making the suggestion. I was making the
assumption, apparently erroneous, that the Bayes tokenization behavior
paralleled the behavior for BODY rules, where text attachments are not included
because they aren't part of the "visible message body".

Thanks for actually checking, and apologies for the noise.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7115] Adding SHA digests of MIME parts as Bayes tokens allows bayes to 'see' non-textual content

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7115

--- Comment #17 from Kevin A. McGrail <km...@pccc.com> ---
(In reply to AXB from comment #16)
> (In reply to Kevin A. McGrail from comment #15)
> > (In reply to Kevin A. McGrail from comment #13)
> > > (In reply to Mark Martinec from comment #12)
> > > > Couldn't resist factoring out common code from
> > > 
> > > +1.00000000000
> > > 
> > > KAM
> > 
> > (In reply to AXB from comment #14)
> > > now we're getting all these new bayes bells & whistles, would't it be a good
> > > moment to get rid of the BAYES_999 and do the correct BAYES_100 ?
> > 
> > Not a real high priority for me.  bug 7013 had some discussion on this but I
> > seem to remember there were barriers in code that we found when trending
> > down this path before.
> 
> from what I'm seeing all it needs is refactoring a cou'le of
> eval:check_bayes rules in 23_bayes.cf and adapting the scores.

I seem to remember finding hard coded Bayes items related to the BAYES_99 and
BAYES_999 stuff.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7115] Adding SHA digests of MIME parts as Bayes tokens allows bayes to 'see' non-textual content

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7115

Henrik Krohns <he...@hege.li> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hege@hege.li

--- Comment #2 from Henrik Krohns <he...@hege.li> ---
To quote Justin, have you run a 10-fold cross validation?? :-D

I don't think just seeing tokens somewhere means anything about overall
efficiency? I actually run something similar, tokenizing attachment names etc,
but overall it make very little difference. I think it actually hurt in some
cases, but I don't remember the exact figures anymore..

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7115] Adding SHA digests of MIME parts as Bayes tokens allows bayes to 'see' non-textual content

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7115

Mark Martinec <Ma...@ijs.si> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|Undefined                   |3.4.1

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7115] Adding SHA digests of MIME parts as Bayes tokens allows bayes to 'see' non-textual content

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7115

--- Comment #1 from Mark Martinec <Ma...@ijs.si> ---
trunk:
  Sending lib/Mail/SpamAssassin/Message.pm
  Sending lib/Mail/SpamAssassin/Plugin/Bayes.pm
Committed revision 1646848.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7115] Adding SHA digests of MIME parts as Bayes tokens allows bayes to 'see' non-textual content

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7115

--- Comment #4 from Mark Martinec <Ma...@ijs.si> ---
> I'd REALLY like to see this extra tokenizing as a switchable option.

Will do something along these lines.

-- 
You are receiving this mail because:
You are the assignee for the bug.