You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2004/02/19 23:52:29 UTC
Re: message metadata (for Bayes etc.)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Daniel Quinlan writes:
>jm@jmason.org (Justin Mason) writes:
>
>> - language detection is called here (if "ok_languages" != "all") and the
>>   language token is added as a metadatum called "X-Language".  (TODO:
>>   this should be conditional, because language rec is a slow process,
>
>Language detection will be faster once/if we add the XS implementation.
>
>>   but is ("ok_languages" != "all") the right way to enable it?)
>
>Yes, but someone might want to start setting "ok_languages all", but
>still send the detected languages to Bayes.
>
>I'd suggest we add a new option "use_language_detection" to
>enable/disable the test for 3.0.

Yeah.  +1

>> - In addition, the MsgMetadata class holds some parsing/rendering code; it
>>   calls the HTML renderer and holds the HTML features hash, and also now
>>   holds the functions that make the "decoded"/"rendered" text arrays.
>>   
>>   Note that there's an open question as to whether rendered data, and
>>   features discovered during that rendering, are really "metadata".  These
>>   may be more appropriate to put in another class, either in the root
>>   MsgContainer or another class off that.
>
>I don't really care, but the data is just used for eval tests.
>
>There may be some utility to attaching some metadata on a per-HTML-part
>basis, but the code isn't really consistent on that point (per-HTML-part
>or per-message) because we used to render everything as one big blob if
>we detected HTML somewhere in the message and now we render HTML parts
>precisely.
> 
>A reasonable guess is that some HTML metadata will want to remain
>per-HTML-part and some will be per-message, but I'm not really sure at
>this point.

Well, the thing is -- as far as I know in the emulated MUA (Outlook
Express) only 1 HTML part will be displayed.  We should figure out
which part that is (if we haven't already?), and emulate that
behaviour -- just keep stats and features from 1 HTML part.

After all, we don't *care* if an attachment looks spammy, because
that's not what the user will see.

Then we can say "html features for message X" == "html features
for message X part Y", and just plonk it all on the message
object.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFANT4sQTcbUG5Y7woRAjpRAJ9qVEfgOkC0cXw3F/zFRgw1hH33UgCfX+dY
KlcIatPlI2NpUfxJkyEtkSE=
=vKUv
-----END PGP SIGNATURE-----