You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spamassassin.apache.org by he...@apache.org on 2021/05/30 05:14:19 UTC
svn commit: r1890317 - in /spamassassin/trunk: UPGRADE lib/Mail/SpamAssassin/Conf.pm lib/Mail/SpamAssassin/Util/DependencyInfo.pm

Author: hege
Date: Sun May 30 05:14:19 2021
New Revision: 1890317

URL: http://svn.apache.org/viewvc?rev=1890317&view=rev
Log:
Enable normalize_charset by default (Bug 7656)

Modified:
    spamassassin/trunk/UPGRADE
    spamassassin/trunk/lib/Mail/SpamAssassin/Conf.pm
    spamassassin/trunk/lib/Mail/SpamAssassin/Util/DependencyInfo.pm

Modified: spamassassin/trunk/UPGRADE
URL: http://svn.apache.org/viewvc/spamassassin/trunk/UPGRADE?rev=1890317&r1=1890316&r2=1890317&view=diff
==============================================================================
--- spamassassin/trunk/UPGRADE (original)
+++ spamassassin/trunk/UPGRADE Sun May 30 05:14:19 2021
@@ -2,6 +2,12 @@
 Note for Users Upgrading to SpamAssassin 4.0.0
 ----------------------------------------------
 
+- Setting normalize_charset is now enabled by default.  Note that rules
+  should not expect specific non-UTF8 or UTF8 encoding in body.  Matching is
+  done against raw bytes, which may very depending on normalize_charset
+  setting and whether decoding to UTF8 was successful.
+  See: http://wiki.apache.org/spamassassin/WritingRulesAdvanced
+
 - Meta rules no longer use priority values, they are evaluated dynamically
   when the rules they depend on are finished.  (Bug 7735)
 

Modified: spamassassin/trunk/lib/Mail/SpamAssassin/Conf.pm
URL: http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Conf.pm?rev=1890317&r1=1890316&r2=1890317&view=diff
==============================================================================
--- spamassassin/trunk/lib/Mail/SpamAssassin/Conf.pm (original)
+++ spamassassin/trunk/lib/Mail/SpamAssassin/Conf.pm Sun May 30 05:14:19 2021
@@ -1250,7 +1250,7 @@ Select the locales to allow from the lis
     type => $CONF_TYPE_STRING,
   });
 
-=item normalize_charset ( 0 | 1 )        (default: 0)
+=item normalize_charset ( 0 | 1 )        (default: 1)
 
 Whether to decode non- UTF-8 and non-ASCII textual parts and recode them
 to UTF-8 before the text is given over to rules processing. The character
@@ -1272,7 +1272,7 @@ it will be used if it is available.
 
   push (@cmds, {
     setting => 'normalize_charset',
-    default => 0,
+    default => 1,
     type => $CONF_TYPE_BOOL,
     code => sub {
 	my ($self, $key, $value, $line) = @_;
@@ -3182,6 +3182,12 @@ non-text MIME parts are stripped, and th
 Quoted-Printable or Base-64-encoded format if necessary.  Parts declared as
 text/html will be rendered from HTML to text.
 
+Body is processed as a raw byte string, which means Unicode-specific regex
+features like \p{} can NOT be used for matching.  The normalize_charset
+setting will also affect how raw bytes are presented.  Rules in .cf files
+should be written portably - to match "a with umlaut" character, look for
+both LATIN1 and UTF8 raw byte variants: /(?:\xE4|\xC3\xA4)/
+
 All body paragraphs (double-newline-separated blocks text) are turned into a
 line breaks removed, whitespace normalized single line.  Any lines longer
 than 2kB are split into shorter separate lines (from a boundary when

Modified: spamassassin/trunk/lib/Mail/SpamAssassin/Util/DependencyInfo.pm
URL: http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Util/DependencyInfo.pm?rev=1890317&r1=1890316&r2=1890317&view=diff
==============================================================================
--- spamassassin/trunk/lib/Mail/SpamAssassin/Util/DependencyInfo.pm (original)
+++ spamassassin/trunk/lib/Mail/SpamAssassin/Util/DependencyInfo.pm Sun May 30 05:14:19 2021
@@ -252,11 +252,10 @@ our @OPTIONAL_MODULES = (
 {
   module => 'Encode::Detect::Detector',
   version => 0,
-  desc => 'If you plan to use the normalize_charset config setting to
-  decode message parts from their declared character set into Unicode, and
-  such decoding fails, the Encode::Detect::Detector module (when available)
-  may be consulted to provide an alternative guess on a character set of a
-  problematic message part.',
+  desc => 'If normalize_charset decoding of message parts from their
+  declared character set into Unicode fails, the Encode::Detect::Detector
+  module (when available) may be consulted to provide an alternative guess
+  on a character set of a problematic message part.',
 },
 {
   module => 'Net::Patricia',