You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by Henrik K <he...@hege.li> on 2021/04/30 18:20:10 UTC

header address parser changeset committed

Please note the large changeset and have a try.  I've been tweaking it all
week, should be good for general use.


On Fri, Apr 30, 2021 at 06:17:51PM -0000, hege@apache.org wrote:
> Author: hege
> Date: Fri Apr 30 18:17:51 2021
> New Revision: 1889337
> 
> URL: http://svn.apache.org/viewvc?rev=1889337&view=rev
> Log:
> - Improved internal header address (From/To/Cc) parser, now also handles
>   multiple addresses.  Optional support for external Email::Address::XS
>   parser, which can handle nested comments and other oddities.
> 
> - Header :addr :name modifiers now returns all addresses.  :first :last
>   select only first (topmost) or last header to process, when there are
>   multiple headers with the same name (:addr and :name may still return
>   multiple values from a single header).
> 
> - API: $pms->get() can and should now be called in list context.  Scalar
>   context continues to return multiple values newline separated, but this
>   should be considered deprecated.
> 
> 
> Added:
>     spamassassin/trunk/t/data/spam/freemail1
>     spamassassin/trunk/t/data/spam/freemail2
>     spamassassin/trunk/t/data/spam/freemail3
> Modified:
>     spamassassin/trunk/MANIFEST
>     spamassassin/trunk/UPGRADE
>     spamassassin/trunk/lib/Mail/SpamAssassin.pm
>     spamassassin/trunk/lib/Mail/SpamAssassin/Conf.pm
>     spamassassin/trunk/lib/Mail/SpamAssassin/PerMsgStatus.pm
>     spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/Bayes.pm
>     spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/FreeMail.pm
>     spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/HeaderEval.pm
>     spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/SPF.pm
>     spamassassin/trunk/lib/Mail/SpamAssassin/Util.pm
>     spamassassin/trunk/lib/Mail/SpamAssassin/Util/DependencyInfo.pm
>     spamassassin/trunk/t/SATest.pm
>     spamassassin/trunk/t/data/Dumpheaders.pm
>     spamassassin/trunk/t/data/nice/unicode1
>     spamassassin/trunk/t/freemail.t
>     spamassassin/trunk/t/freemail_welcome_block.t
>     spamassassin/trunk/t/get_all_headers.t
>     spamassassin/trunk/t/get_headers.t   (contents, props changed)
>     spamassassin/trunk/t/header_utf8.t   (contents, props changed)
> 
> Modified: spamassassin/trunk/MANIFEST
> URL: http://svn.apache.org/viewvc/spamassassin/trunk/MANIFEST?rev=1889337&r1=1889336&r2=1889337&view=diff
> ==============================================================================
> --- spamassassin/trunk/MANIFEST (original)
> +++ spamassassin/trunk/MANIFEST Fri Apr 30 18:17:51 2021
> @@ -414,6 +414,9 @@ t/data/spam/esp/sendgrid_id.eml
>  t/data/spam/esp/sendgrid_id.txt
>  t/data/spam/extracttext/gtube_jpg.eml
>  t/data/spam/extracttext/gtube_pdf.eml
> +t/data/spam/freemail1
> +t/data/spam/freemail2
> +t/data/spam/freemail3
>  t/data/spam/gtube.eml
>  t/data/spam/gtubedcc.eml
>  t/data/spam/gtubedcc_crlf.eml
> 
> Modified: spamassassin/trunk/UPGRADE
> URL: http://svn.apache.org/viewvc/spamassassin/trunk/UPGRADE?rev=1889337&r1=1889336&r2=1889337&view=diff
> ==============================================================================
> --- spamassassin/trunk/UPGRADE (original)
> +++ spamassassin/trunk/UPGRADE Fri Apr 30 18:17:51 2021
> @@ -2,6 +2,19 @@
>  Note for Users Upgrading to SpamAssassin 4.0.0
>  ----------------------------------------------
>  
> +- Improved internal header address (From/To/Cc) parser, now also handles
> +  multiple addresses.  Optional support for external Email::Address::XS
> +  parser, which can handle nested comments and other oddities.
> +
> +- Header :addr :name modifiers now returns all addresses.  :first :last
> +  select only first (topmost) or last header to process, when there are
> +  multiple headers with the same name (:addr and :name may still return
> +  multiple values from a single header).
> +
> +- API: $pms->get() can and should now be called in list context.  Scalar
> +  context continues to return multiple values newline separated, but this
> +  should be considered deprecated.
> +
>  - New ExtractText plugin that extracts text from documents or images and feed it
>    into SpamAssassin
>  
> 
> Modified: spamassassin/trunk/lib/Mail/SpamAssassin.pm
> URL: http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin.pm?rev=1889337&r1=1889336&r2=1889337&view=diff
> ==============================================================================
> --- spamassassin/trunk/lib/Mail/SpamAssassin.pm (original)
> +++ spamassassin/trunk/lib/Mail/SpamAssassin.pm Fri Apr 30 18:17:51 2021
> @@ -1064,8 +1064,11 @@ sub add_all_addresses_to_blacklist {
>  
>    my @addrlist;
>    my @hdrs = $mail_obj->get_header('From');
> -  if ($#hdrs >= 0) {
> -    push (@addrlist, $self->find_all_addrs_in_line (join (" ", @hdrs)));
> +  foreach my $hdr (@hdrs) {
> +    my @addrs = Mail::SpamAssassin::Util::parse_header_addresses($hdr);
> +    foreach my $addr (@addrs) {
> +      push @addrlist, $addr->{address} if defined $addr->{address};
> +    }
>    }
>  
>    foreach my $addr (@addrlist) {
> @@ -2244,8 +2247,12 @@ sub find_all_addrs_in_mail {
>    				Errors-To Mail-Followup-To))
>    {
>      my @hdrs = $mail_obj->get_header($header);
> -    if ($#hdrs < 0) { next; }
> -    push (@addrlist, $self->find_all_addrs_in_line(join (" ", @hdrs)));
> +    foreach my $hdr (@hdrs) {
> +      my @addrs = Mail::SpamAssassin::Util::parse_header_addresses($hdr);
> +      foreach my $addr (@addrs) {
> +        push @addrlist, $addr->{address} if defined $addr->{address};
> +      }
> +    }
>    }
>  
>    # find addrs in body, too
> 
> Modified: spamassassin/trunk/lib/Mail/SpamAssassin/Conf.pm
> URL: http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Conf.pm?rev=1889337&r1=1889336&r2=1889337&view=diff
> ==============================================================================
> --- spamassassin/trunk/lib/Mail/SpamAssassin/Conf.pm (original)
> +++ spamassassin/trunk/lib/Mail/SpamAssassin/Conf.pm Fri Apr 30 18:17:51 2021
> @@ -5430,6 +5430,7 @@ sub feature_subjprefix { 1 } # add subje
>  sub feature_bayes_stopwords { 1 } # multi language stopwords in Bayes
>  sub feature_get_host { 1 } # $pms->get() :host :domain :ip :revip # was implemented together with AskDNS::has_tag_header # Bug 7734
>  sub feature_blocklist_welcomelist { 1 } # bz 7826
> +sub feature_header_address_parser { 1 } # improved header address parsing using Email::Address::XS, $pms->get() list context
>  sub has_tflags_nosubject { 1 } # tflags nosubject
>  sub has_tflags_nolog { 1 } # tflags nolog
>  sub perl_min_version_5010000 { return $] >= 5.010000 }  # perl version check ("perl_version" not neatly backwards-compatible)
> 
> Modified: spamassassin/trunk/lib/Mail/SpamAssassin/PerMsgStatus.pm
> URL: http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/PerMsgStatus.pm?rev=1889337&r1=1889336&r2=1889337&view=diff
> ==============================================================================
> --- spamassassin/trunk/lib/Mail/SpamAssassin/PerMsgStatus.pm (original)
> +++ spamassassin/trunk/lib/Mail/SpamAssassin/PerMsgStatus.pm Fri Apr 30 18:17:51 2021
> @@ -62,7 +62,7 @@ use Mail::SpamAssassin::AsyncLoop;
>  use Mail::SpamAssassin::Conf;
>  use Mail::SpamAssassin::Util qw(untaint_var base64_encode idn_to_ascii
>                                  uri_list_canonicalize reverse_ip_address
> -                                is_fqdn_valid);
> +                                is_fqdn_valid parse_header_addresses);
>  use Mail::SpamAssassin::Timeout;
>  use Mail::SpamAssassin::Logger;
>  
> @@ -1953,21 +1953,24 @@ sub extract_message_metadata {
>    # tags (explicitly required for DMARC, RFC 7489)
>    #
>    { local $1;
> -    my $addr = $self->get('EnvelopeFrom:addr', undef);
> +    my $host = ($self->get('EnvelopeFrom:first:addr:host'))[0];
>      # collect a FQDN, ignoring potential trailing WSP
> -    if (defined $addr && $addr =~ /\@([^@. \t]+\.[^@ \t]+?)[ \t]*\z/s) {
> -      my $d = idn_to_ascii($1);
> +    if (defined $host) {
> +      my $d = idn_to_ascii($host);
>        $self->set_tag('SENDERDOMAIN', $d);
>        $self->{msg}->put_metadata("X-SenderDomain", $d);
>        dbg("metadata: X-SenderDomain: %s", $d);
>      }
> -    # TODO: the get ':addr' only returns the first address; this should be
> -    # augmented to be able to return all addresses in a header field, multiple
> -    # addresses in a From header field are allowed according to RFC 5322
> -    $addr = $self->get('From:addr', undef);
> -    if (defined $addr && $addr =~ /\@([^@. \t]+\.[^@ \t]+?)[ \t]*\z/s) {
> -      my $d = idn_to_ascii($1);
> -      $self->set_tag('AUTHORDOMAIN', $d);
> +    my @from_doms;
> +    my %seen;
> +    foreach ($self->get('From:addr:host')) {
> +      next if $seen{$_}++;
> +      my $d = idn_to_ascii($_);
> +      push @from_doms, $d;
> +    }
> +    if (@from_doms) {
> +      $self->set_tag('AUTHORDOMAIN', @from_doms > 1 ? \@from_doms : $from_doms[0]);
> +      my $d = join(" ", @from_doms);
>        $self->{msg}->put_metadata("X-AuthorDomain", $d);
>        dbg("metadata: X-AuthorDomain: %s", $d);
>      }
> @@ -2031,25 +2034,32 @@ sub get_decoded_stripped_body_text_array
>  
>  =item $status->get (header_name [, default_value])
>  
> -Returns a message header, pseudo-header, real name or address.
> -C<header_name> is the name of a mail header, such as 'Subject', 'To',
> -etc.  If C<default_value> is given, it will be used if the requested
> -C<header_name> does not exist.
> -
> -Appending C<:raw> to the header name will inhibit decoding of quoted-printable
> -or base-64 encoded strings.
> -
> -Appending a modifier C<:addr> to a header field name will cause everything
> -except the first email address to be removed from the header field.  It is
> -mainly applicable to header fields 'From', 'Sender', 'To', 'Cc' along with
> -their 'Resent-*' counterparts, and the 'Return-Path'. For example, all of
> -the following will result in "example@foo":
> +Returns a message header, pseudo-header or a real name, email-address or
> +some other parsed value set by modifiers.  C<header_name> is the name of a
> +mail header, such as 'Subject', 'To', etc.
> +
> +Should be called in list context since 4.0.  Will return list of headers
> +content, or other values when modifiers used.
> +
> +If C<default_value> is given, it will be used if the requested
> +C<header_name> does not exist.  This is mainly useful when called in scalar
> +context to set 'undef' instead of legacy '' return value when header does
> +not exist.
> +
> +Appending C<:raw> modifier to the header name will inhibit decoding of
> +quoted-printable or base-64 encoded strings.
> +
> +Appending C<:addr> modifier to the header name will return all
> +email-addresses found in the header.  It is mainly applicable to header
> +fields 'From', 'Sender', 'To', 'Cc' along with their 'Resent-*'
> +counterparts, and the 'Return-Path'.  For example, all of the following will
> +result in "example@foo" (and "example@bar"):
>  
>  =over 4
>  
>  =item example@foo
>  
> -=item example@foo (Foo Blah)
> +=item example@foo (Foo Blah), <ex...@bar>
>  
>  =item example@foo, example@bar
>  
> @@ -2063,18 +2073,18 @@ the following will result in "example@fo
>  
>  =back
>  
> -Appending a modifier C<:name> to a header field name will cause everything
> -except the first display name to be removed from the header field. It is
> -mainly applicable to header fields containing a single mail address: 'From',
> -'Sender', along with their 'Resent-From' and 'Resent-Sender' counterparts.
> -For example, all of the following will result in "Foo Blah". One level of
> -single quotes is stripped too, as it is often seen.
> +Appending C<:name> modifier to the header name will return all "display
> +names" from the header field.  As with C<:addr>, it is mainly applicable to
> +header fields 'From', 'Sender', 'To', 'Cc' along with their 'Resent-*'
> +counterparts, and the 'Return-Path'.  For example, all of the following will
> +result in "Foo Blah" (and "Bar Baz").  One level of single quotes is
> +stripped too, as it is often seen.
>  
>  =over 4
>  
>  =item example@foo (Foo Blah)
>  
> -=item example@foo (Foo Blah), example@bar
> +=item example@foo (Foo Blah), "Bar Baz" <ex...@bar>
>  
>  =item display: example@foo (Foo Blah), example@bar ;
>  
> @@ -2086,22 +2096,27 @@ single quotes is stripped too, as it is
>  
>  =back
>  
> -Appending a modifier C<:host> to a header field name will return the first
> -hostname-looking string that ends with a valid TLD. First it tries to find a
> -match after @ character (possible email), then from any part of the header.
> -Normal use of this would be for example 'From:addr:host' to return the
> -hostname portion of a From-address.
> -
> -Appending a modifier C<:domain> to a header field name implies C<:host>,
> -but will return only domain part of the hostname, as returned by
> -RegistryBoundaries::trim_domain.
> -
> -Appending a modifier C<:ip> to a header field name, will return the first
> -IPv4 or IPv6 address string found. Could be used for example as
> -'X-Originating-IP:ip'.
> -
> -Appending a modifier C<:revip> to a header field name implies C<:ip>,
> -but will return the found IP in reverse (usually for DNSBL usage).
> +Appending C<:host> to the header name will return the first hostname-looking
> +string that ends with a valid TLD.  First it tries to find a match after @
> +character (possible email), then from any part of the header.  Normal use of
> +this would be for example 'From:addr:host' to return the hostname portion of
> +a From-address.
> +
> +Appending C<:domain> to the header name implies C<:host>, but will return
> +only domain part of the hostname, as returned by
> +RegistryBoundaries::trim_domain().
> +
> +Appending C<:ip> to the header name, will return the first IPv4 or IPv6
> +address string found.  Could be used for example as 'X-Originating-IP:ip'.
> +
> +Appending C<:revip> to the header name implies C<:ip>, but will return the
> +found IP in reverse (usually for DNSBL usage).
> +
> +Appending C<:first> modifier to the header name will return only the first
> +(topmost) header, in case there are multiple ones.  Similarly C<:last> will
> +select the last one.  These affect only the physical header line selection. 
> +If selected header is parsed further with C<:addr> or similar, it may return
> +multiple results, if the selected header contains multiple addresses.
>  
>  There are several special pseudo-headers that can be specified:
>  
> @@ -2143,6 +2158,12 @@ the message has passed through
>  =item C<X-Spam-Relays-Trusted> is the generated metadata of trusted relays
>  the message has passed through
>  
> +=item C<X-Spam-Relays-External> is the generated metadata of external relays
> +the message has passed through
> +
> +=item C<X-Spam-Relays-Internal> is the generated metadata of internal relays
> +the message has passed through
> +
>  =back
>  
>  =cut
> @@ -2151,98 +2172,106 @@ the message has passed through
>  sub _get {
>    my ($self, $request) = @_;
>  
> -  my $result;
> +  my @results;
>    my $getaddr = 0;
>    my $getname = 0;
>    my $getraw = 0;
> +  my $needraw = 0;
>    my $gethost = 0;
>    my $getdomain = 0;
>    my $getip = 0;
>    my $getrevip = 0;
> +  my $getfirst = 0;
> +  my $getlast = 0;
>  
>    # special queries - process and strip modifiers
>    if (index($request,':') >= 0) {  # triage
>      local $1;
>      while ($request =~ s/:([^:]*)//) {
>        if    ($1 eq 'raw')    { $getraw  = 1 }
> -      elsif ($1 eq 'addr')   { $getaddr = $getraw = 1 }
> -      elsif ($1 eq 'name')   { $getname = 1 }
> +      elsif ($1 eq 'addr')   { $getaddr = $needraw = 1 }
> +      elsif ($1 eq 'name')   { $getname = $needraw = 1 }
>        elsif ($1 eq 'host')   { $gethost = 1 }
>        elsif ($1 eq 'domain') { $gethost = $getdomain = 1 }
>        elsif ($1 eq 'ip')     { $getip = 1 }
>        elsif ($1 eq 'revip')  { $getip = $getrevip = 1 }
> +      elsif ($1 eq 'first')  { $getfirst = 1 }
> +      elsif ($1 eq 'last')   { $getlast = 1 }
>      }
>    }
>    my $request_lc = lc $request;
>  
>    # ALL: entire pristine or semi-raw headers
>    if ($request eq 'ALL') {
> -    return ($getraw ? $self->{msg}->get_pristine_header()
> -                    : $self->{msg}->get_all_headers(0));
> +    if ($getraw) {
> +      @results = $self->{msg}->get_pristine_header() =~ /^([^ \t].*?\n)(?![ \t])/smgi;
> +    } else {
> +      @results = $self->{msg}->get_all_headers(0);
> +    }
> +    return \@results;
>    }
>    # ALL-TRUSTED: entire trusted raw headers
>    elsif ($request eq 'ALL-TRUSTED') {
>      # '+1' since we added the received header even though it's not considered
>      # trusted, so we know that those headers can be trusted too
> -    return $self->get_all_hdrs_in_rcvd_index_range(
> +    @results = $self->get_all_hdrs_in_rcvd_index_range(
>  			undef, $self->{last_trusted_relay_index}+1,
>  			undef, undef, $getraw);
> +    return \@results;
>    }
>    # ALL-INTERNAL: entire internal raw headers
>    elsif ($request eq 'ALL-INTERNAL') {
>      # '+1' for the same reason as in ALL-TRUSTED above
> -    return $self->get_all_hdrs_in_rcvd_index_range(
> +    @results = $self->get_all_hdrs_in_rcvd_index_range(
>  			undef, $self->{last_internal_relay_index}+1,
>  			undef, undef, $getraw);
> +    return \@results;
>    }
>    # ALL-UNTRUSTED: entire untrusted raw headers
>    elsif ($request eq 'ALL-UNTRUSTED') {
>      # '+1' for the same reason as in ALL-TRUSTED above
> -    return $self->get_all_hdrs_in_rcvd_index_range(
> +    @results = $self->get_all_hdrs_in_rcvd_index_range(
>  			$self->{last_trusted_relay_index}+1, undef,
>  			undef, undef, $getraw);
> +    return \@results;
>    }
>    # ALL-EXTERNAL: entire external raw headers
>    elsif ($request eq 'ALL-EXTERNAL') {
>      # '+1' for the same reason as in ALL-TRUSTED above
> -    return $self->get_all_hdrs_in_rcvd_index_range(
> +    @results = $self->get_all_hdrs_in_rcvd_index_range(
>  			$self->{last_internal_relay_index}+1, undef,
>  			undef, undef, $getraw);
> +    return \@results;
>    }
>    # EnvelopeFrom: the SMTP MAIL FROM: address
>    elsif ($request_lc eq "\LEnvelopeFrom") {
> -    $result = $self->get_envelope_from();
> +    push @results, $self->get_envelope_from();
>    }
>    # untrusted relays list, as string
>    elsif ($request_lc eq "\LX-Spam-Relays-Untrusted") {
> -    $result = $self->{relays_untrusted_str};
> +    push @results, $self->{relays_untrusted_str};
>    }
>    # trusted relays list, as string
>    elsif ($request_lc eq "\LX-Spam-Relays-Trusted") {
> -    $result = $self->{relays_trusted_str};
> +    push @results, $self->{relays_trusted_str};
>    }
>    # external relays list, as string
>    elsif ($request_lc eq "\LX-Spam-Relays-External") {
> -    $result = $self->{relays_external_str};
> +    push @results, $self->{relays_external_str};
>    }
>    # internal relays list, as string
>    elsif ($request_lc eq "\LX-Spam-Relays-Internal") {
> -    $result = $self->{relays_internal_str};
> +    push @results, $self->{relays_internal_str};
>    }
>    # ToCc: the combined recipients list
>    elsif ($request_lc eq "\LToCc") {
> -    $result = join("\n", $self->{msg}->get_header('To', $getraw));
> -    if ($result ne '') {
> -      chomp $result;
> -      $result .= ", " if $result =~ /\S/;
> -    }
> -    $result .= join("\n", $self->{msg}->get_header('Cc', $getraw));
> -    $result = undef if $result eq '';
> +    push @results, $self->{msg}->get_header('To', $getraw);
> +    push @results, $self->{msg}->get_header('Cc', $getraw);
>    }
>    # MESSAGEID: handle lists which move the real message-id to another
>    # header for resending.
>    elsif ($request eq 'MESSAGEID') {
> -    $result = join("\n", grep { defined($_) && $_ ne '' }
> +    push @results, grep { defined($_) && $_ ne '' } (
>  		   $self->{msg}->get_header('X-Message-Id', $getraw),
>  		   $self->{msg}->get_header('Resent-Message-Id', $getraw),
>  		   $self->{msg}->get_header('X-Original-Message-ID', $getraw),
> @@ -2250,115 +2279,126 @@ sub _get {
>    }
>    # a conventional header
>    else {
> -    my @results = $getraw ? $self->{msg}->raw_header($request)
> -                          : $self->{msg}->get_header($request);
> -  # dbg("message: get(%s)%s = %s",
> -  #     $request, $getraw?'raw':'', join(", ",@results));
> -    if (@results) {
> -      $result = join('', @results);
> -    } else {  # metadata
> -      $result = $self->{msg}->get_metadata($request);
> -    }
> -  }
> -
> -  # special queries
> -  if (defined $result && ($getaddr || $getname)) {
> -    local $1;
> -    $result =~ s/^[^:]+:(.*);\s*$/$1/gs;	# 'undisclosed-recipients: ;'
> -    $result =~ s/\s+/ /g;			# reduce whitespace
> -    $result =~ s/^\s+//;			# leading whitespace
> -    $result =~ s/\s+$//;			# trailing whitespace
> -
> -    if ($getaddr) {
> -      # Get the email address out of the header
> -      # All of these should result in "jm@foo":
> -      # jm@foo
> -      # jm@foo (Foo Blah)
> -      # jm@foo, jm@bar
> -      # display: jm@foo (Foo Blah), jm@bar ;
> -      # Foo Blah <jm...@foo>
> -      # "Foo Blah" <jm...@foo>
> -      # "'Foo Blah'" <jm...@foo>
> -      #
> -      # strip out the (comments)
> -      $result =~ s/\s*\(.*?\)//g;
> -      # strip out the "quoted text", unless it's the only thing in the string
> -      if ($result !~ /^".*"$/) {
> -        $result =~ s/(?<!<)"[^"]*"(?!\@)//g;   #" emacs
> -      }
> -      # Foo Blah <jm...@xxx> or <jm...@xxx>
> -      local $1;
> -      $result =~ s/^[^"<]*?<(.*?)>.*$/$1/;
> -      # multiple addresses on one line? remove all but first
> -      $result =~ s/,.*$//;
> -    }
> -    elsif ($getname) {
> -      # Get the display name out of the header
> -      # All of these should result in "Foo Blah":
> -      #
> -      # jm@foo (Foo Blah)
> -      # (Foo Blah) jm@foo
> -      # jm@foo (Foo Blah), jm@bar
> -      # display: jm@foo (Foo Blah), jm@bar ;
> -      # Foo Blah <jm...@foo>
> -      # "Foo Blah" <jm...@foo>
> -      # "'Foo Blah'" <jm...@foo>
> -      #
> -      local $1;
> -      # does not handle mailbox-list or address-list or quotes well, to be improved
> -      if ($result =~ /^ \s* " (.*?) (?<!\\)" \s* < [^<>]* >/sx ||
> -          $result =~ /^ \s* (.*?) \s* < [^<>]* >/sx) {
> -        $result = $1;  # display-name, RFC 5322
> -        # name-addr    = [display-name] angle-addr
> -        # display-name = phrase
> -        # phrase       = 1*word / obs-phrase
> -        # word         = atom / quoted-string
> -        # obs-phrase   = word *(word / "." / CFWS)
> -        $result =~ s{ " ( (?: [^"\\] | \\. )* ) " }
> -                { my $s=$1; $s=~s{\\(.)}{$1}gs; $s }gsxe;
> -        $result =~ s/\\"/"/gs;
> -      } elsif ($result =~ /^ [^(,]*? \( (.*?) \) /sx) {  # legacy form
> -        # nested comments are not handled, to be improved
> -        $result = $1;
> -      } else {  # no display name
> -        $result = '';
> +    my @res = $getraw||$needraw ? $self->{msg}->raw_header($request)
> +                                : $self->{msg}->get_header($request);
> +    if (!@res) {
> +      if (defined(my $m = $self->{msg}->get_metadata($request))) {
> +        push @res, $m;
> +      }
> +    }
> +    push @results, @res if @res;
> +  }
> +
> +  # Nothing found to process further, bail out quick
> +  if (!@results) {
> +    return \@results;
> +  }
> +
> +  # Continue processing only first (topmost) or last header
> +  if ($getfirst) {
> +    @results = ($results[0]);
> +  } elsif ($getlast) {
> +    @results = ($results[-1]);
> +  }
> +
> +  # special addr/name
> +  if ($getaddr || $getname) {
> +    my @res;
> +    foreach my $line (@results) {
> +      next unless defined $line;
> +      # Note: parse_header_addresses always called with raw undecoded value
> +      # Skip invalid addresses here
> +      my @addrs = parse_header_addresses($line);
> +      if (@addrs) {
> +        if ($getaddr) {
> +          foreach my $addr (@addrs) {
> +            push @res, $addr->{address} if defined $addr->{address};
> +          }
> +        }
> +        elsif ($getname) {
> +          foreach my $addr (@addrs) {
> +            next unless defined $addr->{phrase};
> +            if ($getraw) {
> +              # phrase=name, could also be username or comment unless name found
> +              push @res, $addr->{phrase};
> +            } else {
> +              # If :raw was not specifically asked, decode mimewords
> +              # TODO: silly call to Node module, should probably be in Util
> +              my $decoded = Mail::SpamAssassin::Message::Node::_decode_header(
> +                              $addr->{phrase}, "PMS:get:$request");
> +              # Normalize whitespace, unless it's all white-space
> +              if ($decoded =~ /\S/) {
> +                $decoded =~ s/\s+/ /gs;
> +                $decoded =~ s/^\s+//;
> +                $decoded =~ s/\s+$//;
> +                $decoded =~ s/^'(.*?)'$/$1/; # remove single quotes
> +              }
> +              push @res, $decoded if defined $decoded;
> +            }
> +          }
> +        }
>        }
> -      $result =~ s/^ \s* ' \s* (.*?) \s* ' \s* \z/$1/sx;
>      }
> +    @results = @res;
>    }
>  
>    # special host/domain
> -  if (defined $result && ($gethost || $getdomain || $getip)) {
> -    my $host;
> +  if (@results && ($gethost || $getdomain || $getip)) {
> +    my @res;
>      if ($gethost) {
> +      # TODO: IDN matching needs honing
>        my $tldsRE = $self->{main}->{registryboundaries}->{valid_tlds_re};
> -      my $hostRE = qr/(?<![._-])\b([a-z\d][a-z\d._-]{0,251}\.${tldsRE})\b(?![._-])/i;
> -      # try grabbing email/msgid domain first, because user part might look like
> -      # a valid host..
> -      if ($result =~ /.*\@${hostRE}/i && is_fqdn_valid($1)) {
> -        $host = $1;
> -      } else {
> -        # otherwise try hard to find a valid host
> -        while ($result =~ /${hostRE}/ig) {
> -          if (is_fqdn_valid($1)) {
> +      #my $hostRE = qr/(?<![._-])\b([a-z\d][a-z\d._-]{0,251}\.${tldsRE})\b(?![._-])/i;
> +      my $hostRE = qr/(?<![._-])(\S{1,251}\.${tldsRE})(?![._-])/i;
> +      foreach my $line (@results) {
> +        next unless defined $line;
> +        my $host;
> +        if ($getaddr) {
> +          # If :addr already preparsed the line, just grab domain liberally
> +          if ($line =~ /.*\@(\S+)/) {
>              $host = $1;
> -            last;
>            }
>          }
> -      }
> -      if ($host && $getdomain) {
> -        $host = $self->{main}->{registryboundaries}->trim_domain($host, 1);
> +        else {
> +          # try grabbing email/msgid domain first, because user part might look like
> +          # a valid host..
> +          if ($line =~ /.*\@${hostRE}/i) {
> +            if (is_fqdn_valid(idn_to_ascii($1), 1)) {
> +              $host = $1;
> +            }
> +          }
> +          # otherwise try hard to find a valid host
> +          if (!$host) {
> +            while ($line =~ /${hostRE}/ig) {
> +              if (is_fqdn_valid(idn_to_ascii($1), 1)) {
> +                $host = $1;
> +                last;
> +              }
> +            }
> +          }
> +        }
> +        if ($host) {
> +          if ($getdomain) {
> +            $host = $self->{main}->{registryboundaries}->trim_domain($host, 1);
> +          }
> +          push @res, $host;
> +        }
>        }
>      } else {
>        my $ipRE = qr/(?<!\.)\b(${IP_ADDRESS})\b(?!\.)/;
> -      if ($result =~ $ipRE) {
> -        $host = $getrevip ? reverse_ip_address($1) : $1;
> +      foreach my $line (@results) {
> +        next unless defined $line;
> +        my $host;
> +        if ($line =~ $ipRE) {
> +          $host = $getrevip ? reverse_ip_address($1) : $1;
> +        }
> +        push @res, $host  if defined $host;
>        }
>      }
> -    $result = $host;
> +    @results = @res;
>    }
>  
> -  return $result;
> +  return \@results;
>  }
>  
>  # optimized for speed
> @@ -2367,7 +2407,7 @@ sub _get {
>  # $_[2] is defval
>  sub get {
>    my $cache = $_[0]->{get_cache};
> -  my $found;
> +  my $found = [];
>    if (exists $cache->{$_[1]}) {
>      # return cache entry if it is known
>      # (measured hit/attempts rate on a production mailer is about 47%)
> @@ -2375,13 +2415,34 @@ sub get {
>    } else {
>      # fill in a cache entry
>      $found = _get(@_);
> +    # filter out undefined
> +    @$found = grep { defined } @$found;
>      $cache->{$_[1]} = $found;
>    }
>    # if the requested header wasn't found, we should return a default value
>    # as specified by the caller: if defval argument is present it represents
>    # a default value even if undef; if defval argument is absent a default
>    # value is an empty string for upwards compatibility
> -  return (defined $found ? $found : @_ > 2 ? $_[2] : '');
> +  if (@$found) {
> +    # new list context usage in 4.0, return all values always
> +    if (wantarray) {
> +      return @$found;
> +    }
> +    # legacy scalar context expected only single return value for some
> +    # queries, without a newline
> +    if ($_[1] =~ /:(?:addr|name|host|domain|ip|revip)\b/ ||
> +        $_[1] eq 'EnvelopeFrom') {
> +      my $res = $found->[0];
> +      $res =~ s/\n\z$//;
> +      return $res;
> +    } else {
> +      return join('', @$found);
> +    }
> +  } elsif (@_ > 2) {
> +    return wantarray ? ($_[2]) : $_[2];
> +  } else {
> +    return wantarray ? () : '';
> +  }
>  }
>  
>  ###########################################################################
> @@ -2698,15 +2759,16 @@ sub _process_dkim_uri_list {
>  
>    # Look for the domain in DK/DKIM headers
>    if ($self->{conf}->{parse_dkim_uris}) {
> -    my $dk = join(" ", grep {defined} ( $self->get('DomainKey-Signature',undef ),
> -                                        $self->get('DKIM-Signature',undef) ));
> -    while ($dk =~ /\bd\s*=\s*([^;]+)/g) {
> -      my $d = $1;
> -      $d =~ s/\s+//g;
> -      # prefix with domainkeys: so it doesn't merge with identical keys
> -      $self->add_uri_detail_list("domainkeys:$d",
> -        {'domainkeys'=>1, 'nocanon'=>1, 'noclean'=>1},
> -        'domainkeys', 1);
> +    foreach my $dk ( $self->get('DomainKey-Signature'),
> +                     $self->get('DKIM-Signature') ) {
> +      while ($dk =~ /\bd\s*=\s*([^;]+)/g) {
> +        my $d = $1;
> +        $d =~ s/\s+//g;
> +        # prefix with domainkeys: so it doesn't merge with identical keys
> +        $self->add_uri_detail_list("domainkeys:$d",
> +          {'domainkeys'=>1, 'nocanon'=>1, 'noclean'=>1},
> +          'domainkeys', 1);
> +      }
>      }
>    }
>  }
> @@ -3123,8 +3185,8 @@ sub get_envelope_from {
>    # Assume that because they have configured it, their MTA will always add it.
>    # This will prevent us falling through and picking up inappropriate headers.
>    if (defined $self->{conf}->{envelope_sender_header}) {
> -    # make sure we get the most recent copy - there can be only one EnvelopeSender.
> -    $envf = $self->get($self->{conf}->{envelope_sender_header}.":addr",undef);
> +    # get the most recent (topmost) copy - there can be only one EnvelopeSender.
> +    $envf = ($self->get($self->{conf}->{envelope_sender_header}.":first:addr"))[0];
>      # ok if it contains an "@" sign, or is "" (ie. "<>" without the < and >)
>      if (defined $envf && (index($envf, '@') > 0 || $envf eq '')) {
>        dbg("message: using envelope_sender_header '%s' as EnvelopeFrom: '%s'",
> @@ -3177,17 +3239,19 @@ sub get_envelope_from {
>    # lines, we cannot trust any Envelope-From headers, since they're likely to
>    # be incorrect fetchmail guesses.
>  
> -  if (index($self->get("X-Sender"), '@') != -1) {
> -    my $rcvd = join(' ', $self->get("Received"));
> -    if (index($rcvd, '(fetchmail') != -1) {
> -      dbg("message: X-Sender and fetchmail signatures found, cannot trust envelope-from");
> -      $self->{envelopefrom} = undef;
> -      return;
> +  my $x_sender = ($self->get("X-Sender:first:addr"))[0];
> +  if (defined $x_sender && index($x_sender, '@') != -1) {
> +    foreach ($self->get("Received")) {
> +      if (index($_, '(fetchmail') != -1) {
> +        dbg("message: X-Sender and fetchmail signatures found, cannot trust envelope-from");
> +        $self->{envelopefrom} = undef;
> +        return;
> +      }
>      }
>    }
>  
>    # procmailrc notes this (we now recommend adding it to Received instead)
> -  if (defined($envf = $self->get("X-Envelope-From:addr",undef))) {
> +  if (defined($envf = ($self->get("X-Envelope-From:first:addr"))[0])) {
>      # heuristic: this could have been relayed via a list which then used
>      # a *new* Envelope-from.  check
>      if ($self->get("ALL") =~ /^Received:.*?^X-Envelope-From:/smi) {
> @@ -3202,7 +3266,7 @@ sub get_envelope_from {
>    }
>  
>    # qmail, new-inject(1)
> -  if (defined($envf = $self->get("Envelope-Sender:addr",undef))) {
> +  if (defined($envf = ($self->get("Envelope-Sender:first:addr"))[0])) {
>      # heuristic: this could have been relayed via a list which then used
>      # a *new* Envelope-from.  check
>      if ($self->get("ALL") =~ /^Received:.*?^Envelope-Sender:/smi) {
> @@ -3221,7 +3285,7 @@ sub get_envelope_from {
>    #   data.  This use of return-path is required; mail systems MUST support
>    #   it.  The return-path line preserves the information in the <reverse-
>    #   path> from the MAIL command.
> -  if (defined($envf = $self->get("Return-Path:addr",undef))) {
> +  if (defined($envf = ($self->get("Return-Path:first:addr"))[0])) {
>      # heuristic: this could have been relayed via a list which then used
>      # a *new* Envelope-from.  check
>      if ($self->get("ALL") =~ /^Received:.*?^Return-Path:/smi) {
> @@ -3261,7 +3325,7 @@ sub get_all_hdrs_in_rcvd_index_range {
>    $include_end_rcvd = 1 unless defined $include_end_rcvd;
>  
>    my $cur_rcvd_index = -1;  # none found yet
> -  my $result = '';
> +  my @results;
>  
>    my @hdrs;
>    if ($getraw) {
> @@ -3280,14 +3344,20 @@ sub get_all_hdrs_in_rcvd_index_range {
>      }
>      if ((!defined $start_rcvd || $start_rcvd <= $cur_rcvd_index) &&
>  	(!defined $end_rcvd || $cur_rcvd_index < $end_rcvd)) {
> -      $result .= $hdr;
> +      push @results, $hdr;
>      }
>      elsif (defined $end_rcvd && $cur_rcvd_index == $end_rcvd) {
> -      $result .= $hdr;
> +      push @results, $hdr;
>        last;
>      }
>    }
> -  return ($result eq '' ? undef : $result);
> +
> +  if (wantarray) {
> +    return @results;
> +  } else {
> +    my $result = join('', @results);
> +    return ($result eq '' ? undef : $result);
> +  }
>  }
>  
>  ###########################################################################
> @@ -3377,9 +3447,9 @@ sub all_from_addrs {
>    my @addrs;
>  
>    # Resent- headers take priority, if present. see bug 672
> -  my $resent = $self->get('Resent-From',undef);
> -  if (defined $resent && $resent =~ /\S/) {
> -    @addrs = $self->{main}->find_all_addrs_in_line ($resent);
> +  my @resent = $self->get('Resent-From:first:addr');
> +  if (@resent) {
> +    @addrs = @resent;
>    }
>    else {
>      # bug 2292: Used to use find_all_addrs_in_line() with the same
> @@ -3387,17 +3457,18 @@ sub all_from_addrs {
>      # FNs for things like welcomelist_from (previously whitelist_from).  
>      # Since all of these are From
>      # headers, there should only be 1 address in each anyway (not exactly
> -    # true, RFC 2822 allows multiple addresses in a From header field),
> -    # so use the :addr code...
> +    # true, RFC 2822 allows multiple addresses in a From header field)
> +    # *** since 4.0 all addresses are returned from Header correctly ***
>      # bug 3366: some addresses come in as 'foo@bar...', which is invalid.
>      # so deal with the multiple periods.
> +    # TODO: 4.0 need :first:addr here ? Why check so many headers ?
>      ## no critic
>      @addrs = map { tr/././s; $_ } grep { $_ ne '' }
> -        ($self->get('From:addr'),		# std
> -         $self->get('Envelope-Sender:addr'),	# qmail: new-inject(1)
> -         $self->get('Resent-Sender:addr'),	# procmailrc manpage
> -         $self->get('X-Envelope-From:addr'),	# procmailrc manpage
> -         $self->get('EnvelopeFrom:addr'));	# SMTP envelope
> +      ($self->get('From:addr'),            # std
> +       $self->get('Envelope-Sender:addr'), # qmail: new-inject(1)
> +       $self->get('Resent-Sender:addr'),   # procmailrc manpage
> +       $self->get('X-Envelope-From:addr'), # procmailrc manpage
> +       $self->get('EnvelopeFrom:addr'));   # SMTP envelope
>      # http://www.cs.tut.fi/~jkorpela/headers.html is useful here
>    }
>  
> @@ -3455,47 +3526,52 @@ sub all_to_addrs {
>    my @addrs;
>  
>    # Resent- headers take priority, if present. see bug 672
> -  my $resent = join('', $self->get('Resent-To'), $self->get('Resent-Cc'));
> -  if ($resent =~ /\S/) {
> -    @addrs = $self->{main}->find_all_addrs_in_line($resent);
> +  my @resent = ( $self->get('Resent-To:first:addr'),
> +                 $self->get('Resent-Cc:first:addr') );
> +  if (@resent) {
> +    @addrs = @resent;
>    } else {
>      # OK, a fetchmail trick: try to find the recipient address from
>      # the most recent 3 Received lines.  This is required for sendmail,
>      # since it does not add a helpful header like exim, qmail
>      # or Postfix do.
>      #
> -    my $rcvd = $self->get('Received');
> -    $rcvd =~ s/\n[ \t]+/ /gs;
> -    $rcvd =~ s/\n+/\n/gs;
> -
> -    my @rcvdlines = split(/\n/, $rcvd, 4); pop @rcvdlines; # forget last one
> +    my @rcvd = ($self->get('Received'))[0 .. 2];
>      my @rcvdaddrs;
> -    foreach my $line (@rcvdlines) {
> -      if ($line =~ / for (\S+\@\S+);/) { push (@rcvdaddrs, $1); }
> +    foreach my $line (@rcvd) {
> +      next unless defined $line;
> +      if ($line =~ / for <?(\S+\@(\S+?))>?;/) {
> +        if (is_fqdn_valid(idn_to_ascii($2), 1)) {
> +          push @rcvdaddrs, $1;
> +        }
> +      }
>      }
>  
> -    @addrs = $self->{main}->find_all_addrs_in_line (
> -       join('',
> -	 join(" ", @rcvdaddrs)."\n",
> -         $self->get('To'),			# std
> -  	 $self->get('Apparently-To'),		# sendmail, from envelope
> -  	 $self->get('Delivered-To'),		# Postfix, poss qmail
> -  	 $self->get('Envelope-Recipients'),	# qmail: new-inject(1)
> -  	 $self->get('Apparently-Resent-To'),	# procmailrc manpage
> -  	 $self->get('X-Envelope-To'),		# procmailrc manpage
> -  	 $self->get('Envelope-To'),		# exim
> -	 $self->get('X-Delivered-To'),		# procmail quick start
> -	 $self->get('X-Original-To'),		# procmail quick start
> -	 $self->get('X-Rcpt-To'),		# procmail quick start
> -	 $self->get('X-Real-To'),		# procmail quick start
> -	 $self->get('Cc')));			# std
> +    # TODO: 4.0 use :first:addr ? Why so many headers ?
> +    @addrs = (
> +      @rcvdaddrs,
> +      $self->get('To:addr'),                   # std
> +      $self->get('Apparently-To:addr'),        # sendmail, from envelope
> +      $self->get('Delivered-To:addr'),         # Postfix, poss qmail
> +      $self->get('Envelope-Recipients:addr'),  # qmail: new-inject(1)
> +      $self->get('Apparently-Resent-To:addr'), # procmailrc manpage
> +      $self->get('X-Envelope-To:addr'),        # procmailrc manpage
> +      $self->get('Envelope-To:addr'),          # exim
> +      $self->get('X-Delivered-To:addr'),       # procmail quick start
> +      $self->get('X-Original-To:addr'),        # procmail quick start
> +      $self->get('X-Rcpt-To:addr'),            # procmail quick start
> +      $self->get('X-Real-To:addr'),            # procmail quick start
> +      $self->get('Cc:addr'));                  # std
>      # those are taken from various sources; thanks to Nancy McGough, who
>      # noted some in <http://www.ii.com/internet/robots/procmail/qs/#envelope>
>    }
>  
> -  dbg("eval: all '*To' addrs: " . join(" ", @addrs));
> -  $self->{all_to_addrs} = \@addrs;
> -  return @addrs;
> +  my %seen;
> +  my @result = grep { !$seen{$_}++ } @addrs;
> +
> +  dbg("eval: all '*To' addrs: " . join(" ", @result));
> +  $self->{all_to_addrs} = \@result;
> +  return @result;
>  
>  # http://www.cs.tut.fi/~jkorpela/headers.html is useful here, also
>  # http://www.exim.org/pipermail/exim-users/Week-of-Mon-20001009/021672.html
> 
> Modified: spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/Bayes.pm
> URL: http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/Bayes.pm?rev=1889337&r1=1889336&r2=1889337&view=diff
> ==============================================================================
> --- spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/Bayes.pm (original)
> +++ spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/Bayes.pm Fri Apr 30 18:17:51 2021
> @@ -1561,10 +1561,12 @@ sub _pre_chew_addr_header {
>    my ($self, $val) = @_;
>    local ($_);
>  
> -  my @addrs = $self->{main}->find_all_addrs_in_line ($val);
> +  my @addrs = Mail::SpamAssassin::Util::parse_header_addresses($val);
>    my @toks;
> -  foreach (@addrs) {
> -    push (@toks, $self->_tokenize_mail_addrs ($_));
> +  foreach my $addr (@addrs) {
> +    if (defined $addr->{address}) {
> +      push @toks, $self->_tokenize_mail_addrs($addr->{address});
> +    }
>    }
>    return join (' ', @toks);
>  }
> 
> Modified: spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/FreeMail.pm
> URL: http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/FreeMail.pm?rev=1889337&r1=1889336&r2=1889337&view=diff
> ==============================================================================
> --- spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/FreeMail.pm (original)
> +++ spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/FreeMail.pm Fri Apr 30 18:17:51 2021
> @@ -464,13 +464,13 @@ sub check_freemail_header {
>          $re = $rec;
>      }
>  
> -    my @emails = map (lc, $pms->{main}->find_all_addrs_in_line ($pms->get($header)));
> +    my @emails = map (lc, $pms->get("$header:addr"));
>  
>      if (!scalar (@emails)) {
>           dbg("header $header not found from mail");
>           return 0;
>      }
> -    dbg("addresses from header $header: ".join(';',@emails));
> +    dbg("addresses from header $header: ".join(', ', @emails));
>  
>      foreach my $email (@emails) {    
>          if ($self->_is_freemail($email, $pms)) {
> @@ -592,24 +592,33 @@ sub check_freemail_replyto {
>  
>      # Skip mailing-list etc looking requests, mostly FPs from them
>      if ($pms->{main}->{conf}->{freemail_skip_bulk_envfrom}) {
> -        my $envfrom = lc($pms->get("EnvelopeFrom"));
> -        if ($envfrom =~ $skip_replyto_envfrom) {
> +        my $envfrom = ($pms->get("EnvelopeFrom"))[0];
> +        if (defined $envfrom && $envfrom =~ $skip_replyto_envfrom) {
>              dbg("envelope sender looks bulk, skipping check: $envfrom");
>              return 0;
>          }
>      }
>  
> -    my $from = lc($pms->get("From:addr"));
> -    my $replyto = lc($pms->get("Reply-To:addr"));
> -    my $from_is_fm = $self->_is_freemail($from, $pms);
> -    my $replyto_is_fm = $self->_is_freemail($replyto, $pms);
> +    my @from_addrs = map (lc, $pms->get("From:addr"));
> +    dbg("From address: ".join(", ", @from_addrs)) if @from_addrs;
>  
> -    dbg("From address: $from") if $from ne '';
> -    dbg("Reply-To address: $replyto") if $replyto ne '';
> +    my @replyto_addrs = map (lc, $pms->get("Reply-To:addr"));
> +    dbg("Reply-To address: ".join(", ", @replyto_addrs)) if @replyto_addrs;
>  
> -    if ($from_is_fm and $replyto_is_fm and ($from ne $replyto)) {
> +    my $from_is_fm = grep { $self->_is_freemail($_, $pms) } @from_addrs;
> +    my $replyto_is_fm = grep { $self->_is_freemail($_, $pms) } @replyto_addrs;
> +
> +    my $from_not_in_replyto = 1;
> +    foreach my $from (@from_addrs) {
> +        next unless grep { $_ eq $from } @replyto_addrs;
> +        $from_not_in_replyto = 0;
> +    }
> +
> +    if ($from_is_fm and $replyto_is_fm and $from_not_in_replyto) {
>          dbg("HIT! From and Reply-To are different freemails");
> -        $self->_got_hit($pms, "$from, $replyto", "From and Reply-To are different freemails");
> +        my $from = join(",", @from_addrs);
> +        my $replyto = join(",", @replyto_addrs);
> +        $self->_got_hit($pms, "$from -> $replyto", "From and Reply-To are different freemails");
>          return 0;
>      }
>  
> @@ -620,7 +629,7 @@ sub check_freemail_replyto {
>          }
>      }
>      elsif ($what eq 'reply') {
> -        if ($replyto ne '' and !$replyto_is_fm) {
> +        if (@replyto_addrs and !$replyto_is_fm) {
>              dbg("Reply-To defined and is not freemail, skipping check");
>              return 0;
>          }
> @@ -629,19 +638,21 @@ sub check_freemail_replyto {
>              return 0;
>          }
>      }
> -    my $reply = $replyto_is_fm ? $replyto : $from;
>  
>      return 0 unless $self->_parse_body($pms);
> -    
> +
>      # Compare body to headers
>      if (scalar keys %{$pms->{freemail_cache}{body}}) {
> -        my $check = $what eq 'replyto' ? $replyto : $reply;
> -        dbg("comparing $check to body freemails");
> -        foreach my $email (keys %{$pms->{freemail_cache}{body}}) {
> -            if ($email ne $check) {
> -                dbg("HIT! $check and $email are different freemails");
> -                $self->_got_hit($pms, "$check, $email", "Different freemails in reply header and body");
> -                return 0;
> +        my $reply_addrs = $what eq 'replyto' ? \@replyto_addrs :
> +                              $replyto_is_fm ? \@replyto_addrs : \@from_addrs;
> +        dbg("comparing to body freemails: ".join(", ", @$reply_addrs));
> +        foreach my $body_email (keys %{$pms->{freemail_cache}{body}}) {
> +            foreach my $reply_email (@$reply_addrs) {
> +                if ($body_email ne $reply_email) {
> +                    dbg("HIT! $reply_email (Reply) and $body_email (Body) are different freemails");
> +                    $self->_got_hit($pms, "$reply_email, $body_email", "Different freemails in reply header and body");
> +                    return 0;
> +                }
>              }
>          }
>      }
> 
> Modified: spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/HeaderEval.pm
> URL: http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/HeaderEval.pm?rev=1889337&r1=1889336&r2=1889337&view=diff
> ==============================================================================
> --- spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/HeaderEval.pm (original)
> +++ spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/HeaderEval.pm Fri Apr 30 18:17:51 2021
> @@ -89,7 +89,7 @@ sub check_for_fake_aol_relay_in_rcvd {
>    local ($_);
>  
>    $_ = $pms->get('Received');
> -  s/\s/ /gs;
> +  s/\s+/ /gs;
>  
>    # this is the hostname format used by AOL for their relays. Spammers love 
>    # forging it.  Don't make it more specific to match aol.com only, though --
> @@ -125,16 +125,13 @@ sub check_for_faraway_charset_in_headers
>    return 0 if grep { $_ eq "all" } @locales;
>  
>    for my $h (qw(From Subject)) {
> -    my @hdrs = $pms->get("$h:raw");  # ??? get() returns a scalar ???
> -    if ($#hdrs >= 0) {
> -      $hdr = join(" ", @hdrs);
> -    } else {
> -      $hdr = '';
> -    }
> -    while ($hdr =~ /=\?(.+?)\?.\?.*?\?=/g) {
> -      Mail::SpamAssassin::Locales::is_charset_ok_for_locales($1, @locales)
> -	  or return 1;
> -    }
> +    my @hdrs = $pms->get("$h:raw");
> +    foreach my $hdr (@hdrs) {
> +      while ($hdr =~ /=\?(.+?)\?.\?.*?\?=/g) {
> +        Mail::SpamAssassin::Locales::is_charset_ok_for_locales($1, @locales)
> +          or return 1;
> +      }
> +    }     
>    }
>    0;
>  }
> @@ -145,35 +142,35 @@ sub check_for_unique_subject_id {
>    $_ = lc $pms->get('Subject');
>  
>    my $id = 0;
> -  if (/[-_\.\s]{7,}([-a-z0-9]{4,})$/
> -	|| /\s{10,}(?:\S\s)?(\S+)$/
> -	|| /\s{3,}[-:\#\(\[]+([-a-z0-9]{4,})[\]\)]+$/
> -	|| /\s{3,}[:\#\(\[]*([a-f0-9]{4,})[\]\)]*$/
> -	|| /\s{3,}[-:\#]([a-z0-9]{5,})$/
> -	|| /[\s._]{3,}([^0\s._]\d{3,})$/
> -	|| /[\s._]{3,}\[(\S+)\]$/
> +  if (/[-_\.\s]{7,}([-a-z0-9]{4,})$/m
> +	|| /\s{10,}(?:\S\s)?(\S+)$/m
> +	|| /\s{3,}[-:\#\(\[]+([-a-z0-9]{4,})[\]\)]+$/m
> +	|| /\s{3,}[:\#\(\[]*([a-f0-9]{4,})[\]\)]*$/m
> +	|| /\s{3,}[-:\#]([a-z0-9]{5,})$/m
> +	|| /[\s._]{3,}([^0\s._]\d{3,})$/m
> +	|| /[\s._]{3,}\[(\S+)\]$/m
>  
>          # (7217vPhZ0-478TLdy5829qicU9-0@26) and similar
> -        || /\(([-\w]{7,}\@\d+)\)$/
> +        || /\(([-\w]{7,}\@\d+)\)$/m
>  
>          # Seven or more digits at the end of a subject is almost certainly a id
> -        || /\b(\d{7,})\s*$/
> +        || /\b(\d{7,})\s*$/m
>  
>          # stuff at end of line after "!" or "?" is usually an id
> -        || /[!\?]\s*(\d{4,}|\w+(-\w+)+)\s*$/
> +        || /[!\?]\s*(\d{4,}|\w+(-\w+)+)\s*$/m
>  
>          # 9095IPZK7-095wsvp8715rJgY8-286-28 and similar
>  	# excluding 'Re:', etc and the first word
> -        || /(?:\w{2,3}:\s)?\w+\s+(\w{7,}-\w{7,}(-\w+)*)\s*$/
> +        || /(?:\w{2,3}:\s)?\w+\s+(\w{7,}-\w{7,}(-\w+)*)\s*$/m
>  
>          # #30D7 and similar
> -        || /\s#\s*([a-f0-9]{4,})\s*$/
> +        || /\s#\s*([a-f0-9]{4,})\s*$/m
>       )
>    {
>      $id = $1;
>      # exempt online purchases
>      if ($id =~ /\d{5,}/
> -	&& /(?:item|invoice|order|number|confirmation).{1,6}\Q$id\E\s*$/)
> +	&& /(?:item|invoice|order|number|confirmation).{1,6}\Q$id\E\s*$/m)
>      {
>        $id = 0;
>      }
> @@ -270,7 +267,7 @@ sub check_illegal_chars {
>  
>    $header .= ":raw" unless $header =~ /:raw$/;
>    my $str = $pms->get($header);
> -  return 0 if !defined $str || $str eq '';
> +  return 0 if !defined $str || $str !~ /\S/;
>  
>    if ($str =~ tr/\x00-\x7F//c && is_valid_utf_8($str)) {
>      # is non-ASCII and is valid UTF-8
> @@ -304,12 +301,12 @@ sub gated_through_received_hdr_remover {
>    my ($self, $pms) = @_;
>  
>    my $txt = $pms->get("Mailing-List",undef);
> -  if (defined $txt && $txt =~ /^contact \S+\@\S+\; run by ezmlm$/) {
> +  if (defined $txt && $txt =~ /^contact \S+\@\S+\; run by ezmlm$/m) {
>      my $dlto = $pms->get("Delivered-To");
>      my $rcvd = $pms->get("Received");
>  
>      # ensure we have other indicative headers too
> -    if ($dlto =~ /^mailing list \S+\@\S+/ &&
> +    if ($dlto =~ /^mailing list \S+\@\S+/m &&
>          $rcvd =~ /qmail \d+ invoked (?:from network|by .{3,20})\); \d+ ... \d+/)
>      {
>        return 1;
> @@ -647,10 +644,9 @@ sub _check_recipients {
>    my @inputs;
>  
>    # ToCc: pseudo-header works best, but sometimes Bcc: is better
> -  for ('ToCc', 'Bcc') {
> -    my $to = $pms->get($_);	# get recipients
> -    $to =~ s/\(.*?\)//g;	# strip out the (comments)
> -    push(@inputs, ($to =~ m/([\w.=-]+\@\w+(?:[\w.-]+\.)+\w+)/g));
> +  for ('ToCc:addr', 'Bcc:addr') {
> +    my @to = $pms->get($_);	# get recipients
> +    push @inputs, @to;
>      last if scalar(@inputs) >= TOCC_SIMILAR_COUNT;
>    }
>  
> 
> Modified: spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/SPF.pm
> URL: http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/SPF.pm?rev=1889337&r1=1889336&r2=1889337&view=diff
> ==============================================================================
> --- spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/SPF.pm (original)
> +++ spamassassin/trunk/lib/Mail/SpamAssassin/Plugin/SPF.pm Fri Apr 30 18:17:51 2021
> @@ -381,7 +381,7 @@ sub _check_spf {
>      $scanner->{checked_for_received_spf_header} = 1;
>      dbg("spf: checking to see if the message has a Received-SPF header that we can use");
>  
> -    my @internal_hdrs = split("\n", $scanner->get('ALL-INTERNAL'));
> +    my @internal_hdrs = $scanner->get('ALL-INTERNAL');
>      unless ($scanner->{conf}->{use_newest_received_spf_header}) {
>        # look for the LAST (earliest in time) header, it'll be the most accurate
>        @internal_hdrs = reverse(@internal_hdrs);
> @@ -728,7 +728,7 @@ sub _get_sender {
>        # from the Return-Path, X-Envelope-From, or whatever header.
>        # it's better to get it from Received though, as that is updated
>        # hop-by-hop.
> -      my $sender = $scanner->get("EnvelopeFrom:addr");
> +      my $sender = ($scanner->get("EnvelopeFrom:addr"))[0];
>        if (defined $sender) {
>          dbg("spf: found EnvelopeFrom '$sender' from header");
>          $scanner->{spf_sender} = lc $sender;
> 
> Modified: spamassassin/trunk/lib/Mail/SpamAssassin/Util.pm
> URL: http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Util.pm?rev=1889337&r1=1889336&r2=1889337&view=diff
> ==============================================================================
> --- spamassassin/trunk/lib/Mail/SpamAssassin/Util.pm (original)
> +++ spamassassin/trunk/lib/Mail/SpamAssassin/Util.pm Fri Apr 30 18:17:51 2021
> @@ -49,6 +49,7 @@ require 5.008001;  # needs utf8::is_utf8
>  
>  use Mail::SpamAssassin::Logger;
>  
> +use version 0.77;
>  use Exporter ();
>  
>  our @ISA = qw(Exporter);
> @@ -60,7 +61,7 @@ our @EXPORT_OK = qw(&local_tz &base64_de
>                    &secure_tmpdir &uri_list_canonicalize &get_my_locales
>                    &parse_rfc822_date &idn_to_ascii &is_valid_utf_8
>                    &get_user_groups &compile_regexp &qr_to_string
> -                  &is_fqdn_valid);
> +                  &is_fqdn_valid &parse_header_addresses);
>  
>  our $AM_TAINTED;
>  
> @@ -2334,6 +2335,330 @@ sub get_tag_value_for_score {
>  
>  ###########################################################################
>  
> +# RFC 5322 (+IDN?) parsing of addresses and names from To/From/Cc.. headers
> +#
> +# Return array of hashes, containing at minimum name,address,user,host
> +#
> +# Override parser with SA_HEADER_ADDRESS_PARSER environment variable
> +
> +our $header_address_parser;
> +our $email_address_xs;
> +our $email_address_xs_fix_address;
> +BEGIN {
> +  # SA_HEADER_ADDRESS_PARSER=1 only use internal parser
> +  # SA_HEADER_ADDRESS_PARSER=2 only use Email::Address::XS
> +  # By default internal is preferred, will defer for some cases
> +  $header_address_parser = untaint_var($ENV{'SA_HEADER_ADDRESS_PARSER'});
> +  if ((!defined $header_address_parser || $header_address_parser eq '2') &&
> +       eval 'use Email::Address::XS; 1;') {
> +    $email_address_xs = 1;
> +    if (version->parse(Email::Address::XS->VERSION) < version->parse(1.02)) {
> +      $email_address_xs_fix_address = 1;
> +    }
> +  }
> +}
> +
> +# Helper for internal parser
> +our $header_address_mailre = qr/
> +  # user
> +  (?:
> +    # quoted localpart
> +    " (?:|(?:[^"\\]++|\\.)*+) " |
> +    # or un-quoted localpart
> +    [^\@\s\<\>\(\)\[\]\,\:\;]+
> +  )
> +  # domain
> +  \@ (?: [^\"\s\<\>\(\)\[\]\,\:\;]+ | \[ [\d:.]+ \] )
> +/ix;
> +
> +# Very relaxed internal parser
> +# Only handles non-nested comments in some places
> +our $header_address_re = qr/^
> +  \s*
> +  (?:
> +    # optional phrase, quoted or non-quoted
> +    (?:
> +      ( (?: " (?:|(?:[^"\\]++|\\.)*+) " | [^",;<]++ )+ )
> +      \s*
> +    )?
> +    # and enclosed email (or empty)
> +    # ... allow whitespace in localpart
> +    < \s* ( [^>\@]* \S+ | \s* ) \s* >
> +    # some output duplicate enclosures..
> +    (?: \s* < \s* (?: (?: " (?:|(?:[^"\\]++|\\.)*+) " )? \S+ | \s* ) \s* > )*
> +  |
> +    # or standalone email or phrase
> +    (?:
> +      ( $header_address_mailre ) |
> +      ( (?: " (?:|(?:[^"\\]++|\\.)*+) " | [^",;<]++ )+ )
> +    )
> +  )
> +  # possible comment after (no nested support here)
> +  (?: \s* \( ( (?:|(?:[^()\\]++|\\.)*+) ) \) )?
> +  # Followed by comma (semi-colon sometimes) or finish
> +  \s* (?: [,;] | \z )
> +/ix;
> +
> +#
> +# Main public function
> +# expected input is header contents without Header: itself
> +#
> +sub parse_header_addresses {
> +  my ($str) = @_;
> +
> +  return if !defined $str || $str !~ /\S/;
> +
> +  my @results;
> +
> +  # Internal parser
> +  if (!$header_address_parser || $header_address_parser eq '1') {
> +    @results = _parse_header_addresses($str);
> +  }
> +
> +  # Email::Address::XS
> +  if ($email_address_xs) {
> +    if (!$header_address_parser || $header_address_parser eq '2') {
> +      # Only consulted if no internal results, or there doesn't
> +      # seem to have enough results, or possible nested comments ( (
> +      my $maybe_nested = scalar($str =~ /\(/) >= 2;
> +      if (!@results || $maybe_nested || @results < scalar($str =~ tr/,//)+1) {
> +        my @results_xs = _parse_header_addresses_xs($str);
> +        # If we have more results than internal, use it, or nested
> +        if (@results_xs > @results || $maybe_nested) {
> +          return @results_xs;
> +        }
> +      }
> +    }
> +  }
> +
> +  return @results;
> +}
> +
> +# Check some basic parsing mistakes
> +sub _valid_parsed_address {
> +  return 0 if !defined $_[0];
> +  return 0 if index($_[0], '""@') == 0;
> +  return 0 if scalar($_[0] =~ tr/"//) == 1;
> +  return 1;
> +}
> +
> +#
> +# v0.1, improved internal parser, no support for comments in strange
> +# places or nested comments, but handled a large corpus atleast 99% the
> +# same as Email::Address::XS and in some cases even better (retains some
> +# more name/addr info, even when not fully valid).
> +#
> +sub _parse_header_addresses {
> +  local $_ = shift;
> +  local ($1, $2, $3, $4, $5);
> +
> +  # Clear trailing whitespace
> +  s/\s+\z//s;
> +
> +  # Strip away all escaped blackslashes, simplifies processing a lot
> +  s/\\\\//g;
> +
> +  # Reduce group address
> +  s/^[^"()<>]+:\s*(.*?)\s*(?:;.*)?/$1/gs;
> +
> +  # Skip empty
> +  return unless /\S/;
> +
> +  my @results;
> +  while (s/$header_address_re//igs) {
> +    my $phrase = defined $1 ? $1 :
> +                 defined $4 ? $4 : undef;
> +    my $address = defined $2 ? $2 :
> +                defined $3 ? $3 : undef;
> +    my $comment = defined $5 ? $5 : undef;
> +
> +    my ($user, $host, $invalid);
> +
> +    # Check relaxed <> capture
> +    if (defined $2) {
> +      # Remove comments (no nested support here)
> +      $address =~ s/\((?:|(?:[^()\\]++|\\.)*+)\)//gs;
> +      # Validate as somewhat email looking
> +      if ($address !~ /^$header_address_mailre$/) {
> +        $address = undef;
> +      }
> +    }
> +
> +    # Validate some other address oddities
> +    if (!_valid_parsed_address($address)) {
> +      $address = undef;
> +    }
> +
> +    if (defined $phrase) {
> +      my $newphrase;
> +      # Parse phrase as quoted and unquoted parts
> +      while ($phrase =~ /(?:"(|(?:[^"\\]++|\\.)*+)"|([^"]++))/igs) {
> +        my $qs = $1;
> +        my $nqs = $2;
> +        if (defined $qs) {
> +          # Unescape things inside quoted string
> +          $qs =~ s/\\(?!\\)//g;
> +          $qs =~ s/\\\\/\\/g;
> +          #$qs =~ s/\\//g;
> +          $newphrase .= $qs;
> +        } else {
> +          # Remove comments (no nested support here)
> +          $nqs =~ s/\((?:|(?:[^()\\]++|\\.)*+)\)//gs;
> +          $newphrase .= $nqs;
> +        }
> +      }
> +      $phrase = $newphrase;
> +
> +      # If we only have phrase which looks email, swap when valid
> +      # Check all in one if, either swap or don't
> +      if (!defined $address &&
> +          $phrase =~ /^$header_address_mailre$/i &&
> +          _valid_parsed_address($phrase) &&
> +          $phrase =~ /^[^\@]*\@([^\@]*)/ &&
> +          is_fqdn_valid(idn_to_ascii($1), 1)) {
> +        $address = $phrase;
> +        $phrase = undef;
> +      } else {
> +        # Remove redundant phrase==email?
> +        if (defined $address && $phrase eq $address) {
> +          $phrase = undef;
> +        } elsif ($phrase eq '') {
> +          $phrase = undef;
> +        }
> +      }
> +    }
> +
> +    # Copy comment to phrase if not defined
> +    if (!defined $phrase && defined $comment) {
> +      $phrase = $comment;
> +    }
> +
> +    if (defined $address) {
> +      # Unescape quoted localpart
> +      #if ($address =~ /^"(.*?)"\@(.*)/) {
> +      #  $user = $1;
> +      #  $host = $2;
> +      #  $user =~ s/\\//g;
> +      #  $user =~ s/\s+//gs;
> +      #  $address = "$user\@$host";
> +      #}
> +      # Strip sometimes seen quotes
> +      #$address =~ s/^'(.*?)'$/$1/;
> +      $address =~ s/^(([^\@]*)\@([^\@]*)).*/$1/;
> +      ($user, $host) = ($2, $3);
> +    }
> +
> +    $invalid = !defined $host || !is_fqdn_valid(idn_to_ascii($host), 1);
> +    push @results, {
> +      'phrase' => $phrase,
> +      'user' => $user,
> +      'host' => $host,
> +      'address' => $address,
> +      'comment' => $comment,
> +      'invalid' => $invalid
> +    };
> +  }
> +
> +  # Was something left unparsed?
> +  if (index($_, '@') != -1) {
> +    # Last ditch effort, examples:
> +    # =?UTF-8?Q?"Foobar"_<no...@foobar.com>?=
> +    # =?utf-8?Q?"Foobar"?=<in...@mlsend.com>
> +    while (/<($header_address_mailre)>/igs) {
> +      my $address = $1;
> +      next if !_valid_parsed_address($address);
> +      $address =~ s/^(([^\@]*)\@([^\@]*)).*/$1/;
> +      my ($user, $host) = ($2, $3);
> +      my $invalid = !is_fqdn_valid(idn_to_ascii($host), 1);
> +      push @results, {
> +        'phrase' => undef,
> +        'user' => $user,
> +        'host' => $host,
> +        'address' => $address,
> +        'comment' => undef,
> +        'invalid' => $invalid
> +      };
> +    }
> +  }
> +
> +  return if !@results;
> +  return @results;
> +}
> +
> +sub _parse_header_addresses_xs {
> +  my ($str) = @_;
> +
> +  # Strip away all escaped blackslashes, simplifies processing a lot
> +  $str =~ s/\\\\//g;
> +
> +  my @results;
> +  my @addrs = Email::Address::XS->parse($str);
> +
> +  local ($1, $2);
> +  foreach my $addr (@addrs) {
> +    my $name = $addr->name;
> +    my $address = $addr->address;
> +    my $user = $addr->user;
> +    my $host = $addr->host;
> +    my $phrase = $addr->phrase;
> +    my $comment = $addr->comment;
> +    my $invalid;
> +
> +    # Workaround Bug 5201 for Email::Address::XS
> +    # From: "joe+foobar@example.com"
> +    # If everything else is missing but phrase looks like
> +    # an email, let's assume it is (hostname verifies)
> +    if (!defined $address && !defined $user &&
> +        !defined $comment && defined $phrase &&
> +        _valid_parsed_address($phrase) &&
> +        $phrase =~ /^([^\s\@]+)\@([^\s\@]+)$/ &&
> +        is_fqdn_valid(idn_to_ascii($2), 1))
> +    {
> +      $user = $1;
> +      $host = $2;
> +      $address = $phrase;
> +      $name = $user;
> +      $invalid = 0;
> +      $phrase = undef;
> +    }
> +    else {
> +      $invalid = !$addr->is_valid;
> +    }
> +
> +    # Version <1.02 borks address if both user+host are UTF-8
> +    if ($email_address_xs_fix_address) {
> +      if (defined $user && defined $host) {
> +        # <"Another User"@foo> loses quotes in user, add back
> +        if (index($user, ' ') != -1 &&
> +            index($user, '"') == -1) {
> +          $user = '"'.$user.'"';
> +        }
> +        $address = $user.'@'.$host;
> +      }
> +    }
> +
> +    # Copy comment to phrase if not defined
> +    if (!defined $phrase && defined $comment) {
> +      $phrase = $comment;
> +    }
> +
> +    # Use input as name if nothing found
> +    if (!defined $phrase && !defined $address) {
> +      $phrase = $str;
> +    }
> +
> +    push @results, {
> +      'phrase' => $phrase,
> +      'user' => $user,
> +      'host' => $host,
> +      'address' => $address,
> +      'comment' => $comment,
> +      'invalid' => $invalid
> +    };
> +  }
> +
> +  return @results;
> +}
>  
>  1;
>  
> 
> Modified: spamassassin/trunk/lib/Mail/SpamAssassin/Util/DependencyInfo.pm
> URL: http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Util/DependencyInfo.pm?rev=1889337&r1=1889336&r2=1889337&view=diff
> ==============================================================================
> --- spamassassin/trunk/lib/Mail/SpamAssassin/Util/DependencyInfo.pm (original)
> +++ spamassassin/trunk/lib/Mail/SpamAssassin/Util/DependencyInfo.pm Fri Apr 30 18:17:51 2021
> @@ -302,6 +302,13 @@ our @OPTIONAL_MODULES = (
>    desc => 'IO::String emulates file interface for in-core strings.
>    It is used by the optional OLEVBMacro Plugin.',
>  },
> +{
> +  module => 'Email::Address::XS',
> +  version => 0,
> +  desc => 'Email::Address::XS is used to parse email addresses from header
> +  fields like To/From/cc, per RFC 5322. If installed, it may additionally
> +  be used by internal parser to process complex lists.',
> +},
>  );
>  
>  our @BINARIES = ();
> 
> Modified: spamassassin/trunk/t/SATest.pm
> URL: http://svn.apache.org/viewvc/spamassassin/trunk/t/SATest.pm?rev=1889337&r1=1889336&r2=1889337&view=diff
> ==============================================================================
> --- spamassassin/trunk/t/SATest.pm (original)
> +++ spamassassin/trunk/t/SATest.pm Fri Apr 30 18:17:51 2021
> @@ -68,6 +68,7 @@ BEGIN {
>    # Fix INC to point to built SA
>    if (-e 't/test_dir') { unshift(@INC, 'blib/lib'); }
>    elsif (-e 'test_dir') { unshift(@INC, '../blib/lib'); }
> +  else { die "FATAL: not in or below test directory?\n"; }
>  }
>  
>  # Set up for testing. Exports (as global vars):
> 
> Modified: spamassassin/trunk/t/data/Dumpheaders.pm
> URL: http://svn.apache.org/viewvc/spamassassin/trunk/t/data/Dumpheaders.pm?rev=1889337&r1=1889336&r2=1889337&view=diff
> ==============================================================================
> --- spamassassin/trunk/t/data/Dumpheaders.pm (original)
> +++ spamassassin/trunk/t/data/Dumpheaders.pm Fri Apr 30 18:17:51 2021
> @@ -16,29 +16,81 @@ sub check_end {
>    my ($self, $opts) = @_;
>  
>    local $_;
> -  $_ = $opts->{permsgstatus}->get("ALL:raw");
> -  s/\n/[\\n]/gs; s/\t/[\\t]/gs; s/\n+//gs;
>  
>    # ignore the M:SpamAssassin:compile() test message
> -  return if /I need to make this message body somewhat long so TextCat preloads/;
> -  print STDOUT "text-all-raw: $_\n";
> +  return if $self->{linting};
> +  #return if /I need to make this message body somewhat long so TextCat preloads/;
> +
> +  ## pre-4.0 scalar context calls
> +
> +  $_ = $opts->{permsgstatus}->get("ALL:raw");
> +  s/\n/[\\n]/gs; s/\t/[\\t]/gs; s/\n+//gs;
> +  print STDOUT "scalar-text-all-raw: $_"."[END]\n";
>  
>    $_ = $opts->{permsgstatus}->get("ALL");
>    s/\n/[\\n]/gs; s/\t/[\\t]/gs; s/\n+//gs;
> -  print STDOUT "text-all-noraw: $_\n";
> +  print STDOUT "scalar-text-all-noraw: $_"."[END]\n";
>  
>    $_ = $opts->{permsgstatus}->get("From:raw");
>    s/\n/[\\n]/gs; s/\t/[\\t]/gs; s/\n+//gs;
> -  print STDOUT "text-from-raw: $_\n";
> +  print STDOUT "scalar-text-from-raw: $_"."[END]\n";
>  
>    $_ = $opts->{permsgstatus}->get("From");
>    s/\n/[\\n]/gs; s/\t/[\\t]/gs; s/\n+//gs;
> -  print STDOUT "text-from-noraw: $_\n";
> +  print STDOUT "scalar-text-from-noraw: $_"."[END]\n";
>  
>    $_ = $opts->{permsgstatus}->get("From:addr");
>    s/\n/[\\n]/gs; s/\t/[\\t]/gs; s/\n+//gs;
> -  print STDOUT "text-from-addr: $_\n";
> +  print STDOUT "scalar-text-from-addr: $_"."[END]\n";
> +
> +  ## 4.0 list context tests
> +
> +  my @l;
> +  my $s;
> +
> +  @l = $opts->{permsgstatus}->get("ALL:raw");
> +  foreach (@l) { s/\n/[\\n]/gs; s/\t/[\\t]/gs; s/\n+//gs; }
> +  print STDOUT "list-text-all-raw: ".join("[LIST]", @l)."[END]\n";
> +
> +  @l = $opts->{permsgstatus}->get("ALL");
> +  foreach (@l) { s/\n/[\\n]/gs; s/\t/[\\t]/gs; s/\n+//gs; }
> +  print STDOUT "list-text-all-noraw: ".join("[LIST]", @l)."[END]\n";
> +
> +  @l = $opts->{permsgstatus}->get("From:raw");
> +  foreach (@l) { s/\n/[\\n]/gs; s/\t/[\\t]/gs; s/\n+//gs; }
> +  print STDOUT "list-text-from-raw: ".join("[LIST]", @l)."[END]\n";
> +
> +  @l = $opts->{permsgstatus}->get("From");
> +  foreach (@l) { s/\n/[\\n]/gs; s/\t/[\\t]/gs; s/\n+//gs; }
> +  print STDOUT "list-text-from-noraw: ".join("[LIST]", @l)."[END]\n";
> +
> +  @l = $opts->{permsgstatus}->get("From:addr");
> +  foreach (@l) { s/\n/[\\n]/gs; s/\t/[\\t]/gs; s/\n+//gs; }
> +  print STDOUT "list-text-from-addr: ".join("[LIST]", @l)."[END]\n";
> +
> +  @l = $opts->{permsgstatus}->get("From:first:addr");
> +  foreach (@l) { s/\n/[\\n]/gs; s/\t/[\\t]/gs; s/\n+//gs; }
> +  print STDOUT "list-text-from-first-addr: ".join("[LIST]", @l)."[END]\n";
> +
> +  @l = $opts->{permsgstatus}->get("From:last:addr");
> +  foreach (@l) { s/\n/[\\n]/gs; s/\t/[\\t]/gs; s/\n+//gs; }
> +  print STDOUT "list-text-from-last-addr: ".join("[LIST]", @l)."[END]\n";
> +
> +  @l = $opts->{permsgstatus}->get("MESSAGEID:host");
> +  foreach (@l) { s/\n/[\\n]/gs; s/\t/[\\t]/gs; s/\n+//gs; }
> +  print STDOUT "list-text-msgid-host: ".join("[LIST]", @l)."[END]\n";
> +
> +  @l = $opts->{permsgstatus}->get("MESSAGEID:domain");
> +  foreach (@l) { s/\n/[\\n]/gs; s/\t/[\\t]/gs; s/\n+//gs; }
> +  print STDOUT "list-text-msgid-domain: ".join("[LIST]", @l)."[END]\n";
> +
> +  @l = $opts->{permsgstatus}->get("Received:ip");
> +  foreach (@l) { s/\n/[\\n]/gs; s/\t/[\\t]/gs; s/\n+//gs; }
> +  print STDOUT "list-text-received-ip: ".join("[LIST]", @l)."[END]\n";
>  
> +  @l = $opts->{permsgstatus}->get("Received:revip");
> +  foreach (@l) { s/\n/[\\n]/gs; s/\t/[\\t]/gs; s/\n+//gs; }
> +  print STDOUT "list-text-received-revip: ".join("[LIST]", @l)."[END]\n";
>  }
>  
>  1;
> 
> Modified: spamassassin/trunk/t/data/nice/unicode1
> URL: http://svn.apache.org/viewvc/spamassassin/trunk/t/data/nice/unicode1?rev=1889337&r1=1889336&r2=1889337&view=diff
> ==============================================================================
> --- spamassassin/trunk/t/data/nice/unicode1 (original)
> +++ spamassassin/trunk/t/data/nice/unicode1 Fri Apr 30 18:17:51 2021
> @@ -6,7 +6,7 @@ Received: from mail-ig0-x248.esempio-uni
>    by SÃ¶rensen.example.com (Postfix) with UTF8SMTPS
>    for <DÃ¶rte@SÃ¶rensen.example.com>; Thu,  8 Oct 2015 07:45:14 +0200 (CEST)
>  From: =?ISO-8859-1?Q?Maril=F9?= GioffrÃ© â¥ <MarilÃ¹.GioffrÃ©@esempio-universitÃ .it>
> -To: =?iso-8859-1*sv?Q?D=F6rte_=C5._S=F6rensen,_Jr.?=
> +To: =?iso-8859-1*sv?Q?D=F6rte_=C5._S=F6rensen=2C_Jr.?=
>    <DÃ¶rte@SÃ¶rensen.example.com>
>  Cc: Î??ÏÎµÏ@ÎµÏÎ±Î??ÏÎ»Îµ.ÏÎ¿Î??
>  Subject: =?iso-8859-2*sl?Q?Doma=e8e?=
> 
> Added: spamassassin/trunk/t/data/spam/freemail1
> URL: http://svn.apache.org/viewvc/spamassassin/trunk/t/data/spam/freemail1?rev=1889337&view=auto
> ==============================================================================
> --- spamassassin/trunk/t/data/spam/freemail1 (added)
> +++ spamassassin/trunk/t/data/spam/freemail1 Fri Apr 30 18:17:51 2021
> @@ -0,0 +1,15 @@
> +Return-Path: <te...@gmail.com>
> +Received: from google-public-dns-a.google.com (google-public-dns-a.google.com [8.8.8.8])
> +	by in.example.com (Postfix) with ESMTPS
> +	for <te...@example.com>; Wed, 18 Jul 2018 21:12:22 +0200 (CEST)
> +Received: by google-public-dns-a.google.com with SMTP id f21-v6so3811271wmc.5
> +        for <te...@example.com>; Wed, 18 Jul 2018 12:12:22 -0700 (PDT)
> +From: <te...@gmail.com>
> +To: test@example.com
> +Reply-To: "Spammer" <an...@gmail.com>
> +Subject: Freemail test
> +Date: Wed, 18 Jul 2018 12:12:00 -0700 (PDT)
> +MIME-Version: 1.0
> +Message-Id: <20...@gmail.com>
> +
> +Freemail test
> 
> Added: spamassassin/trunk/t/data/spam/freemail2
> URL: http://svn.apache.org/viewvc/spamassassin/trunk/t/data/spam/freemail2?rev=1889337&view=auto
> ==============================================================================
> --- spamassassin/trunk/t/data/spam/freemail2 (added)
> +++ spamassassin/trunk/t/data/spam/freemail2 Fri Apr 30 18:17:51 2021
> @@ -0,0 +1,15 @@
> +Return-Path: <te...@gmail.com>
> +Received: from google-public-dns-a.google.com (google-public-dns-a.google.com [8.8.8.8])
> +	by in.example.com (Postfix) with ESMTPS
> +	for <te...@example.com>; Wed, 18 Jul 2018 21:12:22 +0200 (CEST)
> +Received: by google-public-dns-a.google.com with SMTP id f21-v6so3811271wmc.5
> +        for <te...@example.com>; Wed, 18 Jul 2018 12:12:22 -0700 (PDT)
> +From: <te...@gmail.com>
> +To: test@example.com
> +Reply-To: innocent@example.com, "Spammer" <an...@gmail.com>
> +Subject: Freemail test
> +Date: Wed, 18 Jul 2018 12:12:00 -0700 (PDT)
> +MIME-Version: 1.0
> +Message-Id: <20...@gmail.com>
> +
> +Freemail test with multiple Reply-To's
> 
> Added: spamassassin/trunk/t/data/spam/freemail3
> URL: http://svn.apache.org/viewvc/spamassassin/trunk/t/data/spam/freemail3?rev=1889337&view=auto
> ==============================================================================
> --- spamassassin/trunk/t/data/spam/freemail3 (added)
> +++ spamassassin/trunk/t/data/spam/freemail3 Fri Apr 30 18:17:51 2021
> @@ -0,0 +1,15 @@
> +Return-Path: <te...@gmail.com>
> +Received: from google-public-dns-a.google.com (google-public-dns-a.google.com [8.8.8.8])
> +	by in.example.com (Postfix) with ESMTPS
> +	for <te...@example.com>; Wed, 18 Jul 2018 21:12:22 +0200 (CEST)
> +Received: by google-public-dns-a.google.com with SMTP id f21-v6so3811271wmc.5
> +        for <te...@example.com>; Wed, 18 Jul 2018 12:12:22 -0700 (PDT)
> +From: <te...@gmail.com>
> +To: test@example.com
> +Subject: Freemail test
> +Date: Wed, 18 Jul 2018 12:12:00 -0700 (PDT)
> +MIME-Version: 1.0
> +Message-Id: <20...@gmail.com>
> +
> +Freemail test with body email
> +another1@gmail.com
> 
> Modified: spamassassin/trunk/t/freemail.t
> URL: http://svn.apache.org/viewvc/spamassassin/trunk/t/freemail.t?rev=1889337&r1=1889336&r2=1889337&view=diff
> ==============================================================================
> --- spamassassin/trunk/t/freemail.t (original)
> +++ spamassassin/trunk/t/freemail.t Fri Apr 30 18:17:51 2021
> @@ -5,19 +5,46 @@ use SATest; sa_t_init("freemail");
>  
>  use Test::More;
>  
> -plan tests => 4;
> +plan tests => 23;
>  
>  # ---------------------------------------------------------------------------
>  
> +# Global
>  tstprefs ("
>    freemail_domains gmail.com
> +");
> +
> +## Standard + whitelist should not hit
> +
> +tstlocalrules (q{
>    freemail_import_whitelist_auth 0
> -  whitelist_auth test\@gmail.com
> +  whitelist_auth test@gmail.com
>    header FREEMAIL_FROM eval:check_freemail_from()
> -");
> +  score FREEMAIL_FROM 3.3
> +  header FREEMAIL_REPLYXX eval:check_freemail_replyto('reply')
> +  score FREEMAIL_REPLYXX 3.3
> +  header FREEMAIL_REPLYTO eval:check_freemail_replyto('replyto')
> +  score FREEMAIL_REPLYTO 3.3
> +  header FREEMAIL_REPLYXX eval:check_freemail_replyto('reply')
> +  score FREEMAIL_REPLYXX 3.3
> +  header FREEMAIL_ENVFROM_END_DIGIT  eval:check_freemail_header('EnvelopeFrom', '\d@')
> +  score FREEMAIL_ENVFROM_END_DIGIT 3.3
> +  header FREEMAIL_REPLYTO_END_DIGIT  eval:check_freemail_header('Reply-To', '\d@')
> +  score FREEMAIL_REPLYTO_END_DIGIT 3.3
> +  header FREEMAIL_HDR_REPLYTO eval:check_freemail_header('Reply-To')
> +  score FREEMAIL_HDR_REPLYTO 3.3
> +});
>  
>  %patterns = (
> -  q{ FREEMAIL_FROM }, 'FREEMAIL_FROM',
> +  q{ 3.3 FREEMAIL_FROM }, 'FREEMAIL_FROM',
> +);
> +%anti_patterns = (
> +  # No Reply-To or body
> +  q{ 3.3 FREEMAIL_REPLYTO }, 'FREEMAIL_REPLYTO',
> +  q{ 3.3 FREEMAIL_REPLYXX }, 'FREEMAIL_REPLYXX',
> +  q{ 3.3 FREEMAIL_ENVFROM_END_DIGIT }, 'FREEMAIL_ENVFROM_END_DIGIT',
> +  q{ 3.3 FREEMAIL_REPLYTO_END_DIGIT }, 'FREEMAIL_REPLYTO_END_DIGIT',
> +  q{ 3.3 FREEMAIL_HDR_REPLYTO }, 'FREEMAIL_HDR_REPLYTO',
>  );
>  
>  ok sarun ("-L -t < data/spam/relayUS.eml", \&patterns_run_cb);
> @@ -28,16 +55,85 @@ clear_pattern_counters();
>  
>  %patterns = ();
>  %anti_patterns = (
> -  q{ FREEMAIL_FROM }, 'FREEMAIL_FROM',
> +  q{ 3.3 FREEMAIL_FROM }, 'FREEMAIL_FROM',
>  );
>  
> -tstprefs ("
> -  freemail_domains gmail.com
> +tstlocalrules (q{
>    freemail_import_whitelist_auth 1
> -  whitelist_auth test\@gmail.com
> +  whitelist_auth test@gmail.com
>    header FREEMAIL_FROM eval:check_freemail_from()
> -");
> +  score FREEMAIL_FROM 3.3
> +});
>  
>  ok sarun ("-L -t < data/spam/relayUS.eml", \&patterns_run_cb);
>  ok_all_patterns();
>  
> +## From and Reply-To different
> +
> +%patterns = (
> +  q{ 3.3 FREEMAIL_FROM }, 'FREEMAIL_FROM',
> +  q{ 3.3 FREEMAIL_REPLYTO }, 'FREEMAIL_REPLYTO',
> +  q{ 3.3 FREEMAIL_REPLYXX }, 'FREEMAIL_REPLYXX',
> +  q{ 3.3 FREEMAIL_ENVFROM_END_DIGIT }, 'FREEMAIL_ENVFROM_END_DIGIT',
> +  q{ 3.3 FREEMAIL_REPLYTO_END_DIGIT }, 'FREEMAIL_REPLYTO_END_DIGIT',
> +  q{ 3.3 FREEMAIL_HDR_REPLYTO }, 'FREEMAIL_HDR_REPLYTO',
> +);
> +%anti_patterns = ();
> +
> +tstlocalrules (q{
> +  header FREEMAIL_FROM eval:check_freemail_from()
> +  score FREEMAIL_FROM 3.3
> +  header FREEMAIL_REPLYTO eval:check_freemail_replyto('replyto')
> +  score FREEMAIL_REPLYTO 3.3
> +  header FREEMAIL_REPLYXX eval:check_freemail_replyto('reply')
> +  score FREEMAIL_REPLYXX 3.3
> +  header FREEMAIL_ENVFROM_END_DIGIT  eval:check_freemail_header('EnvelopeFrom', '\d@')
> +  score FREEMAIL_ENVFROM_END_DIGIT 3.3
> +  header FREEMAIL_REPLYTO_END_DIGIT  eval:check_freemail_header('Reply-To', '\d@')
> +  score FREEMAIL_REPLYTO_END_DIGIT 3.3
> +  header FREEMAIL_HDR_REPLYTO eval:check_freemail_header('Reply-To')
> +  score FREEMAIL_HDR_REPLYTO 3.3
> +});
> +
> +ok sarun ("-L -t < data/spam/freemail1", \&patterns_run_cb);
> +ok_all_patterns();
> +
> +## Multiple Reply-To values, no email on body
> +
> +%patterns = (
> +  q{ 3.3 FREEMAIL_REPLYTO }, 'FREEMAIL_REPLYTO',
> +  q{ 3.3 FREEMAIL_REPLYXX }, 'FREEMAIL_REPLYXX',
> +  q{ 3.3 FREEMAIL_REPLYTO_END_DIGIT }, 'FREEMAIL_REPLYTO_END_DIGIT',
> +  q{ 3.3 FREEMAIL_HDR_REPLYTO }, 'FREEMAIL_HDR_REPLYTO',
> +);
> +%anti_patterns = ();
> +
> +tstlocalrules (q{
> +  header FREEMAIL_REPLYTO eval:check_freemail_replyto('replyto')
> +  score FREEMAIL_REPLYTO 3.3
> +  header FREEMAIL_REPLYXX eval:check_freemail_replyto('reply')
> +  score FREEMAIL_REPLYXX 3.3
> +  header FREEMAIL_REPLYTO_END_DIGIT  eval:check_freemail_header('Reply-To', '\d@')
> +  score FREEMAIL_REPLYTO_END_DIGIT 3.3
> +  header FREEMAIL_HDR_REPLYTO eval:check_freemail_header('Reply-To')
> +  score FREEMAIL_HDR_REPLYTO 3.3
> +});
> +
> +ok sarun ("-L -t < data/spam/freemail2", \&patterns_run_cb);
> +ok_all_patterns();
> +
> +## No Reply-To, another freemail in body
> +
> +%patterns = (
> +  q{ 3.3 FREEMAIL_REPLYXX }, 'FREEMAIL_REPLYXX',
> +);
> +%anti_patterns = ();
> +
> +tstlocalrules (q{
> +  header FREEMAIL_REPLYXX eval:check_freemail_replyto('reply')
> +  score FREEMAIL_REPLYXX 3.3
> +});
> +
> +ok sarun ("-L -t < data/spam/freemail3", \&patterns_run_cb);
> +ok_all_patterns();
> +
> 
> Modified: spamassassin/trunk/t/freemail_welcome_block.t
> URL: http://svn.apache.org/viewvc/spamassassin/trunk/t/freemail_welcome_block.t?rev=1889337&r1=1889336&r2=1889337&view=diff
> ==============================================================================
> --- spamassassin/trunk/t/freemail_welcome_block.t (original)
> +++ spamassassin/trunk/t/freemail_welcome_block.t Fri Apr 30 18:17:51 2021
> @@ -1,23 +1,50 @@
>  #!/usr/bin/perl -T
>  
>  use lib '.'; use lib 't';
> -use SATest; sa_t_init("freemail_welcome_block");
> +use SATest; sa_t_init("freemail");
>  
>  use Test::More;
>  
> -plan tests => 4;
> +plan tests => 23;
>  
>  # ---------------------------------------------------------------------------
>  
> +# Global
>  tstprefs ("
>    freemail_domains gmail.com
> +");
> +
> +## Standard + welcomelist should not hit
> +
> +tstlocalrules (q{
>    freemail_import_welcomelist_auth 0
> -  welcomelist_auth test\@gmail.com
> +  welcomelist_auth test@gmail.com
>    header FREEMAIL_FROM eval:check_freemail_from()
> -");
> +  score FREEMAIL_FROM 3.3
> +  header FREEMAIL_REPLYXX eval:check_freemail_replyto('reply')
> +  score FREEMAIL_REPLYXX 3.3
> +  header FREEMAIL_REPLYTO eval:check_freemail_replyto('replyto')
> +  score FREEMAIL_REPLYTO 3.3
> +  header FREEMAIL_REPLYXX eval:check_freemail_replyto('reply')
> +  score FREEMAIL_REPLYXX 3.3
> +  header FREEMAIL_ENVFROM_END_DIGIT  eval:check_freemail_header('EnvelopeFrom', '\d@')
> +  score FREEMAIL_ENVFROM_END_DIGIT 3.3
> +  header FREEMAIL_REPLYTO_END_DIGIT  eval:check_freemail_header('Reply-To', '\d@')
> +  score FREEMAIL_REPLYTO_END_DIGIT 3.3
> +  header FREEMAIL_HDR_REPLYTO eval:check_freemail_header('Reply-To')
> +  score FREEMAIL_HDR_REPLYTO 3.3
> +});
>  
>  %patterns = (
> -  q{ FREEMAIL_FROM }, 'FREEMAIL_FROM',
> +  q{ 3.3 FREEMAIL_FROM }, 'FREEMAIL_FROM',
> +);
> +%anti_patterns = (
> +  # No Reply-To or body
> +  q{ 3.3 FREEMAIL_REPLYTO }, 'FREEMAIL_REPLYTO',
> +  q{ 3.3 FREEMAIL_REPLYXX }, 'FREEMAIL_REPLYXX',
> +  q{ 3.3 FREEMAIL_ENVFROM_END_DIGIT }, 'FREEMAIL_ENVFROM_END_DIGIT',
> +  q{ 3.3 FREEMAIL_REPLYTO_END_DIGIT }, 'FREEMAIL_REPLYTO_END_DIGIT',
> +  q{ 3.3 FREEMAIL_HDR_REPLYTO }, 'FREEMAIL_HDR_REPLYTO',
>  );
>  
>  ok sarun ("-L -t < data/spam/relayUS.eml", \&patterns_run_cb);
> @@ -28,16 +55,85 @@ clear_pattern_counters();
>  
>  %patterns = ();
>  %anti_patterns = (
> -  q{ FREEMAIL_FROM }, 'FREEMAIL_FROM',
> +  q{ 3.3 FREEMAIL_FROM }, 'FREEMAIL_FROM',
>  );
>  
> -tstlocalrules ("
> -  freemail_domains gmail.com
> +tstlocalrules (q{
>    freemail_import_welcomelist_auth 1
> -  welcomelist_auth test\@gmail.com
> +  welcomelist_auth test@gmail.com
>    header FREEMAIL_FROM eval:check_freemail_from()
> -");
> +  score FREEMAIL_FROM 3.3
> +});
>  
>  ok sarun ("-L -t < data/spam/relayUS.eml", \&patterns_run_cb);
>  ok_all_patterns();
>  
> +## From and Reply-To different
> +
> +%patterns = (
> +  q{ 3.3 FREEMAIL_FROM }, 'FREEMAIL_FROM',
> +  q{ 3.3 FREEMAIL_REPLYTO }, 'FREEMAIL_REPLYTO',
> +  q{ 3.3 FREEMAIL_REPLYXX }, 'FREEMAIL_REPLYXX',
> +  q{ 3.3 FREEMAIL_ENVFROM_END_DIGIT }, 'FREEMAIL_ENVFROM_END_DIGIT',
> +  q{ 3.3 FREEMAIL_REPLYTO_END_DIGIT }, 'FREEMAIL_REPLYTO_END_DIGIT',
> +  q{ 3.3 FREEMAIL_HDR_REPLYTO }, 'FREEMAIL_HDR_REPLYTO',
> +);
> +%anti_patterns = ();
> +
> +tstlocalrules (q{
> +  header FREEMAIL_FROM eval:check_freemail_from()
> +  score FREEMAIL_FROM 3.3
> +  header FREEMAIL_REPLYTO eval:check_freemail_replyto('replyto')
> +  score FREEMAIL_REPLYTO 3.3
> +  header FREEMAIL_REPLYXX eval:check_freemail_replyto('reply')
> +  score FREEMAIL_REPLYXX 3.3
> +  header FREEMAIL_ENVFROM_END_DIGIT  eval:check_freemail_header('EnvelopeFrom', '\d@')
> +  score FREEMAIL_ENVFROM_END_DIGIT 3.3
> +  header FREEMAIL_REPLYTO_END_DIGIT  eval:check_freemail_header('Reply-To', '\d@')
> +  score FREEMAIL_REPLYTO_END_DIGIT 3.3
> +  header FREEMAIL_HDR_REPLYTO eval:check_freemail_header('Reply-To')
> +  score FREEMAIL_HDR_REPLYTO 3.3
> +});
> +
> +ok sarun ("-L -t < data/spam/freemail1", \&patterns_run_cb);
> +ok_all_patterns();
> +
> +## Multiple Reply-To values, no email on body
> +
> +%patterns = (
> +  q{ 3.3 FREEMAIL_REPLYTO }, 'FREEMAIL_REPLYTO',
> +  q{ 3.3 FREEMAIL_REPLYXX }, 'FREEMAIL_REPLYXX',
> +  q{ 3.3 FREEMAIL_REPLYTO_END_DIGIT }, 'FREEMAIL_REPLYTO_END_DIGIT',
> +  q{ 3.3 FREEMAIL_HDR_REPLYTO }, 'FREEMAIL_HDR_REPLYTO',
> +);
> +%anti_patterns = ();
> +
> +tstlocalrules (q{
> +  header FREEMAIL_REPLYTO eval:check_freemail_replyto('replyto')
> +  score FREEMAIL_REPLYTO 3.3
> +  header FREEMAIL_REPLYXX eval:check_freemail_replyto('reply')
> +  score FREEMAIL_REPLYXX 3.3
> +  header FREEMAIL_REPLYTO_END_DIGIT  eval:check_freemail_header('Reply-To', '\d@')
> +  score FREEMAIL_REPLYTO_END_DIGIT 3.3
> +  header FREEMAIL_HDR_REPLYTO eval:check_freemail_header('Reply-To')
> +  score FREEMAIL_HDR_REPLYTO 3.3
> +});
> +
> +ok sarun ("-L -t < data/spam/freemail2", \&patterns_run_cb);
> +ok_all_patterns();
> +
> +## No Reply-To, another freemail in body
> +
> +%patterns = (
> +  q{ 3.3 FREEMAIL_REPLYXX }, 'FREEMAIL_REPLYXX',
> +);
> +%anti_patterns = ();
> +
> +tstlocalrules (q{
> +  header FREEMAIL_REPLYXX eval:check_freemail_replyto('reply')
> +  score FREEMAIL_REPLYXX 3.3
> +});
> +
> +ok sarun ("-L -t < data/spam/freemail3", \&patterns_run_cb);
> +ok_all_patterns();
> +
> 
> Modified: spamassassin/trunk/t/get_all_headers.t
> URL: http://svn.apache.org/viewvc/spamassassin/trunk/t/get_all_headers.t?rev=1889337&r1=1889336&r2=1889337&view=diff
> ==============================================================================
> --- spamassassin/trunk/t/get_all_headers.t (original)
> +++ spamassassin/trunk/t/get_all_headers.t Fri Apr 30 18:17:51 2021
> @@ -2,14 +2,34 @@
>  
>  use lib '.'; use lib 't';
>  use SATest; sa_t_init("get_all_headers");
> -use Test::More tests => 5;
> +use Test::More;
> +
> +use constant HAS_EMAIL_ADDRESS_XS => eval { require Email::Address::XS; };
> +
> +$tests = 19;
> +$tests += 19 if (HAS_EMAIL_ADDRESS_XS);
> +plan tests => $tests;
>  
>  # ---------------------------------------------------------------------------
>  
>  %patterns = (
> -  q{ MIME-Version: 1.0 } => 'no-extra-space',
> -  q{/text-all-raw: Received: from yahoo\.com\[\\\\n\]    \(PPPa33-ResaleLosAngelesMetroB2-2R7452\.dialinx\.net \[4\.48\.136\.190\]\) by\[\\\\n\]    www\.goabroad\.com\.cn \(8\.9\.3/8\.9\.3\) with SMTP id TAA96146; Thu,\[\\\\n\]    30 Aug 2001 19:06:45 \+0800 \(CST\) \(envelope-from\[\\\\n\]    pertand\@email\.mondolink\.com\)\[\\\\n\]From  :<tst1\@example\.com>\[\\\\n\]X-Mailer: Mozilla 4\.04 \[en\]C-bls40  \(Win95; U\)\[\\\\n\]To: jenny33436\@netscape\.net\[\\\\n\]Subject: via\.gra\[\\\\n\]From:\[\\\\t\]  <tst2\@example\.com>\[\\\\n\]DATE: Fri, 7 Dec 2001 07:01:03\[\\\\n\]MIME-Version: 1\.0\[\\\\n\]Message-Id: <20011206235802\.4FD6F1143D6\@mail\.netnoteinc\.com>\[\\\\n\]Sender: travelincentives\@aol\.com\[\\\\n\]Content-Type: text/plain; charset="us-ascii"\[\\\\n\]/} => 'full-headers-raw',
> -  q{/text-all-noraw: Received: from yahoo\\.com \\(PPPa33-ResaleLosAngelesMetroB2-2R7452\\.dialinx\\.net \\[4\\.48\\.136\\.190\\]\\) by www\\.goabroad\\.com\\.cn \\(8\\.9\\.3/8\\.9\\.3\\) with SMTP id TAA96146; Thu, 30 Aug 2001 19:06:45 \\+0800 \\(CST\\) \\(envelope-from pertand\\@email\\.mondolink\\.com\\)\[\\\\n\]From: <tst1\\@example\\.com>\[\\\\n\]X-Mailer: Mozilla 4\\.04 \\[en\\]C-bls40  \\(Win95; U\\)\[\\\\n\]To: jenny33436\\@netscape\\.net\[\\\\n\]Subject: via\\.gra\[\\\\n\]From: <tst2\\@example\\.com>\[\\\\n\]DATE: Fri, 7 Dec 2001 07:01:03\[\\\\n\]MIME-Version: 1\\.0\[\\\\n\]Message-Id: <20011206235802\\.4FD6F1143D6\\@mail\\.netnoteinc\\.com>\[\\\\n\]Sender: travelincentives\\@aol\\.com\[\\\\n\]Content-Type: text/plain; charset="us-ascii"\[\\\\n\]/} => 'full-headers-noraw',
> +  q{'MIME-Version: 1.0'} => 'no-extra-space',
> +  q{'scalar-text-all-raw: Received: from yahoo.com[\n]    (PPPa33-ResaleLosAngelesMetroB2-2R7452.dialinx.net [4.48.136.190]) by[\n]    www.goabroad.com.cn (8.9.3/8.9.3) with SMTP id TAA96146; Thu,[\n]    30 Aug 2001 19:06:45 +0800 (CST) (envelope-from[\n]    pertand@email.mondolink.com)[\n]From  :<ts...@example.com>[\n]X-Mailer: Mozilla 4.04 [en]C-bls40  (Win95; U)[\n]To: jenny33436@netscape.net[\n]Subject: via.gra[\n]From:[\t]  <ts...@example.com>[\n]DATE: Fri, 7 Dec 2001 07:01:03[\n]MIME-Version: 1.0[\n]Message-Id: <20...@mail.netnoteinc.com>[\n]Sender: travelincentives@aol.com[\n]Content-Type: text/plain; charset="us-ascii"[\n][END]'} => 'scalar-text-all-raw',
> +  q{'scalar-text-all-noraw: Received: from yahoo.com (PPPa33-ResaleLosAngelesMetroB2-2R7452.dialinx.net [4.48.136.190]) by www.goabroad.com.cn (8.9.3/8.9.3) with SMTP id TAA96146; Thu, 30 Aug 2001 19:06:45 +0800 (CST) (envelope-from pertand@email.mondolink.com)[\n]From: <ts...@example.com>[\n]X-Mailer: Mozilla 4.04 [en]C-bls40  (Win95; U)[\n]To: jenny33436@netscape.net[\n]Subject: via.gra[\n]From: <ts...@example.com>[\n]DATE: Fri, 7 Dec 2001 07:01:03[\n]MIME-Version: 1.0[\n]Message-Id: <20...@mail.netnoteinc.com>[\n]Sender: travelincentives@aol.com[\n]Content-Type: text/plain; charset="us-ascii"[\n][END]'} => 'scalar-text-all-noraw',
> +  q{'scalar-text-from-raw: <ts...@example.com>[\n][\t]  <ts...@example.com>[\n][END]'} => 'scalar-text-from-raw',
> +  q{'scalar-text-from-noraw: <ts...@example.com>[\n][END]'} => 'scalar-text-from-noraw',
> +  q{'scalar-text-from-addr: tst1@example.com[END]'} => 'scalar-text-from-addr',
> +  q{'list-text-all-raw: Received: from yahoo.com[\n]    (PPPa33-ResaleLosAngelesMetroB2-2R7452.dialinx.net [4.48.136.190]) by[\n]    www.goabroad.com.cn (8.9.3/8.9.3) with SMTP id TAA96146; Thu,[\n]    30 Aug 2001 19:06:45 +0800 (CST) (envelope-from[\n]    pertand@email.mondolink.com)[\n][LIST]From  :<ts...@example.com>[\n][LIST]X-Mailer: Mozilla 4.04 [en]C-bls40  (Win95; U)[\n][LIST]To: jenny33436@netscape.net[\n][LIST]Subject: via.gra[\n][LIST]From:[\t]  <ts...@example.com>[\n][LIST]DATE: Fri, 7 Dec 2001 07:01:03[\n][LIST]MIME-Version: 1.0[\n][LIST]Message-Id: <20...@mail.netnoteinc.com>[\n][LIST]Sender: travelincentives@aol.com[\n][LIST]Content-Type: text/plain; charset="us-ascii"[\n][END]'} => 'list-text-all-raw',
> +  q{'list-text-all-noraw: Received: from yahoo.com (PPPa33-ResaleLosAngelesMetroB2-2R7452.dialinx.net [4.48.136.190]) by www.goabroad.com.cn (8.9.3/8.9.3) with SMTP id TAA96146; Thu, 30 Aug 2001 19:06:45 +0800 (CST) (envelope-from pertand@email.mondolink.com)[\n][LIST]From: <ts...@example.com>[\n][LIST]X-Mailer: Mozilla 4.04 [en]C-bls40  (Win95; U)[\n][LIST]To: jenny33436@netscape.net[\n][LIST]Subject: via.gra[\n][LIST]From: <ts...@example.com>[\n][LIST]DATE: Fri, 7 Dec 2001 07:01:03[\n][LIST]MIME-Version: 1.0[\n][LIST]Message-Id: <20...@mail.netnoteinc.com>[\n][LIST]Sender: travelincentives@aol.com[\n][LIST]Content-Type: text/plain; charset="us-ascii"[\n][END]'} => 'list-text-all-noraw',
> +  q{'list-text-from-raw: <ts...@example.com>[\n][LIST][\t]  <ts...@example.com>[\n][END]'} => 'list-text-from-raw',
> +  q{'list-text-from-noraw: <ts...@example.com>[\n][END]'} => 'list-text-from-noraw',
> +  q{'list-text-from-addr: tst1@example.com[LIST]tst2@example.com[END]'} => 'list-text-from-addr',
> +  q{'list-text-from-first-addr: tst1@example.com[END]'} => 'list-text-from-first-addr',
> +  q{'list-text-from-last-addr: tst2@example.com[END]'} => 'list-text-from-last-addr',
> +  q{'list-text-msgid-host: mail.netnoteinc.com[END]'} => 'list-text-msgid-host',
> +  q{'list-text-msgid-domain: netnoteinc.com[END]'} => 'list-text-msgid-domain',
> +  q{'list-text-received-ip: 4.48.136.190[END]'} => 'list-text-received-ip',
> +  q{'list-text-received-revip: 190.136.48.4[END]'} => 'list-text-received-revip',
>  );
>  
>  %anti_patterns = (
> @@ -20,6 +40,15 @@ tstprefs ("
>    loadplugin Dumpheaders ../../../data/Dumpheaders.pm
>  ");
>  
> +# Internal parser
> +$ENV{'SA_HEADER_ADDRESS_PARSER'} = 1;
>  ok (sarun ("-L -t < data/spam/008", \&patterns_run_cb));
>  ok_all_patterns();
>  
> +if (HAS_EMAIL_ADDRESS_XS) {
> +  # Email::Address::XS
> +  $ENV{'SA_HEADER_ADDRESS_PARSER'} = 2;
> +  ok (sarun ("-L -t < data/spam/008", \&patterns_run_cb));
> +  ok_all_patterns();
> +} else { warn "Not running Email::Address::XS tests, module missing\n"; }
> +
>

Re: header address parser changeset committed

Posted by "Kevin A. McGrail" <km...@apache.org>.

I'll get a fire under the wlbl changes.  Glad we are moving towards a 4.0
release.    -KAM

On Sun, May 2, 2021, 04:32 Giovanni Bechis <gi...@paclan.it> wrote:

> On Sat, May 01, 2021 at 04:57:21PM +0300, Henrik K wrote:
> > On Sat, May 01, 2021 at 06:38:38AM -0700, John Hardin wrote:
> > >
> > > Would we be in the same boat, though? When the RE gets compiled, would
> that
> > > freeze the value of the variable at that time in the RE rather than
> > > interpolating it at execution?
> >
> > Yes, this would have to be implemented dynamically along with all the
> other
> > problems that might come with rule dependencies etc (also the whole meta
> > rules logic needs to be separated from priorities, Bug 7735).  Should be
> > good if it's designed properly instead of trying to hack up replacetags..
> >
> > I don't think I have stamina for that right now, but I'd like to see 4.0
> get
> > going asap.  I don't think there's anything major lacking now that
> address
> > parser works too.
> >
> I'd like to see 4.0 out asap as well, I do not know the status of
> white^Wwelcomelist
> and if it may be a blocker.
>
>  Giovanni
>

Re: header address parser changeset committed

Posted by Giovanni Bechis <gi...@paclan.it>.

On Sat, May 01, 2021 at 04:57:21PM +0300, Henrik K wrote:
> On Sat, May 01, 2021 at 06:38:38AM -0700, John Hardin wrote:
> > 
> > Would we be in the same boat, though? When the RE gets compiled, would that
> > freeze the value of the variable at that time in the RE rather than
> > interpolating it at execution?
> 
> Yes, this would have to be implemented dynamically along with all the other
> problems that might come with rule dependencies etc (also the whole meta
> rules logic needs to be separated from priorities, Bug 7735).  Should be
> good if it's designed properly instead of trying to hack up replacetags..
> 
> I don't think I have stamina for that right now, but I'd like to see 4.0 get
> going asap.  I don't think there's anything major lacking now that address
> parser works too.
> 
I'd like to see 4.0 out asap as well, I do not know the status of white^Wwelcomelist
and if it may be a blocker.

 Giovanni

Re: header address parser changeset committed

Posted by Henrik K <he...@hege.li>.

On Sat, May 01, 2021 at 06:38:38AM -0700, John Hardin wrote:
> 
> Would we be in the same boat, though? When the RE gets compiled, would that
> freeze the value of the variable at that time in the RE rather than
> interpolating it at execution?

Yes, this would have to be implemented dynamically along with all the other
problems that might come with rule dependencies etc (also the whole meta
rules logic needs to be separated from priorities, Bug 7735).  Should be
good if it's designed properly instead of trying to hack up replacetags..

I don't think I have stamina for that right now, but I'd like to see 4.0 get
going asap.  I don't think there's anything major lacking now that address
parser works too.

Re: header address parser changeset committed

Posted by John Hardin <jh...@impsec.org>.

On Sat, 1 May 2021, Henrik K wrote:

> On Fri, Apr 30, 2021 at 10:58:09PM +0300, Henrik K wrote:
>> On Fri, Apr 30, 2021 at 09:36:00PM +0300, Henrik K wrote:
>>> On Fri, Apr 30, 2021 at 11:30:37AM -0700, John Hardin wrote:
>>>>
>>>> Generating a RE fragment that would match on any of the extracted to/cc
>>>> header email addresses would probably be fairly easy as part of this. But
>>>> how would we incorporate that fragment in rules?
>>>
>>> Probably impossible to implement for normal regex rules, because they are
>>> all compiled at start..  it would require modifying them on the fly by
>>> replacing some tag or so.
>>>
>>> Designing a plugin for it would be much easier and more versatile..
>>
>> Ok thinking about it more, Replacetags already has the base functionality..
>> could extend it a bit more to replace static metadata like To addresses,
>> which is already parsed and known before any rules run.  I'll give it a
>> try.
>
> Yeah it gets too complicated.  The compiled regexes live for all the
> duration of the process, which can process many messages.  If you change the
> regex in later than in config parse stage, then it would require some new
> logic to always restore the original regex when a new message is processed,
> and yet again replace it.

Rats. A plugin does sound better.

> Doing all that for a some static header field data seems very unflexible
> too.  What if you want to match some other string, maybe something from the
> body text?  It seems a better solution would be making regex rules able to
> capture something, and then you have some rules which would have dependency
> on that hitting and use the captured variable.

Yeah, that's something I've been wanting a long time as well.

> The need for "rules dependency/chaining" is also already a feature we need,
> mentioned for example in
> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=6855
>
> Wouldn't be much of a more stretch to add some capture variables in the mix.

Would we be in the same boat, though? When the RE gets compiled, would 
that freeze the value of the variable at that time in the RE rather than 
interpolating it at execution?

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org                         pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   The yardstick you should use when considering whether to support a
   given piece of legislation is "what if my worst enemy is chosen to
   administer this law?"
-----------------------------------------------------------------------
  Today: May Day - Remember 110 million people murdered by Communism

Re: header address parser changeset committed

Posted by Henrik K <he...@hege.li>.

On Fri, Apr 30, 2021 at 10:58:09PM +0300, Henrik K wrote:
> On Fri, Apr 30, 2021 at 09:36:00PM +0300, Henrik K wrote:
> > On Fri, Apr 30, 2021 at 11:30:37AM -0700, John Hardin wrote:
> > > 
> > > Generating a RE fragment that would match on any of the extracted to/cc
> > > header email addresses would probably be fairly easy as part of this. But
> > > how would we incorporate that fragment in rules?
> > 
> > Probably impossible to implement for normal regex rules, because they are
> > all compiled at start..  it would require modifying them on the fly by
> > replacing some tag or so.
> > 
> > Designing a plugin for it would be much easier and more versatile..
> 
> Ok thinking about it more, Replacetags already has the base functionality.. 
> could extend it a bit more to replace static metadata like To addresses,
> which is already parsed and known before any rules run.  I'll give it a
> try.

Yeah it gets too complicated.  The compiled regexes live for all the
duration of the process, which can process many messages.  If you change the
regex in later than in config parse stage, then it would require some new
logic to always restore the original regex when a new message is processed,
and yet again replace it.

Doing all that for a some static header field data seems very unflexible
too.  What if you want to match some other string, maybe something from the
body text?  It seems a better solution would be making regex rules able to
capture something, and then you have some rules which would have dependency
on that hitting and use the captured variable.

The need for "rules dependency/chaining" is also already a feature we need,
mentioned for example in
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=6855

Wouldn't be much of a more stretch to add some capture variables in the mix.

Re: header address parser changeset committed

Posted by Henrik K <he...@hege.li>.

On Fri, Apr 30, 2021 at 09:36:00PM +0300, Henrik K wrote:
> On Fri, Apr 30, 2021 at 11:30:37AM -0700, John Hardin wrote:
> > 
> > Generating a RE fragment that would match on any of the extracted to/cc
> > header email addresses would probably be fairly easy as part of this. But
> > how would we incorporate that fragment in rules?
> 
> Probably impossible to implement for normal regex rules, because they are
> all compiled at start..  it would require modifying them on the fly by
> replacing some tag or so.
> 
> Designing a plugin for it would be much easier and more versatile..

Ok thinking about it more, Replacetags already has the base functionality.. 
could extend it a bit more to replace static metadata like To addresses,
which is already parsed and known before any rules run.  I'll give it a
try.

Re: header address parser changeset committed

Posted by Henrik K <he...@hege.li>.

On Fri, Apr 30, 2021 at 11:30:37AM -0700, John Hardin wrote:
> 
> Generating a RE fragment that would match on any of the extracted to/cc
> header email addresses would probably be fairly easy as part of this. But
> how would we incorporate that fragment in rules?

Probably impossible to implement for normal regex rules, because they are
all compiled at start..  it would require modifying them on the fly by
replacing some tag or so.

Designing a plugin for it would be much easier and more versatile..

Re: header address parser changeset committed

Posted by John Hardin <jh...@impsec.org>.

On Fri, 30 Apr 2021, Henrik K wrote:

>
> Please note the large changeset and have a try.  I've been tweaking it all
> week, should be good for general use.
>
>
> On Fri, Apr 30, 2021 at 06:17:51PM -0000, hege@apache.org wrote:
>> Author: hege
>> Date: Fri Apr 30 18:17:51 2021
>> New Revision: 1889337
>>
>> URL: http://svn.apache.org/viewvc?rev=1889337&view=rev
>> Log:
>> - Improved internal header address (From/To/Cc) parser, now also handles
>>   multiple addresses.  Optional support for external Email::Address::XS
>>   parser, which can handle nested comments and other oddities.

Ooo...

While you're digging around in that maybe this would be something that 
could naturally fall out of it:

For a long time I've been wishing for some rule RE syntax where we could 
interpolate the recipient email address into rules without writing a FULL 
rule that tries to match it in the headers and then in the body - for 
example, the recent google storage URI with recipient's email address 
phishing rule I added which uses a very loose generic email format match 
would benefit from an explicit match on the recipient email address.

Generating a RE fragment that would match on any of the extracted to/cc 
header email addresses would probably be fairly easy as part of this. But 
how would we incorporate that fragment in rules?

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org                         pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Activism is a way for useless people to feel important,
   even if the consequences of their activism are counterproductive
   for those they claim to be helping and damaging
   to the fabric of society as a whole.               -- Thomas Sowell
-----------------------------------------------------------------------
  Tomorrow: May Day - Remember 110 million people murdered by Communism