You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by ram01 <ra...@yahoo.com> on 2007/03/06 07:39:45 UTC

auto-learn learned_points

"auto-learn? no: scored as spam but learner indicated ham"
is given if  if ($learned_points < $learner_said_ham_points)    where
$learner_said_ham_points = -1.0

what exactly is learned_points
-- 
View this message in context: http://www.nabble.com/auto-learn-learned_points-tf3353775.html#a9326859
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.

Re: auto-learn learned_points

Posted by Matt Kettler <mk...@verizon.net>.

ram01 wrote:
> "auto-learn? no: scored as spam but learner indicated ham"
> is given if  if ($learned_points < $learner_said_ham_points)    where
> $learner_said_ham_points = -1.0
>
> what exactly is learned_points
>   
It is a recalculation of the message score, based on the following
changes from the normal score calculation:

1) All "userconf" tests disabled. ie: whitelist/blacklists. This is to
prevent an errant whitelist_from from poisoning the autolearning.
2) All "learning" subsystems are disabled, ie: bayes and AWL. This is to
prevent "self feedback".
3) The score set is changed, because bayes is disabled.

Re: [2] auto-learn learned_points

Posted by ram01 <ra...@yahoo.com>.

Thanks for the reply, but I think that you are referring to autolearn_points.
As computed in PerMsgStatus.pm and is used in AutoLearningThreshold.pm. They
are computed in the same function but they are not the same.  Notice that in
the get_autolearn_points autolearn_points is $score where learned points is 
      $self->{learned_points} +=
$self->{conf}->{scoreset}->[$orig_scoreset]->{$test}; which is inside a loop
and a conditional.  I am not very familiar with perl and was kind of lost in
the syntactics of the for and the if, but I assume that += means the same as
in say c/c++ so this is some kind of cumulative sum of something.  On one
run of sa-learn in debug mode I got the following numbers back:

[28135] dbg: learn: auto-learn: currently using scoreset 3, recomputing
score based on scoreset 1
[28135] dbg: learn: auto-learn: message score: 10.955, computed score for
autolearn: 12.011
[28135] dbg: learn: auto-learn? ham=12, spam=1, body-points=0,
head-points=10.813, learned-points=-2.599

so it is definitely not the same score, but what is it?

here's a snippet of AutoLearnThreshold.pm
sub autolearn_discriminator {
  my ($self, $params) = @_;

  my $scan = $params->{permsgstatus};
  my $conf = $scan->{conf};

  # Figure out min/max for autolearning.
  # Default to specified auto_learn_threshold settings
  my $min = $conf->{bayes_auto_learn_threshold_nonspam};
  my $max = $conf->{bayes_auto_learn_threshold_spam};

  # Find out what score we should consider this message to have ...
  my $score = $scan->get_autolearn_points();
  my $body_only_points = $scan->get_body_only_points();
  my $head_only_points = $scan->get_head_only_points();
  my $learned_points = $scan->get_learned_points();

  dbg("learn: auto-learn? ham=$min, spam=$max, ".
                "body-points=".$body_only_points.", ".
                "head-points=".$head_only_points.", ".
                "learned-points=".$learned_points);

  my $isspam;
  if ($score < $min) {
    $isspam = 0;
  } elsif ($score >= $max) {
    $isspam = 1;
  } else {
    dbg("learn: auto-learn? no: inside auto-learn thresholds, not considered
ham or spam");
    return;
  }

  my $learner_said_ham_points = -1.0;
  my $learner_said_spam_points = 1.0;

  if ($isspam) {
    my $required_body_points = 3;
    my $required_head_points = 3;

    if ($body_only_points < $required_body_points) {
      dbg("learn: auto-learn? no: scored as spam but too few body points (".
          $body_only_points." < ".$required_body_points.")");
      return;
    }
    if ($head_only_points < $required_head_points) {
      dbg("learn: auto-learn? no: scored as spam but too few head points (".
          $head_only_points." < ".$required_head_points.")");
      return;
    }
    if ($learned_points < $learner_said_ham_points) {
      dbg("learn: auto-learn? no: scored as spam but learner indicated ham
(".
          $learned_points." < ".$learner_said_ham_points.")");
      return;
    }

    if (!$scan->is_spam()) {
      dbg("learn: auto-learn? no: scored as ham but autolearn wanted spam");
      return;
    }

  } else {
    if ($learned_points > $learner_said_spam_points) {
      dbg("learn: auto-learn? no: scored as ham but learner indicated spam
(".
          $learned_points." > ".$learner_said_spam_points.")");
      return;
    }

    if ($scan->is_spam()) {
      dbg("learn: auto-learn? no: scored as spam but autolearn wanted ham");
      return;
    }
  }

  dbg("learn: auto-learn? yes, ".($isspam?"spam ($score > $max)":"ham
($score < $min)"));
  return $isspam;
}
****************************
here's a snippet of PerMsgStatus.pm
sub _get_autolearn_points {
  my ($self) = @_;

  return if (exists $self->{autolearn_points});
  # ensure it only gets computed once, even if we return early
  $self->{autolearn_points} = 0;

  # This function needs to use use sum($score[scoreset % 2]) not just
{score}.
  # otherwise we shift what we autolearn on and it gets really wierd.  - tvd
  my $orig_scoreset = $self->{conf}->get_score_set();
  my $new_scoreset = $orig_scoreset;
  my $scores = $self->{conf}->{scores};

  if (($orig_scoreset & 2) == 0) { # we don't need to recompute
    dbg("learn: auto-learn: currently using scoreset $orig_scoreset");
  }
  else {
    $new_scoreset = $orig_scoreset & ~2;
    dbg("learn: auto-learn: currently using scoreset $orig_scoreset,
recomputing score based on scoreset $new_scoreset");
    $scores = $self->{conf}->{scoreset}->[$new_scoreset];
  }

  my $tflags = $self->{conf}->{tflags};
  my $points = 0;

  # Just in case this function is called multiple times, clear out the
  # previous calculated values
  $self->{learned_points} = 0;
  $self->{body_only_points} = 0;
  $self->{head_only_points} = 0;

  foreach my $test (@{$self->{test_names_hit}}) {
    # According to the documentation, noautolearn, userconf, and learn
    # rules are ignored for autolearning.
    if (exists $tflags->{$test}) {
      next if $tflags->{$test} =~ /\bnoautolearn\b/;
      next if $tflags->{$test} =~ /\buserconf\b/;

      # Keep track of the learn points for an additional autolearn check.
      # Use the original scoreset since it'll be 0 in sets 0 and 1.
      if ($tflags->{$test} =~ /\blearn\b/) {
	# we're guaranteed that the score will be defined
        $self->{learned_points} +=
$self->{conf}->{scoreset}->[$orig_scoreset]->{$test};
	next;
      }
    }

    # ignore tests with 0 score in this scoreset
    next if ($scores->{$test} == 0);

    # Go ahead and add points to the proper locations
    if (!$self->{conf}->maybe_header_only ($test)) {
      $self->{body_only_points} += $scores->{$test};
    }
    if (!$self->{conf}->maybe_body_only ($test)) {
      $self->{head_only_points} += $scores->{$test};
    }

    $points += $scores->{$test};
  }

  # Figure out the final value we'll use for autolearning
  $points = (sprintf "%0.3f", $points) + 0;
  dbg("learn: auto-learn: message score: ".$self->{score}.", computed score
for autolearn: $points");

  $self->{autolearn_points} = $points;
}


-- 
View this message in context: http://www.nabble.com/auto-learn-learned_points-tf3353775.html#a9335682
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.