You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by fe...@crowfix.com on 2005/09/07 20:16:45 UTC

SpamAssassin perceptron curiousity

(Originally posted to users@ but reposted to dev now)

I got a bit of curiousity in my brain about neural networks, and
someone suggested I take a look at how SpamAssassin trains itself.  I
have been looking into .../masses and come across some things which
set off warning bells.  I don't think I have actually found any bugs,
but it isn't clear to me what is going on, there are some unused
variables, and I pathetically justify my intrusion on your time with
the thought that there *might* be a bug ... :-)

The code generated in tmp/scores.h by logs-to-c includes these three
variables:

    ny_hit[$num_mutable]
    yn_hit[$num_mutable]
    lookup[$num_mutable]

which appear to never be used by either perceptron.c or any generated
code.

It also looks like $num_mutable has almost no use; besides setting the
size of these unused arrays, it governs the weight decay loop, which
looks to be bypassed under default conditions.

A bit more poking shows that num_scores in perceptron.c, set from
$size in logs-to-c, is used for all other array sizes, including the
weights, and for all related loops, including scaling and printing the
weights.  What puzzles me is the print loop at the end of write_weights():

  for (i = 0; i < num_scores; i++) {
    if ( is_mutable[i] )  {
      fprintf(fp, "score %-30s %2.3f # [%2.3f..%2.3f]\n", score_names[i], weight_to_score(weights[i]), range_lo[i], range_hi[i]);
    } else {
      fprintf(fp, "score %-30s %2.3f # not mutable\n", score_names[i], range_lo[i]);
    }
  }

The weight decay loop operates only on the first num_mutable entries
of the weights array, implying that it, and presumably all other
arrays sized by num_scores, are set up with mutable scores first,
followed by non-mutable scores.  Thus this loop could be rewritten
like this:

  for (i = 0; i < num_scores; i++) {
    if ( i < num_mutable )  {
      fprintf(fp, "score %-30s %2.3f # [%2.3f..%2.3f]\n", score_names[i], weight_to_score(weights[i]), range_lo[i], range_hi[i]);
    } else {
      fprintf(fp, "score %-30s %2.3f # not mutable\n", score_names[i], range_lo[i]);
    }
  }

or even like this:

  for (i = 0; i < num_mutable; i++) {
    fprintf(fp, "score %-30s %2.3f # [%2.3f..%2.3f]\n", score_names[i], weight_to_score(weights[i]), range_lo[i], range_hi[i]);
  }
  for (; i < num_scores; i++) {
    fprintf(fp, "score %-30s %2.3f # not mutable\n", score_names[i], range_lo[i]);
  }

Is this right?  I have been doing so much Perl recently that C is
beginning to look funny, like reading Mark Twain after too much
Charles Dickens.  Redundant variables set off alarm bells in my head.
If these are redundant, that would be nice to know, and if not
redundant, the code looks wrong.

What I am really trying to do is understood the neural network part of
SpamAssassin and I seem to have gotten sidetracked, as with all fun
projects :-)  I have gotten hung up on what mutable means for the code
in .../masses/, and it does not seem particularly clear yet.

-- 
            ... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._.
     Felix Finch: scarecrow repairman & rocket surgeon / felix@crowfix.com
  GPG = E987 4493 C860 246C 3B1E  6477 7838 76E9 182E 8151 ITAR license #4933
I've found a solution to Fermat's Last Theorem but I see I've run out of room o