You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2005/09/07 20:57:09 UTC

Re: SpamAssassin perceptron curiousity

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


I think Henry needs to comment on this one, he wrote that code.
reaiming directly at Henry ;)

- --j.

felix@crowfix.com writes:
> (Originally posted to users@ but reposted to dev now)
> 
> I got a bit of curiousity in my brain about neural networks, and
> someone suggested I take a look at how SpamAssassin trains itself.  I
> have been looking into .../masses and come across some things which
> set off warning bells.  I don't think I have actually found any bugs,
> but it isn't clear to me what is going on, there are some unused
> variables, and I pathetically justify my intrusion on your time with
> the thought that there *might* be a bug ... :-)
> 
> The code generated in tmp/scores.h by logs-to-c includes these three
> variables:
> 
>     ny_hit[$num_mutable]
>     yn_hit[$num_mutable]
>     lookup[$num_mutable]
> 
> which appear to never be used by either perceptron.c or any generated
> code.
> 
> It also looks like $num_mutable has almost no use; besides setting the
> size of these unused arrays, it governs the weight decay loop, which
> looks to be bypassed under default conditions.
> 
> A bit more poking shows that num_scores in perceptron.c, set from
> $size in logs-to-c, is used for all other array sizes, including the
> weights, and for all related loops, including scaling and printing the
> weights.  What puzzles me is the print loop at the end of write_weights():
> 
>   for (i = 0; i < num_scores; i++) {
>     if ( is_mutable[i] )  {
>       fprintf(fp, "score %-30s %2.3f # [%2.3f..%2.3f]\n", score_names[i], weight_to_score(weights[i]), range_lo[i], range_hi[i]);
>     } else {
>       fprintf(fp, "score %-30s %2.3f # not mutable\n", score_names[i], range_lo[i]);
>     }
>   }
> 
> The weight decay loop operates only on the first num_mutable entries
> of the weights array, implying that it, and presumably all other
> arrays sized by num_scores, are set up with mutable scores first,
> followed by non-mutable scores.  Thus this loop could be rewritten
> like this:
> 
>   for (i = 0; i < num_scores; i++) {
>     if ( i < num_mutable )  {
>       fprintf(fp, "score %-30s %2.3f # [%2.3f..%2.3f]\n", score_names[i], weight_to_score(weights[i]), range_lo[i], range_hi[i]);
>     } else {
>       fprintf(fp, "score %-30s %2.3f # not mutable\n", score_names[i], range_lo[i]);
>     }
>   }
> 
> or even like this:
> 
>   for (i = 0; i < num_mutable; i++) {
>     fprintf(fp, "score %-30s %2.3f # [%2.3f..%2.3f]\n", score_names[i], weight_to_score(weights[i]), range_lo[i], range_hi[i]);
>   }
>   for (; i < num_scores; i++) {
>     fprintf(fp, "score %-30s %2.3f # not mutable\n", score_names[i], range_lo[i]);
>   }
> 
> Is this right?  I have been doing so much Perl recently that C is
> beginning to look funny, like reading Mark Twain after too much
> Charles Dickens.  Redundant variables set off alarm bells in my head.
> If these are redundant, that would be nice to know, and if not
> redundant, the code looks wrong.
> 
> What I am really trying to do is understood the neural network part of
> SpamAssassin and I seem to have gotten sidetracked, as with all fun
> projects :-)  I have gotten hung up on what mutable means for the code
> in .../masses/, and it does not seem particularly clear yet.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFDHzgFMJF5cimLx9ARAmNZAKCgP8JUNEYvfA++dfBhETPLwv5cOwCfZlpz
k+Qca8K/GYcRgwFMVfGBgzI=
=XE1W
-----END PGP SIGNATURE-----


Re: SpamAssassin perceptron curiousity

Posted by Henry Stern <he...@stern.ca>.
Most of this stuff is legacy code from the craig-evolve.c days.  I
didn't modify logs-to-c's output function.  "If it ain't broke, don't
fix it."

num_mutable is the number of mutable tests (instead of immutable tests).

Thanks for your attention to detail.

Henry

Justin Mason wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
> I think Henry needs to comment on this one, he wrote that code.
> reaiming directly at Henry ;)
>
> - --j.
>
> felix@crowfix.com writes:
>
>>(Originally posted to users@ but reposted to dev now)
>>
>>I got a bit of curiousity in my brain about neural networks, and
>>someone suggested I take a look at how SpamAssassin trains itself.  I
>>have been looking into .../masses and come across some things which
>>set off warning bells.  I don't think I have actually found any bugs,
>>but it isn't clear to me what is going on, there are some unused
>>variables, and I pathetically justify my intrusion on your time with
>>the thought that there *might* be a bug ... :-)
>>
>>The code generated in tmp/scores.h by logs-to-c includes these three
>>variables:
>>
>>    ny_hit[$num_mutable]
>>    yn_hit[$num_mutable]
>>    lookup[$num_mutable]
>>
>>which appear to never be used by either perceptron.c or any generated
>>code.
>>
>>It also looks like $num_mutable has almost no use; besides setting the
>>size of these unused arrays, it governs the weight decay loop, which
>>looks to be bypassed under default conditions.
>>
>>A bit more poking shows that num_scores in perceptron.c, set from
>>$size in logs-to-c, is used for all other array sizes, including the
>>weights, and for all related loops, including scaling and printing the
>>weights.  What puzzles me is the print loop at the end of write_weights():
>>
>>  for (i = 0; i < num_scores; i++) {
>>    if ( is_mutable[i] )  {
>>      fprintf(fp, "score %-30s %2.3f # [%2.3f..%2.3f]\n", score_names[i], weight_to_score(weights[i]), range_lo[i], range_hi[i]);
>>    } else {
>>      fprintf(fp, "score %-30s %2.3f # not mutable\n", score_names[i], range_lo[i]);
>>    }
>>  }
>>
>>The weight decay loop operates only on the first num_mutable entries
>>of the weights array, implying that it, and presumably all other
>>arrays sized by num_scores, are set up with mutable scores first,
>>followed by non-mutable scores.  Thus this loop could be rewritten
>>like this:
>>
>>  for (i = 0; i < num_scores; i++) {
>>    if ( i < num_mutable )  {
>>      fprintf(fp, "score %-30s %2.3f # [%2.3f..%2.3f]\n", score_names[i], weight_to_score(weights[i]), range_lo[i], range_hi[i]);
>>    } else {
>>      fprintf(fp, "score %-30s %2.3f # not mutable\n", score_names[i], range_lo[i]);
>>    }
>>  }
>>
>>or even like this:
>>
>>  for (i = 0; i < num_mutable; i++) {
>>    fprintf(fp, "score %-30s %2.3f # [%2.3f..%2.3f]\n", score_names[i], weight_to_score(weights[i]), range_lo[i], range_hi[i]);
>>  }
>>  for (; i < num_scores; i++) {
>>    fprintf(fp, "score %-30s %2.3f # not mutable\n", score_names[i], range_lo[i]);
>>  }
>>
>>Is this right?  I have been doing so much Perl recently that C is
>>beginning to look funny, like reading Mark Twain after too much
>>Charles Dickens.  Redundant variables set off alarm bells in my head.
>>If these are redundant, that would be nice to know, and if not
>>redundant, the code looks wrong.
>>
>>What I am really trying to do is understood the neural network part of
>>SpamAssassin and I seem to have gotten sidetracked, as with all fun
>>projects :-)  I have gotten hung up on what mutable means for the code
>>in .../masses/, and it does not seem particularly clear yet.
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.2.5 (GNU/Linux)
> Comment: Exmh CVS
>
> iD8DBQFDHzgFMJF5cimLx9ARAmNZAKCgP8JUNEYvfA++dfBhETPLwv5cOwCfZlpz
> k+Qca8K/GYcRgwFMVfGBgzI=
> =XE1W
> -----END PGP SIGNATURE-----