You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2005/09/07 20:57:09 UTC
Re: SpamAssassin perceptron curiousity
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
I think Henry needs to comment on this one, he wrote that code.
reaiming directly at Henry ;)
- --j.
felix@crowfix.com writes:
> (Originally posted to users@ but reposted to dev now)
>
> I got a bit of curiousity in my brain about neural networks, and
> someone suggested I take a look at how SpamAssassin trains itself. I
> have been looking into .../masses and come across some things which
> set off warning bells. I don't think I have actually found any bugs,
> but it isn't clear to me what is going on, there are some unused
> variables, and I pathetically justify my intrusion on your time with
> the thought that there *might* be a bug ... :-)
>
> The code generated in tmp/scores.h by logs-to-c includes these three
> variables:
>
> ny_hit[$num_mutable]
> yn_hit[$num_mutable]
> lookup[$num_mutable]
>
> which appear to never be used by either perceptron.c or any generated
> code.
>
> It also looks like $num_mutable has almost no use; besides setting the
> size of these unused arrays, it governs the weight decay loop, which
> looks to be bypassed under default conditions.
>
> A bit more poking shows that num_scores in perceptron.c, set from
> $size in logs-to-c, is used for all other array sizes, including the
> weights, and for all related loops, including scaling and printing the
> weights. What puzzles me is the print loop at the end of write_weights():
>
> for (i = 0; i < num_scores; i++) {
> if ( is_mutable[i] ) {
> fprintf(fp, "score %-30s %2.3f # [%2.3f..%2.3f]\n", score_names[i], weight_to_score(weights[i]), range_lo[i], range_hi[i]);
> } else {
> fprintf(fp, "score %-30s %2.3f # not mutable\n", score_names[i], range_lo[i]);
> }
> }
>
> The weight decay loop operates only on the first num_mutable entries
> of the weights array, implying that it, and presumably all other
> arrays sized by num_scores, are set up with mutable scores first,
> followed by non-mutable scores. Thus this loop could be rewritten
> like this:
>
> for (i = 0; i < num_scores; i++) {
> if ( i < num_mutable ) {
> fprintf(fp, "score %-30s %2.3f # [%2.3f..%2.3f]\n", score_names[i], weight_to_score(weights[i]), range_lo[i], range_hi[i]);
> } else {
> fprintf(fp, "score %-30s %2.3f # not mutable\n", score_names[i], range_lo[i]);
> }
> }
>
> or even like this:
>
> for (i = 0; i < num_mutable; i++) {
> fprintf(fp, "score %-30s %2.3f # [%2.3f..%2.3f]\n", score_names[i], weight_to_score(weights[i]), range_lo[i], range_hi[i]);
> }
> for (; i < num_scores; i++) {
> fprintf(fp, "score %-30s %2.3f # not mutable\n", score_names[i], range_lo[i]);
> }
>
> Is this right? I have been doing so much Perl recently that C is
> beginning to look funny, like reading Mark Twain after too much
> Charles Dickens. Redundant variables set off alarm bells in my head.
> If these are redundant, that would be nice to know, and if not
> redundant, the code looks wrong.
>
> What I am really trying to do is understood the neural network part of
> SpamAssassin and I seem to have gotten sidetracked, as with all fun
> projects :-) I have gotten hung up on what mutable means for the code
> in .../masses/, and it does not seem particularly clear yet.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS
iD8DBQFDHzgFMJF5cimLx9ARAmNZAKCgP8JUNEYvfA++dfBhETPLwv5cOwCfZlpz
k+Qca8K/GYcRgwFMVfGBgzI=
=XE1W
-----END PGP SIGNATURE-----
Re: SpamAssassin perceptron curiousity
Posted by Henry Stern <he...@stern.ca>.
Most of this stuff is legacy code from the craig-evolve.c days. I
didn't modify logs-to-c's output function. "If it ain't broke, don't
fix it."
num_mutable is the number of mutable tests (instead of immutable tests).
Thanks for your attention to detail.
Henry
Justin Mason wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
> I think Henry needs to comment on this one, he wrote that code.
> reaiming directly at Henry ;)
>
> - --j.
>
> felix@crowfix.com writes:
>
>>(Originally posted to users@ but reposted to dev now)
>>
>>I got a bit of curiousity in my brain about neural networks, and
>>someone suggested I take a look at how SpamAssassin trains itself. I
>>have been looking into .../masses and come across some things which
>>set off warning bells. I don't think I have actually found any bugs,
>>but it isn't clear to me what is going on, there are some unused
>>variables, and I pathetically justify my intrusion on your time with
>>the thought that there *might* be a bug ... :-)
>>
>>The code generated in tmp/scores.h by logs-to-c includes these three
>>variables:
>>
>> ny_hit[$num_mutable]
>> yn_hit[$num_mutable]
>> lookup[$num_mutable]
>>
>>which appear to never be used by either perceptron.c or any generated
>>code.
>>
>>It also looks like $num_mutable has almost no use; besides setting the
>>size of these unused arrays, it governs the weight decay loop, which
>>looks to be bypassed under default conditions.
>>
>>A bit more poking shows that num_scores in perceptron.c, set from
>>$size in logs-to-c, is used for all other array sizes, including the
>>weights, and for all related loops, including scaling and printing the
>>weights. What puzzles me is the print loop at the end of write_weights():
>>
>> for (i = 0; i < num_scores; i++) {
>> if ( is_mutable[i] ) {
>> fprintf(fp, "score %-30s %2.3f # [%2.3f..%2.3f]\n", score_names[i], weight_to_score(weights[i]), range_lo[i], range_hi[i]);
>> } else {
>> fprintf(fp, "score %-30s %2.3f # not mutable\n", score_names[i], range_lo[i]);
>> }
>> }
>>
>>The weight decay loop operates only on the first num_mutable entries
>>of the weights array, implying that it, and presumably all other
>>arrays sized by num_scores, are set up with mutable scores first,
>>followed by non-mutable scores. Thus this loop could be rewritten
>>like this:
>>
>> for (i = 0; i < num_scores; i++) {
>> if ( i < num_mutable ) {
>> fprintf(fp, "score %-30s %2.3f # [%2.3f..%2.3f]\n", score_names[i], weight_to_score(weights[i]), range_lo[i], range_hi[i]);
>> } else {
>> fprintf(fp, "score %-30s %2.3f # not mutable\n", score_names[i], range_lo[i]);
>> }
>> }
>>
>>or even like this:
>>
>> for (i = 0; i < num_mutable; i++) {
>> fprintf(fp, "score %-30s %2.3f # [%2.3f..%2.3f]\n", score_names[i], weight_to_score(weights[i]), range_lo[i], range_hi[i]);
>> }
>> for (; i < num_scores; i++) {
>> fprintf(fp, "score %-30s %2.3f # not mutable\n", score_names[i], range_lo[i]);
>> }
>>
>>Is this right? I have been doing so much Perl recently that C is
>>beginning to look funny, like reading Mark Twain after too much
>>Charles Dickens. Redundant variables set off alarm bells in my head.
>>If these are redundant, that would be nice to know, and if not
>>redundant, the code looks wrong.
>>
>>What I am really trying to do is understood the neural network part of
>>SpamAssassin and I seem to have gotten sidetracked, as with all fun
>>projects :-) I have gotten hung up on what mutable means for the code
>>in .../masses/, and it does not seem particularly clear yet.
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.2.5 (GNU/Linux)
> Comment: Exmh CVS
>
> iD8DBQFDHzgFMJF5cimLx9ARAmNZAKCgP8JUNEYvfA++dfBhETPLwv5cOwCfZlpz
> k+Qca8K/GYcRgwFMVfGBgzI=
> =XE1W
> -----END PGP SIGNATURE-----