You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2004/12/27 08:28:16 UTC

[Bug 4052] New: SpamAssassin rules file: Chinese subject and body tests

http://bugzilla.spamassassin.org/show_bug.cgi?id=4052

           Summary: SpamAssassin rules file: Chinese subject and body tests
           Product: Spamassassin
           Version: unspecified
          Platform: Other
               URL: http://www.ccert.edu.cn/spam/sa/Chinese_rules.cf
        OS/Version: other
            Status: NEW
          Severity: normal
          Priority: P5
         Component: Rules
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: qa@ccert.edu.cn
                CC: quinlan@pathname.com


Dear Colleagues,

I would like to contribute the Chinese_rules.cf (see the URL above) to 
SpamAssassin.

The Chinese_rules.cf is built by statistcal method (we reported this method 
as "Statistical rules" at http://ccas.org.cn/lecture.html#3). Therefore, the 
Chinese_rules.cf can be updated very quickly (e.g. twice a week.)

The Chinese_rules.cf is built based on a Chinese spam database own by CCERT 
anti-spam service. Since it could be updated frequently, it could catch 
very "recent" Chinese spam.

Acording to our statistics, about 400 email servers have been used (and updated 
through CCERT website) the Chinese_rules.cf (of course with SpamAssassin) to 
mark Chinese spam, and the number is increasing, so that, we do hope that the 
name "Chinese_rules.cf" will be kept unchanged in case it is placed in the 
SpamAssassin distribution.

I shall appreciate receaving any suggestion.

Best regards,

Quang-Anh Tran,
CERNET Computer Emergency Response Team (CCERT)



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4052] SpamAssassin rules file: Chinese subject and body tests

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4052





------- Additional Comments From qa@ccert.edu.cn  2005-01-10 23:30 -------
> Great!  Can you please submit an Individual CLA as well?  We need that
> too.

Sure! I have submitted it by facsimile, please check.

best,
Tran




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4052] SpamAssassin rules file: Chinese subject and body tests

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4052





------- Additional Comments From quinlan@pathname.com  2005-01-10 00:47 -------
Subject: Re:  SpamAssassin rules file: Chinese subject and body tests

> We have faxed a signed CCLA (Duan Haixin signed) to the ASF. Please
> notify me when you received it.

Great!  Can you please submit an Individual CLA as well?  We need that
too.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4052] SpamAssassin rules file: Chinese subject and body tests

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4052





------- Additional Comments From qa@ccert.edu.cn  2005-01-04 02:31 -------
Dear Colleagues,

Thank you very much for your attention to the Chinese_rules.cf.

Yes, the recall/error rates have much improved since I use the perceptron code 
(in SpamAssassin 3.x) to score the rules. (Before I used GA.)

My research group where I am working for (the CCERT) will sign the CCLA and 
send to the ASF soon.

I am considering your questions 1 and 2 and will answer later.

For question 3: My research group (CCERT) hopes to assign a team to work on the 
SpamAssassin project to make SpamAssassin more effective against Chinese mails. 
We hope to do two things: 1. Maintain the Chinese_rules.cf; 2. Set up nightly 
mass-checks against our Chinese-ham corpus (it should be daily working emails 
of CCERT or even CERNET.) To do all these, we need something like "an 
agreement" between the SpamAssassin Project and CCERT, therefore, we can apply 
for financial support from our university or somewhere else. Would you please 
tell me is it posible to make such a co-operation ?

(About CCERT: CCERT is CERT of the China Education Research Network (CERNET), 
CCERT also belongs to the Tsinghua University. We have chance to manage the 
network resource from the CERNET.)

Best Regards,
Tran



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4052] SpamAssassin rules file: Chinese subject and body tests

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4052





------- Additional Comments From lwilton@earthlink.net  2005-01-06 17:53 -------
Subject: Re:  SpamAssassin rules file: Chinese subject and body tests

Minor thought Justin - would it be feasible to have a "utf" option that
could be set in the site config that governed how rules were handled?  Or
perhaps also an override flag in the rule flags?

My gut guess is such an option would be a horror to implement, but perhaps
it wouldn't be.  And it would probably be useful if it was feasible.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4052] SpamAssassin rules file: Chinese subject and body tests

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4052





------- Additional Comments From qa@ccert.edu.cn  2005-01-08 06:36 -------
Hi,

two questions:

1. Each rule in SpamAssasin have 4 scores. How does SpamAssassin set the last 3 
scores ?

2. If the last 3 scores are absent, will SpamAssassin use the first score for 
all cases ?

best,
Tran



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4052] SpamAssassin rules file: Chinese subject and body tests

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4052





------- Additional Comments From qa@ccert.edu.cn  2005-01-05 02:59 -------
> 1. In our experience, patterns which span 4 or more words, are often more
> effective at catching a small set of spam, but with very low false positive
> rates, than patterns which match only 1 or 2 words.

> Have you tried modifying the generator so that it generates longer patterns
> from the corpus?

Ok, I am doing experiments for different length of patterns and will show you 
the recall/error results of each kind.


> 2. We have poor support for decoding between character sets (e.g. converting 
> all
> text strings in mails to UTF-8 where possible).  Has this proved to be a
> noticeable issue for this ruleset? (Just wondering!)

Chinese_rules.cf is built to catch Chinese spam written in GB2312 code 
(simplified Chinese, mainly used in the Chinese mainland.)

In future, if SpamAssassin converts all text strings in mails to UTF-8 before 
applying the ruleset, the current version of Chinese_rules.cf will not work. 
However, I can convert the ruleset to UTF-8 if neccessary.

Best,
Tran





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4052] SpamAssassin rules file: Chinese subject and body tests

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4052





------- Additional Comments From qa@ccert.edu.cn  2005-01-09 23:39 -------
> Would it be possible for you to sign and fax an Apache CLA so that we can
> incorporate these (or at least test them)?

We have faxed a signed CCLA (Duan Haixin signed) to the ASF. Please notify me 
when you received it.

best,
Tran




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4052] SpamAssassin rules file: Chinese subject and body tests

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4052





------- Additional Comments From jm@jmason.org  2005-01-06 17:45 -------
'In future, if SpamAssassin converts all text strings in mails to UTF-8 before 
applying the ruleset, the current version of Chinese_rules.cf will not work. 
However, I can convert the ruleset to UTF-8 if neccessary.'

don't worry -- I think if we were to make "body" rules suddenly change to
matching UTF-8, and not the original charset, we would have to make that a
major-release change -- as many third-party rules in non-english languages would
immediate start to fail!   It's not planned.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4052] SpamAssassin rules file: Chinese subject and body tests

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4052

qa@ccert.edu.cn changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|quinlan@pathname.com        |



------- Additional Comments From qa@ccert.edu.cn  2005-01-09 23:31 -------
> 1. In our experience, patterns which span 4 or more words, are often more
> effective at catching a small set of spam, but with very low false positive
> rates, than patterns which match only 1 or 2 words.

> Have you tried modifying the generator so that it generates longer patterns
> from the corpus?

We have finished the experiments on this issue. The Chinese_rules.cf with 
different length of patterns are at the following links:

(Note: A Chinese character is encoded by 2 bytes)


1. Subject and Body patterns are about 4 bytes:
http://www.ccert.edu.cn/spam/sa/Chinese_rules.cf_4_4

2. Subject patterns are about 4 bytes; Body patterns are about 6 bytes:
http://www.ccert.edu.cn/spam/sa/Chinese_rules.cf_4_6

3. Subject patterns are about 6 bytes; Body patterns are about 8 bytes:
http://www.ccert.edu.cn/spam/sa/Chinese_rules.cf_6_8

4. Subject patterns are about 8 bytes; Body patterns are about 10 bytes:
http://www.ccert.edu.cn/spam/sa/Chinese_rules.cf_8_10

And our experience is: Subject patterns span 4 or more bytes and body paterns 
span 6 or more bytes are a good choice.

We have updated our generator (Thank Justin for the suggestion) and I think the 
comming versions of Chinese_rules.cf will reach the folowing recall/error rates:

# Test against 20322 spam and 99689 ham
# (using only the Chinese_rules.cf)
#
#       Threshold       Spam recall     Ham error
#       0.5     92.8%   1.2%
#       1.0     90.6%   0.5%
#       1.5     89.0%   0.3%
#       2.0     86.7%   0.1%
#       2.5     84.6%   0.1%
#       3.0     82.3%   0.0%
#       3.5     80.2%   0.0%
#       4.0     78.3%   0.0%
#       4.5     76.5%   0.0%
#
# It takes 0.03 seconds to scan an email with size 2013.54 bytes (P4-2.8G CPU)

Best,
Tran



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4052] SpamAssassin rules file: Chinese subject and body tests

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4052





------- Additional Comments From qa@ccert.edu.cn  2005-01-10 23:19 -------
> Great!  Can you please submit an Individual CLA as well?  We need that
> too.

Sure!
I have submit it. Please check.

best,
Tran




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4052] SpamAssassin rules file: Chinese subject and body tests

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4052





------- Additional Comments From jm@jmason.org  2005-01-06 18:12 -------
yeah, I think tflags would be the best option -- it's a feature of the rule, not
of the user or the scanning host.

but no need to worry too much about it right now ;)



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4052] SpamAssassin rules file: Chinese subject and body tests

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4052





------- Additional Comments From jm@jmason.org  2005-01-03 11:04 -------
Hi --

these look very interesting, and I like the methodology!  (I also notice that
the recall/error rates have improved from the figures quoted in the
presentations, according to the .cf file's comments; the current figures look
very useful!)

Would it be possible for you to sign and fax an Apache CLA so that we can
incorporate these (or at least test them)?  details are at:
http://www.apache.org/licenses/#clas

OK, a few questions:

1. In our experience, patterns which span 4 or more words, are often more
effective at catching a small set of spam, but with very low false positive
rates, than patterns which match only 1 or 2 words.

Have you tried modifying the generator so that it generates longer patterns from
the corpus?  It would increase memory use in the generator, but should generate
a smaller number of more-reliable rules that can supplement the shorter rules. 
This small set of long rules would then possibly warrant higher score values
than the larger set of short rules.

2. We have poor support for decoding between character sets (e.g. converting all
text strings in mails to UTF-8 where possible).  Has this proved to be a
noticeable issue for this ruleset? (Just wondering!)

3. Our default ruleset is not very good against Chinese mail in general,
apparently missing a lot of spam and causing false positives on ham messages.  
It would be *very* useful if we could set up nightly mass-checks against a good
Chinese-ham corpus, in order to avoid future FPs.   

There's two ways to do that -- either by one of the existing developers
obtaining a (confidential) copy of the corpus and adding that to their
collection if that's permissible, or if your group could set up a nightly
mass-check as described here: http://wiki.apache.org/spamassassin/NightlyMassCheck

If either would be possible, that would be really great ;)

--j.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.