You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Dan <a...@patnode.net> on 2006/04/30 23:14:27 UTC

Parsing DCC

This is partly about DCC and partly about regex (yes, I've ordered  
two more regex books).


First, there's the basic all or nothing output:

	X-DCC-servers-Metrics: ui1 1049; bulk Body=many Fuz1=many Fuz2=many
	X-DCC-servers-Metrics: ui1 1049; bulk Body=0 Fuz1=0 Fuz2=0

...that can be captured with basic rules:

	header DCCBODY_m ALL =~ /X-DCC-.{1,500}Body=many/i
	header DCCFUZ1_m ALL =~ /X-DCC-.{1,500}Fuz1=many/i
	header DCCFUZ2_m ALL =~ /X-DCC-.{1,500}Fuz2=many/i

1) Is capturing header output text the best way to implement DCC in SA?


Then there are variations in between 0 and many (these are actual):

	X-DCC-servers-Metrics: ui1 1049; bulk Body=0 Fuz1=0 Fuz2=1027
	X-DCC-servers-Metrics: ui1 1049; bulk Body=many Fuz1=many Fuz2=230
	X-DCC-CTc-dcc2-Metrics: ui1 1031; bulk Body=40 Fuz1=0 Fuz2=0
	X-DCC-servers-Metrics: ui1 1049; bulk Body=0 Fuz1=0 Fuz2=2
	X-DCC-servers-Metrics: ui1 1049; bulk Body=0 Fuz1=1 Fuz2=1

2) Are DCC scores less than many or 1000's worth valuing,  
particularly 1's and 2's?


3) If so, is their relevancy (likely ham or likely spam) linear and  
segment-able into 1's, 10's, 100's, 1000's, such that this might work?:

	header DCCBODY_4 ALL =~ /X-DCC-.{1,500}Body=[0-9]{4}\b/i
	header DCCFUZ1_4 ALL =~ /X-DCC-.{1,500}Fuz1=[0-9]{4}\b/i
	header DCCFUZ2_4 ALL =~ /X-DCC-.{1,500}Fuz2=[0-9]{4}\b/i

	header DCCBODY_3 ALL =~ /X-DCC-.{1,500}Body=[0-9]{3}\b/i
	header DCCFUZ1_3 ALL =~ /X-DCC-.{1,500}Fuz1=[0-9]{3}\b/i
	header DCCFUZ2_3 ALL =~ /X-DCC-.{1,500}Fuz2=[0-9]{3}\b/i

	header DCCBODY_2 ALL =~ /X-DCC-.{1,500}Body=[0-9]{2}\b/i
	header DCCFUZ1_2 ALL =~ /X-DCC-.{1,500}Fuz1=[0-9]{2}\b/i
	header DCCFUZ2_2 ALL =~ /X-DCC-.{1,500}Fuz2=[0-9]{2}\b/i

	header DCCBODY_1 ALL =~ /X-DCC-.{1,500}Body=[1-9]{1}\b/i
	header DCCFUZ1_1 ALL =~ /X-DCC-.{1,500}Fuz1=[1-9]{1}\b/i
	header DCCFUZ2_1 ALL =~ /X-DCC-.{1,500}Fuz2=[1-9]{1}\b/i


4) If so, is this the way to do it?

5) Are these regex's adequate for what I want and do not want to  
"see" and can they be improved?

Thanks!
Dan


Re: Parsing DCC

Posted by Matt Kettler <mk...@comcast.net>.
Matt Kettler wrote:
> 1) Is capturing header output text the best way to implement DCC in SA?
>   
>
> No, using the DCC plugin that already comes with SA is the best way.
>
> Edit your v310.pre and load the dcc plugin. SA already has pre-scored
> and tested rules built in. No further work needed.
>
>   
One more note.. When you load the DCC plugin, SA will actually call DCC
itself, so you can remove whatever is adding those headers.

SA will attempt to find a dccifd socket, and use that if present. If
dccifd is not running, SA will call dccproc.


Re: Parsing DCC

Posted by Dan <a...@patnode.net>.
Nevermind, I found the entry:


use_dcc { 0 | 1 } (default: 1)
Whether to use DCC, if it is available.

dcc_timeout n (default: 10)
How many seconds you wait for dcc to complete before you go on  
without the results.

dcc_body_max NUMBER
dcc_fuz1_max NUMBER
dcc_fuz2_max NUMBER
DCC (Distributed Checksum Clearinghouse) is a system similar to  
Razor. This option sets how often a message's body/fuz1/fuz2 checksum  
must have been reported to the DCC server before SpamAssassin will  
consider the DCC check as matched.
As nearly all DCC clients are auto-reporting these checksums you  
should set this to a relatively high value, e.g. 999999 (this is  
DCC's MANY count).
The default is 999999 for all these options.

Re: Parsing DCC

Posted by Dan <a...@patnode.net>.
> All that said, I can't see why you'd want to do anything else with  
> DCC.
> The FP rate on DCC, even with the defaults of |999999 for fuzz counts,
> is significant. In the SA 3.1.0 set3 mass-checks, DCC_CHECK had a S/O
> of| 0.979, meaning that 2.1% of email matched by it was nonspam.

So more detail is not needed.  Is the level you're describing  
equivalent to "many"?

Dan


Re: Parsing DCC

Posted by Matt Kettler <mk...@comcast.net>.
Graham Murray wrote:
> Matt Kettler <mk...@comcast.net> writes:
>
>   
>> All that said, I can't see why you'd want to do anything else with DCC.
>> The FP rate on DCC, even with the defaults of |999999 for fuzz counts,
>> is significant. In the SA 3.1.0 set3 mass-checks, DCC_CHECK had a S/O
>> of| 0.979, meaning that 2.1% of email matched by it was nonspam.
>>     
>
> Is that with using DCC 'out-of-the-box' or after whitelisting received
> mailing lists and other regular solicited bulk senders, as recommended
> by DCC?
>
>   
I do not know for sure, however I suspect it's a mixture.

That said, the DCC whitelisting approach is really only practical for
small sites. I think it would be most appropriate for the SA mass-checks
to be based on DCC's performance without any whitelisting.

I administer SA for over 100 users, all of whom have different bulk
senders. I whitelist some of them, and also handle them on a post-FP
basis when reported, but there's no way I can keep track of all of the
thousands of legitimate bulk senders at my site.

Now picture the problems faced by someone who administers an email
system with 10,000+ users.

Since SA is really targeted at server-side use, it needs to be focused
on some of the practicalities of large-scale deployment.



Re: Parsing DCC

Posted by Graham Murray <gr...@gmurray.org.uk>.
Matt Kettler <mk...@comcast.net> writes:

> All that said, I can't see why you'd want to do anything else with DCC.
> The FP rate on DCC, even with the defaults of |999999 for fuzz counts,
> is significant. In the SA 3.1.0 set3 mass-checks, DCC_CHECK had a S/O
> of| 0.979, meaning that 2.1% of email matched by it was nonspam.

Is that with using DCC 'out-of-the-box' or after whitelisting received
mailing lists and other regular solicited bulk senders, as recommended
by DCC?

Re: Parsing DCC

Posted by Matt Kettler <mk...@comcast.net>.
Dan wrote:
>>> 1) Is capturing header output text the best way to implement DCC in SA?
>>
>> No, using the DCC plugin that already comes with SA is the best way.
>>
>> Edit your v310.pre and load the dcc plugin. SA already has pre-scored
>> and tested rules built in. No further work needed.
>
> Excellent Matt.  Is there a way to process the various DCC outputs
> with this architecture?  Searching the "factory" configuration, this
> entry seems to handle scoring?:
>
>     ifplugin Mail::SpamAssassin::Plugin::DCC
>     score DCC_CHECK 0 1.37 0 2.17
>     endif # Mail::SpamAssassin::Plugin::DCC
>
> This looks a bit inflexible, can the plugin do more than take a single
> DCC score and assign 3 weights to the output? 

No.. at this time the DCC plugin is either hit, or not. You can adjust
the fuzz threshold with the dcc_*_max options. See the plugin docs at:

http://spamassassin.apache.org/full/3.1.x/dist/doc/Mail_SpamAssassin_Plugin_DCC.html


All that said, I can't see why you'd want to do anything else with DCC.
The FP rate on DCC, even with the defaults of |999999 for fuzz counts,
is significant. In the SA 3.1.0 set3 mass-checks, DCC_CHECK had a S/O
of| 0.979, meaning that 2.1% of email matched by it was nonspam.

|I can't see how any lower fuzz values would be of any use, as they
should, theoretically, have lower S/O's, and would only be worth small
fractions of a point.



|

Re: Parsing DCC

Posted by Dan <a...@patnode.net>.
>> 1) Is capturing header output text the best way to implement DCC  
>> in SA?
>
> No, using the DCC plugin that already comes with SA is the best way.
>
> Edit your v310.pre and load the dcc plugin. SA already has pre-scored
> and tested rules built in. No further work needed.

Excellent Matt.  Is there a way to process the various DCC outputs  
with this architecture?  Searching the "factory" configuration, this  
entry seems to handle scoring?:

	ifplugin Mail::SpamAssassin::Plugin::DCC
	score DCC_CHECK 0 1.37 0 2.17
	endif # Mail::SpamAssassin::Plugin::DCC

This looks a bit inflexible, can the plugin do more than take a  
single DCC score and assign 3 weights to the output?

Thanks!
Dan

Re: Parsing DCC

Posted by Matt Kettler <mk...@comcast.net>.
Dan wrote:
> This is partly about DCC and partly about regex (yes, I've ordered two
> more regex books).  
>
>
> First, there's the basic all or nothing output:
>
> X-DCC-servers-Metrics: ui1 1049; bulk Body=many Fuz1=many Fuz2=many
> X-DCC-servers-Metrics: ui1 1049; bulk Body=0 Fuz1=0 Fuz2=0
>
> ...that can be captured with basic rules:
>
> header DCCBODY_m ALL =~ /X-DCC-.{1,500}Body=many/i
> header DCCFUZ1_m ALL =~ /X-DCC-.{1,500}Fuz1=many/i
> header DCCFUZ2_m ALL =~ /X-DCC-.{1,500}Fuz2=many/i
>
> 1) Is capturing header output text the best way to implement DCC in SA?

No, using the DCC plugin that already comes with SA is the best way.

Edit your v310.pre and load the dcc plugin. SA already has pre-scored
and tested rules built in. No further work needed.