You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2008/05/01 13:03:40 UTC

[Bug 5896] New: try out enemieslist

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5896

           Summary: try out enemieslist
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Platform: Other
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P5
         Component: Rules
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: jm@jmason.org


Steven Champeon has been in touch regarding 'testing my enemieslist rDNS
patterns data against the SpamAssassin spam/ham corpus(es) to see if there's a
reason for us to collaborate.'

I think this could be very useful.

he says:

'As you may or may not know, enemieslist is my dataset of regular
expressions of rDNS naming conventions, classified by various things
like assignment type/duration (dynamic/static/generic provider-assigned,
and so forth), tech in use (cable/dialup/dsl/wireless/etc), and also by
resnet (.edu residential networks), webhost (mass virtual hosting), and
the like. More can be found here:

 http://enemieslist.com/how/use.html

The basic idea is that EL generic/dynamic/static pats are often bots;
webhost suggests higher risk of phish attacks; outmx suggests that an
outright rejection might be ill-advised, and so forth for the other
classifications. The stats differ between classifications and for PTR as
opposed to HELO; generic HELO of most types often indicates bots,
whereas dynamic/generic PTR is merely suggestive but useful in a scoring
context in my experience with the sendmail package I developed that uses
the EL data. There are currently almost 29K patterns in the dataset. I
ran a list of 100K known Storm bot IPs against it a few weeks ago,
courtesy Randy Vaughn at Baylor, and EL matched > 99.998% of those that
had rDNS. I ran the CBL against it back in late December, and got about
a 94.7% match rate against those IPs that had rDNS. It's pretty
comprehensive. All patterns are fully qualified, and organized by
domain, it's not just a big ugly single regex.

I'm curious to see how incorporating EL DNSBL lookups into SpamAssassin
might be useful; we have a DNSBL mirror network (currently three hosts,
with more on the way) or I can talk about how to use it with a patched
rbldnsd if you wanted to do some local testing. It'd be really
interesting to see how the various classifications compared and how to
best score them (for both PTR and HELO string) as a module in SA. I'm
also looking to see what sort of scaling I'd need to have the DNSBLs
support if we were to introduce an SA module.'

also, in response to a mail from me:

> We already a rudimentary set of the ~20 most common rDNS naming schemes
> for dynamic hosts, but EL sounds a lot more exhaustive, and I suspect
> there'll be good correlation between EL rules and other rules in our
> ruleset.  It should be quite easy to figure that out.

OK, sounds good. I'm really interested in seeing what the various FP
rates would be for both the HELO and PTR for the various return values;
I'm also interested in seeing what rates are for the different
subclasses (as formed by the combination of A response and TXT response
for the same lookup, so "static/cable" or "dynamic/dsl" or
"natproxy/vpn"). Basically, I'm using these today as very blunt hammers,
and I want to make sure I have a good sense of how to better tune the
scoring. And you guys have such great stats, so I came to you :)

> So, these are generally run against the SMTP connecting host's
> rDNS, right?

Both PTR and HELO/EHLO string, yes. We've found that PTR is a good
indicator, but when the HELO string is a match for some EL pattern it's
a very reliable indicator of bot activity with a very low FP rate, so we
test both when available. Of course, this differs between the various
types, so I wouldn't assume webhost or outmx or static PTR are
necessarily bad, just indicative. But we'll see what the numbers
look like after we run some tests, I suppose :)

> By the way, do you mind if we conduct this conversation on a public
> Bugzilla entry?  that's generally how we do it.  Doing that in the
> open is also more likely to get useful info on how other hosts
> have found the increased load from SpamAssassin lookups, too.

No, not at all, though I definitely want to know how adding this to
SA would affect our load; and give me time to throw a few more rbldnsd
mirrors into the rotation if required. (Running lookups against the
patterns is very fast, 75K/s here on my macbook, but once you add
logging and DNS overhead it slows down considerably :-/)

So, what next? Should we look at setting up a local rbldnsd instance
to isolate testing from our production machines? Was the doc I sent
a URL for in my last email sufficient to tweak whatever SA rules
you need to test? I'm here to answer any questions you have :)



Anyway, usage details are here:  http://enemieslist.com/how/use.html -- we'd
need to add some rules to do this.  I've been meaning to do this for several
weeks(!) but things have been busy :( so here's a new ticket.


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5896] RFE: rules for enemieslist

Posted by bu...@issues.apache.org.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5896

Mark Martinec <Ma...@ijs.si> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|3.3.2                       |Future

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5896] RFE: rules for enemieslist

Posted by bu...@bugzilla.spamassassin.org.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5896

Karsten Bräckelmann <gu...@rudersport.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Group|security                    |
          Component|Security                    |Libraries
         AssignedTo|security@spamassassin.apach |dev@spamassassin.apache.org
                   |e.org                       |

--- Comment #11 from Karsten Bräckelmann <gu...@rudersport.de> 2010-03-23 17:43:09 UTC ---
Moving back off of Security, which got changed by accident during the mass
Target Milestone move.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5896] RFE: rules for enemieslist

Posted by bu...@bugzilla.spamassassin.org.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5896

Justin Mason <jm...@jmason.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Group|security                    |
          Component|Security                    |Libraries
         AssignedTo|security@spamassassin.apach |dev@spamassassin.apache.org
                   |e.org                       |

--- Comment #9 from Justin Mason <jm...@jmason.org> 2010-01-27 03:16:50 UTC ---
reassigning, too

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

RE: shortcircuit for USER_IN_WHITELIST --> noautolearn?? ==>learn!

Posted by Harald Binkle <bi...@jam-software.com>.

> -----Original Message-----
> From: Loren Wilton [mailto:lwilton@earthlink.net]
>
> Is there a way to clear the noautolearn for the whitelist rules?
> Normal
> rules could probably do it with tflags.  Except I'm not sure that you
> can
> necessarily negate a previously set tflags value with a later tflags
> value.
> (If not, maybe it would be worth an enhancement request.)

I tried that already. No changes. It seems I can't override the tflags of the USER_IN_WHITELIST rule/shortcirciut.
(I tried to override in local.cf)


> Another solution in this case would be to not use the whitelist.  Just
> make
> a rule, or several rules and meta them together, and give the overall
> rule a
> score of -100 and set the shortcircuit and autolearn flags on the rule.
> As
> everyone has mentioned, this can still end up poisioning your database
> if
> any of those senders get joe-jobbed.  But then again, you might be
> lucky and
> it would work.

Thanks but I think they convinced me.

Harry




----------------------------------------------------
JAM Software GmbH
Gesch?ftsf?hrer: Joachim Marder
Bruchhausenstr. 1 * 54290 Trier * Germany
Tel: 0700-70707050 * Fax: 0700-70707059
(max. 12,4 ct/min, Preise aus Mobilfunknetzen k?nnen abweichen)
Handelsregister Nr. HRB 4920 (AG Wittlich)  http://www.jam-software.de

Re: shortcircuit for USER_IN_WHITELIST --> noautolearn?? ==>learn!

Posted by Loren Wilton <lw...@earthlink.net>.

Is there a way to clear the noautolearn for the whitelist rules?  Normal 
rules could probably do it with tflags.  Except I'm not sure that you can 
necessarily negate a previously set tflags value with a later tflags value. 
(If not, maybe it would be worth an enhancement request.)

Another solution in this case would be to not use the whitelist.  Just make 
a rule, or several rules and meta them together, and give the overall rule a 
score of -100 and set the shortcircuit and autolearn flags on the rule.  As 
everyone has mentioned, this can still end up poisioning your database if 
any of those senders get joe-jobbed.  But then again, you might be lucky and 
it would work.

        Loren

new eval functions comparing the matches of two regular expression?

Posted by Harald Binkle <bi...@jam-software.com>.

What about a new eval functions comparing the matches of two regular expression?
If there would be functions like

 eval:Equals(/regex1/,/regex2/)
and
 eval:NOTEquals(/regex1/,/regex2/)

it would be easy to define rules like:

a rule scoring, say with 0.8 points, if there is only one recipients address and that one equals the senders address but they have different 'name parts'?
Like:
TO: "User Name" <us...@domain.com>
FROM: "viagra offer" <us...@domain.com>

There are a lot of spam mails with that structure trying to get through because many people have their own domain on the whitelist.
I tried to set this up as rule but with no luck. I fear it is not possible to do this with a regular expression
as it is not possible to compare results of a regular expression in a regular expression.

Could someone implement this?

Greetings

Harry



----------------------------------------------------
JAM Software GmbH
Gesch?ftsf?hrer: Joachim Marder
Max-Planck-Str. 22 * 54296 Trier * Germany
Tel: 0700-70707050 * Fax: 0700-70707059
(max. 12,4 ct/min, Preise aus Mobilfunknetzen k?nnen abweichen)
Handelsregister Nr. HRB 4920 (AG Wittlich)  http://www.jam-software.de

RE: shortcircuit for USER_IN_WHITELIST --> noautolearn?? ==>learn!

Posted by Harald Binkle <bi...@jam-software.com>.

Sidney,
thank you very much for your answers and explanations.
I just looked over the code of check_forged_in_whitelist and think it's hard to use for my intention.
I will wait some days if someone else  will replay to the request of implementing eval:Equals(/regex1/,/regex2/) and eval:NOTEquals(/regex1/,/regex2/).
If no one will answer I'll post that request with a correct (more appropriate) subject in one or two weeks to the dev list again and see what others say.
The problem I have is, that we use the windows version of SpamAssassin (http://sourceforge.net/projects/sawin32/) so just implementing a plugin providing those two functions is not easy (much work).

I think those evals would give the option to write more powerful rules without the need to implement little things in plugins as it is not possible to compare matches of regular expression within the same regular expression.

Harry

> -----Original Message-----
> From: Sidney Markowitz [mailto:sidney@sidney.com]
> Sent: Wednesday, May 07, 2008 10:19 AM
> To: Harald Binkle
> Cc: 'dev@spamassassin.apache.org'
> Subject: Re: shortcircuit for USER_IN_WHITELIST --> noautolearn??
> ==>learn!
>
> Harald Binkle wrote, On 7/5/08 7:46 PM:
> > Sorry, I thought a discussion for switching the default behavior
> would be right to be
> > in dev list.
>
> Yes, I'm the one who brought up the related issues of how to handle
> learning and
> whitelisting, and I said what I did to make sure that any further
> digression to those
> topics should go to the users list. Your questions about changing the
> default behavior and
> about new eval rules would go in this list.
>
> > And what about a discussion about a new eval function comparing the
> matches of two
> > regular expression. If there would be functions
> eval:Equals(/regex1/,/regex2/) and
> > eval:NOTEquals(/regex1/,/regex2/)  it would be easy to define rules
> like the one I
> > mentioned in my last mail.
>
> I don't have an immediate opinion about this. Perhaps you could try it
> out in a plugin and
> see how it works out compared to simply using whitelist_from_rcvd to
> make the whitelisting
> work.
>
> I did once try to catch that kind of spam with an eval rule that calls
> check_forged_in_whitelist which is supposed to catch anything that
> matched the address
> portion of a whitelist_in_rcvd but doesn't match the received part of
> the test. I don't
> remember now why we don't have any rules that use that eval, it may be
> that it doesn't
> really work. You might try defining a rule
>
>    header FORGED_USER_IN_WHITELIST  eval:check_forged_in_whitelist()
>
> and also define some whitelist_from_rcvd entries and see if that rule
> has any success at
> catching those.
>
>   -- sidney




----------------------------------------------------
JAM Software GmbH
Gesch?ftsf?hrer: Joachim Marder
Bruchhausenstr. 1 * 54290 Trier * Germany
Tel: 0700-70707050 * Fax: 0700-70707059
(max. 12,4 ct/min, Preise aus Mobilfunknetzen k?nnen abweichen)
Handelsregister Nr. HRB 4920 (AG Wittlich)  http://www.jam-software.de

Re: shortcircuit for USER_IN_WHITELIST --> noautolearn?? ==>learn!

Posted by Sidney Markowitz <si...@sidney.com>.

Harald Binkle wrote, On 7/5/08 7:46 PM:
> Sorry, I thought a discussion for switching the default behavior would be right to be
> in dev list.

Yes, I'm the one who brought up the related issues of how to handle learning and
whitelisting, and I said what I did to make sure that any further digression to those
topics should go to the users list. Your questions about changing the default behavior and
about new eval rules would go in this list.

> And what about a discussion about a new eval function comparing the matches of two
> regular expression. If there would be functions eval:Equals(/regex1/,/regex2/) and
> eval:NOTEquals(/regex1/,/regex2/)  it would be easy to define rules like the one I
> mentioned in my last mail.

I don't have an immediate opinion about this. Perhaps you could try it out in a plugin and
see how it works out compared to simply using whitelist_from_rcvd to make the whitelisting
work.

I did once try to catch that kind of spam with an eval rule that calls
check_forged_in_whitelist which is supposed to catch anything that matched the address
portion of a whitelist_in_rcvd but doesn't match the received part of the test. I don't
remember now why we don't have any rules that use that eval, it may be that it doesn't 
really work. You might try defining a rule

   header FORGED_USER_IN_WHITELIST  eval:check_forged_in_whitelist()

and also define some whitelist_from_rcvd entries and see if that rule has any success at
catching those.

  -- sidney

RE: shortcircuit for USER_IN_WHITELIST --> noautolearn?? ==>learn!

Posted by Harald Binkle <bi...@jam-software.com>.

I see.
Sorry, I thought a discussion for switching the default behavior would be right to be in dev list.
And what about a discussion about a new eval function comparing the matches of two regular expression.
If there would be functions eval:Equals(/regex1/,/regex2/) and eval:NOTEquals(/regex1/,/regex2/)  it would be easy to define rules like the one I mentioned in my last mail.

(create a rule scoring say with 0.8 points if there is only one recipients address and that one equals the senders address but they have different 'name parts'?
Like:
TO: "User Name" <us...@domain.com>
FROM: "viargre offer" <us...@domain.com>

There are a lot of spam mails with that structure trying to get through because many people have their own domain on the whitelist.
I tried to set this up as rule but with no luck. I fear it is not possible to do with an regular expression.)


Harry

> -----Original Message-----
> From: Sidney Markowitz [mailto:sidney@sidney.com]
> Sent: Wednesday, May 07, 2008 9:07 AM
> To: Harald Binkle
> Cc: 'dev@spamassassin.apache.org'
> Subject: Re: shortcircuit for USER_IN_WHITELIST --> noautolearn??
> ==>learn!
>
> Harald Binkle wrote, On 7/5/08 6:30 PM:
> > Hi, ok, these are good reasons, I see. But I wrote a script setting
> all recipients of
> > outgoing mails on the whitelist. So everyone I send a message to will
> be on the
> > whitelist. Meanwhile nearly all people I have contact to are on my
> whitelist so there
> > are almost no mails I receive which will be automatically learned as
> ham.
>
> Autolearn is a way of doing the best that you can with no work, but you
> are seeing some of
> its failings. There is really no substitute for a manual learning
> procedure where you find
> a way to make it easy to specify whether email is really typical ham or
> spam and send it
> to the learner, avoiding sending atypical ham that contains words that
> you would not want
> to learn as ham. I could get into a discussion about ideas on how to do
> that without
> having to classify all your mail by hand, which of course is what you
> use SpamAssassin to
> avoid in the first place, but that's the kind of discussion that the
> SpamAssassin users
> mailing list is for.
>
> > There are a lot of spam mails with that structure trying to get
> through because many
> > people have their own domain on the whitelist. I tried to set this up
> as rule but with
> > no luck. I fear it is not possible to do with an regular expression.
>
> The proper way to do it is to use whitelist_from_rcvd instead of
> whitelist_from and put in
> a rule for each sending mail server that the person uses. Again, this
> is a topic for the
> sa-users mailing list rather than the dev list.
>
>   -- sidney




----------------------------------------------------
JAM Software GmbH
Gesch?ftsf?hrer: Joachim Marder
Bruchhausenstr. 1 * 54290 Trier * Germany
Tel: 0700-70707050 * Fax: 0700-70707059
(max. 12,4 ct/min, Preise aus Mobilfunknetzen k?nnen abweichen)
Handelsregister Nr. HRB 4920 (AG Wittlich)  http://www.jam-software.de

Re: shortcircuit for USER_IN_WHITELIST --> noautolearn?? ==>learn!

Posted by Sidney Markowitz <si...@sidney.com>.

Harald Binkle wrote, On 7/5/08 6:30 PM:
> Hi, ok, these are good reasons, I see. But I wrote a script setting all recipients of
> outgoing mails on the whitelist. So everyone I send a message to will be on the
> whitelist. Meanwhile nearly all people I have contact to are on my whitelist so there
> are almost no mails I receive which will be automatically learned as ham.

Autolearn is a way of doing the best that you can with no work, but you are seeing some of 
its failings. There is really no substitute for a manual learning procedure where you find 
a way to make it easy to specify whether email is really typical ham or spam and send it 
to the learner, avoiding sending atypical ham that contains words that you would not want 
to learn as ham. I could get into a discussion about ideas on how to do that without 
having to classify all your mail by hand, which of course is what you use SpamAssassin to 
avoid in the first place, but that's the kind of discussion that the SpamAssassin users 
mailing list is for.

> There are a lot of spam mails with that structure trying to get through because many
> people have their own domain on the whitelist. I tried to set this up as rule but with
> no luck. I fear it is not possible to do with an regular expression.

The proper way to do it is to use whitelist_from_rcvd instead of whitelist_from and put in 
a rule for each sending mail server that the person uses. Again, this is a topic for the 
sa-users mailing list rather than the dev list.

  -- sidney

RE: shortcircuit for USER_IN_WHITELIST --> noautolearn?? ==>learn!

Posted by Harald Binkle <bi...@jam-software.com>.

Hi,
ok, these are good reasons, I see.
But I wrote a script setting all recipients of outgoing mails on the whitelist.
So everyone I send a message to will be on the whitelist.
Meanwhile nearly all people I have contact to are on my whitelist so there are almost no mails I receive which will be automatically learned as ham.

Another thing regarding to your answer Matt:
Why don't create a rule scoring say with 0.8 points if there is only one recipients address and that one equals the senders address but they have different 'name parts'?
Like:
TO: "User Name" <us...@domain.com>
FROM: "viargre offer" <us...@domain.com>

There are a lot of spam mails with that structure trying to get through because many people have their own domain on the whitelist.
I tried to set this up as rule but with no luck. I fear it is not possible to do with an regular expression.


Harry


> -----Original Message-----
> From: Matt Kettler [mailto:mkettler_sa@verizon.net]
> Sent: Wednesday, May 07, 2008 7:19 AM
> To: Sidney Markowitz
> Cc: Harald Binkle; 'dev@spamassassin.apache.org'
> Subject: Re: shortcircuit for USER_IN_WHITELIST --> noautolearn??
> ==>learn!
>
> Sidney Markowitz wrote:
> > Harald Binkle wrote, On 7/5/08 1:33 AM:
> >> Hi, I just wondered why my bayes filter does not learn as much ham
> >> mails as before. Then I realized that the USER_IN_WHITELIST
> >> shortcirciut is set to spam which has tflags
> >> noautoloearn. Does this really make sense?
> >
> > The rationale is that you put an address on the whitelist when they
> > might send mail that looks like spam but you know it is really ham.
> If
> > it looks like spam, you don't want the Bayes filter to learn that it
> > is ham, because from anyone else it would be spam.
>
> Another reason not to do so is the frequency with which people
> mis-configure their whitelists.
>
> If you mistakenly whitelist_from *@mydomain.com, as many people have
> done when first setting up SA, your DNS database will be poisoned
> rather
> quickly if it allows such messages to autolearn.
>

&&&&&&&&&&&&&&&&&&&&

> -----Original Message-----
> From: Sidney Markowitz [mailto:sidney@sidney.com]
> Sent: Tuesday, May 06, 2008 10:41 PM
> To: Harald Binkle
> Cc: 'dev@spamassassin.apache.org'
> Subject: Re: shortcircuit for USER_IN_WHITELIST --> noautolearn??
> ==>learn!
>
> Harald Binkle wrote, On 7/5/08 1:33 AM:
> > Hi, I just wondered why my bayes filter does not learn as much ham
> mails as before.
> > Then I realized that the USER_IN_WHITELIST shortcirciut is set to
> spam which has tflags
> > noautoloearn. Does this really make sense?
>
> The rationale is that you put an address on the whitelist when they
> might send mail that
> looks like spam but you know it is really ham. If it looks like spam,
> you don't want the
> Bayes filter to learn that it is ham, because from anyone else it would
> be spam.
>
> Of course, someone on your whitelist can also send mail that looks like
> ham. The Bayes
> filter can't learn anything one way or the other from that mail, so it
> is sent to noautolearn.
>
>   -- sidney




----------------------------------------------------
JAM Software GmbH
Gesch?ftsf?hrer: Joachim Marder
Bruchhausenstr. 1 * 54290 Trier * Germany
Tel: 0700-70707050 * Fax: 0700-70707059
(max. 12,4 ct/min, Preise aus Mobilfunknetzen k?nnen abweichen)
Handelsregister Nr. HRB 4920 (AG Wittlich)  http://www.jam-software.de

Re: shortcircuit for USER_IN_WHITELIST --> noautolearn?? ==>learn!

Posted by Matt Kettler <mk...@verizon.net>.

Sidney Markowitz wrote:
> Harald Binkle wrote, On 7/5/08 1:33 AM:
>> Hi, I just wondered why my bayes filter does not learn as much ham 
>> mails as before. Then I realized that the USER_IN_WHITELIST 
>> shortcirciut is set to spam which has tflags
>> noautoloearn. Does this really make sense?
>
> The rationale is that you put an address on the whitelist when they 
> might send mail that looks like spam but you know it is really ham. If 
> it looks like spam, you don't want the Bayes filter to learn that it 
> is ham, because from anyone else it would be spam.

Another reason not to do so is the frequency with which people 
mis-configure their whitelists.

If you mistakenly whitelist_from *@mydomain.com, as many people have 
done when first setting up SA, your DNS database will be poisoned rather 
quickly if it allows such messages to autolearn.

Re: shortcircuit for USER_IN_WHITELIST --> noautolearn?? ==>learn!

Posted by Sidney Markowitz <si...@sidney.com>.

Harald Binkle wrote, On 7/5/08 1:33 AM:
> Hi, I just wondered why my bayes filter does not learn as much ham mails as before. 
> Then I realized that the USER_IN_WHITELIST shortcirciut is set to spam which has tflags
> noautoloearn. Does this really make sense?

The rationale is that you put an address on the whitelist when they might send mail that 
looks like spam but you know it is really ham. If it looks like spam, you don't want the 
Bayes filter to learn that it is ham, because from anyone else it would be spam.

Of course, someone on your whitelist can also send mail that looks like ham. The Bayes 
filter can't learn anything one way or the other from that mail, so it is sent to noautolearn.

  -- sidney

shortcircuit for USER_IN_WHITELIST --> noautolearn?? ==>learn!

Posted by Harald Binkle <bi...@jam-software.com>.

Hi,
I just wondered why my bayes filter does not learn as much ham mails as before.
Then I realized that the USER_IN_WHITELIST shortcirciut is set to spam which has tflags noautoloearn.
Does this really make sense?
The only case a mail from a user of the whitelist is no ham could if the senders machine is infected by a virus or an Trojan.
So why don't set it back that mails from users in the withlist are learned by the bayes?

How can I set it back for me that mails from users in the withlist are learned by the bayes?

Greetings

Harry



----------------------------------------------------
JAM Software GmbH
Gesch?ftsf?hrer: Joachim Marder
Bruchhausenstr. 1 * 54290 Trier * Germany
Tel: 0700-70707050 * Fax: 0700-70707059
(max. 12,4 ct/min, Preise aus Mobilfunknetzen k?nnen abweichen)
Handelsregister Nr. HRB 4920 (AG Wittlich)  http://www.jam-software.de

Re: [Bug 5896] New: try out enemieslist

Posted by Michael Peddemors <mi...@linuxmagic.com>.

On Thursday 01 May 2008 04:03, bugzilla-daemon@bugzilla.spamassassin.org 
wrote:
> Steven Champeon has been in touch regarding 'testing my enemieslist rDNS
> patterns data against the SpamAssassin spam/ham corpus(es) to see if
> there's a reason for us to collaborate.'

> I'm curious to see how incorporating EL DNSBL lookups into SpamAssassin
> might be useful; we have a DNSBL mirror network (currently three hosts,
> with more on the way) or I can talk about how to use it with a patched
> rbldnsd if you wanted to do some local testing. It'd be really

Actually, this is surprising that SA hasn't looked at something like this 
already.. We also use a similar method in our Mail Server technologies, 
albeit we do it in the SMTP layer.. but I think this begs a few questions..

o Should it be RBL based..

In the past SA users have been stung with RBL based lookups, when RBL's get 
blocked etc.. leading to very high system loads..

o Should SA start integrating a definition update program for something like 
this?

Compiling even 10k regex patterns takes very little overhead, and by doing 
daily updates of a locally cached list there is little risk of problems even 
when the updater fails, the latest regex's will always be on hand.

o Should this use one regex supplier, or community based?

This might be more helpful, as since there are projects like Enenies List, our 
own DynaRegex .. or other companies, projects etc.. that might evolve out of 
this.

It also could have several different types of regex patterns, as mentioned 
below  so that SA users could choose score settings for some patterns 
differently than others..  Some patterns are safe enough to score very high, 
while generic shared webhost patterns may want to be scored a little lower.

I think that the regex pattern database would be an excellent candidate for 
building out an SA defintion updater..

> OK, sounds good. I'm really interested in seeing what the various FP
> rates would be for both the HELO and PTR for the various return values;
> I'm also interested in seeing what rates are for the different
> subclasses (as formed by the combination of A response and TXT response
> for the same lookup, so "static/cable" or "dynamic/dsl" or
> "natproxy/vpn"). Basically, I'm using these today as very blunt hammers,
> and I want to make sure I have a good sense of how to better tune the
> scoring. And you guys have such great stats, so I came to you :)
>
> > So, these are generally run against the SMTP connecting host's
> > rDNS, right?
>
> Both PTR and HELO/EHLO string, yes. We've found that PTR is a good
> indicator, but when the HELO string is a match for some EL pattern it's
> a very reliable indicator of bot activity with a very low FP rate, so we
> test both when available. Of course, this differs between the various
> types, so I wouldn't assume webhost or outmx or static PTR are
> necessarily bad, just indicative. But we'll see what the numbers
> look like after we run some tests, I suppose :)
>
> > By the way, do you mind if we conduct this conversation on a public
> > Bugzilla entry?  that's generally how we do it.  Doing that in the
> > open is also more likely to get useful info on how other hosts
> > have found the increased load from SpamAssassin lookups, too.
>
> No, not at all, though I definitely want to know how adding this to
> SA would affect our load; and give me time to throw a few more rbldnsd
> mirrors into the rotation if required. (Running lookups against the
> patterns is very fast, 75K/s here on my macbook, but once you add
> logging and DNS overhead it slows down considerably :-/)
>
> So, what next? Should we look at setting up a local rbldnsd instance
> to isolate testing from our production machines? Was the doc I sent
> a URL for in my last email sufficient to tweak whatever SA rules
> you need to test? I'm here to answer any questions you have :)
>
>
>
> Anyway, usage details are here:  http://enemieslist.com/how/use.html --
> we'd need to add some rules to do this.  I've been meaning to do this for
> several weeks(!) but things have been busy :( so here's a new ticket.

-- 
--
"Catch the Magic of Linux..."
------------------------------------------------------------------------
Michael Peddemors - President/CEO - LinuxMagic
Products, Services, Support and Development
Visit us at http://www.linuxmagic.com
------------------------------------------------------------------------
A Wizard IT Company - For More Info http://www.wizard.ca
"LinuxMagic" is a Registered TradeMark of Wizard Tower TechnoServices Ltd.
------------------------------------------------------------------------
604-589-0037 Beautiful British Columbia, Canada

This email and any electronic data contained are confidential and intended 
solely for the use of the individual or entity to which they are addressed. 
Please note that any views or opinions presented in this email are solely 
those of the author and are not intended to  represent those of the company.

[Bug 5896] RFE: rules for enemieslist

Posted by bu...@bugzilla.spamassassin.org.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5896


Steven Champeon <sc...@hesketh.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |schampeo@hesketh.com




--- Comment #7 from Steven Champeon <sc...@hesketh.com>  2008-09-03 14:33:36 PST ---
(In reply to comment #6)
> Sorry to the listeners for the triple post, but just so it is documented as
> well, when looking at this regex pattern inclusion idea, there should be
> multiple sets of regex patterns.  Some will be guaranteed to be spam sources,
> some guaranteed to be dialup, and some regexes need to test it this is a
> customer relaying outbound, vs MTA -> MTA expected traffic etc.. Some regexes
> may even be confirmed by the network operator, and all should be treated
> differently.. Possibly we need to have separate static regex files for
> different classes, along with contrib/testing/approved seperate files for each
> class.

I'm working on a ruleset, though what I have at present requires modifications
to
DNSEval.pm to support the way the enemieslist DNSBL works (pre-pend a hostname
or HELO string to a zone, an A record lookup returns 127.0.x.x showing how
we've
classified the naming convention, a TXT record lookup returns a string showing
what else we know about the technology in use). I'll also look at doing a
separate
plugin so as not to require mods to DNSEval.pm, and submit it for testing. 

With respect to your comments above:

1) the DNSBL has three mirrors and is a slightly modified rbldnsd; it should
be relatively easy to get more mirrors if needed (and I have no illusions about
need
should this ruleset be incorporated into SpamAssassin)

2) I'd prefer not to incorporate the actual patterns themselves into
SpamAssassin;
licensing the patterns to corporate users is how we sustain the project, so if
we
were to require the patterns be distributed as part of SpamAssassin we'd also
have to discuss licensing terms, etc. So let's stick with the DNSBL for now.

3) EL doesn't list "spam sources", it classifies hosts based on their PTR
naming
and is not to be used in deep header inspection (except possibly to frown on 
possible forged hostnames within a given domain, but that's more complex than
the ruleset I've got right now). I've found that listing spammers with static
naming
using regexes is worse than useless, due to the high rate of change and
turnover.

4) I don't have a problem with SA coming up with a community-based scheme for
maintenance on a superset of patterns, as long as it's compatible with my
current
DNSBL return codes, but I doubt there'd be much interest beyond people
submitting
hostnames and expecting someone else to do the regexes (based on ~5 years 
experience with my project). 

5) We use a DNSBL primarily because distribution and updating of large regex
files
to multiple users was an annoying pain the in butt. 

So, lots of questions, about reliability, licensing, and distribution. But
let's keep
this discussion going.


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5896] RFE: rules for enemieslist

Posted by bu...@bugzilla.spamassassin.org.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5896





--- Comment #2 from Michael Peddemors <mi...@linuxmagic.com>  2008-05-01 12:48:30 PST ---
Actually, this is surprising that SA hasn't looked at something like this 
already.. We also use a similar method in our Mail Server technologies, 
albeit we do it in the SMTP layer.. but I think this begs a few questions..

o Should it be RBL based..

In the past SA users have been stung with RBL based lookups, when RBL's get 
blocked etc.. leading to very high system loads..

o Should SA start integrating a definition update program for something like 
this?

Compiling even 10k regex patterns takes very little overhead, and by doing 
daily updates of a locally cached list there is little risk of problems even 
when the updater fails, the latest regex's will always be on hand.

o Should this use one regex supplier, or community based?

This might be more helpful, as since there are projects like Enenies List, our 
own DynaRegex .. or other companies, projects etc.. that might evolve out of 
this.

It also could have several different types of regex patterns, as mentioned 
below  so that SA users could choose score settings for some patterns 
differently than others..  Some patterns are safe enough to score very high, 
while generic shared webhost patterns may want to be scored a little lower.

I think that the regex pattern database would be an excellent candidate for 
building out an SA defintion updater..


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5896] RFE: rules for enemieslist

Posted by bu...@bugzilla.spamassassin.org.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5896





--- Comment #5 from Michael Peddemors <mi...@linuxmagic.com>  2008-05-02 14:02:05 PST ---
Oh, and the question is still open on the advantage of an RBL service for this
vs a statically distributed file of regex patterns that SA can compile locally.


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5896] RFE: rules for enemieslist

Posted by bu...@bugzilla.spamassassin.org.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5896





--- Comment #6 from Michael Peddemors <mi...@linuxmagic.com>  2008-05-02 14:16:12 PST ---
Sorry to the listeners for the triple post, but just so it is documented as
well, when looking at this regex pattern inclusion idea, there should be
multiple sets of regex patterns.  Some will be guaranteed to be spam sources,
some guaranteed to be dialup, and some regexes need to test it this is a
customer relaying outbound, vs MTA -> MTA expected traffic etc.. Some regexes
may even be confirmed by the network operator, and all should be treated
differently.. Possibly we need to have separate static regex files for
different classes, along with contrib/testing/approved seperate files for each
class.


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5896] RFE: rules for enemieslist

Posted by bu...@bugzilla.spamassassin.org.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5896





--- Comment #3 from Justin Mason <jm...@jmason.org>  2008-05-02 13:23:45 PST ---
hi Michael --

regarding updates of rulesets: note that sa-update is already being used with
great success to do this (my "sought" ruleset for example).  it works.  so
that's pretty much a solved problem now.

Licensing is key for this stuff.  iirc you were in touch previously about your
rulesets, but the licensing was incompatible with the Apache license.  for us
to support a ruleset (or indeed a DNSBL lookup), the license has to be
something we can work with, and licenses that do not permit commercial use are
not compatible. (this will of course also apply to enemieslist eval too)


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5896] RFE: rules for enemieslist

Posted by bu...@bugzilla.spamassassin.org.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5896


Justin Mason <jm...@jmason.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|try out enemieslist         |RFE: rules for enemieslist




--- Comment #1 from Justin Mason <jm...@jmason.org>  2008-05-01 04:04:20 PST ---
oh, and his email addr is schampeo at hesketh.com.


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5896] RFE: rules for enemieslist

Posted by bu...@bugzilla.spamassassin.org.

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5896





--- Comment #4 from Michael Peddemors <mi...@linuxmagic.com>  2008-05-02 13:44:41 PST ---
No, I mean that if we use a special system of updaters for regex patterns,
possibly you can get the patterns from all different sets, and submission to
the central repository would make those patterns available to all, no matter
where they come from. SA-Update currently is used to pull rulesets, and that of
course would be the candidate tool to use for something like this, but I think
that there needs to be an SA approval process on regex patterns, so users of
this approach know that a new regex pattern that SA update supplies won't
increase block mail accidentally.. possible a contrib, tested, and approved
set.. a method of ham/spam scoring on regexs etc. I know that we would be a lot
more willing to contribute our information freely into a system like this..


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.