You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Chris Lear <ch...@laculine.com> on 2005/12/23 12:04:29 UTC

SARE_URI_EQUALS false positives

I'm getting false positives for SARE_URI_EQUALS, which scores 5 and is
therefore skewing the scoring of some mail quite badly.
The weird thing is that the uris that spamassassin is complaining about
aren't uris at all. The mail in question is auto-created reports of cvs
diffs, so it's slightly unusual.
I've tried to condense the debug information. Here it is:

This is some of the output from spamassassin -D <false_positive

[16733] dbg: uri: parsed uri found, updated.by=Mis
[16733] dbg: uri: cleaned parsed uri, http://updated.by=Mis
[16733] dbg: uri: cleaned parsed uri, updated.by=Mis
[16733] dbg: uri: parsed uri found, http://updated.by=Mis
[16733] dbg: uri: cleaned parsed uri, http://updated.by=Mis
[16733] dbg: uri: parsed uri found, updated.by=Updated
[16733] dbg: uri: cleaned parsed uri, updated.by=Updated
[16733] dbg: uri: cleaned parsed uri, http://updated.by=Updated
[16733] dbg: uri: parsed uri found, http://updated.by=Updated
[16733] dbg: uri: cleaned parsed uri, http://updated.by=Updated

These "parsed uris" are not links in the e-mail. They are just text.

I've had a bit of a look at the regexps that spamassassin uses to work
out what is a uri, and it seems that "updated.by=Updated" is treated as
a uri because .by is a valid tld and spamassassin looks for "schemeless"
uris, then prepends http:// for the tests.

I'm running spamassassin 3.1.0 on perl 5.8.2.

Does anyone have any suggestions, apart from simply reducing the score
for SARE_URI_EQUALS? Is this a spamassassin bug, or is there no way to
guarantee that only real uris are parsed as such?

Chris

Re: SARE_URI_EQUALS false positives

Posted by Chris Lear <ch...@laculine.com>.

* jdow wrote (23/12/05 12:06):
> From: "Chris Lear" <ch...@laculine.com>
>>* jdow wrote (23/12/05 11:26):
>>> From: "Chris Lear" <ch...@laculine.com>
>>> 
>>>> I'm getting false positives for SARE_URI_EQUALS, which scores 5 and is
>>>> therefore skewing the scoring of some mail quite badly.
>>>> The weird thing is that the uris that spamassassin is complaining about
>>>> aren't uris at all. The mail in question is auto-created reports of cvs
>>>> diffs, so it's slightly unusual.
>> 
>> [...]
>>>> 
>>>> I've had a bit of a look at the regexps that spamassassin uses to work
>>>> out what is a uri, and it seems that "updated.by=Updated" is treated as
>>>> a uri because .by is a valid tld and spamassassin looks for "schemeless"
>>>> uris, then prepends http:// for the tests.
>>>> 
>>>> I'm running spamassassin 3.1.0 on perl 5.8.2.
>>>> 
>>>> Does anyone have any suggestions, apart from simply reducing the score
>>>> for SARE_URI_EQUALS? Is this a spamassassin bug, or is there no way to
>>>> guarantee that only real uris are parsed as such?
>>> 
>>> Before you drop the score precipitously check if there is some other
>>> characteristic of the emails that trigger falsely which can be used to
>>> apply a negative score. If there is such a characteristic then generate
>>> the appropriate negative score. If not weigh how effective the rule is
>>> for you. The version of "sa-stats.pl" that is on the SARE site helps
>>> figure this out nicely.
>>> 
>>> That said it's close to a "50/50" rule that hits on very few messages
>>> here so should have a low score. (It hit on 6 messages out of 75000.)
>>> Cutting it out completely here seems like it would be effective TODAY.
>>> That could change. At one time it was quite necessary. Spammer fads
>>> change.)
>> 
>> I've reduced the score, and a quick check shows that that rule hits
>> almost nothing anyway, so it's not a big problem. The bayes rules were
>> keeping the false positives from doing much damage, anyway.
>> But spamassassin uses uris for lots of things, and if it's commonly
>> parsing (reasonably) normal text as uris, I would expect that to be a
>> problem in more rules than just SARE_URI_EQUALS.
> 
> That is a standalone rule.
> 
> And I do note that many of the SARE rules have severe problems in very
> specific cases. There are some mailing lists that are not well filtered
> for spam which have postings which trigger some of the "too effective
> to toss" SARE rules. I've developed some massive meta rules to at least
> partially get a handle on the problem. (A number of times XXX hit option
> would be nice to have for this.)

Sorry to go on, but I wonder whether you've missed by point. The
SARE_URI_EQUALS rule is working fine. It just looks in the uris that
spamassassin gives it, and complains when they contain "=".
The problem is that spamassassin is treating things that aren't uris as
uris. So SARE_URI_EQUALS is working on dud data.

In this specific case, the e-mail contains the text
"updated.by=Updated". This is not a uri, and nor should it be treated as
one. But spamassassin thinks it is (becasue .by is a valid tld), so, as
far as I can tell, *all* uri rules will check it. It so happens that
SARE_URI_EQUALS hits in this case, but other uri rules are vulnerable to
false positives if the uri parsing is wrong, aren't they?

Chris

Re: SARE_URI_EQUALS false positives

Posted by jdow <jd...@earthlink.net>.

From: "Chris Lear" <ch...@laculine.com>
>* jdow wrote (23/12/05 11:26):
>> From: "Chris Lear" <ch...@laculine.com>
>> 
>>> I'm getting false positives for SARE_URI_EQUALS, which scores 5 and is
>>> therefore skewing the scoring of some mail quite badly.
>>> The weird thing is that the uris that spamassassin is complaining about
>>> aren't uris at all. The mail in question is auto-created reports of cvs
>>> diffs, so it's slightly unusual.
> 
> [...]
>>> 
>>> I've had a bit of a look at the regexps that spamassassin uses to work
>>> out what is a uri, and it seems that "updated.by=Updated" is treated as
>>> a uri because .by is a valid tld and spamassassin looks for "schemeless"
>>> uris, then prepends http:// for the tests.
>>> 
>>> I'm running spamassassin 3.1.0 on perl 5.8.2.
>>> 
>>> Does anyone have any suggestions, apart from simply reducing the score
>>> for SARE_URI_EQUALS? Is this a spamassassin bug, or is there no way to
>>> guarantee that only real uris are parsed as such?
>> 
>> Before you drop the score precipitously check if there is some other
>> characteristic of the emails that trigger falsely which can be used to
>> apply a negative score. If there is such a characteristic then generate
>> the appropriate negative score. If not weigh how effective the rule is
>> for you. The version of "sa-stats.pl" that is on the SARE site helps
>> figure this out nicely.
>> 
>> That said it's close to a "50/50" rule that hits on very few messages
>> here so should have a low score. (It hit on 6 messages out of 75000.)
>> Cutting it out completely here seems like it would be effective TODAY.
>> That could change. At one time it was quite necessary. Spammer fads
>> change.)
> 
> I've reduced the score, and a quick check shows that that rule hits
> almost nothing anyway, so it's not a big problem. The bayes rules were
> keeping the false positives from doing much damage, anyway.
> But spamassassin uses uris for lots of things, and if it's commonly
> parsing (reasonably) normal text as uris, I would expect that to be a
> problem in more rules than just SARE_URI_EQUALS.

That is a standalone rule.

And I do note that many of the SARE rules have severe problems in very
specific cases. There are some mailing lists that are not well filtered
for spam which have postings which trigger some of the "too effective
to toss" SARE rules. I've developed some massive meta rules to at least
partially get a handle on the problem. (A number of times XXX hit option
would be nice to have for this.)

{^_^}

Re: SARE_URI_EQUALS false positives

Posted by Chris Lear <ch...@laculine.com>.

* jdow wrote (23/12/05 11:26):
> From: "Chris Lear" <ch...@laculine.com>
> 
>> I'm getting false positives for SARE_URI_EQUALS, which scores 5 and is
>> therefore skewing the scoring of some mail quite badly.
>> The weird thing is that the uris that spamassassin is complaining about
>> aren't uris at all. The mail in question is auto-created reports of cvs
>> diffs, so it's slightly unusual.

[...]
>> 
>> I've had a bit of a look at the regexps that spamassassin uses to work
>> out what is a uri, and it seems that "updated.by=Updated" is treated as
>> a uri because .by is a valid tld and spamassassin looks for "schemeless"
>> uris, then prepends http:// for the tests.
>> 
>> I'm running spamassassin 3.1.0 on perl 5.8.2.
>> 
>> Does anyone have any suggestions, apart from simply reducing the score
>> for SARE_URI_EQUALS? Is this a spamassassin bug, or is there no way to
>> guarantee that only real uris are parsed as such?
> 
> Before you drop the score precipitously check if there is some other
> characteristic of the emails that trigger falsely which can be used to
> apply a negative score. If there is such a characteristic then generate
> the appropriate negative score. If not weigh how effective the rule is
> for you. The version of "sa-stats.pl" that is on the SARE site helps
> figure this out nicely.
> 
> That said it's close to a "50/50" rule that hits on very few messages
> here so should have a low score. (It hit on 6 messages out of 75000.)
> Cutting it out completely here seems like it would be effective TODAY.
> That could change. At one time it was quite necessary. Spammer fads
> change.)

I've reduced the score, and a quick check shows that that rule hits
almost nothing anyway, so it's not a big problem. The bayes rules were
keeping the false positives from doing much damage, anyway.
But spamassassin uses uris for lots of things, and if it's commonly
parsing (reasonably) normal text as uris, I would expect that to be a
problem in more rules than just SARE_URI_EQUALS.

Chris

Re: SARE_URI_EQUALS false positives

Posted by jdow <jd...@earthlink.net>.

From: "Chris Lear" <ch...@laculine.com>

> I'm getting false positives for SARE_URI_EQUALS, which scores 5 and is
> therefore skewing the scoring of some mail quite badly.
> The weird thing is that the uris that spamassassin is complaining about
> aren't uris at all. The mail in question is auto-created reports of cvs
> diffs, so it's slightly unusual.
> I've tried to condense the debug information. Here it is:
> 
> This is some of the output from spamassassin -D <false_positive
> 
> [16733] dbg: uri: parsed uri found, updated.by=Mis
> [16733] dbg: uri: cleaned parsed uri, http://updated.by=Mis
> [16733] dbg: uri: cleaned parsed uri, updated.by=Mis
> [16733] dbg: uri: parsed uri found, http://updated.by=Mis
> [16733] dbg: uri: cleaned parsed uri, http://updated.by=Mis
> [16733] dbg: uri: parsed uri found, updated.by=Updated
> [16733] dbg: uri: cleaned parsed uri, updated.by=Updated
> [16733] dbg: uri: cleaned parsed uri, http://updated.by=Updated
> [16733] dbg: uri: parsed uri found, http://updated.by=Updated
> [16733] dbg: uri: cleaned parsed uri, http://updated.by=Updated
> 
> These "parsed uris" are not links in the e-mail. They are just text.
> 
> I've had a bit of a look at the regexps that spamassassin uses to work
> out what is a uri, and it seems that "updated.by=Updated" is treated as
> a uri because .by is a valid tld and spamassassin looks for "schemeless"
> uris, then prepends http:// for the tests.
> 
> I'm running spamassassin 3.1.0 on perl 5.8.2.
> 
> Does anyone have any suggestions, apart from simply reducing the score
> for SARE_URI_EQUALS? Is this a spamassassin bug, or is there no way to
> guarantee that only real uris are parsed as such?

Before you drop the score precipitously check if there is some other
characteristic of the emails that trigger falsely which can be used to
apply a negative score. If there is such a characteristic then generate
the appropriate negative score. If not weigh how effective the rule is
for you. The version of "sa-stats.pl" that is on the SARE site helps
figure this out nicely.

That said it's close to a "50/50" rule that hits on very few messages
here so should have a low score. (It hit on 6 messages out of 75000.)
Cutting it out completely here seems like it would be effective TODAY.
That could change. At one time it was quite necessary. Spammer fads
change.)

{^_^}

Re: SARE_URI_EQUALS false positives

Posted by Chris Lear <ch...@laculine.com>.

* Loren Wilton wrote (24/12/2005 00:23):
>> Does anyone have any suggestions, apart from simply reducing the score
>> for SARE_URI_EQUALS? Is this a spamassassin bug, or is there no way to
>> guarantee that only real uris are parsed as such?
> 
> Several.

Hi. Thanks for the response. I'm replying rather late due to pressures
of Christmas.

> 
> 1.    Change your report generator to remove the extraneous dot between
> updated and by.  Or change it to the more common underscore, if you insist
> on these words being connected for some reason.
> 
> 2.    Put spaces around the equal sign.

These are fine suggestions, but sadly not practical. The e-mails are
auto-generated diffs from cvs commits. The files being committed are
java properties files. In particular, the "updated.by" property contains
internationalised versions of the phrase "Updated by". The "more common
underscore" would be unusual in the java properties file, and expecting
the developers to change the way they work to avoid SARE misfires is a
slightly overzealous reaction to the spam problem, I think. However, it
is possible if there's no sensible alternative.
The second suggestion is only a workaround, not a fix, anyway, because
spamassassin will still check http://updated.by as a uri.

> 
> 3.    If you are reluctant for the correct fix, drop the score on the
> uri_equals rule to 4 or maybe 3, depending on what else your report manages
> to hit.

I am reluctant to use the "correct fix". Actually I'm inclined to think
that the word "correct" is being misapplied here. I've changed the
scores appropriately, though.

> 
> 4.    You could submit a Bugzilla on the parsing of that phrase.  But
> frankly I consider the bug in the report generation, not SA's parsing of
> strange syntax.

The reason I didn't submit a bug was that I was not sure there was one -
hence the original query. And I'm still not going to submit a bug,
because I'm persuaded that there is not one. What bothered me (and still
does a bit) was that the string "updated.by=anything" matches a rule
that looks for uris of the form "http(s)://*=*". Ie the http(s) is
conjured out of nowhere for schemeless uris. I can see the point, but I
thought it would be worth bringing a possible problem to light. It's a
possible problem, not a bug per se, and the subsequent discussion shows
that people take different views on the seriousness of this kind of
parsing issue. One thing that hasn't been mentioned in respect of this
is that if spamassassin is looking aggressively for schemeless uris, it
could in some cases create quite a lot of unwanted uri checking traffic.

I'm happy to stick with what I've got now. I've sent some examples off
as indicated so that the SARE corpus will contain my mail in future.

Chris

Re: SARE_URI_EQUALS false positives

Posted by Theo Van Dinter <fe...@apache.org>.

On Tue, Dec 27, 2005 at 09:17:09PM +0100, mouss wrote:
> are you sure? my understanding is that query part must be in the
> url-path, so must come after at least one slash. something like

I don't know about "=bar", but if it were "?bar", many browsers will assume
there's supposed to be a "/" before the "?".

One could argue that "=bar" looks like quoted-printable encoding, but that's
another discussion. ;)

-- 
Randomly Generated Tagline:
"I'm Bond ... Covalent Bond."            - Farside Cartoon

Re: SARE_URI_EQUALS false positives

Posted by mouss <us...@free.fr>.

Kai Schaetzl a écrit :
> Mouss wrote on Tue, 27 Dec 2005 00:04:34 +0100:
> 
> 
>>Is foo.tld=bar a valid hostname part in a URI?
> 
> 
> "foo.tld=bar" is a valid URL with "foo.tld" being the hostname and "=bar" 
> being the query part.
> 

are you sure? my understanding is that query part must be in the
url-path, so must come after at least one slash. something like
	scheme://[user[:pass]@]host[:port]/[path[?queryargs]]
plus the fact that:
	scheme://[user[:pass]@]host[:port]  is the same as the one with a
traling slash, and
	absence of the scheme part assumes http.

running http://www.google.com=test on my firefox results in a dns lookup
error.

Re: SARE_URI_EQUALS false positives

Posted by Kai Schaetzl <ma...@conactive.com>.

Mouss wrote on Tue, 27 Dec 2005 00:04:34 +0100:

> Is foo.tld=bar a valid hostname part in a URI?

"foo.tld=bar" is a valid URL with "foo.tld" being the hostname and "=bar" 
being the query part.

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com

Re: SARE_URI_EQUALS false positives

Posted by mouss <us...@free.fr>.

Loren Wilton a écrit :
>>Does anyone have any suggestions, apart from simply reducing the score
>>for SARE_URI_EQUALS? Is this a spamassassin bug, or is there no way to
>>guarantee that only real uris are parsed as such?
> 
> 
> Several.
> 
> 1.    Change your report generator to remove the extraneous dot between
> updated and by.  Or change it to the more common underscore, if you insist
> on these words being connected for some reason.
> 
> 2.    Put spaces around the equal sign.
> 
> 3.    If you are reluctant for the correct fix, drop the score on the
> uri_equals rule to 4 or maybe 3, depending on what else your report manages
> to hit.
> 
> 4.    You could submit a Bugzilla on the parsing of that phrase.  But
> frankly I consider the bug in the report generation, not SA's parsing of
> strange syntax.
> 

Is foo.tld=bar a valid hostname part in a URI? I doubt that. now, would
a MUA show that as a URI followed by "bar"?

I think that SA should provide an option to enable/disable:
uri_broken_mua, so that people not caring for "broken" MUAs can avoid
such false positives.

Re: SARE_URI_EQUALS false positives

Posted by Loren Wilton <lw...@earthlink.net>.

> Does anyone have any suggestions, apart from simply reducing the score
> for SARE_URI_EQUALS? Is this a spamassassin bug, or is there no way to
> guarantee that only real uris are parsed as such?

Several.

1.    Change your report generator to remove the extraneous dot between
updated and by.  Or change it to the more common underscore, if you insist
on these words being connected for some reason.

2.    Put spaces around the equal sign.

3.    If you are reluctant for the correct fix, drop the score on the
uri_equals rule to 4 or maybe 3, depending on what else your report manages
to hit.

4.    You could submit a Bugzilla on the parsing of that phrase.  But
frankly I consider the bug in the report generation, not SA's parsing of
strange syntax.

        Loren

Re: SARE_URI_EQUALS false positives

Posted by Robert Menschel <Ro...@Menschel.net>.

Hello Chris,

Friday, December 23, 2005, 3:04:29 AM, you wrote:

CL> I'm getting false positives for SARE_URI_EQUALS, which scores 5 and is
CL> therefore skewing the scoring of some mail quite badly. ...

CL> Does anyone have any suggestions, apart from simply reducing the
CL> score for SARE_URI_EQUALS? Is this a spamassassin bug, or is there
CL> no way to guarantee that only real uris are parsed as such?

Send me a couple of sample emails with this problem so I can add them
to my ham corpus, and SARE_URI_EQUALS will automagically drop its
score to 1.666 (or lower).  No SARE rule with ham scores more than
1.666.

Best is to put them into an mbox file, zip, and email.  Thanks.

Bob Menschel