You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@issues.apache.org on 2010/04/14 10:52:35 UTC

[Bug 6408] New: URIs with "http" not in lower-case are not detected

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6408

           Summary: URIs with "http" not in lower-case are not detected
           Product: Spamassassin
           Version: 3.3.0
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: major
          Priority: P2
         Component: Libraries
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: cm@coretec.at


I just got a spam with the following URI:

Http://bestwebcazinos.net/de

which is not extracted by SA. A quick test confirms that any uppercase letters
in the URI scheme prevent extraction, and therefore checking against URIBLs, of
an URI.

I suppose there's just a /i missing in some regex.

I'm setting this to "major" since it is very easy for spammers to bypass URI
checks this way.

regards,

cm.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6408] [Review] URIs with "http" not in lower-case are not detected

Posted by bu...@issues.apache.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6408

Sidney Markowitz <si...@sidney.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|URIs with "http" not in     |[Review] URIs with "http"
                   |lower-case are not detected |not in lower-case are not
                   |                            |detected

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6408] URIs with "http" not in lower-case are not detected

Posted by bu...@issues.apache.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6408

--- Comment #16 from Sidney Markowitz <si...@sidney.com> 2010-04-19 14:24:23 EDT ---
(in reply to comment #13 and comment #14)

By definition of how they are parsed the URIs involved have to begin with
https? or ftp or mailto, so the extra generality doesn't buy anything. I wanted
to see what the results were with https? independent of ftp and mailto, then
was going to add separate rules for those to fine tune if the results looked
interesting.

Good catch on the fixes. They don't seem to have as general an effect, but +1
for putting them in the 3.3 branch. However, let's put it in their own bug.
Their symptoms are different and this closed bug makes it an awkward place to
keep talking about it. In particular we shouldn't reopen this bug to mark it
for review and voting.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6408] URIs with "http" not in lower-case are not detected

Posted by bu...@issues.apache.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6408

--- Comment #14 from Mark Martinec <Ma...@ijs.si> 2010-04-19 11:20:50 EDT ---
Created an attachment (id=4749)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4749)
fixes more cases of case-sensitive URI schemes

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6408] URIs with "http" not in lower-case are not detected

Posted by bu...@issues.apache.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6408

--- Comment #8 from Mark Martinec <Ma...@ijs.si> 2010-04-17 21:06:44 EDT ---
Just to see what the specs say about case sensitivity, here is
the relevant section from RFC 3986.



RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax

3.1.  Scheme

   Each URI begins with a scheme name that refers to a specification for
   assigning identifiers within that scheme.  As such, the URI syntax is
   a federated and extensible naming system wherein each scheme's
   specification may further restrict the syntax and semantics of
   identifiers using that scheme.

   Scheme names consist of a sequence of characters beginning with a
   letter and followed by any combination of letters, digits, plus
   ("+"), period ("."), or hyphen ("-").  Although schemes are case-
   insensitive, the canonical form is lowercase and documents that
   specify schemes must do so with lowercase letters.  An implementation
   should accept uppercase letters as equivalent to lowercase in scheme
   names (e.g., allow "HTTP" as well as "http") for the sake of
   robustness but should only produce lowercase scheme names for
   consistency.

      scheme      = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6408] URIs with "http" not in lower-case are not detected

Posted by bu...@issues.apache.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6408

--- Comment #18 from Sidney Markowitz <si...@sidney.com> 2010-04-19 15:17:44 EDT ---
(in reply to comment #17)

Yes, this bug was a regression of those other ones that were fixed earlier
caused by new code I added in 3.3. This time I added some test cases, although
it would be good to add some more tests for the ones Mark caught in comment #14

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6408] URIs with "http" not in lower-case are not detected

Posted by bu...@issues.apache.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6408

--- Comment #13 from Mark Martinec <Ma...@ijs.si> 2010-04-19 08:05:39 EDT ---
> uri T_UPPERCASE_HTTP  /^(?:H|hT|htT|httP|httpS)/

How about:

uri T_UPPERCASE_SCHEME /^(?: [A-Z] | [a-z] [a-z0-9.+-]* [A-Z]) [A-Za-z0-9.+-]*
:/x

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6408] URIs with "http" not in lower-case are not detected

Posted by bu...@issues.apache.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6408

--- Comment #17 from John Hardin <jh...@impsec.org> 2010-04-19 14:56:32 EDT ---
(In reply to comment #12)
> I was way overthinking this!
> 
> Here is the test rule I committed to my sandbox to count how often https?
> appears with one or more upper case characters as the scheme of a URI. I don't
> think there is a reason to bother with ftp or mailto for this.
> 
> uri T_UPPERCASE_HTTP  /^(?:H|hT|htT|httP|httpS)/
> 
> We'll see what the hits look like in mass check in a couple of days

Howzabout:

   uri  T_URI_UC  /^[^:]*[A-Z]/

?

Also, this has come up before...

bug 3092
bug 3286
bug 4111
bug 4529

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6408] URIs with "http" not in lower-case are not detected

Posted by bu...@issues.apache.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6408

--- Comment #9 from Mark Martinec <Ma...@ijs.si> 2010-04-17 21:09:25 EDT ---
> Although schemes are case-
> insensitive, the canonical form is lowercase and documents that
> specify schemes must do so with lowercase letters.  An implementation
> should accept ... but should only produce lowercase scheme names for
> consistency.

Sounds like a good reason to assign score points for using
upper-case letters in an URL schema.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6408] URIs with "http" not in lower-case are not detected

Posted by bu...@issues.apache.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6408

--- Comment #19 from Sidney Markowitz <si...@sidney.com> 2010-04-19 19:01:19 EDT ---
FYI, it is not a good test for spam. People do sometimes use uppercase in a
URL, or their word processing setup uppercases what it thinks is the start of a
sentence. In any case too many FPs

T_UPPERCASE_HTTP
0.0408% spam (55 of 134818 messages)
0.4037% ham (981 of 242985 messages)

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6408] URIs with "http" not in lower-case are not detected

Posted by bu...@issues.apache.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6408

Sidney Markowitz <si...@sidney.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED
            Summary|[Review] URIs with "http"   |URIs with "http" not in
                   |not in lower-case are not   |lower-case are not detected
                   |detected                    |
  Status Whiteboard|Needs 1 vote for 3.3 branch |

--- Comment #5 from Sidney Markowitz <si...@sidney.com> 2010-04-17 15:49:38 EDT ---
Committed to 3.3 branch revision 935237.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6408] URIs with "http" not in lower-case are not detected

Posted by bu...@issues.apache.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6408

--- Comment #10 from Kevin A. McGrail <km...@pccc.com> 2010-04-18 09:08:25 EDT ---
(In reply to comment #9)
> > Although schemes are case-
> > insensitive, the canonical form is lowercase and documents that
> > specify schemes must do so with lowercase letters.  An implementation
> > should accept ... but should only produce lowercase scheme names for
> > consistency.
> 
> Sounds like a good reason to assign score points for using
> upper-case letters in an URL schema.


I'd like to vote +1 but have to vote -1 on that one.  Too likely to be a false
positive IMO.  

Case-insensitivity on the URI test is the only viable option from my
perspective.

Regards,
KAM

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6408] URIs with "http" not in lower-case are not detected

Posted by bu...@issues.apache.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6408

John Hardin <jh...@impsec.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jhardin@impsec.org

--- Comment #6 from John Hardin <jh...@impsec.org> 2010-04-17 15:56:46 EDT ---
(In reply to comment #4)
> (In reply to comment #3)
> > +1  Looks good
> 
> +1 for trunk and back-porting to any branches (3.3 definitely, but also 3.2??)

+1 for backporting to 3.2 as well.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6408] URIs with "http" not in lower-case are not detected

Posted by bu...@issues.apache.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6408

--- Comment #1 from Sidney Markowitz <si...@sidney.com> 2010-04-16 08:27:11 EDT ---
Created an attachment (id=4741)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4741)
I'm testing this fix now and will check it in when it passes all tests

I had made the big regexps case insensitive but missed three places after the
URI was extracted where it was being processed based on the scheme. This is in
code that was added in version 3.3.

The patch also adds some test cases that would have caught this. There were
already tests for upper case in the domain TLD.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6408] URIs with "http" not in lower-case are not detected

Posted by bu...@issues.apache.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6408

--- Comment #7 from Sidney Markowitz <si...@sidney.com> 2010-04-17 16:11:57 EDT ---
Brain fade on my part ... The code with the bug is new in 3.3. This bug does
not appear in 3.2 branch, which contains different URI parsing bugs whose fix
we never backported.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6408] URIs with "http" not in lower-case are not detected

Posted by bu...@issues.apache.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6408

Sidney Markowitz <si...@sidney.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P2                          |P1
                 CC|                            |sidney@sidney.com
   Target Milestone|Undefined                   |3.3.2
  Status Whiteboard|                            |Needs 2 votes for 3.3
                   |                            |branch

--- Comment #2 from Sidney Markowitz <si...@sidney.com> 2010-04-16 09:39:41 EDT ---
Committed to trunk revision 934866

This should be committed to 3.3 branch. Calling for votes

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6408] URIs with "http" not in lower-case are not detected

Posted by bu...@issues.apache.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6408

--- Comment #15 from Mark Martinec <Ma...@ijs.si> 2010-04-19 11:25:04 EDT ---
(In reply to comment #14)
> Created an attachment (id=4749)
 --> (https://issues.apache.org/SpamAssassin/attachment.cgi?id=4749) [details]
> fixes more cases of case-sensitive URI schemes

trunk:
  Bug 6408: fix more cases of case-sensitive URI schemes
Sending lib/Mail/SpamAssassin/Dns.pm
Sending lib/Mail/SpamAssassin/MailingList.pm
Sending lib/Mail/SpamAssassin/Plugin/FreeMail.pm
Sending lib/Mail/SpamAssassin/Plugin/HeaderEval.pm
Sending lib/Mail/SpamAssassin/Plugin/URIDNSBL.pm
Sending lib/Mail/SpamAssassin/Plugin/URIEval.pm
Committed revision 935621.

Do we want it in 3.3 too?

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6408] URIs with "http" not in lower-case are not detected

Posted by bu...@issues.apache.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6408

Karsten Bräckelmann <gu...@rudersport.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |me@junc.org

--- Comment #20 from Karsten Bräckelmann <gu...@rudersport.de> 2010-05-25 13:55:09 EDT ---
*** Bug 6438 has been marked as a duplicate of this bug. ***

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6408] [Review] URIs with "http" not in lower-case are not detected

Posted by bu...@issues.apache.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6408

Karsten Bräckelmann <gu...@rudersport.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Status Whiteboard|Needs 2 votes for 3.3       |Needs 1 vote for 3.3 branch
                   |branch                      |

--- Comment #3 from Karsten Bräckelmann <gu...@rudersport.de> 2010-04-17 10:37:50 EDT ---
+1  Looks good

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6408] URIs with "http" not in lower-case are not detected

Posted by bu...@issues.apache.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6408

--- Comment #11 from Sidney Markowitz <si...@sidney.com> 2010-04-18 13:42:11 EDT ---
(in reply to comment #9 and comment #10)

I started working up a couple of test rules that should be able to get the
actual numbers. It's quite messy, but I think I can get it all into big regexp.
Like KAM, my first reaction is to expect FPs, but there's no substitute for
actually measuring. I'll need to find some time to finish it.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6408] [Review] URIs with "http" not in lower-case are not detected

Posted by bu...@issues.apache.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6408

Kevin A. McGrail <km...@pccc.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |kmcgrail@pccc.com

--- Comment #4 from Kevin A. McGrail <km...@pccc.com> 2010-04-17 15:10:51 EDT ---
(In reply to comment #3)
> +1  Looks good

+1 for trunk and back-porting to any branches (3.3 definitely, but also 3.2??)

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6408] URIs with "http" not in lower-case are not detected

Posted by bu...@issues.apache.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6408

--- Comment #12 from Sidney Markowitz <si...@sidney.com> 2010-04-19 00:30:33 EDT ---
I was way overthinking this!

Here is the test rule I committed to my sandbox to count how often https?
appears with one or more upper case characters as the scheme of a URI. I don't
think there is a reason to bother with ftp or mailto for this.

uri T_UPPERCASE_HTTP  /^(?:H|hT|htT|httP|httpS)/

We'll see what the hits look like in mass check in a couple of days

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.