You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2009/02/07 10:06:20 UTC

[Bug 6061] New: Detect URI with unknown TLD

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6061

           Summary: Detect URI with unknown TLD
           Product: Spamassassin
           Version: unspecified
          Platform: Other
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P5
         Component: Rules (Eval Tests)
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: alex.uribl@gmail.com


To help detect URIs in borked spams:

http://health.sharpdecimal
http://stock.prestigechaste
http://terminal.gloriousnext
http://masseurs.sharpdecimal

It would be useful to be able to do an eval if URI is in known list and if not,
score it somewhat.


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6061] Detect URI with unknown TLD

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6061

Kevin A. McGrail <km...@pccc.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |kmcgrail@pccc.com
   Target Milestone|3.3.2                       |3.4.1

--- Comment #18 from Kevin A. McGrail <km...@pccc.com> ---
Moving all open bugs where target is defined and 3.4.0 or lower to 3.4.1 target

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6061] Detect URI with unknown TLD

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6061





--- Comment #12 from Sidney Markowitz <si...@sidney.com>  2009-02-08 09:34:39 PST ---
(in reply to comment#11)

I jut took another look at the code in PerMsgStatus.pm and Util.pm and I see I
was remembering wrong about what I did there. The regexps only include the TLD
table for strings that do not begin with a scheme such as http://. If a string
does begin with a scheme it will be parsed as a URI so http://example.comm does
get parsed and only later get filtered for not having a valid TLD. So what is
being proposed here may be feasible with the current code.


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6061] Detect URI with unknown TLD

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6061





--- Comment #2 from Matt Hampton <ma...@coders.co.uk>  2009-02-07 01:39:29 PST ---
IANA provide a list of valid TLD's

http://data.iana.org/TLD/tlds-alpha-by-domain.txt

I have found that the majority of invalid URL's come from incorrectly parse
messages rather than borked spam.

I run a small black list and have (up until this point) always removed the
invalid URLs.  However this makes me wonder........


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6061] Detect URI with unknown TLD

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6061





--- Comment #1 from AXB <al...@gmail.com>  2009-02-07 01:13:33 PST ---
(In reply to comment #0)
> To help detect URIs in borked spams:
> 
> http://health.sharpdecimal
> http://stock.prestigechaste
> http://terminal.gloriousnext
> http://masseurs.sharpdecimal
> 
> It would be useful to be able to do an eval if URI is in known list and if not,
> score it somewhat.
> 

where "known list" means "known" TLD list


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6061] Detect URI with unknown TLD

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6061





--- Comment #5 from AXB <al...@gmail.com>  2009-02-07 03:18:24 PST ---
(In reply to comment #4)
> It's clear what is a valid URL, and the current code is careful to only match
> valid URLs in deciding what to extract. How do you define a URL string that is
> not a valid URL? 

borked spam contains

http://health.sharpdecimal com

for many MUAs http://health.sharpdecimal is a valid URL

but it could have been

http://health.sharpdecimal-foo
etc

depending on MUA it will be shown as a URL


>I would like to see a specific definition of what these bad
> URLs look like and some indication that they are spam signs.

whatever is at the end of the URL, smells like a TLD but is not in SA's tld
definitions? whould that work?

ongoing borked URL spam flood in URIBL.com's spam feeds triggered this request.

http://medications.prestigechaste
http://health.gloriousnext


having the eval method, doesn't mean it has to scored by default, but it would
be usefull.

like you'd use a 
uri     __URI_IN_MSG   /\S/
in a meta.

hope this makes it clearer


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6061] Detect URI with unknown TLD

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6061





--- Comment #13 from Karsten Bräckelmann <gu...@rudersport.de>  2009-02-08 09:44:25 PST ---
Maybe I'm just being dense, but what are these rules (see previous comments)
supposed to do? How else could a rule access that "broken URI" list, other than
via an internal, temporary header or the body?

The latter is possible already for rules, at least given the examples mentioned
so far.
  m~http://(?:[-a-z0-9]\.)+[-a-z0-9]{4,}~


If I do understand comment 11 correctly, that basically boils down to another
uridnsbl/urirhssub rule, like URIBL_INVALID or something?


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6061] Detect URI with unknown TLD

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6061

Justin Mason <jm...@jmason.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Group|security                    |
          Component|Security                    |Libraries
         AssignedTo|security@spamassassin.apach |dev@spamassassin.apache.org
                   |e.org                       |

--- Comment #15 from Justin Mason <jm...@jmason.org> 2010-01-27 03:16:09 UTC ---
reassigning, too

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6061] Detect URI with unknown TLD

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6061





--- Comment #7 from AXB <al...@gmail.com>  2009-02-07 07:19:27 PST ---
(In reply to comment #6)
> You might find this useful. I created an RBL of TLDs. You can look up:
> 
> domain.com.rb.junkemailfilter.com
> 
> It returns 127.0.0.1 for single level tlds, 127.0.0.2 fir 2 level tlds:
> 
> domain.com = 127.0.0.1
> domain.co.uk = 127.0.0.2
> 
> For unknown you get nothing. So if your lookup returns nothing you know it's
> invalid.
> 

Pls clue me in how SA is supposed to query something that it ignores.
I must be missing something ...


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6061] Detect URI with unknown TLD

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6061

AXB <ax...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |INVALID

--- Comment #19 from AXB <ax...@gmail.com> ---
ancient and not relevant anymore

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 6061] Detect URI with unknown TLD

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6061





--- Comment #9 from AXB <al...@gmail.com>  2009-02-07 12:45:22 PST ---
> (in reply to comment #5)
> Is the idea to accept anything that begins with "http://" as a URL? I would
> like to have some idea as to how many false positives that leads to -- Not FPs
> on spam detection, although that is important too, but for this, how many false
> identification of strings as URLs and how many resulting unnecessary calls to
> URIRBLs? The reason for the current URI parse code (in trunk -- I'm still
> waiting for that one more review and vote to put it in the 3.2 branch) is to
> only send to the RBL what are possibly real links.

- Its not supposed to trigger any queries
- Its not supposed to be used to mark spam or ham, so FPS are not an issue.
- It IS supposed to check if what the parser thinks is a tld, exists in the tld
data or not.

if URL is example.comm and ".comm" IS NOT in known tld list return 0
if URL is example.com and ".com" IS in known tld list return 1

make the 0 available to a rule.

nothing else.

> Which brings up another point. Is health.sharpdecimal as opposed to
> health.sharpdecimal.com in the RBLs anyway? 

the URIBLs depend on SA's or other tld tables to list a domain.
If its an unknown tld it won't be listed.

health.sharpdecimal won't ever be listed unless someone starts listing these
types.
No sober BL op I know of would do this :-)


> If not, what would be the point of  parsing it as a URL?

- to detect if domain is in the known tld list
- to create custom URI rules to detect stuff which won't ever be listed but
needs scoring (positive or negative, whatever may apply)
- if its a new/obscure/frequent URI ending, add a util_rb_2tld entry to allow
SA to parse it as known tld


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6061] Detect URI with unknown TLD

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6061





--- Comment #8 from Sidney Markowitz <si...@sidney.com>  2009-02-07 11:21:32 PST ---
(In reply to comment #6)
Thanks for the offer. However, we already have a table of valid TLDs in
SpamAssassin used in parsing URIs and I don't think the tradeoff of a network
access per parse versus the memory requirements would be worth it in this case.

(in reply to comment #5)
Is the idea to accept anything that begins with "http://" as a URL? I would
like to have some idea as to how many false positives that leads to -- Not FPs
on spam detection, although that is important too, but for this, how many false
identification of strings as URLs and how many resulting unnecessary calls to
URIRBLs? The reason for the current URI parse code (in trunk -- I'm still
waiting for that one more review and vote to put it in the 3.2 branch) is to
only send to the RBL what are possibly real links.

Which brings up another point. Is health.sharpdecimal as opposed to
health.sharpdecimal.com in the RBLs anyway? If not, what would be the point of
parsing it as a URL?


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6061] Detect URI with unknown TLD

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6061





--- Comment #11 from Kevin Golding <ca...@gmail.com>  2009-02-08 02:46:06 PST ---
It's not about parsing these things as standard URIs, it's about catching the
broken URIs.

Currently we see http://example.comm and run it through the known TLD list
because it looks like a link.  At the moment we would detect that isn't a valid
URL and simply discard it - the suggestion is that instead of ignoring this
result we add a return code which we can then use to score these broken URIs.

We're basically doing the test already, the suggestion is just a way to flag
these broken URIs instead of ignoring them.


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6061] Detect URI with unknown TLD

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6061





--- Comment #4 from Sidney Markowitz <si...@sidney.com>  2009-02-07 02:58:51 PST ---
It's clear what is a valid URL, and the current code is careful to only match
valid URLs in deciding what to extract. How do you define a URL string that is
not a valid URL? I would like to see a specific definition of what these bad
URLs look like and some indication that they are spam signs.


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6061] Detect URI with unknown TLD

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6061





--- Comment #3 from AXB <al...@gmail.com>  2009-02-07 01:54:37 PST ---
(In reply to comment #2)
> IANA provide a list of valid TLD's
> 
> http://data.iana.org/TLD/tlds-alpha-by-domain.txt
> 
> I have found that the majority of invalid URL's come from incorrectly parse
> messages rather than borked spam.
> 
> I run a small black list and have (up until this point) always removed the
> invalid URLs.  However this makes me wonder........
> 

SA uses the TLD list in RegistrarBoundaries.pm
or by user added util_rb_2tld (+SA 3.3.x's util_rb_3tld)

so anything which is not in there may be worth a low score.

This could trip on corp names like http://workplace.intranet but checking those
"internal" names can be easily bypassed by uridnsbl_skip_domain.

having an URI_INVALID_TLD rule would allow extra play in meta rules.


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6061] Detect URI with unknown TLD

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6061





--- Comment #10 from Sidney Markowitz <si...@sidney.com>  2009-02-07 18:35:51 PST ---
(in reply to comment#9)
> if URL is example.comm and ".comm" IS NOT in known tld list return 0

The current parser has the tld list built in to its regexp so example.comm
would never get parsed as a URL. At least that's how I remember doing it,
rather than parsing lots if strings that might be URLs and then filtering them
later. The code in 3.2 might be doing it that way.


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6061] Detect URI with unknown TLD

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6061

Karsten Bräckelmann <gu...@rudersport.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Group|security                    |
          Component|Security                    |Libraries
         AssignedTo|security@spamassassin.apach |dev@spamassassin.apache.org
                   |e.org                       |

--- Comment #17 from Karsten Bräckelmann <gu...@rudersport.de> 2010-03-23 17:42:27 UTC ---
Moving back off of Security, which got changed by accident during the mass
Target Milestone move.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 6061] Detect URI with unknown TLD

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6061





--- Comment #6 from Marc Perkel <ma...@perkel.com>  2009-02-07 07:04:41 PST ---
You might find this useful. I created an RBL of TLDs. You can look up:

domain.com.rb.junkemailfilter.com

It returns 127.0.0.1 for single level tlds, 127.0.0.2 fir 2 level tlds:

domain.com = 127.0.0.1
domain.co.uk = 127.0.0.2

For unknown you get nothing. So if your lookup returns nothing you know it's
invalid.


-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.