You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2005/07/01 19:11:18 UTC

[Bug 4449] New: .cf files found by get_uri_list

http://bugzilla.spamassassin.org/show_bug.cgi?id=4449

           Summary: .cf files found by get_uri_list
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: normal
          Priority: P3
         Component: spamassassin
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: dallase@nmgi.com


the new uri detection code that finds standalone uri's without a http:// in
front can cause unnecessary lookups...

for example,

 [20028] dbg: uridnsbl: query for sendmail.cf took 3 seconds to look up
(multi.uribl.com.:sendmail.cf)
 [20028] dbg: uridnsbl: query for 70_sc_top200.cf took 3 seconds to look up
(multi.surbl.org.:70_sc_top200.cf)
 [20028] dbg: uridnsbl: query for local.cf took 3 seconds to look up
(multi.uribl.com.:local.cf)
 [20028] dbg: uridnsbl: query for 70_sare_uri.cf took 3 seconds to look up
(multi.uribl.com.:70_sare_uri.cf)

is it worthwhile, or possible to remove some ccTLD's that could result in false
positives like .cf ?

you guys are probably going to hate me by the end of the day at this rate  ;)

-d



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4449] .cf files found by get_uri_list

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4449





------- Additional Comments From dallase@nmgi.com  2005-07-01 11:16 -------
> However, just because something "could results in
> false positives" isn't enough IMO to drop a TLD.  
> "com" would be an obvious entry in that category 
> for instance.

right, thats why i said ccTLDs..  the odds are much better that 2 letter country
codes are more likely to hit random STRING.CC entries due to

1) less chars after period:  2 (cctlds) versus 3-7 (gtlds)
2) number of tlds available: 245 (cctlds) versus 15 (gtlds)

if someone misses a spacebar and the next sentance starts with a common two
letter word like 'TO', 'IT', 'DO', 'AT', 'SO', or 'US'.  

"Say hi to them.It will make their day!"

it would not only produce an unneeded DNS query, but it also could produce a FP hit.

# host -tTXT them.it.multi.uribl.com
them.it.multi.uribl.com descriptive text "Listed on [black] - See
http://lookup.uribl.com/?domain=them.it"

i'm not so much worried about FP's against the URIBL's..  i'm more worried about
unneeded DNS queries.  Imagine a site (like rulesemporium.com) that passes info
about rulesets around via email all day long.

anyhow.. i'm just bringing things up as i see them... so like i said, use
INVALID as necessary 

d ;)







------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4449] .cf files found by get_uri_list

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4449





------- Additional Comments From dallase@nmgi.com  2005-07-01 13:36 -------
> I don't think the developers can take into
>  account the possible typos of everyone sending
> mail. 

nor do i expect that.  that was mearly an example for demonstration.  why dont 
you run /\b[a-z0-9\-]{2,}\.[a-z][a-z]\b/i through a corpus and see how many 
more 'examples' you can find.  realize [a-z][a-z] is just a simple way to catch 
the 245 ccTLDs and quickly scan a corpus.   i really havent a clue how many 
unnecessary queries will be produced, but this is why i am bringing this up. 

> Also, if you're worried about a single DNS query

i'm not sure how you came to this assumption?   at uribl.com, we are on the 
other end of the thousands of queries coming in every minute...  so, it is my 
responsibility to be "concerned".  i'm sure jeff would be interested in this as 
well... they do like 50x more queries than we do per minute last i checked.

> , then you probably shouldn't be running with 
> net tests enabled anyway.

i am more than capable of taking care of my boxes, but thanks for the 
suggestion.

> I'd say close as INVALID or WONTFIX. 

and i'm fine with that...  



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4449] .cf files found by get_uri_list

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4449





------- Additional Comments From felicity@apache.org  2005-07-01 10:25 -------
Subject: Re:   New: .cf files found by get_uri_list

On Fri, Jul 01, 2005 at 10:11:18AM -0700, bugzilla-daemon@bugzilla.spamassassin.org wrote:
>  [20028] dbg: uridnsbl: query for 70_sare_uri.cf took 3 seconds to look up
> (multi.uribl.com.:70_sare_uri.cf)
> 
> is it worthwhile, or possible to remove some ccTLD's that could result in false
> positives like .cf ?

Hrm.  IMO, we should aim for the top ones that get used (com, net, org, info,
biz) at least.

However, just because something "could results in false positives"
isn't enough IMO to drop a TLD.  "com" would be an obvious entry in that
category for instance.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4449] .cf files found by get_uri_list

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4449





------- Additional Comments From maddenj+spamassassin@skynet.ie  2005-07-01 12:05 -------
> How about the possibility of either an include list or an exclude list that
> could exist in local.cf or similar that would list the tlds to include or
> exclude?  Or maybe a little less format than tlds, just a list of strings to
> match agains the tail of the putative url to include or exclude from
> consideration.
> 
> Could probably start with com/biz/info/us in an include list, and let it
> grow from there.
> 
You can still get caught for random DNS queries :
"Check this out.Net is really slow"
"SpamAssassin 3.1 released.Info from the usual places" 
etc. etc.

I don't think the developers can take into account the possible typos of
everyone sending mail. Also, if you're worried about a single DNS query, then
you probably shouldn't be running with net tests enabled anyway.

I'd say close as INVALID or WONTFIX. 





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4449] .cf files found by get_uri_list

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4449





------- Additional Comments From Bob@Menschel.net  2005-07-14 00:10 -------
Triage: possibly related to bug 4395



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4449] .cf files found by get_uri_list

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4449





------- Additional Comments From lwilton@earthlink.net  2005-07-01 11:36 -------
Subject: Re:  .cf files found by get_uri_list

Hum.  I wonder if there is any better way to detect naked urls than what is
being done?  I suppose if there were smarter code smarter spammers would
find a creative way around it.  :-(

How about the possibility of either an include list or an exclude list that
could exist in local.cf or similar that would list the tlds to include or
exclude?  Or maybe a little less format than tlds, just a list of strings to
match agains the tail of the putative url to include or exclude from
consideration.

Could probably start with com/biz/info/us in an include list, and let it
grow from there.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4449] .cf files found by get_uri_list

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4449


felicity@apache.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |WONTFIX




------- Additional Comments From felicity@apache.org  2006-08-01 15:19 -------
Ok, nothing's really changed with the ticket in a while, and though we still see
the same behavior, there's still really no suggestions about a good way of
fixing the issue short of not doing queries for domains parsed from the message
text, which isn't a good solution.

So I'm going to close this as WONTFIX as discussed in the ticket.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4449] .cf files found by get_uri_list

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4449





------- Additional Comments From maddenj+spamassassin@skynet.ie  2005-07-01 15:43 -------
Subject: Re:  .cf files found by get_uri_list

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On (01/07/05 13:36), bugzilla-daemon@bugzilla.spamassassin.org didst pronounce:
> > Also, if you're worried about a single DNS query
> 
> i'm not sure how you came to this assumption?   at uribl.com, we are on the 
> other end of the thousands of queries coming in every minute...  so, it is my 
> responsibility to be "concerned".  i'm sure jeff would be interested in this as 
> well... they do like 50x more queries than we do per minute last i checked.
> 
My apologies. I didn't know you were on the receiving end of all these
queries! Though, I still can't think of an easy way to solve it. 

- -- 
Chat ya later,

John.
- --
BOFH excuse #24: network packets travelling uphill (use a carrier pigeon)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFCxckoQBw+ZtKOvTIRAhuqAJ9LJpsmS9aY6O/SKQfSsT9Mr7L0QwCeMcck
qsy71f8hZOi1F1rYtD86MkQ=
=Gfzj
-----END PGP SIGNATURE-----





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.