You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spamassassin.apache.org by jq...@apache.org on 2014/06/17 21:59:50 UTC
svn commit: r1603281 - in /spamassassin/trunk: build/README
lib/Mail/SpamAssassin/Util/RegistrarBoundaries.pm t/uri_text.t
Author: jquinn
Date: Tue Jun 17 19:59:50 2014
New Revision: 1603281
URL: http://svn.apache.org/r1603281
Log:
Updated TLD listing, added better TLD updating process for in the future, updated tests to account for new TLDs and changes to update process
Modified:
spamassassin/trunk/build/README
spamassassin/trunk/lib/Mail/SpamAssassin/Util/RegistrarBoundaries.pm
spamassassin/trunk/t/uri_text.t
Modified: spamassassin/trunk/build/README
URL: http://svn.apache.org/viewvc/spamassassin/trunk/build/README?rev=1603281&r1=1603280&r2=1603281&view=diff
==============================================================================
--- spamassassin/trunk/build/README (original)
+++ spamassassin/trunk/build/README Tue Jun 17 19:59:50 2014
@@ -66,6 +66,13 @@ SPAMASSASSIN RELEASE PROCEDURE
(ie., no "M" or "C" files; any files marked "M" have been locally
modified, and should be "svn revert"ed before you continue.)
+- consider updating the TLD list in
+ Mail/SpamAssassin/Util/RegistrarBoundaries.pm
+
+ Follow the documentation under %VALID_TLDS and $VALID_TLDS_RE for
+ updating the TLD list, make test, and do a commit if there are any
+ changes from the previous TLD list
+
- edit lib/Mail/SpamAssassin.pm and comment the $IS_DEVEL_BUILD
line. Ensure the correct version number is present in $VERSION
and @EXTRA_VERSION.
Modified: spamassassin/trunk/lib/Mail/SpamAssassin/Util/RegistrarBoundaries.pm
URL: http://svn.apache.org/viewvc/spamassassin/trunk/lib/Mail/SpamAssassin/Util/RegistrarBoundaries.pm?rev=1603281&r1=1603280&r2=1603281&view=diff
==============================================================================
--- spamassassin/trunk/lib/Mail/SpamAssassin/Util/RegistrarBoundaries.pm (original)
+++ spamassassin/trunk/lib/Mail/SpamAssassin/Util/RegistrarBoundaries.pm Tue Jun 17 19:59:50 2014
@@ -35,10 +35,15 @@ use vars qw (
@ISA %TWO_LEVEL_DOMAINS %THREE_LEVEL_DOMAINS %US_STATES %VALID_TLDS $VALID_TLDS_RE
);
+# %VALID_TLDS
# The list of currently-valid TLDs for the DNS system.
#
# When updating domain lists, also modify t/uri_text.t accordingly
#
+# bash line to generate a formatted list of domains
+# Fetches domains, drops the top comment line, then joins domains with spaces in between
+# wget http://data.iana.org/TLD/tlds-alpha-by-domain.txt -O - | tail -n+2 | perl -e 'chomp && s/$/ / && print lc while <>' && echo
+#
# http://data.iana.org/TLD/tlds-alpha-by-domain.txt
# Version 2008020601, Last Updated Thu Feb 7 09:07:00 2008 UTC
# The following have been removed from the list because they are
@@ -56,37 +61,77 @@ use vars qw (
# Remember to also change regexp below when updating!
foreach (qw/
- ac ad ae aero af ag ai al am an ao aq ar arpa as asia at au aw ax az
- ba bb bd be bf bg bh bi biz bj bm bn bo br bs bt bw by bz ca cat cc
- cd cf cg ch ci ck cl club cm cn co com coop cr cu cv cw cx cy cz de dj dk dm
- do dz ec edu ee eg er es et eu fi fj fk fm fo fr ga gd ge gf gg gh
- gi gl gm gn gov gp gq gr gs gt gu gw gy hk hm hn hr ht hu id ie il im
- in info int io iq ir is it je jm jo jobs jp ke kg kh ki km kn kp kr kw
- ky kz la lb lc li lk lr ls lt lu lv ly ma mc md me mg mh mil mk ml mm
- mn mo mobi mp mq mr ms mt mu museum mv mw mx my mz na name nc ne net
- nf ng ni nl no np nr nu nz om org pa pe pf pg ph pk pl pm pn pr pro ps
- pt pw py qa re ro rs ru rw sa sb sc sd se sg sh si sk sl sm sn so
- sr st su sv sx sy sz tc td tel tf tg th tj tk tl tm tn to tp tr travel tt
- tv tw tz ua ug uk us uy uz va vc ve vg vi vn vu wf ws xxx ye yt za
- zm zw
+ac academy accountants actor ad ae aero af ag agency ai airforce al am an ao aq ar archi army arpa as asia associates
+at attorney au audio autos aw ax axa az ba bar bargains bayern bb bd be beer berlin best bf bg bh bi bid bike bio biz
+bj black blackfriday blue bm bn bo boutique br bs bt build builders buzz bv bw by bz ca cab camera camp capital cards
+care career careers cash cat catering cc cd center ceo cf cg ch cheap christmas church ci citic ck cl claims cleaning
+clinic clothing club cm cn co codes coffee college cologne com community company computer condos construction
+consulting contractors cooking cool coop country cr credit creditcard cruises cu cv cw cx cy cz dance dating de degree
+democrat dental dentist desi diamonds digital directory discount dj dk dm dnp do domains dz ec edu education ee eg
+email engineer engineering enterprises equipment er es estate et eu eus events exchange expert exposed fail farm
+feedback fi finance financial fish fishing fitness fj fk flights florist fm fo foo foundation fr frogans fund furniture
+futbol ga gal gallery gb gd ge gf gg gh gi gift gives gl glass global globo gm gmo gn gop gov gp gq gr graphics gratis
+gripe gs gt gu guide guitars guru gw gy hamburg haus hiphop hiv hk hm hn holdings holiday homes horse host house hr ht
+hu id ie il im immobilien in industries info ink institute insure int international investments io iq ir is it je jetzt
+jm jo jobs jp juegos kaufen ke kg kh ki kim kitchen kiwi km kn koeln kp kr kred kw ky kz la land lawyer lb lc lease li
+life lighting limited limo link lk loans london lr ls lt lu luxe luxury lv ly ma maison management mango market
+marketing mc md me media meet menu mg mh miami mil mk ml mm mn mo mobi moda moe monash mortgage moscow motorcycles mp
+mq mr ms mt mu museum mv mw mx my mz na nagoya name navy nc ne net neustar nf ng nhk ni ninja nl no np nr nu nyc nz
+okinawa om onl org organic pa paris partners parts pe pf pg ph photo photography photos pics pictures pink pk pl
+plumbing pm pn post pr press pro productions properties ps pt pub pw py qa qpon quebec re recipes red rehab reise
+reisen ren rentals repair report republican rest reviews rich rio ro rocks rodeo rs ru ruhr rw ryukyu sa saarland sb
+sc schule scot sd se services sexy sg sh shiksha shoes si singles sj sk sl sm sn so social software sohu solar
+solutions soy space sr st su supplies supply support surgery sv sx sy systems sz tattoo tax tc td technology tel tf
+tg th tienda tips tirol tj tk tl tm tn to today tokyo tools town toys tp tr trade training travel tt tv tw tz ua ug
+uk university uno us uy uz va vacations vc ve vegas ventures versicherung vet vg vi viajes villas vision vn vodka
+vote voting voto voyage vu wang watch webcam website wed wf wien wiki works ws wtc wtf xn--3bst00m xn--3ds443g
+xn--3e0b707e xn--45brj9c xn--4gbrim xn--55qw42g xn--55qx5d xn--6frz82g xn--6qq986b3xl xn--80adxhks xn--80ao21a
+xn--80asehdb xn--80aswg xn--90a3ac xn--c1avg xn--cg4bki xn--clchc0ea0b2g2a9gcd xn--czr694b xn--czru2d xn--d1acj3b
+xn--fiq228c5hs xn--fiq64b xn--fiqs8s xn--fiqz9s xn--fpcrj9c3d xn--fzc2c9e2c xn--gecrj9c xn--h2brj9c xn--i1b6b1a6a2e
+xn--io0a7i xn--j1amh xn--j6w193g xn--kprw13d xn--kpry57d xn--l1acc xn--lgbbat1ad8j xn--mgb9awbf xn--mgba3a4f16a
+xn--mgbaam7a8h xn--mgbab2bd xn--mgbayh7gpa xn--mgbbh1a71e xn--mgbc0a9azcg xn--mgberp4a5d4ar xn--mgbx4cd0ab xn--ngbc5azd
+xn--nqv7f xn--nqv7fs00ema xn--o3cw4h xn--ogbpf8fl xn--p1ai xn--pgbs0dh xn--q9jyb4c xn--rhqv96g xn--s9brj9c xn--ses554g
+xn--unup4y xn--wgbh1c xn--wgbl6a xn--xkc2al3hye2a xn--xkc2dl3a5ee0h xn--yfro4i67o xn--ygbi2ammx xn--zfr164b xxx xyz
+yachts ye yokohama yt za zm zone zw
/) {
$VALID_TLDS{$_} = 1;
}
+# $VALID_TLDS_RE
# %VALID_TLDS as Regexp::List optimized regexp, for use in Plugins etc
-# Paste above list to:
-# perl -MRegexp::List -e '$/=undef; $_=<>; $r = Regexp::List->new; push @l, $_ for (split); print $r->list2re(@l)'
+# bash line to generate regex from TLD list
+# Fetches domains, drops the top commet line, builds a regex from the list of domains, then formats it to remove (?-xsim:) regex modifier flags
+# wget http://data.iana.org/TLD/tlds-alpha-by-domain.txt -O - | tail -n+2 | perl -MRegexp::List -e '$/=undef; $_=<>; $r = Regexp::List->new; push @l, $_ for (split); print $r->list2re(@l)' | perl -pe 's/^\(\?[^:]*:(.*)\)$/$1/' && echo
# Verified up to date 20120401
$VALID_TLDS_RE = qr/
- (?=[abcdefghijklmnopqrstuvwxyz])
- (?:a(?:e(?:ro)?|r(?:pa)?|s(?:ia)?|[cdfgilmnoqtuwxz])|b(?:iz?|[abdefghjmnorstwyz])
- |c(?:at?|o(?:m|op)?|(?:l(?:ub)?)|[cdfghikmnruvwxyz])|d[ejkmoz]|e(?:[cegrst]|d?u)|f[ijkmor]
- |g(?:[adefghilmnpqrstuwy]|ov)|h[kmnrtu]|i(?:n(?:fo|t)?|[delmoqrst])|j(?:o(?:bs)?|[emp])
- |k[eghimnprwyz]|l[abcikrstuvy]|m(?:o(?:bi)?|u(?:seum)?|[acdeghkmnpqrstvwxyz]|i?l)
- |n(?:a(?:me)?|et?|[cfgilopruz])|o(?:m|rg)|p(?:ro?|[aefghklmnstwy])|r[eosuw]
- |s[abcdeghiklmnortuvxyz]|t(?:r(?:avel)?|[cdfghjkmnoptvwz]|e?l)|u[agksyz]
- |v[aceginu]|w[fs]|y[et]|z[amw]|qa|xxx
- )/ix;
+(?:X(?:N--(?:MGB(?:A(?:(?:3A4F16|YH7GP)A|AM7A8H|B2BD)|ERP4A5D4AR|C0A9AZCG|BH1A71E|X4CD0AB|9AWBF)|F(?:IQ(?:(?:228C5H|
+S8|Z9)S|64B)|PCRJ9C3D|ZC2C9E2C)|C(?:LCHC0EA0B2G2A9GCD|ZR(?:694B|U2D)|G4BKI|1AVG)|(?:(?:GEC|H2B)RJ9|Q9JYB4|90A3A)C|
+80A(?:S(?:EHDB|WG)|DXHKS|O21A)|N(?:QV7F(?:S00EMA)?|GBC5AZD)|3(?:E0B707E|BST00M|DS443G)|XKC2(?:DL3A5EE0H|AL3HYE2A)|
+Y(?:FRO4I67O|GBI2AMMX)|6(?:QQ986B3XL|FRZ82G)|I(?:1B6B1A6A2E|O0A7I)|L(?:GBBAT1AD8J|1ACC)|(?:D1ACJ3|ZFR164)B|O(?:GBPF8FL|
+3CW4H)|S(?:9BRJ9C|ES554G)|4(?:5BRJ9C|GBRIM)|J(?:6W193G|1AMH)|55Q(?:W42G|X5D)|KPR(?:W13|Y57)D|P(?:GBS0DH|1AI)|WGB(?:H1C|
+L6A)|RHQV96G|UNUP4Y)|XX|YZ)|C(?:[CDFGKMNUVWXYZ]|O(?:N(?:S(?:TRUCTION|ULTING)|(?:TRACTOR|DO)S)|M(?:P(?:UTER|ANY)|MUNITY)?|
+(?:L(?:LEG|OGN)|FFE)E|O(?:[LP]|KING)|UNTRY|DES)?|A(?:R(?:E(?:ERS?)?|DS)|T(?:ERING)?|M(?:ERA|P)|PITAL|SH|B)?|L(?:(?:EAN|
+OTH)ING|AIMS|INIC|UB)?|R(?:EDIT(?:CARD)?|UISES)?|H(?:RISTMAS|URCH|EAP)?|E(?:NTER|O)|I(?:TIC)?)|S(?:[BDGJKLMNRTVXZ]|
+O(?:L(?:UTIONS|AR)|FTWARE|CIAL|HU|Y)?|U(?:PP(?:L(?:IES|Y)|ORT)|RGERY)?|E(?:RVICES|XY)?|H(?:IKSHA|OES)?|C(?:HULE|OT)?|
+A(?:ARLAND)?|I(?:NGLES)?|Y(?:STEMS)?|PACE)|M(?:[CDGHKLMNPQRSTVWXYZ]|O(?:(?:RTGAG)?E|TORCYCLES|NASH|SCOW|BI|DA)?|
+A(?:N(?:AGEMENT|GO)|RKET(?:ING)?|ISON)?|E(?:DIA|ET|NU)?|I(?:AMI|L)|U(?:SEUM)?)|P(?:[EFGKMNSWY]|R(?:O(?:(?:DUCTION|
+PERTIE)S)?|ESS)?|A(?:R(?:T(?:NER)?|I)S)?|H(?:OTO(?:GRAPHY|S)?)?|I(?:C(?:TURE)?S|NK)|L(?:UMBING)?|(?:OS)?T|UB)|
+A(?:[DFLMNOQWZ]|C(?:COUNTANTS|ADEMY|TOR)?|S(?:SOCIATES|IA)?|R(?:CHI|MY|PA)?|U(?:DIO|TOS)?|I(?:RFORCE)?|T(?:TORNEY)?|
+G(?:ENCY)?|E(?:RO)?|XA?)|F(?:[JM]|I(?:NANC(?:IAL|E)|SH(?:ING)?|TNESS)?|U(?:RNITURE|TBOL|ND)|L(?:IGHTS|ORIST)|O(?:UNDATION|
+O)?|(?:EEDBAC)?K|R(?:OGANS)?|A(?:IL|RM))|B(?:[BDFGHJMNRSTVWYZ]|A(?:R(?:GAINS)?|YERN)?|L(?:ACK(?:FRIDAY)?|UE)|
+U(?:ILD(?:ERS)?|ZZ)|E(?:RLIN|ER|ST)?|I(?:[DOZ]|KE)?|O(?:UTIQUE)?)|G(?:[BDEFGHNPQSTWY]|R(?:A(?:PHIC|TI)S|IPE)?|U(?:I(?:TARS|
+DE)|RU)?|L(?:OB(?:AL|O)|ASS)?|A(?:L(?:LERY)?)?|I(?:VES|FT)?|O[PV]|MO?)|E(?:[CEGR]|N(?:GINEER(?:ING)?|TERPRISES)|
+X(?:P(?:OSED|ERT)|CHANGE)|(?:QUIPMEN)?T|DU(?:CATION)?|S(?:TATE)?|VENTS|MAIL|US?)|T(?:[CDFGHJKLMNPTVWZ]|O(?:(?:OL|Y)S|
+DAY|KYO|WN)?|R(?:A(?:INING|VEL|DE))?|I(?:ENDA|ROL|PS)|E(?:CHNOLOGY|L)|A(?:TTOO|X))|R(?:[SW]|E(?:P(?:UBLICAN|AIR|ORT)|
+(?:CIPE|VIEW)S|N(?:TALS)?|ISEN?|HAB|ST|D)?|O(?:CKS|DEO)?|I(?:CH|O)|U(?:HR)?|YUKYU)|V(?:[CGNU]|E(?:(?:NTURE|GA)S|
+RSICHERUNG|T)?|O(?:T(?:[EO]|ING)|YAGE|DKA)|I(?:(?:AJE|LLA)S|SION)?|A(?:CATIONS)?)|D(?:[JKMZ]|E(?:NT(?:IST|AL)|MOCRAT|
+GREE|SI)?|I(?:RECTORY|AMONDS|SCOUNT|GITAL)|A(?:TING|NCE)|O(?:MAINS)?|NP)|L(?:[BCKRSTVY]|I(?:M(?:ITED|O)|GHTING|FE|NK)?|
+U(?:X(?:URY|E))?|A(?:WYER|ND)?|O(?:NDON|ANS)|EASE)|I(?:[DELOQRST]|N(?:(?:VESTMENT|DUSTRIE)S|T(?:ERNATIONAL)?|S(?:TITUT|
+UR)E|FO|K)?|M(?:MOBILIEN)?)|H(?:[KMNRTU]|O(?:L(?:DINGS|IDAY)|[RU]SE|MES|ST)|A(?:MBURG|US)|I(?:PHOP|V))|W(?:E(?:B(?:SITE|
+CAM)|D)|A(?:TCH|NG)|I(?:EN|KI)|(?:ORK)?S|T[CF]|F)|N(?:[FGLOPRUZ]|A(?:GOYA|ME|VY)?|E(?:USTAR|T)?|I(?:NJA)?|Y?C|HK)|
+K(?:[EGHMPWYZ]|I(?:TCHEN|WI|M)?|(?:AUFE|OEL)?N|R(?:ED)?)|J(?:[MP]|E(?:TZT)?|O(?:BS)?|UEGOS)|U(?:[AGKSYZ]|N(?:IVERSITY|
+O))|O(?:RG(?:ANIC)?|KINAWA|NL|M)|Y(?:[ET]|OKOHAMA|ACHTS)|Q(?:UEBEC|PON|A)|Z(?:[AMW]|ONE))
+/ix;
# Two-Level TLDs
#
Modified: spamassassin/trunk/t/uri_text.t
URL: http://svn.apache.org/viewvc/spamassassin/trunk/t/uri_text.t?rev=1603281&r1=1603280&r2=1603281&view=diff
==============================================================================
--- spamassassin/trunk/t/uri_text.t (original)
+++ spamassassin/trunk/t/uri_text.t Tue Jun 17 19:59:50 2014
@@ -126,7 +126,7 @@ foo.Cahl1goo.php !Cahl1goo
www5.mi1coozu.php !mi1coozu
www.mezeel0P.php !mezeel0P
bar.neih6fee.com.php !neih6fee
-www.zai6Vuwi.com.bar !zai6Vuwi
+www.zai6Vuwi.com.blah !zai6Vuwi
=www.deiJ1pha.com www.deiJ1pha.com
@www.Te0xohxu.com www.Te0xohxu.com
@@ -194,13 +194,13 @@ WWW.Kiox3phi.nz WWW.Kiox3phi.nz
WWW.jong3Xou.cn WWW.jong3Xou.cn
WWW.waeShoe0.tw WWW.waeShoe0.tw
-invalid_ltd.foo !invalid_tld
-invalid_ltd.bar !invalid_tld
+invalid_ltd.notword !invalid_tld
+invalid_ltd.blah !invalid_tld
invalid_ltd.xyzzy !invalid_tld
invalid_ltd.co.zz !invalid_tld
-www.invalid_ltd.foo !invalid_tld
-www.invalid_ltd.bar !invalid_tld
+www.invalid_ltd.notword !invalid_tld
+www.invalid_ltd.blah !invalid_tld
www.invalid_ltd.xyzzy !invalid_tld
www.invalid_ltd.co.zz !invalid_tld
@@ -289,7 +289,7 @@ donotignorethiswww.delimtest14.com donot
# the inactive TLDs have negative checks
# first confirm that it will not match on not a TLD
-example.foo !^http://example.foo$
+example.blah !^http://example.blah$
example.zzf !^http://example.zzf$
example.ac ^http://example.ac$
@@ -329,7 +329,7 @@ example.bo ^http://example.bo$
example.br ^http://example.br$
example.bs ^http://example.bs$
example.bt ^http://example.bt$
-example.bv !^http://example.bv$
+example.bv ^http://example.bv$
example.bw ^http://example.bw$
example.by ^http://example.by$
example.bz ^http://example.bz$
@@ -375,7 +375,7 @@ example.fm ^http://example.fm$
example.fo ^http://example.fo$
example.fr ^http://example.fr$
example.ga ^http://example.ga$
-example.gb !^http://example.gb$
+example.gb ^http://example.gb$
example.gd ^http://example.gd$
example.ge ^http://example.ge$
example.gf ^http://example.gf$
@@ -509,7 +509,7 @@ example.se ^http://example.se$
example.sg ^http://example.sg$
example.sh ^http://example.sh$
example.si ^http://example.si$
-example.sj !^http://example.sj$
+example.sj ^http://example.sj$
example.sk ^http://example.sk$
example.sl ^http://example.sl$
example.sm ^http://example.sm$
@@ -565,7 +565,7 @@ example.zw ^http://example.zw$
# with www. prefix tests a different table of TLDs
-www.example.foo !^http://www.example.foo$
+www.example.foo ^http://www.example.foo$
www.example.zzf !^http://www.example.zzf$
www.example.ac ^http://www.example.ac$
@@ -605,7 +605,7 @@ www.example.bo ^http://www.example.bo$
www.example.br ^http://www.example.br$
www.example.bs ^http://www.example.bs$
www.example.bt ^http://www.example.bt$
-www.example.bv !^http://www.example.bv$
+www.example.bv ^http://www.example.bv$
www.example.bw ^http://www.example.bw$
www.example.by ^http://www.example.by$
www.example.bz ^http://www.example.bz$
@@ -651,7 +651,7 @@ www.example.fm ^http://www.example.fm$
www.example.fo ^http://www.example.fo$
www.example.fr ^http://www.example.fr$
www.example.ga ^http://www.example.ga$
-www.example.gb !^http://www.example.gb$
+www.example.gb ^http://www.example.gb$
www.example.gd ^http://www.example.gd$
www.example.ge ^http://www.example.ge$
www.example.gf ^http://www.example.gf$
@@ -785,7 +785,7 @@ www.example.se ^http://www.example.se$
www.example.sg ^http://www.example.sg$
www.example.sh ^http://www.example.sh$
www.example.si ^http://www.example.si$
-www.example.sj !^http://www.example.sj$
+www.example.sj ^http://www.example.sj$
www.example.sk ^http://www.example.sk$
www.example.sl ^http://www.example.sl$
www.example.sm ^http://www.example.sm$