You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2008/02/07 14:45:56 UTC

[Bug 5813] New: several TLDs are not parsed by URI text scanner in PerMsgStatus.pm

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5813

           Summary: several TLDs are not parsed by URI text scanner in
                    PerMsgStatus.pm
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: normal
          Priority: P5
         Component: spamassassin
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: sidney@sidney.com


"Schemeless" URI text strings such as example.com which do not begin with http:
or https: or ftp: or mailto:  www. or ftp. are checked to see if they have a
domain that is a valid TLD. That check is done in PerMsgStatus.pm with a regexp
that has a comment saying that it was generated from a list of TLDs in
Util/RegistrarBoundaries.pm

The following URIs are not detected by that regexp, even though they are in the
table in Util/RegistrarBoundaries.pm.

Prefixing these URIs with 'www.' so they are processed through that table does
cause them to be detected:

example.asia
example.cat
example.cv
example.cx
example.cy
example.cz
example.jobs
example.mobi
example.tel
example.travel
example.xxx



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5813] [review] several TLDs are not parsed by URI text scanner in PerMsgStatus.pm

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5813





------- Additional Comments From jeffc@surbl.org  2008-02-08 00:40 -------
IMO using the official TLD table should usually be preferred.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5813] [review] several TLDs are not parsed by URI text scanner in PerMsgStatus.pm

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5813





------- Additional Comments From sidney@sidney.com  2008-02-08 01:10 -------
(reply to comment 4)

> using the official TLD table should usually be preferred

I see I didn't make explicit in comment 2 what I did for the attachment 4257.
That patch does what I talked about in comment 1, i.e., it fixes the table in
Util/RegistrarBoundaries.pm to match the official TLD table at
http://data.iana.org/TLD/tlds-alpha-by-domain.txt with the exception of the IDN
test entries, and it redoes the regexp in PerMsgStatus.pm to be the Regexp::List
optimized form of that table.

However, I should mention the discussion from the sa-dev list in which it was
pointed out that there are some TLDs that are defunct, with no domains or active
registrar. These are bv gb pm sj so um yt which perhaps should be rmoved from
the list because they cannot point to a valid URL or mail address, so could only
cause FPs.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5813] [review] several TLDs are not parsed by URI text scanner in PerMsgStatus.pm

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5813


sidney@sidney.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
Attachment #4257 is|0                           |1
           obsolete|                            |




------- Additional Comments From sidney@sidney.com  2008-02-08 14:09 -------
Created an attachment (id=4258)
 --> (http://issues.apache.org/SpamAssassin/attachment.cgi?id=4258&action=view)
Combined patch against 3.2 for bugs 5780 and 5813

This patch includes the fix for bug 5780 as well as this bug, and removes the
unused TLDs bv gb pm sj so um yt. Please use this to review both bug 5780 and
5813 for branch 3.2.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5813] several TLDs are not parsed by URI text scanner in PerMsgStatus.pm

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5813


sidney@sidney.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  BugsThisDependsOn|                            |5780
   Target Milestone|Undefined                   |3.2.5






------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5813] [review] several TLDs are not parsed by URI text scanner in PerMsgStatus.pm

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5813





------- Additional Comments From sidney@sidney.com  2008-02-08 12:53 -------
I have updated trunk to remove bv gb pm sj so um yt from the table and regexp
and to change the test cases to endure that they are not parsed as TLDs.

Committed revision 620000




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5813] [review] several TLDs are not parsed by URI text scanner in PerMsgStatus.pm

Posted by bu...@bugzilla.spamassassin.org.
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5813





------- Additional Comments From jm@jmason.org  2008-03-05 04:29 -------
testing new BZ instance



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5813] [review] several TLDs are not parsed by URI text scanner in PerMsgStatus.pm

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5813





------- Additional Comments From jm@jmason.org  2008-02-10 15:46 -------
+1, why not ;)



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5813] several TLDs are not parsed by URI text scanner in PerMsgStatus.pm

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5813





------- Additional Comments From sidney@sidney.com  2008-02-07 19:30 -------
Created an attachment (id=4257)
 --> (http://issues.apache.org/SpamAssassin/attachment.cgi?id=4257&action=view)
Fix both table and regexp to have correct list of TLDs and add all of them to
uri_text.t tests

This has been committed to trunk revision 619753.

I've attached the patch here and I'm targtting this bug to 3.2.5 to make it
easy to apply to branch 3.2 if we decide to apply the fix for bug 5780 to
branch 3.2. If we do not apply 5780 to branch 3.2, this bug will be in the
branch but a new patch against 3.2 would have to be worked up.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5813] [review] several TLDs are not parsed by URI text scanner in PerMsgStatus.pm

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5813





------- Additional Comments From sidney@sidney.com  2008-02-08 02:58 -------
Here is a summary of comments from sa-dev that should be here for the record:

  -----------

Michael Parker said:

Would anyone object to removing .so from this list?  The .so TLD is  
basically dead and we've found that lots of bogus domains like lib*.so  
are being caught by this.  Also sometimes you'll have spammers who are  
putting in gibberish or funny punctuation and you'll get sentences  
like 'blah blah.So this is'.  It also occurs with a couple of other  
domains but .so is by far the worst.

For more info on the .so domain you can read about it here:
http://en.wikipedia.org/wiki/.so_%28domain_name%29

 ------------

To which I replied:

After reading the Wikipedia entries for the TLDs that they list as unused, I
would be in 
favor of removing all of them

  bv gb pm sj so um yt

If spammers can't use them as domains, then all including them could do would be to 
generate false positives.

The three listed as "phaseout", su tp yu, on the other hand, I would keep,
because it 
appears from the Wikipedia articles that they are in active use either during a
transition 
period or because their phaseout is being resisted.

If we can reach consensus on this, I'll make the changes to the code I committed.

 ---------------------

And then Justin Mason said:

I'm happy enough with that. btw this discussion should be on
a bug...


  -----------------
And here we are




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5813] [review] several TLDs are not parsed by URI text scanner in PerMsgStatus.pm

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5813


sidney@sidney.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|several TLDs are not parsed |[review] several TLDs are
                   |by URI text scanner in      |not parsed by URI text
                   |PerMsgStatus.pm             |scanner in PerMsgStatus.pm
  Status Whiteboard|                            |needs 2 votes for 3.2




------- Additional Comments From sidney@sidney.com  2008-02-07 19:46 -------
I've decided to flag this for review and voting, but please look at bug 5780
first, as it has to be voted and committed before this one can.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5813] [review] several TLDs are not parsed by URI text scanner in PerMsgStatus.pm

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5813


sidney@sidney.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Status Whiteboard|needs 2 votes for 3.2       |needs 1 votes for 3.2






------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5813] several TLDs are not parsed by URI text scanner in PerMsgStatus.pm

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5813





------- Additional Comments From sidney@sidney.com  2008-02-07 06:49 -------
Looking at the list I posted in the description I noticed that .xxx is there
even though it is not a valid TLD. That led me to compare the table in
Util/RegistrarBoundaries.pm with the official list in
http://data.iana.org/TLD/tlds-alpha-by-domain.txt

I haven't done a careful diff, but I do notice that we include .cs which is not
an official TLD, as wel as xxx, so the table is not correct. Does anyone have
any objection to my simply snarfing the official list into
Util/RegistrarBoundaries.pm, minus the 11 IDN .test TLDS (Those are the ones
that look like XN--0ZWM56D and are in the root table only for an IDN test that
was recently run)?

Also, the comment in PerMsgStatus says that the regexp was generated using
Regexp::Optimizer, but the Regexp::Optimizer suggests using Regexp::List when
you are dealing with a plain list of words with no metacharacters, which is what
we have here. Does anyone know a reason why we should stick with Regexp::Optimizer?




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.