You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Alex <my...@gmail.com> on 2017/12/05 21:25:28 UTC

URI parser problems

Hi, I have the following rule that is used to detect some of the less
common URIs:

uri        URI_RARE_TLD
m;://[^/]+\.(?:work|space|club|science|pub|red|blue|green|link|ninja|lol|xyz|faith|review|download|top|global|(?:web)?site|tech|party|pro|bid|trade|win|moda|news|online|xxx|health|bot|cw|date)(?:/|$);i
describe   URI_RARE_TLD     URI refers to rarely-nonspam TLD

The problem is that it is hitting patterns that aren't necessarily
URIs. This one matches on ".SPACE"

TIX400 ROH B.W.SPACE SHUTTLE IN

Dec  4 22:14:43.126 [15338] dbg: rules: ran uri rule URI_RARE_TLD
======> got hit: "://B.W.SPACE"

I asked John Hardin off-list as the author of the rule, and he wrote
the following, and thought I should open it up to the list.

It looks like the parser knows about TLDs, and it's looking for stuff
that looks like hostnames even if there is not a protocol spec. It
would, for example, treat "B.W.com" in the body as a URI. It might be
a bit too eager.

It's possible that the aggressive URI parsing is risky now that IANA
has crapped all over the TLD list and made it a lot harder to
recognize text that looks like valid domainnames and hostnames and
consensus would be to open a bug to modify the behavior of the parser.

Should I submit a bug, or does someone have other suggestions on how
to handle this?

Re: URI parser problems

Posted by "Luis E. Muñoz" <sa...@lem.click>.

On 5 Dec 2017, at 14:59, John Hardin wrote:

> How often would we see a valid registered domain name like "x.info" 
> for example?

This is not as rare as you would think. Those names are more expensive, 
but not insanely so.

https://uniregistry.link/premium-domain-names/

Best regards

-lem

Re: URI parser problems

Posted by Pedro David Marco <pe...@yahoo.com>.


>Perhaps a smaller step that would be useful would be to have the parser 
>require the second-level domain name have > 1 character.
>How often would we see a valid registered domain name like "x.info" for example?
maybe the best way to know whether it is a URI or not is to ask the DNS...

-------Pedro

Re: URI parser problems

Posted by John Hardin <jh...@impsec.org>.

On Tue, 5 Dec 2017, RW wrote:

> On Tue, 5 Dec 2017 16:25:28 -0500
> Alex wrote:
>
>> Hi, I have the following rule that is used to detect some of the less
>> common URIs:
>>
>> uri        URI_RARE_TLD
>> m;://[^/]+\.(?:work|space|club|science|pub|red|blue|green|link|ninja|lol|xyz|faith|review|download|top|global|(?:web)?site|tech|party|pro|bid|trade|win|moda|news|online|xxx|health|bot|cw|date)(?:/|$);i
>> describe   URI_RARE_TLD     URI refers to rarely-nonspam TLD
>>
>> The problem is that it is hitting patterns that aren't necessarily
>> URIs. This one matches on ".SPACE"
>>
>> TIX400 ROH B.W.SPACE SHUTTLE IN
> ...
>> Should I submit a bug,
>
> It's been discussed before. Not doing that would mean that spammers
> could just leave off the protocol and avoid URI lists.

That's obviously a nonstarter.

Perhaps a smaller step that would be useful would be to have the parser 
require the second-level domain name have > 1 character.

How often would we see a valid registered domain name like "x.info" for 
example?

>> or does someone have other suggestions on how
>> to handle this?
>
> It's a reason to exercise caution in scoring such rules.

Agreed. The rule in question could also require two chars before the final 
period; but it doesn't address the underlying issue with recognizing 
non-protocol domain names in body text.


-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   When fascism comes to America, it will be wrapped in
   "Diversity" and demanding "Safe Spaces."             -- Mona Charen
-----------------------------------------------------------------------
  2 days until The 76th anniversary of Pearl Harbor

Re: URI parser problems

Posted by RW <rw...@googlemail.com>.

On Tue, 5 Dec 2017 16:25:28 -0500
Alex wrote:

> Hi, I have the following rule that is used to detect some of the less
> common URIs:
> 
> uri        URI_RARE_TLD
> m;://[^/]+\.(?:work|space|club|science|pub|red|blue|green|link|ninja|lol|xyz|faith|review|download|top|global|(?:web)?site|tech|party|pro|bid|trade|win|moda|news|online|xxx|health|bot|cw|date)(?:/|$);i
> describe   URI_RARE_TLD     URI refers to rarely-nonspam TLD
> 
> The problem is that it is hitting patterns that aren't necessarily
> URIs. This one matches on ".SPACE"
> 
> TIX400 ROH B.W.SPACE SHUTTLE IN
...
> Should I submit a bug, 

It's been discussed before. Not doing that would mean that spammers
could just leave off the protocol and avoid URI lists.


> or does someone have other suggestions on how
> to handle this?


It's a reason to exercise caution in scoring such rules. It's one the
reasons why, when  I suggested rewriting his rules as metarules, I
suggested this:


meta  ADDR_RARE_TLD     __REPTO_RARE_TLD || __FROM_RARE_TLD

meta  URI_RARE_TLD      __URI_RARE_TLD && !ADDR_RARE_TLD