You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Sidney Markowitz <si...@sidney.com> on 2009/11/04 02:42:06 UTC

Re: Non-Roman characters in TLDs and domain names

[This is a repost excerpted from two messages I sent to the list. I just
discovered that my email settings were left incorrect after I recovered
from a hard disk crash. I apologize for the redundancy if the other two
messages are just stuck instead of lost and you end up seeing them.]

I'm bringing this up on dev list to get some discussion of the technical
issues involved before opening a Bugzilla issue for it.

News of an ICANN decision to allow international character
sets in domain names was reported last week, for example, in this article:

  http://www.voanews.com/english/2009-10-30-voa14.cfm

The article doesn't have much technical detail, but does say that there 
will be new TLDs "by the end of the year" which is less than two months 
away.

I'm concerned that it might have a big impact on SpamAssassin's parsing
of headers and URLs.

Further digging found this:

http://idn.icann.org/E-mail_test

which seems to imply that email will use the A-label encoding of IDN for
email addresses, which converts charset encoded characters into encoded
ASCII strings from the alphabet a through z and the hyphen character,
with a prefix of "xn--". As far as I can tell from the examples there
will be new TLDs that will have to be A-label encoded.

I think this means that there will not need to be a major change to
SpamAsassin regarding parsing of headers in which A-label encoding is
required. Where we now have routines that check for valid TLDs
looking for .com, .org, .us, .kr, etc., we will simply have to add some
new TLDs to the list. They will still be specific fixed ASCII strings,
just that there will be new TLDs that look like ".xn--deba0ad"

However, what does this mean for detecting URLs in plain text messages
in which a URL string can be in a non-ASCII charset and MUAs might 
(eventually) parse them as URLs?

   -- sidney


Re: Non-Roman characters in TLDs and domain names

Posted by Mark Martinec <Ma...@ijs.si>.
Sidney,

> News of an ICANN decision to allow international character
> sets in domain names was reported last week, for example

IDN and punycode has been around for a while below TLD, but so far
the few TLDs were only for testing. We came across it in:

  http://marc.info/?t=123928717600002

> I'm concerned that it might have a big impact on SpamAssassin's parsing
> of headers and URLs.

It is quite possible there is still some too-strict regexp lying
around. I know I fixed some in a dkim plugin.

> However, what does this mean for detecting URLs in plain text messages
> in which a URL string can be in a non-ASCII charset and MUAs might
> (eventually) parse them as URLs?

Slippery road ahead...
Can't hurt to open a PR as a placeholder for concerns and ideas.

  Mark

Re: Non-Roman characters in TLDs and domain names

Posted by Benny Pedersen <me...@junc.org>.
On ons 04 nov 2009 04:20:25 CET, Warren Togami wrote
> http://日本語.テスト/
> Did your MUA turn that into a clickable link?
>
> Thunderbird	yes
> Evolution	no
> GMail		yes
> Squirrelmail	no
> Roundcubemail	no

horde imp yes


-- 
xpoint


Re: Non-Roman characters in TLDs and domain names

Posted by Benny Pedersen <me...@junc.org>.
On ons 04 nov 2009 06:21:17 CET, Sidney Markowitz wrote

> http://例え.テスト/

works

> http://例え.テスト/メインページ

works partly subdir is not a link


-- 
xpoint


Re: Non-Roman characters in TLDs and domain names

Posted by Warren Togami <wt...@redhat.com>.
On 11/04/2009 12:34 PM, Sidney Markowitz wrote:
> Warren Togami wrote, On 4/11/09 7:17 PM:
>> My point was lost here. I pasted these URL's as an example of what the
>> spamassassin URI parser might see without decoding
>
> Let's see if I understand this correctly: The message consists of a
> sequence of bytes which encode characters in a certain charset. When
> host names and domain names were restricted to 7-bit ASCII then they
> could be parsed out by SpamAssassin by looking at the raw bytes without
> regard to the charset. Now we would have to convert the entire byte
> stream from the raw bytes to wide characters according to the charset of
> the message before we could be sure to parse and handle text URLs
> correctly. Does that sum it up?
>
> I haven't paid attention to the issue of charset encoding and wide
> characters. How much are we getting away with assuming that most emails
> are in one-byte character codes or at least in codes that represent the
> ASCII set as one byte and so we can just apply rules to the raw byte
> strings and it works most of the time? How badly does SPamAsassin fall
> down if mail is encoded in a charset that violates that assumption?
>
>> Clickable link today is not relevant. MUA and browsers in the future
>> will adapt to support these international TLD's.
>
> It is relevant to what we should handle right now versus what we plan to
> handle in the future when MUAs are changed.
>
> -- sidney
>

What spamassassin handles right now is fine.  Punycode domain names 
(without the soon to be ratified IDN TLD's) are rare because clients do 
not support it.

I thought this thread was about talking about the future after clients 
begin supporting IDN TLD's.

Warren

Re: Non-Roman characters in TLDs and domain names

Posted by Sidney Markowitz <si...@sidney.com>.
Warren Togami wrote, On 4/11/09 7:17 PM:
> My point was lost here.  I pasted these URL's as an example of what the 
> spamassassin URI parser might see without decoding

Let's see if I understand this correctly: The message consists of a 
sequence of bytes which encode characters in a certain charset. When 
host names and domain names were restricted to 7-bit ASCII then they 
could be parsed out by SpamAssassin by looking at the raw bytes without 
regard to the charset. Now we would have to convert the entire byte 
stream from the raw bytes to wide characters according to the charset of 
the message before we could be sure to parse and handle text URLs 
correctly. Does that sum it up?

I haven't paid attention to the issue of charset encoding and wide 
characters. How much are we getting away with assuming that most emails 
are in one-byte character codes or at least in codes that represent the 
ASCII set as one byte and so we can just apply rules to the raw byte 
strings and it works most of the time? How badly does SPamAsassin fall 
down if mail is encoded in a charset that violates that assumption?

> Clickable link today is not relevant.  MUA and browsers in the future 
> will adapt to support these international TLD's.

It is relevant to what we should handle right now versus what we plan to 
handle in the future when MUAs are changed.

  -- sidney


Re: Non-Roman characters in TLDs and domain names

Posted by Warren Togami <wt...@redhat.com>.
On 11/04/2009 12:21 AM, Sidney Markowitz wrote:
>
>> The following examples are not correct, but it demonstrates the problem:
>>
>> ASCII without decoding the domain sent as UTF-8
>> http://日本語.テスト/
>>
>> ASCII without decoding the domain sent as ISO-2022-JP
>> http://$BF|K\8l(B.$B%F%9%H(B/
>
> My Thunderbird only interprets the first one completely as a URL string,
> the second one it ends at the pipe character, making it useless for a
> spammer. The first one is clickable, but I don't see that Firefox, at

My point was lost here.  I pasted these URL's as an example of what the 
spamassassin URI parser might see without decoding.  The above two 
examples are http://日本語.テスト/ in two common encodings of Japanese 
e-mail.  Since they are not decoded by spamassassin, they might become 
two different punycode strings and two different URIBL lookups.  This is 
why we may need to always decode before punycode encoding.

>
> Can you show me the equivalent for the following URL, which is a real
> site? That way we can easily answer the question "If the MUA makes it a
> hot link, is it a link that works?"

Clickable link today is not relevant.  MUA and browsers in the future 
will adapt to support these international TLD's.  Prominent clients like 
Thunderbird and gmail today already make them clickable.  I suspect the 
other clients don't make them clickable today only because they are 
unknown TLD's or they don't recognize non-ascii domains as valid URI's. 
  Yet.

Warren Togami
wtogami@redhat.com

Re: Non-Roman characters in TLDs and domain names

Posted by Sidney Markowitz <si...@sidney.com>.
Sidney Markowitz wrote, On 4/11/09 6:21 PM:
> http://例え.テスト/
> 
> And what about this?
> 
> http://例え.テスト/メインページ

This is interesting... With Thunderbird and Firefox, if I click on those 
links Firefox ends up sticking in a "www." prefix and saying that it 
can't find the server. But if I right-click, copy the link, and paste it 
into Firefox then it works fine.

  -- sidney


Re: Non-Roman characters in TLDs and domain names

Posted by Sidney Markowitz <si...@sidney.com>.
Warren Togami wrote, On 4/11/09 4:20 PM:
> http://日本語.テスト/
> 
> Did your MUA turn that into a clickable link?
> 
> Thunderbird	yes
> Evolution	no
> GMail		yes
> Squirrelmail	no
> Roundcubemail	no

To me this says that SpamAssassin should see it as a URL and check the 
domain when doing URBL testing. Especially if Outlook and/or Outlook 
Express parse it as a URL. But only if clicking on the hot link in the 
MUA results in a browser actually going to the site. In other words, if 
it is useful to spammers to get people to go the URL then we should 
recognize it s a URL.

> The following examples are not correct, but it demonstrates the problem:
> 
> ASCII without decoding the domain sent as UTF-8
> http://日本語.テスト/
> 
> ASCII without decoding the domain sent as ISO-2022-JP
> http://$BF|K\8l(B.$B%F%9%H(B/

My Thunderbird only interprets the first one completely as a URL string, 
the second one it ends at the pipe character, making it useless for a 
spammer. The first one is clickable, but I don't see that Firefox, at 
least, is interpreting it as the proper domain. Except that I don't 
understand the encodings enough to say what it is doing.

Can you show me the equivalent for the following URL, which is a real 
site? That way we can easily answer the question "If the MUA makes it a 
hot link, is it a link that works?"

http://例え.テスト/

And what about this?

http://例え.テスト/メインページ

  -- sidney



Re: Non-Roman characters in TLDs and domain names

Posted by Greg Troxel <gd...@ir.bbn.com>.
Warren Togami <wt...@redhat.com> writes:

> http://日本語.テスト/
>
> Did your MUA turn that into a clickable link?
>
> Thunderbird	yes
> Evolution	no
> GMail		yes
> Squirrelmail	no
> Roundcubemail	no

gnus (trunk, emacs22): yes (and it looked ok), but when handed to
firefox did not work (url is not valid and cannot be loaded).


Re: Non-Roman characters in TLDs and domain names

Posted by Per Jessen <pe...@computer.org>.
Warren Togami wrote:

> On 11/03/2009 09:50 PM, Sidney Markowitz wrote:
>> Warren Togami wrote, On 4/11/09 3:27 PM:
>>> It seems clear that we will need to flatten/encode any URI domain to
>>> punycode for URIBL lookups.
>>
>> I agree with that -- if something has non-ASCII characters then
>> punycode is the canonical form to use to look it up.
>>
>>> The unclear part is if we will need to decode URI's prior to
>>> punycode encoding. I suspect we will be forced to decode.
>>
>> I'm not sure exactly what you mean, but the big issue that I see is
>> how to determine that a string is a URL (where it starts and where it
>> stops) that needs to be encoded to punycode. Is that what you are
>> talking about? The rule of thumb that I used when working on code to
>> extract URLs from plain text is that is some common MUA hot links it,
>> then we want to treat it as a URL. Perhaps the answer is to wait
>> until MUAs support these URLs and then follow that rule of thumb.
>>
>> -- sidney
> 
> http://日本語.テスト/
> 
> Did your MUA turn that into a clickable link?
> 
> Thunderbird   yes
> Evolution     no
> GMail         yes
> Squirrelmail  no
> Roundcubemail no

knode yes.


/Per Jessen, Zürich


Re: Non-Roman characters in TLDs and domain names

Posted by Warren Togami <wt...@redhat.com>.
On 11/03/2009 09:50 PM, Sidney Markowitz wrote:
> Warren Togami wrote, On 4/11/09 3:27 PM:
>> It seems clear that we will need to flatten/encode any URI domain to
>> punycode for URIBL lookups.
>
> I agree with that -- if something has non-ASCII characters then punycode
> is the canonical form to use to look it up.
>
>> The unclear part is if we will need to decode URI's prior to punycode
>> encoding. I suspect we will be forced to decode.
>
> I'm not sure exactly what you mean, but the big issue that I see is how
> to determine that a string is a URL (where it starts and where it stops)
> that needs to be encoded to punycode. Is that what you are talking
> about? The rule of thumb that I used when working on code to extract
> URLs from plain text is that is some common MUA hot links it, then we
> want to treat it as a URL. Perhaps the answer is to wait until MUAs
> support these URLs and then follow that rule of thumb.
>
> -- sidney

http://日本語.テスト/

Did your MUA turn that into a clickable link?

Thunderbird	yes
Evolution	no
GMail		yes
Squirrelmail	no
Roundcubemail	no

Yes, it might be hard to figure out the beginning and end of a URL 
without decoding the entire message.  Determining if we can do it 
without full body decoding will be an important first step before 
decoding what else we will do.

I suspect we will be forced to decode arbitrary encodings before 
punycode flattening because URI domains can be encoded in different ways.

The following examples are not correct, but it demonstrates the problem:

ASCII without decoding the domain sent as UTF-8
http://日本語.テスト/

ASCII without decoding the domain sent as ISO-2022-JP
http://$BF|K\8l(B.$B%F%9%H(B/

Both of these strings are the same domain name.  But if not decoded 
before punycode flattening they will query as different strings in the 
URIBL lookup.

Warren Togami
wtogami@redhat.com

Re: Non-Roman characters in TLDs and domain names

Posted by Sidney Markowitz <si...@sidney.com>.
Warren Togami wrote, On 4/11/09 3:27 PM:
> It seems clear that we will need to flatten/encode any URI domain to 
> punycode for URIBL lookups.

I agree with that -- if something has non-ASCII characters then punycode 
is the canonical form to use to look it up.

> The unclear part is if we will need to decode URI's prior to punycode 
> encoding.  I suspect we will be forced to decode.

I'm not sure exactly what you mean, but the big issue that I see is how 
to determine that a string is a URL (where it starts and where it stops) 
that needs to be encoded to punycode. Is that what you are talking 
about? The rule of thumb that I used when working on code to extract 
URLs from plain text is that is some common MUA hot links it, then we 
want to treat it as a URL. Perhaps the answer is to wait until MUAs 
support these URLs and then follow that rule of thumb.

  -- sidney

Re: Non-Roman characters in TLDs and domain names

Posted by Warren Togami <wt...@redhat.com>.
On 11/03/2009 08:42 PM, Sidney Markowitz wrote:
>
> However, what does this mean for detecting URLs in plain text messages
> in which a URL string can be in a non-ASCII charset and MUAs might
> (eventually) parse them as URLs?
>

It seems clear that we will need to flatten/encode any URI domain to 
punycode for URIBL lookups.

http://search.cpan.org/search?query=punycode&mode=all
There exist some Punycode handling libs in CPAN.  We might be best off 
standardizing on a particular library so URIBL's can use the same 
methodology for encoding their punycode listings.

The unclear part is if we will need to decode URI's prior to punycode 
encoding.  I suspect we will be forced to decode.  Why?

* Encoding punycode with binary garbage input might be poorly defined 
and unstandardized?
* Some spamassassin using sites decode everything by preference while 
most others do not decode.  This means you could be querying URIBL's 
with two different flattened punycode strings?

Please correct me if my understanding is incorrect.

Warren Togami
wtogami@redhat.com