You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Sidney Markowitz <si...@sidney.com> on 2009/11/04 02:42:06 UTC
Re: Non-Roman characters in TLDs and domain names
[This is a repost excerpted from two messages I sent to the list. I just
discovered that my email settings were left incorrect after I recovered
from a hard disk crash. I apologize for the redundancy if the other two
messages are just stuck instead of lost and you end up seeing them.]
I'm bringing this up on dev list to get some discussion of the technical
issues involved before opening a Bugzilla issue for it.
News of an ICANN decision to allow international character
sets in domain names was reported last week, for example, in this article:
http://www.voanews.com/english/2009-10-30-voa14.cfm
The article doesn't have much technical detail, but does say that there
will be new TLDs "by the end of the year" which is less than two months
away.
I'm concerned that it might have a big impact on SpamAssassin's parsing
of headers and URLs.
Further digging found this:
http://idn.icann.org/E-mail_test
which seems to imply that email will use the A-label encoding of IDN for
email addresses, which converts charset encoded characters into encoded
ASCII strings from the alphabet a through z and the hyphen character,
with a prefix of "xn--". As far as I can tell from the examples there
will be new TLDs that will have to be A-label encoded.
I think this means that there will not need to be a major change to
SpamAsassin regarding parsing of headers in which A-label encoding is
required. Where we now have routines that check for valid TLDs
looking for .com, .org, .us, .kr, etc., we will simply have to add some
new TLDs to the list. They will still be specific fixed ASCII strings,
just that there will be new TLDs that look like ".xn--deba0ad"
However, what does this mean for detecting URLs in plain text messages
in which a URL string can be in a non-ASCII charset and MUAs might
(eventually) parse them as URLs?
-- sidney
Re: Non-Roman characters in TLDs and domain names
Posted by Mark Martinec <Ma...@ijs.si>.
Sidney,
> News of an ICANN decision to allow international character
> sets in domain names was reported last week, for example
IDN and punycode has been around for a while below TLD, but so far
the few TLDs were only for testing. We came across it in:
http://marc.info/?t=123928717600002
> I'm concerned that it might have a big impact on SpamAssassin's parsing
> of headers and URLs.
It is quite possible there is still some too-strict regexp lying
around. I know I fixed some in a dkim plugin.
> However, what does this mean for detecting URLs in plain text messages
> in which a URL string can be in a non-ASCII charset and MUAs might
> (eventually) parse them as URLs?
Slippery road ahead...
Can't hurt to open a PR as a placeholder for concerns and ideas.
Mark
Re: Non-Roman characters in TLDs and domain names
Posted by Benny Pedersen <me...@junc.org>.
On ons 04 nov 2009 04:20:25 CET, Warren Togami wrote
> http://日本語.テスト/
> Did your MUA turn that into a clickable link?
>
> Thunderbird yes
> Evolution no
> GMail yes
> Squirrelmail no
> Roundcubemail no
horde imp yes
--
xpoint
Re: Non-Roman characters in TLDs and domain names
Posted by Benny Pedersen <me...@junc.org>.
On ons 04 nov 2009 06:21:17 CET, Sidney Markowitz wrote
> http://例え.テスト/
works
> http://例え.テスト/メインページ
works partly subdir is not a link
--
xpoint
Re: Non-Roman characters in TLDs and domain names
Posted by Warren Togami <wt...@redhat.com>.
On 11/04/2009 12:34 PM, Sidney Markowitz wrote:
> Warren Togami wrote, On 4/11/09 7:17 PM:
>> My point was lost here. I pasted these URL's as an example of what the
>> spamassassin URI parser might see without decoding
>
> Let's see if I understand this correctly: The message consists of a
> sequence of bytes which encode characters in a certain charset. When
> host names and domain names were restricted to 7-bit ASCII then they
> could be parsed out by SpamAssassin by looking at the raw bytes without
> regard to the charset. Now we would have to convert the entire byte
> stream from the raw bytes to wide characters according to the charset of
> the message before we could be sure to parse and handle text URLs
> correctly. Does that sum it up?
>
> I haven't paid attention to the issue of charset encoding and wide
> characters. How much are we getting away with assuming that most emails
> are in one-byte character codes or at least in codes that represent the
> ASCII set as one byte and so we can just apply rules to the raw byte
> strings and it works most of the time? How badly does SPamAsassin fall
> down if mail is encoded in a charset that violates that assumption?
>
>> Clickable link today is not relevant. MUA and browsers in the future
>> will adapt to support these international TLD's.
>
> It is relevant to what we should handle right now versus what we plan to
> handle in the future when MUAs are changed.
>
> -- sidney
>
What spamassassin handles right now is fine. Punycode domain names
(without the soon to be ratified IDN TLD's) are rare because clients do
not support it.
I thought this thread was about talking about the future after clients
begin supporting IDN TLD's.
Warren
Re: Non-Roman characters in TLDs and domain names
Posted by Sidney Markowitz <si...@sidney.com>.
Warren Togami wrote, On 4/11/09 7:17 PM:
> My point was lost here. I pasted these URL's as an example of what the
> spamassassin URI parser might see without decoding
Let's see if I understand this correctly: The message consists of a
sequence of bytes which encode characters in a certain charset. When
host names and domain names were restricted to 7-bit ASCII then they
could be parsed out by SpamAssassin by looking at the raw bytes without
regard to the charset. Now we would have to convert the entire byte
stream from the raw bytes to wide characters according to the charset of
the message before we could be sure to parse and handle text URLs
correctly. Does that sum it up?
I haven't paid attention to the issue of charset encoding and wide
characters. How much are we getting away with assuming that most emails
are in one-byte character codes or at least in codes that represent the
ASCII set as one byte and so we can just apply rules to the raw byte
strings and it works most of the time? How badly does SPamAsassin fall
down if mail is encoded in a charset that violates that assumption?
> Clickable link today is not relevant. MUA and browsers in the future
> will adapt to support these international TLD's.
It is relevant to what we should handle right now versus what we plan to
handle in the future when MUAs are changed.
-- sidney
Re: Non-Roman characters in TLDs and domain names
Posted by Warren Togami <wt...@redhat.com>.
On 11/04/2009 12:21 AM, Sidney Markowitz wrote:
>
>> The following examples are not correct, but it demonstrates the problem:
>>
>> ASCII without decoding the domain sent as UTF-8
>> http://日本語.テスト/
>>
>> ASCII without decoding the domain sent as ISO-2022-JP
>> http://$BF|K\8l(B.$B%F%9%H(B/
>
> My Thunderbird only interprets the first one completely as a URL string,
> the second one it ends at the pipe character, making it useless for a
> spammer. The first one is clickable, but I don't see that Firefox, at
My point was lost here. I pasted these URL's as an example of what the
spamassassin URI parser might see without decoding. The above two
examples are http://日本語.テスト/ in two common encodings of Japanese
e-mail. Since they are not decoded by spamassassin, they might become
two different punycode strings and two different URIBL lookups. This is
why we may need to always decode before punycode encoding.
>
> Can you show me the equivalent for the following URL, which is a real
> site? That way we can easily answer the question "If the MUA makes it a
> hot link, is it a link that works?"
Clickable link today is not relevant. MUA and browsers in the future
will adapt to support these international TLD's. Prominent clients like
Thunderbird and gmail today already make them clickable. I suspect the
other clients don't make them clickable today only because they are
unknown TLD's or they don't recognize non-ascii domains as valid URI's.
Yet.
Warren Togami
wtogami@redhat.com
Re: Non-Roman characters in TLDs and domain names
Posted by Sidney Markowitz <si...@sidney.com>.
Sidney Markowitz wrote, On 4/11/09 6:21 PM:
> http://例え.テスト/
>
> And what about this?
>
> http://例え.テスト/メインページ
This is interesting... With Thunderbird and Firefox, if I click on those
links Firefox ends up sticking in a "www." prefix and saying that it
can't find the server. But if I right-click, copy the link, and paste it
into Firefox then it works fine.
-- sidney
Re: Non-Roman characters in TLDs and domain names
Posted by Sidney Markowitz <si...@sidney.com>.
Warren Togami wrote, On 4/11/09 4:20 PM:
> http://日本語.テスト/
>
> Did your MUA turn that into a clickable link?
>
> Thunderbird yes
> Evolution no
> GMail yes
> Squirrelmail no
> Roundcubemail no
To me this says that SpamAssassin should see it as a URL and check the
domain when doing URBL testing. Especially if Outlook and/or Outlook
Express parse it as a URL. But only if clicking on the hot link in the
MUA results in a browser actually going to the site. In other words, if
it is useful to spammers to get people to go the URL then we should
recognize it s a URL.
> The following examples are not correct, but it demonstrates the problem:
>
> ASCII without decoding the domain sent as UTF-8
> http://日本語.テスト/
>
> ASCII without decoding the domain sent as ISO-2022-JP
> http://$BF|K\8l(B.$B%F%9%H(B/
My Thunderbird only interprets the first one completely as a URL string,
the second one it ends at the pipe character, making it useless for a
spammer. The first one is clickable, but I don't see that Firefox, at
least, is interpreting it as the proper domain. Except that I don't
understand the encodings enough to say what it is doing.
Can you show me the equivalent for the following URL, which is a real
site? That way we can easily answer the question "If the MUA makes it a
hot link, is it a link that works?"
http://例え.テスト/
And what about this?
http://例え.テスト/メインページ
-- sidney
Re: Non-Roman characters in TLDs and domain names
Posted by Greg Troxel <gd...@ir.bbn.com>.
Warren Togami <wt...@redhat.com> writes:
> http://日本語.テスト/
>
> Did your MUA turn that into a clickable link?
>
> Thunderbird yes
> Evolution no
> GMail yes
> Squirrelmail no
> Roundcubemail no
gnus (trunk, emacs22): yes (and it looked ok), but when handed to
firefox did not work (url is not valid and cannot be loaded).
Re: Non-Roman characters in TLDs and domain names
Posted by Per Jessen <pe...@computer.org>.
Warren Togami wrote:
> On 11/03/2009 09:50 PM, Sidney Markowitz wrote:
>> Warren Togami wrote, On 4/11/09 3:27 PM:
>>> It seems clear that we will need to flatten/encode any URI domain to
>>> punycode for URIBL lookups.
>>
>> I agree with that -- if something has non-ASCII characters then
>> punycode is the canonical form to use to look it up.
>>
>>> The unclear part is if we will need to decode URI's prior to
>>> punycode encoding. I suspect we will be forced to decode.
>>
>> I'm not sure exactly what you mean, but the big issue that I see is
>> how to determine that a string is a URL (where it starts and where it
>> stops) that needs to be encoded to punycode. Is that what you are
>> talking about? The rule of thumb that I used when working on code to
>> extract URLs from plain text is that is some common MUA hot links it,
>> then we want to treat it as a URL. Perhaps the answer is to wait
>> until MUAs support these URLs and then follow that rule of thumb.
>>
>> -- sidney
>
> http://日本語.テスト/
>
> Did your MUA turn that into a clickable link?
>
> Thunderbird yes
> Evolution no
> GMail yes
> Squirrelmail no
> Roundcubemail no
knode yes.
/Per Jessen, Zürich
Re: Non-Roman characters in TLDs and domain names
Posted by Warren Togami <wt...@redhat.com>.
On 11/03/2009 09:50 PM, Sidney Markowitz wrote:
> Warren Togami wrote, On 4/11/09 3:27 PM:
>> It seems clear that we will need to flatten/encode any URI domain to
>> punycode for URIBL lookups.
>
> I agree with that -- if something has non-ASCII characters then punycode
> is the canonical form to use to look it up.
>
>> The unclear part is if we will need to decode URI's prior to punycode
>> encoding. I suspect we will be forced to decode.
>
> I'm not sure exactly what you mean, but the big issue that I see is how
> to determine that a string is a URL (where it starts and where it stops)
> that needs to be encoded to punycode. Is that what you are talking
> about? The rule of thumb that I used when working on code to extract
> URLs from plain text is that is some common MUA hot links it, then we
> want to treat it as a URL. Perhaps the answer is to wait until MUAs
> support these URLs and then follow that rule of thumb.
>
> -- sidney
http://日本語.テスト/
Did your MUA turn that into a clickable link?
Thunderbird yes
Evolution no
GMail yes
Squirrelmail no
Roundcubemail no
Yes, it might be hard to figure out the beginning and end of a URL
without decoding the entire message. Determining if we can do it
without full body decoding will be an important first step before
decoding what else we will do.
I suspect we will be forced to decode arbitrary encodings before
punycode flattening because URI domains can be encoded in different ways.
The following examples are not correct, but it demonstrates the problem:
ASCII without decoding the domain sent as UTF-8
http://日本語.テスト/
ASCII without decoding the domain sent as ISO-2022-JP
http://$BF|K\8l(B.$B%F%9%H(B/
Both of these strings are the same domain name. But if not decoded
before punycode flattening they will query as different strings in the
URIBL lookup.
Warren Togami
wtogami@redhat.com
Re: Non-Roman characters in TLDs and domain names
Posted by Sidney Markowitz <si...@sidney.com>.
Warren Togami wrote, On 4/11/09 3:27 PM:
> It seems clear that we will need to flatten/encode any URI domain to
> punycode for URIBL lookups.
I agree with that -- if something has non-ASCII characters then punycode
is the canonical form to use to look it up.
> The unclear part is if we will need to decode URI's prior to punycode
> encoding. I suspect we will be forced to decode.
I'm not sure exactly what you mean, but the big issue that I see is how
to determine that a string is a URL (where it starts and where it stops)
that needs to be encoded to punycode. Is that what you are talking
about? The rule of thumb that I used when working on code to extract
URLs from plain text is that is some common MUA hot links it, then we
want to treat it as a URL. Perhaps the answer is to wait until MUAs
support these URLs and then follow that rule of thumb.
-- sidney
Re: Non-Roman characters in TLDs and domain names
Posted by Warren Togami <wt...@redhat.com>.
On 11/03/2009 08:42 PM, Sidney Markowitz wrote:
>
> However, what does this mean for detecting URLs in plain text messages
> in which a URL string can be in a non-ASCII charset and MUAs might
> (eventually) parse them as URLs?
>
It seems clear that we will need to flatten/encode any URI domain to
punycode for URIBL lookups.
http://search.cpan.org/search?query=punycode&mode=all
There exist some Punycode handling libs in CPAN. We might be best off
standardizing on a particular library so URIBL's can use the same
methodology for encoding their punycode listings.
The unclear part is if we will need to decode URI's prior to punycode
encoding. I suspect we will be forced to decode. Why?
* Encoding punycode with binary garbage input might be poorly defined
and unstandardized?
* Some spamassassin using sites decode everything by preference while
most others do not decode. This means you could be querying URIBL's
with two different flattened punycode strings?
Please correct me if my understanding is incorrect.
Warren Togami
wtogami@redhat.com