You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Simon McCorkindale <si...@unixinside.com> on 2005/08/08 08:50:31 UTC

problems detecting URIs embedded in JIS encoding

Platform: FreeBSD 5.4-RC3
Perl: 5.8.6
SpamAssassin: 3.0.4

I'm a volunteer for the www.rbl.jp project and I think I've come across
a bug in SA. I searched for any previous posts of this bug but couldn't
find anything. I know this isn't the right place to post bugs but I want
to discuss my attempts to fix it.

The problem is when some Japanese characters from the JIS character set
immediately follow a URI then the URI is not detected properly.

The URL I used for testing is listed in our url.rbl.jp black list and
numerous others. It is http://www.j-*sine.com but with the * removed
(just to make sure this mail gets through the mailing list :-)

If there are any JIS characters immediately following the m at the end
if j-sine.com then what is extracted will be the http://www.j-*sine.com
plus a chunk of the JIS characters.

Hence, when SpamAssassin queries url.rbl.jp to see if this URL is
registered it gets a not-registered reply.

I had a hunt through the Perl code and did many test simulations and
managed to track the source of the problem down to PerMsgStatus.pm.
Between lines 1733 and 1745 of this file the regular expressions for
detecting URIs are defined. I'm not a wizard on regular expressions so a
lot of it's over the top for me.

Using my old friend od I tracked the culprit JIS character down. It
seems to be the ESC (hex 1B) character. I don't know much about JIS but
I'm guessing this is used to define the start of a string of JIS
characters.

On line 1735 of PerMsgStatus.pm there is the line:

my $unreserved = "A-Za-z0-9\Q$mark#\E\x00-\x08\x0b\x0c\x0e-\x1f";

so I modified it to:

my $unreserved = "A-Za-z0-9\Q$mark#\E\x00-\x08\x0b\x0c\x0e-\x1a\x1c-
\x1f";

so that \x1b isn't included and this seems to have solved the problem.

I think this is an ugly hack and probably breaking other stuff/going
against certain rules etc but I would like to hear anybody's ideas on
this dilemma.

Thanks in advance,
Simon.





Re: problems detecting URIs embedded in JIS encoding

Posted by Loren Wilton <lw...@earthlink.net>.
> Could you please point this thread at the two bug numbers?  I'd like to 
> target these for a future 3.0.5 bug-fix release, because we are very 
> unlikely able to upgrade our Enterprise distro to 3.1 in the short to 
> medium term.  (I am hoping in the long term to have both RHEL4 and RHEL5 
> on spamassassin-3.1.x after 3.1 has proven itself, but do not count on 
> this as a promise.)

This seems at least conceptually related to the following two closed bugs:

4337
4247


Re: problems detecting URIs embedded in JIS encoding

Posted by Warren Togami <wt...@redhat.com>.
Loren Wilton wrote:
> This is quite similar to two recent bugs that caused similar problems if
> certain ascii characters immediately followed the URI.  Spammers had
> exploited at least one of those cases.  I don't know what the fix was for
> those bugs, but it may have been similar to the change you propose.
> 
>         Loren

Hi Loren,

Could you please point this thread at the two bug numbers?  I'd like to 
target these for a future 3.0.5 bug-fix release, because we are very 
unlikely able to upgrade our Enterprise distro to 3.1 in the short to 
medium term.  (I am hoping in the long term to have both RHEL4 and RHEL5 
on spamassassin-3.1.x after 3.1 has proven itself, but do not count on 
this as a promise.)

Warren Togami
wtogami@redhat.com

Re: problems detecting URIs embedded in JIS encoding

Posted by Loren Wilton <lw...@earthlink.net>.
This is quite similar to two recent bugs that caused similar problems if
certain ascii characters immediately followed the URI.  Spammers had
exploited at least one of those cases.  I don't know what the fix was for
those bugs, but it may have been similar to the change you propose.

        Loren