You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Chris Conn <cc...@abacom.com> on 2010/06/01 17:56:02 UTC

Malformed UTF-8 character

Hello,

I upgraded to SA 3.3.1 on a CentOS system using Perl 5.8.5 and I 
occasionally get this error;

Malformed UTF-8 character (unexpected non-continuation byte 0x00, 
immediately after start byte 0xc3) in pattern match (m//) at 
/var/lib/spamassassin/3.003001/updates_spamassassin_org/72_active.cf, 
rule __HUSH_HUSH, line 1, <GEN272> line 528.

Malformed UTF-8 character (unexpected non-continuation byte 0x00, 
immediately after start byte 0xe9) in pattern match (m//) at 
/var/lib/spamassassin/3.003001/updates_spamassassin_org/72_active.cf, 
rule __HUSH_HUSH, line 1, <GEN24> line 462.

I built a rpm using rpmbuild on the system in question, is my 
installation broken?  I have found similar instances in previous versions

http://wiki.apache.org/spamassassin/RedHatMalformedUtf8
http://www.gossamer-threads.com/lists/spamassassin/users/100450

mostly old stuff.

What can I check to correct this?

Thanks in advance,

C.




Re: Malformed UTF-8 character

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Tue, 2010-06-01 at 12:31 -0400, Michael Scheidell wrote:
> I believe the minimum recommended is 5.8.8  with 5.10.1 STRONGLY 
> recommended.
> 
> there was even talk on this list of disabling anything user 5.8.8, or 
> strongly warning against it.
> 
> (and I think there was some talk about requiring 5.10.1.)

No, this definitely has never been considered.

Since 3.3.0 the SA team dropped /official/ support for Perl 5.6. That
means, we do not guarantee it will continue to work with 5.6, which it
currently does. We are even open to get Perl 5.6 specific patches in, if
provided by the community. However, we are unlikely to fix any issues
with 5.6 ourself.

The discussion and decision to drop official Perl 5.6 support was hard
enough already.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Malformed UTF-8 character

Posted by Michael Scheidell <sc...@secnap.net>.
On 6/1/10 11:56 AM, Chris Conn wrote:
> Hello,
>
> I upgraded to SA 3.3.1 on a CentOS system using Perl 5.8.5 and I 
> occasionally get this error;
>
I believe the minimum recommended is 5.8.8  with 5.10.1 STRONGLY 
recommended.

there was even talk on this list of disabling anything user 5.8.8, or 
strongly warning against it.

(and I think there was some talk about requiring 5.10.1.)

-- 
Michael Scheidell, CTO
Phone: 561-999-5000, x 1259
 > *| *SECNAP Network Security Corporation

    * Certified SNORT Integrator
    * 2008-9 Hot Company Award Winner, World Executive Alliance
    * Five-Star Partner Program 2009, VARBusiness
    * Best Anti-Spam Product 2008, Network Products Guide
    * King of Spam Filters, SC Magazine 2008

______________________________________________________________________
This email has been scanned and certified safe by SpammerTrap(r). 
For Information please see http://www.secnap.com/products/spammertrap/
______________________________________________________________________  

Re: Malformed UTF-8 character

Posted by John Hardin <jh...@impsec.org>.
On Tue, 1 Jun 2010, Chris Conn wrote:

> John Hardin wrote:
>>  On Tue, 1 Jun 2010, Chris Conn wrote:
>> 
>> >  I upgraded to SA 3.3.1 on a CentOS system using Perl 5.8.5 and I 
>> >  occasionally get this error;
>> > 
>> >  Malformed UTF-8 character (unexpected non-continuation byte 0x00, 
>> >  immediately after start byte 0xc3) in pattern match (m//) at 
>> >  /var/lib/spamassassin/3.003001/updates_spamassassin_org/72_active.cf, 
>> >  rule __HUSH_HUSH, line 1, <GEN272> line 528.
>> > 
>> >  What can I check to correct this?
>>
>>  I'll fix that, thanks for mentioning it.
>>
>>  SA is somewhat inconsistent about whether or not it complains about
>>  malformed UTF-8 characters, as illustrated by your only occasionally
>>  getting that error. I get no complaints about that rule here when testing
>>  my sandbox...
>
> Hopefully its the regexp that can be modified and not that it will 
> consistently error-out on my few RH4/CentOS4 boxes I run ;)  RH 
> maintains the same version for the entire life of the distro for 
> dependancies so upgrading out of RedHat is most often painful.

Yes, it's a fairly simple modification to the regex that contains the 
UTF-8 multibyte character sequence. Perl is just getting confused handling 
it properly when the byte sequence is bare (e.g. \xc3\xa9) so making it a 
sequence of one-character character sets ([\xc3][\xa9]) fixes that problem 
without materially altering the RE.

I had to fix this for _some_ of the UTF-8 sequences here, but others were 
being handled properly so I was lazy and didn't change them all. For that 
I apologize.

I've committed the fix, it will go out with the next sa-update.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   North Korea: the only country in the world where people would risk
   execution to flee to communist China.                  -- Ride Fast
-----------------------------------------------------------------------
  5 days until the 66th anniversary of D-Day

Re: Malformed UTF-8 character

Posted by Chris Conn <cc...@abacom.com>.
John Hardin wrote:
> On Tue, 1 Jun 2010, Chris Conn wrote:
> 
>> I upgraded to SA 3.3.1 on a CentOS system using Perl 5.8.5 and I 
>> occasionally get this error;
>>
>> Malformed UTF-8 character (unexpected non-continuation byte 0x00, 
>> immediately after start byte 0xc3) in pattern match (m//) at 
>> /var/lib/spamassassin/3.003001/updates_spamassassin_org/72_active.cf, 
>> rule __HUSH_HUSH, line 1, <GEN272> line 528.
>>
>> What can I check to correct this?
> 
> I'll fix that, thanks for mentioning it.
> 
> SA is somewhat inconsistent about whether or not it complains about 
> malformed UTF-8 characters, as illustrated by your only occasionally 
> getting that error. I get no complaints about that rule here when 
> testing my sandbox...
> 

Hello,

Hopefully its the regexp that can be modified and not that it will 
consistently error-out on my few RH4/CentOS4 boxes I run ;)  RH 
maintains the same version for the entire life of the distro for 
dependancies so upgrading out of RedHat is most often painful.

Thanks again,

C.

Re: Malformed UTF-8 character

Posted by John Hardin <jh...@impsec.org>.
On Tue, 1 Jun 2010, Chris Conn wrote:

> I upgraded to SA 3.3.1 on a CentOS system using Perl 5.8.5 and I occasionally 
> get this error;
>
> Malformed UTF-8 character (unexpected non-continuation byte 0x00, immediately 
> after start byte 0xc3) in pattern match (m//) at 
> /var/lib/spamassassin/3.003001/updates_spamassassin_org/72_active.cf, rule 
> __HUSH_HUSH, line 1, <GEN272> line 528.
>
> What can I check to correct this?

I'll fix that, thanks for mentioning it.

SA is somewhat inconsistent about whether or not it complains about 
malformed UTF-8 characters, as illustrated by your only occasionally 
getting that error. I get no complaints about that rule here when 
testing my sandbox...

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79

Re: Malformed UTF-8 character

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Tue, 2010-06-01 at 11:56 -0400, Chris Conn wrote:
> I upgraded to SA 3.3.1 on a CentOS system using Perl 5.8.5 and I 
> occasionally get this error;

> I built a rpm using rpmbuild on the system in question, is my 
> installation broken?  I have found similar instances in previous versions
> 
> http://wiki.apache.org/spamassassin/RedHatMalformedUtf8
> http://www.gossamer-threads.com/lists/spamassassin/users/100450
> 
> mostly old stuff.

Without thoroughly checking the details...

Yes, mostly old stuff. Just like your Perl version. ;)  Both references
point at issues with Perl handling UTF-8 in 5.8.x versions. Since your
5.8.5 is quite old, and there even have been a couple later 5.8.x
releases -- any chance that's it?


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}