You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by da...@chaosreigns.com on 2014/09/26 21:38:28 UTC

UTF-8 rule generator script Re: UTF-8 rules, what am I missing?

I wrote a script that takes a list of words with UTF-8 characters, and
generates rules matching them:

http://chaosreigns.com/code/dl/sawordrule.pl

For example:

$ echo "análisis" | perl ./sawordrule.pl SPANISH_
body SPANISH_ANALISIS /\ban[\x{C1}\x{E1}]lisis\b/i # análisis

(The two characters per UTF8 character are the upper and lower case
characters, because /i apparently doesn't apply to these.)

For a bigger example:
cat spanish.txt | tr -d ',;.:"()-' | tr ' ' '\n' | sort -f | uniq -i | ./sawordrule.pl SPANISH_ > spanish.cf

A couple untested results:
http://www.chaosreigns.com/sa/spanish.cf
http://www.chaosreigns.com/sa/polish.cf

To be clear, these files will likely flag ALL Polish or Spanish emails as
spam.

By default, rules have a score of 1, so without a corresponding "score"
line, each of these have a score of 1.

The output is going to include some garbage rules you're going to need to
manually delete.  It's also probably going to include occasional rules
which will match English words.  I'm sure I missed a couple of these in the
.cf files I provided.

To use the .cf files, add something like this to your local.cf:

include /etc/spamassassin/spanish.cf
include /etc/spamassassin/polish.cf

On 09/26, John Hardin wrote:
> On Fri, 26 Sep 2014, darxus@chaosreigns.com wrote:
> 
> >I created some rules to match Polish text:
> >http://www.chaosreigns.com/sa/polish.txt
> >
> >The rules with only ascii characters work, the ones with utf8 characters
> >don't.  According to hexedit, they're identical in my maildir and in my
> >/etc/spamassassin/local.cf.
> 
> Put the hex strings for the accented characters into the RE.
> 
> I've had the best reliability from placing each byte in its own
> character class:  [\xd0][\x80]

Thanks.                      

Re: UTF-8 rule generator script Re: UTF-8 rules, what am I missing?

Posted by John Hardin <jh...@impsec.org>.
On Fri, 26 Sep 2014, darxus@chaosreigns.com wrote:

> I wrote a script that takes a list of words with UTF-8 characters, and
> generates rules matching them:
>
> http://chaosreigns.com/code/dl/sawordrule.pl
>
> For example:
>
> $ echo "análisis" | perl ./sawordrule.pl SPANISH_
> body SPANISH_ANALISIS /\ban[\x{C1}\x{E1}]lisis\b/i # análisis

How do you get a one byte match for two-byte-long UTF-8-encoded accented 
characters? Shouldn't it generate this:

    /\ban[\xc3][\xa1]lisis\b/i

I didn't think normalization had been implemented yet.

Your rule doesn't hit in my test environment (though I just pasted that 
word into an existing message to test...)

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   How do you argue with people to whom math is an opinion? -- Unknown
-----------------------------------------------------------------------
  848 days since the first successful private support mission to ISS (SpaceX)

Re: UTF-8 rule generator script Re: UTF-8 rules, what am I missing?

Posted by Adi <ad...@gmail.com>.
Hello

> A couple untested results:
> http://www.chaosreigns.com/sa/polish.cf
> 
> To be clear, these files will likely flag ALL Polish or Spanish emails as
> spam

Your rules maybe are good for testing or if you get only spam mails
(not normal messages). But as you say above probably this rules match
to 90% normal messages (HAM) in polish language :)


>From your rules set only:
kliknij  = click
wypisać  = unsubscribe
otrzymywać = receive

are part of some SPAM messages but normal messages too.
You should consider use long phrase to eliminate wrong matching.
Many Polish words have many meanings depending on the context.

Some people don't use Polish diacritics only latin chars
for example.

Words in Polish have variation (different spelling, mainly suffixes)
by cases and persons.


BTW Polish is one of the most difficult languages ​​in the world (with
Latin based alphabet).



I found polish language rules but very old.

http://svn.apache.org/repos/asf/spamassassin/branches/3.1/rules/25_body_tests_pl.cf



author bypasses Polish characters by "."


Best Regards.

Re: UTF-8 rule generator script Re: UTF-8 rules, what am I missing?

Posted by da...@chaosreigns.com.
On 09/29, Jay Sekora wrote:
> Seems like it would be a huge convenience if either (1) turning on
> normalize_charset forced interpretation of rule files as UTF-8, (2)
> there were a similar setting to specify the encoding of rule files, or
> (3) there were a way on a file-by-file basis to say what charset the
> rules in the file were in (which is probably best since it would
> facilitate custom rule sharing across sites).  That's off the top of my
> head with no thought so it may be dumb. :-)

I think it's worth opening a bug.  If I can copy and paste UTF8, I feel
like I really should be able to paste it into a spamassassin rule.

Re: UTF-8 rule generator script Re: UTF-8 rules, what am I missing?

Posted by Jay Sekora <js...@csail.mit.edu>.
On 09/27/2014 01:16 PM, John Hardin wrote:
> On Fri, 26 Sep 2014, Adi wrote:
>> I don't know if SA converts the text on the fly.
> 
> In my experience it does not. There's been some discussion of charset
> normalization, but I don't think that's been implemented yet, so SA is
> still seeing whatever bytes are in the raw message.

normalize_charset is documented at least since 3.3.2.  I found some list
traffic expressing concerns about performance problems, but I've turned
it on on (low-to-medium-volume) mail servers I'm responsible for and
haven't seen problems.  (We get about 25K incoming messages a day at
work.)  Haven't made extensive use of it, though, and I just recently
figured out that my failed attempts to do so were because the rule files
themselves weren't being interpreted as UTF-8 (so I need to use Darxus'
preprocessing scripts or something similar).

Seems like it would be a huge convenience if either (1) turning on
normalize_charset forced interpretation of rule files as UTF-8, (2)
there were a similar setting to specify the encoding of rule files, or
(3) there were a way on a file-by-file basis to say what charset the
rules in the file were in (which is probably best since it would
facilitate custom rule sharing across sites).  That's off the top of my
head with no thought so it may be dumb. :-)

Jay


Re: UTF-8 rule generator script Re: UTF-8 rules, what am I missing?

Posted by John Hardin <jh...@impsec.org>.
On Fri, 26 Sep 2014, Adi wrote:

> Another problem is that polish messages are (usually) in one of 3
> characters encoding: UTF-8, ISO-8859-2, WINDOWS-1250 (CP-1250).

True, so the rule would need to cover all those possibilities: One-byte 
characters (upper and lower) for non-UTF-8 character sets, and two-byte 
characters (upper and lower) for UTF-8.

> I don't know if SA converts the text on the fly.

In my experience it does not. There's been some discussion of charset 
normalization, but I don't think that's been implemented yet, so SA is 
still seeing whatever bytes are in the raw message.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  849 days since the first successful private support mission to ISS (SpaceX)

Re: UTF-8 rule generator script Re: UTF-8 rules, what am I missing?

Posted by Adi <ad...@gmail.com>.
Hello

Another problem is that polish messages are (usually) in one of 3
characters encoding: UTF-8, ISO-8859-2, WINDOWS-1250 (CP-1250).

I don't know if SA converts the text on the fly.

Best Regards.