You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Giampaolo Tomassoni <g....@libero.it> on 2007/08/29 18:06:27 UTC

R: YAGI: Yet Another Great Idea - Some findings

To whom it may concern.

I had some test with this Great Idea and the TextCat plugin. I found that
TextCat is a YAGP (Yet Another Great Plugin): it can be easily configured to
detect the text language from up to 47 languages, can report the most
probable ones and eventually can even not report anything if the probable
ones are too many.

I enabled the TextCat plugin, then put this stuff in my local.cf:

	ok_languages all
	textcat_optimal_ngrams 0
	textcat_max_languages 2

	describe T_SCRAMBLED Some words are probably scrambled
	header   T_SCRAMBLED X-Languages =~ /^\s*$/
	score    T_SCRAMBLED 0.001

Basically, this stuff allows checking against all the 47 languages
(ok_languages all), disables an optimization feature which the TextCat's
author found not too much sound (textcat_optimal_ngrams 0), allows reporting
up to two languages (textcat_max_languages 2) and finally triggers the
T_SCRAMBLED rule if more than textcat_max_languages languages had been
detected as probable ones.

In fact, It seems to me that when too many languages are probable the text
easily contains scrambled words. This is not a definitive answer, however:
the text could be in more languages, contain code snippets or whatever. This
is why I put textcat_max_languages to 2 instead of 1.

Anyway, I had a check with a couple of short *SEX* spams, 1 short ham and a
couple of longer hams. It *seems* to work. In particular, the short spam
gets tagged while no ham does.

Mine is surely a far from optimal ham/spam test set, but I'm going to enable
this (with the above low score) in a production system I have in order to
see if there is any further finding.

Nevertheless, since all this kludge is very easy to test (no code to add),
isn't there anybody with a good SA testbed that would like to try to give a
reasonable score to T_SCRAMBLED?

What a Great Idea I had... :)

Giampaolo

> -----Messaggio originale-----
> Da: Giampaolo Tomassoni [mailto:g.tomassoni@libero.it]
> Inviato: martedì 28 agosto 2007 15.50
> A: users@spamassassin.apache.org
> Oggetto: YAGI: Yet Another Great Idea
> 
> Hello everybody!
> 
> I'm going to propose you another great idea which will probably
> radically
> change the spam-detection technics.
> 
> No, come one: I'm just kitting. :) I think this "idea" could eventually
> help
> in better detecting the kind of spam in which some words are "garbled"
> in
> order to deceive their detection.
> 
> Some of you probably already know that there exists alghoritms devoted
> to
> detecting the language in which a text is written. I just discovered
> the
> paper in http://www.sfs.uni-tuebingen.de/iscl/Theses/kranig.pdf , which
> by
> the way says that such detectors are already available as Perl modules
> in
> CPAN (see chapter 7).
> 
> The idea is that, applying this alghoritms to the text in a message,
> one
> could eventually obtain the probability that the given text is written
> in a
> given language. Let say that a text is written in english, then these
> perl
> routines should yield a high probability that the given text is
> english.
> Now, say that some of the words in that text are somehow "scrambled".
> The
> language detectors would probably decrease the probability that the
> text is
> in english but, assuming the words are randomly scrambled, the
> probability
> that the text is in another language wouldn't increase, too. Now, we
> could
> apply some thresholding to language scores such that, when the score of
> the
> probable language is below a given threshold above the mean of the
> language
> scores, then we could say that the message contains some "scrambled
> worlds"
> and apply a penalty score to it.
> 
> I know there are scores for scrambled versions of words like "cialis",
> but
> this method would be more solid with respect to non-english languages:
> I'm
> from Italy, and I'm used to see some FPs on italian words like "via
> galileo"
> as being a scrambled version of "viagra". Also, attempting to collect
> all
> the good versions of spam words is expensive in terms of effort.
> 
> Please note that:
> 
>  - language decoding doesn't (actually) work for ideomatic languages
> (chinese, japanese, korean and such);
> 
>  - I didn't even have a run of the language decoding modules;
> 
>  - a message written in many (> 3, 4?) languages may probably trigger
> the
> penalty score.
> 
> I'm just trying to see if such an idea seems definitely "broken" to
> you, as
> well as if anybody did altready try to run into this.
> 
> Regards,
> 
> Giampaolo