You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Mauricio Tavares <ra...@gmail.com> on 2013/11/27 19:38:16 UTC

TextCat triggering

Let's say I have

ok_languages en

and I get an email from Canada that is mostly in English but for the
little disclaimer on the bottom. How can I tell textcat to only flag
an email if more than some percentage of the body text is not in a
ok_languages?

Re: TextCat triggering

Posted by Mauricio Tavares <ra...@gmail.com>.
On Thu, Dec 5, 2013 at 5:29 PM, Mauricio Tavares <ra...@gmail.com> wrote:
> On Wed, Nov 27, 2013 at 7:48 PM, Karsten Bräckelmann
> <gu...@rudersport.de> wrote:
>> On Wed, 2013-11-27 at 13:38 -0500, Mauricio Tavares wrote:
>>> Let's say I have
>>>
>>> ok_languages en
>>>
>>> and I get an email from Canada that is mostly in English but for the
>>> little disclaimer on the bottom. How can I tell textcat to only flag
>>> an email if more than some percentage of the body text is not in a
>>> ok_languages?
>>
>> I haven't actually used the TextCat plugin, but according to the
>> documentation [1]
>>
>>  "The rule UNWANTED_LANGUAGE_BODY is triggered if none of the languages
>>   detected are in the "ok" list."
>>
>> English is NOT one of the languages recognized. Given it fired the
>> unwanted language rule, at least one language has been recognized with
>> an acceptable score above the threshold.
>>
>> Your problem is not TextCat recognizing the other language (probably
>> French), but TextCat failing to recognize English in that message.
>>
>>
>> [1] http://spamassassin.apache.org/doc/Mail_SpamAssassin_Plugin_TextCat.html
>>
>       I start thinking the issue is more interesting than I originally
> thought. I removed the caption in French and fed it manually to
> spamassassin
>
> spamassassin -D -t  < spam2.eml
>
> I am still getting the
>
>  4.5 UNWANTED_LANGUAGE_BODY BODY: Message written in an undesired language
>
> message. Is there a way I can be a bit more verbose so that it tells
> me what part of the body caused it to give that message?
>
      I see what you mean about my

ok_languages en  fr

possibly being cheerfully ignored for English. But I thought that

   Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach:
X-MS-TNEF-Correlator:

would indicate it saw English there. I am not ignoring what you
suggested; I am just trying to figure out what is happening here.
Specially since most of our emails do not seem to exhibit this
problem.

>> --
>> char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
>> main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
>> (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
>>

Re: TextCat triggering

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Wed, 2013-11-27 at 13:38 -0500, Mauricio Tavares wrote:
> Let's say I have
> 
> ok_languages en
> 
> and I get an email from Canada that is mostly in English but for the
> little disclaimer on the bottom. How can I tell textcat to only flag
> an email if more than some percentage of the body text is not in a
> ok_languages?

I haven't actually used the TextCat plugin, but according to the
documentation [1]

 "The rule UNWANTED_LANGUAGE_BODY is triggered if none of the languages
  detected are in the "ok" list."

English is NOT one of the languages recognized. Given it fired the
unwanted language rule, at least one language has been recognized with
an acceptable score above the threshold.

Your problem is not TextCat recognizing the other language (probably
French), but TextCat failing to recognize English in that message.


[1] http://spamassassin.apache.org/doc/Mail_SpamAssassin_Plugin_TextCat.html

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}