You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2007/10/10 14:55:14 UTC

[Bug 5675] New: TextCat sidesteps 'what if I DON'T like language X?'

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5675

           Summary: TextCat sidesteps 'what if I DON'T like language X?'
           Product: Spamassassin
           Version: 3.2.3
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: trivial
          Priority: P5
         Component: Translations and Languages
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: jidanni@jidanni.org


2007Mail::SpamAssassin::Plugin::TextCat's man page totally avoids the issue of
'what if I just hate language X?"
It assumes one only wishes to whitelist languages, and never wants to blacklist
a language.
There is no way to do a "not_ok_languages ru".

It should mention how to do it, or if there's no way to do it.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5675] TextCat sidesteps 'what if I DON'T like language X?'

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5675





------- Additional Comments From jidanni@jidanni.org  2007-10-13 19:22 -------
By the way, I just realized your two whitelists, ok_locales and
ok_languages, with no corresponding blacklists offered, will create a
big problem for the user who uses my lenthly whitelist examples above
in order to blacklist one or two items: One day if the e.g., english
is split into british and american, the user won't be alert he has now
inadvertently blacklisted english, unless you grandfather the
pre-split one, etc...




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5675] TextCat sidesteps 'what if I DON'T like language X?'

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5675





------- Additional Comments From mkettler_sa@verizon.net  2007-10-14 05:10 -------
------------------

I got as far as
loadplugin Mail::SpamAssassin::Plugin::TextCat
ok_languages af am ar...
add_header all languages _LANGUAGES_
 but the fun ended with 
score UNWANTED_LANGUAGE_BODY 11

--------------------

Don't put loadplugin statements into your .cf files, find it in the appropriate
.pre file and uncomment it there. This one should be in v310.pre.

The problem with putting a loadplugin in your local.cf is that the file gets
read after the stock rule files, thus those files detect the plugin as not
loaded, and the textcat rules get skipped.

If you have usage questions, please direct them to the users list first, then
put them into the bug if they happen to be actual bugs.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5675] TextCat sidesteps 'what if I DON'T like language X?'

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5675





------- Additional Comments From mkettler_sa@verizon.net  2007-10-14 05:05 -------
-------------------------------
With my long whitelist approach, it is not clear what

       textcat_max_languages N (default: 5)
           The maximum number of languages before the classification is
considered unknown.

means. 
-------------------------------


It means exactly what the documentation says. Read that sentence carefully, and
pay attention to the word "classification" that appears in it. This setting only
applies to how SA classifies messages, it has nothing to do with config options.

A message can potentially contain more than one language, thus match multiple
languages during classification. This threshold tells SA how many languages can
appear in one message before textcat should just decide that it is confused and
classify the language of the message as unknown.

IMHO, the documentation of that feature is sufficient. However, I can see how
someone glancing through the docs could get confused, so if one of the devs
wants to expand it, go for it.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5675] TextCat sidesteps 'what if I DON'T like language X?'

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5675





------- Additional Comments From jidanni@jidanni.org  2007-12-07 11:03 -------
The above applies to ok_languages too.
Sorry I did not properly comment the correct bug report, but as the
representative of the simple user in the street, I cannot get too detailed. :-)



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5675] TextCat sidesteps 'what if I DON'T like language X?'

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5675





------- Additional Comments From mkettler_sa@verizon.net  2007-10-22 04:41 -------
You mean like the note in README file?

>From the "Customising SpamAssassin" section:
------------------------

  - /etc/mail/spamassassin/*.pre:

        Plugin control files, installed from the distribution. These are
        used to control what plugins are loaded.  Modifications here will
        be loaded before any configuration loaded from the above
        directories.
        
        You want to modify these files if you want to load additional
        plugins, or inhibit loading a plugin that is enabled by default.
        If the files exist in /etc/mail/spamassassin, they will not
        be overwritten during future installs.
------------------------



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5675] TextCat sidesteps 'what if I DON'T like language X?'

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5675





------- Additional Comments From jidanni@jidanni.org  2007-10-13 18:59 -------
Mail::SpamAssassin::Conf at ok_locales should say SEE ALSO
Mail::SpamAssassin::Plugin::TextCat.

Mail::SpamAssassin::Plugin::TextCat should mention SEE ALSO ok_locales on
Mail::SpamAssassin::Conf.

Else people will think... "I thought I saw this earlier" never realizing that
there are now two similarly named ok_ things.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5675] TextCat sidesteps 'what if I DON'T like language X?'

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5675





------- Additional Comments From jidanni@jidanni.org  2007-10-15 23:05 -------
Yes please document it better as one would think it means "Japanese? No.
Chinese? No. Russian? No. Polish? No. Ukrainian? No. That's five, giving up."

> Don't put loadplugin statements into your .cf files
.cf files? I put it in user_prefs! -- my best guess as to how to use this jazz.
SpamAssassin version 3.2.1
  running on Perl version 5.8.8 ... Debian sid.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5675] TextCat sidesteps 'what if I DON'T like language X?'

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5675





------- Additional Comments From jidanni@jidanni.org  2007-10-19 16:08 -------
OK, I have now learned from the mailing list about .pre files.

Still, on Mail::SpamAssassin::Conf,
where it says
See "Mail::SpamAssassin::Plugin" for more details on writing plugins,
please add an additional reference "for more details on USING plugins."





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5675] TextCat sidesteps 'what if I DON'T like language X?'

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5675





------- Additional Comments From jidanni@jidanni.org  2007-10-10 05:57 -------
Sorry I said "2007Mail::SpamAssassin:...", you see I was just cuting and pasting
from the man page whose headers and trailers look like
Mail::SpamAssassin::PlUser:ContributedMail::SpamAssassin::Plugin::TextCat(3pm)
perl v5.8.8                       2007Mail::SpamAssassin::Plugin::TextCat(3pm)
here on Debian Sid.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5675] TextCat sidesteps 'what if I DON'T like language X?'

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5675





------- Additional Comments From jidanni@jidanni.org  2007-10-13 18:51 -------
Mention on the man page if one can use two lines, like whitelist_from:
ok_languages af am ar be bg ca cs da de el en es fa fi fr he hi hr hu hy id it
ok_languages ja ka ko mr ms ne nl no pl pt qu ro sk sq sr sv sw ta th tl tr uk vi zh
or if they all must be on one line.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5675] TextCat sidesteps 'what if I DON'T like language X?'

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5675





------- Additional Comments From jm@jmason.org  2007-10-17 03:07 -------
(In reply to comment #13)
> However, I'm also going to suggest this bug be closed WONTFIX. 
> 
> Currently it is too cluttered with too many different issues to be usable for
> development purposes, and half aren't really bugs. I'd rather see this moved to
> a users list email discussion, and then separate bugs created for each of the
> problems that aren't configuration errors.
> 
> Any devs have a preference?

+1 agreed.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5675] TextCat sidesteps 'what if I DON'T like language X?'

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5675





------- Additional Comments From mkettler_sa@verizon.net  2007-12-07 18:04 -------
Re-mentioning all these with the bug prefix so they get hyperlinked:

bug 5742: (documentation clarity on ok_locales)
bug 5743: (feature request for adding not_ok_locales)
bug 5744: (textcat documentation clarity)
bug 5745: (feature request for adding not_ok_languages to textcat)

Hope that works



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5675] TextCat sidesteps 'what if I DON'T like language X?'

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5675





------- Additional Comments From jidanni@jidanni.org  2007-10-13 20:05 -------
(Sorry that a mail is sent for each of my discoveries. But if I save
them up into one big posting, the ice cream truck will come by and I
will never end up posting these.)

With my long whitelist approach, it is not clear what

       textcat_max_languages N (default: 5)
           The maximum number of languages before the classification is
considered unknown.

means. Does it mean whitelists longer than 5 are meaningless by
default. Document it please, don't just answer me.

       textcat_optimal_ngrams N (default: 0)

Do say what ngrams means here, even though it is in Wikipedia.

       textcat_acceptable_score N (default: 1.05)
           Include any language that scores at least "textcat_acceptable_score"
in the returned list of languages

More mystery. Maybe one is supposed to use this instead of
UNWANTED_LANGUAGE_BODY or something... all unclear. Add examples.
"Jimmy hates everything Russian, so he does the following..."



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5675] TextCat sidesteps 'what if I DON'T like language X?'

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5675





------- Additional Comments From jidanni@jidanni.org  2007-10-13 19:49 -------
DESCRIPTION
       This plugin will try to guess the language used in the message text.

Say body, not text. That would be clearer.

P.S., today I actually tried to use this plugin, but got
score set for non-existent rule UNWANTED_LANGUAGE_BODY

I got as far as
loadplugin Mail::SpamAssassin::Plugin::TextCat
ok_languages af am ar...
add_header all languages _LANGUAGES_
 but the fun ended with 
score UNWANTED_LANGUAGE_BODY 11
 so please add some examples. Also add my add_header example above, lest the
user spend an extra 15 minutes trying to figure it out.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5675] TextCat sidesteps 'what if I DON'T like language X?'

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5675





------- Additional Comments From mkettler_sa@verizon.net  2007-10-10 06:31 -------
Realistically, that's how TextCat works. If a language isn't "ok" then it gets
points applied via the  UNWANTED_LANGUAGE_BODY rule.

So, It is not a whitelist, but rather a list of exceptions to a blacklist.

I guess syntactically it would be easier if you could configure it using
something like:

"ok_languages all except ru"

Rather than:
ok_languages af am ar be bg ca cs da de el en es fa fi fr he hi hr hu hy id it
ja ka ko mr ms ne nl no pl pt qu ro sk sq sr sv sw ta th tl tr uk vi zh

Which is functionally the same assuming you haven't changed the inactive
languages list.

That said, I think most folks would be better off setting ok_languages to a list
of languages that are really acceptable to them (ie: they're capable of reading
it). This has some possibility of false-positives, but if the FP is in a
language you can't read, does it really matter?

 





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5675] TextCat sidesteps 'what if I DON'T like language X?'

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5675





------- Additional Comments From jidanni@jidanni.org  2007-12-07 10:59 -------
Perhaps start afresh with white_ black_, or good_ bad_, as ok and not_ok look
nonsymmetrical. Of course this would be a simple whitelist/blacklist model.
You could leave ok_locales intact for compatibility independent of the new
additional simple white/black lists.
(But white/black infer 100 points to some users perhaps.)
Anyway, the user should be able to find the simple white and black lists
directives he is looking for. Leave the fancy 'email SPF style syntax' stuff or
whatever proposals for the third expert "ok_locales" directive.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5675] TextCat sidesteps 'what if I DON'T like language X?'

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5675





------- Additional Comments From jidanni@jidanni.org  2007-10-13 19:54 -------
>From _LANGUAGES_ on learns it detects to such detail as ru.windows-1251
but except for the two zh locales mention, nothing is documented about anything
more than the two letter abbreviations. So mention one can match in greater
detail...



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5675] TextCat sidesteps 'what if I DON'T like language X?'

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5675





------- Additional Comments From mkettler_sa@verizon.net  2007-12-07 06:18 -------
Created bugs:
5742 (documentation clarity on ok_locales)
5743 (feature request for adding not_ok_locales)
5744 (textcat documentation clarity)
5745 (feature request for adding not_ok_languages to textcat)

I *think* that's all the valid bits of this bug. The rest are either config
errors, or invalid. 

(ie: in comment 7 the request for a reference to textcat in ok_locales isn't
invalid.. ok_locales doesn't have anything to do with textcat. Adding the
reference would only further the confusion. If anything we should add a note
indicating the two are not related.)





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5675] TextCat sidesteps 'what if I DON'T like language X?'

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5675





------- Additional Comments From mkettler_sa@verizon.net  2007-10-16 21:22 -------
(In reply to comment #12)
> Yes please document it better as one would think it means "Japanese? No.
> Chinese? No. Russian? No. Polish? No. Ukrainian? No. That's five, giving up.

Ok, I suggest rewriting it to:
-------------

textcat_max_languages N (default: 5)
           The maximum number of languages any one message can simultaneously
match before its language classification is considered unknown.

-------------

> 
> > Don't put loadplugin statements into your .cf files
> .cf files? I put it in user_prefs! -- my best guess as to how to use this jazz.

*DEFINITELY* not in your user_prefs. 

The Mail::SpamAssassin::Conf document is quite clear that loadplugin is an
administrator setting. No administrator settings should be in your user_prefs,
as they will be ignored by spamd for security reasons, although a normal call to
"spamassassin" will run them.



However, I'm also going to suggest this bug be closed WONTFIX. 

Currently it is too cluttered with too many different issues to be usable for
development purposes, and half aren't really bugs. I'd rather see this moved to
a users list email discussion, and then separate bugs created for each of the
problems that aren't configuration errors.

Any devs have a preference?






------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5675] TextCat sidesteps 'what if I DON'T like language X?'

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5675





------- Additional Comments From mkettler_sa@verizon.net  2007-12-07 18:01 -------
But they're definitively *NOT* white_ and black_. There's no white about it.

No configuration of either feature results in negative points being applied, or
anything else that would inhibit the message from being tagged as spam by other
rules.

If anything, it's black_ and notblack_.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5675] TextCat sidesteps 'what if I DON'T like language X?'

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5675





------- Additional Comments From jidanni@jidanni.org  2007-12-07 21:45 -------
> But they're definitively *NOT* white_ and black_.

OK. Perhaps add them in case users are looking for them or their functionality.
They could exist in parallel with the present commands.
The black one could even be an alias for the present one...

Anyway, users perhaps think in terms of simple operators: black, white, and
that's what they look for if they don't have time to "hunker down with the man
page".

Of course languages and locals would be a blur to those users too.

Anyway, thanks from the "glue sniffer crowd" for making this all a little simpler.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 5675] TextCat sidesteps 'what if I DON'T like language X?'

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5675





------- Additional Comments From jidanni@jidanni.org  2007-10-10 07:14 -------
OK, do document the ok_languages af am ar be bg ... way to blacklist one
language... ah, so that's how one (painfully) does it!

> if the FP is in a language you can't read, does it really matter?
9 times out of 10 a pal's mail will sail in in an unexpected language due to
some odd character falling into his message or who knows, so one wishes not to
risk it.

Also new customers/contacts would not think to turn off their native language
signatures, or default encoding Mail User Agent settings for an otherwise ASCII
message... so "odd, I never have any new customers / comments from abroad"... of
course: you blocked all their mail!

So don't shoot first and ask questions later.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.