You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by da...@chaosreigns.com on 2011/09/22 19:59:50 UTC

Non-English accuracy Re: Rescore Masscheck for 3.4.x?

On 09/22, Warren Togami Jr. wrote:
>    On a separate note, I have a volunteer at school willing to help us build
>    a Mandarin language ham corpus a few months from now.  That will be
>    interesting to see how that throws off our statistics. =)

I've been wondering about SA's accuracy on other languages.  It looks like
the only corpus we have is your wt-jp1?  What's the accuracy like on that?
Is the accuracy available somewhere on ruleqa?  I'm actually more curious
about accuracy of *spam* in non-English, because I'd say a very
significant portion of my missed spam is in a non-Latin alphabet.
And I don't want to just tell SA to classify non-English as spam because
it would be nice if SA was actually usable for people who speak these
languages.

75 out of the 113 spams SA has missed so far this month have subjects in a
non-Latin alphabet.  66.4%.  That doesn't even include a bunch of the
non-English stuff.

(I'm also not using bayes.)

-- 
"Some people will tell you that slow is good - and it may be, on some
days - but I am here to tell you that fast is better....
That is why God made fast motorcycles...." - Hunter S. Thompson
http://www.ChaosReigns.com

Re: Non-English accuracy Re: Rescore Masscheck for 3.4.x?

Posted by Axb <ax...@gmail.com>.
On 2011-09-23 9:30, Jari Fredriksson wrote:
> My smallish corpus (mostly ham) is Finnish language, but also English in
> it. Spam is of course English and other languages, there is no Finnish
> spam available ;)

There's a couple of finnish wannabe ESPs spamming purchased lists, in 
finnish language to "non finnish speaking traps".
very low volume but it exists.



Re: Non-English accuracy Re: Rescore Masscheck for 3.4.x?

Posted by Jari Fredriksson <ja...@iki.fi>.
23.9.2011 10:37, Henrik Krohns kirjoitti:
> 
> There isn't any Finnish spam per se, but there are loads of that "badly
> autotranslated" Finnish langauge spam/phishing coming in daily.
> 

I have yet to see one. I know they exist, from banks and such, somehow
they have evaded my email :( I want all kinds of that as much as possible.

-- 

It is a wise father that knows his own child.
		-- William Shakespeare, "The Merchant of Venice"


Re: Non-English accuracy Re: Rescore Masscheck for 3.4.x?

Posted by Henrik Krohns <he...@hege.li>.
On Fri, Sep 23, 2011 at 10:30:18AM +0300, Jari Fredriksson wrote:
> 22.9.2011 20:59, darxus@chaosreigns.com kirjoitti:
> > On 09/22, Warren Togami Jr. wrote:
> >>    On a separate note, I have a volunteer at school willing to help us build
> >>    a Mandarin language ham corpus a few months from now.  That will be
> >>    interesting to see how that throws off our statistics. =)
> > 
> > I've been wondering about SA's accuracy on other languages.  It looks like
> > the only corpus we have is your wt-jp1?  What's the accuracy like on that?
> > Is the accuracy available somewhere on ruleqa?  I'm actually more curious
> > about accuracy of *spam* in non-English, because I'd say a very
> > significant portion of my missed spam is in a non-Latin alphabet.
> > And I don't want to just tell SA to classify non-English as spam because
> > it would be nice if SA was actually usable for people who speak these
> > languages.
> > 
> > 75 out of the 113 spams SA has missed so far this month have subjects in a
> > non-Latin alphabet.  66.4%.  That doesn't even include a bunch of the
> > non-English stuff.
> > 
> > (I'm also not using bayes.)
> > 
> 
> My smallish corpus (mostly ham) is Finnish language, but also English in
> it. Spam is of course English and other languages, there is no Finnish
> spam available ;)

There isn't any Finnish spam per se, but there are loads of that "badly
autotranslated" Finnish langauge spam/phishing coming in daily.


Re: Non-English accuracy Re: Rescore Masscheck for 3.4.x?

Posted by Jari Fredriksson <ja...@iki.fi>.
22.9.2011 20:59, darxus@chaosreigns.com kirjoitti:
> On 09/22, Warren Togami Jr. wrote:
>>    On a separate note, I have a volunteer at school willing to help us build
>>    a Mandarin language ham corpus a few months from now.  That will be
>>    interesting to see how that throws off our statistics. =)
> 
> I've been wondering about SA's accuracy on other languages.  It looks like
> the only corpus we have is your wt-jp1?  What's the accuracy like on that?
> Is the accuracy available somewhere on ruleqa?  I'm actually more curious
> about accuracy of *spam* in non-English, because I'd say a very
> significant portion of my missed spam is in a non-Latin alphabet.
> And I don't want to just tell SA to classify non-English as spam because
> it would be nice if SA was actually usable for people who speak these
> languages.
> 
> 75 out of the 113 spams SA has missed so far this month have subjects in a
> non-Latin alphabet.  66.4%.  That doesn't even include a bunch of the
> non-English stuff.
> 
> (I'm also not using bayes.)
> 

My smallish corpus (mostly ham) is Finnish language, but also English in
it. Spam is of course English and other languages, there is no Finnish
spam available ;)

-- 

"I wonder", he said to himself, "what's in a book while it's closed.  Oh, I
know it's full of letters printed on paper, but all the same, something must
be happening, because as soon as I open it, there's a whole story with
people
I don't know yet and all kinds of adventures and battles."
		-- Bastian B. Bux