You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Karsten Bräckelmann <gu...@rudersport.de> on 2009/10/19 21:39:58 UTC
Dirty ham corpora?
Checking the latest ruleqa results, there are some ham hits I believe
should not exist. Calling the respective corpus owners to the phone. :)
bb-jm -- Justin, is this really ham?
bbmass/uploadedcorpora/jm/ham/pub.20070118/96
bb-trec_enron -- No idea, who/what that is, and the logs are broken.
However, I still believe the hits on KB_RATWARE_OUTLOOK_08 and even
worse KB_RATWARE_MSGID are false. Who can validate those?
--
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: Dirty ham corpora?
Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Tue, 2009-10-20 at 12:02 +0100, Justin Mason wrote:
> 2009/10/20 Karsten Bräckelmann <gu...@rudersport.de>:
> >> > bb-jm -- Justin, is this really ham?
> >> > bbmass/uploadedcorpora/jm/ham/pub.20070118/96
> >>
> >> a quoted spam sent to the users list. deleted.
> >
> > Quoted? How exactly -- the rule that identified it is a header rule.
>
> um, I missed that. here's the mail:
Thanks. Amazing, this is indeed a FP hit on the 08 variant. Wonder how
fast that guy can type...
> > Does this also have an impact on the GA re-scoring?
>
> no, bb-trec_enron isn't included in that. (the first message would be, though.)
Going to propose one of the stricter variants to keep.
--
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: Dirty ham corpora?
Posted by Justin Mason <jm...@jmason.org>.
2009/10/20 Karsten Bräckelmann <gu...@rudersport.de>:
> On Tue, 2009-10-20 at 11:43 +0100, Justin Mason wrote:
>> 2009/10/19 Karsten Bräckelmann <gu...@rudersport.de>:
>> > Checking the latest ruleqa results, there are some ham hits I believe
>> > should not exist. Calling the respective corpus owners to the phone. :)
>> >
>> > bb-jm -- Justin, is this really ham?
>> > bbmass/uploadedcorpora/jm/ham/pub.20070118/96
>>
>> a quoted spam sent to the users list. deleted.
>
> Quoted? How exactly -- the rule that identified it is a header rule.
um, I missed that. here's the mail:
Return-Path: <us...@spamassassin.apache.org>
X-Original-To: jm@localhost
Delivered-To: jm@localhost.jmason.org
Received: from radish.jmason.org (localhost [127.0.0.1])
by radish.jmason.org (Postfix) with ESMTP id 8C4AF32CDD
for <jm...@localhost>; Wed, 17 Jan 2007 14:36:48 +0000 (GMT)
X-Spam-Virus: No
X-Spam-Checker-Version: SpamAssassin 3.2.0-r492202 (2007-01-03) on
dogma.boxhost.net
X-Spam-Relays-External: [ ip=140.211.11.2 rdns=hermes.apache.org
helo=mail.apache.org by=dogma.boxhost.net ident= envfrom= intl=0
id=B2DC3310052 auth= ] [ ip=140.211.11.133 rdns=herse.apache.org
helo=herse.apache.org by=apache.org ident= envfrom= intl=0 id= auth= ] [
ip=194.131.70.131 rdns=!194.131.70.131! helo=mail.mh.omnis.net
by=apache.org
ident= envfrom= intl=0 id= auth= ] [ ip=194.131.70.197 rdns=
helo=rwhiting
by=mail.mh.omnis.net ident= envfrom= intl=0 id=1H7C3z-0004NR-6u auth= ]
X-Spam-Status: No, score=1.8 required=5.0 tests=BAYES_50,DK_POLICY_SIGNSOME,
HTML_IMAGE_ONLY_24,HTML_MESSAGE,SPF_HELO_PASS,SPF_PASS shortcircuit=no
autolearn=no version=3.2.0-r492202
X-Spam-Relays-Internal:
X-Spam-Relays-Untrusted: [ ip=140.211.11.2 rdns=hermes.apache.org
helo=mail.apache.org by=dogma.boxhost.net ident= envfrom= intl=0
id=B2DC3310052 auth= ] [ ip=140.211.11.133 rdns=herse.apache.org
helo=herse.apache.org by=apache.org ident= envfrom= intl=0 id= auth= ] [
ip=194.131.70.131 rdns=!194.131.70.131! helo=mail.mh.omnis.net
by=apache.org
ident= envfrom= intl=0 id= auth= ] [ ip=194.131.70.197 rdns=
helo=rwhiting
by=mail.mh.omnis.net ident= envfrom= intl=0 id=1H7C3z-0004NR-6u auth= ]
X-Spam-Relays-Trusted:
X-Spam-Level: *
X-Spam-RBL: <dns:spamassassin.apache.org> [140.211.11.130]
<dns:spamassassin.apache.org?type=MX> [10 herse.apache.org.,
20 mail.apache.org.]
X-Original-To: jm@jmason.org
Delivered-To: jm@dogma.boxhost.net
Received: from localhost [127.0.0.1]
by radish.jmason.org with IMAP (fetchmail-6.3.2)
for <jm...@localhost> (single-drop); Wed, 17 Jan 2007 14:36:48 +0000 (GMT)
Received: from mail.apache.org (hermes.apache.org [140.211.11.2])
by dogma.boxhost.net (Postfix) with SMTP id B2DC3310052
for <jm...@jmason.org>; Wed, 17 Jan 2007 14:31:08 +0000 (GMT)
Received: (qmail 69274 invoked by uid 500); 17 Jan 2007 14:31:02 -0000
Mailing-List: contact users-help@spamassassin.apache.org; run by ezmlm
Precedence: bulk
list-help: <ma...@spamassassin.apache.org>
list-unsubscribe: <ma...@spamassassin.apache.org>
List-Post: <ma...@spamassassin.apache.org>
List-Id: <users.spamassassin.apache.org>
Delivered-To: mailing list users@spamassassin.apache.org
Received: (qmail 69265 invoked by uid 99); 17 Jan 2007 14:31:02 -0000
Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133)
by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Jan 2007 06:31:02 -0800
X-ASF-Spam-Status: No, hits=2.9 required=10.0
tests=HTML_IMAGE_ONLY_24,HTML_MESSAGE
Received-SPF: pass (herse.apache.org: local policy)
Received: from [194.131.70.131] (HELO mail.mh.omnis.net) (194.131.70.131)
by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Jan 2007 06:30:52 -0800
Received: from [194.131.70.197] (helo=rwhiting)
by mail.mh.omnis.net with smtp (Exim 4.24)
id 1H7C3z-0004NR-6u
for users@spamassassin.apache.org; Wed, 17 Jan 2007 14:46:55 +0000
Message-ID: <00...@rwhiting>
From: "MIS" <mi...@omnis.net>
To: <us...@spamassassin.apache.org>
Subject: PLease help
Date: Wed, 17 Jan 2007 14:30:07 -0000
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="----=_NextPart_000_00F1_01C73A43.FEA53C10"
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2800.1106
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106
X-RainingData_UK_Ltd-MailScanner-Sophos: Found to be clean
X-RainingData_UK_Ltd-MailScanner-SpamCheck:
X-Virus-Checked: Checked by ClamAV on apache.org
X-IMAPbase: 1074397152 82723
Status: O
X-UID: 82720
X-Keywords:
This is a multi-part message in MIME format.
------=_NextPart_000_00F1_01C73A43.FEA53C10
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Hi,
I am newbie to this list and quite new to the difficulties in fight =
spam. Please accept my apologies if sending a copy spam message to the =
list is not acceptable etiquette
We have started to receive quite a lot of spam in this type of form with =
an embedded stock or meds image., None of my rules are hitting it. Can =
anyone please offer me some help so that I can get to the bottom of what =
is becoming a very time consuming process. I use Spam Assassin 2.64. =
Perhaps the answer is that it is time to upgrade.
Thanks
Bob
>> > bb-trec_enron -- No idea, who/what that is, and the logs are broken.
>> > However, I still believe the hits on KB_RATWARE_OUTLOOK_08 and even
>> > worse KB_RATWARE_MSGID are false. Who can validate those?
>>
>> they're false. it's a corpus with generated (synthetic) headers from
>> the TREC Enron corpus, only useful for body hits.
>
> Yay, back to zero FPs for my Ratware Message-Id and Boundary (Outlook
> variant, sadly) then. That's how it should be. ;)
>
> Does this also have an impact on the GA re-scoring?
no, bb-trec_enron isn't included in that. (the first message would be, though.)
--
--j.
Re: Dirty ham corpora?
Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Tue, 2009-10-20 at 11:43 +0100, Justin Mason wrote:
> 2009/10/19 Karsten Bräckelmann <gu...@rudersport.de>:
> > Checking the latest ruleqa results, there are some ham hits I believe
> > should not exist. Calling the respective corpus owners to the phone. :)
> >
> > bb-jm -- Justin, is this really ham?
> > bbmass/uploadedcorpora/jm/ham/pub.20070118/96
>
> a quoted spam sent to the users list. deleted.
Quoted? How exactly -- the rule that identified it is a header rule.
> > bb-trec_enron -- No idea, who/what that is, and the logs are broken.
> > However, I still believe the hits on KB_RATWARE_OUTLOOK_08 and even
> > worse KB_RATWARE_MSGID are false. Who can validate those?
>
> they're false. it's a corpus with generated (synthetic) headers from
> the TREC Enron corpus, only useful for body hits.
Yay, back to zero FPs for my Ratware Message-Id and Boundary (Outlook
variant, sadly) then. That's how it should be. ;)
Does this also have an impact on the GA re-scoring?
--
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: Dirty ham corpora?
Posted by Justin Mason <jm...@jmason.org>.
2009/10/19 Karsten Bräckelmann <gu...@rudersport.de>:
> Checking the latest ruleqa results, there are some ham hits I believe
> should not exist. Calling the respective corpus owners to the phone. :)
>
> bb-jm -- Justin, is this really ham?
> bbmass/uploadedcorpora/jm/ham/pub.20070118/96
a quoted spam sent to the users list. deleted.
> bb-trec_enron -- No idea, who/what that is, and the logs are broken.
> However, I still believe the hits on KB_RATWARE_OUTLOOK_08 and even
> worse KB_RATWARE_MSGID are false. Who can validate those?
they're false. it's a corpus with generated (synthetic) headers from
the TREC Enron corpus, only useful for body hits.
--
--j.