You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Karsten Bräckelmann <gu...@rudersport.de> on 2009/10/19 21:39:58 UTC

Dirty ham corpora?

Checking the latest ruleqa results, there are some ham hits I believe
should not exist. Calling the respective corpus owners to the phone. :)

bb-jm -- Justin, is this really ham?
  bbmass/uploadedcorpora/jm/ham/pub.20070118/96

bb-trec_enron -- No idea, who/what that is, and the logs are broken.
However, I still believe the hits on KB_RATWARE_OUTLOOK_08 and even
worse KB_RATWARE_MSGID are false. Who can validate those?


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Dirty ham corpora?

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Tue, 2009-10-20 at 12:02 +0100, Justin Mason wrote:
> 2009/10/20 Karsten Bräckelmann <gu...@rudersport.de>:

> >> > bb-jm -- Justin, is this really ham?
> >> >  bbmass/uploadedcorpora/jm/ham/pub.20070118/96
> >>
> >> a quoted spam sent to the users list.  deleted.
> >
> > Quoted?  How exactly -- the rule that identified it is a header rule.
> 
> um, I missed that.  here's the mail:

Thanks. Amazing, this is indeed a FP hit on the 08 variant. Wonder how
fast that guy can type...


> > Does this also have an impact on the GA re-scoring?
> 
> no, bb-trec_enron isn't included in that.  (the first message would be, though.)

Going to propose one of the stricter variants to keep.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Dirty ham corpora?

Posted by Justin Mason <jm...@jmason.org>.
2009/10/20 Karsten Bräckelmann <gu...@rudersport.de>:
> On Tue, 2009-10-20 at 11:43 +0100, Justin Mason wrote:
>> 2009/10/19 Karsten Bräckelmann <gu...@rudersport.de>:
>> > Checking the latest ruleqa results, there are some ham hits I believe
>> > should not exist. Calling the respective corpus owners to the phone. :)
>> >
>> > bb-jm -- Justin, is this really ham?
>> >  bbmass/uploadedcorpora/jm/ham/pub.20070118/96
>>
>> a quoted spam sent to the users list.  deleted.
>
> Quoted?  How exactly -- the rule that identified it is a header rule.

um, I missed that.  here's the mail:

Return-Path: <us...@spamassassin.apache.org>
X-Original-To: jm@localhost
Delivered-To: jm@localhost.jmason.org
Received: from radish.jmason.org (localhost [127.0.0.1])
        by radish.jmason.org (Postfix) with ESMTP id 8C4AF32CDD
        for <jm...@localhost>; Wed, 17 Jan 2007 14:36:48 +0000 (GMT)
X-Spam-Virus: No
X-Spam-Checker-Version: SpamAssassin 3.2.0-r492202 (2007-01-03) on
        dogma.boxhost.net
X-Spam-Relays-External: [ ip=140.211.11.2 rdns=hermes.apache.org
        helo=mail.apache.org by=dogma.boxhost.net ident= envfrom= intl=0
        id=B2DC3310052 auth= ] [ ip=140.211.11.133 rdns=herse.apache.org
        helo=herse.apache.org by=apache.org ident= envfrom= intl=0 id= auth= ] [
        ip=194.131.70.131 rdns=!194.131.70.131! helo=mail.mh.omnis.net
by=apache.org
        ident= envfrom= intl=0 id= auth= ] [ ip=194.131.70.197 rdns=
helo=rwhiting
        by=mail.mh.omnis.net ident= envfrom= intl=0 id=1H7C3z-0004NR-6u auth= ]
X-Spam-Status: No, score=1.8 required=5.0 tests=BAYES_50,DK_POLICY_SIGNSOME,
        HTML_IMAGE_ONLY_24,HTML_MESSAGE,SPF_HELO_PASS,SPF_PASS shortcircuit=no
        autolearn=no version=3.2.0-r492202
X-Spam-Relays-Internal:
X-Spam-Relays-Untrusted: [ ip=140.211.11.2 rdns=hermes.apache.org
        helo=mail.apache.org by=dogma.boxhost.net ident= envfrom= intl=0
        id=B2DC3310052 auth= ] [ ip=140.211.11.133 rdns=herse.apache.org
        helo=herse.apache.org by=apache.org ident= envfrom= intl=0 id= auth= ] [
        ip=194.131.70.131 rdns=!194.131.70.131! helo=mail.mh.omnis.net
by=apache.org
        ident= envfrom= intl=0 id= auth= ] [ ip=194.131.70.197 rdns=
helo=rwhiting
        by=mail.mh.omnis.net ident= envfrom= intl=0 id=1H7C3z-0004NR-6u auth= ]
X-Spam-Relays-Trusted:
X-Spam-Level: *
X-Spam-RBL: <dns:spamassassin.apache.org> [140.211.11.130]
        <dns:spamassassin.apache.org?type=MX> [10 herse.apache.org.,
20 mail.apache.org.]
X-Original-To: jm@jmason.org
Delivered-To: jm@dogma.boxhost.net
Received: from localhost [127.0.0.1]
        by radish.jmason.org with IMAP (fetchmail-6.3.2)
        for <jm...@localhost> (single-drop); Wed, 17 Jan 2007 14:36:48 +0000 (GMT)
Received: from mail.apache.org (hermes.apache.org [140.211.11.2])
        by dogma.boxhost.net (Postfix) with SMTP id B2DC3310052
        for <jm...@jmason.org>; Wed, 17 Jan 2007 14:31:08 +0000 (GMT)
Received: (qmail 69274 invoked by uid 500); 17 Jan 2007 14:31:02 -0000
Mailing-List: contact users-help@spamassassin.apache.org; run by ezmlm
Precedence: bulk
list-help: <ma...@spamassassin.apache.org>
list-unsubscribe: <ma...@spamassassin.apache.org>
List-Post: <ma...@spamassassin.apache.org>
List-Id: <users.spamassassin.apache.org>
Delivered-To: mailing list users@spamassassin.apache.org
Received: (qmail 69265 invoked by uid 99); 17 Jan 2007 14:31:02 -0000
Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Jan 2007 06:31:02 -0800
X-ASF-Spam-Status: No, hits=2.9 required=10.0
        tests=HTML_IMAGE_ONLY_24,HTML_MESSAGE
Received-SPF: pass (herse.apache.org: local policy)
Received: from [194.131.70.131] (HELO mail.mh.omnis.net) (194.131.70.131)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Jan 2007 06:30:52 -0800
Received: from [194.131.70.197] (helo=rwhiting)
        by mail.mh.omnis.net with smtp (Exim 4.24)
        id 1H7C3z-0004NR-6u
        for users@spamassassin.apache.org; Wed, 17 Jan 2007 14:46:55 +0000
Message-ID: <00...@rwhiting>
From: "MIS" <mi...@omnis.net>
To: <us...@spamassassin.apache.org>
Subject: PLease help
Date: Wed, 17 Jan 2007 14:30:07 -0000
MIME-Version: 1.0
Content-Type: multipart/alternative;
        boundary="----=_NextPart_000_00F1_01C73A43.FEA53C10"
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2800.1106
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106
X-RainingData_UK_Ltd-MailScanner-Sophos: Found to be clean
X-RainingData_UK_Ltd-MailScanner-SpamCheck:
X-Virus-Checked: Checked by ClamAV on apache.org
X-IMAPbase: 1074397152 82723
Status: O
X-UID: 82720
X-Keywords:


This is a multi-part message in MIME format.

------=_NextPart_000_00F1_01C73A43.FEA53C10
Content-Type: text/plain;
        charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Hi,

I am newbie to this list and quite new to the difficulties in fight =
spam. Please accept my apologies if sending a copy spam message to the =
list is not acceptable etiquette

We have started to receive quite a lot of spam in this type of form with =
an embedded stock or meds image., None of my rules are hitting it. Can =
anyone please offer me some help so that I can get to the bottom of what =
is becoming a very time consuming process. I use Spam Assassin 2.64. =
Perhaps the answer is that it is time to upgrade.

Thanks

Bob





>> > bb-trec_enron -- No idea, who/what that is, and the logs are broken.
>> > However, I still believe the hits on KB_RATWARE_OUTLOOK_08 and even
>> > worse KB_RATWARE_MSGID are false. Who can validate those?
>>
>> they're false.  it's a corpus with generated (synthetic) headers from
>> the TREC Enron corpus, only useful for body hits.
>
> Yay, back to zero FPs for my Ratware Message-Id and Boundary (Outlook
> variant, sadly) then. That's how it should be. ;)
>
> Does this also have an impact on the GA re-scoring?

no, bb-trec_enron isn't included in that.  (the first message would be, though.)


-- 
--j.

Re: Dirty ham corpora?

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Tue, 2009-10-20 at 11:43 +0100, Justin Mason wrote:
> 2009/10/19 Karsten Bräckelmann <gu...@rudersport.de>:
> > Checking the latest ruleqa results, there are some ham hits I believe
> > should not exist. Calling the respective corpus owners to the phone. :)
> >
> > bb-jm -- Justin, is this really ham?
> >  bbmass/uploadedcorpora/jm/ham/pub.20070118/96
> 
> a quoted spam sent to the users list.  deleted.

Quoted?  How exactly -- the rule that identified it is a header rule.

> > bb-trec_enron -- No idea, who/what that is, and the logs are broken.
> > However, I still believe the hits on KB_RATWARE_OUTLOOK_08 and even
> > worse KB_RATWARE_MSGID are false. Who can validate those?
> 
> they're false.  it's a corpus with generated (synthetic) headers from
> the TREC Enron corpus, only useful for body hits.

Yay, back to zero FPs for my Ratware Message-Id and Boundary (Outlook
variant, sadly) then. That's how it should be. ;)

Does this also have an impact on the GA re-scoring?


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}


Re: Dirty ham corpora?

Posted by Justin Mason <jm...@jmason.org>.
2009/10/19 Karsten Bräckelmann <gu...@rudersport.de>:
> Checking the latest ruleqa results, there are some ham hits I believe
> should not exist. Calling the respective corpus owners to the phone. :)
>
> bb-jm -- Justin, is this really ham?
>  bbmass/uploadedcorpora/jm/ham/pub.20070118/96

a quoted spam sent to the users list.  deleted.

> bb-trec_enron -- No idea, who/what that is, and the logs are broken.
> However, I still believe the hits on KB_RATWARE_OUTLOOK_08 and even
> worse KB_RATWARE_MSGID are false. Who can validate those?

they're false.  it's a corpus with generated (synthetic) headers from
the TREC Enron corpus, only useful for body hits.

-- 
--j.