You are viewing a plain text version of this content. The canonical link for it is here.
Posted to infrastructure-dev@apache.org by Paul Querna <pa...@querna.org> on 2009/04/01 12:32:01 UTC

Mail Archive Spam Cleanup

Hi,

First, this is kinda a call for a volunteer, I doubt I'll have time to
really get into fixing this any time soon.

I have access to the webmaster tools for mail-archives.apache.org, and
one of the most disturbing things is that for our top 20 queries by
traffic, none of them have to do with software at the ASF.

19 are porn related, and 1 is about hacking gmail.

It would be nice to clean this up, as mail moderation is never
perfect, but keeping them up on mail-archive.apache.org forever is
less than ideal.

Part of the problem is that there is not an easy way to remove things
from the mail archives, as they are just mbox files on disk.

I have two ideas on how we could solve this:
  1) Add a feature to mod_mbox, a Message ID Blacklist file, if the
message-id is contained in the blacklist, it just 404s on the site
like it wasn't there.  This means many parts of mod_mbox need to be
modified to check this blacklist, and anyone who rsync's our mbox
files will still get the spam.  This however is likely the easiest to
manage, as we could make a tiny webapp for adding message IDs to the
blacklist.  This is also an easily reversible step, so if we
accidentally blacklist something, its easy to fix.

 2) Edit the raw mbox files themselves.  This is the hardest, as the
mbox files in archived format are gzipped compressed, so any tool
would need to uncompress, edit, recompress....  Maybe a command line
(python?) tool, that we could run form people.apache.org, but it would
be much harder to make web based.  It does have the advantage that all
3rd parties would get editted archives, but I doubt many will re-read
the edited files.  Without care on how this is done however, this is
literally destroying data.

Once we have the tool, I think the task of actually cleaning it up
becomes much easier, and one we can do on a reactionary mode.

Thoughts?

Thanks,

Paul

Re: Mail Archive Spam Cleanup

Posted by Justin Mason <jm...@jmason.org>.
hi -- I suggest setting "use_auto_whitelist" to 0.  it wouldn't make
much use in this case and requires file locking too.  also, if you
want to avoid locking slowdowns, turn off bayes autolearning... it
probably isn't helping enough to make it useful for the slowdown it
imposes.

Another idea: use SA's "mass-check" tool:
http://wiki.apache.org/spamassassin/MassCheck

it is nicely parallelized....

--j.

On Thu, Apr 2, 2009 at 21:20, chris <ch...@ia.gov> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>>  2) Edit the raw mbox files themselves.  This is the hardest, as the
>> mbox files in archived format are gzipped compressed, so any tool
>> would need to uncompress, edit, recompress....  Maybe a command line
>> (python?) tool, that we could run form people.apache.org, but it would
>> be much harder to make web based.  It does have the advantage that all
>> 3rd parties would get editted archives, but I doubt many will re-read
>> the edited files.  Without care on how this is done however, this is
>> literally destroying data.
>
> Hi Paul,
>
> I have something similar to your 2) option above running right now on a local copy of the mail archives to see how it
> does.  So far I am encouraged by the results but it is slow going even with multiple threads. I seem to having locking
> issues with SA if I go over 5 on my machine.  When it is done I will have cleaner copies of the the archive files that
> had spam,  a list of pruned message-id and the mbox they came from, and a mbox URL to the message-ID for quick viewing
> in a browser.
>
> I'm using a bit of perl that calls the perl Spam Assassin hooks to parse the mailbox and score it.  I'm not very
> experienced with SA so I'd be open suggestions as to what good settings would be for performing these kinds of tests on
> large archives like the ASF's.
>
> My SA user_prefs are as follows and I am catching a fair bit of spam.  Casual inspection has not revealed much of a
> problem with false positives.  Though I only reviewed a small sample.
>
> required_score           5.0
> report_safe             0
> use_bayes               1
> bayes_auto_learn              1
> skip_rbl_checks         1
> use_razor2              0
> use_dcc                 0
> use_pyzor               0
> ok_languages            all
> ok_locales              all
> trusted_networks        140.211.11.0/24
> score CTYPE_8SPACE_GIF 0
> score TVD_FW_GRAPHIC_NAME_LONG 0
> score TVD_FW_GRAPHIC_NAME_MID 0
> lock_method flock
>
>
> After my full run is done I will put the results up including the cleaned and gzip'ed files as well as the script I am
> using to do it all.
>
> good day!
> crr/arreyder
>
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.10 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iEYEARECAAYFAknVLAQACgkQPmaZdRmQd+aPEgCfajBc08ccymrj5rBQ4FpaStP3
> MtAAmgLwcYZzivaxbtVlSIEB7CDdqood
> =tpOV
> -----END PGP SIGNATURE-----
>
>

Re: Mail Archive Spam Cleanup

Posted by chris <ch...@ia.gov>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

>  2) Edit the raw mbox files themselves.  This is the hardest, as the
> mbox files in archived format are gzipped compressed, so any tool
> would need to uncompress, edit, recompress....  Maybe a command line
> (python?) tool, that we could run form people.apache.org, but it would
> be much harder to make web based.  It does have the advantage that all
> 3rd parties would get editted archives, but I doubt many will re-read
> the edited files.  Without care on how this is done however, this is
> literally destroying data.

Hi Paul,

I have something similar to your 2) option above running right now on a local copy of the mail archives to see how it
does.  So far I am encouraged by the results but it is slow going even with multiple threads. I seem to having locking
issues with SA if I go over 5 on my machine.  When it is done I will have cleaner copies of the the archive files that
had spam,  a list of pruned message-id and the mbox they came from, and a mbox URL to the message-ID for quick viewing
in a browser.

I'm using a bit of perl that calls the perl Spam Assassin hooks to parse the mailbox and score it.  I'm not very
experienced with SA so I'd be open suggestions as to what good settings would be for performing these kinds of tests on
large archives like the ASF's.

My SA user_prefs are as follows and I am catching a fair bit of spam.  Casual inspection has not revealed much of a
problem with false positives.  Though I only reviewed a small sample.

required_score           5.0
report_safe             0
use_bayes               1
bayes_auto_learn              1
skip_rbl_checks         1
use_razor2              0
use_dcc                 0
use_pyzor               0
ok_languages            all
ok_locales              all
trusted_networks        140.211.11.0/24
score CTYPE_8SPACE_GIF 0
score TVD_FW_GRAPHIC_NAME_LONG 0
score TVD_FW_GRAPHIC_NAME_MID 0
lock_method flock


After my full run is done I will put the results up including the cleaned and gzip'ed files as well as the script I am
using to do it all.

good day!
crr/arreyder



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknVLAQACgkQPmaZdRmQd+aPEgCfajBc08ccymrj5rBQ4FpaStP3
MtAAmgLwcYZzivaxbtVlSIEB7CDdqood
=tpOV
-----END PGP SIGNATURE-----

Re: Mail Archive Spam Cleanup

Posted by Santiago Gala <sa...@gmail.com>.
El mié, 01-04-2009 a las 13:57 +0100, sebb escribió:
> On 01/04/2009, Paul Querna <pa...@querna.org> wrote:
> > Hi,
> >
> >  First, this is kinda a call for a volunteer, I doubt I'll have time to
> >  really get into fixing this any time soon.
> >
> >  I have access to the webmaster tools for mail-archives.apache.org, and
> >  one of the most disturbing things is that for our top 20 queries by
> >  traffic, none of them have to do with software at the ASF.
> >
> >  19 are porn related, and 1 is about hacking gmail.
> >
> >  It would be nice to clean this up, as mail moderation is never
> >  perfect, but keeping them up on mail-archive.apache.org forever is
> >  less than ideal.
> 
> And subscribed/allowed users may generate spam, e.g. if their system
> is compromised.
> 
> >  Part of the problem is that there is not an easy way to remove things
> >  from the mail archives, as they are just mbox files on disk.
> >
> >  I have two ideas on how we could solve this:
> >   1) Add a feature to mod_mbox, a Message ID Blacklist file, if the
> >  message-id is contained in the blacklist, it just 404s on the site
> >  like it wasn't there.  This means many parts of mod_mbox need to be
> >  modified to check this blacklist, and anyone who rsync's our mbox
> >  files will still get the spam.  This however is likely the easiest to
> >  manage, as we could make a tiny webapp for adding message IDs to the
> >  blacklist.  This is also an easily reversible step, so if we
> >  accidentally blacklist something, its easy to fix.
> >
> >   2) Edit the raw mbox files themselves.  This is the hardest, as the
> >  mbox files in archived format are gzipped compressed, so any tool
> >  would need to uncompress, edit, recompress....  Maybe a command line
> >  (python?) tool, that we could run form people.apache.org, but it would
> >  be much harder to make web based.  It does have the advantage that all
> >  3rd parties would get editted archives, but I doubt many will re-read
> >  the edited files.  Without care on how this is done however, this is
> >  literally destroying data.
> 
> I suggest that the tool should split the files into ham and spam; no
> need to delete the spam, as very little extra space will be used
> compared with keeping the original. This would allow recovery of
> mesages if required.
> 
> The tool could also use message ids to drive the split.
> 
> Again, it would be nice if there was a web-app that one could use to
> browse messages and mark them as spam.
> 

For public lists, installing any webapp software (squirrellmail in
python, horde in php, ...) in p.a.o and pointing it to the uncompressed
mboxes would work, but for compressed files r/w access it is way
trickier. python has the gzip module, for gzipped files, but it is
either read or write, not r/w.

Not sure if it would run the boat, even with small patches. Letting apps
manipulate the mboxes looks trickier than using sed or just emacs to cut
the offending messages.

If it helps, I have patched squirrell already and can do it again if
needed. :)

I have also written a python script that reads and parses mbox files
using python, which would work for not so big mboxes (It slurps the
whole thing in memory). In read-only mode it works straight from gzip
files, never tried it in r-w mode but I guess it should write to an
alternate file and use diff to check before risking any real data
manipulation. 

Regards
Santiago

> Keeping ham and spam might be useful in providing data for the
> SpamAssassin project?
> But maybe they have already processed all the ASF mail.
> 
> >  Once we have the tool, I think the task of actually cleaning it up
> >  becomes much easier, and one we can do on a reactionary mode.
> >
> >  Thoughts?
> >
> >  Thanks,
> >
> >
> >  Paul
> >


Re: Mail Archive Spam Cleanup

Posted by Justin Mason <jm...@jmason.org>.
On Wed, Apr 1, 2009 at 13:57, sebb <se...@gmail.com> wrote:
> Keeping ham and spam might be useful in providing data for the
> SpamAssassin project?
> But maybe they have already processed all the ASF mail.

if we can get a list of "confirmed ham" or "confirmed spam" messages,
and we can gain access to them in unmunged (ie. all Received hdrs intact)
messages in mbox/maildir format, then that would indeed be
useful ;)

--j.

Re: Mail Archive Spam Cleanup

Posted by sebb <se...@gmail.com>.
On 01/04/2009, Paul Querna <pa...@querna.org> wrote:
> Hi,
>
>  First, this is kinda a call for a volunteer, I doubt I'll have time to
>  really get into fixing this any time soon.
>
>  I have access to the webmaster tools for mail-archives.apache.org, and
>  one of the most disturbing things is that for our top 20 queries by
>  traffic, none of them have to do with software at the ASF.
>
>  19 are porn related, and 1 is about hacking gmail.
>
>  It would be nice to clean this up, as mail moderation is never
>  perfect, but keeping them up on mail-archive.apache.org forever is
>  less than ideal.

And subscribed/allowed users may generate spam, e.g. if their system
is compromised.

>  Part of the problem is that there is not an easy way to remove things
>  from the mail archives, as they are just mbox files on disk.
>
>  I have two ideas on how we could solve this:
>   1) Add a feature to mod_mbox, a Message ID Blacklist file, if the
>  message-id is contained in the blacklist, it just 404s on the site
>  like it wasn't there.  This means many parts of mod_mbox need to be
>  modified to check this blacklist, and anyone who rsync's our mbox
>  files will still get the spam.  This however is likely the easiest to
>  manage, as we could make a tiny webapp for adding message IDs to the
>  blacklist.  This is also an easily reversible step, so if we
>  accidentally blacklist something, its easy to fix.
>
>   2) Edit the raw mbox files themselves.  This is the hardest, as the
>  mbox files in archived format are gzipped compressed, so any tool
>  would need to uncompress, edit, recompress....  Maybe a command line
>  (python?) tool, that we could run form people.apache.org, but it would
>  be much harder to make web based.  It does have the advantage that all
>  3rd parties would get editted archives, but I doubt many will re-read
>  the edited files.  Without care on how this is done however, this is
>  literally destroying data.

I suggest that the tool should split the files into ham and spam; no
need to delete the spam, as very little extra space will be used
compared with keeping the original. This would allow recovery of
mesages if required.

The tool could also use message ids to drive the split.

Again, it would be nice if there was a web-app that one could use to
browse messages and mark them as spam.

Keeping ham and spam might be useful in providing data for the
SpamAssassin project?
But maybe they have already processed all the ASF mail.

>  Once we have the tool, I think the task of actually cleaning it up
>  becomes much easier, and one we can do on a reactionary mode.
>
>  Thoughts?
>
>  Thanks,
>
>
>  Paul
>