You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by James Keating <ja...@okeating.net> on 2005/12/26 17:03:45 UTC

Real-Time Stats Plugin Released

Morning All,

   I have released a real-time stats collection plugin 
(http://wiki.apache.org/spamassassin/StatsPlugin)for SpamAssassin. 
The plugin collects data into a MySQL DB based on the date and username 
that SpamAssassin was invoked with.  This creates a single row for each 
user, each day. Also, a row for the total number of messages for that 
day is created.  Each entry contains the date, user name, domain name, 
number of hams, number of spams and total number of messages processed 
for that user (or for the system).  This is an example of what you would 
see in your table -


day 		username 	domain 		spam 	ham 	total
2005-12-26 	$TOTALS 	$TOTALS 	106 	229 	335
2005-12-26 	user1 		domain.com 	11 	0 	11
2005-12-26 	user2 		domain.com 	25 	15 	40
2005-12-26 	user3	 	domain.com 	27 	3 	30
2005-12-26 	user4	 	domain.com 	0 	24 	24
2005-12-26 	user5 		domain.com 	1 	0 	1
2005-12-26 	user6 		domain.com 	0 	1 	1
etc...

A new entry for each user is created each day, so there is historical 
information for as long as you want to make it.  As I said before this 
is collected in real-time, so when a new message is processed by 
SpamAssassin the numbers are automatically increased for an existing 
entry or a new entry is made.  This elevates the need to continually 
have scripts parse log files.

  As always I cannot guarantee that this will work with your 
installation/configuration of SpamAssassin or that it will not make your 
system explode, but it has worked for me in a production environment for 
quiet some time.  Also, this can only be used with SpamAssassin 3.1+. 
The code and more details about installing/configuring it can be seen on 
the SpamAssassin wiki - http://wiki.apache.org/spamassassin/StatsPlugin

Let me know if you run into problems or have suggestions for modifications.

- James

Re: Real-Time Stats Plugin Released

Posted by Jim Knuth <jk...@jkart.de>.
Heute (28.12.2005/19:45 Uhr) schrieb Gary V (mr88talent@hotmail.com),

>> > [OT] FYI, both 4.0 and 4.1 are available in Debian stable.
>>
>>do you know, which (maybe) problems by change of MySQL 4.1
>>becomes? Or not? I mean database errors and so on. I`ve heard of
>>password and charset mistakes.
>>--
>>Viele Grueße, Kind regards,
>>  Jim Knuth

> Sorry Jim, I have not tried an upgrade. As you know my friend, I am new at
> MySQL also.

mmh, yes I know. It was only a try. ;)

> Gary V


-- 
Viele Gruesse, Kind regards,
 Jim Knuth
 jk@jkart.de
 ICQ #277289867 - VoIP: +49 (0) 322 212 044 67
 Key ID: 0x1F78066F
----------
Zufalls-Zitat
----------
Eskimos haben Hunderte von Worten für "Eis", aber keines 
für "Hallo".
----------
Der Text hat nichts mit dem Empfaenger der Mail zu tun
----------
Virus free. Checked by NOD32 Version 1.1343 Build 6524  28.12.2005

Re: Real-Time Stats Plugin Released

Posted by Gary V <mr...@hotmail.com>.
> > [OT] FYI, both 4.0 and 4.1 are available in Debian stable.
>
>do you know, which (maybe) problems by change of MySQL 4.1
>becomes? Or not? I mean database errors and so on. I`ve heard of
>password and charset mistakes.
>--
>Viele Grüße, Kind regards,
>  Jim Knuth

Sorry Jim, I have not tried an upgrade. As you know my friend, I am new at 
MySQL also.

Gary V

_________________________________________________________________
Is your PC infected? Get a FREE online computer virus scan from McAfeeŽ 
Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963


Re: Real-Time Stats Plugin Released

Posted by Jim Knuth <jk...@jkart.de>.
Heute (28.12.2005/05:41 Uhr) schrieb Gary V (mr88talent@hotmail.com),

>>Well the simplest fix is the one that I did not implement in the first 
>>place, using "ON DUPLICATE KEY".  However, I did not implement that because 
>>of its only being in version 4.1 of MySQL and I still use Debian stable for 
>>most production machines, which runs 4.0.x.
>>
>>Anyway, I will poke at it some more.
>>
>>- James

> [OT] FYI, both 4.0 and 4.1 are available in Debian stable.

do you know, which (maybe) problems by change of MySQL 4.1
becomes? Or not? I mean database errors and so on. I`ve heard of
password and charset mistakes.

> Package mysql-server
> stable (misc): mysql database server binaries
> 4.0.24-10sarge1: alpha amd64 arm hppa i386 ia64 m68k mips mipsel powerpc
> s390 sparc
> 4.0.24-10: amd64

> Package mysql-server-4.1
> stable (misc): mysql database server binaries
> 4.1.11a-4sarge2: alpha amd64 arm hppa i386 ia64 m68k mips mipsel powerpc
> s390 sparc
> 4.1.11a-4: amd64

> Gary V

-- 
Viele Grüße, Kind regards,
 Jim Knuth
 jk@jkart.de
 ICQ #277289867 - VoIP: +49 (0) 322 212 044 67
 Key ID: 0x1F78066F
----------
Zufalls-Zitat
----------
Wenn ein Mann nicht bereit ist, für seine Überzeugungen 
Risiken einzugehen, dann taugen entweder seiner 
Überzeugungen oder er selbstnichts. (Ezra Pand)
----------
Der Text hat nichts mit dem Empfänger der Mail zu tun
----------
Virus free. Checked by NOD32 Version 1.1341 Build 6516  27.12.2005

Re: Real-Time Stats Plugin Released - UPDATE RELEASED

Posted by Duane Hill <d....@yournetplus.com>.
On Monday, January 2, 2006 at 6:50:15 PM, jamesk@okeating.net confabulated:

> You did have the latest version Duane.

> I guess MySQL 4.1 doesn't care if you have the "WHERE" clause after it;
> as I have been using the updated version for sometime and have not 
> received any errors.  But MySQL 5.0, apparently, doesn't like that 
> syntax.  I removed the WHERE clause and tested it with MySQL 4.1 and it
> still appears to function properly.
> I have updated the SpamAssassin Wiki and the code on my site.
> Thanks for catching that Duane.

Not  a problem. Glad I can be of some assistance on this list finally.
SA  was  completely new to me when I joined the list and usually I sit
back  in  the woodwork and let things kind of soak in for a bit before
attempting  to  aid  in  something. I usually like to make sure I have
most,  if  not  all,  the  ground work done in a solution to something
rather  than having to go back and forth a few times. That was hard at
first on this one because MySQL is still relatively new to me.

Anyhow, the plugin is working like a charm at this time. Thanks!

> Duane Hill wrote:
>> I  am  running  MySQL  5.0  and  WAS  receiving  an  error for the SQL
>> statement that had the 'ON DUPLICATE KEY UPDATE'. After doing a little
>> investigation  and looking into the MySQL docs I have fixed the issue.
>> 
>> I  dropped  the  entire  WHERE  from the statement. According to MySQL
>> docs, there is no WHERE in the statement. If a record key exists, that
>> record is used in the 'ON DUPLICATE KEY UPDATE' part of the statement.
>> 
>> Or, maybe I downloaded the wrong version of the plugin.
>> 
>> On Saturday, December 31, 2005 at 5:55:31 AM, jamesk@okeating.net confabulated:
>> 
>> 
>>>I have updated the INSERT code to use the "ON DUPLICATE KEY" feature of
>>>MySQL.  This feature insures that if the insert statement is executed
>>>and another insert has already occurred for the changing day, an update
>>>will take place instead.  Hopefully, this will alleviate the largest
>>>possibility of a race condition.  However, this is a feature of MySQL
>>>that is only available inside of MySQL 4.1 and greater.  If you are NOT
>>>running at least MySQL 4.1 you can get the old code from my website
>>>listed below.
>> 
>> 
>>>Where to get the changes -
>> 
>> 
>>>http://wiki.apache.org/spamassassin/StatsPlugin
>>>or
>>>http://www.okeating.net/blosxom.cgi/2005/12/31#statsplugin-update
>> 
>> 
>>>Let me know if you see any other issues or problems.
>>>Thanks for everyone's input, more is always appreciated.
>> 
>> 
>>>Thanks,
>>>  James
>> 
>> 
>>>Jim C. Nasby wrote:
>>>
>>>>On Wed, Dec 28, 2005 at 02:19:51AM -0600, Chris Thielen wrote:
>>>>
>>>>
>>>>>James Keating wrote:
>>>>>
>>>>>
>>>>>
>>>>>>Well the simplest fix is the one that I did not implement in the first 
>>>>>>place, using "ON DUPLICATE KEY".  However, I did not implement that 
>>>>>>because of its only being in version 4.1 of MySQL and I still use 
>>>>>>Debian stable for most production machines, which runs 4.0.x.
>>>>>>
>>>>>>Anyway, I will poke at it some more.
>>>>>>
>>>>>
>>>>>Read on for a portable fix called optimistic locking. 
>>>>>
>>>>>
>>>>>Pseudocode follows.  I assume you are doing something like this:
>>>>
>>>>
>>>>You assume wrong. :) The code is doing
>>>>
>>>>UPDATE ... SET field = field + 1 ...
>>>>
>>>>which is almost always a much better idea than doing the increment in
>>>>code.
>>>>
>>>>The race condition is in checking to see if the row already exists, and
>>>>then using that information to decide if you should do an insert or an
>>>>update.
>>>>
>>>>The example at the very bottom of http://lnk.nu/postgresql.org/5sl.html
>>>>is how you would do this in a function in PostgreSQL, though there's no
>>>>reason I can think of why you couldn't do that in perl instead (though
>>>>it would perform much slower).

--

"This message is made of 100% recycled electrons."


Re: Real-Time Stats Plugin Released - UPDATE RELEASED

Posted by James Keating <ja...@okeating.net>.
You did have the latest version Duane.

I guess MySQL 4.1 doesn't care if you have the "WHERE" clause after it; 
as I have been using the updated version for sometime and have not 
received any errors.  But MySQL 5.0, apparently, doesn't like that 
syntax.  I removed the WHERE clause and tested it with MySQL 4.1 and it 
still appears to function properly.
I have updated the SpamAssassin Wiki and the code on my site.
Thanks for catching that Duane.

- James

Duane Hill wrote:
> I  am  running  MySQL  5.0  and  WAS  receiving  an  error for the SQL
> statement that had the 'ON DUPLICATE KEY UPDATE'. After doing a little
> investigation  and looking into the MySQL docs I have fixed the issue.
> 
> I  dropped  the  entire  WHERE  from the statement. According to MySQL
> docs, there is no WHERE in the statement. If a record key exists, that
> record is used in the 'ON DUPLICATE KEY UPDATE' part of the statement.
> 
> Or, maybe I downloaded the wrong version of the plugin.
> 
> On Saturday, December 31, 2005 at 5:55:31 AM, jamesk@okeating.net confabulated:
> 
> 
>>I have updated the INSERT code to use the "ON DUPLICATE KEY" feature of
>>MySQL.  This feature insures that if the insert statement is executed
>>and another insert has already occurred for the changing day, an update
>>will take place instead.  Hopefully, this will alleviate the largest
>>possibility of a race condition.  However, this is a feature of MySQL
>>that is only available inside of MySQL 4.1 and greater.  If you are NOT
>>running at least MySQL 4.1 you can get the old code from my website
>>listed below.
> 
> 
>>Where to get the changes -
> 
> 
>>http://wiki.apache.org/spamassassin/StatsPlugin
>>or
>>http://www.okeating.net/blosxom.cgi/2005/12/31#statsplugin-update
> 
> 
>>Let me know if you see any other issues or problems.
>>Thanks for everyone's input, more is always appreciated.
> 
> 
>>Thanks,
>>  James
> 
> 
>>Jim C. Nasby wrote:
>>
>>>On Wed, Dec 28, 2005 at 02:19:51AM -0600, Chris Thielen wrote:
>>>
>>>
>>>>James Keating wrote:
>>>>
>>>>
>>>>
>>>>>Well the simplest fix is the one that I did not implement in the first 
>>>>>place, using "ON DUPLICATE KEY".  However, I did not implement that 
>>>>>because of its only being in version 4.1 of MySQL and I still use 
>>>>>Debian stable for most production machines, which runs 4.0.x.
>>>>>
>>>>>Anyway, I will poke at it some more.
>>>>>
>>>>
>>>>Read on for a portable fix called optimistic locking. 
>>>>
>>>>
>>>>Pseudocode follows.  I assume you are doing something like this:
>>>
>>>
>>>You assume wrong. :) The code is doing
>>>
>>>UPDATE ... SET field = field + 1 ...
>>>
>>>which is almost always a much better idea than doing the increment in
>>>code.
>>>
>>>The race condition is in checking to see if the row already exists, and
>>>then using that information to decide if you should do an insert or an
>>>update.
>>>
>>>The example at the very bottom of http://lnk.nu/postgresql.org/5sl.html
>>>is how you would do this in a function in PostgreSQL, though there's no
>>>reason I can think of why you couldn't do that in perl instead (though
>>>it would perform much slower).
> 
> 
> --
> 
> "This message is made of 100% recycled electrons."
> 

Re: Real-Time Stats Plugin Released - UPDATE RELEASED

Posted by Duane Hill <d....@yournetplus.com>.
I  am  running  MySQL  5.0  and  WAS  receiving  an  error for the SQL
statement that had the 'ON DUPLICATE KEY UPDATE'. After doing a little
investigation  and looking into the MySQL docs I have fixed the issue.

I  dropped  the  entire  WHERE  from the statement. According to MySQL
docs, there is no WHERE in the statement. If a record key exists, that
record is used in the 'ON DUPLICATE KEY UPDATE' part of the statement.

Or, maybe I downloaded the wrong version of the plugin.

On Saturday, December 31, 2005 at 5:55:31 AM, jamesk@okeating.net confabulated:

> I have updated the INSERT code to use the "ON DUPLICATE KEY" feature of
> MySQL.  This feature insures that if the insert statement is executed
> and another insert has already occurred for the changing day, an update
> will take place instead.  Hopefully, this will alleviate the largest
> possibility of a race condition.  However, this is a feature of MySQL
> that is only available inside of MySQL 4.1 and greater.  If you are NOT
> running at least MySQL 4.1 you can get the old code from my website
> listed below.

> Where to get the changes -

> http://wiki.apache.org/spamassassin/StatsPlugin
> or
> http://www.okeating.net/blosxom.cgi/2005/12/31#statsplugin-update

> Let me know if you see any other issues or problems.
> Thanks for everyone's input, more is always appreciated.

> Thanks,
>   James

> Jim C. Nasby wrote:
>> On Wed, Dec 28, 2005 at 02:19:51AM -0600, Chris Thielen wrote:
>> 
>>>James Keating wrote:
>>>
>>>
>>>>Well the simplest fix is the one that I did not implement in the first 
>>>>place, using "ON DUPLICATE KEY".  However, I did not implement that 
>>>>because of its only being in version 4.1 of MySQL and I still use 
>>>>Debian stable for most production machines, which runs 4.0.x.
>>>>
>>>>Anyway, I will poke at it some more.
>>>>
>>>
>>>Read on for a portable fix called optimistic locking. 
>>>
>>>
>>>Pseudocode follows.  I assume you are doing something like this:
>> 
>> 
>> You assume wrong. :) The code is doing
>> 
>> UPDATE ... SET field = field + 1 ...
>> 
>> which is almost always a much better idea than doing the increment in
>> code.
>> 
>> The race condition is in checking to see if the row already exists, and
>> then using that information to decide if you should do an insert or an
>> update.
>> 
>> The example at the very bottom of http://lnk.nu/postgresql.org/5sl.html
>> is how you would do this in a function in PostgreSQL, though there's no
>> reason I can think of why you couldn't do that in perl instead (though
>> it would perform much slower).

--

"This message is made of 100% recycled electrons."


Re: Real-Time Stats Plugin Released - UPDATE RELEASED

Posted by James Keating <ja...@okeating.net>.
I have updated the INSERT code to use the "ON DUPLICATE KEY" feature of
MySQL.  This feature insures that if the insert statement is executed
and another insert has already occurred for the changing day, an update
will take place instead.  Hopefully, this will alleviate the largest
possibility of a race condition.  However, this is a feature of MySQL
that is only available inside of MySQL 4.1 and greater.  If you are NOT
running at least MySQL 4.1 you can get the old code from my website
listed below.

Where to get the changes -

http://wiki.apache.org/spamassassin/StatsPlugin
or
http://www.okeating.net/blosxom.cgi/2005/12/31#statsplugin-update

Let me know if you see any other issues or problems.
Thanks for everyone's input, more is always appreciated.

Thanks,
  James

Jim C. Nasby wrote:
> On Wed, Dec 28, 2005 at 02:19:51AM -0600, Chris Thielen wrote:
> 
>>James Keating wrote:
>>
>>
>>>Well the simplest fix is the one that I did not implement in the first 
>>>place, using "ON DUPLICATE KEY".  However, I did not implement that 
>>>because of its only being in version 4.1 of MySQL and I still use 
>>>Debian stable for most production machines, which runs 4.0.x.
>>>
>>>Anyway, I will poke at it some more.
>>>
>>
>>Read on for a portable fix called optimistic locking. 
>>
>>
>>Pseudocode follows.  I assume you are doing something like this:
> 
> 
> You assume wrong. :) The code is doing
> 
> UPDATE ... SET field = field + 1 ...
> 
> which is almost always a much better idea than doing the increment in
> code.
> 
> The race condition is in checking to see if the row already exists, and
> then using that information to decide if you should do an insert or an
> update.
> 
> The example at the very bottom of http://lnk.nu/postgresql.org/5sl.html
> is how you would do this in a function in PostgreSQL, though there's no
> reason I can think of why you couldn't do that in perl instead (though
> it would perform much slower).


Re: Real-Time Stats Plugin Released

Posted by "Jim C. Nasby" <de...@decibel.org>.
On Wed, Dec 28, 2005 at 02:19:51AM -0600, Chris Thielen wrote:
> James Keating wrote:
> 
> >Well the simplest fix is the one that I did not implement in the first 
> >place, using "ON DUPLICATE KEY".  However, I did not implement that 
> >because of its only being in version 4.1 of MySQL and I still use 
> >Debian stable for most production machines, which runs 4.0.x.
> >
> >Anyway, I will poke at it some more.
> >
> Read on for a portable fix called optimistic locking. 
> 
> 
> Pseudocode follows.  I assume you are doing something like this:

You assume wrong. :) The code is doing

UPDATE ... SET field = field + 1 ...

which is almost always a much better idea than doing the increment in
code.

The race condition is in checking to see if the row already exists, and
then using that information to decide if you should do an insert or an
update.

The example at the very bottom of http://lnk.nu/postgresql.org/5sl.html
is how you would do this in a function in PostgreSQL, though there's no
reason I can think of why you couldn't do that in perl instead (though
it would perform much slower).
-- 
Jim C. Nasby, Database Architect                decibel@decibel.org 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Real-Time Stats Plugin Released

Posted by Chris Thielen <cm...@someone.dhs.org>.
James Keating wrote:

> Well the simplest fix is the one that I did not implement in the first 
> place, using "ON DUPLICATE KEY".  However, I did not implement that 
> because of its only being in version 4.1 of MySQL and I still use 
> Debian stable for most production machines, which runs 4.0.x.
>
> Anyway, I will poke at it some more.
>
Read on for a portable fix called optimistic locking. 


Pseudocode follows.  I assume you are doing something like this:

select ham from table where user = $user;
$ham++;
update table set ham = $ham where user = $user;



You can instead do something like this:

while (!success) {
  select ham from table where user = $user;
  $newham = $currentham + 1;
  update table set ham = $newham where user = $user and ham = $currentham;
  success = (getRowsUpdated() > 0);
}


What this does is update the row only if the value of table.ham has not 
changed.  The updated row count is checked.  If it is greater than 0 
(your row was indeed updated) then you set the success flag and 
continue.  If the updated row count is 0 you know the data has changed 
unexpectedly and must redo the entire read/modify/write cycle.



Usually this approach is done by adding a separate version column that 
is tested and updated, but in this case the ham or spam counters can be 
substituted.  It is called optimistic locking because you assume 
(optimistically) that your update will usually succeed; eg: that nobody 
else has updated the data without you knowing.  It's not really locking, 
but rather a concurrent update detection mechanism which the application 
then must handle programatically.

Re: Real-Time Stats Plugin Released

Posted by James Keating <ja...@okeating.net>.
Gary V wrote:
>> Well the simplest fix is the one that I did not implement in the first 
>> place, using "ON DUPLICATE KEY".  However, I did not implement that 
>> because of its only being in version 4.1 of MySQL and I still use 
>> Debian stable for most production machines, which runs 4.0.x.
>>
>> Anyway, I will poke at it some more.
>>
>> - James
> 
> 
> [OT] FYI, both 4.0 and 4.1 are available in Debian stable.
> 
> Package mysql-server
> stable (misc): mysql database server binaries
> 4.0.24-10sarge1: alpha amd64 arm hppa i386 ia64 m68k mips mipsel powerpc 
> s390 sparc
> 4.0.24-10: amd64
> 
> Package mysql-server-4.1
> stable (misc): mysql database server binaries
> 4.1.11a-4sarge2: alpha amd64 arm hppa i386 ia64 m68k mips mipsel powerpc 
> s390 sparc
> 4.1.11a-4: amd64
> 
> Gary V
> 
> _________________________________________________________________
> Express yourself instantly with MSN Messenger! Download today - it's 
> FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
> 

Thanks for pointing that out Gary.  I was actually unaware of that.  I 
will probably change the code to use the "ON DUPLICATE KEY" because that 
appears to be the best and only way I see right now.

- James

Re: Real-Time Stats Plugin Released

Posted by Gary V <mr...@hotmail.com>.
>Well the simplest fix is the one that I did not implement in the first 
>place, using "ON DUPLICATE KEY".  However, I did not implement that because 
>of its only being in version 4.1 of MySQL and I still use Debian stable for 
>most production machines, which runs 4.0.x.
>
>Anyway, I will poke at it some more.
>
>- James

[OT] FYI, both 4.0 and 4.1 are available in Debian stable.

Package mysql-server
stable (misc): mysql database server binaries
4.0.24-10sarge1: alpha amd64 arm hppa i386 ia64 m68k mips mipsel powerpc 
s390 sparc
4.0.24-10: amd64

Package mysql-server-4.1
stable (misc): mysql database server binaries
4.1.11a-4sarge2: alpha amd64 arm hppa i386 ia64 m68k mips mipsel powerpc 
s390 sparc
4.1.11a-4: amd64

Gary V

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/


Re: Real-Time Stats Plugin Released

Posted by James Keating <ja...@okeating.net>.
Well the simplest fix is the one that I did not implement in the first 
place, using "ON DUPLICATE KEY".  However, I did not implement that 
because of its only being in version 4.1 of MySQL and I still use Debian 
stable for most production machines, which runs 4.0.x.

Anyway, I will poke at it some more.

- James


Jim C. Nasby wrote:
> On Tue, Dec 27, 2005 at 09:33:11PM -0500, James Keating wrote:
> 
>>Indeed. My thinking behind storing both system totals ($TOTALS) and user 
>>totals in the database was for easy error checking.  Each day you can 
>>quickly/easily run through the database and look for potential errors 
>>(possibly resulting from the race condition that you mentioned). 
>>However, I have not seen any miscalculations yet.
> 
> 
> Well, there won't be any errors with an ACID database (ie: not MyISAM)
> and if the race condition is handled properly.
> 
> 
>>Regarding the race condition, I would be more than happy to fix it, and 
>>any input would be wonderful on that.  I will poke around and see what I 
>>can come up with on my own.
> 
> 
> I could give you a fix if it was PostgreSQL, but I'm not really a MySQL
> person.
> 
> 
>>Jim C. Nasby wrote:
>>
>>>Neat plugin. I have two comments:
>>>
>>>I wouldn't store $TOTALS or the total column in the database, as both
>>>can easily be calculated when retrieving the data.
>>>
>>>There is a race condition, especially for $TOTALS. First you check for
>>>existence, then you try and do an insert or an update based on that. If
>>>more than one SA thread is running, you can end up with duplicated
>>>entries. I think MySQL has a merge command that would be more
>>>appropriate.
>>>
>>>Also, it appears that this would work on PostgreSQL with minimal
>>>modifications. If anyone's interested in that I'd be happy to help.
>>>
>>>On Mon, Dec 26, 2005 at 11:03:45AM -0500, James Keating wrote:
>>>
>>>
>>>>Morning All,
>>>>
>>>>I have released a real-time stats collection plugin 
>>>>(http://wiki.apache.org/spamassassin/StatsPlugin)for SpamAssassin. 
>>>>The plugin collects data into a MySQL DB based on the date and username 
>>>>that SpamAssassin was invoked with.  This creates a single row for each 
>>>>user, each day. Also, a row for the total number of messages for that 
>>>>day is created.  Each entry contains the date, user name, domain name, 
>>>>number of hams, number of spams and total number of messages processed 
>>>>for that user (or for the system).  This is an example of what you would 
>>>>see in your table -
>>>>
>>>>
>>>>day 		username 	domain 		spam 	ham 	total
>>>>2005-12-26 	$TOTALS 	$TOTALS 	106 	229 	335
>>>>2005-12-26 	user1 		domain.com 	11 	0 	11
>>>>2005-12-26 	user2 		domain.com 	25 	15 	40
>>>>2005-12-26 	user3	 	domain.com 	27 	3 	30
>>>>2005-12-26 	user4	 	domain.com 	0 	24 	24
>>>>2005-12-26 	user5 		domain.com 	1 	0 	1
>>>>2005-12-26 	user6 		domain.com 	0 	1 	1
>>>>etc...
>>>>
>>>>A new entry for each user is created each day, so there is historical 
>>>>information for as long as you want to make it.  As I said before this 
>>>>is collected in real-time, so when a new message is processed by 
>>>>SpamAssassin the numbers are automatically increased for an existing 
>>>>entry or a new entry is made.  This elevates the need to continually 
>>>>have scripts parse log files.
>>>>
>>>>As always I cannot guarantee that this will work with your 
>>>>installation/configuration of SpamAssassin or that it will not make your 
>>>>system explode, but it has worked for me in a production environment for 
>>>>quiet some time.  Also, this can only be used with SpamAssassin 3.1+. 
>>>>The code and more details about installing/configuring it can be seen on 
>>>>the SpamAssassin wiki - http://wiki.apache.org/spamassassin/StatsPlugin
>>>>
>>>>Let me know if you run into problems or have suggestions for 
>>>>modifications.
>>>>
>>>>- James
>>>>
>>>
>>>
> 

Re: Real-Time Stats Plugin Released

Posted by "Jim C. Nasby" <de...@decibel.org>.
On Tue, Dec 27, 2005 at 09:33:11PM -0500, James Keating wrote:
> Indeed. My thinking behind storing both system totals ($TOTALS) and user 
> totals in the database was for easy error checking.  Each day you can 
> quickly/easily run through the database and look for potential errors 
> (possibly resulting from the race condition that you mentioned). 
> However, I have not seen any miscalculations yet.

Well, there won't be any errors with an ACID database (ie: not MyISAM)
and if the race condition is handled properly.

> Regarding the race condition, I would be more than happy to fix it, and 
> any input would be wonderful on that.  I will poke around and see what I 
> can come up with on my own.

I could give you a fix if it was PostgreSQL, but I'm not really a MySQL
person.

> Jim C. Nasby wrote:
> >Neat plugin. I have two comments:
> >
> >I wouldn't store $TOTALS or the total column in the database, as both
> >can easily be calculated when retrieving the data.
> >
> >There is a race condition, especially for $TOTALS. First you check for
> >existence, then you try and do an insert or an update based on that. If
> >more than one SA thread is running, you can end up with duplicated
> >entries. I think MySQL has a merge command that would be more
> >appropriate.
> >
> >Also, it appears that this would work on PostgreSQL with minimal
> >modifications. If anyone's interested in that I'd be happy to help.
> >
> >On Mon, Dec 26, 2005 at 11:03:45AM -0500, James Keating wrote:
> >
> >>Morning All,
> >>
> >> I have released a real-time stats collection plugin 
> >>(http://wiki.apache.org/spamassassin/StatsPlugin)for SpamAssassin. 
> >>The plugin collects data into a MySQL DB based on the date and username 
> >>that SpamAssassin was invoked with.  This creates a single row for each 
> >>user, each day. Also, a row for the total number of messages for that 
> >>day is created.  Each entry contains the date, user name, domain name, 
> >>number of hams, number of spams and total number of messages processed 
> >>for that user (or for the system).  This is an example of what you would 
> >>see in your table -
> >>
> >>
> >>day 		username 	domain 		spam 	ham 	total
> >>2005-12-26 	$TOTALS 	$TOTALS 	106 	229 	335
> >>2005-12-26 	user1 		domain.com 	11 	0 	11
> >>2005-12-26 	user2 		domain.com 	25 	15 	40
> >>2005-12-26 	user3	 	domain.com 	27 	3 	30
> >>2005-12-26 	user4	 	domain.com 	0 	24 	24
> >>2005-12-26 	user5 		domain.com 	1 	0 	1
> >>2005-12-26 	user6 		domain.com 	0 	1 	1
> >>etc...
> >>
> >>A new entry for each user is created each day, so there is historical 
> >>information for as long as you want to make it.  As I said before this 
> >>is collected in real-time, so when a new message is processed by 
> >>SpamAssassin the numbers are automatically increased for an existing 
> >>entry or a new entry is made.  This elevates the need to continually 
> >>have scripts parse log files.
> >>
> >>As always I cannot guarantee that this will work with your 
> >>installation/configuration of SpamAssassin or that it will not make your 
> >>system explode, but it has worked for me in a production environment for 
> >>quiet some time.  Also, this can only be used with SpamAssassin 3.1+. 
> >>The code and more details about installing/configuring it can be seen on 
> >>the SpamAssassin wiki - http://wiki.apache.org/spamassassin/StatsPlugin
> >>
> >>Let me know if you run into problems or have suggestions for 
> >>modifications.
> >>
> >>- James
> >>
> >
> >
> 

-- 
Jim C. Nasby, Database Architect                decibel@decibel.org 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Real-Time Stats Plugin Released

Posted by James Keating <ja...@okeating.net>.
Indeed. My thinking behind storing both system totals ($TOTALS) and user 
totals in the database was for easy error checking.  Each day you can 
quickly/easily run through the database and look for potential errors 
(possibly resulting from the race condition that you mentioned). 
However, I have not seen any miscalculations yet.

Regarding the race condition, I would be more than happy to fix it, and 
any input would be wonderful on that.  I will poke around and see what I 
can come up with on my own.

- James

Jim C. Nasby wrote:
> Neat plugin. I have two comments:
> 
> I wouldn't store $TOTALS or the total column in the database, as both
> can easily be calculated when retrieving the data.
> 
> There is a race condition, especially for $TOTALS. First you check for
> existence, then you try and do an insert or an update based on that. If
> more than one SA thread is running, you can end up with duplicated
> entries. I think MySQL has a merge command that would be more
> appropriate.
> 
> Also, it appears that this would work on PostgreSQL with minimal
> modifications. If anyone's interested in that I'd be happy to help.
> 
> On Mon, Dec 26, 2005 at 11:03:45AM -0500, James Keating wrote:
> 
>>Morning All,
>>
>>  I have released a real-time stats collection plugin 
>>(http://wiki.apache.org/spamassassin/StatsPlugin)for SpamAssassin. 
>>The plugin collects data into a MySQL DB based on the date and username 
>>that SpamAssassin was invoked with.  This creates a single row for each 
>>user, each day. Also, a row for the total number of messages for that 
>>day is created.  Each entry contains the date, user name, domain name, 
>>number of hams, number of spams and total number of messages processed 
>>for that user (or for the system).  This is an example of what you would 
>>see in your table -
>>
>>
>>day 		username 	domain 		spam 	ham 	total
>>2005-12-26 	$TOTALS 	$TOTALS 	106 	229 	335
>>2005-12-26 	user1 		domain.com 	11 	0 	11
>>2005-12-26 	user2 		domain.com 	25 	15 	40
>>2005-12-26 	user3	 	domain.com 	27 	3 	30
>>2005-12-26 	user4	 	domain.com 	0 	24 	24
>>2005-12-26 	user5 		domain.com 	1 	0 	1
>>2005-12-26 	user6 		domain.com 	0 	1 	1
>>etc...
>>
>>A new entry for each user is created each day, so there is historical 
>>information for as long as you want to make it.  As I said before this 
>>is collected in real-time, so when a new message is processed by 
>>SpamAssassin the numbers are automatically increased for an existing 
>>entry or a new entry is made.  This elevates the need to continually 
>>have scripts parse log files.
>>
>> As always I cannot guarantee that this will work with your 
>>installation/configuration of SpamAssassin or that it will not make your 
>>system explode, but it has worked for me in a production environment for 
>>quiet some time.  Also, this can only be used with SpamAssassin 3.1+. 
>>The code and more details about installing/configuring it can be seen on 
>>the SpamAssassin wiki - http://wiki.apache.org/spamassassin/StatsPlugin
>>
>>Let me know if you run into problems or have suggestions for modifications.
>>
>>- James
>>
> 
> 

Re: Real-Time Stats Plugin Released

Posted by "Jim C. Nasby" <de...@decibel.org>.
Neat plugin. I have two comments:

I wouldn't store $TOTALS or the total column in the database, as both
can easily be calculated when retrieving the data.

There is a race condition, especially for $TOTALS. First you check for
existence, then you try and do an insert or an update based on that. If
more than one SA thread is running, you can end up with duplicated
entries. I think MySQL has a merge command that would be more
appropriate.

Also, it appears that this would work on PostgreSQL with minimal
modifications. If anyone's interested in that I'd be happy to help.

On Mon, Dec 26, 2005 at 11:03:45AM -0500, James Keating wrote:
> Morning All,
> 
>   I have released a real-time stats collection plugin 
> (http://wiki.apache.org/spamassassin/StatsPlugin)for SpamAssassin. 
> The plugin collects data into a MySQL DB based on the date and username 
> that SpamAssassin was invoked with.  This creates a single row for each 
> user, each day. Also, a row for the total number of messages for that 
> day is created.  Each entry contains the date, user name, domain name, 
> number of hams, number of spams and total number of messages processed 
> for that user (or for the system).  This is an example of what you would 
> see in your table -
> 
> 
> day 		username 	domain 		spam 	ham 	total
> 2005-12-26 	$TOTALS 	$TOTALS 	106 	229 	335
> 2005-12-26 	user1 		domain.com 	11 	0 	11
> 2005-12-26 	user2 		domain.com 	25 	15 	40
> 2005-12-26 	user3	 	domain.com 	27 	3 	30
> 2005-12-26 	user4	 	domain.com 	0 	24 	24
> 2005-12-26 	user5 		domain.com 	1 	0 	1
> 2005-12-26 	user6 		domain.com 	0 	1 	1
> etc...
> 
> A new entry for each user is created each day, so there is historical 
> information for as long as you want to make it.  As I said before this 
> is collected in real-time, so when a new message is processed by 
> SpamAssassin the numbers are automatically increased for an existing 
> entry or a new entry is made.  This elevates the need to continually 
> have scripts parse log files.
> 
>  As always I cannot guarantee that this will work with your 
> installation/configuration of SpamAssassin or that it will not make your 
> system explode, but it has worked for me in a production environment for 
> quiet some time.  Also, this can only be used with SpamAssassin 3.1+. 
> The code and more details about installing/configuring it can be seen on 
> the SpamAssassin wiki - http://wiki.apache.org/spamassassin/StatsPlugin
> 
> Let me know if you run into problems or have suggestions for modifications.
> 
> - James
> 

-- 
Jim C. Nasby, Database Architect                decibel@decibel.org 
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Real-Time Stats Plugin Released

Posted by James Keating <ja...@okeating.net>.
A new version of the plugin has been released thanks to Andrew Donkin 
who provided a variety of updates for the code.  The updates include the 
following changes -

- Removed the exists sub as it is not necessary when using an INSERT 
with ON DUPLICATE KEY UPDATE.

- Removed the TOTALS insert and the totals column. These are not 
necessary and can be calculated via select statements.

- Changed the table schema to remove the totals field

- Numerous other changes

The code is much much smaller now and faster.  It is backwards 
compatible with the old table schema, but a new table schema (without 
the totals column) is available.

Code can be found at the SpamAssassin Wiki -
http://wiki.apache.org/spamassassin/StatsPlugin

Or at my bl0g -
http://www.okeating.net/blosxom.cgi/2006/01/08#statsplugin-update-010706

Thanks Andrew for your great input and code!

- James


Andrew Donkin wrote:
> James Keating <ja...@okeating.net> writes:
> 
> 
>>Morning All,
>>
>>   I have released a real-time stats collection plugin
>>(http://wiki.apache.org/spamassassin/StatsPlugin)for SpamAssassin.
> 
> 
> Actually James, now that you have that atomic insert/update in there,
> you can get rid of $user_exists, $total_exists, the last parameter to
> execute_stats(), and get_current_entry().
> 
> I am running with that version and it is working well, at about one
> update per second.
>