You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by James Keating <ja...@okeating.net> on 2005/12/26 17:03:45 UTC
Real-Time Stats Plugin Released
Morning All,
I have released a real-time stats collection plugin
(http://wiki.apache.org/spamassassin/StatsPlugin)for SpamAssassin.
The plugin collects data into a MySQL DB based on the date and username
that SpamAssassin was invoked with. This creates a single row for each
user, each day. Also, a row for the total number of messages for that
day is created. Each entry contains the date, user name, domain name,
number of hams, number of spams and total number of messages processed
for that user (or for the system). This is an example of what you would
see in your table -
day username domain spam ham total
2005-12-26 $TOTALS $TOTALS 106 229 335
2005-12-26 user1 domain.com 11 0 11
2005-12-26 user2 domain.com 25 15 40
2005-12-26 user3 domain.com 27 3 30
2005-12-26 user4 domain.com 0 24 24
2005-12-26 user5 domain.com 1 0 1
2005-12-26 user6 domain.com 0 1 1
etc...
A new entry for each user is created each day, so there is historical
information for as long as you want to make it. As I said before this
is collected in real-time, so when a new message is processed by
SpamAssassin the numbers are automatically increased for an existing
entry or a new entry is made. This elevates the need to continually
have scripts parse log files.
As always I cannot guarantee that this will work with your
installation/configuration of SpamAssassin or that it will not make your
system explode, but it has worked for me in a production environment for
quiet some time. Also, this can only be used with SpamAssassin 3.1+.
The code and more details about installing/configuring it can be seen on
the SpamAssassin wiki - http://wiki.apache.org/spamassassin/StatsPlugin
Let me know if you run into problems or have suggestions for modifications.
- James
Re: Real-Time Stats Plugin Released
Posted by Jim Knuth <jk...@jkart.de>.
Heute (28.12.2005/19:45 Uhr) schrieb Gary V (mr88talent@hotmail.com),
>> > [OT] FYI, both 4.0 and 4.1 are available in Debian stable.
>>
>>do you know, which (maybe) problems by change of MySQL 4.1
>>becomes? Or not? I mean database errors and so on. I`ve heard of
>>password and charset mistakes.
>>--
>>Viele Grueße, Kind regards,
>> Jim Knuth
> Sorry Jim, I have not tried an upgrade. As you know my friend, I am new at
> MySQL also.
mmh, yes I know. It was only a try. ;)
> Gary V
--
Viele Gruesse, Kind regards,
Jim Knuth
jk@jkart.de
ICQ #277289867 - VoIP: +49 (0) 322 212 044 67
Key ID: 0x1F78066F
----------
Zufalls-Zitat
----------
Eskimos haben Hunderte von Worten für "Eis", aber keines
für "Hallo".
----------
Der Text hat nichts mit dem Empfaenger der Mail zu tun
----------
Virus free. Checked by NOD32 Version 1.1343 Build 6524 28.12.2005
Re: Real-Time Stats Plugin Released
Posted by Gary V <mr...@hotmail.com>.
> > [OT] FYI, both 4.0 and 4.1 are available in Debian stable.
>
>do you know, which (maybe) problems by change of MySQL 4.1
>becomes? Or not? I mean database errors and so on. I`ve heard of
>password and charset mistakes.
>--
>Viele Grüße, Kind regards,
> Jim Knuth
Sorry Jim, I have not tried an upgrade. As you know my friend, I am new at
MySQL also.
Gary V
_________________________________________________________________
Is your PC infected? Get a FREE online computer virus scan from McAfeeŽ
Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963
Re: Real-Time Stats Plugin Released
Posted by Jim Knuth <jk...@jkart.de>.
Heute (28.12.2005/05:41 Uhr) schrieb Gary V (mr88talent@hotmail.com),
>>Well the simplest fix is the one that I did not implement in the first
>>place, using "ON DUPLICATE KEY". However, I did not implement that because
>>of its only being in version 4.1 of MySQL and I still use Debian stable for
>>most production machines, which runs 4.0.x.
>>
>>Anyway, I will poke at it some more.
>>
>>- James
> [OT] FYI, both 4.0 and 4.1 are available in Debian stable.
do you know, which (maybe) problems by change of MySQL 4.1
becomes? Or not? I mean database errors and so on. I`ve heard of
password and charset mistakes.
> Package mysql-server
> stable (misc): mysql database server binaries
> 4.0.24-10sarge1: alpha amd64 arm hppa i386 ia64 m68k mips mipsel powerpc
> s390 sparc
> 4.0.24-10: amd64
> Package mysql-server-4.1
> stable (misc): mysql database server binaries
> 4.1.11a-4sarge2: alpha amd64 arm hppa i386 ia64 m68k mips mipsel powerpc
> s390 sparc
> 4.1.11a-4: amd64
> Gary V
--
Viele Grüße, Kind regards,
Jim Knuth
jk@jkart.de
ICQ #277289867 - VoIP: +49 (0) 322 212 044 67
Key ID: 0x1F78066F
----------
Zufalls-Zitat
----------
Wenn ein Mann nicht bereit ist, für seine Überzeugungen
Risiken einzugehen, dann taugen entweder seiner
Überzeugungen oder er selbstnichts. (Ezra Pand)
----------
Der Text hat nichts mit dem Empfänger der Mail zu tun
----------
Virus free. Checked by NOD32 Version 1.1341 Build 6516 27.12.2005
Re: Real-Time Stats Plugin Released - UPDATE RELEASED
Posted by Duane Hill <d....@yournetplus.com>.
On Monday, January 2, 2006 at 6:50:15 PM, jamesk@okeating.net confabulated:
> You did have the latest version Duane.
> I guess MySQL 4.1 doesn't care if you have the "WHERE" clause after it;
> as I have been using the updated version for sometime and have not
> received any errors. But MySQL 5.0, apparently, doesn't like that
> syntax. I removed the WHERE clause and tested it with MySQL 4.1 and it
> still appears to function properly.
> I have updated the SpamAssassin Wiki and the code on my site.
> Thanks for catching that Duane.
Not a problem. Glad I can be of some assistance on this list finally.
SA was completely new to me when I joined the list and usually I sit
back in the woodwork and let things kind of soak in for a bit before
attempting to aid in something. I usually like to make sure I have
most, if not all, the ground work done in a solution to something
rather than having to go back and forth a few times. That was hard at
first on this one because MySQL is still relatively new to me.
Anyhow, the plugin is working like a charm at this time. Thanks!
> Duane Hill wrote:
>> I am running MySQL 5.0 and WAS receiving an error for the SQL
>> statement that had the 'ON DUPLICATE KEY UPDATE'. After doing a little
>> investigation and looking into the MySQL docs I have fixed the issue.
>>
>> I dropped the entire WHERE from the statement. According to MySQL
>> docs, there is no WHERE in the statement. If a record key exists, that
>> record is used in the 'ON DUPLICATE KEY UPDATE' part of the statement.
>>
>> Or, maybe I downloaded the wrong version of the plugin.
>>
>> On Saturday, December 31, 2005 at 5:55:31 AM, jamesk@okeating.net confabulated:
>>
>>
>>>I have updated the INSERT code to use the "ON DUPLICATE KEY" feature of
>>>MySQL. This feature insures that if the insert statement is executed
>>>and another insert has already occurred for the changing day, an update
>>>will take place instead. Hopefully, this will alleviate the largest
>>>possibility of a race condition. However, this is a feature of MySQL
>>>that is only available inside of MySQL 4.1 and greater. If you are NOT
>>>running at least MySQL 4.1 you can get the old code from my website
>>>listed below.
>>
>>
>>>Where to get the changes -
>>
>>
>>>http://wiki.apache.org/spamassassin/StatsPlugin
>>>or
>>>http://www.okeating.net/blosxom.cgi/2005/12/31#statsplugin-update
>>
>>
>>>Let me know if you see any other issues or problems.
>>>Thanks for everyone's input, more is always appreciated.
>>
>>
>>>Thanks,
>>> James
>>
>>
>>>Jim C. Nasby wrote:
>>>
>>>>On Wed, Dec 28, 2005 at 02:19:51AM -0600, Chris Thielen wrote:
>>>>
>>>>
>>>>>James Keating wrote:
>>>>>
>>>>>
>>>>>
>>>>>>Well the simplest fix is the one that I did not implement in the first
>>>>>>place, using "ON DUPLICATE KEY". However, I did not implement that
>>>>>>because of its only being in version 4.1 of MySQL and I still use
>>>>>>Debian stable for most production machines, which runs 4.0.x.
>>>>>>
>>>>>>Anyway, I will poke at it some more.
>>>>>>
>>>>>
>>>>>Read on for a portable fix called optimistic locking.
>>>>>
>>>>>
>>>>>Pseudocode follows. I assume you are doing something like this:
>>>>
>>>>
>>>>You assume wrong. :) The code is doing
>>>>
>>>>UPDATE ... SET field = field + 1 ...
>>>>
>>>>which is almost always a much better idea than doing the increment in
>>>>code.
>>>>
>>>>The race condition is in checking to see if the row already exists, and
>>>>then using that information to decide if you should do an insert or an
>>>>update.
>>>>
>>>>The example at the very bottom of http://lnk.nu/postgresql.org/5sl.html
>>>>is how you would do this in a function in PostgreSQL, though there's no
>>>>reason I can think of why you couldn't do that in perl instead (though
>>>>it would perform much slower).
--
"This message is made of 100% recycled electrons."
Re: Real-Time Stats Plugin Released - UPDATE RELEASED
Posted by James Keating <ja...@okeating.net>.
You did have the latest version Duane.
I guess MySQL 4.1 doesn't care if you have the "WHERE" clause after it;
as I have been using the updated version for sometime and have not
received any errors. But MySQL 5.0, apparently, doesn't like that
syntax. I removed the WHERE clause and tested it with MySQL 4.1 and it
still appears to function properly.
I have updated the SpamAssassin Wiki and the code on my site.
Thanks for catching that Duane.
- James
Duane Hill wrote:
> I am running MySQL 5.0 and WAS receiving an error for the SQL
> statement that had the 'ON DUPLICATE KEY UPDATE'. After doing a little
> investigation and looking into the MySQL docs I have fixed the issue.
>
> I dropped the entire WHERE from the statement. According to MySQL
> docs, there is no WHERE in the statement. If a record key exists, that
> record is used in the 'ON DUPLICATE KEY UPDATE' part of the statement.
>
> Or, maybe I downloaded the wrong version of the plugin.
>
> On Saturday, December 31, 2005 at 5:55:31 AM, jamesk@okeating.net confabulated:
>
>
>>I have updated the INSERT code to use the "ON DUPLICATE KEY" feature of
>>MySQL. This feature insures that if the insert statement is executed
>>and another insert has already occurred for the changing day, an update
>>will take place instead. Hopefully, this will alleviate the largest
>>possibility of a race condition. However, this is a feature of MySQL
>>that is only available inside of MySQL 4.1 and greater. If you are NOT
>>running at least MySQL 4.1 you can get the old code from my website
>>listed below.
>
>
>>Where to get the changes -
>
>
>>http://wiki.apache.org/spamassassin/StatsPlugin
>>or
>>http://www.okeating.net/blosxom.cgi/2005/12/31#statsplugin-update
>
>
>>Let me know if you see any other issues or problems.
>>Thanks for everyone's input, more is always appreciated.
>
>
>>Thanks,
>> James
>
>
>>Jim C. Nasby wrote:
>>
>>>On Wed, Dec 28, 2005 at 02:19:51AM -0600, Chris Thielen wrote:
>>>
>>>
>>>>James Keating wrote:
>>>>
>>>>
>>>>
>>>>>Well the simplest fix is the one that I did not implement in the first
>>>>>place, using "ON DUPLICATE KEY". However, I did not implement that
>>>>>because of its only being in version 4.1 of MySQL and I still use
>>>>>Debian stable for most production machines, which runs 4.0.x.
>>>>>
>>>>>Anyway, I will poke at it some more.
>>>>>
>>>>
>>>>Read on for a portable fix called optimistic locking.
>>>>
>>>>
>>>>Pseudocode follows. I assume you are doing something like this:
>>>
>>>
>>>You assume wrong. :) The code is doing
>>>
>>>UPDATE ... SET field = field + 1 ...
>>>
>>>which is almost always a much better idea than doing the increment in
>>>code.
>>>
>>>The race condition is in checking to see if the row already exists, and
>>>then using that information to decide if you should do an insert or an
>>>update.
>>>
>>>The example at the very bottom of http://lnk.nu/postgresql.org/5sl.html
>>>is how you would do this in a function in PostgreSQL, though there's no
>>>reason I can think of why you couldn't do that in perl instead (though
>>>it would perform much slower).
>
>
> --
>
> "This message is made of 100% recycled electrons."
>
Re: Real-Time Stats Plugin Released - UPDATE RELEASED
Posted by Duane Hill <d....@yournetplus.com>.
I am running MySQL 5.0 and WAS receiving an error for the SQL
statement that had the 'ON DUPLICATE KEY UPDATE'. After doing a little
investigation and looking into the MySQL docs I have fixed the issue.
I dropped the entire WHERE from the statement. According to MySQL
docs, there is no WHERE in the statement. If a record key exists, that
record is used in the 'ON DUPLICATE KEY UPDATE' part of the statement.
Or, maybe I downloaded the wrong version of the plugin.
On Saturday, December 31, 2005 at 5:55:31 AM, jamesk@okeating.net confabulated:
> I have updated the INSERT code to use the "ON DUPLICATE KEY" feature of
> MySQL. This feature insures that if the insert statement is executed
> and another insert has already occurred for the changing day, an update
> will take place instead. Hopefully, this will alleviate the largest
> possibility of a race condition. However, this is a feature of MySQL
> that is only available inside of MySQL 4.1 and greater. If you are NOT
> running at least MySQL 4.1 you can get the old code from my website
> listed below.
> Where to get the changes -
> http://wiki.apache.org/spamassassin/StatsPlugin
> or
> http://www.okeating.net/blosxom.cgi/2005/12/31#statsplugin-update
> Let me know if you see any other issues or problems.
> Thanks for everyone's input, more is always appreciated.
> Thanks,
> James
> Jim C. Nasby wrote:
>> On Wed, Dec 28, 2005 at 02:19:51AM -0600, Chris Thielen wrote:
>>
>>>James Keating wrote:
>>>
>>>
>>>>Well the simplest fix is the one that I did not implement in the first
>>>>place, using "ON DUPLICATE KEY". However, I did not implement that
>>>>because of its only being in version 4.1 of MySQL and I still use
>>>>Debian stable for most production machines, which runs 4.0.x.
>>>>
>>>>Anyway, I will poke at it some more.
>>>>
>>>
>>>Read on for a portable fix called optimistic locking.
>>>
>>>
>>>Pseudocode follows. I assume you are doing something like this:
>>
>>
>> You assume wrong. :) The code is doing
>>
>> UPDATE ... SET field = field + 1 ...
>>
>> which is almost always a much better idea than doing the increment in
>> code.
>>
>> The race condition is in checking to see if the row already exists, and
>> then using that information to decide if you should do an insert or an
>> update.
>>
>> The example at the very bottom of http://lnk.nu/postgresql.org/5sl.html
>> is how you would do this in a function in PostgreSQL, though there's no
>> reason I can think of why you couldn't do that in perl instead (though
>> it would perform much slower).
--
"This message is made of 100% recycled electrons."
Re: Real-Time Stats Plugin Released - UPDATE RELEASED
Posted by James Keating <ja...@okeating.net>.
I have updated the INSERT code to use the "ON DUPLICATE KEY" feature of
MySQL. This feature insures that if the insert statement is executed
and another insert has already occurred for the changing day, an update
will take place instead. Hopefully, this will alleviate the largest
possibility of a race condition. However, this is a feature of MySQL
that is only available inside of MySQL 4.1 and greater. If you are NOT
running at least MySQL 4.1 you can get the old code from my website
listed below.
Where to get the changes -
http://wiki.apache.org/spamassassin/StatsPlugin
or
http://www.okeating.net/blosxom.cgi/2005/12/31#statsplugin-update
Let me know if you see any other issues or problems.
Thanks for everyone's input, more is always appreciated.
Thanks,
James
Jim C. Nasby wrote:
> On Wed, Dec 28, 2005 at 02:19:51AM -0600, Chris Thielen wrote:
>
>>James Keating wrote:
>>
>>
>>>Well the simplest fix is the one that I did not implement in the first
>>>place, using "ON DUPLICATE KEY". However, I did not implement that
>>>because of its only being in version 4.1 of MySQL and I still use
>>>Debian stable for most production machines, which runs 4.0.x.
>>>
>>>Anyway, I will poke at it some more.
>>>
>>
>>Read on for a portable fix called optimistic locking.
>>
>>
>>Pseudocode follows. I assume you are doing something like this:
>
>
> You assume wrong. :) The code is doing
>
> UPDATE ... SET field = field + 1 ...
>
> which is almost always a much better idea than doing the increment in
> code.
>
> The race condition is in checking to see if the row already exists, and
> then using that information to decide if you should do an insert or an
> update.
>
> The example at the very bottom of http://lnk.nu/postgresql.org/5sl.html
> is how you would do this in a function in PostgreSQL, though there's no
> reason I can think of why you couldn't do that in perl instead (though
> it would perform much slower).
Re: Real-Time Stats Plugin Released
Posted by "Jim C. Nasby" <de...@decibel.org>.
On Wed, Dec 28, 2005 at 02:19:51AM -0600, Chris Thielen wrote:
> James Keating wrote:
>
> >Well the simplest fix is the one that I did not implement in the first
> >place, using "ON DUPLICATE KEY". However, I did not implement that
> >because of its only being in version 4.1 of MySQL and I still use
> >Debian stable for most production machines, which runs 4.0.x.
> >
> >Anyway, I will poke at it some more.
> >
> Read on for a portable fix called optimistic locking.
>
>
> Pseudocode follows. I assume you are doing something like this:
You assume wrong. :) The code is doing
UPDATE ... SET field = field + 1 ...
which is almost always a much better idea than doing the increment in
code.
The race condition is in checking to see if the row already exists, and
then using that information to decide if you should do an insert or an
update.
The example at the very bottom of http://lnk.nu/postgresql.org/5sl.html
is how you would do this in a function in PostgreSQL, though there's no
reason I can think of why you couldn't do that in perl instead (though
it would perform much slower).
--
Jim C. Nasby, Database Architect decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828
Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"
Re: Real-Time Stats Plugin Released
Posted by Chris Thielen <cm...@someone.dhs.org>.
James Keating wrote:
> Well the simplest fix is the one that I did not implement in the first
> place, using "ON DUPLICATE KEY". However, I did not implement that
> because of its only being in version 4.1 of MySQL and I still use
> Debian stable for most production machines, which runs 4.0.x.
>
> Anyway, I will poke at it some more.
>
Read on for a portable fix called optimistic locking.
Pseudocode follows. I assume you are doing something like this:
select ham from table where user = $user;
$ham++;
update table set ham = $ham where user = $user;
You can instead do something like this:
while (!success) {
select ham from table where user = $user;
$newham = $currentham + 1;
update table set ham = $newham where user = $user and ham = $currentham;
success = (getRowsUpdated() > 0);
}
What this does is update the row only if the value of table.ham has not
changed. The updated row count is checked. If it is greater than 0
(your row was indeed updated) then you set the success flag and
continue. If the updated row count is 0 you know the data has changed
unexpectedly and must redo the entire read/modify/write cycle.
Usually this approach is done by adding a separate version column that
is tested and updated, but in this case the ham or spam counters can be
substituted. It is called optimistic locking because you assume
(optimistically) that your update will usually succeed; eg: that nobody
else has updated the data without you knowing. It's not really locking,
but rather a concurrent update detection mechanism which the application
then must handle programatically.
Re: Real-Time Stats Plugin Released
Posted by James Keating <ja...@okeating.net>.
Gary V wrote:
>> Well the simplest fix is the one that I did not implement in the first
>> place, using "ON DUPLICATE KEY". However, I did not implement that
>> because of its only being in version 4.1 of MySQL and I still use
>> Debian stable for most production machines, which runs 4.0.x.
>>
>> Anyway, I will poke at it some more.
>>
>> - James
>
>
> [OT] FYI, both 4.0 and 4.1 are available in Debian stable.
>
> Package mysql-server
> stable (misc): mysql database server binaries
> 4.0.24-10sarge1: alpha amd64 arm hppa i386 ia64 m68k mips mipsel powerpc
> s390 sparc
> 4.0.24-10: amd64
>
> Package mysql-server-4.1
> stable (misc): mysql database server binaries
> 4.1.11a-4sarge2: alpha amd64 arm hppa i386 ia64 m68k mips mipsel powerpc
> s390 sparc
> 4.1.11a-4: amd64
>
> Gary V
>
> _________________________________________________________________
> Express yourself instantly with MSN Messenger! Download today - it's
> FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
>
Thanks for pointing that out Gary. I was actually unaware of that. I
will probably change the code to use the "ON DUPLICATE KEY" because that
appears to be the best and only way I see right now.
- James
Re: Real-Time Stats Plugin Released
Posted by Gary V <mr...@hotmail.com>.
>Well the simplest fix is the one that I did not implement in the first
>place, using "ON DUPLICATE KEY". However, I did not implement that because
>of its only being in version 4.1 of MySQL and I still use Debian stable for
>most production machines, which runs 4.0.x.
>
>Anyway, I will poke at it some more.
>
>- James
[OT] FYI, both 4.0 and 4.1 are available in Debian stable.
Package mysql-server
stable (misc): mysql database server binaries
4.0.24-10sarge1: alpha amd64 arm hppa i386 ia64 m68k mips mipsel powerpc
s390 sparc
4.0.24-10: amd64
Package mysql-server-4.1
stable (misc): mysql database server binaries
4.1.11a-4sarge2: alpha amd64 arm hppa i386 ia64 m68k mips mipsel powerpc
s390 sparc
4.1.11a-4: amd64
Gary V
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
Re: Real-Time Stats Plugin Released
Posted by James Keating <ja...@okeating.net>.
Well the simplest fix is the one that I did not implement in the first
place, using "ON DUPLICATE KEY". However, I did not implement that
because of its only being in version 4.1 of MySQL and I still use Debian
stable for most production machines, which runs 4.0.x.
Anyway, I will poke at it some more.
- James
Jim C. Nasby wrote:
> On Tue, Dec 27, 2005 at 09:33:11PM -0500, James Keating wrote:
>
>>Indeed. My thinking behind storing both system totals ($TOTALS) and user
>>totals in the database was for easy error checking. Each day you can
>>quickly/easily run through the database and look for potential errors
>>(possibly resulting from the race condition that you mentioned).
>>However, I have not seen any miscalculations yet.
>
>
> Well, there won't be any errors with an ACID database (ie: not MyISAM)
> and if the race condition is handled properly.
>
>
>>Regarding the race condition, I would be more than happy to fix it, and
>>any input would be wonderful on that. I will poke around and see what I
>>can come up with on my own.
>
>
> I could give you a fix if it was PostgreSQL, but I'm not really a MySQL
> person.
>
>
>>Jim C. Nasby wrote:
>>
>>>Neat plugin. I have two comments:
>>>
>>>I wouldn't store $TOTALS or the total column in the database, as both
>>>can easily be calculated when retrieving the data.
>>>
>>>There is a race condition, especially for $TOTALS. First you check for
>>>existence, then you try and do an insert or an update based on that. If
>>>more than one SA thread is running, you can end up with duplicated
>>>entries. I think MySQL has a merge command that would be more
>>>appropriate.
>>>
>>>Also, it appears that this would work on PostgreSQL with minimal
>>>modifications. If anyone's interested in that I'd be happy to help.
>>>
>>>On Mon, Dec 26, 2005 at 11:03:45AM -0500, James Keating wrote:
>>>
>>>
>>>>Morning All,
>>>>
>>>>I have released a real-time stats collection plugin
>>>>(http://wiki.apache.org/spamassassin/StatsPlugin)for SpamAssassin.
>>>>The plugin collects data into a MySQL DB based on the date and username
>>>>that SpamAssassin was invoked with. This creates a single row for each
>>>>user, each day. Also, a row for the total number of messages for that
>>>>day is created. Each entry contains the date, user name, domain name,
>>>>number of hams, number of spams and total number of messages processed
>>>>for that user (or for the system). This is an example of what you would
>>>>see in your table -
>>>>
>>>>
>>>>day username domain spam ham total
>>>>2005-12-26 $TOTALS $TOTALS 106 229 335
>>>>2005-12-26 user1 domain.com 11 0 11
>>>>2005-12-26 user2 domain.com 25 15 40
>>>>2005-12-26 user3 domain.com 27 3 30
>>>>2005-12-26 user4 domain.com 0 24 24
>>>>2005-12-26 user5 domain.com 1 0 1
>>>>2005-12-26 user6 domain.com 0 1 1
>>>>etc...
>>>>
>>>>A new entry for each user is created each day, so there is historical
>>>>information for as long as you want to make it. As I said before this
>>>>is collected in real-time, so when a new message is processed by
>>>>SpamAssassin the numbers are automatically increased for an existing
>>>>entry or a new entry is made. This elevates the need to continually
>>>>have scripts parse log files.
>>>>
>>>>As always I cannot guarantee that this will work with your
>>>>installation/configuration of SpamAssassin or that it will not make your
>>>>system explode, but it has worked for me in a production environment for
>>>>quiet some time. Also, this can only be used with SpamAssassin 3.1+.
>>>>The code and more details about installing/configuring it can be seen on
>>>>the SpamAssassin wiki - http://wiki.apache.org/spamassassin/StatsPlugin
>>>>
>>>>Let me know if you run into problems or have suggestions for
>>>>modifications.
>>>>
>>>>- James
>>>>
>>>
>>>
>
Re: Real-Time Stats Plugin Released
Posted by "Jim C. Nasby" <de...@decibel.org>.
On Tue, Dec 27, 2005 at 09:33:11PM -0500, James Keating wrote:
> Indeed. My thinking behind storing both system totals ($TOTALS) and user
> totals in the database was for easy error checking. Each day you can
> quickly/easily run through the database and look for potential errors
> (possibly resulting from the race condition that you mentioned).
> However, I have not seen any miscalculations yet.
Well, there won't be any errors with an ACID database (ie: not MyISAM)
and if the race condition is handled properly.
> Regarding the race condition, I would be more than happy to fix it, and
> any input would be wonderful on that. I will poke around and see what I
> can come up with on my own.
I could give you a fix if it was PostgreSQL, but I'm not really a MySQL
person.
> Jim C. Nasby wrote:
> >Neat plugin. I have two comments:
> >
> >I wouldn't store $TOTALS or the total column in the database, as both
> >can easily be calculated when retrieving the data.
> >
> >There is a race condition, especially for $TOTALS. First you check for
> >existence, then you try and do an insert or an update based on that. If
> >more than one SA thread is running, you can end up with duplicated
> >entries. I think MySQL has a merge command that would be more
> >appropriate.
> >
> >Also, it appears that this would work on PostgreSQL with minimal
> >modifications. If anyone's interested in that I'd be happy to help.
> >
> >On Mon, Dec 26, 2005 at 11:03:45AM -0500, James Keating wrote:
> >
> >>Morning All,
> >>
> >> I have released a real-time stats collection plugin
> >>(http://wiki.apache.org/spamassassin/StatsPlugin)for SpamAssassin.
> >>The plugin collects data into a MySQL DB based on the date and username
> >>that SpamAssassin was invoked with. This creates a single row for each
> >>user, each day. Also, a row for the total number of messages for that
> >>day is created. Each entry contains the date, user name, domain name,
> >>number of hams, number of spams and total number of messages processed
> >>for that user (or for the system). This is an example of what you would
> >>see in your table -
> >>
> >>
> >>day username domain spam ham total
> >>2005-12-26 $TOTALS $TOTALS 106 229 335
> >>2005-12-26 user1 domain.com 11 0 11
> >>2005-12-26 user2 domain.com 25 15 40
> >>2005-12-26 user3 domain.com 27 3 30
> >>2005-12-26 user4 domain.com 0 24 24
> >>2005-12-26 user5 domain.com 1 0 1
> >>2005-12-26 user6 domain.com 0 1 1
> >>etc...
> >>
> >>A new entry for each user is created each day, so there is historical
> >>information for as long as you want to make it. As I said before this
> >>is collected in real-time, so when a new message is processed by
> >>SpamAssassin the numbers are automatically increased for an existing
> >>entry or a new entry is made. This elevates the need to continually
> >>have scripts parse log files.
> >>
> >>As always I cannot guarantee that this will work with your
> >>installation/configuration of SpamAssassin or that it will not make your
> >>system explode, but it has worked for me in a production environment for
> >>quiet some time. Also, this can only be used with SpamAssassin 3.1+.
> >>The code and more details about installing/configuring it can be seen on
> >>the SpamAssassin wiki - http://wiki.apache.org/spamassassin/StatsPlugin
> >>
> >>Let me know if you run into problems or have suggestions for
> >>modifications.
> >>
> >>- James
> >>
> >
> >
>
--
Jim C. Nasby, Database Architect decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828
Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"
Re: Real-Time Stats Plugin Released
Posted by James Keating <ja...@okeating.net>.
Indeed. My thinking behind storing both system totals ($TOTALS) and user
totals in the database was for easy error checking. Each day you can
quickly/easily run through the database and look for potential errors
(possibly resulting from the race condition that you mentioned).
However, I have not seen any miscalculations yet.
Regarding the race condition, I would be more than happy to fix it, and
any input would be wonderful on that. I will poke around and see what I
can come up with on my own.
- James
Jim C. Nasby wrote:
> Neat plugin. I have two comments:
>
> I wouldn't store $TOTALS or the total column in the database, as both
> can easily be calculated when retrieving the data.
>
> There is a race condition, especially for $TOTALS. First you check for
> existence, then you try and do an insert or an update based on that. If
> more than one SA thread is running, you can end up with duplicated
> entries. I think MySQL has a merge command that would be more
> appropriate.
>
> Also, it appears that this would work on PostgreSQL with minimal
> modifications. If anyone's interested in that I'd be happy to help.
>
> On Mon, Dec 26, 2005 at 11:03:45AM -0500, James Keating wrote:
>
>>Morning All,
>>
>> I have released a real-time stats collection plugin
>>(http://wiki.apache.org/spamassassin/StatsPlugin)for SpamAssassin.
>>The plugin collects data into a MySQL DB based on the date and username
>>that SpamAssassin was invoked with. This creates a single row for each
>>user, each day. Also, a row for the total number of messages for that
>>day is created. Each entry contains the date, user name, domain name,
>>number of hams, number of spams and total number of messages processed
>>for that user (or for the system). This is an example of what you would
>>see in your table -
>>
>>
>>day username domain spam ham total
>>2005-12-26 $TOTALS $TOTALS 106 229 335
>>2005-12-26 user1 domain.com 11 0 11
>>2005-12-26 user2 domain.com 25 15 40
>>2005-12-26 user3 domain.com 27 3 30
>>2005-12-26 user4 domain.com 0 24 24
>>2005-12-26 user5 domain.com 1 0 1
>>2005-12-26 user6 domain.com 0 1 1
>>etc...
>>
>>A new entry for each user is created each day, so there is historical
>>information for as long as you want to make it. As I said before this
>>is collected in real-time, so when a new message is processed by
>>SpamAssassin the numbers are automatically increased for an existing
>>entry or a new entry is made. This elevates the need to continually
>>have scripts parse log files.
>>
>> As always I cannot guarantee that this will work with your
>>installation/configuration of SpamAssassin or that it will not make your
>>system explode, but it has worked for me in a production environment for
>>quiet some time. Also, this can only be used with SpamAssassin 3.1+.
>>The code and more details about installing/configuring it can be seen on
>>the SpamAssassin wiki - http://wiki.apache.org/spamassassin/StatsPlugin
>>
>>Let me know if you run into problems or have suggestions for modifications.
>>
>>- James
>>
>
>
Re: Real-Time Stats Plugin Released
Posted by "Jim C. Nasby" <de...@decibel.org>.
Neat plugin. I have two comments:
I wouldn't store $TOTALS or the total column in the database, as both
can easily be calculated when retrieving the data.
There is a race condition, especially for $TOTALS. First you check for
existence, then you try and do an insert or an update based on that. If
more than one SA thread is running, you can end up with duplicated
entries. I think MySQL has a merge command that would be more
appropriate.
Also, it appears that this would work on PostgreSQL with minimal
modifications. If anyone's interested in that I'd be happy to help.
On Mon, Dec 26, 2005 at 11:03:45AM -0500, James Keating wrote:
> Morning All,
>
> I have released a real-time stats collection plugin
> (http://wiki.apache.org/spamassassin/StatsPlugin)for SpamAssassin.
> The plugin collects data into a MySQL DB based on the date and username
> that SpamAssassin was invoked with. This creates a single row for each
> user, each day. Also, a row for the total number of messages for that
> day is created. Each entry contains the date, user name, domain name,
> number of hams, number of spams and total number of messages processed
> for that user (or for the system). This is an example of what you would
> see in your table -
>
>
> day username domain spam ham total
> 2005-12-26 $TOTALS $TOTALS 106 229 335
> 2005-12-26 user1 domain.com 11 0 11
> 2005-12-26 user2 domain.com 25 15 40
> 2005-12-26 user3 domain.com 27 3 30
> 2005-12-26 user4 domain.com 0 24 24
> 2005-12-26 user5 domain.com 1 0 1
> 2005-12-26 user6 domain.com 0 1 1
> etc...
>
> A new entry for each user is created each day, so there is historical
> information for as long as you want to make it. As I said before this
> is collected in real-time, so when a new message is processed by
> SpamAssassin the numbers are automatically increased for an existing
> entry or a new entry is made. This elevates the need to continually
> have scripts parse log files.
>
> As always I cannot guarantee that this will work with your
> installation/configuration of SpamAssassin or that it will not make your
> system explode, but it has worked for me in a production environment for
> quiet some time. Also, this can only be used with SpamAssassin 3.1+.
> The code and more details about installing/configuring it can be seen on
> the SpamAssassin wiki - http://wiki.apache.org/spamassassin/StatsPlugin
>
> Let me know if you run into problems or have suggestions for modifications.
>
> - James
>
--
Jim C. Nasby, Database Architect decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828
Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"
Re: Real-Time Stats Plugin Released
Posted by James Keating <ja...@okeating.net>.
A new version of the plugin has been released thanks to Andrew Donkin
who provided a variety of updates for the code. The updates include the
following changes -
- Removed the exists sub as it is not necessary when using an INSERT
with ON DUPLICATE KEY UPDATE.
- Removed the TOTALS insert and the totals column. These are not
necessary and can be calculated via select statements.
- Changed the table schema to remove the totals field
- Numerous other changes
The code is much much smaller now and faster. It is backwards
compatible with the old table schema, but a new table schema (without
the totals column) is available.
Code can be found at the SpamAssassin Wiki -
http://wiki.apache.org/spamassassin/StatsPlugin
Or at my bl0g -
http://www.okeating.net/blosxom.cgi/2006/01/08#statsplugin-update-010706
Thanks Andrew for your great input and code!
- James
Andrew Donkin wrote:
> James Keating <ja...@okeating.net> writes:
>
>
>>Morning All,
>>
>> I have released a real-time stats collection plugin
>>(http://wiki.apache.org/spamassassin/StatsPlugin)for SpamAssassin.
>
>
> Actually James, now that you have that atomic insert/update in there,
> you can get rid of $user_exists, $total_exists, the last parameter to
> execute_stats(), and get_current_entry().
>
> I am running with that version and it is working well, at about one
> update per second.
>