You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Cedric Lejeune <ce...@arcelor.com> on 2006/11/13 11:58:19 UTC

BayesStore/SQL.pm proposed update.

Hi list!

First, this is my first submission to a project, so please excuse me if 
I do something that do not follow the rules.

I'm in charged with spamassassin (SA) and after an update from 3.0 to 
3.1, users starts to complain about SA improved effectiveness (to more 
false positive). After some investigation, the problem seems to come 
from bayesian filter. I use a global bayesian database (using 
bayes_sql_override_username) but with customers coming from all over the 
world, it seems a more fine grain filtering should be better. But it 
seems it is currently impossible. I use SQL bayesian database setup and 
it should have been great if it has offered the same kind of feature 
that user_scores_sql_custom_query. What I wanted to do is make bayesian 
filter retrieve database first for the current user, then for the 
routing domain to which user belongs to, then fall back to global 
database. For instance:

toto@foo.bar -> *@foo.bar -> *

This way, it is possible for users to have their own personal bayes 
database and if they do no want to create one or they do no have one 
already, they still can benefit from others' databases. Grouping users 
per routing domain is like grouping users per center of interest: if one 
declares mail as spam, there is little few chance that others consider 
it as ham.

So this feature was not implemented yet and I have started writing it 
myself. I warn you that I did not write a single perl line before and 
"my" code is made of doc quotes and cut & paste. I've added the 
following configuration option bayes_sql_custom_query to SA 
configuration file. It acts the same way bayes_sql_override_username does.

Please, let me know if this feature could be useful to others and/or if 
it requires some rewriting.

Please, find diff as attachments. They applied against SA 3.1.4 because 
it is the SA version shipped with Debian testing at this time.

Best regards =)

cedric.

Re: BayesStore/SQL.pm proposed update.

Posted by Cedric Lejeune <ce...@arcelor.com>.
Hi Michael,

As explained before, the problem is I am a completely perl noob. I have never written a single line of perl before. I understand 
what you want, but I do not have a single idea of how to do this, sorry. I have changed SQL.pm and added configuration option 
because it seems to be the simpliest solution to me and I have tried to not altere to way SQL.pm worked before. That is, I try 
to keep this processing order:

bayes_sql_override_username -> bayes_sql_custom_query -> current user running spamassassin -> default

I would really like to do more, but I do not have enough time to "learn" perl. I had a need and I have tried to solve it. I just 
wanted to contribute to spamassassin project providing resulting patchs.

By the way, you can use "my" code if you want. Spamassassin team and contributors should be credited for this piece of code as I 
only do cut & paste and some writing based on perl doc.

If I find some free time, I may try to do what you ask, the subclass and the like, but I cannot garantee nothing.

Thanks everyone,

cedric.

Michael Parker wrote:
> Hi Cedric,
> 
> The BayesStore API is designed in such a way that implementing a
> separate store for this sort of thing would be super easy.  I suggest
> you subclass SQL and make the changes and call it something new.
> 
> I'm probably -1 on changing the top level, fairly generic, SQL.pm to do
> what you're asking, but would have to examine the change a little more
> before I made a final determination.
> 
> Michael
> 
> 
> Cedric Lejeune wrote:
>> Hi list!
>>
>> First, this is my first submission to a project, so please excuse me if
>> I do something that do not follow the rules.
>>
>> I'm in charged with spamassassin (SA) and after an update from 3.0 to
>> 3.1, users starts to complain about SA improved effectiveness (to more
>> false positive). After some investigation, the problem seems to come
>> from bayesian filter. I use a global bayesian database (using
>> bayes_sql_override_username) but with customers coming from all over the
>> world, it seems a more fine grain filtering should be better. But it
>> seems it is currently impossible. I use SQL bayesian database setup and
>> it should have been great if it has offered the same kind of feature
>> that user_scores_sql_custom_query. What I wanted to do is make bayesian
>> filter retrieve database first for the current user, then for the
>> routing domain to which user belongs to, then fall back to global
>> database. For instance:
>>
>> toto@foo.bar -> *@foo.bar -> *
>>
>> This way, it is possible for users to have their own personal bayes
>> database and if they do no want to create one or they do no have one
>> already, they still can benefit from others' databases. Grouping users
>> per routing domain is like grouping users per center of interest: if one
>> declares mail as spam, there is little few chance that others consider
>> it as ham.
>>
>> So this feature was not implemented yet and I have started writing it
>> myself. I warn you that I did not write a single perl line before and
>> "my" code is made of doc quotes and cut & paste. I've added the
>> following configuration option bayes_sql_custom_query to SA
>> configuration file. It acts the same way bayes_sql_override_username does.
>>
>> Please, let me know if this feature could be useful to others and/or if
>> it requires some rewriting.
>>
>> Please, find diff as attachments. They applied against SA 3.1.4 because
>> it is the SA version shipped with Debian testing at this time.
>>
>> Best regards =)
>>
>> cedric.
>>
>>
>> ------------------------------------------------------------------------
>>
>> --- SpamAssassin/Conf.pm	2006-08-12 18:08:44.000000000 +0200
>> +++ Conf.pm	2006-11-13 10:37:16.000000000 +0100
>> @@ -2330,6 +2330,52 @@
>>      type => $CONF_TYPE_STRING
>>    });
>>  
>> +#### Start of modification (last modified on 20061109).
>> +=item bayes_sql_custom_query query 
>> +
>> +This option gives you the ability to create a custom SQL query to
>> +retrieve username.  In order to work correctly your query should
>> +return only one value, the desired username. In addition, there
>> +are several "variables" that you can use as part of your query,
>> +these variables will be substituted for the current values right
>> +before the query is run.  The current allowed variables are:
>> +
>> +=over 2
>> +
>> +=item _USERNAME_
>> +
>> +The current user's username.
>> +
>> +=item _DOMAIN_
>> +
>> +The portion after the @ as derived from the current user's username, this
>> +value may be null.
>> +
>> +=back
>> +
>> +The query must be one continuous line in order to parse correctly.
>> +
>> +Here is an example query, please note that it is broken up for easy
>> +reading, in your config it should be one continuous line.
>> +
>> +=over 1
>> +
>> +=item Current default query:
>> +
>> +C<SELECT username FROM bayes_vars WHERE username = '*' OR Username = CONCAT('*@',_DOMAIN_) OR Username = _USERNAME_ ORDER BY username ASC>
>> +
>> +=back
>> +
>> +=cut
>> +
>> +  push (@cmds, {
>> +    setting => 'bayes_sql_custom_query',
>> +    is_admin => 1,
>> +    type => $CONF_TYPE_STRING
>> +  });
>> +
>> +#### End of modification.
>> +
>>  =item bayes_sql_username_authorized ( 0 | 1 )  (default: 0)
>>  
>>  Whether to call the services_authorized_for_username plugin hook in BayesSQL.
>>
>>
>> ------------------------------------------------------------------------
>>
>> --- SpamAssassin/BayesStore/SQL.pm	2005-08-11 09:00:37.000000000 +0200
>> +++ SQL.pm.bayesstore	2006-11-13 11:15:43.000000000 +0100
>> @@ -85,15 +85,70 @@
>>    if ($self->{bayes}->{conf}->{bayes_sql_override_username}) {
>>      $self->{_username} = $self->{bayes}->{conf}->{bayes_sql_override_username};
>>    }
>> +#### Start of modification (last modified on 20061113).
>>    else {
>> -    $self->{_username} = $self->{bayes}->{main}->{username};
>> +	if ($self->{bayes}->{conf}->{bayes_sql_custom_query}) {
>>  
>> -    # Need to make sure that a username is set, so just in case there is
>> -    # no username set in main, set one here.
>> -    unless ($self->{_username}) {
>> -      $self->{_username} = "GLOBALBAYES";
>> -    }
>> +                # Connect to database.
>> +                return 0 unless ($self->_connect_db());
>> +
>> +                # Retrieve current username and play with it.
>> +                my $username = $self->{bayes}->{main}->{username};
>> +                my ($mailbox, $domain) = split('@', $username);
>> +
>> +                my $quoted_username = $self->{_dbh}->quote($username);
>> +                my $quoted_domain = $self->{_dbh}->quote($domain);
>> +
>> +                my $custom_query = $self->{bayes}->{conf}->{bayes_sql_custom_query};
>> +                $custom_query =~ s/_USERNAME_/$quoted_username/g;
>> +                $custom_query =~ s/_DOMAIN_/$quoted_domain/g;
>> +
>> +                dbg("bayes: new: quoted_username = ".$quoted_username);
>> +                dbg("bayes: new: quoted_domain = ".$quoted_domain);
>> +                dbg("bayes: new: custom_query = ".$custom_query);
>> +
>> +                # Prepare query.
>> +                my $sth = $self->{_dbh}->prepare($custom_query);
>> +                unless (defined($sth)) {
>> +                        dbg("bayes: new: SQL error: ".$self->{_dbh}->errstr());
>> +                        return 0;
>> +                }
>> +
>> +                # Execute query.
>> +                my $rc = $sth->execute();
>> +                unless ($rc) {
>> +                        dbg("bayes: new: SQL error: ".$self->{_dbh}->errstr());
>> +                        return 0;
>> +                }
>> +
>> +		# Retrieve _username.
>> +                my $ary_ref = $sth->fetchall_arrayref();
>> +                $self->{_username} = $ary_ref->[-1]->[-1];
>> +
>> +                dbg("bayes: new: _username = ".$self->{_username});
>> +
>> +                # Tell database server to free buffer allocated to query.
>> +                $sth->finish();
>> +
>> +                # Close database connection.
>> +                $self->{_dbh}->disconnect();
>> +
>> +                # Set _dbh to initial state.
>> +                $self->{_dbh} = undef;
>> +
>> +  	}
>> +#### End of modification.
>> +	else {
>> +		$self->{_username} = $self->{bayes}->{main}->{username};
>> +	}
>> +  }
>> +	
>> +  # Need to make sure that a username is set, so just in case there is
>> +  # no username set in main, set one here.
>> +  unless ($self->{_username}) {
>> +    $self->{_username} = "GLOBALBAYES";
>>    }
>> +
>>    dbg("bayes: using username: ".$self->{_username});
>>  
>>    return $self;
> 

Re: BayesStore/SQL.pm proposed update.

Posted by Michael Parker <pa...@pobox.com>.
Hi Cedric,

The BayesStore API is designed in such a way that implementing a
separate store for this sort of thing would be super easy.  I suggest
you subclass SQL and make the changes and call it something new.

I'm probably -1 on changing the top level, fairly generic, SQL.pm to do
what you're asking, but would have to examine the change a little more
before I made a final determination.

Michael


Cedric Lejeune wrote:
> Hi list!
> 
> First, this is my first submission to a project, so please excuse me if
> I do something that do not follow the rules.
> 
> I'm in charged with spamassassin (SA) and after an update from 3.0 to
> 3.1, users starts to complain about SA improved effectiveness (to more
> false positive). After some investigation, the problem seems to come
> from bayesian filter. I use a global bayesian database (using
> bayes_sql_override_username) but with customers coming from all over the
> world, it seems a more fine grain filtering should be better. But it
> seems it is currently impossible. I use SQL bayesian database setup and
> it should have been great if it has offered the same kind of feature
> that user_scores_sql_custom_query. What I wanted to do is make bayesian
> filter retrieve database first for the current user, then for the
> routing domain to which user belongs to, then fall back to global
> database. For instance:
> 
> toto@foo.bar -> *@foo.bar -> *
> 
> This way, it is possible for users to have their own personal bayes
> database and if they do no want to create one or they do no have one
> already, they still can benefit from others' databases. Grouping users
> per routing domain is like grouping users per center of interest: if one
> declares mail as spam, there is little few chance that others consider
> it as ham.
> 
> So this feature was not implemented yet and I have started writing it
> myself. I warn you that I did not write a single perl line before and
> "my" code is made of doc quotes and cut & paste. I've added the
> following configuration option bayes_sql_custom_query to SA
> configuration file. It acts the same way bayes_sql_override_username does.
> 
> Please, let me know if this feature could be useful to others and/or if
> it requires some rewriting.
> 
> Please, find diff as attachments. They applied against SA 3.1.4 because
> it is the SA version shipped with Debian testing at this time.
> 
> Best regards =)
> 
> cedric.
> 
> 
> ------------------------------------------------------------------------
> 
> --- SpamAssassin/Conf.pm	2006-08-12 18:08:44.000000000 +0200
> +++ Conf.pm	2006-11-13 10:37:16.000000000 +0100
> @@ -2330,6 +2330,52 @@
>      type => $CONF_TYPE_STRING
>    });
>  
> +#### Start of modification (last modified on 20061109).
> +=item bayes_sql_custom_query query 
> +
> +This option gives you the ability to create a custom SQL query to
> +retrieve username.  In order to work correctly your query should
> +return only one value, the desired username. In addition, there
> +are several "variables" that you can use as part of your query,
> +these variables will be substituted for the current values right
> +before the query is run.  The current allowed variables are:
> +
> +=over 2
> +
> +=item _USERNAME_
> +
> +The current user's username.
> +
> +=item _DOMAIN_
> +
> +The portion after the @ as derived from the current user's username, this
> +value may be null.
> +
> +=back
> +
> +The query must be one continuous line in order to parse correctly.
> +
> +Here is an example query, please note that it is broken up for easy
> +reading, in your config it should be one continuous line.
> +
> +=over 1
> +
> +=item Current default query:
> +
> +C<SELECT username FROM bayes_vars WHERE username = '*' OR Username = CONCAT('*@',_DOMAIN_) OR Username = _USERNAME_ ORDER BY username ASC>
> +
> +=back
> +
> +=cut
> +
> +  push (@cmds, {
> +    setting => 'bayes_sql_custom_query',
> +    is_admin => 1,
> +    type => $CONF_TYPE_STRING
> +  });
> +
> +#### End of modification.
> +
>  =item bayes_sql_username_authorized ( 0 | 1 )  (default: 0)
>  
>  Whether to call the services_authorized_for_username plugin hook in BayesSQL.
> 
> 
> ------------------------------------------------------------------------
> 
> --- SpamAssassin/BayesStore/SQL.pm	2005-08-11 09:00:37.000000000 +0200
> +++ SQL.pm.bayesstore	2006-11-13 11:15:43.000000000 +0100
> @@ -85,15 +85,70 @@
>    if ($self->{bayes}->{conf}->{bayes_sql_override_username}) {
>      $self->{_username} = $self->{bayes}->{conf}->{bayes_sql_override_username};
>    }
> +#### Start of modification (last modified on 20061113).
>    else {
> -    $self->{_username} = $self->{bayes}->{main}->{username};
> +	if ($self->{bayes}->{conf}->{bayes_sql_custom_query}) {
>  
> -    # Need to make sure that a username is set, so just in case there is
> -    # no username set in main, set one here.
> -    unless ($self->{_username}) {
> -      $self->{_username} = "GLOBALBAYES";
> -    }
> +                # Connect to database.
> +                return 0 unless ($self->_connect_db());
> +
> +                # Retrieve current username and play with it.
> +                my $username = $self->{bayes}->{main}->{username};
> +                my ($mailbox, $domain) = split('@', $username);
> +
> +                my $quoted_username = $self->{_dbh}->quote($username);
> +                my $quoted_domain = $self->{_dbh}->quote($domain);
> +
> +                my $custom_query = $self->{bayes}->{conf}->{bayes_sql_custom_query};
> +                $custom_query =~ s/_USERNAME_/$quoted_username/g;
> +                $custom_query =~ s/_DOMAIN_/$quoted_domain/g;
> +
> +                dbg("bayes: new: quoted_username = ".$quoted_username);
> +                dbg("bayes: new: quoted_domain = ".$quoted_domain);
> +                dbg("bayes: new: custom_query = ".$custom_query);
> +
> +                # Prepare query.
> +                my $sth = $self->{_dbh}->prepare($custom_query);
> +                unless (defined($sth)) {
> +                        dbg("bayes: new: SQL error: ".$self->{_dbh}->errstr());
> +                        return 0;
> +                }
> +
> +                # Execute query.
> +                my $rc = $sth->execute();
> +                unless ($rc) {
> +                        dbg("bayes: new: SQL error: ".$self->{_dbh}->errstr());
> +                        return 0;
> +                }
> +
> +		# Retrieve _username.
> +                my $ary_ref = $sth->fetchall_arrayref();
> +                $self->{_username} = $ary_ref->[-1]->[-1];
> +
> +                dbg("bayes: new: _username = ".$self->{_username});
> +
> +                # Tell database server to free buffer allocated to query.
> +                $sth->finish();
> +
> +                # Close database connection.
> +                $self->{_dbh}->disconnect();
> +
> +                # Set _dbh to initial state.
> +                $self->{_dbh} = undef;
> +
> +  	}
> +#### End of modification.
> +	else {
> +		$self->{_username} = $self->{bayes}->{main}->{username};
> +	}
> +  }
> +	
> +  # Need to make sure that a username is set, so just in case there is
> +  # no username set in main, set one here.
> +  unless ($self->{_username}) {
> +    $self->{_username} = "GLOBALBAYES";
>    }
> +
>    dbg("bayes: using username: ".$self->{_username});
>  
>    return $self;