You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2004/04/29 16:12:30 UTC

[Bug 3331] New: Bayes option to keep original token as db data (not key).

http://bugzilla.spamassassin.org/show_bug.cgi?id=3331

           Summary: Bayes option to keep original token as db data (not
                    key).
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: normal
          Priority: P5
         Component: Learner
        AssignedTo: spamassassin-dev@incubator.apache.org
        ReportedBy: koppel@ece.lsu.edu


With r10394 (bug 3225) the Bayes database no longer retains the
original token, instead it uses a hashed version as a key.  As noted
elsewhere there are many good reasons to do this, such as compactness,
efficiency, and privacy.  However one can no longer get the original
token from the database and so those tuning Bayesian classification
can no longer get token statistics such as the hammiest or spammiest
tokens.

For that reason we might implement Sidney Markowitz's suggestion [bug
2266 comment 14] that as an option the original tokens be retained.
Michael Parker pointed out the maintenance difficulties of having to
support a database which could be keyed either on a hash or the full
token [bug 2266 comment 15].  Instead the unhashed token might be
stored as data while still using the hashed token as a key.  If the
original tokens are not to be stored then a NULL or zero-length string
would be stored, otherwise the original token.  This way there is
just one database format.  

I'm assuming that a string field holding only NULLs will have little
performance impact in dbfile and the SQL implementations, so that
users not keeping original tokens will be unaffected.

If there are no comments on this after a few days I'll cobble together
a patch for the DBM code.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.