You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2006/05/26 12:18:18 UTC

[Bug 4915] New: RFE: Distributed mass-check

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4915

           Summary: RFE: Distributed mass-check
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: enhancement
          Priority: P5
         Component: spamc/spamd
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: jm@jmason.org
OtherBugsDependingO 4560
             nThis:


This was a suggested idea for the Google Summer of Code 2006;
I'm adding it to the bugzilla for future use, and in case anyone feels
like implementing it.

Subject ID: spamassassin-distributed-mass-check
Keywords: corpora, perl
Description: mass-check currently makes use of a single system to process a
number of messages. However, in larger organizations, or for people with
multiple machines, it would be nice if multiple machines could all process a
single mass-check run, preferably without needing to share the same filesystem,
paths, etc.  It would also be useful if we ended up with a single large corpus
(see the spamassassin-corpus project above), so that multiple people could run
the messages through over the Internet.
Possible Mentors: Theo Van Dinter (felicity -at- apache.org)



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4915] RFE: Distributed mass-check

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4915


felicity@apache.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED




------- Additional Comments From felicity@apache.org  2006-09-08 17:49 -------
Ok, this has generally been implemented in 3.2/trunk.  I think it still needs
some work around the edges, but it's good enough for me to close the ticket. :)



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

[Bug 4915] RFE: Distributed mass-check

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4915





------- Additional Comments From felicity@apache.org  2006-08-24 18:29 -------
(In reply to comment #3)
> There are several issues I haven't figured out how to deal with, so I'm leaving
[...]
> with bayes.  A way to abort if not all the messages were processed?  etc.

There's also the issue of ordering run (ie: not using -n), which isn't
guaranteed in the current code and is going to be much more difficult in a
client/server model.

"happily", ordered runs typically only happen when using bayes, which as
mentioned before doesn't work in this model, so ...



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

[Bug 4915] RFE: Distributed mass-check

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4915


felicity@apache.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dev@spamassassin.apache.org
         AssignedTo|dev@spamassassin.apache.org |felicity@apache.org
   Target Milestone|Undefined                   |3.2.0




------- Additional Comments From felicity@apache.org  2006-08-24 03:26 -------
I'm working on implementing this in a branch in my spare time.  It's really
still floating around in my head, though I think I know what I'm looking for in
an implementation.  If/when I get the chance to write it out, I'll put it here
in the ticket. :)



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
You are on the CC list for the bug, or are watching someone who is.

[Bug 4915] RFE: Distributed mass-check

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4915





------- Additional Comments From jm@jmason.org  2006-08-24 10:42 -------
interesting!

how are you planning to distribute the workload and scanned messages?  ssh? a
grid-based system?

If you want to get complex, distributed work-queues are nice, and I'd be very
happy to add that support to IPC::DirQueue ;)




------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

[Bug 4915] RFE: Distributed mass-check

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4915





------- Additional Comments From felicity@apache.org  2006-08-24 16:05 -------
(In reply to comment #2)
> how are you planning to distribute the workload and scanned messages?  ssh? a
> grid-based system?

At the moment, I was planning on HTTP, in much the same way that our current "-j
#" method works, mass-check client connect to mass-check server, makes a request
(give me at most X messages), the server reads in from disk and sends out a
tar/gz file of messages in file format.  The client then runs over them in a
normal mass-check mode and gathers all the results, then connects back to the
server to give the results and request more work.  Somewhere along the line, I
was going to have the client dynamically adjust the "max" number based on the
amount of time needed to process a message (ie: the client wants to connect to
the server roughly once a minute).  I also need to have support in the server to
track which messages were handed out, and then rehand them out if some time
limit passes without seeing the result.

There are several issues I haven't figured out how to deal with, so I'm leaving
them for now -- make sure that all of the clients run the same version w/ the
same (as appropriate) modules, conf files, plugins, etc.  How to let this work
with bayes.  A way to abort if not all the messages were processed?  etc.

This feels like recreating the wheel, btw, but I don't know of a module/etc that
does what I want here.

> If you want to get complex, distributed work-queues are nice, and I'd be very
> happy to add that support to IPC::DirQueue ;)

I haven't looked too much at IPC::DQ, but based on what I've read I don't think
it fits in with this completely, though feel free to correct me. :)  I was
planning on non-long-lived connections, ability to communicate through a proxy, etc.



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.