You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2004/02/12 01:51:46 UTC
Large-scale use of Bayes with SpamAssassin

Hi all --

(as threatened, I've moved discussion onto SpamAssassin-dev, and
reposting.)

Kelsey wrote asking about tweaks for large-scale use (*very* large-scale, like
tens of thousands of users) of SA Bayes...

> I just wanted to let you know that we're still working on the bayes
> hacks.  What we've discovered - and it's really no suprise - is that bayes
> has truly insane I/O requirements.  I don't think most admins/users really
> notice this since they aren't processing high enough volumes of mail to max
> out typical hardware.  For us, at ~1000msg/minute over ~30k users, this
> becomes an amazing load, quite capable of wiping out every IO subsystem
> we've thrown at it this far.  And certainly, none of them have given any
> room for growth.  We're going to try one more architecture and IO subsystem
> before we give up doing per user bayes as impractical at scale - at least
> as it is implemented today with DB_File.  SQL is probably out of the
> question - I can't imagine how big a SQL box would be needed to power Bayes
> for us.
> 
> Do you know anyone else that's tried to do bayes in large scale
> installations?

Justin Mason:
> Worth bearing in mind that we could speed up things a *lot* by turning off
> updating of access times on tokens, auto-learning, use of hapaxes, token
> degeneration, and the range of things we turn into tokens.

Kelsey Cummings:
> Disabling of the journal syncing helps, we haven't gone deeper into the
> code in attempts to disable or tune other bits for performance- neither
> Nathan or I are particularly comfortable with what's going on to know how
> to do this.  What we see is a message gets broken into about 250 tokens on
> average, which results in about 250 seeks/reads on the database in the
> worse case.  Typically, these are 4k reads for DB_File.  In our case, this
> translates to about 16MB/s in highly random read IO or ~4k read/ops second
> over something like 150 gigs of data.

Justin:
> Kelsey -- what about using a farm of SQL boxes, and picking which one to
> use based on username?  It does sound like a nasty problem. :(

Kelsey:
> We've thought about that but we are working on the assumption right now
> that the complexity brought on by the management of N SQL boxes is
> something we'd like to avoid.  Right now, we're hoping that we can hack the
> client to do a hash of the username in the client to direct it to one of N
> spamd boxes, each with it's own local RAID for storing of the user's data.
> Increasing N is a bit of a drag, but doable.  We loose redundancy too but
> are willing to sacrafice that to get per user Bayes up for our customers.
> Of course, you're right.  SQL is an option - I don't have a firm grasp on
> how many SQL boxes it would take and/or how big they'd have to be.

Craig:
> Maybe replace DB_File with something where one has better control of 
> the physical grouping of the data, and then make it so that the 
> most-frequently used tokens are stored physically near each other, 
> thereby minimizing the disk hits.  Also, request all 250 things at 
> once, so even if you're requesting multiple blocks to be read from disk 
> over the course of the 250 items, you're allowing the disk subsystem to 
> optimize those seeks/reads instead of forcing them to be done spread 
> over a longer time period, in the order we're currently doing them?  
> Alternatively, if the IO's really the blocking point here, do some CPU 
> vs IO trade-offing by compressing the data on disk, so you need to read 
> fewer blocks to get all 250 tokens' data.

Dan:
> I think we should think more about a custom DB module.  A lock daemon 
> that could allow simultaneous r/w maybe.

Some discussion on Monday with Dan, we talked about the idea of writing a
network-based bayes daemon to take care of this.   It's a possibility;
especially if that daemon can be made to reorder accesses to increase I/O
request locality, and if requests can be load-balanced across a farm
of servers to split up load.

Anyway, what put me in mind of reposting this for more discussion, was
this document from a talk at ETCon by Nelson Minar, at
http://craphound.com/googleetcon04.txt .  Here's the relevant text...


  How a search works:

  Query comes into custom httpd, Google Web Server ("gwis")

  Sent in parallel to several places:

  * Index server, "every page with the word 'apple' in it -- a
  cluster that manages "shards" or "partitions" (everything
  starting with the letter "a") and then load-balancing
  replications for each. Have to calculate intersections for
  multiple-term queries

  * Doc server, copies of webpages -- whence page-snippets are
  served in results. Sharded and replicated for scaleability and
  redundancy

  * Misc servers: QuickLinks, spell-checkers, Ad server (first two
  are small servers, ad server is humongous)

  --

  Relevance:

  Google examines > 100 factors to ensure accurate results

  Link text

  Font size

  Proximity

  Anchor text -- allows for matches to pages that don't even
  contain keywords

  Google has an adversarial relationship with people who want to
  get higher results because Google is committed to presenting
  unbiased results. [Ed: what's bias? Nelson: sometimes it's
  obvious] [Ed: what if it's not obvious?]

  Does Google punish people who spam results? No, we try to keep
  good stuff at the top.


  --

  PageRank

  Examine graph structure of the Web: a page with a lot of inbound
  links is probably relevant, esp if each inbound linkers have lots
  of inbound links in turn.

  This is a matrix calculation with 30MM nodes with 10 edges each

  --

  Onebox:

  There's one place you type stuff, we guess what you want and put
  the results in the onebox.

  --

  Our index has 3.3B pages -- for some definition of page. We try
  to give supplemental results with deeper crawls.

  Burtonator: A search for "the" returns 5.4B results

  --

  Google Ads:

  Advertising goal: connect people who are visiting webpages to ads

  We do a subtle thing to calculate whose ad gets displayed -- if
  your clickthrough rate is higher, we give you more prominent
  placing on the page.

  Ads that don't rank don't get shown (you're not getting clikcs,
  we're not making money, it annoys the users, forget it)

  --

  Ad targeting

  We understand pages based on keywords, word freq, font size,
  anchor text, linugistic processing, works on dynamic content.

  --

  Hardware: PCs are unreliable, cheap and fast. Use software to
  make it reliable.

  --

  Google FileSystem:

  Fault-tolerant mass storage: 300TB on 1000 machines

  Design decisions: Machine failure is common, 1GB+ files, high
  bandwidth traded for low latency, files are appended, not edited;
  API and apps co-developed with filesystem

  Implementation: Maseter sends requests to chunkservers that
  manage 64MB chunks, master metadata manages chunks

  eof