You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Giovanni Mascia <gm...@gmail.com> on 2008/08/27 01:56:25 UTC
system design for big numbers

I've been wandering for a while through this list and other Lucene
resources on the web trying to figure out the possible outlines of a
search solution which could fit my case. But as a Lucene newbie I
decided to ask for your help.

Now this is the scenario. I am building a webmail application which
allows users to aggregate their email addresses. Every user can setup
up to 6 mailboxes on his user account on the website. From day 1 the
site enables the new user to start downloading emails from his
accounts.
Email messages and attachments are permanently stored (I am not
describing that here, just presume that the storing side of the
problem is solved).

Now the task is to let users do full-text searches into their stored
emails (forget the attachments).

Say that an average user does receive and downloads 30 emails per day
per account. This makes 180 mails/day, 5400 mails/month, 32400 mails
on the first 6 months.
Say that on day 1 we do face with 100,000 registered users and that
this user base does not change on the 6 months timeframe (ok it's not
realistic, it's just to put down a sort of a scale).

We will face 100,000*32,400 = 3,240,000,000 mails, say 20KB per
message, it's 64,800,000,000KB which makes some 60TBs (!!).

The Lucene index would be some 20TBs (let's say that it would take the
30% of the space needed for the actual content, as stated elsewhere in
this list).

OK let's finally presume that my forecast is unrealistic or just fool,
let's say that we will have to handle a Lucene index of only  3 or 4
TBs.

I was thinking that the best and simplest approach would be creating
one index per user: every user will search only its own mail querying
only its own index. Multiple not-too-big indexes could be easily moved
and scaled to additional disks or machines.

But I've read that even 100 indexes could be too much (also for OS
limitations in concurrently keeping files open, of course).

Maybe my scenario is not a common one, but from where would you start
when trying to outlining a solution which in 6 months could take to a
3-4 TBs index?

Browsing the web it seems that scaling or distributing Lucene indexes
is quite a task, I have in mind too many different options for a
possible architecture but I don't know where to start.

As you may understand I'm not a search-specialist, I presume that in
case of a good start of my project I will get some help or, better,
some money for hire someone. The task is to "resist" until that time
without creating in the first 3-6 months an un-manageable, un-scalable
situation.

Thanks for sharing with me your points of view.

And I may add that the search feature is absolutely needed from the
start, building it in a second phase is out of question.

Giovanni

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org