You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Lawrence <lu...@savant-is.com> on 2006/03/11 08:07:29 UTC

100,000 indexes and what to do

Hi all,



I was reading one of the posting on concurrency and I reread section 9.1 in Lucene in Action which lead me to this question. I have 100,000 customers and I want to provide them with personal searching for their documents and sometimes to include company documents in that search.

1.	100,000 customers with 10-20 small document each.
2.	Company 5,000 documents, specification, papers, research, etc.
3.	Customers can search their own documents and company document.

P1: Do I provide an index for each customer and allow them multiple index searching, into company document when they need it?

OR

P2: Do I provide one large index for all my 100,000 customers, adding a field for customer ID so searching can be constrained, so they won’t/can’t search across other customer’s documents, and then categorize company documents so customers can do multiple index searches into company documents?

After writing this out I realize that P2 is probably the wiser choice, less complicated, but I would like to hear from other Luceners.

Lucene in Action is one of the best written books in my library of ~300 CS books. It ranks in completeness and clarity up there with works by David Geary, Martin Fowler, and other Hatcher greats like Java Development with Ant. 

Thanks Otis and Erik.

Regards, Lawrence

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: 100,000 indexes and what to do

Posted by Paul Elschot <pa...@xs4all.nl>.
On Saturday 11 March 2006 08:07, Lawrence wrote:
> Hi all,
> 
> 
> 
> I was reading one of the posting on concurrency and I reread section 9.1 in 
Lucene in Action which lead me to this question. I have 100,000 customers and 
I want to provide them with personal searching for their documents and 
sometimes to include company documents in that search.
> 
> 1.	100,000 customers with 10-20 small document each.
> 2.	Company 5,000 documents, specification, papers, research, etc.
> 3.	Customers can search their own documents and company document.
> 
> P1: Do I provide an index for each customer and allow them multiple index 
searching, into company document when they need it?
> 
> OR
> 
> P2: Do I provide one large index for all my 100,000 customers, adding a 
field for customer ID so searching can be constrained, so they won’t/can’t 
search across other customer’s documents, and then categorize company 
documents so customers can do multiple index searches into company documents?
> 
> After writing this out I realize that P2 is probably the wiser choice, less 
complicated, but I would like to hear from other Luceners.

In case you have many customers searching at the same time, compact filters
can help reduce memory requirements:
http://issues.apache.org/jira/browse/LUCENE-328
A BitSet filter uses one bit per indexed document, and a compact filter uses 
one or three bytes per indexed document passing the filter.
When there are 100 different customers searching in their own docs at the
same time, assuming there are 100,000 * 20 docs in the customer index:
- BitSet filters will use 100 * (100,000 * 20) / 8 bytes,
- compact filters will use roughly 100 * 20 * 2  bytes.
The ratio between these is roughly 100,000 / 16 or about 6000.

Since the company docs will not need to be filtered, you can put these in a
separate index, and write your own MultiSearcher that filters only on the
customer index.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: 100,000 indexes and what to do

Posted by Chris Lu <ch...@gmail.com>.
I think it's best to have one small index for each customer, and one
large index for company's index.

Merging customers' contents with the main index will cost a lot of
resources, slowing down systems, while actually not necessary. If
indexing is done by batch job, there'll be a delay between content
updated time and index refreshed time. This maybe acceptable for some
cases, but usually for users' own content, they want to search it
right away.

With small individual customer index, indexing won't cost any time for
10~20 small documents. Customers can search their content right after
content is updated.

Chris Lu
-------------------------------
Full-Text Search on Any Databases
http://www.dbsight.net

On 3/10/06, Lawrence <lu...@savant-is.com> wrote:
> Hi all,
>
>
>
> I was reading one of the posting on concurrency and I reread section 9.1 in Lucene in Action which lead me to this question. I have 100,000 customers and I want to provide them with personal searching for their documents and sometimes to include company documents in that search.
>
> 1.      100,000 customers with 10-20 small document each.
> 2.      Company 5,000 documents, specification, papers, research, etc.
> 3.      Customers can search their own documents and company document.
>
> P1: Do I provide an index for each customer and allow them multiple index searching, into company document when they need it?
>
> OR
>
> P2: Do I provide one large index for all my 100,000 customers, adding a field for customer ID so searching can be constrained, so they won't/can't search across other customer's documents, and then categorize company documents so customers can do multiple index searches into company documents?
>
> After writing this out I realize that P2 is probably the wiser choice, less complicated, but I would like to hear from other Luceners.
>
> Lucene in Action is one of the best written books in my library of ~300 CS books. It ranks in completeness and clarity up there with works by David Geary, Martin Fowler, and other Hatcher greats like Java Development with Ant.
>
> Thanks Otis and Erik.
>
> Regards, Lawrence
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: 100,000 indexes and what to do

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Lawrence,

Thanks for the LIA compliments.
In addition to what Paul and Chris already mentioned, keep in mind open files (also covered in LIA).  If you have 100K separate indices, that means a lot of open file descriptors.  One common index doesn't have this problem.  Separate indices are still possible, you just have to be smart about keeping track of used and unused indices and diligent about managing and freeing up resources.

Otis


----- Original Message ----
From: Lawrence <lu...@savant-is.com>
To: java-user@lucene.apache.org
Sent: Saturday, March 11, 2006 2:07:29 AM
Subject: 100,000 indexes and what to do

Hi all,



I was reading one of the posting on concurrency and I reread section 9.1 in Lucene in Action which lead me to this question. I have 100,000 customers and I want to provide them with personal searching for their documents and sometimes to include company documents in that search.

1.    100,000 customers with 10-20 small document each.
2.    Company 5,000 documents, specification, papers, research, etc.
3.    Customers can search their own documents and company document.

P1: Do I provide an index for each customer and allow them multiple index searching, into company document when they need it?

OR

P2: Do I provide one large index for all my 100,000 customers, adding a field for customer ID so searching can be constrained, so they won’t/can’t search across other customer’s documents, and then categorize company documents so customers can do multiple index searches into company documents?

After writing this out I realize that P2 is probably the wiser choice, less complicated, but I would like to hear from other Luceners.

Lucene in Action is one of the best written books in my library of ~300 CS books. It ranks in completeness and clarity up there with works by David Geary, Martin Fowler, and other Hatcher greats like Java Development with Ant. 

Thanks Otis and Erik.

Regards, Lawrence

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org