You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by ariel goldberg <ar...@yahoo.com> on 2007/02/26 16:25:51 UTC

One index per user or one index per day?

Greetings,



 



I'm creating an application that 
requires the indexing of millions of documents on behalf of a large group of 
users, and was hoping to get an opinion on whether I should use one index per 
user or one index per day.



 



My application will have to handle 
the following:



 



- the indexing of about 1 million 5K 
documents per day, with each document containing about 5 
fields



- expiration of documents, since 
after a while, my hard drive would run out of 
room



- queries that consist of boolean 
expressions (e.g., the body field contains "a" AND "b", and the title field 
contains "c"), as well as ranges (e.g., the document needs to have been indexed 
between 2/25/07 10:00 am and 2/28/07 9:00 pm)



- permissions; in other words, user 
A might be able to search on documents X and Y, but user B might be able to 
search on documents Y and Z.



- up to 1,000 
users



 



So, I was considering the 
following:



 



1) Using one index per 
user



 



This would entail creating and using 
up to 1,000 indices.  Document Y in the example above would have to be 
duplicated.  Expiration is performed via IndexWriter.deleteDocuments.  The 
advantage here is that querying should be reasonably quick, because each index 
would only contain tens of thousands of documents, instead of millions.  The 
disadvantages: I'm concerned about the "too many open files" error, and I'm also 
concerned about the performance of 
deleteDocuments.



 



2) Using one index per 
day



 



Each day, I create a new index.  
Again, document Y in the example above would have to be duplicated (is there any 
way around this?)  The advantage here is that expiring documents means simply 
deleting the index corresponding to a particular day.  The disadvantage is the 
query performance, since the queries, which are already very complex, would have 
to be performed using MultiSearcher (if expiration is after 10 days, that's 10 
indices to search across).



 



Tough to know for sure which option 
is better without testing, but does anyone have a gut reaction?  Any advice 
would be greatly appreciated!



 



Thanks,



Ariel






 
____________________________________________________________________________________
Need Mail bonding?
Go to the Yahoo! Mail Q&A for great tips from Yahoo! Answers users.
http://answers.yahoo.com/dir/?link=list&sid=396546091

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: One index per user or one index per day?

Posted by Shane <lu...@my-family.us>.
If you can categorize the documents based on user permissions, that is 
the route I would go. 

For example users 1, 2,  and 3 are allowed to search documents a and b.  
In addition, user 1 can search documents c and d, while users 2 and 3 
can search documents e and f.  I would create 3 indexes: one for docs a 
and b, one for docs c and d, and finally one for docs e and f.  Then 
using your method of choice, you can restrict documents based on the 
users permission.

I realize scaling may cause an issue, but this route would allow you to 
normalize your data and reduce duplication in the system.

Shane

ariel goldberg wrote:
> Greetings,
>
>
>
>  
>
>
>
> I'm creating an application that 
> requires the indexing of millions of documents on behalf of a large group of 
> users, and was hoping to get an opinion on whether I should use one index per 
> user or one index per day.
>
>
>
>  
>
>
>
> My application will have to handle 
> the following:
>
>
>
>  
>
>
>
> - the indexing of about 1 million 5K 
> documents per day, with each document containing about 5 
> fields
>
>
>
> - expiration of documents, since 
> after a while, my hard drive would run out of 
> room
>
>
>
> - queries that consist of boolean 
> expressions (e.g., the body field contains "a" AND "b", and the title field 
> contains "c"), as well as ranges (e.g., the document needs to have been indexed 
> between 2/25/07 10:00 am and 2/28/07 9:00 pm)
>
>
>
> - permissions; in other words, user 
> A might be able to search on documents X and Y, but user B might be able to 
> search on documents Y and Z.
>
>
>
> - up to 1,000 
> users
>
>
>
>  
>
>
>
> So, I was considering the 
> following:
>
>
>
>  
>
>
>
> 1) Using one index per 
> user
>
>
>
>  
>
>
>
> This would entail creating and using 
> up to 1,000 indices.  Document Y in the example above would have to be 
> duplicated.  Expiration is performed via IndexWriter.deleteDocuments.  The 
> advantage here is that querying should be reasonably quick, because each index 
> would only contain tens of thousands of documents, instead of millions.  The 
> disadvantages: I'm concerned about the "too many open files" error, and I'm also 
> concerned about the performance of 
> deleteDocuments.
>
>
>
>  
>
>
>
> 2) Using one index per 
> day
>
>
>
>  
>
>
>
> Each day, I create a new index.  
> Again, document Y in the example above would have to be duplicated (is there any 
> way around this?)  The advantage here is that expiring documents means simply 
> deleting the index corresponding to a particular day.  The disadvantage is the 
> query performance, since the queries, which are already very complex, would have 
> to be performed using MultiSearcher (if expiration is after 10 days, that's 10 
> indices to search across).
>
>
>
>  
>
>
>
> Tough to know for sure which option 
> is better without testing, but does anyone have a gut reaction?  Any advice 
> would be greatly appreciated!
>
>
>
>  
>
>
>
> Thanks,
>
>
>
> Ariel
>
>
>
>
>
>
>  
> ____________________________________________________________________________________
> Need Mail bonding?
> Go to the Yahoo! Mail Q&A for great tips from Yahoo! Answers users.
> http://answers.yahoo.com/dir/?link=list&sid=396546091
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org