You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Ganesh - yahoo <em...@yahoo.co.in> on 2008/08/05 14:36:36 UTC

Per user data store

Hello all,

Documents coressponding to multiple users are to be indexed. Each user is going to search only his documents. Only Administrator could search all users data.

Is it good to have one database for each User or to have only one database for all Users? Which will be better? 

My opinion is to have one database for all users and to have field 'Username'. Using this field data will get filtered out and the search results will be served to the User. In this approach, whether Username should be part of boolean query or TermFilter will be the better approach?

One more technical question: Username field will have repeated entry of the user names. Whether the space for this field will be consumped for every document / record or the data will be tokenzied and a pointer to the document will be stored.

Regards
Ganesh   

Re: Per user data store

Posted by "Karsten F." <ka...@fiz-technik.de>.
Hi,

I want to agree with the advice of using only one index.

And I want to add two reasons:
1. Sorting and caching are working with the lucene-document-numbers.
In case of lucene "warming up" means that a lot of int-Arrays and bitsets
are stored in main memory.
If you using different MultiReader for each user all caching is also
seperately.

2. you should think about what happened, if you get new users:
Most possible you will get a user "with the same permissions as XY".
So you don't want to copy a index-file or insert a new value in an existing
document-field.
But you can easly copy the filter of an existing user.
(Which also means that I suggest not to use a field "userids with
read-permission". It is better to decouple userids and index).

But this reasons are only good for my thinking of amount of users, ratio of
deleting and adding documents and period of valid documents. 

So I again agree with Erick, that you should tell more about your use case.

Best regards

  Karsten 


Erick Erickson wrote:
> 
> I'd start out with one index, if for no other reason
> than keeping track of one index for each user would
> be a royal pain in the neck. You haven't told us
> how many users or documents you expect,
> so that's just a guess. There's one answer perhaps
> if you wind up with a 10M index, another if it's 10T.....
> 
> Filtering on the username is a fine idea, although
> I'd also start by just ANDing in the username to
> the query to start. Then measure your resonse
> time. Note that the first time you open a reader, the
> response will be slow so measure queries 2-n
> instead.
> 
> I don't know the guts of Lucene, but my indexes do NOT
> grow linearly with the data. After a very few docs, adding,
> say, 1M of data does not cause the data to grow by 1M (or
> even close to that) for fields that are NOT stored. I've
> learned to just trust that the very bright people who work
> on Lucene have "done the right thing" <G>...
> 
> Best
> Erick
> 

-- 
View this message in context: http://www.nabble.com/Per-user-data-store-tp18830202p18846581.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Per user data store

Posted by Erick Erickson <er...@gmail.com>.
I'd start out with one index, if for no other reason
than keeping track of one index for each user would
be a royal pain in the neck. You haven't told us
how many users or documents you expect,
so that's just a guess. There's one answer perhaps
if you wind up with a 10M index, another if it's 10T.....

Filtering on the username is a fine idea, although
I'd also start by just ANDing in the username to
the query to start. Then measure your resonse
time. Note that the first time you open a reader, the
response will be slow so measure queries 2-n
instead.

I don't know the guts of Lucene, but my indexes do NOT
grow linearly with the data. After a very few docs, adding,
say, 1M of data does not cause the data to grow by 1M (or
even close to that) for fields that are NOT stored. I've
learned to just trust that the very bright people who work
on Lucene have "done the right thing" <G>...

Best
Erick

On Tue, Aug 5, 2008 at 8:36 AM, Ganesh - yahoo <em...@yahoo.co.in>wrote:

> Hello all,
>
> Documents coressponding to multiple users are to be indexed. Each user is
> going to search only his documents. Only Administrator could search all
> users data.
>
> Is it good to have one database for each User or to have only one database
> for all Users? Which will be better?
>
> My opinion is to have one database for all users and to have field
> 'Username'. Using this field data will get filtered out and the search
> results will be served to the User. In this approach, whether Username
> should be part of boolean query or TermFilter will be the better approach?
>
> One more technical question: Username field will have repeated entry of the
> user names. Whether the space for this field will be consumped for every
> document / record or the data will be tokenzied and a pointer to the
> document will be stored.
>
> Regards
> Ganesh

Re: Per user data store

Posted by Antony Bowesman <ad...@teamware.com>.
Ganesh - yahoo wrote:
> Hello all,
> 
> Documents coressponding to multiple users are to be indexed. Each user is
> going to search only his documents. Only Administrator could search all users
> data.
> 
> Is it good to have one database for each User or to have only one database
> for all Users? Which will be better?

I created a hybrid approach that supported 1..n databases based on a hash of the 
user's user Id.  This was to allow for the situation where a single database 
would not scale - at the time there was not good information about Lucene's 
performance with large data sets.

In practice, we are now using a single database with data for all users.  There 
is an 'ownerId' field with the unique user Id in every document.

 > My opinion is to have one database for all users and to have field
 > 'Username'. Using this field data will get filtered out and the search
 > results will be served to the User. In this approach, whether Username should
 > be part of boolean query or TermFilter will be the better approach?

The ownerId is used as a cached filter rather than always added to the query, so 
that only that user's documents influence the score.  If it is part of the 
query, the complete document set for other users will influence the hits for 
this user.

Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org