You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by jafarim <ja...@gmail.com> on 2007/03/18 20:41:07 UTC

Storing whole documents in the index

Hello
It's a whil that I am using lucene and as most of people seemingly do, I
used to save only some important fields of a docuemnt in the index. But
recently I thought why not store the whole document bytes as an untokenized
field in the index in order to ease the retrieval process? For example
serialize the pdf file into a byte[] and then save the bytes as a field in
the index.(some gzip and base64 encodings may be needed as glue logic). Then
I can delete the original file from the system. Is there any reason against
this idea? Can lucene bear this large volume of input streamed data?

Re: Design Problem: Searching large set of protected documents

Posted by Daniel Rosher <ro...@googlemail.com>.
Hi Jonathon,

Since the number of users in your application is small, perhaps you could
apply a pre-generated filter per user, and apply this to the search, however
this won't scale well if the number of users grow.

Another idea might be to have several filters,each of which detail a
particular type of access rather than a filter / user. Then using the
ChainedFilter chain together these filters, depending on the user, at query
time.

Regards,
Dan

On 4/3/07, Jonathan O'Connor <jo...@xcom.de> wrote:
>
> Hi,
> I have a database of a million documents and about 100 users. The
> documents
> can have an access control list, and there is a complex, recursive
> algorithm to say if a particular user can see a particular document.
>
> My problem is that my search algorithm is to first do a standard lucene
> search for matching documents, and then check security on each one found,
> just returning the allowed documents. However, if I do this, and the
> lucene
> returns 100000 docs, but the user can only see 10 of these, then obviously
> the search is going to take an awful long time.
>
> Has anyone come across this problem before, and if so what approach did
> you
> take? I guess I could precalculate the permissions for every user-document
> pair, but that's alot of storage, and a lot of precalculation!
>
> I await the list's accumulated wisdom with eagerness and interest.
> Thanks,
> Jonathan O'Connor
> XCOM Dublin
>
>
>
> *** XCOM AG Legal Disclaimer ***
>
> Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist allein
> für den Gebrauch durch den vorgesehenen Empfaenger bestimmt. Dritten ist
> das Lesen, Verteilen oder Weiterleiten dieser E-Mail untersagt. Wir
> bitten,
> eine fehlgeleitete E-Mail unverzueglich vollstaendig zu loeschen und uns
> eine Nachricht zukommen zu lassen.
>
> This email may contain material that is confidential and for the sole use
> of the intended recipient. Any review, distribution by others or
> forwarding
> without express permission is strictly prohibited. If you are not the
> intended recipient, please contact the sender and delete all copies.
>
> Hauptsitz: Bahnstrasse 37, D-47877 Willich, USt-IdNr.: DE 812 885 664
> Kommunikation: Telefon +49 2154 9209-70, Telefax +49 2154 9209-900,
> www.xcom.de
> Handelsregister: Amtsgericht Krefeld, HRB 10340
> Vorstand: Matthias Albrecht, Renate Becker-Grope, Marco Marty, Dr. Rainer
> Fuchs
> Vorsitzender des Aufsichtsrates: Stephan Steuer

Re: Design Problem: Searching large set of protected documents

Posted by Paul Elschot <pa...@xs4all.nl>.
On Wednesday 04 April 2007 01:32, Erick Erickson wrote:
> I thought you could simply add a ConstantScoreQuery (whose
> constructor takes a Filter) to a BooleanQuery. It seems that doing
> this at the very top level with a MUST would do the trick.....

I have not tried this myself, but indeed this must (should) work.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Design Problem: Searching large set of protected documents

Posted by Erick Erickson <er...@gmail.com>.
I thought you could simply add a ConstantScoreQuery (whose
constructor takes a Filter) to a BooleanQuery. It seems that doing
this at the very top level with a MUST would do the trick.....

Erick

On 4/3/07, Paul Elschot <pa...@xs4all.nl> wrote:
>
> On Tuesday 03 April 2007 17:44, Erick Erickson wrote:
> > ...
> > Then simply add the users filter to a BooleanQuery (MUST)
> > that you use when you search.
> >
>
> Adding a Filter to a BooleanQuery is not (yet) possible.
> For the moment one needs to use the Searcher methods that
> take a filter and a query.
>
> Regards,
> Paul Elschot
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Design Problem: Searching large set of protected documents

Posted by Paul Elschot <pa...@xs4all.nl>.
On Tuesday 03 April 2007 17:44, Erick Erickson wrote:
> ...
> Then simply add the users filter to a BooleanQuery (MUST)
> that you use when you search.
> 

Adding a Filter to a BooleanQuery is not (yet) possible.
For the moment one needs to use the Searcher methods that
take a filter and a query.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Design Problem: Searching large set of protected documents

Posted by Grant Ingersoll <gr...@gmail.com>.
I seem to recall this type of question coming up from time to time  
over the years, but don't have any specific pointers, so you may find  
it useful to dig into the archives.

-Grant

On Apr 3, 2007, at 12:00 PM, Jonathan O'Connor wrote:

> Erick,
> thanks for the tips. I guess I'll have to dust off my copy of LIA  
> and get cracking.
>
> BTW, in our system a user is just a group with a single fixed  
> member of itself.
>
> Ciao,
> Jonathan O'Connor
> XCOM Dublin
> "Erick Erickson" <er...@gmail.com>
>
>
> "Erick Erickson" <er...@gmail.com>
> 03/04/2007 16:44
> Please respond to
> java-user@lucene.apache.org
>
>
> To
>
> java-user@lucene.apache.org
>
> cc
>
>
> Subject
>
> Re: Design Problem: Searching large set of protected documents
> 	
>
> Storage isn't too much of a problem, 12.5 M since a Lucene Filter  
> is just
> a BitSet, one bit per document. (plus some trivial overhead).
>
> Computational costs... only you know.....
>
> But, is every user allowed individual permissions or are users part of
> groups that have permissions? Filters have logical operators (well,
> actually it's the bitset that has operators), so you may be able to
> save a bunch of time by calculating group permissions
> instead and using the relevant AND/OR/NOT operators. Assuming
> your permissions are additive, this becomes something like...
>
> Calculate a filter for each group
> for each user
>   for each group the user is in, or in the relevant group filter
>   or in the users individual permissions
>   store the filter away.
> endfor
>
> Then simply add the users filter to a BooleanQuery (MUST)
> that you use when you search.
>
> Don't know if this help much or not, but it's an idea <G>.
>
> Best
> Erick
>
> On 4/3/07, Jonathan O'Connor <jo...@xcom.de> wrote:
> >
> > Michael,
> > as usual its never so easy! Some users can see almost all  
> documents, and
> > some other users can see very few.
> >
> > I did find an interesting document that describes the problem  
> (but offers
> > no solutions :-() http://www.ideaeng.com/pub/entsrch/v3n4/ 
> article01.html.
> >
> > This article talks about early and late binding of security  
> information.
> > Early binding is faster, but harder to implement. And of course, I
> > implemented the easier one.
> >
> > I'm going to see what the computational and storage cost will be  
> if I
> > precalculate this info.
> > Ciao,
> > Jonathan O'Connor
> > XCOM Dublin
> > [image: Inactive hide details for "Michael D. Curtin"  
> <mi...@curtin.com>]"Michael
> > D. Curtin" <mi...@curtin.com>
> >
> >
> >
> >     *"Michael D. Curtin" <mi...@curtin.com>*
> >
> >             03/04/2007 15:28 Please respond to
> >             java-user@lucene.apache.org
> >
> >
> > To
> >
> > java-user@lucene.apache.org
> > cc
> >
> >
> > Subject
> >
> > Re: Design Problem: Searching large set of protected documents
> >
> > Jonathan O'Connor wrote:
> >
> > > I have a database of a million documents and about 100 users. The
> > documents
> > > can have an access control list, and there is a complex, recursive
> > > algorithm to say if a particular user can see a particular  
> document.
> > >
> > > My problem is that my search algorithm is to first do a  
> standard lucene
> > > search for matching documents, and then check security on each one
> > found,
> > > just returning the allowed documents. However, if I do this,  
> and the
> > lucene
> > > returns 100000 docs, but the user can only see 10 of these, then
> > obviously
> > > the search is going to take an awful long time.
> > >
> > > Has anyone come across this problem before, and if so what  
> approach did
> > you
> > > take? I guess I could precalculate the permissions for every
> > user-document
> > > pair, but that's alot of storage, and a lot of precalculation!
> >
> > My knee-jerk reaction is to suggest a simpler document security  
> model,
> > but I'm guessing that that option isn't available to you.
> >
> > In your example the security attributes of a document are far more
> > discriminating than the query terms.  If that relationship is  
> indicative
> > of most of your users and most of the documents, the users and  
> documents
> > aren't updated much, and you have a lot of searching to do,
> > precalculation (results into an additional document field) seems  
> the way
> > to go.  It might even turn out that, if you start from a  
> presumption of
> > calculating every user--document security attribute, you come up  
> with an
> > algorithm that is much more efficient than a one-off,
> > can-this-user-see-this-document type of algorithm.
> >
> > Precalculation isn't necessarily a bad thing.  Often, it's quite
> > beneficial -- for example, the indexing process itself is a pretty
> > substantial precalculation step!
> >
> > If this seems unwieldy or impractical for some reason, perhaps  
> you could
> > post more attributes of your situation, such as user and data  
> update and
> > addition frequency, query attributes and frequency, and so on.
> >
> > --MDC
> >
> >  
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
> >
> >
> >
> > *** XCOM AG Legal Disclaimer ***
> >
> > Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und  
> ist allein
> > für den Gebrauch durch den vorgesehenen Empfaenger bestimmt.  
> Dritten ist das
> > Lesen, Verteilen oder Weiterleiten dieser E-Mail untersagt. Wir  
> bitten, eine
> > fehlgeleitete E-Mail unverzueglich vollstaendig zu loeschen und  
> uns eine
> > Nachricht zukommen zu lassen.
> >
> > This email may contain material that is confidential and for the  
> sole use
> > of the intended recipient. Any review, distribution by others or  
> forwarding
> > without express permission is strictly prohibited. If you are not  
> the
> > intended recipient, please contact the sender and delete all copies.
> >
> > Hauptsitz: Bahnstrasse 37, D-47877 Willich, USt-IdNr.: DE 812 885  
> 664
> > Kommunikation: Telefon +49 2154 9209-70, Telefax +49 2154 9209-900,
> > www.xcom.de
> > Handelsregister: Amtsgericht Krefeld, HRB 10340
> > Vorstand: Matthias Albrecht, Renate Becker-Grope, Marco Marty,  
> Dr. Rainer
> > Fuchs
> > Vorsitzender des Aufsichtsrates: Stephan Steuer
> >
> >
>
>
>
>
>
> *** XCOM AG Legal Disclaimer ***
>
> Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist  
> allein für den Gebrauch durch den vorgesehenen Empfaenger bestimmt.  
> Dritten ist das Lesen, Verteilen oder Weiterleiten dieser E-Mail  
> untersagt. Wir bitten, eine fehlgeleitete E-Mail unverzueglich  
> vollstaendig zu loeschen und uns eine Nachricht zukommen zu lassen.
>
> This email may contain material that is confidential and for the  
> sole use of the intended recipient. Any review, distribution by  
> others or forwarding without express permission is strictly  
> prohibited. If you are not the intended recipient, please contact  
> the sender and delete all copies.
>
> Hauptsitz: Bahnstrasse 37, D-47877 Willich, USt-IdNr.: DE 812 885 664
> Kommunikation: Telefon +49 2154 9209-70, Telefax +49 2154 9209-900,  
> www.xcom.de
> Handelsregister: Amtsgericht Krefeld, HRB 10340
> Vorstand: Matthias Albrecht, Renate Becker-Grope, Marco Marty, Dr.  
> Rainer Fuchs
> Vorsitzender des Aufsichtsrates: Stephan Steuer

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



Re: Design Problem: Searching large set of protected documents

Posted by Jonathan O'Connor <jo...@xcom.de>.
Erick,
thanks for the tips. I guess I'll have to dust off my copy of LIA and get
cracking.

BTW, in our system a user is just a group with a single fixed member of
itself.

Ciao,
Jonathan O'Connor
XCOM Dublin


                                                                           
             "Erick Erickson"                                              
             <erickerickson@gm                                             
             ail.com>                                                   To 
                                       java-user@lucene.apache.org         
             03/04/2007 16:44                                           cc 
                                                                           
                                                                   Subject 
             Please respond to         Re: Design Problem: Searching large 
             java-user@lucene.         set of protected documents          
                apache.org                                                 
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           




Storage isn't too much of a problem, 12.5 M since a Lucene Filter is just
a BitSet, one bit per document. (plus some trivial overhead).

Computational costs... only you know.....

But, is every user allowed individual permissions or are users part of
groups that have permissions? Filters have logical operators (well,
actually it's the bitset that has operators), so you may be able to
save a bunch of time by calculating group permissions
instead and using the relevant AND/OR/NOT operators. Assuming
your permissions are additive, this becomes something like...

Calculate a filter for each group
for each user
   for each group the user is in, or in the relevant group filter
   or in the users individual permissions
   store the filter away.
endfor

Then simply add the users filter to a BooleanQuery (MUST)
that you use when you search.

Don't know if this help much or not, but it's an idea <G>.

Best
Erick

On 4/3/07, Jonathan O'Connor <jo...@xcom.de> wrote:
>
> Michael,
> as usual its never so easy! Some users can see almost all documents, and
> some other users can see very few.
>
> I did find an interesting document that describes the problem (but offers
> no solutions :-() http://www.ideaeng.com/pub/entsrch/v3n4/article01.html.
>
> This article talks about early and late binding of security information.
> Early binding is faster, but harder to implement. And of course, I
> implemented the easier one.
>
> I'm going to see what the computational and storage cost will be if I
> precalculate this info.
> Ciao,
> Jonathan O'Connor
> XCOM Dublin
> [image: Inactive hide details for "Michael D. Curtin"
<mi...@curtin.com>]"Michael
> D. Curtin" <mi...@curtin.com>
>
>
>
>     *"Michael D. Curtin" <mi...@curtin.com>*
>
>             03/04/2007 15:28 Please respond to
>             java-user@lucene.apache.org
>
>
> To
>
> java-user@lucene.apache.org
> cc
>
>
> Subject
>
> Re: Design Problem: Searching large set of protected documents
>
> Jonathan O'Connor wrote:
>
> > I have a database of a million documents and about 100 users. The
> documents
> > can have an access control list, and there is a complex, recursive
> > algorithm to say if a particular user can see a particular document.
> >
> > My problem is that my search algorithm is to first do a standard lucene
> > search for matching documents, and then check security on each one
> found,
> > just returning the allowed documents. However, if I do this, and the
> lucene
> > returns 100000 docs, but the user can only see 10 of these, then
> obviously
> > the search is going to take an awful long time.
> >
> > Has anyone come across this problem before, and if so what approach did
> you
> > take? I guess I could precalculate the permissions for every
> user-document
> > pair, but that's alot of storage, and a lot of precalculation!
>
> My knee-jerk reaction is to suggest a simpler document security model,
> but I'm guessing that that option isn't available to you.
>
> In your example the security attributes of a document are far more
> discriminating than the query terms.  If that relationship is indicative
> of most of your users and most of the documents, the users and documents
> aren't updated much, and you have a lot of searching to do,
> precalculation (results into an additional document field) seems the way
> to go.  It might even turn out that, if you start from a presumption of
> calculating every user--document security attribute, you come up with an
> algorithm that is much more efficient than a one-off,
> can-this-user-see-this-document type of algorithm.
>
> Precalculation isn't necessarily a bad thing.  Often, it's quite
> beneficial -- for example, the indexing process itself is a pretty
> substantial precalculation step!
>
> If this seems unwieldy or impractical for some reason, perhaps you could
> post more attributes of your situation, such as user and data update and
> addition frequency, query attributes and frequency, and so on.
>
> --MDC
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
>
> *** XCOM AG Legal Disclaimer ***
>
> Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist
allein
> für den Gebrauch durch den vorgesehenen Empfaenger bestimmt. Dritten ist
das
> Lesen, Verteilen oder Weiterleiten dieser E-Mail untersagt. Wir bitten,
eine
> fehlgeleitete E-Mail unverzueglich vollstaendig zu loeschen und uns eine
> Nachricht zukommen zu lassen.
>
> This email may contain material that is confidential and for the sole use
> of the intended recipient. Any review, distribution by others or
forwarding
> without express permission is strictly prohibited. If you are not the
> intended recipient, please contact the sender and delete all copies.
>
> Hauptsitz: Bahnstrasse 37, D-47877 Willich, USt-IdNr.: DE 812 885 664
> Kommunikation: Telefon +49 2154 9209-70, Telefax +49 2154 9209-900,
> www.xcom.de
> Handelsregister: Amtsgericht Krefeld, HRB 10340
> Vorstand: Matthias Albrecht, Renate Becker-Grope, Marco Marty, Dr. Rainer
> Fuchs
> Vorsitzender des Aufsichtsrates: Stephan Steuer
>
>





*** XCOM AG Legal Disclaimer ***

Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist allein
für den Gebrauch durch den vorgesehenen Empfaenger bestimmt. Dritten ist
das Lesen, Verteilen oder Weiterleiten dieser E-Mail untersagt. Wir bitten,
eine fehlgeleitete E-Mail unverzueglich vollstaendig zu loeschen und uns
eine Nachricht zukommen zu lassen.

This email may contain material that is confidential and for the sole use
of the intended recipient. Any review, distribution by others or forwarding
without express permission is strictly prohibited. If you are not the
intended recipient, please contact the sender and delete all copies.

Hauptsitz: Bahnstrasse 37, D-47877 Willich, USt-IdNr.: DE 812 885 664
Kommunikation: Telefon +49 2154 9209-70, Telefax +49 2154 9209-900,
www.xcom.de
Handelsregister: Amtsgericht Krefeld, HRB 10340
Vorstand: Matthias Albrecht, Renate Becker-Grope, Marco Marty, Dr. Rainer
Fuchs
Vorsitzender des Aufsichtsrates: Stephan Steuer

Re: Design Problem: Searching large set of protected documents

Posted by Erick Erickson <er...@gmail.com>.
Storage isn't too much of a problem, 12.5 M since a Lucene Filter is just
a BitSet, one bit per document. (plus some trivial overhead).

Computational costs... only you know.....

But, is every user allowed individual permissions or are users part of
groups that have permissions? Filters have logical operators (well,
actually it's the bitset that has operators), so you may be able to
save a bunch of time by calculating group permissions
instead and using the relevant AND/OR/NOT operators. Assuming
your permissions are additive, this becomes something like...

Calculate a filter for each group
for each user
   for each group the user is in, or in the relevant group filter
   or in the users individual permissions
   store the filter away.
endfor

Then simply add the users filter to a BooleanQuery (MUST)
that you use when you search.

Don't know if this help much or not, but it's an idea <G>.

Best
Erick

On 4/3/07, Jonathan O'Connor <jo...@xcom.de> wrote:
>
> Michael,
> as usual its never so easy! Some users can see almost all documents, and
> some other users can see very few.
>
> I did find an interesting document that describes the problem (but offers
> no solutions :-() http://www.ideaeng.com/pub/entsrch/v3n4/article01.html.
>
> This article talks about early and late binding of security information.
> Early binding is faster, but harder to implement. And of course, I
> implemented the easier one.
>
> I'm going to see what the computational and storage cost will be if I
> precalculate this info.
> Ciao,
> Jonathan O'Connor
> XCOM Dublin
> [image: Inactive hide details for "Michael D. Curtin" <mi...@curtin.com>]"Michael
> D. Curtin" <mi...@curtin.com>
>
>
>
>     *"Michael D. Curtin" <mi...@curtin.com>*
>
>             03/04/2007 15:28 Please respond to
>             java-user@lucene.apache.org
>
>
> To
>
> java-user@lucene.apache.org
> cc
>
>
> Subject
>
> Re: Design Problem: Searching large set of protected documents
>
> Jonathan O'Connor wrote:
>
> > I have a database of a million documents and about 100 users. The
> documents
> > can have an access control list, and there is a complex, recursive
> > algorithm to say if a particular user can see a particular document.
> >
> > My problem is that my search algorithm is to first do a standard lucene
> > search for matching documents, and then check security on each one
> found,
> > just returning the allowed documents. However, if I do this, and the
> lucene
> > returns 100000 docs, but the user can only see 10 of these, then
> obviously
> > the search is going to take an awful long time.
> >
> > Has anyone come across this problem before, and if so what approach did
> you
> > take? I guess I could precalculate the permissions for every
> user-document
> > pair, but that's alot of storage, and a lot of precalculation!
>
> My knee-jerk reaction is to suggest a simpler document security model,
> but I'm guessing that that option isn't available to you.
>
> In your example the security attributes of a document are far more
> discriminating than the query terms.  If that relationship is indicative
> of most of your users and most of the documents, the users and documents
> aren't updated much, and you have a lot of searching to do,
> precalculation (results into an additional document field) seems the way
> to go.  It might even turn out that, if you start from a presumption of
> calculating every user--document security attribute, you come up with an
> algorithm that is much more efficient than a one-off,
> can-this-user-see-this-document type of algorithm.
>
> Precalculation isn't necessarily a bad thing.  Often, it's quite
> beneficial -- for example, the indexing process itself is a pretty
> substantial precalculation step!
>
> If this seems unwieldy or impractical for some reason, perhaps you could
> post more attributes of your situation, such as user and data update and
> addition frequency, query attributes and frequency, and so on.
>
> --MDC
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
>
> *** XCOM AG Legal Disclaimer ***
>
> Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist allein
> für den Gebrauch durch den vorgesehenen Empfaenger bestimmt. Dritten ist das
> Lesen, Verteilen oder Weiterleiten dieser E-Mail untersagt. Wir bitten, eine
> fehlgeleitete E-Mail unverzueglich vollstaendig zu loeschen und uns eine
> Nachricht zukommen zu lassen.
>
> This email may contain material that is confidential and for the sole use
> of the intended recipient. Any review, distribution by others or forwarding
> without express permission is strictly prohibited. If you are not the
> intended recipient, please contact the sender and delete all copies.
>
> Hauptsitz: Bahnstrasse 37, D-47877 Willich, USt-IdNr.: DE 812 885 664
> Kommunikation: Telefon +49 2154 9209-70, Telefax +49 2154 9209-900,
> www.xcom.de
> Handelsregister: Amtsgericht Krefeld, HRB 10340
> Vorstand: Matthias Albrecht, Renate Becker-Grope, Marco Marty, Dr. Rainer
> Fuchs
> Vorsitzender des Aufsichtsrates: Stephan Steuer
>
>

Re: Design Problem: Searching large set of protected documents

Posted by Jonathan O'Connor <jo...@xcom.de>.
Michael,
as usual its never so easy! Some users can see almost all documents, and
some other users can see very few.

I did find an interesting document that describes the problem (but offers
no solutions :-() http://www.ideaeng.com/pub/entsrch/v3n4/article01.html.

This article talks about early and late binding of security information.
Early binding is faster, but harder to implement. And of course, I
implemented the easier one.

I'm going to see what the computational and storage cost will be if I
precalculate this info.
Ciao,
Jonathan O'Connor
XCOM Dublin


                                                                           
             "Michael D.                                                   
             Curtin"                                                       
             <mi...@curtin.com>                                          To 
                                       java-user@lucene.apache.org         
             03/04/2007 15:28                                           cc 
                                                                           
                                                                   Subject 
             Please respond to         Re: Design Problem: Searching large 
             java-user@lucene.         set of protected documents          
                apache.org                                                 
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           




Jonathan O'Connor wrote:

> I have a database of a million documents and about 100 users. The
documents
> can have an access control list, and there is a complex, recursive
> algorithm to say if a particular user can see a particular document.
>
> My problem is that my search algorithm is to first do a standard lucene
> search for matching documents, and then check security on each one found,
> just returning the allowed documents. However, if I do this, and the
lucene
> returns 100000 docs, but the user can only see 10 of these, then
obviously
> the search is going to take an awful long time.
>
> Has anyone come across this problem before, and if so what approach did
you
> take? I guess I could precalculate the permissions for every
user-document
> pair, but that's alot of storage, and a lot of precalculation!

My knee-jerk reaction is to suggest a simpler document security model,
but I'm guessing that that option isn't available to you.

In your example the security attributes of a document are far more
discriminating than the query terms.  If that relationship is indicative
of most of your users and most of the documents, the users and documents
aren't updated much, and you have a lot of searching to do,
precalculation (results into an additional document field) seems the way
to go.  It might even turn out that, if you start from a presumption of
calculating every user--document security attribute, you come up with an
algorithm that is much more efficient than a one-off,
can-this-user-see-this-document type of algorithm.

Precalculation isn't necessarily a bad thing.  Often, it's quite
beneficial -- for example, the indexing process itself is a pretty
substantial precalculation step!

If this seems unwieldy or impractical for some reason, perhaps you could
post more attributes of your situation, such as user and data update and
addition frequency, query attributes and frequency, and so on.

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org






*** XCOM AG Legal Disclaimer ***

Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist allein
für den Gebrauch durch den vorgesehenen Empfaenger bestimmt. Dritten ist
das Lesen, Verteilen oder Weiterleiten dieser E-Mail untersagt. Wir bitten,
eine fehlgeleitete E-Mail unverzueglich vollstaendig zu loeschen und uns
eine Nachricht zukommen zu lassen.

This email may contain material that is confidential and for the sole use
of the intended recipient. Any review, distribution by others or forwarding
without express permission is strictly prohibited. If you are not the
intended recipient, please contact the sender and delete all copies.

Hauptsitz: Bahnstrasse 37, D-47877 Willich, USt-IdNr.: DE 812 885 664
Kommunikation: Telefon +49 2154 9209-70, Telefax +49 2154 9209-900,
www.xcom.de
Handelsregister: Amtsgericht Krefeld, HRB 10340
Vorstand: Matthias Albrecht, Renate Becker-Grope, Marco Marty, Dr. Rainer
Fuchs
Vorsitzender des Aufsichtsrates: Stephan Steuer

Re: Design Problem: Searching large set of protected documents

Posted by "Michael D. Curtin" <mi...@curtin.com>.
Jonathan O'Connor wrote:

> I have a database of a million documents and about 100 users. The documents
> can have an access control list, and there is a complex, recursive
> algorithm to say if a particular user can see a particular document.
> 
> My problem is that my search algorithm is to first do a standard lucene
> search for matching documents, and then check security on each one found,
> just returning the allowed documents. However, if I do this, and the lucene
> returns 100000 docs, but the user can only see 10 of these, then obviously
> the search is going to take an awful long time.
> 
> Has anyone come across this problem before, and if so what approach did you
> take? I guess I could precalculate the permissions for every user-document
> pair, but that's alot of storage, and a lot of precalculation!

My knee-jerk reaction is to suggest a simpler document security model, 
but I'm guessing that that option isn't available to you.

In your example the security attributes of a document are far more 
discriminating than the query terms.  If that relationship is indicative 
of most of your users and most of the documents, the users and documents 
aren't updated much, and you have a lot of searching to do, 
precalculation (results into an additional document field) seems the way 
to go.  It might even turn out that, if you start from a presumption of 
calculating every user--document security attribute, you come up with an 
algorithm that is much more efficient than a one-off, 
can-this-user-see-this-document type of algorithm.

Precalculation isn't necessarily a bad thing.  Often, it's quite 
beneficial -- for example, the indexing process itself is a pretty 
substantial precalculation step!

If this seems unwieldy or impractical for some reason, perhaps you could 
post more attributes of your situation, such as user and data update and 
addition frequency, query attributes and frequency, and so on.

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Design Problem: Searching large set of protected documents

Posted by Jonathan O'Connor <jo...@xcom.de>.
Hi,
I have a database of a million documents and about 100 users. The documents
can have an access control list, and there is a complex, recursive
algorithm to say if a particular user can see a particular document.

My problem is that my search algorithm is to first do a standard lucene
search for matching documents, and then check security on each one found,
just returning the allowed documents. However, if I do this, and the lucene
returns 100000 docs, but the user can only see 10 of these, then obviously
the search is going to take an awful long time.

Has anyone come across this problem before, and if so what approach did you
take? I guess I could precalculate the permissions for every user-document
pair, but that's alot of storage, and a lot of precalculation!

I await the list's accumulated wisdom with eagerness and interest.
Thanks,
Jonathan O'Connor
XCOM Dublin



*** XCOM AG Legal Disclaimer ***

Diese E-Mail einschliesslich ihrer Anhaenge ist vertraulich und ist allein
für den Gebrauch durch den vorgesehenen Empfaenger bestimmt. Dritten ist
das Lesen, Verteilen oder Weiterleiten dieser E-Mail untersagt. Wir bitten,
eine fehlgeleitete E-Mail unverzueglich vollstaendig zu loeschen und uns
eine Nachricht zukommen zu lassen.

This email may contain material that is confidential and for the sole use
of the intended recipient. Any review, distribution by others or forwarding
without express permission is strictly prohibited. If you are not the
intended recipient, please contact the sender and delete all copies.

Hauptsitz: Bahnstrasse 37, D-47877 Willich, USt-IdNr.: DE 812 885 664
Kommunikation: Telefon +49 2154 9209-70, Telefax +49 2154 9209-900,
www.xcom.de
Handelsregister: Amtsgericht Krefeld, HRB 10340
Vorstand: Matthias Albrecht, Renate Becker-Grope, Marco Marty, Dr. Rainer
Fuchs
Vorsitzender des Aufsichtsrates: Stephan Steuer

Re: Storing whole documents in the index

Posted by Karel Tejnora <ka...@tejnora.cz>.
To store document (specially large ones) out of the index is better than
in index. Every merge of segments or optimize will copy those data.
Stored in index is possible, but it requires 1-4x more space, depends on
read/write speed of the fs, merge and optimize takes longer time.

Karel

On Sun, 2007-03-18 at 23:11 +0330, jafarim wrote:
> Hello
> It's a whil that I am using lucene and as most of people seemingly do, I
> used to save only some important fields of a docuemnt in the index. But
> recently I thought why not store the whole document bytes as an untokenized
> field in the index in order to ease the retrieval process? For example
> serialize the pdf file into a byte[] and then save the bytes as a field in
> the index.(some gzip and base64 encodings may be needed as glue logic). Then
> I can delete the original file from the system. Is there any reason against
> this idea? Can lucene bear this large volume of input streamed data?


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org