You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Steve Molloy <sm...@opentext.com> on 2013/02/04 15:20:40 UTC

RE: Post-sort filtering

BTW, I've logged SOLR-4397 for this and submitted a first patch (based on 4.1 tag which is what we use). Need to at least add logic to respect timeAllowed, and would like a better way of handling missing results than going back and restarting by asking for more, but works for now so guess it's a start.

Steve Molloy		                  steve.molloy@opentext.com
Software Architect  |  Information Discovery & Analytics R&D               
OpenText                      

-----Original Message-----
From: Steve Molloy [mailto:smolloy@opentext.com] 
Sent: January-24-13 1:16 PM
To: dev@lucene.apache.org
Subject: RE: Post-sort filtering

I was actually looking for an extension point to plug in, which I wasn't able to find looking at the code. And yes, I'm willing to have counts being off, the important thing being that results don't contain the wrong document. I'd like to avoid oversampling and requesting back because of the bandwidth and overall resource usage this implies. I'm currently trying out a "PostSortFilter" approach that I'll share if it seems interesting enough.

Steve Molloy
Software Architect  |  Information Discovery & Analytics R&D
Website:
www.opentext.com

This email message is confidential, may be privileged, and is intended for the exclusive use of the addressee. Any other person is strictly prohibited from disclosing or reproducing it. If the addressee cannot be reached or is unknown to you, please inform the sender by return email and delete this email message and all copies immediately.

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: January-24-13 1:11 PM
To: dev@lucene.apache.org
Subject: Re: Post-sort filtering

this has some problems. First, your facet, group, num hits, etc.
counts will be off for that user. Second, you can't sort without having all of the documents, so unless you're willing to have your counts be off, you really have to pay the price of post-filtering everything.

If you can live with the counts being off, consider just having the application do a couple of round-trips. Get the docs (oversample, say just get the IDs for the top 100 docs) _without_ any kind of ACL filtering. Then send those docs back to the server with the ACL filtering. If you don't get enough to fill up a response, get the next page of 100, etc.....

Finally, the user's list is a better place for this kind of question, this list is for discussing developing the code...

Best
Erick

On Wed, Jan 23, 2013 at 9:05 AM, Steve Molloy <sm...@opentext.com> wrote:
> Hi,
>
>     I'm looking for a way to apply filtering that unfortunately 
> implies high cost because it needs to access external resources (for 
> security). I looked at (and tried) the PostFilter technique, which 
> offers some advantages, but still imply a lot of matches in a lot of 
> cases. What I'd like to be able to do is to filter out ids until I 
> have enough to fill the response, then stop filtering (and accept 
> everything). The idea being that total count is not as important, 
> major thing being results should not contain documents requester 
> should not see. So, post filter almost does the trick, except it comes 
> before sorting, so first X documents are not the same that the post filter is getting.
>
> Is there a way to filter out documents after they have been scored and 
> sorted?
>
> Thanks,
> Steve
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

RE: Post-sort filtering

Posted by Steve Molloy <sm...@opentext.com>.

I understand all that (and do want to avoid revisiting same documents, this is just a first working version). I also know about Manifold CF, or more generally about storing security information in the index. But in some cases, this is not enough. When access to restricted content can lead to huge legal issues, companies want to make sure that there is 0 latency between a permission change and access to information. So we want to have a security net after results are gathered.

And we do want to avoid putting that logic in an external component (which would definitely not be UI anyhow) so that we can reduce amount of information going back and forth on the wire. Anyhow, I guess you won't be putting your vote for that one, but still, I'm open to all suggestions for improvement. :)

Steve Molloy
Software Architect  |  Information Discovery & Analytics R&D

From: Mikhail Khludnev [mailto:mkhludnev@griddynamics.com]
Sent: February-04-13 1:11 PM
To: dev@lucene.apache.org
Subject: Re: Post-sort filtering

Steve,
this question pops up from time to time, but the answer is usually - no.
This approach is inefficient, and usually proposed as hack/or workaround made in UI (front end app).
Current patch ruins facets, it filter the same top docs again and again (i.e. you don't exclude document from the step one from the following steps), every step costs O(n log p), but lucene support deep scrolling which made it much more efficient.
AFAIK common way is using Manifold CF to index security filter inside of Solr.

On Mon, Feb 4, 2013 at 6:20 PM, Steve Molloy <sm...@opentext.com>> wrote:
BTW, I've logged SOLR-4397 for this and submitted a first patch (based on 4.1 tag which is what we use). Need to at least add logic to respect timeAllowed, and would like a better way of handling missing results than going back and restarting by asking for more, but works for now so guess it's a start.

Steve Molloy                              steve.molloy@opentext.com<ma...@opentext.com>
Software Architect  |  Information Discovery & Analytics R&D
OpenText

-----Original Message-----
From: Steve Molloy [mailto:smolloy@opentext.com<ma...@opentext.com>]
Sent: January-24-13 1:16 PM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: RE: Post-sort filtering

I was actually looking for an extension point to plug in, which I wasn't able to find looking at the code. And yes, I'm willing to have counts being off, the important thing being that results don't contain the wrong document. I'd like to avoid oversampling and requesting back because of the bandwidth and overall resource usage this implies. I'm currently trying out a "PostSortFilter" approach that I'll share if it seems interesting enough.

Steve Molloy
Software Architect  |  Information Discovery & Analytics R&D
Website:
www.opentext.com<http://www.opentext.com>

This email message is confidential, may be privileged, and is intended for the exclusive use of the addressee. Any other person is strictly prohibited from disclosing or reproducing it. If the addressee cannot be reached or is unknown to you, please inform the sender by return email and delete this email message and all copies immediately.

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com<ma...@gmail.com>]
Sent: January-24-13 1:11 PM
To: dev@lucene.apache.org<ma...@lucene.apache.org>
Subject: Re: Post-sort filtering

this has some problems. First, your facet, group, num hits, etc.
counts will be off for that user. Second, you can't sort without having all of the documents, so unless you're willing to have your counts be off, you really have to pay the price of post-filtering everything.

If you can live with the counts being off, consider just having the application do a couple of round-trips. Get the docs (oversample, say just get the IDs for the top 100 docs) _without_ any kind of ACL filtering. Then send those docs back to the server with the ACL filtering. If you don't get enough to fill up a response, get the next page of 100, etc.....

Finally, the user's list is a better place for this kind of question, this list is for discussing developing the code...

Best
Erick

On Wed, Jan 23, 2013 at 9:05 AM, Steve Molloy <sm...@opentext.com>> wrote:
> Hi,
>
>     I'm looking for a way to apply filtering that unfortunately
> implies high cost because it needs to access external resources (for
> security). I looked at (and tried) the PostFilter technique, which
> offers some advantages, but still imply a lot of matches in a lot of
> cases. What I'd like to be able to do is to filter out ids until I
> have enough to fill the response, then stop filtering (and accept
> everything). The idea being that total count is not as important,
> major thing being results should not contain documents requester
> should not see. So, post filter almost does the trick, except it comes
> before sorting, so first X documents are not the same that the post filter is getting.
>
> Is there a way to filter out documents after they have been scored and
> sorted?
>
> Thanks,
> Steve
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org> For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org> For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<ma...@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<ma...@lucene.apache.org>

--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<ma...@griddynamics.com>

Re: Post-sort filtering

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Steve,
this question pops up from time to time, but the answer is usually - no.
This approach is inefficient, and usually proposed as hack/or workaround
made in UI (front end app).
Current patch ruins facets, it filter the same top docs again and again
(i.e. you don't exclude document from the step one from the following
steps), every step costs O(n log p), but lucene support deep scrolling
which made it much more efficient.
AFAIK common way is using Manifold CF to index security filter inside of
Solr.


On Mon, Feb 4, 2013 at 6:20 PM, Steve Molloy <sm...@opentext.com> wrote:

> BTW, I've logged SOLR-4397 for this and submitted a first patch (based on
> 4.1 tag which is what we use). Need to at least add logic to respect
> timeAllowed, and would like a better way of handling missing results than
> going back and restarting by asking for more, but works for now so guess
> it's a start.
>
> Steve Molloy                              steve.molloy@opentext.com
> Software Architect  |  Information Discovery & Analytics R&D
> OpenText
>
> -----Original Message-----
> From: Steve Molloy [mailto:smolloy@opentext.com]
> Sent: January-24-13 1:16 PM
> To: dev@lucene.apache.org
> Subject: RE: Post-sort filtering
>
> I was actually looking for an extension point to plug in, which I wasn't
> able to find looking at the code. And yes, I'm willing to have counts being
> off, the important thing being that results don't contain the wrong
> document. I'd like to avoid oversampling and requesting back because of the
> bandwidth and overall resource usage this implies. I'm currently trying out
> a "PostSortFilter" approach that I'll share if it seems interesting enough.
>
> Steve Molloy
> Software Architect  |  Information Discovery & Analytics R&D
> Website:
> www.opentext.com
>
>
>
> This email message is confidential, may be privileged, and is intended for
> the exclusive use of the addressee. Any other person is strictly prohibited
> from disclosing or reproducing it. If the addressee cannot be reached or is
> unknown to you, please inform the sender by return email and delete this
> email message and all copies immediately.
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: January-24-13 1:11 PM
> To: dev@lucene.apache.org
> Subject: Re: Post-sort filtering
>
> this has some problems. First, your facet, group, num hits, etc.
> counts will be off for that user. Second, you can't sort without having
> all of the documents, so unless you're willing to have your counts be off,
> you really have to pay the price of post-filtering everything.
>
> If you can live with the counts being off, consider just having the
> application do a couple of round-trips. Get the docs (oversample, say just
> get the IDs for the top 100 docs) _without_ any kind of ACL filtering. Then
> send those docs back to the server with the ACL filtering. If you don't get
> enough to fill up a response, get the next page of 100, etc.....
>
> Finally, the user's list is a better place for this kind of question, this
> list is for discussing developing the code...
>
> Best
> Erick
>
> On Wed, Jan 23, 2013 at 9:05 AM, Steve Molloy <sm...@opentext.com>
> wrote:
> > Hi,
> >
> >     I'm looking for a way to apply filtering that unfortunately
> > implies high cost because it needs to access external resources (for
> > security). I looked at (and tried) the PostFilter technique, which
> > offers some advantages, but still imply a lot of matches in a lot of
> > cases. What I'd like to be able to do is to filter out ids until I
> > have enough to fill the response, then stop filtering (and accept
> > everything). The idea being that total count is not as important,
> > major thing being results should not contain documents requester
> > should not see. So, post filter almost does the trick, except it comes
> > before sorting, so first X documents are not the same that the post
> filter is getting.
> >
> > Is there a way to filter out documents after they have been scored and
> > sorted?
> >
> > Thanks,
> > Steve
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
> commands, e-mail: dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional
> commands, e-mail: dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>