You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Burton-West, Tom" <tb...@umich.edu> on 2010/10/15 17:49:10 UTC

filter query from external list of Solr unique IDs

At the Lucene Revolution conference I asked about efficiently building a filter query from an external list of Solr unique ids.

Some use cases I can think of are:
1)      personal sub-collections (in our case a user can create a small subset of our 6.5 million doc collection and then run filter queries against it)
2)      tagging documents
3)      access control lists
4)      anything that needs complex relational joins
5)      a sort of alternative to incremental field updating (i.e. update in an external database or kv store)
6)      Grant's clustering cluster points and similar apps.

Grant pointed to SOLR 1715, but when I looked on JIRA, there doesn't seem to be any work on it yet.

Hoss  mentioned a couple of ideas:
            1) sub-classing query parser
        2) Having the app query a database and somehow passing something to Solr or lucene for the filter query

Can Hoss or someone else point me to more detailed information on what might be involved in the two ideas listed above?

Is somehow keeping an up-to-date map of unique Solr ids to internal Lucene ids needed to implement this or is that a separate issue?


Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search





RE: filter query from external list of Solr unique IDs

Posted by Jonathan Rochkind <ro...@jhu.edu>.
> You could even
>generalize the hell out of it so the SQL itself could be specified at
>request time...

>  q=solr&fq={!sql}SELECT ID FROM USER_MAP WHERE USER=1234 ORDER BY ID ASC

I think that's missing the need for an argument for what field in the solr index to require to be within the values generated by the SQL? Or maybe it's meant to assume the identifier field, but it would be an interesting generalization to allow any field. 

 q=solr&fq={!sql field=id}SELECT ID FROM USER_MAP WHERE USER=1234 ORDER BY ID ASC

And then, thinking further, in addition to an external sql, how about generalize this to an alternate 'sub' query on the solr index itself?  

q=solr&fq={!join_query field=id =id}foo:bar AND something_else

['subquery' is already taken as a defType, for purposes not entirely suitable here... I think? ]

Or even a different core! 

q=solr&fq={!join_query core=different_one field=id on_stored_field=id}foo:bar AND something_else

If that could be done as efficiently as reasonable, it would actually solve a whole BUNCH of problem cases that come up now and then. (And yes, people would have to be cautioned not to immediately use this type of solution becuase they are used to thinking in terms of rdbms; but for some problems it really would allow things just not easily possible otherwise.)

Re: filter query from external list of Solr unique IDs

Posted by Chris Hostetter <ho...@fucit.org>.
: Hoss  mentioned a couple of ideas:
:             1) sub-classing query parser
:         2) Having the app query a database and somehow passing something 
: to Solr or lucene for the filter query

The approach i was refering to is something one of my coworkers did a 
while back (if he's still lurking on the list, maybe he'll speak up)

He implemented a custom "SqlFilterQuery" class that was constructed from a 
JDBC URL and a SQL statement.  the SqlQuery class rewrote to itself (so it 
was a primitive query class) and returned a Scorer method that would:

1) execute the SQL query (which should return a sorted list of uniqueKey 
field values) and retrieve a JDBC iterator (cursor?) over the results.
2) fetch a TermEnum from Lucene for the uniqueKey field
3) use the JDBC Iterator to skip ahead on the TermEnum and for each 
uniqueKey to get the underlying lucene docid, and record it in a DocSet

As i recall, my coworker was using this in a custom RequestHandler, where 
he was then forcibly putting that DocSet in the filterCache so that it 
would be there on future requests, and it would be regenerated by 
autoWarming (the advantage of implementing this logic using the Query 
interface) but it could also be done with a custom cache if you don't want 
these to contend for space in the filterCache.

My point aout hte query parser was that instead of needing to use a custom 
RequestHandler (or even a custom SearchCOmponent) to generate this DocSet 
for filtering, you could probably do it using a QParserPlugin -- that way 
you could use a regaulr "fq" param to generate the filter.  You could even 
generalize the hell out of it so the SQL itself could be specified at 
request time...

  q=solr&fq={!sql}SELECT ID FROM USER_MAP WHERE USER=1234 ORDER BY ID ASC



-Hoss

RE: filter query from external list of Solr unique IDs

Posted by samabhiK <qe...@gmail.com>.
Does anything exists already in solr 4.3 to meet this usecase scenario?



--
View this message in context: http://lucene.472066.n3.nabble.com/filter-query-from-external-list-of-Solr-unique-IDs-tp1709060p4070874.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: filter query from external list of Solr unique IDs

Posted by Jonathan Rochkind <ro...@jhu.edu>.
Or an alternate simpler solution requiring the client to do more work, but getting around the inefficienty and limit on clauses of a whole lot of OR clauses: provide a qparser that accepts an "in" query. 

&fq={!in field=id}100,150,201,304,etc.

Could, it seems from Hoss et al's suggestions, be processed much more efficiently than id:100 OR id:150 etc., and also without running up against the limit on number of clauses. Could still result in very large size of http request to solr though, which may or may not be a problem. 

(At first I was thinking about if it's a problem that ALL these solutions suggested will take up a spot in the filter cache since you're using fq-- but I think that's actually a benefit rather than a problem, in more cases than not it will actually be convenient for the result to be in the filter cache, it very well may end up being used multiple times). 

RE: filter query from external list of Solr unique IDs

Posted by Jonathan Rochkind <ro...@jhu.edu>.
> I was actually thinking of some kind of custom Lucene/Solr component that
> would for example take a query parameter such as &lookitUp=123 and the
> component might do a JDBC query against a database or kv store and return
> results in some form that would be efficient for Solr/Lucene to process. 

If you do this, I'd definitely be interested in using it too. I have the same sort of use cases as you. 

It would be interesting to figure out if such a component could be used for "join-like behavior" too, as you included in your original use cases too. I'm not entirely sure what that would look like, but I have some problems which trying to solve with Solr often runs up against the lack of ability to do that (and there are lots of questions on the listserv asking "how do I do a join in solr" -- often the questioners can and should really solve their problem without needing to do this, but sometimes there really isn't a good solution without it), and when I try to spec out some workarounds in my head, it often comes up against the need to do what you're describing, efficiently do a query in solr limited by a known list of Solr ideas -- or in my case, sometimes a known (but lengthy, and "OR"d list of facet values for an fq limit -- really the same pattern there, it doesn't matter if the field is the id field or something else)

I'm not really sure what the API for a component like this meant to support 'join' kind of behavior would look like, but it would be interesting to think about. Maybe it needs to be able to generate the values against an alternate solr core, with a specified query against that core, in addition to being able to generate with a specified query from a JDBC or kv-store lookup?  Or in some cases, even the same solr core -- do a query against the same solr core, take one stored field from the results, and use it to filter the result set of a subsequent query. 

Re: filter query from external list of Solr unique IDs

Posted by eks dev <ek...@yahoo.co.uk>.
if your index is read-only in production, can you add mapping
unique_id-Lucene docId in your kv store and and build filters externally?
That would make unique Key obsolete in your production index, as you would
work at lucene doc id level.

That way, you offline the problem to update/optimize phase. Ugly part is a
lot of updates on your kv-store...

I am not really familiar with solr, but working directly with lucene this is
doable, even having parallel index that has unique ID as a stored field, and
another index with indexed fields on update master, and than having only
this index with indexed fields in production.





On Fri, Oct 15, 2010 at 8:59 PM, Burton-West, Tom <tb...@umich.edu>wrote:

> Hi Jonathan,
>
> The advantages of the obvious approach you outline are that it is simple,
> it fits in to the existing Solr model, it doesn't require any customization
> or modification to Solr/Lucene java code.  Unfortunately, it does not scale
> well.  We originally tried just what you suggest for our implementation of
> Collection Builder.  For a user's personal collection we had a table that
> maps the collection id to the unique Solr ids.
> Then when they wanted to search their collection, we just took their search
> and added a filter query with the fq=(id:1 OR id:2 OR....).   I seem to
> remember running in to a limit on the number of OR clauses allowed. Even if
> you can set that limit larger, there are a  number of efficiency issues.
>
> We ended up constructing a separate Solr index where we have a multi-valued
> collection number field. Unfortunately, until incremental field updating
> gets implemented, this means that every time someone adds a document to a
> collection, the entire document (including 700KB of OCR) needs to be
> re-indexed just to update the collection number field. This approach has
> allowed us to scale up to a total of something under 100,000 documents, but
> we don't think we can scale it much beyond that for various reasons.
>
> I was actually thinking of some kind of custom Lucene/Solr component that
> would for example take a query parameter such as &lookitUp=123 and the
> component might do a JDBC query against a database or kv store and return
> results in some form that would be efficient for Solr/Lucene to process. (Of
> course this assumes that a JDBC query would be more efficient than just
> sending a long list of ids to Solr).  The other part of the equation is
> mapping the unique Solr ids to internal Lucene ids in order to implement a
> filter query.   I was wondering if something like the unique id to Lucene id
> mapper in zoie might be useful or if that is too specific to zoie. SoThis
> may be totally off-base, since I haven't looked at the zoie code at all yet.
>
> In our particular use case, we might be able to build some kind of
> in-memory map after we optimize an index and before we mount it in
> production. In our workflow, we update the index and optimize it before we
> release it and once it is released to production there is no
> indexing/merging taking place on the production index (so the internal
> Lucene ids don't change.)
>
> Tom
>
>
>
> -----Original Message-----
> From: Jonathan Rochkind [mailto:rochkind@jhu.edu]
> Sent: Friday, October 15, 2010 1:07 PM
> To: solr-user@lucene.apache.org
> Subject: RE: filter query from external list of Solr unique IDs
>
> Definitely interested in this.
>
> The naive obvious approach would be just putting all the ID's in the query.
> Like fq=(id:1 OR id:2 OR....).  Or making it another clause in the 'q'.
>
> Can you outline what's wrong with this approach, to make it more clear
> what's needed in a solution?
> ________________________________________
>

RE: filter query from external list of Solr unique IDs

Posted by "Burton-West, Tom" <tb...@umich.edu>.
Hi Jonathan,

The advantages of the obvious approach you outline are that it is simple, it fits in to the existing Solr model, it doesn't require any customization or modification to Solr/Lucene java code.  Unfortunately, it does not scale well.  We originally tried just what you suggest for our implementation of Collection Builder.  For a user's personal collection we had a table that maps the collection id to the unique Solr ids.
Then when they wanted to search their collection, we just took their search and added a filter query with the fq=(id:1 OR id:2 OR....).   I seem to remember running in to a limit on the number of OR clauses allowed. Even if you can set that limit larger, there are a  number of efficiency issues.  

We ended up constructing a separate Solr index where we have a multi-valued collection number field. Unfortunately, until incremental field updating gets implemented, this means that every time someone adds a document to a collection, the entire document (including 700KB of OCR) needs to be re-indexed just to update the collection number field. This approach has allowed us to scale up to a total of something under 100,000 documents, but we don't think we can scale it much beyond that for various reasons.

I was actually thinking of some kind of custom Lucene/Solr component that would for example take a query parameter such as &lookitUp=123 and the component might do a JDBC query against a database or kv store and return results in some form that would be efficient for Solr/Lucene to process. (Of course this assumes that a JDBC query would be more efficient than just sending a long list of ids to Solr).  The other part of the equation is mapping the unique Solr ids to internal Lucene ids in order to implement a filter query.   I was wondering if something like the unique id to Lucene id mapper in zoie might be useful or if that is too specific to zoie. SoThis may be totally off-base, since I haven't looked at the zoie code at all yet.

In our particular use case, we might be able to build some kind of in-memory map after we optimize an index and before we mount it in production. In our workflow, we update the index and optimize it before we release it and once it is released to production there is no indexing/merging taking place on the production index (so the internal Lucene ids don't change.)  

Tom



-----Original Message-----
From: Jonathan Rochkind [mailto:rochkind@jhu.edu] 
Sent: Friday, October 15, 2010 1:07 PM
To: solr-user@lucene.apache.org
Subject: RE: filter query from external list of Solr unique IDs

Definitely interested in this. 

The naive obvious approach would be just putting all the ID's in the query. Like fq=(id:1 OR id:2 OR....).  Or making it another clause in the 'q'.  

Can you outline what's wrong with this approach, to make it more clear what's needed in a solution?
________________________________________

RE: filter query from external list of Solr unique IDs

Posted by Demian Katz <de...@villanova.edu>.
The main problem I've encountered with the "lots of OR clauses" approach is that you eventually hit the limit on Boolean clauses and the whole query fails.  You can keep raising the limit through the Solr configuration, but there's still a ceiling eventually.

- Demian

> -----Original Message-----
> From: Jonathan Rochkind [mailto:rochkind@jhu.edu]
> Sent: Friday, October 15, 2010 1:07 PM
> To: solr-user@lucene.apache.org
> Subject: RE: filter query from external list of Solr unique IDs
> 
> Definitely interested in this.
> 
> The naive obvious approach would be just putting all the ID's in the
> query. Like fq=(id:1 OR id:2 OR....).  Or making it another clause in
> the 'q'.
> 
> Can you outline what's wrong with this approach, to make it more clear
> what's needed in a solution?
> ________________________________________
> From: Burton-West, Tom [tburtonw@umich.edu]
> Sent: Friday, October 15, 2010 11:49 AM
> To: solr-user@lucene.apache.org
> Subject: filter query from external list of Solr unique IDs
> 
> At the Lucene Revolution conference I asked about efficiently building
> a filter query from an external list of Solr unique ids.
> 
> Some use cases I can think of are:
> 1)      personal sub-collections (in our case a user can create a small
> subset of our 6.5 million doc collection and then run filter queries
> against it)
> 2)      tagging documents
> 3)      access control lists
> 4)      anything that needs complex relational joins
> 5)      a sort of alternative to incremental field updating (i.e.
> update in an external database or kv store)
> 6)      Grant's clustering cluster points and similar apps.
> 
> Grant pointed to SOLR 1715, but when I looked on JIRA, there doesn't
> seem to be any work on it yet.
> 
> Hoss  mentioned a couple of ideas:
>             1) sub-classing query parser
>         2) Having the app query a database and somehow passing
> something to Solr or lucene for the filter query
> 
> Can Hoss or someone else point me to more detailed information on what
> might be involved in the two ideas listed above?
> 
> Is somehow keeping an up-to-date map of unique Solr ids to internal
> Lucene ids needed to implement this or is that a separate issue?
> 
> 
> Tom Burton-West
> http://www.hathitrust.org/blogs/large-scale-search
> 
> 
> 


RE: filter query from external list of Solr unique IDs

Posted by Jonathan Rochkind <ro...@jhu.edu>.
Definitely interested in this. 

The naive obvious approach would be just putting all the ID's in the query. Like fq=(id:1 OR id:2 OR....).  Or making it another clause in the 'q'.  

Can you outline what's wrong with this approach, to make it more clear what's needed in a solution?
________________________________________
From: Burton-West, Tom [tburtonw@umich.edu]
Sent: Friday, October 15, 2010 11:49 AM
To: solr-user@lucene.apache.org
Subject: filter query from external list of Solr unique IDs

At the Lucene Revolution conference I asked about efficiently building a filter query from an external list of Solr unique ids.

Some use cases I can think of are:
1)      personal sub-collections (in our case a user can create a small subset of our 6.5 million doc collection and then run filter queries against it)
2)      tagging documents
3)      access control lists
4)      anything that needs complex relational joins
5)      a sort of alternative to incremental field updating (i.e. update in an external database or kv store)
6)      Grant's clustering cluster points and similar apps.

Grant pointed to SOLR 1715, but when I looked on JIRA, there doesn't seem to be any work on it yet.

Hoss  mentioned a couple of ideas:
            1) sub-classing query parser
        2) Having the app query a database and somehow passing something to Solr or lucene for the filter query

Can Hoss or someone else point me to more detailed information on what might be involved in the two ideas listed above?

Is somehow keeping an up-to-date map of unique Solr ids to internal Lucene ids needed to implement this or is that a separate issue?


Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search





RE: filter query from external list of Solr unique IDs

Posted by "Burton-West, Tom" <tb...@umich.edu>.
Thanks Yonik,

Is this something you might have time to throw together, or an outline of what needs to be thrown together?
Is this something that should be asked on the developer's list or discussed in SOLR 1715 or does it make the most sense to keep the discussion in this thread?

Tom

-----Original Message-----
From: yseeley@gmail.com [mailto:yseeley@gmail.com] On Behalf Of Yonik Seeley
Sent: Friday, October 15, 2010 1:19 PM
To: solr-user@lucene.apache.org
Subject: Re: filter query from external list of Solr unique IDs

On Fri, Oct 15, 2010 at 11:49 AM, Burton-West, Tom <tb...@umich.edu> wrote:
> At the Lucene Revolution conference I asked about efficiently building a filter query from an external list of Solr unique ids.
Yeah, I've thought about a special query parser and query to deal with
this (relatively) efficiently, both from a query perspective and a
memory perspective.

Should be pretty quick to throw together:
- comma separated list of terms (unique ids are a special case of this)
- in the query, store as a single byte array for efficiency
- sort the ids if they aren't already sorted
- do lookups with a term enumerator and skip weighting or anything
else like that
- configurable caching... may, or may not want to cache this big query

That's only part of the stuff you mention, but seems like it would be
useful to a number of people.

-Yonik
http://www.lucidimagination.com

Re: filter query from external list of Solr unique IDs

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Fri, Oct 15, 2010 at 11:49 AM, Burton-West, Tom <tb...@umich.edu> wrote:
> At the Lucene Revolution conference I asked about efficiently building a filter query from an external list of Solr unique ids.

Yeah, I've thought about a special query parser and query to deal with
this (relatively) efficiently, both from a query perspective and a
memory perspective.

Should be pretty quick to throw together:
- comma separated list of terms (unique ids are a special case of this)
- in the query, store as a single byte array for efficiency
- sort the ids if they aren't already sorted
- do lookups with a term enumerator and skip weighting or anything
else like that
- configurable caching... may, or may not want to cache this big query

That's only part of the stuff you mention, but seems like it would be
useful to a number of people.

-Yonik
http://www.lucidimagination.com