You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jeff Schmidt <ja...@535consulting.com> on 2011/12/02 05:47:20 UTC

Possible to facet across two indices, or document types in single index?

Hello:

I'm trying to relate together two different types of documents.  Currently I have 'node' documents that reside in one index (core), and 'product mapping' documents that are in another index.  The product mapping index is used to map tenant products to nodes. The nodes are canonical content that gets updated every quarter, where as the product mappings can change at any time.

I put them in two indexes because (1) canonical content changes rarely, and I don't want product mapping changes to affect it (commit, re-open searchers etc.), and I would like to support multiple tenants mapping products to the same canonical content to avoid duplication (a few GB).

This arrange has worked well thus far, but only in the sense that for each node result returned, I can query the product mapping index to determine the products mapped to the node.  I combine this information within my application and return it to the client.  This works okay in that there are only 5-20 results returned per page (start, rows).  But now I'm being asked to facet the product catagories (multi-valued field within a product mapping document) along with other facets defined in the canonical content.

Can this be done with Solr 3.5.0?  I've been looking into sub-queries, function queries etc.  Also, I've seen various postings indicating that one needs to denormalize more.  I don't want to add product information as fields to the canonical content. Not only does that defeat my objective (1) above, but Solr does not support incremental updates of document fields.

So, one approach is to issue by query to the canonical index and get all of the document IDs (could be 1000s), and then issue a filter query to the product mapping index with all of these IDs and have Solr facet the product categories.  Is that efficient?  I suppose I could use HTTP POST (via SolrJ) to convey that payload of IDs?  I could then take the facet results of that query and combine them with the canonical index results and return them to the client.

That may be do-able, but then let's say the user clicks on a product category facet value to narrow the node results to only those mapped to category XYZ. This will not affect the query issued against the canonical content index.  Instead, I think I'd have to go through the canonical results and eliminate the nodes that are not associated with product category XYZ.  Then, if the current page of results is inadequate (rows=10, but 3 nodes were eliminated), I'd have to go back to the canonical index to get more rows, eliminate some some again perhaps, get more etc.  That sounds unappealing and low performing.

Is there a Solr way to do this?  My Packt "Apache Solr 3 Enterprise Search Server" book (page 34) states regarding separate indices:

	"If you do develop separate schemas and if you need to search across your indices in one search then you must perform a distributed search, described in the last chapter. A distributed search is usually a feature employed for a large corpus but it applies here too."

But in the chapter it goes on to talk about dealing with sharding, replication etc. to support a large corpus, not necessarily tying together two different indexes.

Is it possible to accomplish my goal in a less ugly way than I outlined above?  Since we only have a single tenant to worry about, I could use a combined index at least for a few months (separate fields per document type, IDs are unique among then all) if that makes a difference.

Thanks!

Jeff
--
Jeff Schmidt
535 Consulting
jas@535consulting.com
http://www.535consulting.com
(650) 423-1068










Re: Possible to facet across two indices, or document types in single index?

Posted by Chris Hostetter <ho...@fucit.org>.
: Chris, you replied:
: 
: > : But there is a workaround:
: > : 1) Do a normal query without facets (you only need to request doc ids
: > : at this point)
: > : 2) Collect all the IDs of the documents returned
: > : 3) Do a second query for all fields and facets, adding a filter to
: > : restrict result to those IDs collected in step 2.

FYI: that was actually Erick's suggestion, i just pointed out that #3 
wasn't neccessary if you *only* care about the docs on page #1 ... but 
given thta in your situation you really need data from two different 
collections, it's a much differnet problem.

: When the initial search query comes in, I can do 1-3 above as you 
: describe.  I have fewer than 200K documents in the index. Given the 
: generalness of the search terms, let's say I get 7500 document IDs back 
: per 1 and 2.  It sounds like I need to create a filter query which 
: includes all 7500 IDs, and issue the 2nd query (in my case to another 
: core) and have it facet on the additional field(s) I'm interested in.  
: I don't need to return results from this, just get the facet 
: values/counts.

so far so good -- what you are describing is exactly what Join does (or 
specificly: what it was designed to do except for the anoying bug in how 
it parses the query) except that you are choosing to ignore the "results" 
and only look at the facet counts.

: Step 4 for me is to search the first index again, to obtain the 
: requested number of rows of results, return the appropriate fields, and 
: calculate facets for that content.  I can then merge the facet results 
: of both indexes, and the client is none the wiser.

here's where you've lost me...

how are you going to "merge" the facet counts from the two cores?  you 
could just lump them all in together (fieldA1 and fieldA2 from coreA, 
in a map with fieldB1 and fieldB2 from coreB) but they are counting 
ocmpleltey differnet things from comletely differnet cores -- if your main 
result set is from coreA, but you also show these facet counts based on 
the "join" against coreB, the constraint counts for values from fieldB2 
aren't going to mean much relative to the results you return.

I mean: consider a concrete example of having a "books" core and an 
"authors" core - wher every book has a field identifying the author by id.

if a user searches for authors who live in oregon, and then you get that 
list of 98 authors, and "join" them against the books core and facet on 
"genre" you can return some data like this...

  Genre:
   Biography: 1023
   Romance: 854
   Mystery: 674
   ...

...but thta doesn't really tell you anything about the "author" documents 
you are returning does it?  you know that some subset of those 98 authors 
wrote a total of 854 romance novels, but is that actaully useful in 
some way?  I suspect what you really want is to know the number of 
*authors* who have written books in each of those genres -- and nothing 
you've described so far will get you that.  (once again, we're back to the 
issue of denormalizing)

Setting asside that issue for a moment...

: A couple questions though (aren't there always? :))  Is this very 
: efficient?  Beyond building the string of 7500 IDs within my app, can 
: Solr swallow that okay?  I'm using SolrJ, javabin format, so hopefully 
: there is not a URL length issue (between my app and Solr)?  I'm guessing 
: javabin uses HTTP POST.

"efficient" is vauge... it can be done, but there's a lot of data going 
over the wire.  it would probably be more efficint to do is server side in 
a custom request handler (similar to how Join works)

: What is a reasonable way for the facets derived from the 2nd index to be 
: used for narrowing like those in the main content index? That is, 
: pinning down facet values from the second index is not going to affect 
: the results (document IDs) from searching the first index.  Perhaps that 

Now we're back to the problem i mentioned before, except you're 
describing it at the moment when a person attempts to filter on a facet 
constraint -- but as i've pointed out, you already have to deal with this 
just to generate the list of facet constraints and their counts.


-Hoss

Re: Possible to facet across two indices, or document types in single index?

Posted by Jeff Schmidt <ja...@535consulting.com>.
Hi again:

I figured it'd be bad form to hijack Kashif's thread, so I"ll just leverage some of its content here. :)

Chris, you replied:

> : But there is a workaround:
> : 1) Do a normal query without facets (you only need to request doc ids
> : at this point)
> : 2) Collect all the IDs of the documents returned
> : 3) Do a second query for all fields and facets, adding a filter to
> : restrict result to those IDs collected in step 2.
> 
> an easier solution, if you really just want the counts based on the data 
> from th page the user is looking at, is to count up the values in your UI 
> from the stored fields you get back.
> 
> (This is the type of thing that falls into the general category of "stuff 
> the client can do just as easily as Solr" so there isn't really any reason
> to consider implementing it in/with Solr.)


Moving on from join, my alternative solution is to do something like you described.  I am a server-side/API guy, and my application stands in between the UI/client and Solr.  I currently offer a number of facets, based on a single document type, and I want to add my new (cross-document) facets so as not to impose implementation details on the API client.  That is, they should receive values/counts and be able to specify values for narrowing down results, page facet values etc.  That is the ideal situation, anyway.

When the initial search query comes in, I can do 1-3 above as you describe.  I have fewer than 200K documents in the index. Given the generalness of the search terms, let's say I get 7500 document IDs back per 1 and 2.  It sounds like I need to create a filter query which includes all 7500 IDs, and issue the 2nd query (in my case to another core) and have it facet on the additional field(s) I'm interested in.  I don't need to return results from this, just get the facet values/counts.

Step 4 for me is to search the first index again, to obtain the requested number of rows of results, return the appropriate fields, and calculate facets for that content.  I can then merge the facet results of both indexes, and the client is none the wiser.

A couple questions though (aren't there always? :))  Is this very efficient?  Beyond building the string of 7500 IDs within my app, can Solr swallow that okay?  I'm using SolrJ, javabin format, so hopefully there is not a URL length issue (between my app and Solr)?  I'm guessing javabin uses HTTP POST.

What is a reasonable way for the facets derived from the 2nd index to be used for narrowing like those in the main content index? That is, pinning down facet values from the second index is not going to affect the results (document IDs) from searching the first index.  Perhaps that can resolved by performing steps 1-3 as before, but also retrieving the related ID values (a field in second index that refers to the document ID in the first index) from the second index, and do a set intersection with the document IDs of the first index (step 2). Then I modify my step 4 to filter on the document ID intersection. This would cause documents in the first index to drop out due to narrowing a facet in the second index.

So, from a performance or strategy perspective, is it a bit crazy to make a go of this?

Thanks,

Jeff

On Dec 11, 2011, at 12:17 PM, Jeff Schmidt wrote:

> Thanks Chris.  I was just going to sit down and see if I could get join to do what I want within a single index.  I'm glad I checked my email first. :)
> 
> However, I need to see how else to solve the problem, and it looks likes the most apropriate line of reasoning is in your response to Kashif Kahn's 05-Dec-11 email thread with the subject "Facet on a field with rows=n".  I'll reply to that and officially abandoned my pursuit of join.
> 
> Cheers,
> 
> Jeff
> 
> On Dec 9, 2011, at 1:34 PM, Chris Hostetter wrote:
> 
>> 
>> : What you said about faceting is the key.  I want to use my existing 
>> : edismax configuration to create the scored document result set of type 
>> : Y.  I don't want to affect their scores, but for each document ID, I 
>> : want join it with another type of document (X), which has a field which 
>> : contains a document ID of one of Y. There will be zero or more of these 
>> : per Y doc ID.  The X document then has a multi-valued field I would like 
>> : to facet. I don't need scores for the joined X documents.
>> 
>> If i'm following you correctly, then what you are asking about just isn't 
>> possible with "join".
>> 
>> thecrux of the issue is that you have a particular type of document you 
>> want *returned* to the users as teh results of a search, sorted by score.  
>> that set of documents is what you "join to"  All faceting, stats, 
>> highlighting, etc... are based on that final set of documents -- the 
>> documents you "join from" can only contribute in the query by identifying 
>> the set -- none of their field values or properties "survive the join" so 
>> to speak.
>> 
>> : This does not sound possible according to the end of your final 
>> : paragraph.  Is that because two cores are involved?  Despite the join 
>> 
>> no .. it has nothing to do with the multiple core part of the problem -- 
>> it's just how join works.  it identifies an (unordered) set of "joined to" 
>> documents based on the matching "joined from" documents.
>> 
>> 
>> -Hoss
> 
> 
> 
> --
> Jeff Schmidt
> 535 Consulting
> jas@535consulting.com
> http://www.535consulting.com
> (650) 423-1068
> 
> 
> 
> 
> 
> 
> 
> 
> 



--
Jeff Schmidt
535 Consulting
jas@535consulting.com
http://www.535consulting.com
(650) 423-1068










Re: Possible to facet across two indices, or document types in single index?

Posted by Jeff Schmidt <ja...@535consulting.com>.
Thanks Chris.  I was just going to sit down and see if I could get join to do what I want within a single index.  I'm glad I checked my email first. :)

However, I need to see how else to solve the problem, and it looks likes the most apropriate line of reasoning is in your response to Kashif Kahn's 05-Dec-11 email thread with the subject "Facet on a field with rows=n".  I'll reply to that and officially abandoned my pursuit of join.

Cheers,

Jeff

On Dec 9, 2011, at 1:34 PM, Chris Hostetter wrote:

> 
> : What you said about faceting is the key.  I want to use my existing 
> : edismax configuration to create the scored document result set of type 
> : Y.  I don't want to affect their scores, but for each document ID, I 
> : want join it with another type of document (X), which has a field which 
> : contains a document ID of one of Y. There will be zero or more of these 
> : per Y doc ID.  The X document then has a multi-valued field I would like 
> : to facet. I don't need scores for the joined X documents.
> 
> If i'm following you correctly, then what you are asking about just isn't 
> possible with "join".
> 
> thecrux of the issue is that you have a particular type of document you 
> want *returned* to the users as teh results of a search, sorted by score.  
> that set of documents is what you "join to"  All faceting, stats, 
> highlighting, etc... are based on that final set of documents -- the 
> documents you "join from" can only contribute in the query by identifying 
> the set -- none of their field values or properties "survive the join" so 
> to speak.
> 
> : This does not sound possible according to the end of your final 
> : paragraph.  Is that because two cores are involved?  Despite the join 
> 
> no .. it has nothing to do with the multiple core part of the problem -- 
> it's just how join works.  it identifies an (unordered) set of "joined to" 
> documents based on the matching "joined from" documents.
> 
> 
> -Hoss



--
Jeff Schmidt
535 Consulting
jas@535consulting.com
http://www.535consulting.com
(650) 423-1068










Re: Possible to facet across two indices, or document types in single index?

Posted by Chris Hostetter <ho...@fucit.org>.
: What you said about faceting is the key.  I want to use my existing 
: edismax configuration to create the scored document result set of type 
: Y.  I don't want to affect their scores, but for each document ID, I 
: want join it with another type of document (X), which has a field which 
: contains a document ID of one of Y. There will be zero or more of these 
: per Y doc ID.  The X document then has a multi-valued field I would like 
: to facet. I don't need scores for the joined X documents.

If i'm following you correctly, then what you are asking about just isn't 
possible with "join".

thecrux of the issue is that you have a particular type of document you 
want *returned* to the users as teh results of a search, sorted by score.  
that set of documents is what you "join to"  All faceting, stats, 
highlighting, etc... are based on that final set of documents -- the 
documents you "join from" can only contribute in the query by identifying 
the set -- none of their field values or properties "survive the join" so 
to speak.

: This does not sound possible according to the end of your final 
: paragraph.  Is that because two cores are involved?  Despite the join 

no .. it has nothing to do with the multiple core part of the problem -- 
it's just how join works.  it identifies an (unordered) set of "joined to" 
documents based on the matching "joined from" documents.


-Hoss

Re: Possible to facet across two indices, or document types in single index?

Posted by Jeff Schmidt <ja...@535consulting.com>.
Hi Chris:

Thanks a lot for your response. This is the kind of information I'm looking for. 

What you said about faceting is the key.  I want to use my existing edismax configuration to create the scored document result set of type Y.  I don't want to affect their scores, but for each document ID, I want join it with another type of document (X), which has a field which contains a document ID of one of Y. There will be zero or more of these per Y doc ID.  The X document then has a multi-valued field I would like to facet. I don't need scores for the joined X documents.

This does not sound possible according to the end of your final paragraph.  Is that because two cores are involved?  Despite the join syntax, I don't see any way to specify a facet.field parameter which indicates both a core and a field name. What if a single core is used (no fromIndex), containing both the X and Y type documents?  They will all have unique IDs, and I could I specify facet.field=y_abc&facet.field=x_abc?

I know Solr is all about denormalization, but I don't want to have to add frequently changing fields (X) to my very infrequently updated canonical content (Y) just to be able to facet on them.

Thanks!

Jeff

On Dec 5, 2011, at 5:11 PM, Chris Hostetter wrote:

> Jeff,
> 
> I'm not entirely understanding everything you've been asking about (in 
> terms of what your ultimate goal is) but as far as the JoinQParser 
> specificially...
> 
> 
> : http://localhost:8091/solr/ing-content/select/?qt=partner-tmo&fq=type:node&q={!join+from=conceptId+to=id+fromIndex=partner-tmo}brca1&debugQuery=true&rows=5&fl=id,n_type,n_name
> 	...
> :         <str name="parsedquery_toString">{!join from=conceptId to=id fromIndex=partner-tmo}n_text:brca</str>
> 	...
> : It looks like despite qt=partner-tmo, the edismax based search hander is 
> : being bypassed for the default search handler, and is querying against 
> : the n_text field, which is the defaultSearchField for the ing-conent 
> : core.  But, I don't want to use the default handler, but rather my 
> : configured edismax hander, and any specified filter queries, to 
> : determine the document set in the ing-conent core, and then join with 
> : the partner-tmo core.  [Yes, the edismax handler in the ing-content core 
> : and the second core are both named partner-tmo].
> 
> ...i *think* what you are getting bitten by here is SOLR-2824 - a bug in 
> the JoinQParser relating to how it parses the query that it should be 
> executing against the "fromIndex".  At the moment it is *parsed* according 
> to the configs of the index you are quering against, and then that query 
> is *executed* against the SolrCore identified by the "fromIndex" param ... 
> i'm not sure if knowing that will help you work arround this bug until it 
> gets fixed ,but it might help if you can tweak your configs/request to 
> make the query "make sense" in your "ing-content" collection.
> 
> In general though, i'm not certain that what you are trying to do will be 
> solvable with Join, based on some of your earlier comments -- the main 
> thing to remember is that {!join} is just a QParser that only matches some 
> document Y if Y's "to" field "joins up" against some other document X's 
> "from" field and document X matches the query the {!join} wraps.  It 
> doesn't give you any of the scores from the joined X documents, or cause 
> any of the fields from X to be useable when faceting on Y.  (Just putting 
> all that out there so you know in case those are deal breakers that are 
> going to force you to re-think your approach)
> 
> 
> -Hoss



--
Jeff Schmidt
535 Consulting
jas@535consulting.com
http://www.535consulting.com
(650) 423-1068










Re: Possible to facet across two indices, or document types in single index?

Posted by Chris Hostetter <ho...@fucit.org>.
Jeff,

I'm not entirely understanding everything you've been asking about (in 
terms of what your ultimate goal is) but as far as the JoinQParser 
specificially...


: http://localhost:8091/solr/ing-content/select/?qt=partner-tmo&fq=type:node&q={!join+from=conceptId+to=id+fromIndex=partner-tmo}brca1&debugQuery=true&rows=5&fl=id,n_type,n_name
	...
:         <str name="parsedquery_toString">{!join from=conceptId to=id fromIndex=partner-tmo}n_text:brca</str>
	...
: It looks like despite qt=partner-tmo, the edismax based search hander is 
: being bypassed for the default search handler, and is querying against 
: the n_text field, which is the defaultSearchField for the ing-conent 
: core.  But, I don't want to use the default handler, but rather my 
: configured edismax hander, and any specified filter queries, to 
: determine the document set in the ing-conent core, and then join with 
: the partner-tmo core.  [Yes, the edismax handler in the ing-content core 
: and the second core are both named partner-tmo].

...i *think* what you are getting bitten by here is SOLR-2824 - a bug in 
the JoinQParser relating to how it parses the query that it should be 
executing against the "fromIndex".  At the moment it is *parsed* according 
to the configs of the index you are quering against, and then that query 
is *executed* against the SolrCore identified by the "fromIndex" param ... 
i'm not sure if knowing that will help you work arround this bug until it 
gets fixed ,but it might help if you can tweak your configs/request to 
make the query "make sense" in your "ing-content" collection.

In general though, i'm not certain that what you are trying to do will be 
solvable with Join, based on some of your earlier comments -- the main 
thing to remember is that {!join} is just a QParser that only matches some 
document Y if Y's "to" field "joins up" against some other document X's 
"from" field and document X matches the query the {!join} wraps.  It 
doesn't give you any of the scores from the joined X documents, or cause 
any of the fields from X to be useable when faceting on Y.  (Just putting 
all that out there so you know in case those are deal breakers that are 
going to force you to re-think your approach)


-Hoss

Re: Possible to facet across two indices, or document types in single index?

Posted by Jeff Schmidt <ja...@535consulting.com>.
Well, the JoinQParserPlugin is definitely there.  Turning on debug reveals why I get zero results.  Given the URL:

http://localhost:8091/solr/ing-content/select/?qt=partner-tmo&fq=type:node&q={!join+from=conceptId+to=id+fromIndex=partner-tmo}brca1&debugQuery=true&rows=5&fl=id,n_type,n_name

I get:

<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">1</int>
        <lst name="params">
            <str name="debugQuery">true</str>
            <str name="fl">id,n_type,n_name</str>
            <str name="q">{!join from=conceptId to=id fromIndex=partner-tmo}brca1</str>
            <str name="qt">partner-tmo</str>
            <str name="fq">type:node</str>
            <str name="rows">5</str>
        </lst>
    </lst>
    <result name="response" numFound="0" start="0"/>
    <lst name="debug">
        <str name="rawquerystring">{!join from=conceptId to=id fromIndex=partner-tmo}brca1</str>
        <str name="querystring">{!join from=conceptId to=id fromIndex=partner-tmo}brca1</str>
        <str name="parsedquery">JoinQuery({!join from=conceptId to=id fromIndex=partner-tmo}n_text:brca)</str>
        <str name="parsedquery_toString">{!join from=conceptId to=id fromIndex=partner-tmo}n_text:brca</str>
        <lst name="explain"/>
        <str name="QParser"/>
        <arr name="filter_queries">
            <str>type:node</str>
        </arr>
        <arr name="parsed_filter_queries">
            <str>type:node</str>
        </arr>
        ...
    </lst>
</response>

It looks like despite qt=partner-tmo, the edismax based search hander is being bypassed for the default search handler, and is querying against the n_text field, which is the defaultSearchField for the ing-conent core.  But, I don't want to use the default handler, but rather my configured edismax hander,  and any specified filter queries, to determine the document set in the ing-conent core, and then join with the partner-tmo core.  [Yes, the edismax handler in the ing-content core and the second core are both named partner-tmo].

Can the JoinQParserPlugin work in conjunction with edismax?

Thanks,

Jeff

On Dec 4, 2011, at 4:12 PM, Jeff Schmidt wrote:

> Hello again:
> 
> I'm looking at the newer join functionality (http://wiki.apache.org/solr/Join) to see if that will help me out.  While there are signs it can go cross index/core (https://issues.apache.org/jira/browse/SOLR-2272), I doubt I can specify facet.field params for fields in a couple of different indexes.  But, perhaps a single combined index it might work.
> 
> Anyway, the above Jira item indicates status: resolved, resolution: fixed, and Fix version/s: 4.0.  I've been working with 3.5.0, so I checked out 4.0 from svn today:
> 
> [imac:svn/dev/trunk] jas% svn info
> Path: .
> URL: http://svn.apache.org/repos/asf/lucene/dev/trunk
> Repository Root: http://svn.apache.org/repos/asf
> Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
> Revision: 1210126
> ...
> Last Changed Rev: 1210116
> Last Changed Date: 2011-12-04 07:35:46 -0700 (Sun, 04 Dec 2011)
> 
> Issuing a join query looks like the local params syntax is being ignored and is part of the search terms?  I get zero results, when w/o the join I get 979.
> 
> <response>
>    <lst name="responseHeader">
>        <int name="status">0</int>
>        <int name="QTime">1</int>
>        <lst name="params">
>            <str name="fl">id,n_type,n_name</str>
>            <str name="q">{!join from=conceptId to=id fromIndex=partner-tmo}brca1</str>
>            <str name="qt">partner-tmo</str>
>            <str name="fq">type:node</str>
>            <str name="rows">5</str>
>        </lst>
>    </lst>
>    <result name="response" numFound="0" start="0"/>
> </response>
> 
> I've not yet fully explored this yet, and I'm not all that familiar with the Solr codebase, but is this functionality in 4.x trunk or not? I can see there is the package org.apache.lucene.search.join. Is this the implementation of SOLR-2272?
> 
> I can see the commit was made earlier this year, and then it was reverted and things went off the rails. I don't want to open any old wounds, but does the join exist?  I not, I'll know not to pursue it any further. If so, is there some solrconfig.xml configuration needed to enable it?  I don't see it in the examples.
> 
> Thanks,
> 
> Jeff
> 
> On Dec 1, 2011, at 9:47 PM, Jeff Schmidt wrote:
> 
>> Hello:
>> 
>> I'm trying to relate together two different types of documents.  Currently I have 'node' documents that reside in one index (core), and 'product mapping' documents that are in another index.  The product mapping index is used to map tenant products to nodes. The nodes are canonical content that gets updated every quarter, where as the product mappings can change at any time.
>> 
>> I put them in two indexes because (1) canonical content changes rarely, and I don't want product mapping changes to affect it (commit, re-open searchers etc.), and I would like to support multiple tenants mapping products to the same canonical content to avoid duplication (a few GB).
>> 
>> This arrange has worked well thus far, but only in the sense that for each node result returned, I can query the product mapping index to determine the products mapped to the node.  I combine this information within my application and return it to the client.  This works okay in that there are only 5-20 results returned per page (start, rows).  But now I'm being asked to facet the product catagories (multi-valued field within a product mapping document) along with other facets defined in the canonical content.
>> 
>> Can this be done with Solr 3.5.0?  I've been looking into sub-queries, function queries etc.  Also, I've seen various postings indicating that one needs to denormalize more.  I don't want to add product information as fields to the canonical content. Not only does that defeat my objective (1) above, but Solr does not support incremental updates of document fields.
>> 
>> So, one approach is to issue by query to the canonical index and get all of the document IDs (could be 1000s), and then issue a filter query to the product mapping index with all of these IDs and have Solr facet the product categories.  Is that efficient?  I suppose I could use HTTP POST (via SolrJ) to convey that payload of IDs?  I could then take the facet results of that query and combine them with the canonical index results and return them to the client.
>> 
>> That may be do-able, but then let's say the user clicks on a product category facet value to narrow the node results to only those mapped to category XYZ. This will not affect the query issued against the canonical content index.  Instead, I think I'd have to go through the canonical results and eliminate the nodes that are not associated with product category XYZ.  Then, if the current page of results is inadequate (rows=10, but 3 nodes were eliminated), I'd have to go back to the canonical index to get more rows, eliminate some some again perhaps, get more etc.  That sounds unappealing and low performing.
>> 
>> Is there a Solr way to do this?  My Packt "Apache Solr 3 Enterprise Search Server" book (page 34) states regarding separate indices:
>> 
>> 	"If you do develop separate schemas and if you need to search across your indices in one search then you must perform a distributed search, described in the last chapter. A distributed search is usually a feature employed for a large corpus but it applies here too."
>> 
>> But in the chapter it goes on to talk about dealing with sharding, replication etc. to support a large corpus, not necessarily tying together two different indexes.
>> 
>> Is it possible to accomplish my goal in a less ugly way than I outlined above?  Since we only have a single tenant to worry about, I could use a combined index at least for a few months (separate fields per document type, IDs are unique among then all) if that makes a difference.
>> 
>> Thanks!
>> 
>> Jeff
>> --
>> Jeff Schmidt
>> 535 Consulting
>> jas@535consulting.com
>> http://www.535consulting.com
>> (650) 423-1068
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> --
> Jeff Schmidt
> 535 Consulting
> jas@535consulting.com
> http://www.535consulting.com
> (650) 423-1068
> 
> 
> 
> 
> 
> 
> 
> 
> 



--
Jeff Schmidt
535 Consulting
jas@535consulting.com
http://www.535consulting.com
(650) 423-1068










Re: Possible to facet across two indices, or document types in single index?

Posted by Jeff Schmidt <ja...@535consulting.com>.
Hello again:

I'm looking at the newer join functionality (http://wiki.apache.org/solr/Join) to see if that will help me out.  While there are signs it can go cross index/core (https://issues.apache.org/jira/browse/SOLR-2272), I doubt I can specify facet.field params for fields in a couple of different indexes.  But, perhaps a single combined index it might work.

Anyway, the above Jira item indicates status: resolved, resolution: fixed, and Fix version/s: 4.0.  I've been working with 3.5.0, so I checked out 4.0 from svn today:

[imac:svn/dev/trunk] jas% svn info
Path: .
URL: http://svn.apache.org/repos/asf/lucene/dev/trunk
Repository Root: http://svn.apache.org/repos/asf
Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68
Revision: 1210126
...
Last Changed Rev: 1210116
Last Changed Date: 2011-12-04 07:35:46 -0700 (Sun, 04 Dec 2011)

Issuing a join query looks like the local params syntax is being ignored and is part of the search terms?  I get zero results, when w/o the join I get 979.

<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">1</int>
        <lst name="params">
            <str name="fl">id,n_type,n_name</str>
            <str name="q">{!join from=conceptId to=id fromIndex=partner-tmo}brca1</str>
            <str name="qt">partner-tmo</str>
            <str name="fq">type:node</str>
            <str name="rows">5</str>
        </lst>
    </lst>
    <result name="response" numFound="0" start="0"/>
</response>

I've not yet fully explored this yet, and I'm not all that familiar with the Solr codebase, but is this functionality in 4.x trunk or not? I can see there is the package org.apache.lucene.search.join. Is this the implementation of SOLR-2272?

I can see the commit was made earlier this year, and then it was reverted and things went off the rails. I don't want to open any old wounds, but does the join exist?  I not, I'll know not to pursue it any further. If so, is there some solrconfig.xml configuration needed to enable it?  I don't see it in the examples.

Thanks,

Jeff

On Dec 1, 2011, at 9:47 PM, Jeff Schmidt wrote:

> Hello:
> 
> I'm trying to relate together two different types of documents.  Currently I have 'node' documents that reside in one index (core), and 'product mapping' documents that are in another index.  The product mapping index is used to map tenant products to nodes. The nodes are canonical content that gets updated every quarter, where as the product mappings can change at any time.
> 
> I put them in two indexes because (1) canonical content changes rarely, and I don't want product mapping changes to affect it (commit, re-open searchers etc.), and I would like to support multiple tenants mapping products to the same canonical content to avoid duplication (a few GB).
> 
> This arrange has worked well thus far, but only in the sense that for each node result returned, I can query the product mapping index to determine the products mapped to the node.  I combine this information within my application and return it to the client.  This works okay in that there are only 5-20 results returned per page (start, rows).  But now I'm being asked to facet the product catagories (multi-valued field within a product mapping document) along with other facets defined in the canonical content.
> 
> Can this be done with Solr 3.5.0?  I've been looking into sub-queries, function queries etc.  Also, I've seen various postings indicating that one needs to denormalize more.  I don't want to add product information as fields to the canonical content. Not only does that defeat my objective (1) above, but Solr does not support incremental updates of document fields.
> 
> So, one approach is to issue by query to the canonical index and get all of the document IDs (could be 1000s), and then issue a filter query to the product mapping index with all of these IDs and have Solr facet the product categories.  Is that efficient?  I suppose I could use HTTP POST (via SolrJ) to convey that payload of IDs?  I could then take the facet results of that query and combine them with the canonical index results and return them to the client.
> 
> That may be do-able, but then let's say the user clicks on a product category facet value to narrow the node results to only those mapped to category XYZ. This will not affect the query issued against the canonical content index.  Instead, I think I'd have to go through the canonical results and eliminate the nodes that are not associated with product category XYZ.  Then, if the current page of results is inadequate (rows=10, but 3 nodes were eliminated), I'd have to go back to the canonical index to get more rows, eliminate some some again perhaps, get more etc.  That sounds unappealing and low performing.
> 
> Is there a Solr way to do this?  My Packt "Apache Solr 3 Enterprise Search Server" book (page 34) states regarding separate indices:
> 
> 	"If you do develop separate schemas and if you need to search across your indices in one search then you must perform a distributed search, described in the last chapter. A distributed search is usually a feature employed for a large corpus but it applies here too."
> 
> But in the chapter it goes on to talk about dealing with sharding, replication etc. to support a large corpus, not necessarily tying together two different indexes.
> 
> Is it possible to accomplish my goal in a less ugly way than I outlined above?  Since we only have a single tenant to worry about, I could use a combined index at least for a few months (separate fields per document type, IDs are unique among then all) if that makes a difference.
> 
> Thanks!
> 
> Jeff
> --
> Jeff Schmidt
> 535 Consulting
> jas@535consulting.com
> http://www.535consulting.com
> (650) 423-1068
> 
> 
> 
> 
> 
> 
> 
> 
> 



--
Jeff Schmidt
535 Consulting
jas@535consulting.com
http://www.535consulting.com
(650) 423-1068