You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jeff Schmidt <ja...@535consulting.com> on 2012/07/26 01:10:49 UTC

Solr 4.0 cross-core join limitations or a misunderstanding?

Hello:

I'm trying to figure out if there is some limitation to a cross core join, or if I'm must misunderstanding something.  This has been working fine with a small number of documents in the from index, but now I'm not getting the expected results now that a given example here has 41K from index documents with which to filter the results of the main index.  On the other hand, I do have a case where things work with 80K docs in the from index that match the criteria...

My scenario is that we have canonical content, to which tenants map their product information. The canonical content is in one core, while each tenant has their own core for defining their mappings and other stuff.  In the tenant index, a product ID is mapped to a canonical node, whose ID is the document ID.  For example, a product mapping is defined as:

<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">1</int>
        <lst name="params">
            <str name="rows">1000</str>
            <str name="fl">id,conceptId,productId,parentProductId</str>
            <str name="q">parentProductId:A10E5AC6-306C-4B71-BE03-62ACA0C4D34D</str>
            <str name="fq">conceptId:ING\:3uly</str>
        </lst>
    </lst>
    <result name="response" numFound="1" start="0">
        <doc>
            <str name="id">ING:3uly|285676_A10E5AC6-306C-4B71-BE03-62ACA0C4D34D|ING:3uly</str>
            <str name="conceptId">ING:3uly</str>
            <str name="parentProductId">A10E5AC6-306C-4B71-BE03-62ACA0C4D34D</str>
            <str name="productId">285676_A10E5AC6-306C-4B71-BE03-62ACA0C4D34D</str>
        </doc>
    </result>
</response>

Some schema.xml for this product index:

<field name="id" type="string" indexed="true" stored="true" required="true" /> 
<field name="conceptId" type="string" indexed="true" stored="true" required="true"/>
<field name="productId" type="string" indexed="true" stored="true" required="true"/> <field name="parentProductId" type="string" indexed="true" stored="true"/>

In the canonical core, the document ID is defined the same way:

<field name="id" type="string" indexed="true" stored="true" required="true" />

I concatenate the node ID (ING:3uly, which is the document ID in the canonical index) and the product ID (285676_A10E5AC6-306C-4B71-BE03-62ACA0C4D34D) to create a unique document ID in the product index. However, their are hierarchies defined in the canonical (biological) content, so if the canonical node is a member of another node (group or complex), then I create additional documents to accommodate this.  Thus, the ID is also comprised of the ID of the hierarchical node, if any, otherwise the same origin node is used.

Products can have a parent product ID to group the products as being related to one another. Only one level is supported, and the parent product ID is optional.

Okay, back to the join query issue. :)  The goal is to search the canonical index, and return only documents to which one or more products are mapped to them that have a designated parent product ID. Given the response above, you can see the field conceptId refers to the ID of the document in the canonical index, and that parentProductId is defined.

Now, I can search the canonical index for a specific term, and I get results:

curl 'http://localhost:8983/solr/IngenuityContent.SearchMain/select/?qt=partner-tmo&fl=id&q=znf454&rows=1'
<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">0</int>
        <lst name="params">
            <str name="rows">1</str>
            <str name="fl">id</str>
            <str name="q">znf454</str>
            <str name="qt">partner-tmo</str>
        </lst>
    </lst>
    <result name="response" numFound="98" start="0">
        <doc>
            <str name="id">ING:3uly</str>
        </doc>
    </result>
</response>

Note that the first result document ID is same one as the conceptId defined for the product mapping earlier.  So, when I do the join query:

curl "http://localhost:8983/solr/IngenuityContent.SearchMain/select/?qt=partner-tmo&fl=id,n_name&q=znf454&fq=%7b%21join+from=conceptId+to=id+fromIndex=PartnerContent.SearchProducts%7dparentProductId:A10E5AC6-306C-4B71-BE03-62ACA0C4D34D"
<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">1</int>
        <lst name="params">
            <str name="fl">id,n_name</str>
            <str name="q">znf454</str>
            <str name="qt">partner-tmo</str>
            <str name="fq">{!join from=conceptId to=id fromIndex=PartnerContent.SearchProducts}parentProductId:A10E5AC6-306C-4B71-BE03-62ACA0C4D34D</str>
        </lst>
    </lst>
    <result name="response" numFound="0" start="0"/>
</response>

Should I not get the canonical content document with ID ING:3uly as a result rather than zero documents?  In other cases, this works as expected.  Note the partner-tmo query type is edismax.

Anyway, this email is long already so I don't want to go adding misc configuration information. With debugQuery=true, I can see:

        <arr name="filter_queries">
            <str>{!join from=conceptId to=id fromIndex=PartnerContent.SearchProducts}parentProductId:A10E5AC6-306C-4B71-BE03-62ACA0C4D34D</str>
        </arr>
        <arr name="parsed_filter_queries">
            <str>JoinQuery({!join from=conceptId to=id fromIndex=PartnerContent.SearchProducts}parentProductId:A10E5AC6-306C-4B71-BE03-62ACA0C4D34D)</str>
        </arr>

That looks normal to me, but maybe it's not...

Thanks!

Jeff
--
Jeff Schmidt
535 Consulting
jas@535consulting.com
http://www.535consulting.com
(650) 423-1068