You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jonathan Carothers <jo...@amentra.com> on 2011/10/20 16:12:14 UTC

how to handle large relational data in Solr

All,

We are attempting to convert a fairly large relational database into Solr index(es).

There are ~100,000 products with ~1,000,000 accessories that can be related to any number of the products.  So if I include the search terms and the relationships in the same index, we're looking at a pretty huge index.

If we break it out into three indexes, one for the product search, one for the accessories search, and one for their relationship, is there a good way to merge the results?

Is there a better way to structure the indexes?

We will have a relational database available if it makes sense to do some sort of a hybrid approach.

many thanks,
Jonathan

Re: how to handle large relational data in Solr

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi Jonathan,

Not sure which version of Solr you are using, but look into Join functionality - hit #1: http://search-lucene.com/?q=join&fc_project=Solr

Otis
----

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


>________________________________
>From: Jonathan Carothers <jo...@amentra.com>
>To: "solr-user@lucene.apache.org" <so...@lucene.apache.org>
>Sent: Thursday, October 20, 2011 1:23 PM
>Subject: RE: how to handle large relational data in Solr
>
>Actually, that's the root of my concern.  It looks like it product will average ~20,000 associated accessories, still workable, but starting to look painful.  Coming back the other way, I would guess each accessory would be associated with 100 products on average.
>
>Given that there would be searchable fields in both the product and accessory data, I assume I would have to either split them  into separate indexes and merge the results, or have one document per product/accessory combo so that I don't get a mix of accessories matching the search term.  For example, if a product had two accessories, one with the description of "Blue Swing" and another with "Red Ball" and I did a search for "Red Swing" it would rank about the same as a document that actually had a "Red Swing".
>
>So it sounds like you are suggesting the external map, in which case is there a good way to merge the two searches?  Basically on search on product attributes and a second search on the attributes of related accessories?
>
>many thanks,
>Jonathan
>________________________________________
>From: Robert Stewart [bstewart.ny@gmail.com]
>Sent: Thursday, October 20, 2011 12:05 PM
>To: solr-user@lucene.apache.org
>Subject: Re: how to handle large relational data in Solr
>
>If your "documents" are products, then 100,000 documents is a pretty small index for solr.  Do you know approximately how many accessories are related to each product on average?  If # if relatively small (around 100 or less), then it should be ok to create product documents with all the related accessories as fields on the document, something like:
>
><doc>
>        <field name="id">PRODUCT_ID</field>
>        <field name="name">PRODUCT_NAME</field>
>        <field name="accessory">accessory one</field>
>        <field name="accessory">accessory two</field>
>        ....
>        <field name="accessory">accessory N</field>
></doc>
>
>
>And then you can search for products by accessory, and show accessory facets over products, etc.
>
>Even if # of accessories per product is large (1000 or more), you can still do it this way, but it may be better to store some small accessory ID as integers instead of larger names, and maybe use some external mapping to resolve names for search and display.
>
>Bob
>
>
>On Oct 20, 2011, at 11:08 AM, Jonathan Carothers wrote:
>
>> Agreed, this will just be a read only view of the existing database for search purposes.  Sorry for the confusion.
>> ________________________________________
>> From: Brandon Ramirez [Brandon_Ramirez@elementk.com]
>> Sent: Thursday, October 20, 2011 10:50 AM
>> To: solr-user@lucene.apache.org
>> Subject: RE: how to handle large relational data in Solr
>>
>> I would not recommend removing your relational database altogether.  You should treat that as your system of record.  By replacing it, you are forcing Solr to store the unmodified value for everything even when not needed.  You also lose normalization.   And if you ever need to add some data to your system that isn't search-related, you have no choice but to add it to your search index.
>>
>>
>> Brandon Ramirez | Office: 585.214.5413 | Fax: 585.295.4848
>> Software Engineer II | Element K | www.elementk.com
>>
>>
>> -----Original Message-----
>> From: Jonathan Carothers [mailto:jonathan.carothers@amentra.com]
>> Sent: Thursday, October 20, 2011 10:12 AM
>> To: solr-user@lucene.apache.org
>> Subject: how to handle large relational data in Solr
>>
>> All,
>>
>> We are attempting to convert a fairly large relational database into Solr index(es).
>>
>> There are ~100,000 products with ~1,000,000 accessories that can be related to any number of the products.  So if I include the search terms and the relationships in the same index, we're looking at a pretty huge index.
>>
>> If we break it out into three indexes, one for the product search, one for the accessories search, and one for their relationship, is there a good way to merge the results?
>>
>> Is there a better way to structure the indexes?
>>
>> We will have a relational database available if it makes sense to do some sort of a hybrid approach.
>>
>> many thanks,
>> Jonathan
>>
>
>
>
>

Re: how to handle large relational data in Solr

Posted by Erick Erickson <er...@gmail.com>.
In addition to Otis' suggestion, think about using multivalued fields
with an increment gap of,
say, 100 (assuming your accessories had less than 100 fields). Then
you can do proximity
searches with a size < 100 (e.g. "red swing"~90) would not match
across your multiple
entries....

If this is clear as mud, write back with what you've tried and maybe we can help

Best
Erick

On Thu, Oct 20, 2011 at 7:23 PM, Jonathan Carothers
<jo...@amentra.com> wrote:
> Actually, that's the root of my concern.  It looks like it product will average ~20,000 associated accessories, still workable, but starting to look painful.  Coming back the other way, I would guess each accessory would be associated with 100 products on average.
>
> Given that there would be searchable fields in both the product and accessory data, I assume I would have to either split them  into separate indexes and merge the results, or have one document per product/accessory combo so that I don't get a mix of accessories matching the search term.  For example, if a product had two accessories, one with the description of "Blue Swing" and another with "Red Ball" and I did a search for "Red Swing" it would rank about the same as a document that actually had a "Red Swing".
>
> So it sounds like you are suggesting the external map, in which case is there a good way to merge the two searches?  Basically on search on product attributes and a second search on the attributes of related accessories?
>
> many thanks,
> Jonathan
> ________________________________________
> From: Robert Stewart [bstewart.ny@gmail.com]
> Sent: Thursday, October 20, 2011 12:05 PM
> To: solr-user@lucene.apache.org
> Subject: Re: how to handle large relational data in Solr
>
> If your "documents" are products, then 100,000 documents is a pretty small index for solr.  Do you know approximately how many accessories are related to each product on average?  If # if relatively small (around 100 or less), then it should be ok to create product documents with all the related accessories as fields on the document, something like:
>
> <doc>
>        <field name="id">PRODUCT_ID</field>
>        <field name="name">PRODUCT_NAME</field>
>        <field name="accessory">accessory one</field>
>        <field name="accessory">accessory two</field>
>        ....
>        <field name="accessory">accessory N</field>
> </doc>
>
>
> And then you can search for products by accessory, and show accessory facets over products, etc.
>
> Even if # of accessories per product is large (1000 or more), you can still do it this way, but it may be better to store some small accessory ID as integers instead of larger names, and maybe use some external mapping to resolve names for search and display.
>
> Bob
>
>
> On Oct 20, 2011, at 11:08 AM, Jonathan Carothers wrote:
>
>> Agreed, this will just be a read only view of the existing database for search purposes.  Sorry for the confusion.
>> ________________________________________
>> From: Brandon Ramirez [Brandon_Ramirez@elementk.com]
>> Sent: Thursday, October 20, 2011 10:50 AM
>> To: solr-user@lucene.apache.org
>> Subject: RE: how to handle large relational data in Solr
>>
>> I would not recommend removing your relational database altogether.  You should treat that as your system of record.  By replacing it, you are forcing Solr to store the unmodified value for everything even when not needed.  You also lose normalization.   And if you ever need to add some data to your system that isn't search-related, you have no choice but to add it to your search index.
>>
>>
>> Brandon Ramirez | Office: 585.214.5413 | Fax: 585.295.4848
>> Software Engineer II | Element K | www.elementk.com
>>
>>
>> -----Original Message-----
>> From: Jonathan Carothers [mailto:jonathan.carothers@amentra.com]
>> Sent: Thursday, October 20, 2011 10:12 AM
>> To: solr-user@lucene.apache.org
>> Subject: how to handle large relational data in Solr
>>
>> All,
>>
>> We are attempting to convert a fairly large relational database into Solr index(es).
>>
>> There are ~100,000 products with ~1,000,000 accessories that can be related to any number of the products.  So if I include the search terms and the relationships in the same index, we're looking at a pretty huge index.
>>
>> If we break it out into three indexes, one for the product search, one for the accessories search, and one for their relationship, is there a good way to merge the results?
>>
>> Is there a better way to structure the indexes?
>>
>> We will have a relational database available if it makes sense to do some sort of a hybrid approach.
>>
>> many thanks,
>> Jonathan
>>
>
>

RE: how to handle large relational data in Solr

Posted by Jonathan Carothers <jo...@amentra.com>.
Actually, that's the root of my concern.  It looks like it product will average ~20,000 associated accessories, still workable, but starting to look painful.  Coming back the other way, I would guess each accessory would be associated with 100 products on average.

Given that there would be searchable fields in both the product and accessory data, I assume I would have to either split them  into separate indexes and merge the results, or have one document per product/accessory combo so that I don't get a mix of accessories matching the search term.  For example, if a product had two accessories, one with the description of "Blue Swing" and another with "Red Ball" and I did a search for "Red Swing" it would rank about the same as a document that actually had a "Red Swing".

So it sounds like you are suggesting the external map, in which case is there a good way to merge the two searches?  Basically on search on product attributes and a second search on the attributes of related accessories?

many thanks,
Jonathan
________________________________________
From: Robert Stewart [bstewart.ny@gmail.com]
Sent: Thursday, October 20, 2011 12:05 PM
To: solr-user@lucene.apache.org
Subject: Re: how to handle large relational data in Solr

If your "documents" are products, then 100,000 documents is a pretty small index for solr.  Do you know approximately how many accessories are related to each product on average?  If # if relatively small (around 100 or less), then it should be ok to create product documents with all the related accessories as fields on the document, something like:

<doc>
        <field name="id">PRODUCT_ID</field>
        <field name="name">PRODUCT_NAME</field>
        <field name="accessory">accessory one</field>
        <field name="accessory">accessory two</field>
        ....
        <field name="accessory">accessory N</field>
</doc>


And then you can search for products by accessory, and show accessory facets over products, etc.

Even if # of accessories per product is large (1000 or more), you can still do it this way, but it may be better to store some small accessory ID as integers instead of larger names, and maybe use some external mapping to resolve names for search and display.

Bob


On Oct 20, 2011, at 11:08 AM, Jonathan Carothers wrote:

> Agreed, this will just be a read only view of the existing database for search purposes.  Sorry for the confusion.
> ________________________________________
> From: Brandon Ramirez [Brandon_Ramirez@elementk.com]
> Sent: Thursday, October 20, 2011 10:50 AM
> To: solr-user@lucene.apache.org
> Subject: RE: how to handle large relational data in Solr
>
> I would not recommend removing your relational database altogether.  You should treat that as your system of record.  By replacing it, you are forcing Solr to store the unmodified value for everything even when not needed.  You also lose normalization.   And if you ever need to add some data to your system that isn't search-related, you have no choice but to add it to your search index.
>
>
> Brandon Ramirez | Office: 585.214.5413 | Fax: 585.295.4848
> Software Engineer II | Element K | www.elementk.com
>
>
> -----Original Message-----
> From: Jonathan Carothers [mailto:jonathan.carothers@amentra.com]
> Sent: Thursday, October 20, 2011 10:12 AM
> To: solr-user@lucene.apache.org
> Subject: how to handle large relational data in Solr
>
> All,
>
> We are attempting to convert a fairly large relational database into Solr index(es).
>
> There are ~100,000 products with ~1,000,000 accessories that can be related to any number of the products.  So if I include the search terms and the relationships in the same index, we're looking at a pretty huge index.
>
> If we break it out into three indexes, one for the product search, one for the accessories search, and one for their relationship, is there a good way to merge the results?
>
> Is there a better way to structure the indexes?
>
> We will have a relational database available if it makes sense to do some sort of a hybrid approach.
>
> many thanks,
> Jonathan
>


Re: how to handle large relational data in Solr

Posted by Robert Stewart <bs...@gmail.com>.
If your "documents" are products, then 100,000 documents is a pretty small index for solr.  Do you know approximately how many accessories are related to each product on average?  If # if relatively small (around 100 or less), then it should be ok to create product documents with all the related accessories as fields on the document, something like:

<doc>
	<field name="id">PRODUCT_ID</field>
	<field name="name">PRODUCT_NAME</field>
	<field name="accessory">accessory one</field>
	<field name="accessory">accessory two</field>
	....
	<field name="accessory">accessory N</field>
</doc>
	

And then you can search for products by accessory, and show accessory facets over products, etc. 

Even if # of accessories per product is large (1000 or more), you can still do it this way, but it may be better to store some small accessory ID as integers instead of larger names, and maybe use some external mapping to resolve names for search and display.  

Bob


On Oct 20, 2011, at 11:08 AM, Jonathan Carothers wrote:

> Agreed, this will just be a read only view of the existing database for search purposes.  Sorry for the confusion.
> ________________________________________
> From: Brandon Ramirez [Brandon_Ramirez@elementk.com]
> Sent: Thursday, October 20, 2011 10:50 AM
> To: solr-user@lucene.apache.org
> Subject: RE: how to handle large relational data in Solr
> 
> I would not recommend removing your relational database altogether.  You should treat that as your system of record.  By replacing it, you are forcing Solr to store the unmodified value for everything even when not needed.  You also lose normalization.   And if you ever need to add some data to your system that isn't search-related, you have no choice but to add it to your search index.
> 
> 
> Brandon Ramirez | Office: 585.214.5413 | Fax: 585.295.4848
> Software Engineer II | Element K | www.elementk.com
> 
> 
> -----Original Message-----
> From: Jonathan Carothers [mailto:jonathan.carothers@amentra.com]
> Sent: Thursday, October 20, 2011 10:12 AM
> To: solr-user@lucene.apache.org
> Subject: how to handle large relational data in Solr
> 
> All,
> 
> We are attempting to convert a fairly large relational database into Solr index(es).
> 
> There are ~100,000 products with ~1,000,000 accessories that can be related to any number of the products.  So if I include the search terms and the relationships in the same index, we're looking at a pretty huge index.
> 
> If we break it out into three indexes, one for the product search, one for the accessories search, and one for their relationship, is there a good way to merge the results?
> 
> Is there a better way to structure the indexes?
> 
> We will have a relational database available if it makes sense to do some sort of a hybrid approach.
> 
> many thanks,
> Jonathan
> 


RE: how to handle large relational data in Solr

Posted by Jonathan Carothers <jo...@amentra.com>.
Agreed, this will just be a read only view of the existing database for search purposes.  Sorry for the confusion.
________________________________________
From: Brandon Ramirez [Brandon_Ramirez@elementk.com]
Sent: Thursday, October 20, 2011 10:50 AM
To: solr-user@lucene.apache.org
Subject: RE: how to handle large relational data in Solr

I would not recommend removing your relational database altogether.  You should treat that as your system of record.  By replacing it, you are forcing Solr to store the unmodified value for everything even when not needed.  You also lose normalization.   And if you ever need to add some data to your system that isn't search-related, you have no choice but to add it to your search index.


Brandon Ramirez | Office: 585.214.5413 | Fax: 585.295.4848
Software Engineer II | Element K | www.elementk.com


-----Original Message-----
From: Jonathan Carothers [mailto:jonathan.carothers@amentra.com]
Sent: Thursday, October 20, 2011 10:12 AM
To: solr-user@lucene.apache.org
Subject: how to handle large relational data in Solr

All,

We are attempting to convert a fairly large relational database into Solr index(es).

There are ~100,000 products with ~1,000,000 accessories that can be related to any number of the products.  So if I include the search terms and the relationships in the same index, we're looking at a pretty huge index.

If we break it out into three indexes, one for the product search, one for the accessories search, and one for their relationship, is there a good way to merge the results?

Is there a better way to structure the indexes?

We will have a relational database available if it makes sense to do some sort of a hybrid approach.

many thanks,
Jonathan


RE: how to handle large relational data in Solr

Posted by Brandon Ramirez <Br...@elementk.com>.
I would not recommend removing your relational database altogether.  You should treat that as your system of record.  By replacing it, you are forcing Solr to store the unmodified value for everything even when not needed.  You also lose normalization.   And if you ever need to add some data to your system that isn't search-related, you have no choice but to add it to your search index.


Brandon Ramirez | Office: 585.214.5413 | Fax: 585.295.4848 
Software Engineer II | Element K | www.elementk.com


-----Original Message-----
From: Jonathan Carothers [mailto:jonathan.carothers@amentra.com] 
Sent: Thursday, October 20, 2011 10:12 AM
To: solr-user@lucene.apache.org
Subject: how to handle large relational data in Solr

All,

We are attempting to convert a fairly large relational database into Solr index(es).

There are ~100,000 products with ~1,000,000 accessories that can be related to any number of the products.  So if I include the search terms and the relationships in the same index, we're looking at a pretty huge index.

If we break it out into three indexes, one for the product search, one for the accessories search, and one for their relationship, is there a good way to merge the results?

Is there a better way to structure the indexes?

We will have a relational database available if it makes sense to do some sort of a hybrid approach.

many thanks,
Jonathan