You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by solr-user <so...@hotmail.com> on 2009/12/02 00:27:10 UTC

question about schemas

I just started using Solr, and I am trying to figure out how to setup my
schema. I know that Solr doesn’t have JOINs, and so I am having some
difficulty figuring out how would I setup a schema for the following
fictional situation.  For example, let us say that :

-	I have a 10000+ customers, each having some specific info (StoreId , Name,
Phone, Address, City, State, Zip, etc)
-	Each customer has a subset of the 100+ products I am looking to track,
each product having some specific info (ProductId, Name, Width, Height,
Depth, Weight, Density, etc)
-	I want to be able to search by the product info but have facets return the
number of customers, rather than the number of products, that meet my
criteria
-	I want to display (and sort) customers based on my product search

In relational databases, I would simply create two tables (customer and
product) and JOIN them.  I could then craft a sql query to count the number
of distinct StoreId values in the result (something like facets).

In Solr, however, there are no joins.  As far as I can tell, my options are
to:

-	create two Solr instances, one with customer info and one with product
info; I would search the product Solr instance and identify the StoreId
values return, and then use that info to search the customer Solr instance
to get the customer info.  The problem with this is the second query could
have ten thousand ANDs (one for each StoreId returned by the first query)
-	create a single Solr instance that contains a denormalized version of the
data where each doc would contain both the customer info and the product
info for a given product.  The problem with this is that my facets would
return the number of products, not the number of customers
-	create a single Solr instance that contains a denormalized version of the
data where each doc contains the customer info and info for ALL products
that the  customer might have (likely done via dynamicfields). The problem
with this is that my schema would be a bit messy and that my queries could
have hundreds of ANDs and Ors (one AND for each product field, and one OR
for each product); for example, q=((Width1:50 AND Density1:7) OR (Width2:50
AND Density2:7) OR …)

Does anyone have any advice on this?  Are there other schemas that might
work?  Hopefully the example makes sense.

-- 
View this message in context: http://old.nabble.com/question-about-schemas-tp26600956p26600956.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: question about schemas

Posted by Lance Norskog <go...@gmail.com>.
I don't know. The common way to do this in Solr is the full
denormalization technique, but that blows up in this case. This is not
an easy problem space to implement in Solr. Data warehousing & star
schema techniques may be more appropriate.

On 12/7/09, solr-user <so...@hotmail.com> wrote:
>
>
> Lance Norskog-2 wrote:
>>
>> You can make a separate facet field which contains a range of "buckets":
>> 10, 20, 50, or 100 means that the field has a value 0-10, 11-20, 21-50, or
>> 51-100. You could use a separate filter query with values for these
>> buckets. Filter queries are very fast in Solr 1.4 and this would limit
>> your range query execution to documents which match the buckets.
>>
>
> Lance, I am afraid that I do not see how to use this suggestion.
>
> Which of the three (four?) suggested schemas would I be using?  How would
> these range facets prevent the potential issues I found such as getting
> product facets instead of customer facets, or having very large numbers of
> ANDs and ORs, and so forth.
> --
> View this message in context:
> http://old.nabble.com/question-about-schemas-tp26600956p26679922.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
Lance Norskog
goksron@gmail.com

Re: question about schemas

Posted by solr-user <so...@hotmail.com>.

Lance Norskog-2 wrote:
> 
> You can make a separate facet field which contains a range of "buckets":
> 10, 20, 50, or 100 means that the field has a value 0-10, 11-20, 21-50, or
> 51-100. You could use a separate filter query with values for these
> buckets. Filter queries are very fast in Solr 1.4 and this would limit
> your range query execution to documents which match the buckets.
> 

Lance, I am afraid that I do not see how to use this suggestion.

Which of the three (four?) suggested schemas would I be using?  How would
these range facets prevent the potential issues I found such as getting
product facets instead of customer facets, or having very large numbers of
ANDs and ORs, and so forth.
-- 
View this message in context: http://old.nabble.com/question-about-schemas-tp26600956p26679922.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: question about schemas

Posted by solr-user <so...@hotmail.com>.

Lance Norskog-2 wrote:
> 
> But, in general, this is a "shopping cart" database and Solr/Lucene may
> not be the best fit for this problem.
> 

True, every tool has strengths and weaknesses. Given how powerful Solr
appears to be, I would be surprised if I was not able to handle this use
case.


Lance Norskog-2 wrote:
> 
> You can make a separate facet field which contains a range of "buckets":
> 10, 20, 50, or 100 means that the field has a value 0-10, 11-20, 21-50, or
> 51-100. You could use a separate filter query with values for these
> buckets. Filter queries are very fast in Solr 1.4 and this would limit
> your range query execution to documents which match the buckets.
> 

Thank you for this suggestion.  I will look into this.

-- 
View this message in context: http://old.nabble.com/question-about-schemas-tp26600956p26636155.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: question about schemas

Posted by Lance Norskog <go...@gmail.com>.
You can make a separate facet field which contains a range of
"buckets": 10, 20, 50, or 100 means that the field has a value 0-10,
11-20, 21-50, or 51-100. You could use a separate filter query with
values for these buckets. Filter queries are very fast in Solr 1.4 and
this would limit your range query execution to documents which match
the buckets.

But, in general, this is a "shopping cart" database and Solr/Lucene
may not be the best fit for this problem.

If you want to do numerical analysis on your shopping carts, check out
KNIME: www.knime.org . It's wonderful.

On Wed, Dec 2, 2009 at 8:38 AM, gdeconto <ge...@topproducer.com> wrote:
>
> I dont believe there is any way to link values in one multivalue field to
> values in other multivalue fields.
>
> Re "where each doc contains the customer info and info for ALL products that
> the  customer might have (likely done via dynamicfields)":
>
> one thing you might want to consider is that this solution might lead to
> performance issues if you need to do range queries such as q=((Width1:[50 TO
> *] AND Density1:[7 to *]) OR (Width2:[50 TO *] AND Density2:[7 TO *]) OR …)
>
> I had a similar problem a while back, and basically had similar options.  In
> my tests, this particular option became slower as I increased the number of
> "products" (and so the number of unique values for each "product" field).
>
> If you come up with a solution, let me know.
>
> also, another option might be to encode the "product" information (ie using
> a field delimiter, something like CSV) and then storing it into a multivalue
> field for each customer.  I dont know how you would search that data tho
> (maybe by having a unique delimiter for each field?)
> --
> View this message in context: http://old.nabble.com/question-about-schemas-tp26600956p26611997.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
Lance Norskog
goksron@gmail.com

RE: question about schemas

Posted by gdeconto <ge...@topproducer.com>.
I dont believe there is any way to link values in one multivalue field to
values in other multivalue fields.

Re "where each doc contains the customer info and info for ALL products that
the  customer might have (likely done via dynamicfields)":

one thing you might want to consider is that this solution might lead to
performance issues if you need to do range queries such as q=((Width1:[50 TO
*] AND Density1:[7 to *]) OR (Width2:[50 TO *] AND Density2:[7 TO *]) OR …) 

I had a similar problem a while back, and basically had similar options.  In
my tests, this particular option became slower as I increased the number of
"products" (and so the number of unique values for each "product" field).

If you come up with a solution, let me know.

also, another option might be to encode the "product" information (ie using
a field delimiter, something like CSV) and then storing it into a multivalue
field for each customer.  I dont know how you would search that data tho
(maybe by having a unique delimiter for each field?)
-- 
View this message in context: http://old.nabble.com/question-about-schemas-tp26600956p26611997.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: question about schemas

Posted by solr-user <so...@hotmail.com>.
cbennett wrote:
> 
> Solr supports multi value fields so you could store one document per
> customer and have multi value fields for the product information.
> 
> Colin.
Quoted from: 
http://old.nabble.com/question-about-schemas-tp26600956p26608618.html

Thanks Colin.  From the online docs, there doesnt seem to be a way to
directly map a multivalue field value in one field to the multivalue field
value in another field (ie the first value in myMultiValueProductId wouldnt
necessarily match the first value in myMultiValueDensity or in
myMultiValueWeight)?  Is there a technique to do this?
-- 
View this message in context: http://old.nabble.com/question-about-schemas-tp26600956p26611715.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: question about schemas

Posted by cb...@job.com.
Solr supports multi value fields so you could store one document per customer and have multi value fields for the product information.

Colin.

> -----Original Message-----
> From: solr-user [mailto:solr-user@hotmail.com]
> Sent: Tuesday, December 01, 2009 6:27 PM
> To: solr-user@lucene.apache.org
> Subject: question about schemas
> 
> 
> I just started using Solr, and I am trying to figure out how to setup
> my
> schema. I know that Solr doesn’t have JOINs, and so I am having some
> difficulty figuring out how would I setup a schema for the following
> fictional situation.  For example, let us say that :
> 
> -	I have a 10000+ customers, each having some specific info
> (StoreId , Name,
> Phone, Address, City, State, Zip, etc)
> -	Each customer has a subset of the 100+ products I am looking to
> track,
> each product having some specific info (ProductId, Name, Width, Height,
> Depth, Weight, Density, etc)
> -	I want to be able to search by the product info but have facets
> return the
> number of customers, rather than the number of products, that meet my
> criteria
> -	I want to display (and sort) customers based on my product search
> 
> In relational databases, I would simply create two tables (customer and
> product) and JOIN them.  I could then craft a sql query to count the
> number
> of distinct StoreId values in the result (something like facets).
> 
> In Solr, however, there are no joins.  As far as I can tell, my options
> are
> to:
> 
> -	create two Solr instances, one with customer info and one with
> product
> info; I would search the product Solr instance and identify the StoreId
> values return, and then use that info to search the customer Solr
> instance
> to get the customer info.  The problem with this is the second query
> could
> have ten thousand ANDs (one for each StoreId returned by the first
> query)
> -	create a single Solr instance that contains a denormalized
> version of the
> data where each doc would contain both the customer info and the
> product
> info for a given product.  The problem with this is that my facets
> would
> return the number of products, not the number of customers
> -	create a single Solr instance that contains a denormalized
> version of the
> data where each doc contains the customer info and info for ALL
> products
> that the  customer might have (likely done via dynamicfields). The
> problem
> with this is that my schema would be a bit messy and that my queries
> could
> have hundreds of ANDs and Ors (one AND for each product field, and one
> OR
> for each product); for example, q=((Width1:50 AND Density1:7) OR
> (Width2:50
> AND Density2:7) OR …)
> 
> Does anyone have any advice on this?  Are there other schemas that
> might
> work?  Hopefully the example makes sense.
> 
> --
> View this message in context: http://old.nabble.com/question-about-
> schemas-tp26600956p26600956.html
> Sent from the Solr - User mailing list archive at Nabble.com.





Re: question about schemas (and SOLR-1131?)

Posted by solr-user <so...@hotmail.com>.

wojtekpia wrote:
> 
> Could this be solved with a multi-valued custom field type (including a
> custom comparator)? The OP's situation deals with multi-valuing products
> for each customer. If products contain strictly numeric fields then it
> seems like a custom field implementation (or extension of BinaryField?)
> *should* be easy - only the comparator part needs work. I'm not clear on
> how the existing query parsers would handle this though, so there's
> probably some work there too. 
> https://issues.apache.org/jira/browse/SOLR-1131 SOLR-1131  seems like a
> more general solution that supports analysis that numeric fields don't
> need.
> 

Thank you for your suggestion.

It was my hope that I had simply not understood how to properly define the
schema in Solr, or that I had not understood how to use the existing Solr
functionality.

I will further look into the suggestions that I have received so far,
however I have concerns that my Solr project cannot proceed with the
technology present.  Lance may be correct in his assertion that I am using
the incorrect tool for the job.
-- 
View this message in context: http://old.nabble.com/question-about-schemas-tp26600956p26680485.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: question about schemas (and SOLR-1131?)

Posted by wojtekpia <wo...@hotmail.com>.
Could this be solved with a multi-valued custom field type (including a
custom comparator)? The OP's situation deals with multi-valuing products for
each customer. If products contain strictly numeric fields then it seems
like a custom field implementation (or extension of BinaryField?) *should*
be easy - only the comparator part needs work. I'm not clear on how the
existing query parsers would handle this though, so there's probably some
work there too. SOLR-1131 seems like a more general solution that supports
analysis that numeric fields don't need.


gdeconto wrote:
> 
> I saw an interesting thread in the solr-dev forum about multiple fields
> per fieldtype (https://issues.apache.org/jira/browse/SOLR-1131)
> 
> from the sounds of it, it might be of interest and/or use in these types
> of problems;  for your example, you might be able to define a fieldtype
> that houses the product data.
> 
> note that I only skimmed the thread. hopefully, I'll get get some time to
> look at it more closely
> 

-- 
View this message in context: http://old.nabble.com/question-about-schemas-tp26600956p26636170.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: question about schemas (and SOLR-1131?)

Posted by gdeconto <ge...@topproducer.com>.
I saw an interesting thread in the solr-dev forum about multiple fields per
fieldtype (https://issues.apache.org/jira/browse/SOLR-1131)

from the sounds of it, it might be of interest and/or use in these types of
problems;  for your example, you might be able to define a fieldtype that
houses the product data.

note that I only skimmed the thread. hopefully, I'll get get some time to
look at it more closely
-- 
View this message in context: http://old.nabble.com/question-about-schemas-tp26600956p26619190.html
Sent from the Solr - User mailing list archive at Nabble.com.