You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Kelly Taylor <wi...@hotmail.com> on 2010/01/11 23:27:34 UTC

Encountering a roadblock with my Solr schema design...use dedupe?

I am in the process of building a Solr search solution for my application and
have run into a roadblock with the schema design. Trying to match criteria
in one multi-valued field with corresponding criteria in another
multi-valued field. Any advice would be greatly appreciated.

BACKGROUND:
My RDBMS data model is such that for every one of my "Product" entities,
there are one-to-many "SKU" entities available for purchase. Each SKU entity
can have its own price, as well as one-to-many options, etc. The web
frontend displays available "Product" entities on both directory and detail
pages.

In order to take advantage of Solr's facet count, paging, and sorting
functionality, I decided to base the Solr schema on "Product" documents; so
none of my documents currently contain duplicate "Product" data, and all
"SKU" related data is denormalized as necessary, but into multi-valued
fields. For example, I have a document with an "id" field set to
"Product:7," a "docType" field is set to "Product" as well as multi-valued
"SKU" related fields and data like, "sku_color" {Red | Green | Blue},
"sku_size" {Small | Medium | Large}, "sku_price" {10.00 | 10.00 | 7.99}

I hit the roadblock when I tried to answer the question, "Which products are
available that contain skus with color Green, size M, and a price of $9.99
or less?"...and have now begun the switch to "SKU" level indexing. This
also gives me what I need for faceted browsing/navigation, and search
refinement...leading the user to "Product" entities having purchasable "SKU"
entities. But this also means I now have documents which are mostly
duplicates for each "Product," and all, facet counts, paging and sorting is
then inaccurate; so it appears I need do this myself, with multiple Solr
requests.

Is this really the best approach; and if so, should I use the Solr
Deduplication update processor when indexing and querying?

Thanks in advance,
Kelly
--
View this message in context: http://old.nabble.com/Encountering-a-roadblock-with-my-Solr-schema-design...use-dedupe--tp27118977p27118977.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Encountering a roadblock with my Solr schema design...use dedupe?

Posted by Amit Nithian <an...@gmail.com>.

Hi all,

I am the author of the article referenced in this thread and after reading
it again, I can understand where there might have been confusion and my
apologies on that. I have edited the article to indicate that a
deduplication component is in the works and referenced SOLR-236. The article
can still be found at
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Solr-and-RDBMS-design-basics

My only question after reading this thread is what does a user purchase? A
product identified by a SKU? If that's the case then certainly indexing by
SKU is the way to go and then using the field collapse (the query time
deduplication) should work.

Also keep in mind that in my example, I was talking about the *exact* same
product located in different locations which could yield a bad user
experience if they were all shown on the same search result page. In your
case, each SKU is a unique (purchasable) product so collapsing by product id
is nice but would not doing so degrade the user experience? If I searched
for a green shirt and got S,M,L (all product ID 3) is that bad?

Hope that helps some
Amit

On Sat, Jan 16, 2010 at 3:43 PM, David MARTIN <dm...@gmail.com> wrote:

> I'm really interested in reading the answer to this thread as my problem is
> rather the same. Maybe my main difference is the huge SKU number per
> product
> I may have.
>
>
> David
>
> On Thu, Jan 14, 2010 at 2:35 AM, Kelly Taylor <wi...@hotmail.com>
> wrote:
>
> >
> > Hoss,
> >
> > Would you suggest using dedup for my use case; and if so, do you know of
> a
> > working example I can reference?
> >
> > I don't have an issue using the patched version of Solr, but I'd much
> > rather
> > use the GA version.
> >
> > -Kelly
> >
> >
> >
> > hossman wrote:
> > >
> > >
> > > : Dedupe is completely the wrong word. Deduping is something else
> > > : entirely - it is about trying not to index the same document twice.
> > >
> > > Dedup can also certainly be used with field collapsing -- that was one
> of
> > > the initial use cases identified for the
> SignatureUpdateProcessorFactory
> > > ... you can compute an 'expensive' signature when adding a document,
> > index
> > > it, and then FieldCollapse on that signature field.
> > >
> > > This gives you "query time deduplication" based on a value computed
> when
> > > indexing (the canonical example is multiple urls refrenceing the "same"
> > > content but with slightly differnet boilerplate markup.  You can use a
> > > Signature class that recognizes the boilerplate and computes an
> identical
> > > signature value for each URL whose content is "the same" but still
> index
> > > all of the URLs and their content as distinct documents ... so use
> cases
> > > where people only "distinct" URLs work using field collapse but by
> > default
> > > all matching documents can still be returned and searches on text in
> the
> > > boilerplate markup also still work.
> > >
> > >
> > > -Hoss
> > >
> > >
> > >
> >
> > --
> > View this message in context:
> >
> http://old.nabble.com/Encountering-a-roadblock-with-my-Solr-schema-design...use-dedupe--tp27118977p27155115.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
> >
>

Re: Encountering a roadblock with my Solr schema design...use dedupe?

Posted by David MARTIN <dm...@gmail.com>.

I'm really interested in reading the answer to this thread as my problem is
rather the same. Maybe my main difference is the huge SKU number per product
I may have.


David

On Thu, Jan 14, 2010 at 2:35 AM, Kelly Taylor <wi...@hotmail.com> wrote:

>
> Hoss,
>
> Would you suggest using dedup for my use case; and if so, do you know of a
> working example I can reference?
>
> I don't have an issue using the patched version of Solr, but I'd much
> rather
> use the GA version.
>
> -Kelly
>
>
>
> hossman wrote:
> >
> >
> > : Dedupe is completely the wrong word. Deduping is something else
> > : entirely - it is about trying not to index the same document twice.
> >
> > Dedup can also certainly be used with field collapsing -- that was one of
> > the initial use cases identified for the SignatureUpdateProcessorFactory
> > ... you can compute an 'expensive' signature when adding a document,
> index
> > it, and then FieldCollapse on that signature field.
> >
> > This gives you "query time deduplication" based on a value computed when
> > indexing (the canonical example is multiple urls refrenceing the "same"
> > content but with slightly differnet boilerplate markup.  You can use a
> > Signature class that recognizes the boilerplate and computes an identical
> > signature value for each URL whose content is "the same" but still index
> > all of the URLs and their content as distinct documents ... so use cases
> > where people only "distinct" URLs work using field collapse but by
> default
> > all matching documents can still be returned and searches on text in the
> > boilerplate markup also still work.
> >
> >
> > -Hoss
> >
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/Encountering-a-roadblock-with-my-Solr-schema-design...use-dedupe--tp27118977p27155115.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Encountering a roadblock with my Solr schema design...use dedupe?

Posted by Kelly Taylor <wi...@hotmail.com>.

Hoss,

Would you suggest using dedup for my use case; and if so, do you know of a
working example I can reference?

I don't have an issue using the patched version of Solr, but I'd much rather
use the GA version.

-Kelly



hossman wrote:
> 
> 
> : Dedupe is completely the wrong word. Deduping is something else
> : entirely - it is about trying not to index the same document twice.
> 
> Dedup can also certainly be used with field collapsing -- that was one of 
> the initial use cases identified for the SignatureUpdateProcessorFactory 
> ... you can compute an 'expensive' signature when adding a document, index 
> it, and then FieldCollapse on that signature field.
> 
> This gives you "query time deduplication" based on a value computed when 
> indexing (the canonical example is multiple urls refrenceing the "same" 
> content but with slightly differnet boilerplate markup.  You can use a 
> Signature class that recognizes the boilerplate and computes an identical 
> signature value for each URL whose content is "the same" but still index 
> all of the URLs and their content as distinct documents ... so use cases 
> where people only "distinct" URLs work using field collapse but by default 
> all matching documents can still be returned and searches on text in the 
> boilerplate markup also still work.
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Encountering-a-roadblock-with-my-Solr-schema-design...use-dedupe--tp27118977p27155115.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Encountering a roadblock with my Solr schema design...use dedupe?

Posted by Chris Hostetter <ho...@fucit.org>.

: Dedupe is completely the wrong word. Deduping is something else
: entirely - it is about trying not to index the same document twice.

Dedup can also certainly be used with field collapsing -- that was one of 
the initial use cases identified for the SignatureUpdateProcessorFactory 
... you can compute an 'expensive' signature when adding a document, index 
it, and then FieldCollapse on that signature field.

This gives you "query time deduplication" based on a value computed when 
indexing (the canonical example is multiple urls refrenceing the "same" 
content but with slightly differnet boilerplate markup.  You can use a 
Signature class that recognizes the boilerplate and computes an identical 
signature value for each URL whose content is "the same" but still index 
all of the URLs and their content as distinct documents ... so use cases 
where people only "distinct" URLs work using field collapse but by default 
all matching documents can still be returned and searches on text in the 
boilerplate markup also still work.


-Hoss

Re: Encountering a roadblock with my Solr schema design...use dedupe?

Posted by Lance Norskog <go...@gmail.com>.

Field Collapsing is what you want - this is a classic problem with
retail store product indexing and everyone uses field collapsing.
(That is, everyone who is willing to apply the patch on their own
code.)

Dedupe is completely the wrong word. Deduping is something else
entirely - it is about trying not to index the same document twice.

On Tue, Jan 12, 2010 at 11:30 AM, Kelly Taylor <wi...@hotmail.com> wrote:
>
> David,
>
> Thanks, and yes, I decided to travel that path last night (applying SOLR-236
> patch) and plan to have some results by the end of the day; I'll post a
> summary.
>
> I read about field collapsing in your book last night. The book is an
> excellent resource by the way (shameless commendation plug!), and it made me
> laugh to find out that my use case is crazy!
>
> Regarding dedupe, I'm not sure either.  The component is mentioned in an
> article by Amit Nithianandan
> (http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Solr-and-RDBMS-design-basics).
> I had concluded from the section entitled, "Comparing the Solr Approach with
> the RDBMS," that the dedupe component was somehow used as a "field
> collapsing" alternative (in my mind anyway) but I couldn't find a real-world
> example.
>
> Amit says, "...I might create an index with multiple documents or records
> for the same exact wiper blade, each document having different location data
> (lat/long, address, etc.) to represent an individual store. Solr has a
> de-duplication component to help show unique documents in case that
> particular wiper blade is available in multiple stores near me..."
>
> In my case, I was attempting to equate Amit's "wiper blade" with my
> "product" entity, and his "individual store" my "SKU" entity.
>
> Thanks again.
>
> -Kelly
>
>
> David Smiley @MITRE.org wrote:
>>
>> Kelly,
>> This is a good question you have posed and illustrates a challenge with
>> Solr's limited schema.  I don't see how the dedup will help.  I would
>> continue with the SKU based approach and use this patch:
>> https://issues.apache.org/jira/browse/SOLR-236
>> You'll collapse on the product id.  My book, p.192, highlights this
>> component as it existed when I wrote it but it has been updated since
>> then.
>>
>> A recent separate question by you on this list suggests you're going down
>> this path.  I would grab the attached SOLR-236.patch file and attempt to
>> apply it to the 1.4 source.
>>
>> ~ David Smiley
>> Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/
>>
>
> --
> View this message in context: http://old.nabble.com/Encountering-a-roadblock-with-my-Solr-schema-design...use-dedupe--tp27118977p27131969.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Encountering a roadblock with my Solr schema design...use dedupe?

Posted by Kelly Taylor <wi...@hotmail.com>.

David,

Thanks, and yes, I decided to travel that path last night (applying SOLR-236
patch) and plan to have some results by the end of the day; I'll post a
summary.

I read about field collapsing in your book last night. The book is an
excellent resource by the way (shameless commendation plug!), and it made me
laugh to find out that my use case is crazy!

Regarding dedupe, I'm not sure either.  The component is mentioned in an
article by Amit Nithianandan 
(http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Solr-and-RDBMS-design-basics). 
I had concluded from the section entitled, "Comparing the Solr Approach with
the RDBMS," that the dedupe component was somehow used as a "field
collapsing" alternative (in my mind anyway) but I couldn't find a real-world
example.

Amit says, "...I might create an index with multiple documents or records
for the same exact wiper blade, each document having different location data
(lat/long, address, etc.) to represent an individual store. Solr has a
de-duplication component to help show unique documents in case that
particular wiper blade is available in multiple stores near me..."

In my case, I was attempting to equate Amit's "wiper blade" with my
"product" entity, and his "individual store" my "SKU" entity.

Thanks again.

-Kelly

David Smiley @MITRE.org wrote:
> 
> Kelly,
> This is a good question you have posed and illustrates a challenge with
> Solr's limited schema.  I don't see how the dedup will help.  I would
> continue with the SKU based approach and use this patch:
> https://issues.apache.org/jira/browse/SOLR-236
> You'll collapse on the product id.  My book, p.192, highlights this
> component as it existed when I wrote it but it has been updated since
> then.
> 
> A recent separate question by you on this list suggests you're going down
> this path.  I would grab the attached SOLR-236.patch file and attempt to
> apply it to the 1.4 source.
> 
> ~ David Smiley
> Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/
> 

-- 
View this message in context: http://old.nabble.com/Encountering-a-roadblock-with-my-Solr-schema-design...use-dedupe--tp27118977p27131969.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Encountering a roadblock with my Solr schema design...use dedupe?

Posted by "Smiley, David W." <ds...@mitre.org>.

Kelly,
This is a good question you have posed and illustrates a challenge with Solr's limited schema.  I don't see how the dedup will help.  I would continue with the SKU based approach and use this patch:
https://issues.apache.org/jira/browse/SOLR-236
You'll collapse on the product id.  My book, p.192, highlights this component as it existed when I wrote it but it has been updated since then.

A recent separate question by you on this list suggests you're going down this path.  I would grab the attached SOLR-236.patch file and attempt to apply it to the 1.4 source.

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/


On Jan 11, 2010, at 5:27 PM, Kelly Taylor wrote:

> 
> I am in the process of building a Solr search solution for my application and
> have run into a roadblock with the schema design.  Trying to match criteria
> in one multi-valued field with corresponding criteria in another
> multi-valued field.  Any advice would be greatly appreciated.
> 
> BACKGROUND:
> My RDBMS data model is such that for every one of my "Product" entities,
> there are one-to-many "SKU" entities available for purchase. Each SKU entity
> can have its own price, as well as one-to-many options, etc.  The web
> frontend displays available "Product" entities on both directory and detail
> pages.
> 
> In order to take advantage of Solr's facet count, paging, and sorting
> functionality, I decided to base the Solr schema on "Product" documents; so
> none of my documents currently contain duplicate "Product" data, and all
> "SKU" related data is denormalized as necessary, but into multi-valued
> fields.  For example, I have a document with an "id" field set to
> "Product:7," a "docType" field is set to "Product" as well as multi-valued
> "SKU" related fields and data like, "sku_color" {Red | Green | Blue},
> "sku_size" {Small | Medium | Large}, "sku_price" {10.00 | 10.00 | 7.99}
> 
> I hit the roadblock when I tried to answer the question, "Which products are
> available that contain skus with color Green, size M, and a price of $9.99
> or less?"...and have now begun the switch to "SKU" level indexing.  This
> also gives me what I need for faceted browsing/navigation, and search
> refinement...leading the user to "Product" entities having purchasable "SKU"
> entities.  But this also means I now have documents which are mostly
> duplicates for each "Product," and all, facet counts, paging and sorting is
> then inaccurate;  so it appears I need do this myself, with multiple Solr
> requests.
> 
> Is this really the best approach; and if so, should I use the Solr
> Deduplication update processor when indexing and querying?
> 
> Thanks in advance,
> Kelly
> -- 
> View this message in context: http://old.nabble.com/Encountering-a-roadblock-with-my-Solr-schema-design...use-dedupe--tp27118977p27118977.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Encountering a roadblock with my Solr schema design...use dedupe?

Posted by Chantal Ackermann <ch...@btelligent.de>.

Hi Kelly,

"...the criteria for this hypothetical search involves multi-valued fields,
where the index of one matching criteria needs to correspond to the same
value in another multi-valued field in the same index. You can't do that..."

Just my two cents:
By storing values in two different multi-value fields you do cannot 
store their relation to each other. If you want to have that in the 
index as well, you need another field with the pairs (or triples or 
whatever) (like a map but stored as a list of patterned strings, e.g. 
"size1:prize2","size2:prize2" etc.
And that of course for every possible combination that the user can 
order (in your use case). Whenever delivery of a certain combination 
changes, you'll have to update the specific documents to reflect that.

You'll still need the other fields for facetting I suppose. (My 
experience is that you often need different fields for facetting than 
for searching.)

The index is flat. If you think that storing all combinations (and that 
multiple times for all documents in your current schema) is too vast, 
than maybe you should store it in an extra index (core), and store only 
id references? But I'm not sure about that. I would try to store 
everything in one flat schema (one or multiple cores), unless you really 
run into unsolvable (!) hardware/performance issues.

Cheers,
Chantal

Kelly Taylor schrieb:
> Hi Markus,
> 
> Thanks again. I wish this were simple boolean algebra. This is something I
> have already tried. So either I am missing the boat completely, or have
> failed to communicate it clearly. I didn't want to confuse the issue further
> but maybe the following excerpts will help...
> 
> Excerpt from  "Solr 1.4 Enterprise Search Server" by David Smiley & Eric
> Pugh...
> 
> "...the criteria for this hypothetical search involves multi-valued fields,
> where the index of one matching criteria needs to correspond to the same
> value in another multi-valued field in the same index. You can't do that..."
> 
> And this excerpt is from "Solr and RDBMS: The basics of designing your
> application for the best of both" by by Amit Nithianandan...
> 
> "...If I wanted to allow my users to search for wiper blades available in a
> store nearby, I might create an index with multiple documents or records for
> the same exact wiper blade, each document having different location data
> (lat/long, address, etc.) to represent an individual store. Solr has a
> de-duplication component to help show unique documents in case that
> particular wiper blade is available in multiple stores near me..."
> 
> http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Solr-and-RDBMS-design-basics
> 
> Remember, with my original schema definition I have multi-valued fields, and
> when the "product" document is built, these fields do contain an array of
> values retrieved from each of the related skus. Skus are children of my
> products.
> 
> Using your example data, which t-shirt sku is available for purchase as a
> child of t-shirt product with id 3? Is it really the green, M, or have we
> found a product document related to both a green t-shirt and a Medium
> t-shirt of some other color, which will thereby leave the user with nothing
> to purchase?
> 
> sku = 9 [color=green, size=L, price=10.99], product id = 3
> sku = 10 [color=blue, size=S, price=9.99], product id = 3
> sku = 11 [color=blue, size=M, price=10.99], product id = 3
> 
>>> id = 1
>>> color = [green, blue]
>>> size = [M, S]
>>> price = 6
>>>
>>> id = 2
>>> color = [red, blue]
>>> size = [L, S]
>>> price = 12
>>>
>>> id = 3
>>> color = [green, red, blue]
>>> size = [L, S, M]
>>> price = 5
> 
> If this is still unclear, I'll post a new question based on findings from
> this conversation. Thanks for all of your help.
> 
> -Kelly
> 
> 
> Markus Jelsma - Buyways B.V. wrote:
>> Hello Kelly,
>>
>>
>> Simple boolean algebra, you tell Solr you want color = green AND size = M
>> so it will only return green t-shirts in size M. If you, however, turn the
>> AND in a OR it will return all t-shirts that are green OR in size M, thus
>> you can then get M sized shirts in the blue color or green shirts in size
>> XXL.
>>
>> I suggest you'd just give it a try and perhaps come back later to find
>> some improvements for your query. It would also be a good idea - if i may
>> say so - to read the links provided in the earlier message.
>>
>> Hope you will find what you're looking for :)
>>
>>
>> Cheers,
>>
>> Kelly Taylor zei:
>>> Hi Markus,
>>>
>>> Thanks for your reply.
>>>
>>> Using the current schema and query like you suggest, how can I identify
>>> the unique combination of options and price for a given SKU?   I don't
>>> want the user to arrive at a product which doesn't completely satisfy
>>> their search request.  For example, with the "color:Green", "size:M",
>>> and "price:[0 to 9.99]" search refinements applied,  no products should
>>> be displayed which only have "size:M" in "color:Blue"
>>>
>>> The actual data in the database for a product to display on the frontend
>>> could be as follows:
>>>
>>> product id = 1
>>> product name = T-shirt
>>>
>>> related skus...
>>> -- sku id = 7 [color=green, size=S, price=10.99]
>>> -- sku id = 9 [color=green, size=L, price=10.99]
>>> -- sku id = 10 [color=blue, size=S, price=9.99]
>>> -- sku id = 11 [color=blue, size=M, price=10.99]
>>> -- sku id = 12 [color=blue, size=L, price=10.99]
>>>
>>> Regards,
>>> Kelly
>>>
>>>
>>> Markus Jelsma - Buyways B.V. wrote:
>>>> Hello Kelly,
>>>>
>>>>
>>>> I am not entirely sure if i understand your problem correctly. But i
>>>> believe your first approach is the right one.
>>>>
>>>> Your question: "Which products are available that contain skus with
>>>> color Green, size M, and a price of $9.99 or less?" can be easily
>>>> answered using a schema like yours.
>>>>
>>>> id = 1
>>>> color = [green, blue]
>>>> size = [M, S]
>>>> price = 6
>>>>
>>>> id = 2
>>>> color = [red, blue]
>>>> size = [L, S]
>>>> price = 12
>>>>
>>>> id = 3
>>>> color = [green, red, blue]
>>>> size = [L, S, M]
>>>> price = 5
>>>>
>>>> Using the data above you can answer your question using a basic Solr
>>>> query [1] like the following: q=color:green AND price:[0 TO 9,99] AND
>>>> size:M
>>>>
>>>> Of course, you would make this a function query [2] but this, if i
>>>> understood your question well enough, answers it.
>>>>
>>>> [1] http://wiki.apache.org/solr/SolrQuerySyntax
>>>> [2] http://wiki.apache.org/solr/FunctionQuery
>>>>
>>>>
>>>> Cheers,
>>>>
>>>>
>>> --
>>> View this message in context:
>>> http://old.nabble.com/Encountering-a-roadblock-with-my-Solr-schema-design...use-dedupe--tp27118977p27120031.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>>
>>
> 
> --
> View this message in context: http://old.nabble.com/Encountering-a-roadblock-with-my-Solr-schema-design...use-dedupe--tp27118977p27120734.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Encountering a roadblock with my Solr schema design...use dedupe?

Posted by Markus Jelsma <ma...@buyways.nl>.

Hello,


I now believe that i really did misunderstand the problem and,
unfortunately, i don't believe i can be of much assistance as i did not
have to implement a similar problem.


Cheers,

-  
Markus Jelsma          Buyways B.V.            
Technisch Architect    Friesestraatweg 215c    
http://www.buyways.nl  9743 AD Groningen       


Alg. 050-853 6600      KvK  01074105
Tel. 050-853 6620      Fax. 050-3118124
Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17


On Mon, 2010-01-11 at 16:56 -0800, Kelly Taylor wrote:

> Hi Markus,
> 
> Thanks again. I wish this were simple boolean algebra. This is something I
> have already tried. So either I am missing the boat completely, or have
> failed to communicate it clearly. I didn't want to confuse the issue further
> but maybe the following excerpts will help...
> 
> Excerpt from  "Solr 1.4 Enterprise Search Server" by David Smiley & Eric
> Pugh...
> 
> "...the criteria for this hypothetical search involves multi-valued fields,
> where the index of one matching criteria needs to correspond to the same
> value in another multi-valued field in the same index. You can't do that..."
> 
> And this excerpt is from "Solr and RDBMS: The basics of designing your
> application for the best of both" by by Amit Nithianandan...
> 
> "...If I wanted to allow my users to search for wiper blades available in a
> store nearby, I might create an index with multiple documents or records for
> the same exact wiper blade, each document having different location data
> (lat/long, address, etc.) to represent an individual store. Solr has a
> de-duplication component to help show unique documents in case that
> particular wiper blade is available in multiple stores near me..."
> 
> http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Solr-and-RDBMS-design-basics
> 
> Remember, with my original schema definition I have multi-valued fields, and
> when the "product" document is built, these fields do contain an array of
> values retrieved from each of the related skus. Skus are children of my
> products.
> 
> Using your example data, which t-shirt sku is available for purchase as a
> child of t-shirt product with id 3? Is it really the green, M, or have we
> found a product document related to both a green t-shirt and a Medium
> t-shirt of some other color, which will thereby leave the user with nothing
> to purchase?
> 
> sku = 9 [color=green, size=L, price=10.99], product id = 3
> sku = 10 [color=blue, size=S, price=9.99], product id = 3
> sku = 11 [color=blue, size=M, price=10.99], product id = 3
> 
> >> id = 1
> >> color = [green, blue]
> >> size = [M, S]
> >> price = 6
> >>
> >> id = 2
> >> color = [red, blue]
> >> size = [L, S]
> >> price = 12
> >>
> >> id = 3
> >> color = [green, red, blue]
> >> size = [L, S, M]
> >> price = 5
> 
> If this is still unclear, I'll post a new question based on findings from
> this conversation. Thanks for all of your help.
> 
> -Kelly
> 
> 
> Markus Jelsma - Buyways B.V. wrote:
> > 
> > Hello Kelly,
> > 
> > 
> > Simple boolean algebra, you tell Solr you want color = green AND size = M
> > so it will only return green t-shirts in size M. If you, however, turn the
> > AND in a OR it will return all t-shirts that are green OR in size M, thus
> > you can then get M sized shirts in the blue color or green shirts in size
> > XXL.
> > 
> > I suggest you'd just give it a try and perhaps come back later to find
> > some improvements for your query. It would also be a good idea - if i may
> > say so - to read the links provided in the earlier message.
> > 
> > Hope you will find what you're looking for :)
> > 
> > 
> > Cheers,
> > 
> > Kelly Taylor zei:
> >>
> >> Hi Markus,
> >>
> >> Thanks for your reply.
> >>
> >> Using the current schema and query like you suggest, how can I identify
> >> the unique combination of options and price for a given SKU?   I don't
> >> want the user to arrive at a product which doesn't completely satisfy
> >> their search request.  For example, with the "color:Green", "size:M",
> >> and "price:[0 to 9.99]" search refinements applied,  no products should
> >> be displayed which only have "size:M" in "color:Blue"
> >>
> >> The actual data in the database for a product to display on the frontend
> >> could be as follows:
> >>
> >> product id = 1
> >> product name = T-shirt
> >>
> >> related skus...
> >> -- sku id = 7 [color=green, size=S, price=10.99]
> >> -- sku id = 9 [color=green, size=L, price=10.99]
> >> -- sku id = 10 [color=blue, size=S, price=9.99]
> >> -- sku id = 11 [color=blue, size=M, price=10.99]
> >> -- sku id = 12 [color=blue, size=L, price=10.99]
> >>
> >> Regards,
> >> Kelly
> >>
> >>
> >> Markus Jelsma - Buyways B.V. wrote:
> >>>
> >>> Hello Kelly,
> >>>
> >>>
> >>> I am not entirely sure if i understand your problem correctly. But i
> >>> believe your first approach is the right one.
> >>>
> >>> Your question: "Which products are available that contain skus with
> >>> color Green, size M, and a price of $9.99 or less?" can be easily
> >>> answered using a schema like yours.
> >>>
> >>> id = 1
> >>> color = [green, blue]
> >>> size = [M, S]
> >>> price = 6
> >>>
> >>> id = 2
> >>> color = [red, blue]
> >>> size = [L, S]
> >>> price = 12
> >>>
> >>> id = 3
> >>> color = [green, red, blue]
> >>> size = [L, S, M]
> >>> price = 5
> >>>
> >>> Using the data above you can answer your question using a basic Solr
> >>> query [1] like the following: q=color:green AND price:[0 TO 9,99] AND
> >>> size:M
> >>>
> >>> Of course, you would make this a function query [2] but this, if i
> >>> understood your question well enough, answers it.
> >>>
> >>> [1] http://wiki.apache.org/solr/SolrQuerySyntax
> >>> [2] http://wiki.apache.org/solr/FunctionQuery
> >>>
> >>>
> >>> Cheers,
> >>>
> >>>
> >>
> >> --
> >> View this message in context:
> >> http://old.nabble.com/Encountering-a-roadblock-with-my-Solr-schema-design...use-dedupe--tp27118977p27120031.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> > 
> > 
> > 
> > 
> > 
>

Re: Encountering a roadblock with my Solr schema design...use dedupe?

Posted by Kelly Taylor <wi...@hotmail.com>.

Hi Markus,

Thanks again. I wish this were simple boolean algebra. This is something I
have already tried. So either I am missing the boat completely, or have
failed to communicate it clearly. I didn't want to confuse the issue further
but maybe the following excerpts will help...

Excerpt from  "Solr 1.4 Enterprise Search Server" by David Smiley & Eric
Pugh...

"...the criteria for this hypothetical search involves multi-valued fields,
where the index of one matching criteria needs to correspond to the same
value in another multi-valued field in the same index. You can't do that..."

And this excerpt is from "Solr and RDBMS: The basics of designing your
application for the best of both" by by Amit Nithianandan...

"...If I wanted to allow my users to search for wiper blades available in a
store nearby, I might create an index with multiple documents or records for
the same exact wiper blade, each document having different location data
(lat/long, address, etc.) to represent an individual store. Solr has a
de-duplication component to help show unique documents in case that
particular wiper blade is available in multiple stores near me..."

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Solr-and-RDBMS-design-basics

Remember, with my original schema definition I have multi-valued fields, and
when the "product" document is built, these fields do contain an array of
values retrieved from each of the related skus. Skus are children of my
products.

Using your example data, which t-shirt sku is available for purchase as a
child of t-shirt product with id 3? Is it really the green, M, or have we
found a product document related to both a green t-shirt and a Medium
t-shirt of some other color, which will thereby leave the user with nothing
to purchase?

sku = 9 [color=green, size=L, price=10.99], product id = 3
sku = 10 [color=blue, size=S, price=9.99], product id = 3
sku = 11 [color=blue, size=M, price=10.99], product id = 3

>> id = 1
>> color = [green, blue]
>> size = [M, S]
>> price = 6
>>
>> id = 2
>> color = [red, blue]
>> size = [L, S]
>> price = 12
>>
>> id = 3
>> color = [green, red, blue]
>> size = [L, S, M]
>> price = 5

If this is still unclear, I'll post a new question based on findings from
this conversation. Thanks for all of your help.

-Kelly

Markus Jelsma - Buyways B.V. wrote:
> 
> Hello Kelly,
> 
> 
> Simple boolean algebra, you tell Solr you want color = green AND size = M
> so it will only return green t-shirts in size M. If you, however, turn the
> AND in a OR it will return all t-shirts that are green OR in size M, thus
> you can then get M sized shirts in the blue color or green shirts in size
> XXL.
> 
> I suggest you'd just give it a try and perhaps come back later to find
> some improvements for your query. It would also be a good idea - if i may
> say so - to read the links provided in the earlier message.
> 
> Hope you will find what you're looking for :)
> 
> 
> Cheers,
> 
> Kelly Taylor zei:
>>
>> Hi Markus,
>>
>> Thanks for your reply.
>>
>> Using the current schema and query like you suggest, how can I identify
>> the unique combination of options and price for a given SKU?   I don't
>> want the user to arrive at a product which doesn't completely satisfy
>> their search request.  For example, with the "color:Green", "size:M",
>> and "price:[0 to 9.99]" search refinements applied,  no products should
>> be displayed which only have "size:M" in "color:Blue"
>>
>> The actual data in the database for a product to display on the frontend
>> could be as follows:
>>
>> product id = 1
>> product name = T-shirt
>>
>> related skus...
>> -- sku id = 7 [color=green, size=S, price=10.99]
>> -- sku id = 9 [color=green, size=L, price=10.99]
>> -- sku id = 10 [color=blue, size=S, price=9.99]
>> -- sku id = 11 [color=blue, size=M, price=10.99]
>> -- sku id = 12 [color=blue, size=L, price=10.99]
>>
>> Regards,
>> Kelly
>>
>>
>> Markus Jelsma - Buyways B.V. wrote:
>>>
>>> Hello Kelly,
>>>
>>>
>>> I am not entirely sure if i understand your problem correctly. But i
>>> believe your first approach is the right one.
>>>
>>> Your question: "Which products are available that contain skus with
>>> color Green, size M, and a price of $9.99 or less?" can be easily
>>> answered using a schema like yours.
>>>
>>> id = 1
>>> color = [green, blue]
>>> size = [M, S]
>>> price = 6
>>>
>>> id = 2
>>> color = [red, blue]
>>> size = [L, S]
>>> price = 12
>>>
>>> id = 3
>>> color = [green, red, blue]
>>> size = [L, S, M]
>>> price = 5
>>>
>>> Using the data above you can answer your question using a basic Solr
>>> query [1] like the following: q=color:green AND price:[0 TO 9,99] AND
>>> size:M
>>>
>>> Of course, you would make this a function query [2] but this, if i
>>> understood your question well enough, answers it.
>>>
>>> [1] http://wiki.apache.org/solr/SolrQuerySyntax
>>> [2] http://wiki.apache.org/solr/FunctionQuery
>>>
>>>
>>> Cheers,
>>>
>>>
>>
>> --
>> View this message in context:
>> http://old.nabble.com/Encountering-a-roadblock-with-my-Solr-schema-design...use-dedupe--tp27118977p27120031.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Encountering-a-roadblock-with-my-Solr-schema-design...use-dedupe--tp27118977p27120734.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Encountering a roadblock with my Solr schema design...use dedupe?

Posted by Markus Jelsma <ma...@buyways.nl>.

Hello Kelly,


Simple boolean algebra, you tell Solr you want color = green AND size = M
so it will only return green t-shirts in size M. If you, however, turn the
AND in a OR it will return all t-shirts that are green OR in size M, thus
you can then get M sized shirts in the blue color or green shirts in size
XXL.

I suggest you'd just give it a try and perhaps come back later to find
some improvements for your query. It would also be a good idea - if i may
say so - to read the links provided in the earlier message.

Hope you will find what you're looking for :)


Cheers,

Kelly Taylor zei:
>
> Hi Markus,
>
> Thanks for your reply.
>
> Using the current schema and query like you suggest, how can I identify
> the unique combination of options and price for a given SKU?   I don't
> want the user to arrive at a product which doesn't completely satisfy
> their search request.  For example, with the "color:Green", "size:M",
> and "price:[0 to 9.99]" search refinements applied,  no products should
> be displayed which only have "size:M" in "color:Blue"
>
> The actual data in the database for a product to display on the frontend
> could be as follows:
>
> product id = 1
> product name = T-shirt
>
> related skus...
> -- sku id = 7 [color=green, size=S, price=10.99]
> -- sku id = 9 [color=green, size=L, price=10.99]
> -- sku id = 10 [color=blue, size=S, price=9.99]
> -- sku id = 11 [color=blue, size=M, price=10.99]
> -- sku id = 12 [color=blue, size=L, price=10.99]
>
> Regards,
> Kelly
>
>
> Markus Jelsma - Buyways B.V. wrote:
>>
>> Hello Kelly,
>>
>>
>> I am not entirely sure if i understand your problem correctly. But i
>> believe your first approach is the right one.
>>
>> Your question: "Which products are available that contain skus with
>> color Green, size M, and a price of $9.99 or less?" can be easily
>> answered using a schema like yours.
>>
>> id = 1
>> color = [green, blue]
>> size = [M, S]
>> price = 6
>>
>> id = 2
>> color = [red, blue]
>> size = [L, S]
>> price = 12
>>
>> id = 3
>> color = [green, red, blue]
>> size = [L, S, M]
>> price = 5
>>
>> Using the data above you can answer your question using a basic Solr
>> query [1] like the following: q=color:green AND price:[0 TO 9,99] AND
>> size:M
>>
>> Of course, you would make this a function query [2] but this, if i
>> understood your question well enough, answers it.
>>
>> [1] http://wiki.apache.org/solr/SolrQuerySyntax
>> [2] http://wiki.apache.org/solr/FunctionQuery
>>
>>
>> Cheers,
>>
>>
>
> --
> View this message in context:
> http://old.nabble.com/Encountering-a-roadblock-with-my-Solr-schema-design...use-dedupe--tp27118977p27120031.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Encountering a roadblock with my Solr schema design...use dedupe?

Posted by Kelly Taylor <wi...@hotmail.com>.

Hi Markus,

Thanks for your reply.

Using the current schema and query like you suggest, how can I identify the
unique combination of options and price for a given SKU?   I don't want the
user to arrive at a product which doesn't completely satisfy their search
request.  For example, with the "color:Green", "size:M", and "price:[0 to
9.99]" search refinements applied,  no products should be displayed which
only have "size:M" in "color:Blue"

The actual data in the database for a product to display on the frontend
could be as follows:

product id = 1
product name = T-shirt

related skus...
-- sku id = 7 [color=green, size=S, price=10.99]
-- sku id = 9 [color=green, size=L, price=10.99]
-- sku id = 10 [color=blue, size=S, price=9.99]
-- sku id = 11 [color=blue, size=M, price=10.99]
-- sku id = 12 [color=blue, size=L, price=10.99]

Regards,
Kelly


Markus Jelsma - Buyways B.V. wrote:
> 
> Hello Kelly,
> 
> 
> I am not entirely sure if i understand your problem correctly. But i
> believe your first approach is the right one.
> 
> Your question: "Which products are available that contain skus with color
> Green, size M, and a price of $9.99 or less?" can be easily answered using
> a schema like yours.
> 
> id = 1
> color = [green, blue]
> size = [M, S]
> price = 6
> 
> id = 2
> color = [red, blue]
> size = [L, S]
> price = 12
> 
> id = 3
> color = [green, red, blue]
> size = [L, S, M]
> price = 5
> 
> Using the data above you can answer your question using a basic Solr query
> [1] like the following: q=color:green AND price:[0 TO 9,99] AND size:M
> 
> Of course, you would make this a function query [2] but this, if i
> understood your question well enough, answers it.
> 
> [1] http://wiki.apache.org/solr/SolrQuerySyntax
> [2] http://wiki.apache.org/solr/FunctionQuery
> 
> 
> Cheers,
> 
> 

-- 
View this message in context: http://old.nabble.com/Encountering-a-roadblock-with-my-Solr-schema-design...use-dedupe--tp27118977p27120031.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Encountering a roadblock with my Solr schema design...use dedupe?

Posted by Markus Jelsma <ma...@buyways.nl>.

Hello Kelly,


I am not entirely sure if i understand your problem correctly. But i
believe your first approach is the right one.

Your question: "Which products are available that contain skus with color
Green, size M, and a price of $9.99 or less?" can be easily answered using
a schema like yours.

id = 1
color = [green, blue]
size = [M, S]
price = 6

id = 2
color = [red, blue]
size = [L, S]
price = 12

id = 3
color = [green, red, blue]
size = [L, S, M]
price = 5

Using the data above you can answer your question using a basic Solr query
[1] like the following: q=color:green AND price:[0 TO 9,99] AND size:M

Of course, you would make this a function query [2] but this, if i
understood your question well enough, answers it.

[1] http://wiki.apache.org/solr/SolrQuerySyntax
[2] http://wiki.apache.org/solr/FunctionQuery


Cheers,


Kelly Taylor zei:
>
> I am in the process of building a Solr search solution for my
> application and have run into a roadblock with the schema design.
> Trying to match criteria in one multi-valued field with corresponding
> criteria in another
> multi-valued field.  Any advice would be greatly appreciated.
>
> BACKGROUND:
> My RDBMS data model is such that for every one of my "Product" entities,
> there are one-to-many "SKU" entities available for purchase. Each SKU
> entity can have its own price, as well as one-to-many options, etc.  The
> web frontend displays available "Product" entities on both directory and
> detail pages.
>
> In order to take advantage of Solr's facet count, paging, and sorting
> functionality, I decided to base the Solr schema on "Product" documents;
> so none of my documents currently contain duplicate "Product" data, and
> all "SKU" related data is denormalized as necessary, but into
> multi-valued fields.  For example, I have a document with an "id" field
> set to
> "Product:7," a "docType" field is set to "Product" as well as
> multi-valued "SKU" related fields and data like, "sku_color" {Red |
> Green | Blue}, "sku_size" {Small | Medium | Large}, "sku_price" {10.00 |
> 10.00 | 7.99}
>
> I hit the roadblock when I tried to answer the question, "Which products
> are available that contain skus with color Green, size M, and a price of
> $9.99 or less?"...and have now begun the switch to "SKU" level indexing.
>  This also gives me what I need for faceted browsing/navigation, and
> search refinement...leading the user to "Product" entities having
> purchasable "SKU" entities.  But this also means I now have documents
> which are mostly duplicates for each "Product," and all, facet counts,
> paging and sorting is then inaccurate;  so it appears I need do this
> myself, with multiple Solr requests.
>
> Is this really the best approach; and if so, should I use the Solr
> Deduplication update processor when indexing and querying?
>
> Thanks in advance,
> Kelly
> --
> View this message in context:
> http://old.nabble.com/Encountering-a-roadblock-with-my-Solr-schema-design...use-dedupe--tp27118977p27118977.html
> Sent from the Solr - User mailing list archive at Nabble.com.