You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Eric Reeves <er...@eline.com> on 2009/10/06 03:19:15 UTC

Help with denormalizing issues

Hi there,

I'm evaluating Solr as a replacement for our current search server, and am trying to determine what the best strategy would be to implement our business needs.  Our problem is that we have a catalog schema with products and skus, one to many.  The most relevant content being indexed is at the product level, in the name and description fields.  However we are interested in filtering by sku attributes, and in particular making multiple filters apply to a single sku.  For example, find a product that contains a sku that is both blue and on sale.  No approach I've tried at collapsing the sku data into the product document works for this.  If we put the data in separate fields, there's no way to apply multiple filters to the same sku. and if we concatenate all of the relevant sku data into a single multivalued field then as I understand it, this is just indexed as one large field with extra whitespace between the individual entries, so there's still no way to enforce that an AND filter query applies to the same sku.

One approach I was considering was to create separate indexes for products and skus, and store the product IDs in the sku documents.  Then we could apply our own filters to the initially generated list, based on unique query parameters.  I thought creating a component between query and facet would be a good place to add such a filter, but further research seems to indicate that this would break paging and sorting.  The only other thing I can think of would be to subclass QueryComponent itself, which looks rather daunting-the process() method has no hooks for this sort of thing, it seems I would have to copy the entire existing implementation and add them myself, which looks to be a fair chunk of work and brittle to changes in the trunk code.  Ideally it would be nice to be able to handle certain fq parameters in a completely different way, perhaps using a custom query parser, but I haven't wrapped my head around how those work.  Does any of this sound remotely doable?  Any advice?

The other suggestion we are looking at was given to us by our current search provider, which is to index the skus themselves.  It looks as if we may be able to make this work using the field collapsing patch from SOLR-236.  I have some concerns about this approach though: 1) It will make for a much larger index and longer indexing times (products can have 10 or more skus in our catalog).  2) Because the indexing will be copying the description and name from the product it will be indexing the same content more than once, and the number of times per product will vary based on the number of skus.  I'm concerned that this may skew the scoring algorithm, in particular the inverse frequency part.  3) I'm not sure about the performance of the field collapsing patch, I've read contradictory reports on the web.

I apologize if this is a bit rambling.  If anyone has any advice for our situation it would be very helpful.

Thanks,
Eric

Re: Help with denormalizing issues

Posted by Chris Hostetter <ho...@fucit.org>.
: business needs.  Our problem is that we have a catalog schema with 
: products and skus, one to many.  The most relevant content being indexed 
: is at the product level, in the name and description fields.  However we 
: are interested in filtering by sku attributes, and in particular making 
: multiple filters apply to a single sku.  For example, find a product 

the first rule of denormalization is to construct documents bsaed on what 
you granularity you wnat to get back -- because from a user perspective, 
that level of granularity is whta's going to make sense for 
faceting/filtern.  

If you want your results to be product based have one doc per product -- 
if you results to be sku based, have one doc per sku, and denormalize the 
product data redundently into every sku.

if sometimes you wnat to reutnr product data, and other times you wnat to 
return sku data, then create both types of documents (either in different 
indexes, or in the same index but with a doctype field that you can filter 
on)



-Hoss


Re: Help with denormalizing issues

Posted by Lance Norskog <go...@gmail.com>.
The separate sku do not become one long text string. They are separate
values in the same field. The relevance calculation is completely
separate per value.

The performance problem with the field collapsing patch is that it
does the same thing as a facet or sorting operation: it does a sweep
through the index and builds a data structure whose size depends on
the index. Faceting is not cached directly but still works very
quickly the second time. Sorting has its own cache and is very slow (N
log N) the first time and very fast afterwards. The field collapsing
patch does not cache any of its work and is almost as slow the second
time as the first time.

On 10/7/09, Eric Reeves <er...@eline.com> wrote:
> Hi again, I'm gonna try this again with more focus this time :D
>
> 1) Ideally what we would like to do, is plug in an additional mechanism to
> filter the initial result set, because we can't find a way to implement our
> filtering needs as filter queries against a single index.  We would want to
> do this while maintaining support for paging.  Looking through the codebase
> it looks as if this would not be possible without major surgery, due to the
> paging support being implemented deep inside private methods of
> SolrIndexSearcher.  Does this sound accurate?
>
> 2) If we pursue the other option of indexing skus and collapsing the results
> based on product id using the field collapsing patch, is there any validity
> to my concerns about indexing the same content multiple times skewing the
> scoring?
>
> 3) Does anyone have experience using the field collapsing patch, and have
> any idea how much additional overhead it incurs?
>
> Thanks,
> Eric
>
> -----Original Message-----
> From: Eric Reeves
> Sent: Monday, October 05, 2009 6:19 PM
> To: solr-user@lucene.apache.org
> Subject: Help with denormalizing issues
>
> Hi there,
>
> I'm evaluating Solr as a replacement for our current search server, and am
> trying to determine what the best strategy would be to implement our
> business needs.  Our problem is that we have a catalog schema with products
> and skus, one to many.  The most relevant content being indexed is at the
> product level, in the name and description fields.  However we are
> interested in filtering by sku attributes, and in particular making multiple
> filters apply to a single sku.  For example, find a product that contains a
> sku that is both blue and on sale.  No approach I've tried at collapsing the
> sku data into the product document works for this.  If we put the data in
> separate fields, there's no way to apply multiple filters to the same sku.
> and if we concatenate all of the relevant sku data into a single multivalued
> field then as I understand it, this is just indexed as one large field with
> extra whitespace between the individual entries, so there's still no way to
> enforce that an AND filter query applies to the same sku.
>
> One approach I was considering was to create separate indexes for products
> and skus, and store the product IDs in the sku documents.  Then we could
> apply our own filters to the initially generated list, based on unique query
> parameters.  I thought creating a component between query and facet would be
> a good place to add such a filter, but further research seems to indicate
> that this would break paging and sorting.  The only other thing I can think
> of would be to subclass QueryComponent itself, which looks rather
> daunting-the process() method has no hooks for this sort of thing, it seems
> I would have to copy the entire existing implementation and add them myself,
> which looks to be a fair chunk of work and brittle to changes in the trunk
> code.  Ideally it would be nice to be able to handle certain fq parameters
> in a completely different way, perhaps using a custom query parser, but I
> haven't wrapped my head around how those work.  Does any of this sound
> remotely doable?  Any advice?
>
> The other suggestion we are looking at was given to us by our current search
> provider, which is to index the skus themselves.  It looks as if we may be
> able to make this work using the field collapsing patch from SOLR-236.  I
> have some concerns about this approach though: 1) It will make for a much
> larger index and longer indexing times (products can have 10 or more skus in
> our catalog).  2) Because the indexing will be copying the description and
> name from the product it will be indexing the same content more than once,
> and the number of times per product will vary based on the number of skus.
> I'm concerned that this may skew the scoring algorithm, in particular the
> inverse frequency part.  3) I'm not sure about the performance of the field
> collapsing patch, I've read contradictory reports on the web.
>
> I apologize if this is a bit rambling.  If anyone has any advice for our
> situation it would be very helpful.
>
> Thanks,
> Eric
>


-- 
Lance Norskog
goksron@gmail.com

RE: Help with denormalizing issues

Posted by Eric Reeves <er...@eline.com>.
Hi again, I'm gonna try this again with more focus this time :D

1) Ideally what we would like to do, is plug in an additional mechanism to filter the initial result set, because we can't find a way to implement our filtering needs as filter queries against a single index.  We would want to do this while maintaining support for paging.  Looking through the codebase it looks as if this would not be possible without major surgery, due to the paging support being implemented deep inside private methods of SolrIndexSearcher.  Does this sound accurate? 

2) If we pursue the other option of indexing skus and collapsing the results based on product id using the field collapsing patch, is there any validity to my concerns about indexing the same content multiple times skewing the scoring?

3) Does anyone have experience using the field collapsing patch, and have any idea how much additional overhead it incurs?

Thanks,
Eric

-----Original Message-----
From: Eric Reeves 
Sent: Monday, October 05, 2009 6:19 PM
To: solr-user@lucene.apache.org
Subject: Help with denormalizing issues

Hi there,

I'm evaluating Solr as a replacement for our current search server, and am trying to determine what the best strategy would be to implement our business needs.  Our problem is that we have a catalog schema with products and skus, one to many.  The most relevant content being indexed is at the product level, in the name and description fields.  However we are interested in filtering by sku attributes, and in particular making multiple filters apply to a single sku.  For example, find a product that contains a sku that is both blue and on sale.  No approach I've tried at collapsing the sku data into the product document works for this.  If we put the data in separate fields, there's no way to apply multiple filters to the same sku. and if we concatenate all of the relevant sku data into a single multivalued field then as I understand it, this is just indexed as one large field with extra whitespace between the individual entries, so there's still no way to enforce that an AND filter query applies to the same sku.

One approach I was considering was to create separate indexes for products and skus, and store the product IDs in the sku documents.  Then we could apply our own filters to the initially generated list, based on unique query parameters.  I thought creating a component between query and facet would be a good place to add such a filter, but further research seems to indicate that this would break paging and sorting.  The only other thing I can think of would be to subclass QueryComponent itself, which looks rather daunting-the process() method has no hooks for this sort of thing, it seems I would have to copy the entire existing implementation and add them myself, which looks to be a fair chunk of work and brittle to changes in the trunk code.  Ideally it would be nice to be able to handle certain fq parameters in a completely different way, perhaps using a custom query parser, but I haven't wrapped my head around how those work.  Does any of this sound remotely doable?  Any advice?

The other suggestion we are looking at was given to us by our current search provider, which is to index the skus themselves.  It looks as if we may be able to make this work using the field collapsing patch from SOLR-236.  I have some concerns about this approach though: 1) It will make for a much larger index and longer indexing times (products can have 10 or more skus in our catalog).  2) Because the indexing will be copying the description and name from the product it will be indexing the same content more than once, and the number of times per product will vary based on the number of skus.  I'm concerned that this may skew the scoring algorithm, in particular the inverse frequency part.  3) I'm not sure about the performance of the field collapsing patch, I've read contradictory reports on the web.

I apologize if this is a bit rambling.  If anyone has any advice for our situation it would be very helpful.

Thanks,
Eric