You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Steven Livingstone Pérez <we...@hotmail.com> on 2012/08/20 22:56:00 UTC

Many fields versus join

Hi folks. I read some posts in the past about this subject but nothing that definitively answer my question.
I am trying to understand the trade off when you use a large number of fields (now sure what a quantative value of large is in Solr .. say 200 fields) versus a join - and even a multi value join.
The reason being, I have a document that has a set of core fields and then a load of metadata that is a repeating structure.
D1 F1 F2 F3 F4 F5 ..... S1a S1b S1c S2a S2b S2c ....
I'm not sure whether to create a load of fields up to SNx and a single document or to have multiple documents with each SNx in a separate document with a parent id that points to a parent document (or a multivalue metadata pointer field).
I hope that comes across reasonable well - please ask if not. Oh, if anyone knows of any quantative studies in Solr fields/documents i'd love to see the hard stats to improve my knowledge.
Loving Solr.
Cheers,/Steven

RE: Many fields versus join

Posted by Steven Livingstone Pérez <we...@hotmail.com>.

Thanka again Erick.
I have read some of Yonik's posts also.
I think 1M is closer to my number (i'm more interested in using Solr to improve the quality of search over a limited doc with lots of metadata set than quantity).
I'll make sure to stress test.
Cheers,/Steven

> Date: Tue, 21 Aug 2012 06:17:11 -0600
> Subject: Re: Many fields versus join
> From: erickerickson@gmail.com
> To: solr-user@lucene.apache.org
> 
> Steven:
> 
> Nope, I don't have any benchmarks off the top of my head.
> 
> You could probably compare this pretty quickly by using one of the
> benchmarking tools (http://wiki.apache.org/solr/BenchmarkingSolr)
> jMeter works as well, using two different schemas and
> configuring, say, an edismax request handler to search across
> all your fields.....
> 
> You could try some sort of clever indexing on multiValued fields with
> an appropriate positionIncrementGap and phrase slop. The idea
> here would be to put all the fields in one field and somehow
> keep them distinguishable (but I don't understand the domain
> well enough to suggest how).
> 
> But I think the real question is whether your corpus is big enough
> to worry about. Try the simple thing, stress test, and go from there.
> If you have a million docs, chances are you don't much care. 100M
> and it's dicier.
> 
> I have seen people like Yonik say that searching a bunch of
> separate fields is more expensive than searching a single large
> field, but whether it's enough to matter in _your_ situation only
> testing will tell....
> 
> Best
> Erick
> 
> On Tue, Aug 21, 2012 at 3:41 AM, Steven Livingstone Pérez
> <we...@hotmail.com> wrote:
> > Many Thanks Erick.
> > Are you aware of any real world metrics or best practice/pattern samples that use a lot of fields?
> > I'm looking to get an ideas of the pros/cons as I scale.
> > On what you're saying it defo looks like I'll try keeping a flat structure (which means perhaps 300 fields) but given some things i read i suspect there are things to watch out for when defining so many fields (but then, not sure it 300 is a *big* number).
> > thanks,steven
> >
> >> Date: Mon, 20 Aug 2012 19:28:57 -0600
> >> Subject: Re: Many fields versus join
> >> From: erickerickson@gmail.com
> >> To: solr-user@lucene.apache.org
> >>
> >> Join works best with a small number of unique values. Unfortunately,
> >> people often want to join on <uniqueKey>, which is by definition
> >> unique per document.
> >>
> >> The usual advice is to first try to flatten your data as much as possible.
> >> There's also some ongoing work on "block joins" that you may want to
> >> look at the JIRA for, explicitly for parent/child relationships but I confess
> >> I haven't a real clue what the details are....
> >>
> >> Best
> >> Erick
> >>
> >> On Mon, Aug 20, 2012 at 2:56 PM, Steven Livingstone Pérez
> >> <we...@hotmail.com> wrote:
> >> > Hi folks. I read some posts in the past about this subject but nothing that definitively answer my question.
> >> > I am trying to understand the trade off when you use a large number of fields (now sure what a quantative value of large is in Solr .. say 200 fields) versus a join - and even a multi value join.
> >> > The reason being, I have a document that has a set of core fields and then a load of metadata that is a repeating structure.
> >> > D1 F1 F2 F3 F4 F5 ..... S1a S1b S1c S2a S2b S2c ....
> >> > I'm not sure whether to create a load of fields up to SNx and a single document or to have multiple documents with each SNx in a separate document with a parent id that points to a parent document (or a multivalue metadata pointer field).
> >> > I hope that comes across reasonable well - please ask if not. Oh, if anyone knows of any quantative studies in Solr fields/documents i'd love to see the hard stats to improve my knowledge.
> >> > Loving Solr.
> >> > Cheers,/Steven
> >

Re: Many fields versus join

Posted by Erick Erickson <er...@gmail.com>.

Steven:

Nope, I don't have any benchmarks off the top of my head.

You could probably compare this pretty quickly by using one of the
benchmarking tools (http://wiki.apache.org/solr/BenchmarkingSolr)
jMeter works as well, using two different schemas and
configuring, say, an edismax request handler to search across
all your fields.....

You could try some sort of clever indexing on multiValued fields with
an appropriate positionIncrementGap and phrase slop. The idea
here would be to put all the fields in one field and somehow
keep them distinguishable (but I don't understand the domain
well enough to suggest how).

But I think the real question is whether your corpus is big enough
to worry about. Try the simple thing, stress test, and go from there.
If you have a million docs, chances are you don't much care. 100M
and it's dicier.

I have seen people like Yonik say that searching a bunch of
separate fields is more expensive than searching a single large
field, but whether it's enough to matter in _your_ situation only
testing will tell....

Best
Erick

On Tue, Aug 21, 2012 at 3:41 AM, Steven Livingstone Pérez
<we...@hotmail.com> wrote:
> Many Thanks Erick.
> Are you aware of any real world metrics or best practice/pattern samples that use a lot of fields?
> I'm looking to get an ideas of the pros/cons as I scale.
> On what you're saying it defo looks like I'll try keeping a flat structure (which means perhaps 300 fields) but given some things i read i suspect there are things to watch out for when defining so many fields (but then, not sure it 300 is a *big* number).
> thanks,steven
>
>> Date: Mon, 20 Aug 2012 19:28:57 -0600
>> Subject: Re: Many fields versus join
>> From: erickerickson@gmail.com
>> To: solr-user@lucene.apache.org
>>
>> Join works best with a small number of unique values. Unfortunately,
>> people often want to join on <uniqueKey>, which is by definition
>> unique per document.
>>
>> The usual advice is to first try to flatten your data as much as possible.
>> There's also some ongoing work on "block joins" that you may want to
>> look at the JIRA for, explicitly for parent/child relationships but I confess
>> I haven't a real clue what the details are....
>>
>> Best
>> Erick
>>
>> On Mon, Aug 20, 2012 at 2:56 PM, Steven Livingstone Pérez
>> <we...@hotmail.com> wrote:
>> > Hi folks. I read some posts in the past about this subject but nothing that definitively answer my question.
>> > I am trying to understand the trade off when you use a large number of fields (now sure what a quantative value of large is in Solr .. say 200 fields) versus a join - and even a multi value join.
>> > The reason being, I have a document that has a set of core fields and then a load of metadata that is a repeating structure.
>> > D1 F1 F2 F3 F4 F5 ..... S1a S1b S1c S2a S2b S2c ....
>> > I'm not sure whether to create a load of fields up to SNx and a single document or to have multiple documents with each SNx in a separate document with a parent id that points to a parent document (or a multivalue metadata pointer field).
>> > I hope that comes across reasonable well - please ask if not. Oh, if anyone knows of any quantative studies in Solr fields/documents i'd love to see the hard stats to improve my knowledge.
>> > Loving Solr.
>> > Cheers,/Steven
>

RE: Many fields versus join

Posted by Steven Livingstone Pérez <we...@hotmail.com>.

Many Thanks Erick.
Are you aware of any real world metrics or best practice/pattern samples that use a lot of fields?
I'm looking to get an ideas of the pros/cons as I scale.
On what you're saying it defo looks like I'll try keeping a flat structure (which means perhaps 300 fields) but given some things i read i suspect there are things to watch out for when defining so many fields (but then, not sure it 300 is a *big* number).
thanks,steven

> Date: Mon, 20 Aug 2012 19:28:57 -0600
> Subject: Re: Many fields versus join
> From: erickerickson@gmail.com
> To: solr-user@lucene.apache.org
> 
> Join works best with a small number of unique values. Unfortunately,
> people often want to join on <uniqueKey>, which is by definition
> unique per document.
> 
> The usual advice is to first try to flatten your data as much as possible.
> There's also some ongoing work on "block joins" that you may want to
> look at the JIRA for, explicitly for parent/child relationships but I confess
> I haven't a real clue what the details are....
> 
> Best
> Erick
> 
> On Mon, Aug 20, 2012 at 2:56 PM, Steven Livingstone Pérez
> <we...@hotmail.com> wrote:
> > Hi folks. I read some posts in the past about this subject but nothing that definitively answer my question.
> > I am trying to understand the trade off when you use a large number of fields (now sure what a quantative value of large is in Solr .. say 200 fields) versus a join - and even a multi value join.
> > The reason being, I have a document that has a set of core fields and then a load of metadata that is a repeating structure.
> > D1 F1 F2 F3 F4 F5 ..... S1a S1b S1c S2a S2b S2c ....
> > I'm not sure whether to create a load of fields up to SNx and a single document or to have multiple documents with each SNx in a separate document with a parent id that points to a parent document (or a multivalue metadata pointer field).
> > I hope that comes across reasonable well - please ask if not. Oh, if anyone knows of any quantative studies in Solr fields/documents i'd love to see the hard stats to improve my knowledge.
> > Loving Solr.
> > Cheers,/Steven

Re: Many fields versus join

Posted by Erick Erickson <er...@gmail.com>.

Join works best with a small number of unique values. Unfortunately,
people often want to join on <uniqueKey>, which is by definition
unique per document.

The usual advice is to first try to flatten your data as much as possible.
There's also some ongoing work on "block joins" that you may want to
look at the JIRA for, explicitly for parent/child relationships but I confess
I haven't a real clue what the details are....

Best
Erick

On Mon, Aug 20, 2012 at 2:56 PM, Steven Livingstone Pérez
<we...@hotmail.com> wrote:
> Hi folks. I read some posts in the past about this subject but nothing that definitively answer my question.
> I am trying to understand the trade off when you use a large number of fields (now sure what a quantative value of large is in Solr .. say 200 fields) versus a join - and even a multi value join.
> The reason being, I have a document that has a set of core fields and then a load of metadata that is a repeating structure.
> D1 F1 F2 F3 F4 F5 ..... S1a S1b S1c S2a S2b S2c ....
> I'm not sure whether to create a load of fields up to SNx and a single document or to have multiple documents with each SNx in a separate document with a parent id that points to a parent document (or a multivalue metadata pointer field).
> I hope that comes across reasonable well - please ask if not. Oh, if anyone knows of any quantative studies in Solr fields/documents i'd love to see the hard stats to improve my knowledge.
> Loving Solr.
> Cheers,/Steven