You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Phil Scadden <P....@gns.cri.nz> on 2017/12/05 00:20:56 UTC

Multiple cores versus a "source" field.

I have two different document stores that I want index. Both are quite small (<50,000 documents though documents can be quite large). They are quite capable of using the same schema, but you would not want to search both simultaneously. I can see two approaches to handling this case.
1/ Create a "source" field and use that identify which store is being used. The search interface add the appropriate " AND source=xxxx" to queries.
2/ Create separate core for each.

If you want to use the same solr server to handle queries to both stores, which is the best approach in terms of minimizing JVM size while keeping searches reasonably fast?
Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.

RE: Multiple cores versus a "source" field.

Posted by Phil Scadden <P....@gns.cri.nz>.

Thanks Walter. Your case does apply as both data stores do indeed cover the same kind of material, with many important terms in common. "source" + fq: coming up.

-----Original Message-----
From: Walter Underwood [mailto:wunder@wunderwood.org]
Sent: Tuesday, 5 December 2017 5:51 p.m.
To: solr-user@lucene.apache.org
Subject: Re: Multiple cores versus a "source" field.

One more opinion on source field vs separate collections for multiple corpora.

Index statistics don’t really settle down until at least 100k documents. Below that, idf is pretty noisy. With Ultraseek, we used pre-calculated frequency data for collections under 10k docs.

If your corpora have similar word statistics, you might get more predictable relevance with a single collection. For example, if you have data sheets and press releases, but they are both about test instruments, then you might get some advantage from having more data points about the “text” and “title” fields.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 4, 2017, at 7:17 PM, Phil Scadden <P....@gns.cri.nz> wrote:
>
> Thanks Eric. I have already followed the solrj indexing very closely - I have to do a lot of manipulation at indexing time. The other blog article is very interesting as I do indeed use "year" (year of publication) and it is very frequently used to filter queries. I will have a play with that now.
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Tuesday, 5 December 2017 4:11 p.m.
> To: solr-user <so...@lucene.apache.org>
> Subject: Re: Multiple cores versus a "source" field.
>
> That's the unpleasant part of semi-structued documents (PDF, Word, whatever). You never know the relationship between raw size and indexable text.
>
> Basically anything that you don't care to contribute to _scoring_ is often better in an fq clause. You can also use {!cache=false} to bypass actually using the cache if you know it's unlikely to be reused.
>
> Two other points:
>
> 1> you can offload the parsing to clients rather than Solr and gain
> more control over the process (assuming you haven't already). Here's a blog:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
>
> 2> One reason to not go to fq clauses (except if you use
> {!cache=false}) is if you are using bare NOW in your clauses for, say ranges, one common construct is fq=date[NOW-1DAY TO NOW]. Here's another blog on the subject:
> https://lucidworks.com/2012/02/23/date-math-now-and-filter-queries/
>
>
> Best,
> Erick
>
>
> On Mon, Dec 4, 2017 at 6:08 PM, Phil Scadden <P....@gns.cri.nz> wrote:
>>> You'll have a few economies of scale I think with a single core, but frankly I don't know if they'd be enough to measure. You say the docs are "quite large" though, >are you talking books? Magazine articles? is 20K large or are the 20M?
>>
>> Technical reports. Sometimes up to 200MB pdfs, but that would include a lot of imagery. More typically 20Mb. A 140MB pdf contained only 400k of text.
>>
>> Thanks for tip on fq: I will put that into code now as I have other fields used is similar fashion.
>> Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.
> Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.

Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.

Re: Multiple cores versus a "source" field.

Posted by Walter Underwood <wu...@wunderwood.org>.

One more opinion on source field vs separate collections for multiple corpora.

Index statistics don’t really settle down until at least 100k documents. Below that, idf is pretty noisy. With Ultraseek, we used pre-calculated frequency data for collections under 10k docs.

If your corpora have similar word statistics, you might get more predictable relevance with a single collection. For example, if you have data sheets and press releases, but they are both about test instruments, then you might get some advantage from having more data points about the “text” and “title” fields.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Dec 4, 2017, at 7:17 PM, Phil Scadden <P....@gns.cri.nz> wrote:
> 
> Thanks Eric. I have already followed the solrj indexing very closely - I have to do a lot of manipulation at indexing time. The other blog article is very interesting as I do indeed use "year" (year of publication) and it is very frequently used to filter queries. I will have a play with that now.
> 
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Tuesday, 5 December 2017 4:11 p.m.
> To: solr-user <so...@lucene.apache.org>
> Subject: Re: Multiple cores versus a "source" field.
> 
> That's the unpleasant part of semi-structued documents (PDF, Word, whatever). You never know the relationship between raw size and indexable text.
> 
> Basically anything that you don't care to contribute to _scoring_ is often better in an fq clause. You can also use {!cache=false} to bypass actually using the cache if you know it's unlikely to be reused.
> 
> Two other points:
> 
> 1> you can offload the parsing to clients rather than Solr and gain
> more control over the process (assuming you haven't already). Here's a blog:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
> 
> 2> One reason to not go to fq clauses (except if you use
> {!cache=false}) is if you are using bare NOW in your clauses for, say ranges, one common construct is fq=date[NOW-1DAY TO NOW]. Here's another blog on the subject:
> https://lucidworks.com/2012/02/23/date-math-now-and-filter-queries/
> 
> 
> Best,
> Erick
> 
> 
> On Mon, Dec 4, 2017 at 6:08 PM, Phil Scadden <P....@gns.cri.nz> wrote:
>>> You'll have a few economies of scale I think with a single core, but frankly I don't know if they'd be enough to measure. You say the docs are "quite large" though, >are you talking books? Magazine articles? is 20K large or are the 20M?
>> 
>> Technical reports. Sometimes up to 200MB pdfs, but that would include a lot of imagery. More typically 20Mb. A 140MB pdf contained only 400k of text.
>> 
>> Thanks for tip on fq: I will put that into code now as I have other fields used is similar fashion.
>> Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.
> Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.

RE: Multiple cores versus a "source" field.

Posted by Phil Scadden <P....@gns.cri.nz>.

Thanks Eric. I have already followed the solrj indexing very closely - I have to do a lot of manipulation at indexing time. The other blog article is very interesting as I do indeed use "year" (year of publication) and it is very frequently used to filter queries. I will have a play with that now.

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: Tuesday, 5 December 2017 4:11 p.m.
To: solr-user <so...@lucene.apache.org>
Subject: Re: Multiple cores versus a "source" field.

That's the unpleasant part of semi-structued documents (PDF, Word, whatever). You never know the relationship between raw size and indexable text.

Basically anything that you don't care to contribute to _scoring_ is often better in an fq clause. You can also use {!cache=false} to bypass actually using the cache if you know it's unlikely to be reused.

Two other points:

1> you can offload the parsing to clients rather than Solr and gain
more control over the process (assuming you haven't already). Here's a blog:
https://lucidworks.com/2012/02/14/indexing-with-solrj/

2> One reason to not go to fq clauses (except if you use
{!cache=false}) is if you are using bare NOW in your clauses for, say ranges, one common construct is fq=date[NOW-1DAY TO NOW]. Here's another blog on the subject:
https://lucidworks.com/2012/02/23/date-math-now-and-filter-queries/

Best,
Erick

On Mon, Dec 4, 2017 at 6:08 PM, Phil Scadden <P....@gns.cri.nz> wrote:
>>You'll have a few economies of scale I think with a single core, but frankly I don't know if they'd be enough to measure. You say the docs are "quite large" though, >are you talking books? Magazine articles? is 20K large or are the 20M?
>
> Technical reports. Sometimes up to 200MB pdfs, but that would include a lot of imagery. More typically 20Mb. A 140MB pdf contained only 400k of text.
>
> Thanks for tip on fq: I will put that into code now as I have other fields used is similar fashion.
> Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.
Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.

Re: Multiple cores versus a "source" field.

Posted by Erick Erickson <er...@gmail.com>.

That's the unpleasant part of semi-structued documents (PDF, Word,
whatever). You never know the relationship between raw size and
indexable text.

Basically anything that you don't care to contribute to _scoring_ is
often better in an fq clause. You can also use {!cache=false} to
bypass actually using the cache if you know it's unlikely to be
reused.

Two other points:

1> you can offload the parsing to clients rather than Solr and gain
more control over the process (assuming you haven't already). Here's
a blog:
https://lucidworks.com/2012/02/14/indexing-with-solrj/

2> One reason to not go to fq clauses (except if you use
{!cache=false}) is if you are using bare NOW in your clauses for, say
ranges, one common construct is fq=date[NOW-1DAY TO NOW]. Here's
another blog on the subject:
https://lucidworks.com/2012/02/23/date-math-now-and-filter-queries/

Best,
Erick

On Mon, Dec 4, 2017 at 6:08 PM, Phil Scadden <P....@gns.cri.nz> wrote:
>>You'll have a few economies of scale I think with a single core, but frankly I don't know if they'd be enough to measure. You say the docs are "quite large" though, >are you talking books? Magazine articles? is 20K large or are the 20M?
>
> Technical reports. Sometimes up to 200MB pdfs, but that would include a lot of imagery. More typically 20Mb. A 140MB pdf contained only 400k of text.
>
> Thanks for tip on fq: I will put that into code now as I have other fields used is similar fashion.
> Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.

RE: Multiple cores versus a "source" field.

Posted by Phil Scadden <P....@gns.cri.nz>.

>You'll have a few economies of scale I think with a single core, but frankly I don't know if they'd be enough to measure. You say the docs are "quite large" though, >are you talking books? Magazine articles? is 20K large or are the 20M?

Technical reports. Sometimes up to 200MB pdfs, but that would include a lot of imagery. More typically 20Mb. A 140MB pdf contained only 400k of text.

Thanks for tip on fq: I will put that into code now as I have other fields used is similar fashion.
Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.

Re: Multiple cores versus a "source" field.

Posted by Erick Erickson <er...@gmail.com>.

At that scale, whatever you find administratively most convenient.
You'll have a few economies of scale I think with a single core, but
frankly I don't know if they'd be enough to measure. You say the docs
are "quite large" though, are you talking books? Magazine articles? is
20K large or are the 20M?

One small tip: If you put them in the same core, use
fq=source:whatever rather than AND. The fq clause will be set in the
filterCache upon first use and will be faster than ANDing after that.
If you set autowarm to a reasonable on filterCache they'll always be
"hot". "Reasonable" here is on the order of 10-20...

Best,
Erick

On Mon, Dec 4, 2017 at 4:20 PM, Phil Scadden <P....@gns.cri.nz> wrote:
> I have two different document stores that I want index. Both are quite small (<50,000 documents though documents can be quite large). They are quite capable of using the same schema, but you would not want to search both simultaneously. I can see two approaches to handling this case.
> 1/ Create a "source" field and use that identify which store is being used. The search interface add the appropriate " AND source=xxxx" to queries.
> 2/ Create separate core for each.
>
> If you want to use the same solr server to handle queries to both stores, which is the best approach in terms of minimizing JVM size while keeping searches reasonably fast?
> Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.