You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Wittenberg, Lucas" <lu...@capgemini.com.INVALID> on 2019/08/26 16:01:11 UTC

SOLR 7+ / Lucene 7+ and performance issues with DelegatingCollector and PostFilter

Hello all,
Here is the situation I am facing.

I am migrating from SOLR 4 to SOLR 7. SOLR 4 is running on Tomcat 8, SOLR 7 runs with built in Jetty 9.
The largest core contains about 1,800,000 documents (about 3 GB).

The migration went through smoothly. But something's bothering me.

I have a PostFilter to collect only some documents according to a pre-selected list.

Here is the code for the org.apache.solr.search.DelegatingCollector:

	@Override
	protected void doSetNextReader(LeafReaderContext context) throws IOException {
		this.reader = context.reader();
		super.doSetNextReader(context);
	}

	@Override
	public void collect(int docNumber) throws IOException {
		if (null != this.reader && isValid(this.reader.document(docNumber).get("customid")))
		{
			super.collect(docNumber);
		}
	}

	private boolean isValid(String customId) {
		boolean valid = false;
		if (null != customMap) // HashMap<String, String>, contains the custom IDs to keep. Contains an average of 2k items
		{
			valid = customMap.get(customId) != null;
		}

		return valid;
	}

And here is an example of query sent to SOLR:

	/select?fq=%7B!MyPostFilter%20sessionid%3DWST0DEV-QS-5BEEB1CC28B45580F92CCCEA32727083&q=system%20upgrade

So, the problem is:
	- It runs pretty fast on SOLR 4, with average QTime equals to 30.
	- But now on SOLR 7, it is awfully slow with average QTime around 25000!

And I am wondering what can be the source of such bad performances...

With a very simplified (or should I say transparent) collect function (see below), there is no degradation. This test just to exclude server/platform from the equation.

	@Override
	public void collect(int docNumber) throws IOException {
		super.collect(docNumber);
	}

My guess is that since LUCENE 7, there have been drastic changes in the way the API access documents, but I am not sure to have understood everything.
I got it from this post: https://stackoverflow.com/questions/48474506/how-to-get-docvalue-by-document-id-in-lucene-7

I suppose this has something to do with the issues I am facing.
But I have no idea how to upgrade/change my PostFilter and/or DelegatingCollector to go back to good performances.

If any LUCENE/SOLR experts could provide some hints or leads, it would be very appreciated.
Thanks in advance.


PS:
In the core schema:

	<field name="customid" type="string" indexed="true" stored="true" required="true" multiValued="false" />

This field is string-type as it can be something like "100034_001".

In the solrconfig.xml:

	<queryParser name="MyPostFilter" class="solrpostfilter.MyQueryPaser"/>

I can share the full schema and solrconfig.xml files if needed but so far, there is no other particular configuration in there.
This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.

Re: SOLR 7+ / Lucene 7+ and performance issues with DelegatingCollector and PostFilter

Posted by Toke Eskildsen <to...@kb.dk>.

Wittenberg, Lucas <lu...@capgemini.com.INVALID> wrote:
> As suggested I switched to using DocValues and SortedDocValues.
> Now QTime is down to an average of 1100, which is much, much better
> but still far from the 30 I had with SOLR 4.
> I suppose it is due to the block-oriented compression you mentioned.

I apologize for being unclear: Only stored fields are block compressed in Solr 7. doc values for string fields are ... well, also compressed, but in much smaller blocks (prefix compression as far as I remember) and each string field separately, so they should be very fast to access.

1100 ms for Solr 7 vs. 30 ms for Solr 4 sounds like a huge difference. You don't by chance use a fully merged (aka "optimized") index? Doc values in Solr 7 can (very counter-intuitively) suffer from that for some access patterns.

Maybe you are doing something sub-optimal like calling DocValues.getSorted for each collect call? Could you share your code somewhere?

- Toke Eskildsen

RE: SOLR 7+ / Lucene 7+ and performance issues with DelegatingCollector and PostFilter

Posted by "Wittenberg, Lucas" <lu...@capgemini.com.INVALID>.

Ok, thank you Erick and Toke.
As suggested I switched to using DocValues and SortedDocValues.
Now QTime is down to an average of 1100, which is much, much better but still far from the 30 I had with SOLR 4.
I suppose it is due to the block-oriented compression you mentioned. Not sure if it is possible to improve this even more. Is it possible/wise to disable the compression?
Anyway, really appreciate the support. Thanks.
/cheers

-----Message d'origine-----
De : Wittenberg, Lucas 
Envoyé : mardi 27 août 2019 11:06
À : solr-user@lucene.apache.org
Objet : RE: SOLR 7+ / Lucene 7+ and performance issues with DelegatingCollector and PostFilter

Thanks for the suggestion.
But the "customid" field is already set as docValues="true" actually.
Well, I guess so as it is a type="string" which by default has docValues="true".

<field name="customid" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <fieldType name="string" class="solr.StrField" sortMissingLast="true" docValues="true" />


-----Message d'origine-----
De : Wittenberg, Lucas
Envoyé : lundi 26 août 2019 18:01
À : solr-user@lucene.apache.org
Objet : SOLR 7+ / Lucene 7+ and performance issues with DelegatingCollector and PostFilter

Hello all,
Here is the situation I am facing.

I am migrating from SOLR 4 to SOLR 7. SOLR 4 is running on Tomcat 8, SOLR 7 runs with built in Jetty 9.
The largest core contains about 1,800,000 documents (about 3 GB).

The migration went through smoothly. But something's bothering me.

I have a PostFilter to collect only some documents according to a pre-selected list.

Here is the code for the org.apache.solr.search.DelegatingCollector:

	@Override
	protected void doSetNextReader(LeafReaderContext context) throws IOException {
		this.reader = context.reader();
		super.doSetNextReader(context);
	}

	@Override
	public void collect(int docNumber) throws IOException {
		if (null != this.reader && isValid(this.reader.document(docNumber).get("customid")))
		{
			super.collect(docNumber);
		}
	}

	private boolean isValid(String customId) {
		boolean valid = false;
		if (null != customMap) // HashMap<String, String>, contains the custom IDs to keep. Contains an average of 2k items
		{
			valid = customMap.get(customId) != null;
		}

		return valid;
	}

And here is an example of query sent to SOLR:

	/select?fq=%7B!MyPostFilter%20sessionid%3DWST0DEV-QS-5BEEB1CC28B45580F92CCCEA32727083&q=system%20upgrade

So, the problem is:
	- It runs pretty fast on SOLR 4, with average QTime equals to 30.
	- But now on SOLR 7, it is awfully slow with average QTime around 25000!

And I am wondering what can be the source of such bad performances...

With a very simplified (or should I say transparent) collect function (see below), there is no degradation. This test just to exclude server/platform from the equation.

	@Override
	public void collect(int docNumber) throws IOException {
		super.collect(docNumber);
	}

My guess is that since LUCENE 7, there have been drastic changes in the way the API access documents, but I am not sure to have understood everything.
I got it from this post: https://stackoverflow.com/questions/48474506/how-to-get-docvalue-by-document-id-in-lucene-7

I suppose this has something to do with the issues I am facing.
But I have no idea how to upgrade/change my PostFilter and/or DelegatingCollector to go back to good performances.

If any LUCENE/SOLR experts could provide some hints or leads, it would be very appreciated.
Thanks in advance.


PS:
In the core schema:

	<field name="customid" type="string" indexed="true" stored="true" required="true" multiValued="false" />

This field is string-type as it can be something like "100034_001".

In the solrconfig.xml:

	<queryParser name="MyPostFilter" class="solrpostfilter.MyQueryPaser"/>

I can share the full schema and solrconfig.xml files if needed but so far, there is no other particular configuration in there.
This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.

Re: SOLR 7+ / Lucene 7+ and performance issues with DelegatingCollector and PostFilter

Posted by Erick Erickson <er...@gmail.com>.

Well, the question is then whether you’re getting the value from the docValues structure or the stored structure. My bet is the latter. Simple test would be to comment out the line and return some random value just to see how long it takes.

> On Aug 27, 2019, at 5:05 AM, Wittenberg, Lucas <lu...@capgemini.com.INVALID> wrote:
> 
> Thanks for the suggestion.
> But the "customid" field is already set as docValues="true" actually.
> Well, I guess so as it is a type="string" which by default has docValues="true".
> 
> <field name="customid" type="string" indexed="true" stored="true" required="true" multiValued="false" />
> <fieldType name="string" class="solr.StrField" sortMissingLast="true" docValues="true" />
> 
> 
> -----Message d'origine-----
> De : Wittenberg, Lucas 
> Envoyé : lundi 26 août 2019 18:01
> À : solr-user@lucene.apache.org
> Objet : SOLR 7+ / Lucene 7+ and performance issues with DelegatingCollector and PostFilter
> 
> Hello all,
> Here is the situation I am facing.
> 
> I am migrating from SOLR 4 to SOLR 7. SOLR 4 is running on Tomcat 8, SOLR 7 runs with built in Jetty 9.
> The largest core contains about 1,800,000 documents (about 3 GB).
> 
> The migration went through smoothly. But something's bothering me.
> 
> I have a PostFilter to collect only some documents according to a pre-selected list.
> 
> Here is the code for the org.apache.solr.search.DelegatingCollector:
> 
> 	@Override
> 	protected void doSetNextReader(LeafReaderContext context) throws IOException {
> 		this.reader = context.reader();
> 		super.doSetNextReader(context);
> 	}
> 
> 	@Override
> 	public void collect(int docNumber) throws IOException {
> 		if (null != this.reader && isValid(this.reader.document(docNumber).get("customid")))
> 		{
> 			super.collect(docNumber);
> 		}
> 	}
> 
> 	private boolean isValid(String customId) {
> 		boolean valid = false;
> 		if (null != customMap) // HashMap<String, String>, contains the custom IDs to keep. Contains an average of 2k items
> 		{
> 			valid = customMap.get(customId) != null;
> 		}
> 
> 		return valid;
> 	}
> 
> And here is an example of query sent to SOLR:
> 
> 	/select?fq=%7B!MyPostFilter%20sessionid%3DWST0DEV-QS-5BEEB1CC28B45580F92CCCEA32727083&q=system%20upgrade
> 
> So, the problem is:
> 	- It runs pretty fast on SOLR 4, with average QTime equals to 30.
> 	- But now on SOLR 7, it is awfully slow with average QTime around 25000!
> 
> And I am wondering what can be the source of such bad performances...
> 
> With a very simplified (or should I say transparent) collect function (see below), there is no degradation. This test just to exclude server/platform from the equation.
> 
> 	@Override
> 	public void collect(int docNumber) throws IOException {
> 		super.collect(docNumber);
> 	}
> 
> My guess is that since LUCENE 7, there have been drastic changes in the way the API access documents, but I am not sure to have understood everything.
> I got it from this post: https://stackoverflow.com/questions/48474506/how-to-get-docvalue-by-document-id-in-lucene-7
> 
> I suppose this has something to do with the issues I am facing.
> But I have no idea how to upgrade/change my PostFilter and/or DelegatingCollector to go back to good performances.
> 
> If any LUCENE/SOLR experts could provide some hints or leads, it would be very appreciated.
> Thanks in advance.
> 
> 
> PS:
> In the core schema:
> 
> 	<field name="customid" type="string" indexed="true" stored="true" required="true" multiValued="false" />
> 
> This field is string-type as it can be something like "100034_001".
> 
> In the solrconfig.xml:
> 
> 	<queryParser name="MyPostFilter" class="solrpostfilter.MyQueryPaser"/>
> 
> I can share the full schema and solrconfig.xml files if needed but so far, there is no other particular configuration in there.
> This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.
>

Re: SOLR 7+ / Lucene 7+ and performance issues with DelegatingCollector and PostFilter

Posted by Erick Erickson <er...@gmail.com>.

> I don't know the precedence rules for stored vs. dovValues in Solr

DocValues are used if (and only if) all the fields being returned have 
docValues=“true” _and_ are single-valued, or if you’ve explicitly
set useDocValuesAsStored.

single-valued docValues are they only situation where the response is
identical between stored and docValues. MultiValued fields have the twin
issues of deduplicating input and sorting.

The theory is that if you have to decompress the data anyway, it’s
likely more efficient to get all the fields from the stored data you 
can rather than separately go out to read DV fields.

> On Aug 27, 2019, at 8:05 AM, Toke Eskildsen <to...@kb.dk> wrote:
> 
> On Tue, 2019-08-27 at 09:05 +0000, Wittenberg, Lucas wrote:
>> But the "customid" field is already set as docValues="true" actually.
>> Well, I guess so as it is a type="string" which by default has
>> docValues="true".
>> 
>> <field name="customid" type="string" indexed="true" stored="true"
>> required="true" multiValued="false" />
>> <fieldType name="string" class="solr.StrField" sortMissingLast="true"
>> docValues="true" />
> 
> Yeah, it's a bit confusing. It is both stored and docValues and as far
> as I can see, the reader.document-methods only deal with stored.
> 
> Solr masks the difference between stored & docValues for retrieval by
> using SolrDocumentFetcher.decoratedocValueFields but Lucene does not do
> that for you. The relevant Solr API seems to be
> 
> 
> https://lucene.apache.org/solr/7_0_1/solr-core/org/apache/solr/search/SolrDocumentFetcher.html#doc-int-java.util.Set
> -
> 
> I don't know the precedence rules for stored vs. dovValues in Solr, so
> the safe (best performance) solution would be to implement something
> like the pseudo code I wrote earlier.
> 
> 
> - Toke Eskildsen, Royal Danish Library
> 
>

Re: SOLR 7+ / Lucene 7+ and performance issues with DelegatingCollector and PostFilter

Posted by Toke Eskildsen <to...@kb.dk>.

On Tue, 2019-08-27 at 09:05 +0000, Wittenberg, Lucas wrote:
> But the "customid" field is already set as docValues="true" actually.
> Well, I guess so as it is a type="string" which by default has
> docValues="true".
> 
> <field name="customid" type="string" indexed="true" stored="true"
> required="true" multiValued="false" />
> <fieldType name="string" class="solr.StrField" sortMissingLast="true"
> docValues="true" />

Yeah, it's a bit confusing. It is both stored and docValues and as far
as I can see, the reader.document-methods only deal with stored.

Solr masks the difference between stored & docValues for retrieval by
using SolrDocumentFetcher.decoratedocValueFields but Lucene does not do
that for you. The relevant Solr API seems to be

https://lucene.apache.org/solr/7_0_1/solr-core/org/apache/solr/search/SolrDocumentFetcher.html#doc-int-java.util.Set
-

I don't know the precedence rules for stored vs. dovValues in Solr, so
the safe (best performance) solution would be to implement something
like the pseudo code I wrote earlier.

- Toke Eskildsen, Royal Danish Library

RE: SOLR 7+ / Lucene 7+ and performance issues with DelegatingCollector and PostFilter

Posted by "Wittenberg, Lucas" <lu...@capgemini.com.INVALID>.

Thanks for the suggestion.
But the "customid" field is already set as docValues="true" actually.
Well, I guess so as it is a type="string" which by default has docValues="true".

<field name="customid" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<fieldType name="string" class="solr.StrField" sortMissingLast="true" docValues="true" />


-----Message d'origine-----
De : Wittenberg, Lucas 
Envoyé : lundi 26 août 2019 18:01
À : solr-user@lucene.apache.org
Objet : SOLR 7+ / Lucene 7+ and performance issues with DelegatingCollector and PostFilter

Hello all,
Here is the situation I am facing.

I am migrating from SOLR 4 to SOLR 7. SOLR 4 is running on Tomcat 8, SOLR 7 runs with built in Jetty 9.
The largest core contains about 1,800,000 documents (about 3 GB).

The migration went through smoothly. But something's bothering me.

I have a PostFilter to collect only some documents according to a pre-selected list.

Here is the code for the org.apache.solr.search.DelegatingCollector:

	@Override
	protected void doSetNextReader(LeafReaderContext context) throws IOException {
		this.reader = context.reader();
		super.doSetNextReader(context);
	}

	@Override
	public void collect(int docNumber) throws IOException {
		if (null != this.reader && isValid(this.reader.document(docNumber).get("customid")))
		{
			super.collect(docNumber);
		}
	}

	private boolean isValid(String customId) {
		boolean valid = false;
		if (null != customMap) // HashMap<String, String>, contains the custom IDs to keep. Contains an average of 2k items
		{
			valid = customMap.get(customId) != null;
		}

		return valid;
	}

And here is an example of query sent to SOLR:

	/select?fq=%7B!MyPostFilter%20sessionid%3DWST0DEV-QS-5BEEB1CC28B45580F92CCCEA32727083&q=system%20upgrade

So, the problem is:
	- It runs pretty fast on SOLR 4, with average QTime equals to 30.
	- But now on SOLR 7, it is awfully slow with average QTime around 25000!

And I am wondering what can be the source of such bad performances...

With a very simplified (or should I say transparent) collect function (see below), there is no degradation. This test just to exclude server/platform from the equation.

	@Override
	public void collect(int docNumber) throws IOException {
		super.collect(docNumber);
	}

My guess is that since LUCENE 7, there have been drastic changes in the way the API access documents, but I am not sure to have understood everything.
I got it from this post: https://stackoverflow.com/questions/48474506/how-to-get-docvalue-by-document-id-in-lucene-7

I suppose this has something to do with the issues I am facing.
But I have no idea how to upgrade/change my PostFilter and/or DelegatingCollector to go back to good performances.

If any LUCENE/SOLR experts could provide some hints or leads, it would be very appreciated.
Thanks in advance.


PS:
In the core schema:

	<field name="customid" type="string" indexed="true" stored="true" required="true" multiValued="false" />

This field is string-type as it can be something like "100034_001".

In the solrconfig.xml:

	<queryParser name="MyPostFilter" class="solrpostfilter.MyQueryPaser"/>

I can share the full schema and solrconfig.xml files if needed but so far, there is no other particular configuration in there.
This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.

Re: SOLR 7+ / Lucene 7+ and performance issues with DelegatingCollector and PostFilter

Posted by Erick Erickson <er...@gmail.com>.

Is “customid” a docValues=true field? I suspect not, in which case
I think this is the problem (but do be warned, I don’t spend much time in Lucene code).
this.reader.document(docNumber).get("customid”)

document(docNumber)

goes out to do a disk read I think. If it were docValues=true, it could be fetched from
the in-memory structures instead. I don’t know how much of that is automatic at this
level though.

Best,
Erick

> On Aug 26, 2019, at 12:01 PM, Wittenberg, Lucas <lu...@capgemini.com.INVALID> wrote:
> 
> this.reader.document(docNumber).get("customid")

Re: SOLR 7+ / Lucene 7+ and performance issues with DelegatingCollector and PostFilter

Posted by Toke Eskildsen <to...@kb.dk>.

On Mon, 2019-08-26 at 16:01 +0000, Wittenberg, Lucas wrote:
>	@Override
> 	public void collect(int docNumber) throws IOException {
> 		if (null != this.reader &&
> isValid(this.reader.document(docNumber).get("customid")))
> 		{
> 			super.collect(docNumber);
> 		}
> 	}
...
	- It runs pretty fast on SOLR 4, with average QTime equals to
> 30.
> 	- But now on SOLR 7, it is awfully slow with average QTime
> around 25000!

Lucene 4.0 did not compress stored fields per default and as far as I
remember, Solr 7 forces compression across documents in 16KB blocks. My
guess is that you see the effect of a lot of decompression (although 25
seconds still seems excessive).

There's at least 2 thing you can try:

1) State which field you need.

Instead of 
  this.reader.document(docNumber).get("customid")
have a Set containing only customid and call
  this.reader.document(docNumber, wantedFields)

https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/index/IndexReader.html#document(int,%20java.util.Set
)

I'm not sure if it will be much better with Solr 7 though, due to the
block-oriented compression.

2) Switch to using DocValues, as Erick suggests.

You will have to add something like
  SortedDocValues dv = DocValues.getSorted(context, "customid");
to your doSetNextReader method and
  dv.advanceExact(docNumber) && 
    isValid(dv.binaryValue().utf8ToString())
in your collect method.

https://lucene.apache.org/core/7_1_0/core/org/apache/lucene/index/DocValues.html#getSorted-org.apache.lucene.index.LeafReader-java.lang.String
-

If you want to speed it up further, you can use BytesRefs as keys in
your customMap instead of Strings, and avoid the .utf8ToString() call.

- Toke Eskildsen, Royal Danish Library