You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Avi Steiner <as...@varonis.com> on 2019/07/03 07:36:06 UTC

tlog/commit questions

Hi

We had some cases with customers (Solr 5.3.1, one search node, one shard) with huge tlog files (more than 1 GB).

Our settings:

<updateHandler class="solr.DirectUpdateHandler2">
<autoCommit>
                                <maxDocs>10000</maxDocs>
                                <maxTime>30000</maxTime> <!-- 30 seconds -->
                                <openSearcher>false</openSearcher> <!-- don't open a new searcher -->
                </autoCommit>

                <autoSoftCommit>
                                <maxTime>1800000</maxTime> <!-- 30 minutes -->
</autoSoftCommit>

                <updateLog>
                                <str name="dir">${solr.data.dir:}</str>
                </updateLog>
  </updateHandler>

I don't have enough logs so I don't know if commit failed or not. I just remember there were OOM messages.

As you may know, during restart, Solr tries to replay from tlog. It may take a lot of time. I tried to move the files to other location, started Solr and only after the core was loaded, I moved tlog back to their original location. They were cleared after a while.

So I have few questions:

  1.  Do you have any idea for commit failures?
  2.  Should we decrease the maxTime for hard commit or any other settings?
  3.  Is there any way to replay tlog asynchronously (or disable it, so we will be able to call it programmatically from our code in a separate thread), so Solr will be loaded more quickly?
  4.  Is there any improvement in Solr 7.3.1?

Thanks in advance

Avi


________________________________
This email and any attachments thereto may contain private, confidential, and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments thereto) by others is strictly prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete the original and any copies of this email and any attachments thereto.

Re: tlog/commit questions

Posted by Shawn Heisey <ap...@elyograg.org>.

On 7/3/2019 1:36 AM, Avi Steiner wrote:
> We had some cases with customers (Solr 5.3.1, one search node, one shard) with huge tlog files (more than 1 GB).

With 30 seconds on the autoCommit, that should not be happening.

When a hard commit fires, the current tlog is closed and a new one
starts. Solr only keeps enough tlogs to meet certain minimum
requirements. If the tlogs never rotate, then Solr has to keep the huge
one to meet the requirements.

I have heard of one situation that causes huge tlogs even with
autoCommit. That is a misconfigured SolrCloud feature called Cross Data
Center Replication (CDCR) ... but CDCR did not exist in version 5.3.1.
It was added in 6.0.0.

Do you have a solr.log file covering a significant period of time? At
least several minutes while indexing occurs.

> I don't have enough logs so I don't know if commit failed or not. I just remember there were OOM messages.

What OS is Solr running on? On most platforms other than Windows, an
OOM will cause Solr to self-terminate. On Windows, that wouldn't
happen, Solr would most likely keep running.

The reason that we configured Solr to self-terminate on OOM is that
program operation is completely unpredictable once OOM happens. Index
corruption is only one of the possible side effects. It is far safer to
terminate.

When Solr self-terminates, it will NOT automatically restart with the
out-of-the-box setup. You would have to create that functionality yourself.

If you have the actual OOM message ... what resource does it say was
depleted? It is not always heap memory.

Thanks,
Shawn

Re: tlog/commit questions

Posted by Erick Erickson <er...@gmail.com>.

1. Frustrating when this happens.

2. yes.The impact depends on how aggressively you autowarm, i.e. what your autowarm counts are. Whether users _notice_ or not is an open question. Since searches are served from the old searcher until the new one’s autowarming is complete, if you have some spare cycles your users by and large won’t notice. And, as always, “best practice” depends on a variety of factors. This is something of a distraction though, I mentioned it as a _possible_ source of your OOM which _might_ have triggered a tlog replay. It’s pretty irrelevant to why your startup took a long time though.

3. First it’s not clear that the tlog was being replayed. You could have been synching from the leader (guessing). Although you’re right, a large tlog and a core that isn’t coming up in combination is suspicious. And no, you can’t do this async and you wouldn’t want to. If the core is replaying the tlog, it means that the last hard commit didn’t happen and the docs indexed since the last _successful_ commit wouldn’t be found. IOW, you’d think you lost documents.

4. Not really. The commit/tlog management has been pretty static.

BTW, are you by any chance using CDCR? If so, that can be a source of tlog growth, although that wouldn’t impact tlog replay on startup.


Here’s the bottom line: 
- It’s very unusual for a single tlog to be 1G unless it has quite a long commit interval and you’re sending docs to be indexed really quickly or they’re huge.
- tlog replay is unlikely in the absence of un-graceful shutdown. Even if replayed, it shouldn’t take longer than the hard commit interval [1]
- Have you changed any of the tlog settings? For instance numRecordsToKeep?
- Are you totally sure that the autocommit interval isn’t being overridden at startup time by defining a sysvar? "-Dsolr.autoCommit.maxTime=###”?

So finding out why your tlog is so big is where I’d start.

[1] Note that if you’re furiously indexing docs during tlog replay, those are queued up to be indexed. If you’re firing a bazillion updates to Solr, they’ll get into the queue too.


> On Jul 3, 2019, at 10:37 PM, Avi Steiner <as...@varonis.com> wrote:
> 
> Thanks for your reply, Erick.
> 
> 1. Unfortunately, we got those incidents after a long time, and relevant log files have been already rolled, so I couldn't find commit failure messages, but since I found OOM messages in other logs, I can guess it was the root cause.
> 2. Just to be sure I'm right, every soft commit, a new searcher is opened and all caches are warmed up again. Doesn't it impact performance (memory, IO, CPU)? Is there a best practice?
> 3. I think I used the wrong term. We saw cases where the tlog files were huge (more than 1 GB). We tried to change some settings in restart Solr, but it took a lot of time to load, I guess because of these tlog files. So again, is there a way Solr can do it async and not suspend the loading?
> 4. We can't upgrade to version greater than 7.3.1 currently. The question is there something in commit/tlog management that was improved in 7.3.1?
> 
> Thanks again.
> 
> 
> -----Original Message-----
> From: Erick Erickson <er...@gmail.com>
> Sent: Wednesday, July 3, 2019 6:42 PM
> To: solr-user@lucene.apache.org
> Subject: Re: tlog/commit questions
> 
> External Email: Don’t open links/attachments from untrusted senders
> 
> 
> Let’s take this a piece at a time.
> 
> 
> 1. commit failures are very rare, in fact the only time I’ve seen them is when running out of disk space, OOMs, pulling the plug, etc. Look in your log files, is there any evidence of same?
> 
> 2. OOM messages. To support Real Time Get, internal structures are kept for all docs that have been indexed but no searcher has been opened to make visible. So you’re collecting up to 30 minutes of updates. This _may_ be relevant to your OOM problem. So I’d recommend dropping your soft commit interval to maybe 5 minutes.
> 
> 3. Tlogs shouldn’t replay much. They only replay when Solr quits abnormally, OOM, kill -9, pull the plug etc. When Solr is shut down gracefully, i.e. “bin/solr stop” etc, it should commit before closing and should _not_ replay anything from the tlog. Of course you should stop indexing while shutting down Solr…
> 
> 
> 4. There are lots of improvements in Solr 7x. Go to the latest Solr version (7.7.2) rather than 7.3.1. That said, TMP has been around for a long, long time. The low-level process of merging segments hasn’t been changed. One thing that _has_ changed is that TMP will now respect the max segment size (5G) when optimizing or doing an expungeDeletes. And I strongly recommend that you do neither of those unless you demonstrate need, just mentioning in case you already do that.
> 7.4-: https://urldefense.proofpoint.com/v2/url?u=https-3A__lucidworks.com_post_segment-2Dmerging-2Ddeleted-2Ddocuments-2Doptimize-2Dmay-2Dbad_&d=DwIFaQ&c=TxO9TIZxM1NIgbR_44vEiALc2o8uaxixBRc1BtwrN08&r=N8Ef6xGR2eDgjA8I5q1SOErZhf616XiV4IPj4Ncf1w0&m=z9ADXLG0NJrFBm2Dxo17ipIMtPEHqjs1V_5liNAIwfk&s=6KFrZbKSYKabZXETQpnsK0CC2JkWue9WuURs6YcjO8M&e=
> 7.5+: https://urldefense.proofpoint.com/v2/url?u=https-3A__lucidworks.com_post_solr-2Dand-2Doptimizing-2Dyour-2Dindex-2Dtake-2Dii_&d=DwIFaQ&c=TxO9TIZxM1NIgbR_44vEiALc2o8uaxixBRc1BtwrN08&r=N8Ef6xGR2eDgjA8I5q1SOErZhf616XiV4IPj4Ncf1w0&m=z9ADXLG0NJrFBm2Dxo17ipIMtPEHqjs1V_5liNAIwfk&s=XBru20LMKde85b29wQx2BkeLns6RyxHp3ZRzqdMMRlw&e=
> 
> All in all, I’d recommend getting to the bottom of your OOM issues. Absent abnormal termination, tlog replay really shouldn’t be happening. Are you totally sure that it was TLOG replay and not a full sync from the leader?
> 
> Best,
> Erick
> 
>> On Jul 3, 2019, at 12:36 AM, Avi Steiner <as...@varonis.com> wrote:
>> 
>> Hi
>> 
>> We had some cases with customers (Solr 5.3.1, one search node, one shard) with huge tlog files (more than 1 GB).
>> 
>> Our settings:
>> 
>> <updateHandler class="solr.DirectUpdateHandler2">
>> <autoCommit>
>>                               <maxDocs>10000</maxDocs>
>>                               <maxTime>30000</maxTime> <!-- 30 seconds -->
>>                               <openSearcher>false</openSearcher> <!-- don't open a new searcher -->
>>               </autoCommit>
>> 
>>               <autoSoftCommit>
>>                               <maxTime>1800000</maxTime> <!-- 30
>> minutes --> </autoSoftCommit>
>> 
>>               <updateLog>
>>                               <str name="dir">${solr.data.dir:}</str>
>>               </updateLog>
>> </updateHandler>
>> 
>> I don't have enough logs so I don't know if commit failed or not. I just remember there were OOM messages.
>> 
>> As you may know, during restart, Solr tries to replay from tlog. It may take a lot of time. I tried to move the files to other location, started Solr and only after the core was loaded, I moved tlog back to their original location. They were cleared after a while.
>> 
>> So I have few questions:
>> 
>> 1.  Do you have any idea for commit failures?
>> 2.  Should we decrease the maxTime for hard commit or any other settings?
>> 3.  Is there any way to replay tlog asynchronously (or disable it, so we will be able to call it programmatically from our code in a separate thread), so Solr will be loaded more quickly?
>> 4.  Is there any improvement in Solr 7.3.1?
>> 
>> Thanks in advance
>> 
>> Avi
>> 
>> 
>> ________________________________
>> This email and any attachments thereto may contain private, confidential, and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments thereto) by others is strictly prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete the original and any copies of this email and any attachments thereto.
> 
> ________________________________
> This email and any attachments thereto may contain private, confidential, and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments thereto) by others is strictly prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete the original and any copies of this email and any attachments thereto.

RE: tlog/commit questions

Posted by Avi Steiner <as...@varonis.com>.

Thanks for your reply, Erick.

1. Unfortunately, we got those incidents after a long time, and relevant log files have been already rolled, so I couldn't find commit failure messages, but since I found OOM messages in other logs, I can guess it was the root cause.
2. Just to be sure I'm right, every soft commit, a new searcher is opened and all caches are warmed up again. Doesn't it impact performance (memory, IO, CPU)? Is there a best practice?
3. I think I used the wrong term. We saw cases where the tlog files were huge (more than 1 GB). We tried to change some settings in restart Solr, but it took a lot of time to load, I guess because of these tlog files. So again, is there a way Solr can do it async and not suspend the loading?
4. We can't upgrade to version greater than 7.3.1 currently. The question is there something in commit/tlog management that was improved in 7.3.1?

Thanks again.

-----Original Message-----
From: Erick Erickson <er...@gmail.com>
Sent: Wednesday, July 3, 2019 6:42 PM
To: solr-user@lucene.apache.org
Subject: Re: tlog/commit questions

External Email: Don’t open links/attachments from untrusted senders

Let’s take this a piece at a time.

1. commit failures are very rare, in fact the only time I’ve seen them is when running out of disk space, OOMs, pulling the plug, etc. Look in your log files, is there any evidence of same?

2. OOM messages. To support Real Time Get, internal structures are kept for all docs that have been indexed but no searcher has been opened to make visible. So you’re collecting up to 30 minutes of updates. This _may_ be relevant to your OOM problem. So I’d recommend dropping your soft commit interval to maybe 5 minutes.

3. Tlogs shouldn’t replay much. They only replay when Solr quits abnormally, OOM, kill -9, pull the plug etc. When Solr is shut down gracefully, i.e. “bin/solr stop” etc, it should commit before closing and should _not_ replay anything from the tlog. Of course you should stop indexing while shutting down Solr…

4. There are lots of improvements in Solr 7x. Go to the latest Solr version (7.7.2) rather than 7.3.1. That said, TMP has been around for a long, long time. The low-level process of merging segments hasn’t been changed. One thing that _has_ changed is that TMP will now respect the max segment size (5G) when optimizing or doing an expungeDeletes. And I strongly recommend that you do neither of those unless you demonstrate need, just mentioning in case you already do that.
7.4-: https://urldefense.proofpoint.com/v2/url?u=https-3A__lucidworks.com_post_segment-2Dmerging-2Ddeleted-2Ddocuments-2Doptimize-2Dmay-2Dbad_&d=DwIFaQ&c=TxO9TIZxM1NIgbR_44vEiALc2o8uaxixBRc1BtwrN08&r=N8Ef6xGR2eDgjA8I5q1SOErZhf616XiV4IPj4Ncf1w0&m=z9ADXLG0NJrFBm2Dxo17ipIMtPEHqjs1V_5liNAIwfk&s=6KFrZbKSYKabZXETQpnsK0CC2JkWue9WuURs6YcjO8M&e=
7.5+: https://urldefense.proofpoint.com/v2/url?u=https-3A__lucidworks.com_post_solr-2Dand-2Doptimizing-2Dyour-2Dindex-2Dtake-2Dii_&d=DwIFaQ&c=TxO9TIZxM1NIgbR_44vEiALc2o8uaxixBRc1BtwrN08&r=N8Ef6xGR2eDgjA8I5q1SOErZhf616XiV4IPj4Ncf1w0&m=z9ADXLG0NJrFBm2Dxo17ipIMtPEHqjs1V_5liNAIwfk&s=XBru20LMKde85b29wQx2BkeLns6RyxHp3ZRzqdMMRlw&e=

All in all, I’d recommend getting to the bottom of your OOM issues. Absent abnormal termination, tlog replay really shouldn’t be happening. Are you totally sure that it was TLOG replay and not a full sync from the leader?

Best,
Erick

> On Jul 3, 2019, at 12:36 AM, Avi Steiner <as...@varonis.com> wrote:
>
> Hi
>
> We had some cases with customers (Solr 5.3.1, one search node, one shard) with huge tlog files (more than 1 GB).
>
> Our settings:
>
> <updateHandler class="solr.DirectUpdateHandler2">
> <autoCommit>
>                                <maxDocs>10000</maxDocs>
>                                <maxTime>30000</maxTime> <!-- 30 seconds -->
>                                <openSearcher>false</openSearcher> <!-- don't open a new searcher -->
>                </autoCommit>
>
>                <autoSoftCommit>
>                                <maxTime>1800000</maxTime> <!-- 30
> minutes --> </autoSoftCommit>
>
>                <updateLog>
>                                <str name="dir">${solr.data.dir:}</str>
>                </updateLog>
>  </updateHandler>
>
> I don't have enough logs so I don't know if commit failed or not. I just remember there were OOM messages.
>
> As you may know, during restart, Solr tries to replay from tlog. It may take a lot of time. I tried to move the files to other location, started Solr and only after the core was loaded, I moved tlog back to their original location. They were cleared after a while.
>
> So I have few questions:
>
>  1.  Do you have any idea for commit failures?
>  2.  Should we decrease the maxTime for hard commit or any other settings?
>  3.  Is there any way to replay tlog asynchronously (or disable it, so we will be able to call it programmatically from our code in a separate thread), so Solr will be loaded more quickly?
>  4.  Is there any improvement in Solr 7.3.1?
>
> Thanks in advance
>
> Avi
>
>
> ________________________________
> This email and any attachments thereto may contain private, confidential, and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments thereto) by others is strictly prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete the original and any copies of this email and any attachments thereto.

________________________________
This email and any attachments thereto may contain private, confidential, and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments thereto) by others is strictly prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete the original and any copies of this email and any attachments thereto.

Re: tlog/commit questions

Posted by Erick Erickson <er...@gmail.com>.

Let’s take this a piece at a time.

1. commit failures are very rare, in fact the only time I’ve seen them is when running out of disk space, OOMs, pulling the plug, etc. Look in your log files, is there any evidence of same?

2. OOM messages. To support Real Time Get, internal structures are kept for all docs that have been indexed but no searcher has been opened to make visible. So you’re collecting up to 30 minutes of updates. This _may_ be relevant to your OOM problem. So I’d recommend dropping your soft commit interval to maybe 5 minutes.

3. Tlogs shouldn’t replay much. They only replay when Solr quits abnormally, OOM, kill -9, pull the plug etc. When Solr is shut down gracefully, i.e. “bin/solr stop” etc, it should commit before closing and should _not_ replay anything from the tlog. Of course you should stop indexing while shutting down Solr…

4. There are lots of improvements in Solr 7x. Go to the latest Solr version (7.7.2) rather than 7.3.1. That said, TMP has been around for a long, long time. The low-level process of merging segments hasn’t been changed. One thing that _has_ changed is that TMP will now respect the max segment size (5G) when optimizing or doing an expungeDeletes. And I strongly recommend that you do neither of those unless you demonstrate need, just mentioning in case you already do that. 
7.4-: https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/
7.5+: https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/

All in all, I’d recommend getting to the bottom of your OOM issues. Absent abnormal termination, tlog replay really shouldn’t be happening. Are you totally sure that it was TLOG replay and not a full sync from the leader?

Best,
Erick

> On Jul 3, 2019, at 12:36 AM, Avi Steiner <as...@varonis.com> wrote:
> 
> Hi
> 
> We had some cases with customers (Solr 5.3.1, one search node, one shard) with huge tlog files (more than 1 GB).
> 
> Our settings:
> 
> <updateHandler class="solr.DirectUpdateHandler2">
> <autoCommit>
>                                <maxDocs>10000</maxDocs>
>                                <maxTime>30000</maxTime> <!-- 30 seconds -->
>                                <openSearcher>false</openSearcher> <!-- don't open a new searcher -->
>                </autoCommit>
> 
>                <autoSoftCommit>
>                                <maxTime>1800000</maxTime> <!-- 30 minutes -->
> </autoSoftCommit>
> 
>                <updateLog>
>                                <str name="dir">${solr.data.dir:}</str>
>                </updateLog>
>  </updateHandler>
> 
> I don't have enough logs so I don't know if commit failed or not. I just remember there were OOM messages.
> 
> As you may know, during restart, Solr tries to replay from tlog. It may take a lot of time. I tried to move the files to other location, started Solr and only after the core was loaded, I moved tlog back to their original location. They were cleared after a while.
> 
> So I have few questions:
> 
>  1.  Do you have any idea for commit failures?
>  2.  Should we decrease the maxTime for hard commit or any other settings?
>  3.  Is there any way to replay tlog asynchronously (or disable it, so we will be able to call it programmatically from our code in a separate thread), so Solr will be loaded more quickly?
>  4.  Is there any improvement in Solr 7.3.1?
> 
> Thanks in advance
> 
> Avi
> 
> 
> ________________________________
> This email and any attachments thereto may contain private, confidential, and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments thereto) by others is strictly prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete the original and any copies of this email and any attachments thereto.