You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Danilo Tomasoni <to...@cosbi.eu> on 2019/10/24 07:52:28 UTC

solr configuration issue

Hello all,

we have a solr 7.3.1 instance with around 40 MLN documents in it.

After the initial one-shot import, we found an issue in the import 
software, we updated it and re-run the import that will atomically 
update (with set)

the existing documents.

The import is divided into processes, each process is responsible of 
updating a portion of the documents.

For every document processed, a soft commit is performed to make the 
update visible to other concurrent update processes.

Every process at the end will perform an hard commit.

The issue I have is that hard commits never terminate (it's ongoing by 
more than 3 days) and the number of segments and the solr index will 
grow a lot.

In the past when the commit finished I was used to incrementally 
optimize the index (from 40 segments to 39, to 38 and so on)

but also here the process is very slow.


Any advice on how to speed up things?

I checked the system usage in the solr machine and neither I/O nor CPU 
are heavily used..


Thanks

Danilo

-- 
Danilo Tomasoni

Fondazione The Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI)
Piazza Manifattura 1,  38068 Rovereto (TN), Italy
tomasoni@cosbi.eu
http://www.cosbi.eu
  
As for the European General Data Protection Regulation 2016/679 on the protection of natural persons with regard to the processing of personal data, we inform you that all the data we possess are object of treatment in the respect of the normative provided for by the cited GDPR.
It is your right to be informed on which of your data are used and how; you may ask for their correction, cancellation or you may oppose to their use by written request sent by recorded delivery to The Microsoft Research – University of Trento Centre for Computational and Systems Biology Scarl, Piazza Manifattura 1, 38068 Rovereto (TN), Italy.
P Please don't print this e-mail unless you really need to


Re: solr configuration issue

Posted by Shawn Heisey <ap...@elyograg.org>.
On 10/24/2019 1:52 AM, Danilo Tomasoni wrote:
> For every document processed, a soft commit is performed to make the 
> update visible to other concurrent update processes.

This is not the way to do things.  Doing a commit after every document 
means that Solr will spend more time doing commits than anything else.

Documents should be indexed in batches.

https://lucidworks.com/post/really-batch-updates-solr-2/

> Every process at the end will perform an hard commit.

Use autoCommit to do hard commits.  I would suggest NOT using maxDoc, 
only use maxTime, and set it to 60000 -- one minute.  Also ensure that 
openSearcher is set to false.  Commits that do not open a new searcher 
are VERY fast.  These hard commits will not do anything for document 
visibility, they are about data durability.

Then you can use autoSoftCommit for change visibility, and not worry 
about sending commits in your indexing application.  Again, don't set 
maxDoc.  Set maxTime to as long an interval as you can stand.  I would 
suggest a minumum of two minutes, but make it longer if you can. 
Something like 5 or 10 minutes.

https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

> The issue I have is that hard commits never terminate (it's ongoing by 
> more than 3 days) and the number of segments and the solr index will 
> grow a lot.

What do you mean by "terminate" here?  I cannnot figure this out from 
the context.  The only thing I'm aware of that a hard commit is going to 
terminate is the current transaction log ... the current log is closed 
and the next time a document is indexed, a new one will be created. 
Hard commits are the only thing that will close a transaction log.

> In the past when the commit finished I was used to incrementally 
> optimize the index (from 40 segments to 39, to 38 and so on)
> but also here the process is very slow.

If you're going to optimize, which we generally recommend NOT doing, 
optimize in a single pass.  Optimizing with multiple passes means 
reading the index and writing the index multiple times ... and each 
forced merge will require significant system resources.  It may not 
require them all, but it is significant.

Thanks,
Shawn

Re: solr configuration issue

Posted by Paras Lehana <pa...@indiamart.com>.
Hi Danilo,

We have a solr 7.3.1 instance with around 40 MLN documents in it.


I guess you are hard committing after few of millions of docs are indexed,
right? I suggest you not to fully avoid hard committing. Set *autoCommit*
(not autoSoftCommit) at around half a million of documents (that's from my
experience given my core of 250 million documents). Obviously, you need to
find the sweet spot yourself but you can start with this number.

Also, play with values of *IndexConfig*
<https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html>
(merge
factor, segment size, maxBufferedDocs, Merge Policies). We, at
Auto-Suggest, also do atomic updates daily and specifically changing merge
factor gave us a boost of ~4x during indexing. At current configuration,
our core atomically updates ~423 documents per second. I also do few core
optimizations in between the full indexing.

On Thu, 24 Oct 2019 at 13:31, Danilo Tomasoni <to...@cosbi.eu> wrote:

> Hello all,
>
> we have a solr 7.3.1 instance with around 40 MLN documents in it.
>
> After the initial one-shot import, we found an issue in the import
> software, we updated it and re-run the import that will atomically
> update (with set)
>
> the existing documents.
>
> The import is divided into processes, each process is responsible of
> updating a portion of the documents.
>
> For every document processed, a soft commit is performed to make the
> update visible to other concurrent update processes.
>
> Every process at the end will perform an hard commit.
>
> The issue I have is that hard commits never terminate (it's ongoing by
> more than 3 days) and the number of segments and the solr index will
> grow a lot.
>
> In the past when the commit finished I was used to incrementally
> optimize the index (from 40 segments to 39, to 38 and so on)
>
> but also here the process is very slow.
>
>
> Any advice on how to speed up things?
>
> I checked the system usage in the solr machine and neither I/O nor CPU
> are heavily used..
>
>
> Thanks
>
> Danilo
>
> --
> Danilo Tomasoni
>
> Fondazione The Microsoft Research - University of Trento Centre for
> Computational and Systems Biology (COSBI)
> Piazza Manifattura 1,  38068 Rovereto (TN), Italy
> tomasoni@cosbi.eu
> http://www.cosbi.eu
>
> As for the European General Data Protection Regulation 2016/679 on the
> protection of natural persons with regard to the processing of personal
> data, we inform you that all the data we possess are object of treatment in
> the respect of the normative provided for by the cited GDPR.
> It is your right to be informed on which of your data are used and how;
> you may ask for their correction, cancellation or you may oppose to their
> use by written request sent by recorded delivery to The Microsoft Research
> – University of Trento Centre for Computational and Systems Biology Scarl,
> Piazza Manifattura 1, 38068 Rovereto (TN), Italy.
> P Please don't print this e-mail unless you really need to
>
>

-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
IMPORTANT: 
NEVER share your IndiaMART OTP/ Password with anyone.

Re: solr configuration issue

Posted by Shawn Heisey <ap...@elyograg.org>.
On 10/25/2019 5:44 AM, Danilo Tomasoni wrote:
> Another question, is softCommit sufficient to ensure visibility or 
> should I call a commit to ensure a new searcher will be opened?
> 
> softCommit automatically opens a new searcher?

There would be little point to doing a soft commit with openSearcher set 
to false.  I actually don't even know if you CAN do such a commit, but 
there would be no reason to ever do it even if you can.  If you're not 
opening a searcher, the performance characteristics say that you might 
as well use hard commit for better data durability.  Creating searchers 
is the expensive part of a commit.

So in my mind, a soft commit always opens a new searcher.  It exists as 
a commit that MIGHT perform faster than a hard commit that opens a new 
searcher.  It also might not perform faster ... there are situations in 
which it must do just as much work writing to disk as a hard commit. 
But it does go faster sometimes, so it's what we recommend for 
visibility commits.

Thanks,
Shawn

Re: solr configuration issue

Posted by Danilo Tomasoni <to...@cosbi.eu>.
Thank you all for your suggestions.

Now I changed my import strategy to ensure that the same document will 
be updated eventually by different "batches",

in this way I need a single programmatic softcommit at the end of each 
batch.


Configuration-side I enabled autoCommit with opensearcher=false and 
maxtime=60000 (1 minute)


Hope this will do it.

Another question, is softCommit sufficient to ensure visibility or 
should I call a commit to ensure a new searcher will be opened?

softCommit automatically opens a new searcher?


Thanks


Danilo


On 24/10/19 17:06, Erick Erickson wrote:
> "For every document processed, a soft commit is performed to make the update visible to other concurrent update processes.”
>
> Please do not do this! First, Real Time Get will always return the current doc, whether you’ve opened a new reader or not. Second, this is an anti-pattern. I agree with Paras, set your defaults in solrconfig and forget about it.
>
> I’d also set the hard commits to something like 15 seconds (openSearcher=false). Or, if you can stand 15 second latency, set openSearcher=true and leave the soft commit set to -1.
>
> Opening a searcher is a heavyweight operation. doing it after _every_ document is a poor choice. If you absolutely _must_, at least batch your updates up in groups of, say, 1,000 and open a new searcher after that.
>
> Best,
> Erick
>
>> On Oct 24, 2019, at 3:52 AM, Danilo Tomasoni <to...@cosbi.eu> wrote:
>>
>> For every document processed, a soft commit is performed to make the update visible to other concurrent update processes.

-- 
Danilo Tomasoni

Fondazione The Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI)
Piazza Manifattura 1,  38068 Rovereto (TN), Italy
tomasoni@cosbi.eu
http://www.cosbi.eu
  
As for the European General Data Protection Regulation 2016/679 on the protection of natural persons with regard to the processing of personal data, we inform you that all the data we possess are object of treatment in the respect of the normative provided for by the cited GDPR.
It is your right to be informed on which of your data are used and how; you may ask for their correction, cancellation or you may oppose to their use by written request sent by recorded delivery to The Microsoft Research – University of Trento Centre for Computational and Systems Biology Scarl, Piazza Manifattura 1, 38068 Rovereto (TN), Italy.
P Please don't print this e-mail unless you really need to


Re: solr configuration issue

Posted by Erick Erickson <er...@gmail.com>.
"For every document processed, a soft commit is performed to make the update visible to other concurrent update processes.”

Please do not do this! First, Real Time Get will always return the current doc, whether you’ve opened a new reader or not. Second, this is an anti-pattern. I agree with Paras, set your defaults in solrconfig and forget about it. 

I’d also set the hard commits to something like 15 seconds (openSearcher=false). Or, if you can stand 15 second latency, set openSearcher=true and leave the soft commit set to -1.

Opening a searcher is a heavyweight operation. doing it after _every_ document is a poor choice. If you absolutely _must_, at least batch your updates up in groups of, say, 1,000 and open a new searcher after that.

Best,
Erick

> On Oct 24, 2019, at 3:52 AM, Danilo Tomasoni <to...@cosbi.eu> wrote:
> 
> For every document processed, a soft commit is performed to make the update visible to other concurrent update processes.