You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by David '-1' Schmid <gd...@gmail.com> on 2019/02/18 18:03:59 UTC

UpdateHandler batch size / search solr-user

Hello!

Another question I could not find an answer to:
is there a best-practice / recommendation for pushing several million
documents into a new index?

I'm currently splittig my documents into batches of 10,000 json-line
payloads into the update request handler, with commit set to 'true'
(yes, for each of the batches).
I'm using commit since that got me stable 'QTime' around ~2100, without
commiting every batch, the QTime will degrade ten-fold by the time I
sent somewhere around 1,000,000 documents.
This will steadily climb, so after I sent all 46M documents I end up
with QTime values about 40,000 in case I don't commit every batch
immediately.

Since I cannot find anything in my mails, I wanted to search the
solr-user archives but, as far as I can tell: there is no such thing.
Maybe I can't see it or just glossed over it, but is there no searchable
index of solr-user? Any hints?

regards,
-1

Re: UpdateHandler batch size / search solr-user

Posted by Erick Erickson <er...@gmail.com>.
Sending batches in parallel is perfectly fine. _However_,
if you’re updating the same document, there’s no 
guarantee which would win.

Imagine you have two processes sending batches. The
order of execution depends on way too many variables.

If nothing else, if process 1 sends a document then some
time later process 2 sends the same document, the one from
process2 would “win”. The optimistic locking scenario wouldn’t
come into the picture unless you took  control of assigning the
_version_ number.

Best,
Erick

> On Feb 19, 2019, at 9:23 AM, David '-1' Schmid <gd...@gmail.com> wrote:
> 
> Hi!
> 
> On 2019-02-18T20:36:35, Erick Erickson wrote:
>> Typically, people set their autocommit (hard) settings in
>> solrconfig.xml and forget about it. I usually use a time-based trigger
>> and don’t use documents as a trigger.
> I added a timed autoCommit and it seems to work out nicely. Thank you!
> 
>> Until you do a hard commit, all the incoming documents are held in the
>> transaction log,
> Ah, yes. Somehow I did not draw the link to transactions.
> I've noticed that solr is using only one of my four CPUs for applying
> the update. With that in mind, could I submit my batches in parallel,
> or would that be worse? To be honest, I've never seen what kind of
> transaction or coherency model is used in solr.
> 
> I think it's touched briefly by the solr-ref-guide for applying updates
> to single document fields; but I can't say for sure if it's using an
> optimistic strategy or if the parallel updates would produce more
> overhead by pessimistic locking.
> 
> regards,
> =1


Re: UpdateHandler batch size / search solr-user

Posted by David '-1' Schmid <gd...@gmail.com>.
Hi!

On 2019-02-18T20:36:35, Erick Erickson wrote:
> Typically, people set their autocommit (hard) settings in
> solrconfig.xml and forget about it. I usually use a time-based trigger
> and don’t use documents as a trigger.
I added a timed autoCommit and it seems to work out nicely. Thank you!

> Until you do a hard commit, all the incoming documents are held in the
> transaction log,
Ah, yes. Somehow I did not draw the link to transactions.
I've noticed that solr is using only one of my four CPUs for applying
the update. With that in mind, could I submit my batches in parallel,
or would that be worse? To be honest, I've never seen what kind of
transaction or coherency model is used in solr.

I think it's touched briefly by the solr-ref-guide for applying updates
to single document fields; but I can't say for sure if it's using an
optimistic strategy or if the parallel updates would produce more
overhead by pessimistic locking.

regards,
=1

Re: UpdateHandler batch size / search solr-user

Posted by Erick Erickson <er...@gmail.com>.
Typically, people set their autocommit (hard) settings in solrconfig.xml and forget about it. I usually use a time-based trigger and don’t use documents as a trigger.

If you were waiting until the end of your batch run (all 46M docs) to issue a commit, that’s an anit-pattern. Until you do a hard commit, all the incoming documents are held in the transaction log, see: https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ <https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/>. Setting the autocommit settings to, say, 15 seconds should give a flatter response time.

The Solr mailing list archives, see: http://lucene.apache.org/solr/community.html#mailing-lists-irc <http://lucene.apache.org/solr/community.html#mailing-lists-irc>

Best,
Erick

> On Feb 18, 2019, at 10:03 AM, David '-1' Schmid <gd...@gmail.com> wrote:
> 
> Hello!
> 
> Another question I could not find an answer to:
> is there a best-practice / recommendation for pushing several million
> documents into a new index?
> 
> I'm currently splittig my documents into batches of 10,000 json-line
> payloads into the update request handler, with commit set to 'true'
> (yes, for each of the batches).
> I'm using commit since that got me stable 'QTime' around ~2100, without
> commiting every batch, the QTime will degrade ten-fold by the time I
> sent somewhere around 1,000,000 documents.
> This will steadily climb, so after I sent all 46M documents I end up
> with QTime values about 40,000 in case I don't commit every batch
> immediately.
> 
> Since I cannot find anything in my mails, I wanted to search the
> solr-user archives but, as far as I can tell: there is no such thing.
> Maybe I can't see it or just glossed over it, but is there no searchable
> index of solr-user? Any hints?
> 
> regards,
> -1