You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Bernd Fehling <be...@uni-bielefeld.de> on 2018/01/10 15:33:29 UTC

is ConcurrentUpdateSolrClient.Builder thread safe?

Hi list,

after some strange search results I was trying to locate the problem
and it turned out that it starts with bulk loading with SolrJ
and ConcurrentUpdateSolrClient.Builder with several threads.

I assume that ConcurrentUpdateSolrClient.Builder is _NOT_ thread safe
according the docs send to the indexer?

It feels like documents with the same doc_id are not always indexed
in the order they are sent to the indexer. It is some kind of random generator.

Example:
file LR00010.xml
<doc>
  <str name="id">my_uniq_id_1234</str>
  <date name="date">2017-03-28T23:21:40Z</date>
  ...

file LR01000.xml
<doc>
  <str name="id">my_uniq_id_1234</str>
  <date name="date">2017-04-26T00:42:10Z</date>
  ...


The files are in the same subdir.
They are loaded, processed, and send to the indexer in ascending natural order.
LR00010.xml is handled way before LR01000.xml.

But the result is that sometimes the older doc of LR00010.xml is in the index
and the newer doc from LR01000.xml is marked as deleted, and sometimes the
newer doc of LR01000.xml is in the index and the older doc from LR00010.xml
is marked as deleted.

Anyone seens this?

I could try ConcurrentUpdateSolrClient.Builder with only one thread and
see if the problem still exists.

Regards
Bernd



Re: is ConcurrentUpdateSolrClient.Builder thread safe?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/10/2018 8:33 AM, Bernd Fehling wrote:
> after some strange search results I was trying to locate the problem
> and it turned out that it starts with bulk loading with SolrJ
> and ConcurrentUpdateSolrClient.Builder with several threads.
>
> I assume that ConcurrentUpdateSolrClient.Builder is _NOT_ thread safe
> according the docs send to the indexer?

Why would you need the Builder to be threadsafe?

The actual client object (ConcurrentUpdateSolrClient) should be 
perfectly threadsafe, but the Builder probably isn't, and I can't think 
of any reason to try and use it with multiple threads.  In a 
well-constructed program, you will use the Builder exactly once, in an 
initialization thread, and then have all the indexing threads use the 
client object that the Builder creates.

I hope you're aware that the concurrent client swallows all indexing 
errors and does not tell your program about them.

Thanks,
Shawn


Re: is ConcurrentUpdateSolrClient.Builder thread safe?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/11/2018 12:05 AM, Bernd Fehling wrote:
> This will nerver pass a Jepsen test and I call it _NOT_ thread safe.
> 
> I haven't looked into the code yet, to see if the queue is FIFO, otherwise
> this would be stupid.

I was not thinking about order of operations when I said that the client 
was threadsafe.  I meant that one client object can be used 
simultaneously by multiple threads without anything getting 
cross-contaminated within the program.

If you are absolutely reliant on operations happening in a precise 
order, such that a document could get indexed in one request and then 
replaced (or updated) with a later request, you should not use the 
concurrent client.  You could define it with a single thread, but if you 
do that, then the concurrent client doesn't work any faster than the 
standard client.

When a concurrent client is built, it creates the specified number of 
processing threads.  When updates are sent, they are added to an 
internal queue.  The processing threads will handle requests from the 
queue as long as the queue is not empty.

Those threads will process the requests they have been assigned 
simultaneously.  Although I'm sure that each thread pulls requests off 
the queue in a FIFO manner, I have a scenario for you to consider.  This 
scenario is not just an intellectual exercise, it is the kind of thing 
that can easily happen in the wild.

Let's say that when document X is initially indexed, it is at position 
997 in a batch of 1000 documents.  Then two update requests later, the 
new version of document X is at position 2 in another batch of 1000 
documents.

If there are at least three threads in the concurrent client, those 
update requests may begin execution at nearly the same time.  In that 
situation, Solr is likely to index document X in the request added later 
before it indexes document X in the request added earlier, resulting in 
outdated information ending up in the index.

The same thing can happen even with a non-concurrent client when it is 
used in a multi-threaded manner.

Preserving order of operations cannot be guaranteed if there are 
multiple threads.  It could be possible to add some VERY sophisticated 
synchronization capabilities, but writing code to do that would be very 
difficult, and it wouldn't be trivial to use either.

Thanks,
Shawn

Re: is ConcurrentUpdateSolrClient.Builder thread safe?

Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/11/2018 1:38 AM, Bernd Fehling wrote:
> To sum it up, there is no way for bulk loading in solr, due to the lack
> of preserving the order of operation.
> Solr can only supply bulk loading if you really have unique data, right?

Bulk loading implies that every document is inserted exactly once and 
that there are no other operations, like updates or deletes.  If there 
are other operations, then in my mind, it's not bulk loading.

> By the way, the queue used is java.util.concurrent.BlockingQueue.
> Changing that to ArrayBlockingQueue (to force FIFO) would not really help, I guess.

Correct, the issue is that updates are processed simultaneously.  Making 
absolutely sure that removal is FIFO wouldn't make any difference. 
Although I think that the current implementation is probably just as 
FIFO as the array implementation.

> You say "If there are at least three threads in the concurrent client...", but
> two threads would work?

The thread count of three was specific to the exact scenario I 
described, where update 1 contains the initial indexing and update 3 
(two updates later) contains the new version.  If it were update 1 and 
update 7, then there would need to be a thread count of seven to see the 
problem.

> How are other users doing bulk loading with archived backups and preserving the order?
> Can't believe that I'm the only one on earth having this need.

If the backup is a log of changes rather than an info dump, then the 
only reliable way you could guarantee correct operation is to do the 
indexing with one thread.  But then indexing will be slower, possibly a 
LOT slower.

Thanks,
Shawn

Re: is ConcurrentUpdateSolrClient.Builder thread safe?

Posted by Bernd Fehling <be...@uni-bielefeld.de>.
To sum it up, there is no way for bulk loading in solr, due to the lack
of preserving the order of operation.
Solr can only supply bulk loading if you really have unique data, right?

By the way, the queue used is java.util.concurrent.BlockingQueue.
Changing that to ArrayBlockingQueue (to force FIFO) would not really help, I guess.
Because the bottleneck is not reading the content from filesystem, but
analyzing and indexing.

Any other options for bulk loading?

You say "If there are at least three threads in the concurrent client...", but
two threads would work?

How are other users doing bulk loading with archived backups and preserving the order?
Can't believe that I'm the only one on earth having this need.

Regards
Bernd


Am 11.01.2018 um 08:53 schrieb Shawn Heisey:
> On 1/11/2018 12:05 AM, Bernd Fehling wrote:
>> This will nerver pass a Jepsen test and I call it _NOT_ thread safe.
>>
>> I haven't looked into the code yet, to see if the queue is FIFO, otherwise
>> this would be stupid.
> 
> I was not thinking about order of operations when I said that the client was threadsafe.  I meant that one client object can be used
> simultaneously by multiple threads without anything getting cross-contaminated within the program.
> 
> If you are absolutely reliant on operations happening in a precise order, such that a document could get indexed in one request and then
> replaced (or updated) with a later request, you should not use the concurrent client.  You could define it with a single thread, but if you do
> that, then the concurrent client doesn't work any faster than the standard client.
> 
> When a concurrent client is built, it creates the specified number of processing threads.  When updates are sent, they are added to an internal
> queue.  The processing threads will handle requests from the queue as long as the queue is not empty.
> 
> Those threads will process the requests they have been assigned simultaneously.  Although I'm sure that each thread pulls requests off the queue
> in a FIFO manner, I have a scenario for you to consider.  This scenario is not just an intellectual exercise, it is the kind of thing that can
> easily happen in the wild.
> 
> Let's say that when document X is initially indexed, it is at position 997 in a batch of 1000 documents.  Then two update requests later, the
> new version of document X is at position 2 in another batch of 1000 documents.
> 
> If there are at least three threads in the concurrent client, those update requests may begin execution at nearly the same time.  In that
> situation, Solr is likely to index document X in the request added later before it indexes document X in the request added earlier, resulting in
> outdated information ending up in the index.
> 
> The same thing can happen even with a non-concurrent client when it is used in a multi-threaded manner.
> 
> Preserving order of operations cannot be guaranteed if there are multiple threads.  It could be possible to add some VERY sophisticated
> synchronization capabilities, but writing code to do that would be very difficult, and it wouldn't be trivial to use either.
> 
> Thanks,
> Shawn

Re: is ConcurrentUpdateSolrClient.Builder thread safe?

Posted by Bernd Fehling <be...@uni-bielefeld.de>.
Hi Shawn,

from your answer I see that you are obviously not using ConcurrentUpdateSolrClient.
I didn't say that I use ConcurrentUpdateSolrClient in multiple threads.
I say that ConcurrentUpdateSolrClient.Builder has a method to set
"withThreadCount", to empty the Clients queue with multiple threads.
This is useful for bulk loading huge data volumes or replay backup into index.

As I can see at the indexer with infostream, there are _no_ indexing errors.

I tried now with one thread several times and everything was fine.
The newer docs replaced the older docs (wich were marked deleted) in the index.
With more than 1 "threadCount" for emtying the queue there are problems with
ConcurrentUpdateSolrClient.

This will nerver pass a Jepsen test and I call it _NOT_ thread safe.

I haven't looked into the code yet, to see if the queue is FIFO, otherwise
this would be stupid.

Regards
Bernd


Am 11.01.2018 um 02:27 schrieb Shawn Heisey:
> On 1/10/2018 8:33 AM, Bernd Fehling wrote:
>> after some strange search results I was trying to locate the problem
>> and it turned out that it starts with bulk loading with SolrJ
>> and ConcurrentUpdateSolrClient.Builder with several threads.
>>
>> I assume that ConcurrentUpdateSolrClient.Builder is _NOT_ thread safe
>> according the docs send to the indexer?
> 
> Why would you need the Builder to be threadsafe?
> 
> The actual client object (ConcurrentUpdateSolrClient) should be perfectly threadsafe, but the Builder probably isn't, and I can't think of any
> reason to try and use it with multiple threads.  In a well-constructed program, you will use the Builder exactly once, in an initialization
> thread, and then have all the indexing threads use the client object that the Builder creates.
> 
> I hope you're aware that the concurrent client swallows all indexing errors and does not tell your program about them.
> 
> Thanks,
> Shawn
>