You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ignite.apache.org by Denis Magda <dm...@apache.org> on 2020/08/18 17:43:04 UTC

Re: Operation block on Cluster recovery/rebalance.

John, thanks for filing. Back up the Ilya's proposal. I've changed the
ticket from the improvement to the bug.

-
Denis


On Tue, Aug 18, 2020 at 10:06 AM Ilya Kasnacheev <il...@gmail.com>
wrote:

> Hello!
>
> My opinion is that it should be filed as a bug and then fixed.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> сб, 15 авг. 2020 г. в 01:22, Denis Magda <dm...@apache.org>:
>
>> @Evgenii Zhuravlev <ez...@gridgain.com>, @Ilya Kasnacheev
>> <il...@gmail.com>, any thoughts on this?
>>
>> As a dirty workaround, you can update your cache references on client
>> reconnect events. You will be getting an exception by calling
>> ignite.cache(cacheName) in the time when the cluster is not activated yet.
>> Does this work for you?
>>
>> -
>> Denis
>>
>>
>> On Fri, Aug 14, 2020 at 3:12 PM John Smith <ja...@gmail.com>
>> wrote:
>>
>>> Is there any work around? I can't have an HTTP server block on all
>>> requests.
>>>
>>> 1- I need to figure out why I lose a server nodes every few weeks, which
>>> when rebooting the nodes cause the inactive state until they are back....
>>>
>>> 2- Implement some kind of logic on the client side not to block the HTTP
>>> part...
>>>
>>> Can IgniteCache instance be notified of disconnected events so I can
>>> maybe tell the repository class I have to set a flag to skip the operation?
>>>
>>>
>>> On Fri., Aug. 14, 2020, 5:17 p.m. Denis Magda, <dm...@apache.org>
>>> wrote:
>>>
>>>> My guess that it's standard behavior for all operations (SQL,
>>>> key-value, compute, etc.). But I'll let the maintainers of those modules
>>>> clarify.
>>>>
>>>> -
>>>> Denis
>>>>
>>>>
>>>> On Fri, Aug 14, 2020 at 1:44 PM John Smith <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Denis, so to understand it's all operations or just the query?
>>>>>
>>>>> On Fri., Aug. 14, 2020, 12:53 p.m. Denis Magda, <dm...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> John,
>>>>>>
>>>>>> Ok, we nailed it. That's the current expected behavior. Generally, I
>>>>>> agree with you that the platform should support an option when operations
>>>>>> fail if the cluster is deactivated. Could you propose the change by
>>>>>> starting a discussion on the dev list? You can refer to this user list
>>>>>> discussion for reference. Let me know if you need help with this.
>>>>>>
>>>>>> -
>>>>>> Denis
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 13, 2020 at 5:55 PM John Smith <ja...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> No I, reuse the instance. The cache instance is created once at
>>>>>>> startup of the application and I pass it to my "repository" class
>>>>>>>
>>>>>>> public abstract class AbstractIgniteRepository<K,V> implements CacheRepository<K, V> {
>>>>>>>     public final long DEFAULT_OPERATION_TIMEOUT = 2000;
>>>>>>>
>>>>>>>     private Vertx vertx;
>>>>>>>     private IgniteCache<K, V> cache;
>>>>>>>
>>>>>>>     AbstractIgniteRepository(Vertx vertx, IgniteCache<K, V> cache) {
>>>>>>>         this.vertx = vertx;
>>>>>>>         this.cache = cache;
>>>>>>>     }
>>>>>>>
>>>>>>> ...
>>>>>>>
>>>>>>>     Future<List<JsonArray>> query(final String sql, final long timeoutMs, final Object... args) {
>>>>>>>         final Promise<List<JsonArray>> promise = Promise.promise();
>>>>>>>
>>>>>>>         vertx.setTimer(timeoutMs, l -> {
>>>>>>>             promise.tryFail(new TimeoutException("Cache operation did not complete within: " + timeoutMs + " Ms.")); // THIS FIRE IF THE BLOE DOESN"T COMPLETE IN TIME.
>>>>>>>         });
>>>>>>>
>>>>>>>         vertx.<List<JsonArray>>executeBlocking(code -> {
>>>>>>>             SqlFieldsQuery query = new SqlFieldsQuery(sql).setArgs(args);
>>>>>>>             query.setTimeout((int) timeoutMs, TimeUnit.MILLISECONDS);
>>>>>>>
>>>>>>>
>>>>>>>             try (QueryCursor<List<?>> cursor = cache.query(query)) { // <--- BLOCKS HERE.
>>>>>>>                 List<JsonArray> rows = new ArrayList<>();
>>>>>>>                 Iterator<List<?>> iterator = cursor.iterator();
>>>>>>>
>>>>>>>                 while(iterator.hasNext()) {
>>>>>>>                     List currentRow = iterator.next();
>>>>>>>                     JsonArray row = new JsonArray();
>>>>>>>
>>>>>>>                     currentRow.forEach(o -> row.add(o));
>>>>>>>
>>>>>>>                     rows.add(row);
>>>>>>>                 }
>>>>>>>
>>>>>>>                 code.complete(rows);
>>>>>>>             } catch(Exception ex) {
>>>>>>>                 code.fail(ex);
>>>>>>>             }
>>>>>>>         }, result -> {
>>>>>>>             if(result.succeeded()) {
>>>>>>>                 promise.tryComplete(result.result());
>>>>>>>             } else {
>>>>>>>                 promise.tryFail(result.cause());
>>>>>>>             }
>>>>>>>         });
>>>>>>>
>>>>>>>         return promise.future();
>>>>>>>     }
>>>>>>>
>>>>>>>     public <T> T cache() {
>>>>>>>         return (T) cache;
>>>>>>>     }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, 13 Aug 2020 at 16:29, Denis Magda <dm...@apache.org> wrote:
>>>>>>>
>>>>>>>> I've created a simple test and always getting the exception below
>>>>>>>> on an attempt to get a reference to an IgniteCache instance in cases when
>>>>>>>> the cluster is not activated:
>>>>>>>>
>>>>>>>> *Exception in thread "main" class
>>>>>>>> org.apache.ignite.IgniteException: Can not perform the operation because
>>>>>>>> the cluster is inactive. Note, that the cluster is considered inactive by
>>>>>>>> default if Ignite Persistent Store is used to let all the nodes join the
>>>>>>>> cluster. To activate the cluster call Ignite.active(true)*
>>>>>>>>
>>>>>>>> Are you trying to get a new IgniteCache reference whenever the
>>>>>>>> client reconnects successfully to the cluster? My guts feel that currently,
>>>>>>>> Ignite verifies the activation status and generates the exception above
>>>>>>>> whenever you're getting a reference to an IgniteCache or IgniteCompute. But
>>>>>>>> once you got those references and try to run some operations then those get
>>>>>>>> stuck if the cluster is not activated.
>>>>>>>> -
>>>>>>>> Denis
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Aug 13, 2020 at 6:37 AM John Smith <ja...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The cache.query() starts to block when ignite server nodes are
>>>>>>>>> being restarted and there's no baseline topology yet. The server nodes do
>>>>>>>>> not block. It's the client that blocks.
>>>>>>>>>
>>>>>>>>> The dumpfiles are of the server nodes. The screen shot is from the
>>>>>>>>> client app using your kit profiler on the client side the threads are
>>>>>>>>> marked as red on your kit.
>>>>>>>>>
>>>>>>>>> The app is simple, make http request, it runs cache Sql query on
>>>>>>>>> ignite and if it succeeds does a put back to ignite.
>>>>>>>>>
>>>>>>>>> The Client disconnected exception only happens when all server
>>>>>>>>> nodes in the cluster are down. The blockage only happens when the cluster
>>>>>>>>> is trying to establish baseline topology.
>>>>>>>>>
>>>>>>>>> On Wed., Aug. 12, 2020, 6:28 p.m. Denis Magda, <dm...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> John,
>>>>>>>>>>
>>>>>>>>>> I don't see any traits of an application-caused deadlock in the
>>>>>>>>>> thread dumps. Please elaborate on the following:
>>>>>>>>>>
>>>>>>>>>> 7- Restart 1st node, run operation, operation fails with
>>>>>>>>>>> ClientDisconectedException but application still able to complete it's
>>>>>>>>>>> request.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> What's the IP address of the server node the client app uses to
>>>>>>>>>> join the cluster? If that's not the address of the 1st node, that is
>>>>>>>>>> already restarted, then the client couldn't join the cluster and it's
>>>>>>>>>> expected that it fails with the ClientDisconnectedException.
>>>>>>>>>>
>>>>>>>>>> 8- Start 2nd node, run operation, from here on all operations
>>>>>>>>>>> just block.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Are the operations unblocked and completed successfully when the
>>>>>>>>>> third node joins the cluster and the cluster gets activated automatically?
>>>>>>>>>>
>>>>>>>>>> -
>>>>>>>>>> Denis
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Aug 12, 2020 at 11:08 AM John Smith <
>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Ok Denis here they are...
>>>>>>>>>>>
>>>>>>>>>>> 3 nodes and I capture a yourlit screenshot of what it thinks are
>>>>>>>>>>> deadlocks on the client app.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://www.dropbox.com/sh/2cxjkngvx0ubw3b/AADa--HQg-rRsY3RBo2vQeJ9a?dl=0
>>>>>>>>>>>
>>>>>>>>>>> On Wed, 12 Aug 2020 at 11:07, John Smith <ja...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Denis. I will asap but you I think you were right it is the
>>>>>>>>>>>> query that blocks.
>>>>>>>>>>>>
>>>>>>>>>>>> My application first first runs a select on the cache and then
>>>>>>>>>>>> does a put to cache.
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, 11 Aug 2020 at 19:22, Denis Magda <dm...@apache.org>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> John,
>>>>>>>>>>>>>
>>>>>>>>>>>>> It sounds like a deadlock caused by the application logic. Is
>>>>>>>>>>>>> there any chance that the operation you run on step 8 accesses several keys
>>>>>>>>>>>>> in one order while the other operations work with the same keys but in a
>>>>>>>>>>>>> different order. The deadlocks are possible when you use Ignite Transaction
>>>>>>>>>>>>> API or simply execute bulk operations such as cache.readAll() or
>>>>>>>>>>>>> cache.writeAll(..).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please take and attach thread dumps from all the cluster nodes
>>>>>>>>>>>>> for analysis if we need to dig deeper.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -
>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Aug 10, 2020 at 6:23 PM John Smith <
>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Denis, I think you are right. It's the query that blocks
>>>>>>>>>>>>>> the other k/v operations are ok.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Any thoughts on this?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, 10 Aug 2020 at 15:28, John Smith <
>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I tried with 2.8.1, same issue. Operations block
>>>>>>>>>>>>>>> indefinitely...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1- Start 3 node cluster
>>>>>>>>>>>>>>> 2- Start client application client = true with
>>>>>>>>>>>>>>> Ignition.start()
>>>>>>>>>>>>>>> 3- Run some cache operations, everything ok...
>>>>>>>>>>>>>>> 4- Shut down one node, run operation, still ok
>>>>>>>>>>>>>>> 5- Shut down 2nd node, run operation, still ok
>>>>>>>>>>>>>>> 6- Shut down 3rd node, run operation, still ok...
>>>>>>>>>>>>>>> Operations start failing with ClientDisconectedException...
>>>>>>>>>>>>>>> 7- Restart 1st node, run operation, operation fails
>>>>>>>>>>>>>>> with ClientDisconectedException but application still able to complete it's
>>>>>>>>>>>>>>> request.
>>>>>>>>>>>>>>> 8- Start 2nd node, run operation, from here on all
>>>>>>>>>>>>>>> operations just block.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Basically the client application is an HTTP Server on each
>>>>>>>>>>>>>>> HTTP request does cache exception.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, 7 Aug 2020 at 19:46, John Smith <
>>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> No, everything blocks... Also using 2.7.0 just in case.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Only time I get exception is if the cluster is
>>>>>>>>>>>>>>>> completely off, then I get ClientDisconectedException...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, 7 Aug 2020 at 18:52, Denis Magda <dm...@apache.org>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If I'm not mistaken, key-value operations (cache.get/put)
>>>>>>>>>>>>>>>>> and compute calls fail with an exception if the cluster is deactivated. Do
>>>>>>>>>>>>>>>>> those fail on your end?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> As for the async and SQL operations, let's see what other
>>>>>>>>>>>>>>>>> community members say.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Aug 7, 2020 at 1:06 PM John Smith <
>>>>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi any thoughts on this?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, 6 Aug 2020 at 23:33, John Smith <
>>>>>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Here is another example where it blocks.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> SqlFieldsQuery query = new SqlFieldsQuery(
>>>>>>>>>>>>>>>>>>>         "select * from my_table")
>>>>>>>>>>>>>>>>>>>         .setArgs(providerId, carrierCode);
>>>>>>>>>>>>>>>>>>> query.setTimeout(1000, TimeUnit.MILLISECONDS);
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> try (QueryCursor<List<?>> cursor = cache.query(query))
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> cache.query just blocks even with the timeout set.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Is there a way to timeout and at least have the
>>>>>>>>>>>>>>>>>>> application continue and respond with an appropriate message?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thu, 6 Aug 2020 at 23:06, John Smith <
>>>>>>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi running 2.7.0
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> When I reboot a node and it begins to rejoin the
>>>>>>>>>>>>>>>>>>>> cluster or the cluster is not yet activated with baseline topology
>>>>>>>>>>>>>>>>>>>> operations seem to block forever, operations that are supposed to return
>>>>>>>>>>>>>>>>>>>> IgniteFuture. I.e: putAsync, getAsync etc... They just block, until the
>>>>>>>>>>>>>>>>>>>> cluster resolves it's state.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>