You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Joe Olson <jo...@outlook.com> on 2017/05/16 12:28:58 UTC

UnknownKvStateKeyGroupLocation

When running Flink in high availability mode, I've been seeing a high number of UnknownKvStateKeyGroupLocation errors being returned when using queryable state calls.


If I put a simple getKvState call into a loop executing every second, and call it repeatedly, sometimes I will get the expected results, sometimes I will get UnknownKvStateKeyGroupLocation thrown. This is not associated with a query timeout (network issue).


From looking at the Flink source code, this problem stems from a failure of lookup.getKvStateServerAddress returning null. I know all the task managers are registering state with the job manager, because I see the "Key value state registered for job xx under name yy" messages in the job server log.


Anything else I should be looking for? I have several jobs I am querying state on, and this seems isolated to only one. I've gone over very closely the difference between the jobs, but they all built from the same template.


What would cause a lookup.getKvStateServerAddress to sometimes succeed, and sometimes to fail?



Re: UnknownKvStateKeyGroupLocation

Posted by Ufuk Celebi <uc...@apache.org>.
Hey Joe! This sounds odd... are there any failures (JobManager or
TaskManager) or leader elections being reported? You should see such
events in the JobManager/TaskManager logs.

On Tue, May 16, 2017 at 2:28 PM, Joe Olson <jo...@outlook.com> wrote:
> When running Flink in high availability mode, I've been seeing a high number
> of UnknownKvStateKeyGroupLocation errors being returned when using queryable
> state calls.
>
>
> If I put a simple getKvState call into a loop executing every second, and
> call it repeatedly, sometimes I will get the expected results, sometimes I
> will get UnknownKvStateKeyGroupLocation thrown. This is not associated with
> a query timeout (network issue).
>
>
> From looking at the Flink source code, this problem stems from a failure of
> lookup.getKvStateServerAddress returning null. I know all the task managers
> are registering state with the job manager, because I see the "Key value
> state registered for job xx under name yy" messages in the job server log.
>
>
> Anything else I should be looking for? I have several jobs I am querying
> state on, and this seems isolated to only one. I've gone over very closely
> the difference between the jobs, but they all built from the same template.
>
>
> What would cause a lookup.getKvStateServerAddress to sometimes succeed, and
> sometimes to fail?
>
>
>