You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Dan Hill <qu...@gmail.com> on 2021/08/20 15:20:09 UTC

Re: savepoint failure

I think this was from a breaking change we made to the key calculation in
our code between version updates.  So this error makes sense.

What's the best way to get more info for debugging?  How can I configure
the logs to output more key information?

On Fri, Jul 16, 2021 at 11:29 PM Dan Hill <qu...@gmail.com> wrote:

> Thanks, Till!
>
> On Thu, Jul 15, 2021 at 12:52 AM Till Rohrmann <tr...@apache.org>
> wrote:
>
>> Hi Dan,
>>
>> From the logs I couldn't find anything suspicious. The job runs until you
>> try to draw a savepoint. When doing this Flink fails with "Key group 0 is
>> not in KeyGroupRange{startKeyGroup=32, endKeyGroup=63}". W/o having access
>> to your job or a minimal example that allows to reproduce this problem, it
>> will be super hard to figure out what's going wrong. My best guess would
>> still be that we have a non deterministic key somewhere.
>>
>> Cheers,
>> Till
>>
>> On Thu, Jul 15, 2021 at 7:26 AM Dan Hill <qu...@gmail.com> wrote:
>>
>>> I don't know if it matters but I'm using unaligned checkpoints.
>>>
>>> On Wed, Jul 14, 2021 at 8:33 PM Dan Hill <qu...@gmail.com> wrote:
>>>
>>>> Here's the overview flow chart.
>>>>
>>>> [image: Screen Shot 2021-07-14 at 8.24.33 PM.png]
>>>>
>>>>
>>>>
>>>> On Wed, Jul 14, 2021 at 7:10 PM Dan Hill <qu...@gmail.com> wrote:
>>>>
>>>>> *-others*
>>>>>
>>>>> *Code*
>>>>> I'm not sure of a good, secure way of sharing the java code.  It
>>>>> depends on multiple internal repos.  The savepoint appears to be failing in
>>>>> a custom KeyedCoProcessFunction that joins two keyed streams in a fuzzy
>>>>> way.  The streams are joined based on a Tuple2<String, Long> and has some
>>>>> internal map state using String keys.
>>>>>
>>>>> *Flink config*
>>>>> The most relevant parts of the flink config are the following:
>>>>> state.backend.async: true
>>>>> state.backend.incremental: true
>>>>> state.backend.local-recovery: false
>>>>> taskmanager.state.local.root-dirs: /flink_state/local-recovery
>>>>> state.backend.rocksdb.checkpoint.transfer.thread.num: 1
>>>>> state.backend.rocksdb.localdir: /flink_state/rocksdb
>>>>> state.backend.rocksdb.options-factory:
>>>>> org.apache.flink.contrib.streaming.state.DefaultConfigurableOptionsFactory
>>>>> state.backend.rocksdb.predefined-options: DEFAULT
>>>>> state.backend.rocksdb.timer-service.factory: ROCKSDB
>>>>> state.backend.rocksdb.ttl.compaction.filter.enabled: false
>>>>> state.checkpoints.dir: s3a://my-flink-state/checkpoints
>>>>> state.savepoints.dir: s3a://my-metrics-flink-state/savepoints
>>>>>
>>>>> *Workflow*
>>>>> What do you mean by workflow?
>>>>>
>>>>> *Logs*
>>>>> Here's the job manager log.  The task manager log did not look useful.
>>>>>
>>>>> https://drive.google.com/file/d/1jC5-3Bm2OP0dX1GJACwHGeqxd4snFc-W/view?usp=sharing
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jul 14, 2021 at 12:45 AM Till Rohrmann <tr...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi Dan,
>>>>>>
>>>>>> Can you provide us with more information about your job (maybe even
>>>>>> the job code or a minimally working example), the Flink configuration, the
>>>>>> exact workflow you are doing and the corresponding logs and error messages?
>>>>>>
>>>>>> Cheers,
>>>>>> Till
>>>>>>
>>>>>> On Tue, Jul 13, 2021 at 9:39 PM Dan Hill <qu...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Could this be caused by mixing of configuration settings when
>>>>>>> running?  Running a job with one parallelism, stop/savepointing and then
>>>>>>> recovering with a different parallelism?  I'd assume that's fine and
>>>>>>> wouldn't put create bad state.
>>>>>>>
>>>>>>> On Tue, Jul 13, 2021 at 12:34 PM Dan Hill <qu...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I checked my code.  Our keys for streams and map state only use
>>>>>>>> either (1) string, (2) long IDs that don't change or (3) Tuple of 1 and 2.
>>>>>>>>
>>>>>>>> I don't know why my current case is breaking.  Our job partitions
>>>>>>>> and parallelism settings have not changed.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jul 13, 2021 at 12:11 PM Dan Hill <qu...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey.  I just hit a similar error in production when trying to
>>>>>>>>> savepoint.  We also use protobufs.
>>>>>>>>>
>>>>>>>>> Has anyone found a better fix to this?
>>>>>>>>>
>>>>>>>>> On Fri, Oct 23, 2020 at 5:21 AM Till Rohrmann <
>>>>>>>>> trohrmann@apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> Glad to hear that you solved your problem. Afaik Flink should not
>>>>>>>>>> read the fields of messages and call hashCode on them.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Till
>>>>>>>>>>
>>>>>>>>>> On Fri, Oct 23, 2020 at 2:18 PM Radoslav Smilyanov <
>>>>>>>>>> radoslav.smilyanov@smule.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Till,
>>>>>>>>>>>
>>>>>>>>>>> I found my problem. It was indeed related to a mutable hashcode.
>>>>>>>>>>>
>>>>>>>>>>> I was using a protobuf message in the key selector function and
>>>>>>>>>>> one of the protobuf fields was enum. I checked the implementation of the
>>>>>>>>>>> hashcode of the generated message and it is using the int value field of
>>>>>>>>>>> the protobuf message so I assumed that it is ok and it's immutable.
>>>>>>>>>>>
>>>>>>>>>>> I replaced the key selector function to use Tuple[Long, Int]
>>>>>>>>>>> (since my protobuf message has only these two fields where the int
>>>>>>>>>>> parameter stands for the enum value field). After changing my code to use
>>>>>>>>>>> the Tuple it worked.
>>>>>>>>>>>
>>>>>>>>>>> I am not sure if Flink somehow reads the protobuf message fields
>>>>>>>>>>> and uses the hashcode of the fields directly since the generated protobuf
>>>>>>>>>>> enum indeed has a mutable hashcode (Enum.hashcode).
>>>>>>>>>>>
>>>>>>>>>>> Nevertheless it's ok with the Tuple key.
>>>>>>>>>>>
>>>>>>>>>>> Thanks for your response!
>>>>>>>>>>>
>>>>>>>>>>> Best Regards,
>>>>>>>>>>> Rado
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Oct 23, 2020 at 2:39 PM Till Rohrmann <
>>>>>>>>>>> trohrmann@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Rado,
>>>>>>>>>>>>
>>>>>>>>>>>> it is hard to tell the reason w/o a bit more details. Could you
>>>>>>>>>>>> share with us the complete logs of the problematic run? Also the job you
>>>>>>>>>>>> are running and the types of the state you are storing in RocksDB and use
>>>>>>>>>>>> as events in your job are very important. In the linked SO question, the
>>>>>>>>>>>> problem was a type whose hashcode was not immutable.
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Till
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Oct 21, 2020 at 6:24 PM Radoslav Smilyanov <
>>>>>>>>>>>> radoslav.smilyanov@smule.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am running a Flink job that performs data enrichment. My job
>>>>>>>>>>>>> has 7 kafka consumers that receive messages for dml statements performed
>>>>>>>>>>>>> for 7 db tables.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Job setup:
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - Flink is run in k8s in a similar way as it is described
>>>>>>>>>>>>>    here
>>>>>>>>>>>>>    <https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/kubernetes.html#job-cluster-resource-definitions>
>>>>>>>>>>>>>    .
>>>>>>>>>>>>>    - 1 job manager and 2 task managers
>>>>>>>>>>>>>    - parallelism is set to 4 and 2 task slots
>>>>>>>>>>>>>    - rocksdb as state backend
>>>>>>>>>>>>>    - protobuf for serialization
>>>>>>>>>>>>>
>>>>>>>>>>>>> Whenever I try to trigger a savepoint after my state is
>>>>>>>>>>>>> bootstrapped I get the following error for different operators:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Caused by: java.lang.IllegalArgumentException: Key group 0 is
>>>>>>>>>>>>> not in KeyGroupRange{startKeyGroup=32, endKeyGroup=63}.
>>>>>>>>>>>>> at
>>>>>>>>>>>>> org.apache.flink.runtime.state.KeyGroupRangeOffsets.computeKeyGroupIndex(KeyGroupRangeOffsets.java:142)
>>>>>>>>>>>>> at
>>>>>>>>>>>>> org.apache.flink.runtime.state.KeyGroupRangeOffsets.setKeyGroupOffset(KeyGroupRangeOffsets.java:104)
>>>>>>>>>>>>> at
>>>>>>>>>>>>> org.apache.flink.contrib.streaming.state.snapshot.RocksFullSnapshotStrategy$SnapshotAsynchronousPartCallable.writeKVStateData(RocksFullSnapshotStrategy.java:319)
>>>>>>>>>>>>> at
>>>>>>>>>>>>> org.apache.flink.contrib.streaming.state.snapshot.RocksFullSnapshotStrategy$SnapshotAsynchronousPartCallable.writeSnapshotToOutputStream(RocksFullSnapshotStrategy.java:261)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Note: key group might vary.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I found this
>>>>>>>>>>>>> <https://stackoverflow.com/questions/49140654/flink-error-key-group-is-not-in-keygrouprange> article
>>>>>>>>>>>>> in Stackoverflow which relates to such an exception (btw my job graph looks
>>>>>>>>>>>>> similar to the one described in the article except that my job has more
>>>>>>>>>>>>> joins). I double checked my hashcodes and I think that they are fine.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I tried to reduce the parallelism to 1 with 1 task slot per
>>>>>>>>>>>>> task manager and this configuration seems to work. This leads me to a
>>>>>>>>>>>>> direction that it might be some concurrency issue.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would like to understand what is causing the savepoint
>>>>>>>>>>>>> failure. Do you have any suggestions what I might be missing?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks in advance!
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>> Rado
>>>>>>>>>>>>>
>>>>>>>>>>>>