You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Jason Gustafson (JIRA)" <ji...@apache.org> on 2017/06/12 23:59:00 UTC

[jira] [Commented] (KAFKA-5436) NullPointerException when loading producer snapshot

    [ https://issues.apache.org/jira/browse/KAFKA-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16047247#comment-16047247 ] 

Jason Gustafson commented on KAFKA-5436:
----------------------------------------

I think the underlying cause is KAFKA-5435. It's a bit counter-intuitive, but this bug can also be revealed as inconsistent snapshot data. Here is the relevant entry from the server log:

{code}
[2017-06-10 20:47:34,609] INFO Truncating log output-topic-1 to offset 5776. (kafka.log.Log)
[2017-06-10 20:47:34,616] INFO Loading producer state from offset 5776 for partition output-topic-1 (kafka.log.Log)
[2017-06-10 20:47:34,618] DEBUG Writing producer snapshot for partition output-topic-1 at offset 5776 (kafka.log.ProducerStateManager)
{code}

This is the first snapshot created for this partition, so the code follows the optimization path described in KAFKA-5435, which results in a snapshot at the log end offset with whatever happens to be in the producer state cache. The state in the cache is stale because of the log truncation, so we end up writing an inconsistent snapshot.

> NullPointerException when loading producer snapshot
> ---------------------------------------------------
>
>                 Key: KAFKA-5436
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5436
>             Project: Kafka
>          Issue Type: Sub-task
>            Reporter: Apurva Mehta
>            Assignee: Jason Gustafson
>            Priority: Blocker
>              Labels: exactly-once
>             Fix For: 0.11.0.0
>
>
> This was seen in the server.log of a recent system test failure:
> {noformat}
> java.lang.NullPointerException
>         at kafka.log.Log.updateFirstUnstableOffset(Log.scala:689)
>         at kafka.log.Log.loadProducerState(Log.scala:466)
>         at kafka.log.Log.<init>(Log.scala:185)
>         at kafka.log.LogManager$$anonfun$loadLogs$2$$anonfun$5$$anonfun$apply$12$$anonfun$apply$1.apply$mcV$sp(LogManager.scala:172)
>         at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The actual snapshot file had these contents:
> {noformat}
> amehta-macbook-pro:kafka-data-logs apurva$ ~/workspace/confluent/kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments -files output-topic-1/00000000000000005776.snapshot
> Dumping output-topic-1/00000000000000005776.snapshot
> producerId: 1001 producerEpoch: 0 lastSequence: 3249 lastOffset: 6026 offsetDelta: 0 lastTimestamp: 1497127651606 coordinatorEpoch: 2 currentTxnFirstOffset: None
> producerId: 0 producerEpoch: 0 lastSequence: 2749 lastOffset: 6025 offsetDelta: 249 lastTimestamp: 1497127647269 coordinatorEpoch: 1 currentTxnFirstOffset: Some(5776)
> {noformat}
> The log file ended at offset 5775:
> {noformat}
> baseOffset: 4770 lastOffset: 4770 baseSequence: -1 lastSequence: -1 producerId: 0 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true position: 58134 CreateTime: 1497127646732 isvalid: true size: 78 magic: 2 compresscodec: NONE crc: 3476040058
> baseOffset: 4771 lastOffset: 5020 baseSequence: 2750 lastSequence: 2999 producerId: 1001 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true position: 58212 CreateTime: 1497127646719 isvalid: true size: 2997 magic: 2 compresscodec: NONE crc: 1010131725
> baseOffset: 5021 lastOffset: 5021 baseSequence: -1 lastSequence: -1 producerId: 1001 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true position: 61209 CreateTime: 1497127646923 isvalid: true size: 78 magic: 2 compresscodec: NONE crc: 4001115747
> baseOffset: 5022 lastOffset: 5271 baseSequence: 2000 lastSequence: 2249 producerId: 0 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true position: 61287 CreateTime: 1497127646760 isvalid: true size: 2997 magic: 2 compresscodec: NONE crc: 2903697477
> baseOffset: 5272 lastOffset: 5272 baseSequence: -1 lastSequence: -1 producerId: 0 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true position: 64284 CreateTime: 1497127646982 isvalid: true size: 78 magic: 2 compresscodec: NONE crc: 2599716061
> baseOffset: 5273 lastOffset: 5522 baseSequence: 3000 lastSequence: 3249 producerId: 1001 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true position: 64362 CreateTime: 1497127646948 isvalid: true size: 3080 magic: 2 compresscodec: NONE crc: 3176858298
> baseOffset: 5523 lastOffset: 5523 baseSequence: -1 lastSequence: -1 producerId: 1001 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true position: 67442 CreateTime: 1497127647152 isvalid: true size: 78 magic: 2 compresscodec: NONE crc: 3464913172
> baseOffset: 5524 lastOffset: 5773 baseSequence: 2250 lastSequence: 2499 producerId: 0 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true position: 67520 CreateTime: 1497127647001 isvalid: true size: 3081 magic: 2 compresscodec: NONE crc: 712707513
> baseOffset: 5774 lastOffset: 5774 baseSequence: -1 lastSequence: -1 producerId: 0 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true position: 70601 CreateTime: 1497127647248 isvalid: true size: 78 magic: 2 compresscodec: NONE crc: 1043637525
> baseOffset: 5775 lastOffset: 5775 baseSequence: -1 lastSequence: -1 producerId: 1001 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true position: 70679 CreateTime: 1497127647352 isvalid: true size: 78 magic: 2 compresscodec: NONE crc: 968257373
> {noformat}
> The suspicious bit is the `currentTxnFirstOffset: Some(5776)`. This indicates that snapshot didn't get truncated correctly since it retains an offset which doesn't exist in the truncated segment, causing an NPE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)