You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Flavio Junqueira (JIRA)" <ji...@apache.org> on 2016/10/04 16:05:20 UTC

[jira] [Commented] (KAFKA-3985) Transient system test failure ZooKeeperSecurityUpgradeTest.test_zk_security_upgrade.security_protocol

    [ https://issues.apache.org/jira/browse/KAFKA-3985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15545834#comment-15545834 ] 

Flavio Junqueira commented on KAFKA-3985:
-----------------------------------------

We have been observing in our system test runs that test cases with SSL enabled fail occasionally. I did some digging with SSL debug enabled and and here is a fairly detailed description of my findings. 

The test case that failed in this particular run is this one:

{noformat}
Module: kafkatest.tests.core.security_rolling_upgrade_test
Class:  TestSecurityRollingUpgrade
Method: test_rolling_upgrade_phase_one
Arguments:
{
  "client_protocol": "SASL_SSL"
}
{noformat}

First, I noticed that that the server (worker10) was failing to complete a connection request from the producer:

{noformat}
kafka-network-thread-1-SASL_SSL-3, WRITE: TLSv1.2 Handshake, length = 1925
kafka-network-thread-1-SASL_SSL-3, READ: TLSv1.2 Handshake, length = 193
kafka-network-thread-1-SASL_SSL-3, fatal error: 80: problem unwrapping net record
javax.net.ssl.SSLProtocolException: Handshake message sequence violation, state = 1, type = 1
{noformat}

The exception indicated an error in the unwrapping of the incoming message:

{noformat}
javax.net.ssl.SSLProtocolException: Handshake message sequence violation, state = 1, type = 1
	at sun.security.ssl.Handshaker.checkThrown(Handshaker.java:1357)
	at sun.security.ssl.SSLEngineImpl.checkTaskThrown(SSLEngineImpl.java:519)
	at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:796)
	at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:764)
	at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624)
	at org.apache.kafka.common.network.SslTransportLayer.handshakeUnwrap(SslTransportLayer.java:409)
	at org.apache.kafka.common.network.SslTransportLayer.handshake(SslTransportLayer.java:270)
	at org.apache.kafka.common.network.KafkaChannel.prepare(KafkaChannel.java:62)
	at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:338)
	at org.apache.kafka.common.network.Selector.poll(Selector.java:291)
	at kafka.network.Processor.poll(SocketServer.scala:476)
	at kafka.network.Processor.run(SocketServer.scala:416)
	at java.lang.Thread.run(Thread.java:745)
{noformat}

Inspecting the producer output, I found the following:

{noformat}
kafka-producer-network-thread | producer-1, fatal error: 46: General SSLEngine problem
sun.security.validator.ValidatorException: PKIX path validation failed: java.security.cert.CertPathValidatorException: 
basic constraints check failed: this is not a CA certificate
%% Invalidated:  [Session-361, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256]
kafka-producer-network-thread | producer-1, SEND TLSv1.2 ALERT:  fatal, description = certificate_unknown
{noformat}

which indicates that the certificate of the CA is incorrect, and it turns out that it is incorrect in this run. The way I observed this is the following. In the {{ServerHello}} response the producer receives from the server, we have as the second element of the certificate chain the following:

{noformat}
chain [1] = [
[
  Version: V3
  Subject: CN=SystemTestCA
  Signature Algorithm: SHA256withRSA, OID = 1.2.840.113549.1.1.11

  Key:  Sun RSA public key, 2048 bits
  modulus: 2421083435994698547407383995133450358448219871045998555961244700979330643844364183298673326405969256709602016
1158944052724951660368237891673420098874800950540649610593248597413809483851739528155298853162531155745397359
0767317213764937849600679327937089735954501718065324256197703519689803978997662398590165914517169395637524460
3378903708218606621761213913048196427063541851690381274617210042171263186499322961280259599484330528665445123
8302313616171177770753250228263123742100612605694409578300489940094647906623975819080502840638336757802617437
298013918600386417877347815394687707255367254823636003381423029812181649
  public exponent: 65537
  Validity: [From: Mon Oct 03 09:35:36 UTC 2016,
               To: Sun Jan 01 09:35:36 UTC 2017]
  Issuer: CN=SystemTestCA
  SerialNumber: [    1f5f5e9e]

Certificate Extensions: 1
[1]: ObjectId: 2.5.29.14 Criticality=false
SubjectKeyIdentifier [
KeyIdentifier [
0000: A5 66 37 91 4F D4 A7 78   DD 33 48 A8 92 0E 62 9C  .f7.O..x.3H...b.
0010: 5E A5 31 EE                                        ^.1.
]
]

]
  Algorithm: [SHA256withRSA]
  Signature:
0000: 4F BC ED FB DC B7 96 D4   30 0E 56 47 3A 17 7C 98  O.......0.VG:...
0010: C7 7A 07 0D 29 DC E7 52   B1 E3 6A B6 E8 30 A5 20  .z..)..R..j..0. 
0020: B5 19 75 41 ED 8E A5 F8   BC D7 E8 B8 B4 31 43 BA  ..uA.........1C.
0030: C2 38 05 19 05 1C 45 49   F3 D6 E3 C1 E4 40 C8 1B  .8....EI.....@..
0040: 4E E2 30 CE A8 03 BC DC   F4 70 A7 86 D8 49 AB 50  N.0......p...I.P
0050: 93 98 64 84 8F FB 0A 61   F5 42 18 C8 98 BB 85 5C  ..d....a.B.....\
0060: E9 AB 15 B1 1A E2 E8 8A   54 1E B6 C0 8A B1 9F D9  ........T.......
0070: FE 31 F3 30 BB EC A1 E8   8B 62 79 78 BD 30 85 6F  .1.0.....byx.0.o
0080: 99 43 DB 0C 8F 53 91 B8   67 01 2F D6 11 5D D1 DD  .C...S..g./..]..
0090: AF E2 A8 32 CB 49 68 28   2C 2F 56 B4 C0 49 0B A9  ...2.Ih(,/V..I..
00A0: 65 30 0D 6A 59 88 89 AD   0A A2 EA 88 7F 2B 5E 2B  e0.jY........+^+
00B0: 5C 2E 31 7E 3A 8E 7E DD   2F 8D 69 12 4C DE E5 32  \.1.:.../.i.L..2
00C0: 0E B7 27 6B 31 4C FA 03   45 6C BB 22 8B 49 E7 5E  ..'k1L..El.".I.^
00D0: 10 5C 52 72 88 F1 D0 7C   B9 E7 EA EC 67 08 2A EF  .\Rr........g.*.
00E0: 3E 32 2C 66 0E 86 5B C5   DD 68 AF B7 CD 90 F0 B7  >2,f..[..h......
00F0: D1 41 37 BE C0 2E EA D2   C1 7C 00 B0 38 08 FD 8E  .A7.........8...

]
***
{noformat}

The certificate, however, should be this one instead, which I could find in the logs corresponding to the producer loading the key entry {{kafka}} from the keystore:

{noformat}
chain [1] = [
[
  Version: V3
  Subject: CN=SystemTestCA
  Signature Algorithm: SHA256withRSA, OID = 1.2.840.113549.1.1.11

  Key:  Sun RSA public key, 2048 bits
  modulus: 21764805780610680096899849598570065114366691738099560028265285678634635260326628439310540620474404121894
28837320034498591154935495719586811871677117856001504917455871043545701765773999211491735917750683751734778415465
6808368130560745100828027834306469567092980787103490844983058203701005538036639256533164454027786830984177893100
1665292571714406816927375223448787431887985822089549650745336073720648023043110616617877325717470252466628447540
7469913561906182551997803303452644603273770169060856283566998276859728657509348800308679584434046601202424040524
2779560327146084301434080054974452408698040373821098393838004417
  public exponent: 65537
  Validity: [From: Mon Oct 03 09:55:15 UTC 2016,
               To: Sun Jan 01 09:55:15 UTC 2017]
  Issuer: CN=SystemTestCA
  SerialNumber: [    313ef821]

Certificate Extensions: 1
[1]: ObjectId: 2.5.29.14 Criticality=false
SubjectKeyIdentifier [
KeyIdentifier [
0000: 3D FA EA B1 6A E6 E3 D4   CF CB 60 8F EB F3 5B 6B  =...j.....`...[k
0010: 7D 77 ED A9                                        .w..
]
]

]
  Algorithm: [SHA256withRSA]
  Signature:
0000: 6E B0 98 B3 70 04 C8 96   63 7C 01 4B CE DF 86 E7  n...p...c..K....
0010: 13 0A 9F 72 8A 5E A5 36   00 02 1F 80 AB 8F E1 AB  ...r.^.6........
0020: A9 61 69 28 76 32 84 73   A9 45 E9 97 3A 8A 92 DF  .ai(v2.s.E..:...
0030: 6F B3 07 E7 D5 DD A6 B7   C9 CD 03 0C DF 58 B6 F4  o............X..
0040: 8B 77 18 69 66 D9 90 18   79 8E 63 AD 91 20 1B C9  .w.if...y.c.. ..
0050: D8 EB 75 4F 6D 4A CE 44   5E 39 7E 82 D2 9D E4 08  ..uOmJ.D^9......
0060: C4 ED 6E 84 17 BE BC 52   C5 61 29 E1 3A 54 58 5E  ..n....R.a).:TX^
0070: B7 1F 2F 2D C9 52 C3 3A   8B 27 14 BD 8F 86 DE 57  ../-.R.:.'.....W
0080: BC E1 F3 B7 5C CA 4B 06   BA B9 CD 6B 4C AF 53 1A  ....\.K....kL.S.
0090: 02 16 0D 1C 3E 5B 14 55   9B 53 26 49 DE E2 6E E4  ....>[.U.S&I..n.
00A0: 20 BE FB BE 64 B6 E9 C9   6B 36 1E 9D F2 24 C8 33   ...d...k6...$.3
00B0: 8A 16 C9 F1 FE 56 EA 4B   F3 11 02 7A 29 CD 0F 9F  .....V.K...z)...
00C0: 6F F1 07 EB 04 73 C0 8F   82 DE FB 34 40 85 BB 87  o....s.....4@...
00D0: 5C C9 43 40 99 EA 6C 63   84 5A 28 E5 3D 38 0A 1A  \.C@..lc.Z(.=8..
00E0: C6 9B 35 41 C3 CC 21 88   6E 76 86 DB C3 1A 1A C3  ..5A..!.nv......
00F0: C9 5F 63 0B 3F A0 10 83   E3 63 D1 47 2E AA 00 56  ._c.?....c.G...V

]
{noformat}


Interestingly, earlier in the logs, I can see that the producer is able to connect to worker2 successfully, and the certificate chain includes precisely this certificate.

Another interesting point is that date in the certificate validity field maps roughly to the start time of the test. The test case that failed started roughly at 10:55 (we use a lag of 1 hour to avoid issues with clock drift), which maps perfectly to the certificate that worked. The broken certificate starts at 9:35, and the test that ran around that time had exactly that CA certificate, this is what I got:

{noformat}
Module: kafkatest.tests.core.consumer_group_command_test
Class:  ConsumerGroupCommandTest
Method: test_describe_consumer_group
Arguments:
{
  "security_protocol": "SSL"
}


===========================================================


CA Certificate

chain [1] = [
[
  Version: V3
  Subject: CN=SystemTestCA
  Signature Algorithm: SHA256withRSA, OID = 1.2.840.113549.1.1.11

  Key:  Sun RSA public key, 2048 bits
  modulus: 2421083435994698547407383995133450358448219871045998555961244700979330643844364183298673326405969256709602016115
8944052724951660368237891673420098874800950540649610593248597413809483851739528155298853162531155745397359076731
7213764937849600679327937089735954501718065324256197703519689803978997662398590165914517169395637524460337890370
8218606621761213913048196427063541851690381274617210042171263186499322961280259599484330528665445123830231361617
1177770753250228263123742100612605694409578300489940094647906623975819080502840638336757802617437298013918600386
417877347815394687707255367254823636003381423029812181649
  public exponent: 65537
  Validity: [From: Mon Oct 03 09:35:36 UTC 2016,
               To: Sun Jan 01 09:35:36 UTC 2017]
  Issuer: CN=SystemTestCA
  SerialNumber: [    1f5f5e9e]

Certificate Extensions: 1
[1]: ObjectId: 2.5.29.14 Criticality=false
SubjectKeyIdentifier [
KeyIdentifier [
0000: A5 66 37 91 4F D4 A7 78   DD 33 48 A8 92 0E 62 9C  .f7.O..x.3H...b.
0010: 5E A5 31 EE                                        ^.1.
]
]

]
  Algorithm: [SHA256withRSA]
  Signature:
0000: 4F BC ED FB DC B7 96 D4   30 0E 56 47 3A 17 7C 98  O.......0.VG:...
0010: C7 7A 07 0D 29 DC E7 52   B1 E3 6A B6 E8 30 A5 20  .z..)..R..j..0. 
0020: B5 19 75 41 ED 8E A5 F8   BC D7 E8 B8 B4 31 43 BA  ..uA.........1C.
0030: C2 38 05 19 05 1C 45 49   F3 D6 E3 C1 E4 40 C8 1B  .8....EI.....@..
0040: 4E E2 30 CE A8 03 BC DC   F4 70 A7 86 D8 49 AB 50  N.0......p...I.P
0050: 93 98 64 84 8F FB 0A 61   F5 42 18 C8 98 BB 85 5C  ..d....a.B.....\
0060: E9 AB 15 B1 1A E2 E8 8A   54 1E B6 C0 8A B1 9F D9  ........T.......
0070: FE 31 F3 30 BB EC A1 E8   8B 62 79 78 BD 30 85 6F  .1.0.....byx.0.o
0080: 99 43 DB 0C 8F 53 91 B8   67 01 2F D6 11 5D D1 DD  .C...S..g./..]..
0090: AF E2 A8 32 CB 49 68 28   2C 2F 56 B4 C0 49 0B A9  ...2.Ih(,/V..I..
00A0: 65 30 0D 6A 59 88 89 AD   0A A2 EA 88 7F 2B 5E 2B  e0.jY........+^+
00B0: 5C 2E 31 7E 3A 8E 7E DD   2F 8D 69 12 4C DE E5 32  \.1.:.../.i.L..2
00C0: 0E B7 27 6B 31 4C FA 03   45 6C BB 22 8B 49 E7 5E  ..'k1L..El.".I.^
00D0: 10 5C 52 72 88 F1 D0 7C   B9 E7 EA EC 67 08 2A EF  .\Rr........g.*.
00E0: 3E 32 2C 66 0E 86 5B C5   DD 68 AF B7 CD 90 F0 B7  >2,f..[..h......
00F0: D1 41 37 BE C0 2E EA D2   C1 7C 00 B0 38 08 FD 8E  .A7.........8...

]
{noformat}

My conclusion is that for some reason worker10 picked an old certificate, while everybody else picked the new one. Inspecting the code, I can’t tell immediately what the problem is. There seem to be a race between deleting the old files in {{SslStores.__init__}} and the call to {{SslStores.generate_and_copy_keystore}}, which is called from {{SecurityConfig.setup_ssl}} and is called per node. Is that at all possible?

cc [~rsivaram], would you have any insight here?

> Transient system test failure ZooKeeperSecurityUpgradeTest.test_zk_security_upgrade.security_protocol
> -----------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-3985
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3985
>             Project: Kafka
>          Issue Type: Test
>          Components: system tests
>    Affects Versions: 0.10.0.0
>            Reporter: Jason Gustafson
>
> Found this in the nightly build on the 0.10.0 branch. Full details here: http://testing.confluent.io/confluent-kafka-0-10-0-system-test-results/?prefix=2016-07-22--001.1469199875--apache--0.10.0--71a598a/.  
> {code}
> test_id:    2016-07-22--001.kafkatest.tests.core.zookeeper_security_upgrade_test.ZooKeeperSecurityUpgradeTest.test_zk_security_upgrade.security_protocol=SSL
> status:     FAIL
> run time:   5 minutes 14.067 seconds
>     292 acked message did not make it to the Consumer. They are: 11264, 11265, 11266, 11267, 11268, 11269, 11270, 11271, 11272, 11273, 11274, 11275, 11276, 11277, 11278, 11279, 11280, 11281, 11282, 11283, ...plus 252 more. Total Acked: 11343, Total Consumed: 11054. We validated that the first 272 of these missing messages correctly made it into Kafka's data files. This suggests they were lost on their way to the consumer.
> Traceback (most recent call last):
>   File "/var/lib/jenkins/workspace/system-test-kafka-0.10.0/kafka/venv/local/lib/python2.7/site-packages/ducktape/tests/runner.py", line 106, in run_all_tests
>     data = self.run_single_test()
>   File "/var/lib/jenkins/workspace/system-test-kafka-0.10.0/kafka/venv/local/lib/python2.7/site-packages/ducktape/tests/runner.py", line 162, in run_single_test
>     return self.current_test_context.function(self.current_test)
>   File "/var/lib/jenkins/workspace/system-test-kafka-0.10.0/kafka/venv/local/lib/python2.7/site-packages/ducktape/mark/_mark.py", line 331, in wrapper
>     return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
>   File "/var/lib/jenkins/workspace/system-test-kafka-0.10.0/kafka/tests/kafkatest/tests/core/zookeeper_security_upgrade_test.py", line 115, in test_zk_security_upgrade
>     self.run_produce_consume_validate(self.run_zk_migration)
>   File "/var/lib/jenkins/workspace/system-test-kafka-0.10.0/kafka/tests/kafkatest/tests/produce_consume_validate.py", line 79, in run_produce_consume_validate
>     raise e
> AssertionError: 292 acked message did not make it to the Consumer. They are: 11264, 11265, 11266, 11267, 11268, 11269, 11270, 11271, 11272, 11273, 11274, 11275, 11276, 11277, 11278, 11279, 11280, 11281, 11282, 11283, ...plus 252 more. Total Acked: 11343, Total Consumed: 11054. We validated that the first 272 of these missing messages correctly made it into Kafka's data files. This suggests they were lost on their way to the consumer.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)