You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Rajini Sivaram <ra...@gmail.com> on 2019/08/01 09:10:45 UTC

Re: Dynamic configuration of interbroker SSL certificates

Hi Michael,

Thank you for reporting this. It is possible that we missed out some
client-side update in KAFKA-8336. Would it be possible for you to provide
logs from multiple brokers around the time of a broker restart when
handshake failures started occurring so that we can try to work out which
client connection is failing? Corresponding to the handshake failure entry
from SocketServer in the restarted broker, there should a failed client
connection on another broker. If that has a stack trace, it would be useful.

Also just to confirm, are these correct:

1) All brokers are dynamically updated with new certs and this works fine
without any errors (this just indicates that config update succeeded)
2) If one broker is restarted, handshakes failures appear in the logs and
these are logged from SocketServer in the restarted broker. (this could
indicate that some client-side cert was not updated dynamically)
3) Once all brokers are restarted, there are no more handshake failures
(this is the actual confirmation that certs from 1) work)

Thank you,

Rajini

On Thu, Aug 1, 2019 at 12:40 AM Michael Carter <
michael.carter@instaclustr.com> wrote:

> Hi all,
>
> I'm having an issue with dynamic configuration of interbroker SSL
> certificates (in Kafka 2.3.0) that I'm hoping someone can give me insight
> on. I've previously posted something similar on the Users email list, but I
> think it will need some help from developers experienced with how the
> networking code in Kafka works.
>
> I'm trying to use SSL two-way authentication for inter broker
> communication, with short lived SSL certificates, rotatating them
> frequently without needing to do a broker restart. So, on each broker in my
> cluster, I periodically generate a new certificate keystore file, and set
> the "listener.name.interbroker.ssl.keystore.location" broker config
> property property. (I'm using inter.broker.listener.name=INTERBROKER)
>
> Setting this property works fine, and everything appears ok. And manually
> connecting to the inter broker listener shows it's correctly serving the
> new certificate. But if I ever restart a broker after the original
> certificate has expired (The one the broker started up with, which is no
> longer configured anywhere), then communication failures between brokers
> start to toccur. My logs fill up with messages like this:
>
> [2019-07-22 03:57:43,605] INFO [SocketServer brokerId=1] Failed
> authentication with 10.224.70.3 (SSL handshake failed)
> (org.apache.kafka.common.network.Selector)
>
> A little bit of extra logging injected into the code tells me that the
> failures are caused by the out of date SSL certificates being used. So it
> seems there are some network components inside Kafka still stuck on the old
> settings.
> This sounds similar to the behaviour described in KAFKA-8336 (
> https://issues.apache.org/jira/browse/KAFKA-8336), but this is marked as
> fixed in 2.3.0.
>
> I've confirmed that all the SslChannelBuilders and SslFactories appear to
> be being reconfigured correctly when the dynamic setting is set. I've tried
> closing all existing KafkaChannels on a reconfiguration event, in order to
> force them to re-open with the new certificates, but the problem persists.
>
> Does anyone have any idea what other components may be hanging around,
> using the old certificates?
>
> Thanks,
> Michael
>
>

Re: Dynamic configuration of interbroker SSL certificates

Posted by Michael Carter <mi...@instaclustr.com>.

Hi Colin,

Thanks for the explanation. Yes, I can see the sense in that, although it was certainly a surprise for me. I’ll just have to be a bit more careful about instantiating AdminClients in the future.

Cheers,
Michael

> On 13 Aug 2019, at 1:36 am, Colin McCabe <cm...@apache.org> wrote:
> 
> Hi Michael,
> 
> The NetworkClient periodically fetches metadata so that it always knows what the cluster topology is. This also helps it to have some open connections when needed to reduce the latency of operations.
> 
> To be fair, we haven’t thought very much about optimizing this since the overhead seems to be low. The update interval can also be configured.
> 
> Best,
> Colin
> 
> On Sun, Aug 11, 2019, at 16:43, Michael Carter wrote:
>> Hi Rajini,
>> 
>> I just thought I'd let you know I've sorted through my issue now. It turns out that I was doing something silly. The SSL handshake errors weren't caused by the brokers talking to themselves but were instead being initiated from an AdminClient I had running in a seperate process. Ironically the same AdminClient I was using to rotate the certificates in the first place. I was naively assuming that because I wasn't asking the AdminClient to do anything at that exact moment, that it wouldn't be contacting the brokers. Instead, it seems the AdminClient has an internal thread running that is periodically trying to get metadata information from the brokers.
>> Although now I'm wondering, why is this continual requesting necessary? Is there some reason that work isn't deferred until the AdminClient is specifically asked to do something?
>> 
>> Cheers,
>> Michael
>> 
>>> On 5 Aug 2019, at 3:33 pm, Michael Carter <mi...@instaclustr.com> wrote:
>>> 
>>> Hi Rajini,
>>> 
>>> Thanks for the good suggestions. I’ve investigated further and confirmed that the issue continues to occur even with hostname verification turned off.
>>> I put in the ssl debug option and got some interesting information. I’ve attached an example of what the logs output when SSL errors start occurring.
>>> 
>>> From the debug output I can see that the internal IP addresses do form part of the alternative names on the certificate, which is good.
>>> What it shows is that the verification is failing due to the fact that it is using an out of date certificate (where the validity period has expired).
>>> 
>>> It is interesting that in the attached example, the broker appears to be trying to connect to itself, and one end of the connection has the new, valid certificate, while the other end has the old, out of date certificate, even though my understanding is that it should be one and the same certificate.
>>> Does this suggest that there is some network component inside Kafka that hasn’t responded to the reconfiguration of of interbroker Keystore property?
>>> 
>>> Cheers,
>>> Michael
>>> <debug-handshake.log>
>>> 
>>>> On 2 Aug 2019, at 7:04 pm, Rajini Sivaram <ra...@gmail.com> wrote:
>>>> 
>>>> Hi Michael,
>>>> 
>>>> Thanks for the logs.
>>>> 
>>>> When keystore is reconfigured in a stable cluster, we wouldn't expect to
>>>> see any failures even if there is an issue with the certs since existing
>>>> connections continue to work without re-authenticating. When a broker is
>>>> restarted, there could be controller re-election, leader elections etc.
>>>> causing new connections on all brokers, so failures can prop up on all
>>>> brokers.
>>>> 
>>>> When we reconfigure inter-broker keystore, we validate the keystore against
>>>> the truststore on each broker. Since we are seeing handshake failures
>>>> within. a broker (e.g. controller connections), it suggests that the
>>>> validation performed by the actual connection is failing while our
>>>> validation during reconfiguration succeeded. The main difference between
>>>> the two is hostname verification - we disable hostname verification when
>>>> validating configs, but for actual connections this seems to be enabled in
>>>> your configs. Can you verify that your certs contain the host IP address
>>>> used in the advertised listeners list (e.g. 10.224.59.10). In order to rule
>>>> out hostname issues, you could rerun your tests with
>>>> `listener.name.interbroker.ssl.endpoint.identification.algorithm=`,
>>>> i.e. empty string to disable hostname verification.
>>>> 
>>>> If the issue still occurs with hostname verification disabled, it will be
>>>> useful to run the brokers with
>>>> KAFKA_OPTS="-Djavax.net.debug=ssl,handshake". That would show the certs
>>>> used as well as more details about the handshake failure.
>>>> 
>>>> Regards,
>>>> 
>>>> Rajini
>>>> 
>>>> 
>>>> 
>>>> On Fri, Aug 2, 2019 at 8:49 AM Michael Carter <
>>>> michael.carter@instaclustr.com> wrote:
>>>> 
>>>>> Hi Rajini,
>>>>> 
>>>>> Thanks for your response. I've attached some logs from a three node
>>>>> cluster that I was doing this on. For the purpose of this test I'm using
>>>>> certificates with ten minutes validity. The original certificate was valid
>>>>> from roughly 05:32 to 05:42, I rotate them every five minutes, and stopped
>>>>> the third broker at 05:48 (five or so minutes after the original
>>>>> certificates would have expired). At this point the errors start to appear
>>>>> in the logs in volume. A lot of them are errors connecting to the restarted
>>>>> broker, but not all of them. Some of them are the first broker (which
>>>>> hasn't been restarted) apparently having trouble connecting to itself. I
>>>>> restarted the other brokers at about 05:49 but the errors continued.
>>>>> 
>>>>> To answer your questions:
>>>>> 1) All brokers are dynamically updated with new certs and this works fine
>>>>> without any errors (this just indicates that config update succeeded)
>>>>> 
>>>>> - Yes that is correct.
>>>>> 
>>>>> 2) If one broker is restarted, handshakes failures appear in the logs and
>>>>> these are logged from SocketServer in the restarted broker. (this could
>>>>> indicate that some client-side cert was not updated dynamically)
>>>>> 
>>>>> - The errors tend to appear in all the brokers logs. I can also get the
>>>>> same thing just by shutting down one of the brokers. It doesn't actually
>>>>> need to be restarted, since the SslAuthenticationExceptions will appear in
>>>>> the brokers that weren't shutdown or restarted regardless. In fact, I can
>>>>> get a message saying "Failed authentication with <IP address> (SSL
>>>>> handshake failed)" when no broker is running on that IP address at all.
>>>>> Some of the messages indicate a failure of a broker connecting to itself
>>>>> (e.g. "Failed authentication with /10.224.59.10 (SSL handshake failed)"
>>>>> in the logs of the broker running on 10.224.59.10)
>>>>> 
>>>>> 3) Once all brokers are restarted, there are no more handshake failures
>>>>> (this is the actual confirmation that certs from 1) work)
>>>>> 
>>>>> - Unfortunately no, the errors persist even after this. Sometimes, it will
>>>>> recover from this situation if I run another dynamic certificate rotation.
>>>>> But restarting the brokers doesn't seem to clear it. It might be tempting
>>>>> to think that there is something wrong with the new certificates, but
>>>>> they're generated in the exact same way as the original certificates
>>>>> (except for the validity time period), and they seem to be being served
>>>>> correctly after the dynamic rotation. (I'm checking with this command:
>>>>> openssl s_client -debug -connect 10.224.59.10:9094 2>/dev/null </dev/null
>>>>> | openssl x509 -noout -dates ). Also, subsequent updates with similar
>>>>> certificates can sometimes fix the problem.
>>>>> 
>>>>> The fact that it can persist after a restart is what really sticks out to
>>>>> me. Is there any state stored outside of Kafka that could be getting in the
>>>>> way? Does Zookeeper store anything other than the filename of the keystore?
>>>>> 
>>>>> Cheers,
>>>>> Michael
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 1 Aug 2019, at 7:10 pm, Rajini Sivaram <ra...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>> Hi Michael,
>>>>>> 
>>>>>> Thank you for reporting this. It is possible that we missed out some
>>>>>> client-side update in KAFKA-8336. Would it be possible for you to provide
>>>>>> logs from multiple brokers around the time of a broker restart when
>>>>>> handshake failures started occurring so that we can try to work out which
>>>>>> client connection is failing? Corresponding to the handshake failure
>>>>> entry
>>>>>> from SocketServer in the restarted broker, there should a failed client
>>>>>> connection on another broker. If that has a stack trace, it would be
>>>>> useful.
>>>>>> 
>>>>>> Also just to confirm, are these correct:
>>>>>> 
>>>>>> 1) All brokers are dynamically updated with new certs and this works fine
>>>>>> without any errors (this just indicates that config update succeeded)
>>>>>> 2) If one broker is restarted, handshakes failures appear in the logs and
>>>>>> these are logged from SocketServer in the restarted broker. (this could
>>>>>> indicate that some client-side cert was not updated dynamically)
>>>>>> 3) Once all brokers are restarted, there are no more handshake failures
>>>>>> (this is the actual confirmation that certs from 1) work)
>>>>>> 
>>>>>> Thank you,
>>>>>> 
>>>>>> Rajini
>>>>>> 
>>>>>> On Thu, Aug 1, 2019 at 12:40 AM Michael Carter <
>>>>>> michael.carter@instaclustr.com> wrote:
>>>>>> 
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> I'm having an issue with dynamic configuration of interbroker SSL
>>>>>>> certificates (in Kafka 2.3.0) that I'm hoping someone can give me
>>>>> insight
>>>>>>> on. I've previously posted something similar on the Users email list,
>>>>> but I
>>>>>>> think it will need some help from developers experienced with how the
>>>>>>> networking code in Kafka works.
>>>>>>> 
>>>>>>> I'm trying to use SSL two-way authentication for inter broker
>>>>>>> communication, with short lived SSL certificates, rotatating them
>>>>>>> frequently without needing to do a broker restart. So, on each broker
>>>>> in my
>>>>>>> cluster, I periodically generate a new certificate keystore file, and
>>>>> set
>>>>>>> the "listener.name.interbroker.ssl.keystore.location" broker config
>>>>>>> property property. (I'm using inter.broker.listener.name=INTERBROKER)
>>>>>>> 
>>>>>>> Setting this property works fine, and everything appears ok. And
>>>>> manually
>>>>>>> connecting to the inter broker listener shows it's correctly serving the
>>>>>>> new certificate. But if I ever restart a broker after the original
>>>>>>> certificate has expired (The one the broker started up with, which is no
>>>>>>> longer configured anywhere), then communication failures between brokers
>>>>>>> start to toccur. My logs fill up with messages like this:
>>>>>>> 
>>>>>>> [2019-07-22 03:57:43,605] INFO [SocketServer brokerId=1] Failed
>>>>>>> authentication with 10.224.70.3 (SSL handshake failed)
>>>>>>> (org.apache.kafka.common.network.Selector)
>>>>>>> 
>>>>>>> A little bit of extra logging injected into the code tells me that the
>>>>>>> failures are caused by the out of date SSL certificates being used. So
>>>>> it
>>>>>>> seems there are some network components inside Kafka still stuck on the
>>>>> old
>>>>>>> settings.
>>>>>>> This sounds similar to the behaviour described in KAFKA-8336 (
>>>>>>> https://issues.apache.org/jira/browse/KAFKA-8336), but this is marked
>>>>> as
>>>>>>> fixed in 2.3.0.
>>>>>>> 
>>>>>>> I've confirmed that all the SslChannelBuilders and SslFactories appear
>>>>> to
>>>>>>> be being reconfigured correctly when the dynamic setting is set. I've
>>>>> tried
>>>>>>> closing all existing KafkaChannels on a reconfiguration event, in order
>>>>> to
>>>>>>> force them to re-open with the new certificates, but the problem
>>>>> persists.
>>>>>>> 
>>>>>>> Does anyone have any idea what other components may be hanging around,
>>>>>>> using the old certificates?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Michael
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>> 
>>

Re: Dynamic configuration of interbroker SSL certificates

Posted by Colin McCabe <cm...@apache.org>.

Hi Michael,

The NetworkClient periodically fetches metadata so that it always knows what the cluster topology is. This also helps it to have some open connections when needed to reduce the latency of operations.

To be fair, we haven’t thought very much about optimizing this since the overhead seems to be low. The update interval can also be configured.

Best,
Colin

On Sun, Aug 11, 2019, at 16:43, Michael Carter wrote:
> Hi Rajini,
> 
> I just thought I'd let you know I've sorted through my issue now. It turns out that I was doing something silly. The SSL handshake errors weren't caused by the brokers talking to themselves but were instead being initiated from an AdminClient I had running in a seperate process. Ironically the same AdminClient I was using to rotate the certificates in the first place. I was naively assuming that because I wasn't asking the AdminClient to do anything at that exact moment, that it wouldn't be contacting the brokers. Instead, it seems the AdminClient has an internal thread running that is periodically trying to get metadata information from the brokers.
> Although now I'm wondering, why is this continual requesting necessary? Is there some reason that work isn't deferred until the AdminClient is specifically asked to do something?
> 
> Cheers,
> Michael
> 
> > On 5 Aug 2019, at 3:33 pm, Michael Carter <mi...@instaclustr.com> wrote:
> > 
> > Hi Rajini,
> > 
> > Thanks for the good suggestions. I’ve investigated further and confirmed that the issue continues to occur even with hostname verification turned off.
> > I put in the ssl debug option and got some interesting information. I’ve attached an example of what the logs output when SSL errors start occurring.
> > 
> > From the debug output I can see that the internal IP addresses do form part of the alternative names on the certificate, which is good.
> > What it shows is that the verification is failing due to the fact that it is using an out of date certificate (where the validity period has expired).
> > 
> > It is interesting that in the attached example, the broker appears to be trying to connect to itself, and one end of the connection has the new, valid certificate, while the other end has the old, out of date certificate, even though my understanding is that it should be one and the same certificate.
> > Does this suggest that there is some network component inside Kafka that hasn’t responded to the reconfiguration of of interbroker Keystore property?
> > 
> > Cheers,
> > Michael
> > <debug-handshake.log>
> > 
> >> On 2 Aug 2019, at 7:04 pm, Rajini Sivaram <ra...@gmail.com> wrote:
> >> 
> >> Hi Michael,
> >> 
> >> Thanks for the logs.
> >> 
> >> When keystore is reconfigured in a stable cluster, we wouldn't expect to
> >> see any failures even if there is an issue with the certs since existing
> >> connections continue to work without re-authenticating. When a broker is
> >> restarted, there could be controller re-election, leader elections etc.
> >> causing new connections on all brokers, so failures can prop up on all
> >> brokers.
> >> 
> >> When we reconfigure inter-broker keystore, we validate the keystore against
> >> the truststore on each broker. Since we are seeing handshake failures
> >> within. a broker (e.g. controller connections), it suggests that the
> >> validation performed by the actual connection is failing while our
> >> validation during reconfiguration succeeded. The main difference between
> >> the two is hostname verification - we disable hostname verification when
> >> validating configs, but for actual connections this seems to be enabled in
> >> your configs. Can you verify that your certs contain the host IP address
> >> used in the advertised listeners list (e.g. 10.224.59.10). In order to rule
> >> out hostname issues, you could rerun your tests with
> >> `listener.name.interbroker.ssl.endpoint.identification.algorithm=`,
> >> i.e. empty string to disable hostname verification.
> >> 
> >> If the issue still occurs with hostname verification disabled, it will be
> >> useful to run the brokers with
> >> KAFKA_OPTS="-Djavax.net.debug=ssl,handshake". That would show the certs
> >> used as well as more details about the handshake failure.
> >> 
> >> Regards,
> >> 
> >> Rajini
> >> 
> >> 
> >> 
> >> On Fri, Aug 2, 2019 at 8:49 AM Michael Carter <
> >> michael.carter@instaclustr.com> wrote:
> >> 
> >>> Hi Rajini,
> >>> 
> >>> Thanks for your response. I've attached some logs from a three node
> >>> cluster that I was doing this on. For the purpose of this test I'm using
> >>> certificates with ten minutes validity. The original certificate was valid
> >>> from roughly 05:32 to 05:42, I rotate them every five minutes, and stopped
> >>> the third broker at 05:48 (five or so minutes after the original
> >>> certificates would have expired). At this point the errors start to appear
> >>> in the logs in volume. A lot of them are errors connecting to the restarted
> >>> broker, but not all of them. Some of them are the first broker (which
> >>> hasn't been restarted) apparently having trouble connecting to itself. I
> >>> restarted the other brokers at about 05:49 but the errors continued.
> >>> 
> >>> To answer your questions:
> >>> 1) All brokers are dynamically updated with new certs and this works fine
> >>> without any errors (this just indicates that config update succeeded)
> >>> 
> >>> - Yes that is correct.
> >>> 
> >>> 2) If one broker is restarted, handshakes failures appear in the logs and
> >>> these are logged from SocketServer in the restarted broker. (this could
> >>> indicate that some client-side cert was not updated dynamically)
> >>> 
> >>> - The errors tend to appear in all the brokers logs. I can also get the
> >>> same thing just by shutting down one of the brokers. It doesn't actually
> >>> need to be restarted, since the SslAuthenticationExceptions will appear in
> >>> the brokers that weren't shutdown or restarted regardless. In fact, I can
> >>> get a message saying "Failed authentication with <IP address> (SSL
> >>> handshake failed)" when no broker is running on that IP address at all.
> >>> Some of the messages indicate a failure of a broker connecting to itself
> >>> (e.g. "Failed authentication with /10.224.59.10 (SSL handshake failed)"
> >>> in the logs of the broker running on 10.224.59.10)
> >>> 
> >>> 3) Once all brokers are restarted, there are no more handshake failures
> >>> (this is the actual confirmation that certs from 1) work)
> >>> 
> >>> - Unfortunately no, the errors persist even after this. Sometimes, it will
> >>> recover from this situation if I run another dynamic certificate rotation.
> >>> But restarting the brokers doesn't seem to clear it. It might be tempting
> >>> to think that there is something wrong with the new certificates, but
> >>> they're generated in the exact same way as the original certificates
> >>> (except for the validity time period), and they seem to be being served
> >>> correctly after the dynamic rotation. (I'm checking with this command:
> >>> openssl s_client -debug -connect 10.224.59.10:9094 2>/dev/null </dev/null
> >>> | openssl x509 -noout -dates ). Also, subsequent updates with similar
> >>> certificates can sometimes fix the problem.
> >>> 
> >>> The fact that it can persist after a restart is what really sticks out to
> >>> me. Is there any state stored outside of Kafka that could be getting in the
> >>> way? Does Zookeeper store anything other than the filename of the keystore?
> >>> 
> >>> Cheers,
> >>> Michael
> >>> 
> >>> 
> >>> 
> >>>> On 1 Aug 2019, at 7:10 pm, Rajini Sivaram <ra...@gmail.com>
> >>> wrote:
> >>>> 
> >>>> Hi Michael,
> >>>> 
> >>>> Thank you for reporting this. It is possible that we missed out some
> >>>> client-side update in KAFKA-8336. Would it be possible for you to provide
> >>>> logs from multiple brokers around the time of a broker restart when
> >>>> handshake failures started occurring so that we can try to work out which
> >>>> client connection is failing? Corresponding to the handshake failure
> >>> entry
> >>>> from SocketServer in the restarted broker, there should a failed client
> >>>> connection on another broker. If that has a stack trace, it would be
> >>> useful.
> >>>> 
> >>>> Also just to confirm, are these correct:
> >>>> 
> >>>> 1) All brokers are dynamically updated with new certs and this works fine
> >>>> without any errors (this just indicates that config update succeeded)
> >>>> 2) If one broker is restarted, handshakes failures appear in the logs and
> >>>> these are logged from SocketServer in the restarted broker. (this could
> >>>> indicate that some client-side cert was not updated dynamically)
> >>>> 3) Once all brokers are restarted, there are no more handshake failures
> >>>> (this is the actual confirmation that certs from 1) work)
> >>>> 
> >>>> Thank you,
> >>>> 
> >>>> Rajini
> >>>> 
> >>>> On Thu, Aug 1, 2019 at 12:40 AM Michael Carter <
> >>>> michael.carter@instaclustr.com> wrote:
> >>>> 
> >>>>> Hi all,
> >>>>> 
> >>>>> I'm having an issue with dynamic configuration of interbroker SSL
> >>>>> certificates (in Kafka 2.3.0) that I'm hoping someone can give me
> >>> insight
> >>>>> on. I've previously posted something similar on the Users email list,
> >>> but I
> >>>>> think it will need some help from developers experienced with how the
> >>>>> networking code in Kafka works.
> >>>>> 
> >>>>> I'm trying to use SSL two-way authentication for inter broker
> >>>>> communication, with short lived SSL certificates, rotatating them
> >>>>> frequently without needing to do a broker restart. So, on each broker
> >>> in my
> >>>>> cluster, I periodically generate a new certificate keystore file, and
> >>> set
> >>>>> the "listener.name.interbroker.ssl.keystore.location" broker config
> >>>>> property property. (I'm using inter.broker.listener.name=INTERBROKER)
> >>>>> 
> >>>>> Setting this property works fine, and everything appears ok. And
> >>> manually
> >>>>> connecting to the inter broker listener shows it's correctly serving the
> >>>>> new certificate. But if I ever restart a broker after the original
> >>>>> certificate has expired (The one the broker started up with, which is no
> >>>>> longer configured anywhere), then communication failures between brokers
> >>>>> start to toccur. My logs fill up with messages like this:
> >>>>> 
> >>>>> [2019-07-22 03:57:43,605] INFO [SocketServer brokerId=1] Failed
> >>>>> authentication with 10.224.70.3 (SSL handshake failed)
> >>>>> (org.apache.kafka.common.network.Selector)
> >>>>> 
> >>>>> A little bit of extra logging injected into the code tells me that the
> >>>>> failures are caused by the out of date SSL certificates being used. So
> >>> it
> >>>>> seems there are some network components inside Kafka still stuck on the
> >>> old
> >>>>> settings.
> >>>>> This sounds similar to the behaviour described in KAFKA-8336 (
> >>>>> https://issues.apache.org/jira/browse/KAFKA-8336), but this is marked
> >>> as
> >>>>> fixed in 2.3.0.
> >>>>> 
> >>>>> I've confirmed that all the SslChannelBuilders and SslFactories appear
> >>> to
> >>>>> be being reconfigured correctly when the dynamic setting is set. I've
> >>> tried
> >>>>> closing all existing KafkaChannels on a reconfiguration event, in order
> >>> to
> >>>>> force them to re-open with the new certificates, but the problem
> >>> persists.
> >>>>> 
> >>>>> Does anyone have any idea what other components may be hanging around,
> >>>>> using the old certificates?
> >>>>> 
> >>>>> Thanks,
> >>>>> Michael
> >>>>> 
> >>>>> 
> >>> 
> >>> 
> > 
> 
>

Re: Dynamic configuration of interbroker SSL certificates

Posted by Michael Carter <mi...@instaclustr.com>.

Hi Rajini,

I just thought I'd let you know I've sorted through my issue now. It turns out that I was doing something silly. The SSL handshake errors weren't caused by the brokers talking to themselves but were instead being initiated from an AdminClient I had running in a seperate process. Ironically the same AdminClient I was using to rotate the certificates in the first place. I was naively assuming that because I wasn't asking the AdminClient to do anything at that exact moment, that it wouldn't be contacting the brokers. Instead, it seems the AdminClient has an internal thread running that is periodically trying to get metadata information from the brokers.
Although now I'm wondering, why is this continual requesting necessary? Is there some reason that work isn't deferred until the AdminClient is specifically asked to do something?

Cheers,
Michael

> On 5 Aug 2019, at 3:33 pm, Michael Carter <mi...@instaclustr.com> wrote:
> 
> Hi Rajini,
> 
> Thanks for the good suggestions. I’ve investigated further and confirmed that the issue continues to occur even with hostname verification turned off.
> I put in the ssl debug option and got some interesting information. I’ve attached an example of what the logs output when SSL errors start occurring.
> 
> From the debug output I can see that the internal IP addresses do form part of the alternative names on the certificate, which is good.
> What it shows is that the verification is failing due to the fact that it is using an out of date certificate (where the validity period has expired).
> 
> It is interesting that in the attached example, the broker appears to be trying to connect to itself, and one end of the connection has the new, valid certificate, while the other end has the old, out of date certificate, even though my understanding is that it should be one and the same certificate.
> Does this suggest that there is some network component inside Kafka that hasn’t responded to the reconfiguration of of interbroker Keystore property?
> 
> Cheers,
> Michael
> <debug-handshake.log>
> 
>> On 2 Aug 2019, at 7:04 pm, Rajini Sivaram <ra...@gmail.com> wrote:
>> 
>> Hi Michael,
>> 
>> Thanks for the logs.
>> 
>> When keystore is reconfigured in a stable cluster, we wouldn't expect to
>> see any failures even if there is an issue with the certs since existing
>> connections continue to work without re-authenticating. When a broker is
>> restarted, there could be controller re-election, leader elections etc.
>> causing new connections on all brokers, so failures can prop up on all
>> brokers.
>> 
>> When we reconfigure inter-broker keystore, we validate the keystore against
>> the truststore on each broker. Since we are seeing handshake failures
>> within. a broker (e.g. controller connections), it suggests that the
>> validation performed by the actual connection is failing while our
>> validation during reconfiguration succeeded. The main difference between
>> the two is hostname verification - we disable hostname verification when
>> validating configs, but for actual connections this seems to be enabled in
>> your configs. Can you verify that your certs contain the host IP address
>> used in the advertised listeners list (e.g. 10.224.59.10). In order to rule
>> out hostname issues,  you could rerun your tests with
>> `listener.name.interbroker.ssl.endpoint.identification.algorithm=`,
>> i.e. empty string to disable hostname verification.
>> 
>> If the issue still occurs with hostname verification disabled, it will be
>> useful to run the brokers with
>> KAFKA_OPTS="-Djavax.net.debug=ssl,handshake". That would show the certs
>> used as well as more details about the handshake failure.
>> 
>> Regards,
>> 
>> Rajini
>> 
>> 
>> 
>> On Fri, Aug 2, 2019 at 8:49 AM Michael Carter <
>> michael.carter@instaclustr.com> wrote:
>> 
>>> Hi Rajini,
>>> 
>>> Thanks for your response. I've attached some logs from a three node
>>> cluster that I was doing this on. For the purpose of this test I'm using
>>> certificates with ten minutes validity. The original certificate was valid
>>> from roughly 05:32 to 05:42, I rotate them every five minutes, and stopped
>>> the third broker at 05:48 (five or so minutes after the original
>>> certificates would have expired). At this point the errors start to appear
>>> in the logs in volume. A lot of them are errors connecting to the restarted
>>> broker, but not all of them. Some of them are the first broker (which
>>> hasn't been restarted) apparently having trouble connecting to itself. I
>>> restarted the other brokers at about 05:49 but the errors continued.
>>> 
>>> To answer your questions:
>>> 1) All brokers are dynamically updated with new certs and this works fine
>>> without any errors (this just indicates that config update succeeded)
>>> 
>>> - Yes that is correct.
>>> 
>>> 2) If one broker is restarted, handshakes failures appear in the logs and
>>> these are logged from SocketServer in the restarted broker. (this could
>>> indicate that some client-side cert was not updated dynamically)
>>> 
>>> - The errors tend to appear in all the brokers logs. I can also get the
>>> same thing just by shutting down one of the brokers. It doesn't actually
>>> need to be restarted, since the SslAuthenticationExceptions will appear in
>>> the brokers that weren't shutdown or restarted regardless. In fact, I can
>>> get a message saying "Failed authentication with <IP address> (SSL
>>> handshake failed)" when no broker is running on that IP address at all.
>>> Some of the messages indicate a failure of a broker connecting to itself
>>> (e.g. "Failed authentication with /10.224.59.10 (SSL handshake failed)"
>>> in the logs of the broker running on 10.224.59.10)
>>> 
>>> 3) Once all brokers are restarted, there are no more handshake failures
>>> (this is the actual confirmation that certs from 1) work)
>>> 
>>> - Unfortunately no, the errors persist even after this. Sometimes, it will
>>> recover from this situation if I run another dynamic certificate rotation.
>>> But restarting the brokers doesn't seem to clear it. It might be tempting
>>> to think that there is something wrong with the new certificates, but
>>> they're generated in the exact same way as the original certificates
>>> (except for the validity time period), and they seem to be being served
>>> correctly after the dynamic rotation. (I'm checking with this command:
>>> openssl s_client -debug -connect 10.224.59.10:9094 2>/dev/null </dev/null
>>> | openssl x509 -noout -dates ). Also, subsequent updates with similar
>>> certificates can sometimes fix the problem.
>>> 
>>> The fact that it can persist after a restart is what really sticks out to
>>> me. Is there any state stored outside of Kafka that could be getting in the
>>> way? Does Zookeeper store anything other than the filename of the keystore?
>>> 
>>> Cheers,
>>> Michael
>>> 
>>> 
>>> 
>>>> On 1 Aug 2019, at 7:10 pm, Rajini Sivaram <ra...@gmail.com>
>>> wrote:
>>>> 
>>>> Hi Michael,
>>>> 
>>>> Thank you for reporting this. It is possible that we missed out some
>>>> client-side update in KAFKA-8336. Would it be possible for you to provide
>>>> logs from multiple brokers around the time of a broker restart when
>>>> handshake failures started occurring so that we can try to work out which
>>>> client connection is failing? Corresponding to the handshake failure
>>> entry
>>>> from SocketServer in the restarted broker, there should a failed client
>>>> connection on another broker. If that has a stack trace, it would be
>>> useful.
>>>> 
>>>> Also just to confirm, are these correct:
>>>> 
>>>> 1) All brokers are dynamically updated with new certs and this works fine
>>>> without any errors (this just indicates that config update succeeded)
>>>> 2) If one broker is restarted, handshakes failures appear in the logs and
>>>> these are logged from SocketServer in the restarted broker. (this could
>>>> indicate that some client-side cert was not updated dynamically)
>>>> 3) Once all brokers are restarted, there are no more handshake failures
>>>> (this is the actual confirmation that certs from 1) work)
>>>> 
>>>> Thank you,
>>>> 
>>>> Rajini
>>>> 
>>>> On Thu, Aug 1, 2019 at 12:40 AM Michael Carter <
>>>> michael.carter@instaclustr.com> wrote:
>>>> 
>>>>> Hi all,
>>>>> 
>>>>> I'm having an issue with dynamic configuration of interbroker SSL
>>>>> certificates (in Kafka 2.3.0) that I'm hoping someone can give me
>>> insight
>>>>> on. I've previously posted something similar on the Users email list,
>>> but I
>>>>> think it will need some help from developers experienced with how the
>>>>> networking code in Kafka works.
>>>>> 
>>>>> I'm trying to use SSL two-way authentication for inter broker
>>>>> communication, with short lived SSL certificates, rotatating them
>>>>> frequently without needing to do a broker restart. So, on each broker
>>> in my
>>>>> cluster, I periodically generate a new certificate keystore file, and
>>> set
>>>>> the "listener.name.interbroker.ssl.keystore.location" broker config
>>>>> property property. (I'm using inter.broker.listener.name=INTERBROKER)
>>>>> 
>>>>> Setting this property works fine, and everything appears ok. And
>>> manually
>>>>> connecting to the inter broker listener shows it's correctly serving the
>>>>> new certificate. But if I ever restart a broker after the original
>>>>> certificate has expired (The one the broker started up with, which is no
>>>>> longer configured anywhere), then communication failures between brokers
>>>>> start to toccur. My logs fill up with messages like this:
>>>>> 
>>>>> [2019-07-22 03:57:43,605] INFO [SocketServer brokerId=1] Failed
>>>>> authentication with 10.224.70.3 (SSL handshake failed)
>>>>> (org.apache.kafka.common.network.Selector)
>>>>> 
>>>>> A little bit of extra logging injected into the code tells me that the
>>>>> failures are caused by the out of date SSL certificates being used. So
>>> it
>>>>> seems there are some network components inside Kafka still stuck on the
>>> old
>>>>> settings.
>>>>> This sounds similar to the behaviour described in KAFKA-8336 (
>>>>> https://issues.apache.org/jira/browse/KAFKA-8336), but this is marked
>>> as
>>>>> fixed in 2.3.0.
>>>>> 
>>>>> I've confirmed that all the SslChannelBuilders and SslFactories appear
>>> to
>>>>> be being reconfigured correctly when the dynamic setting is set. I've
>>> tried
>>>>> closing all existing KafkaChannels on a reconfiguration event, in order
>>> to
>>>>> force them to re-open with the new certificates, but the problem
>>> persists.
>>>>> 
>>>>> Does anyone have any idea what other components may be hanging around,
>>>>> using the old certificates?
>>>>> 
>>>>> Thanks,
>>>>> Michael
>>>>> 
>>>>> 
>>> 
>>> 
>

Re: Dynamic configuration of interbroker SSL certificates

Posted by Michael Carter <mi...@instaclustr.com>.

Hi Rajini,

Thanks for the good suggestions. I’ve investigated further and confirmed that the issue continues to occur even with hostname verification turned off.
I put in the ssl debug option and got some interesting information. I’ve attached an example of what the logs output when SSL errors start occurring.

From the debug output I can see that the internal IP addresses do form part of the alternative names on the certificate, which is good.
What it shows is that the verification is failing due to the fact that it is using an out of date certificate (where the validity period has expired).

It is interesting that in the attached example, the broker appears to be trying to connect to itself, and one end of the connection has the new, valid certificate, while the other end has the old, out of date certificate, even though my understanding is that it should be one and the same certificate.
Does this suggest that there is some network component inside Kafka that hasn’t responded to the reconfiguration of of interbroker Keystore property?

Cheers,
Michael

Re: Dynamic configuration of interbroker SSL certificates

Posted by Rajini Sivaram <ra...@gmail.com>.

Hi Michael,

Thanks for the logs.

When keystore is reconfigured in a stable cluster, we wouldn't expect to
see any failures even if there is an issue with the certs since existing
connections continue to work without re-authenticating. When a broker is
restarted, there could be controller re-election, leader elections etc.
causing new connections on all brokers, so failures can prop up on all
brokers.

When we reconfigure inter-broker keystore, we validate the keystore against
the truststore on each broker. Since we are seeing handshake failures
within. a broker (e.g. controller connections), it suggests that the
validation performed by the actual connection is failing while our
validation during reconfiguration succeeded. The main difference between
the two is hostname verification - we disable hostname verification when
validating configs, but for actual connections this seems to be enabled in
your configs. Can you verify that your certs contain the host IP address
used in the advertised listeners list (e.g. 10.224.59.10). In order to rule
out hostname issues,  you could rerun your tests with
`listener.name.interbroker.ssl.endpoint.identification.algorithm=`,
 i.e. empty string to disable hostname verification.

If the issue still occurs with hostname verification disabled, it will be
useful to run the brokers with
KAFKA_OPTS="-Djavax.net.debug=ssl,handshake". That would show the certs
used as well as more details about the handshake failure.

Regards,

Rajini



On Fri, Aug 2, 2019 at 8:49 AM Michael Carter <
michael.carter@instaclustr.com> wrote:

> Hi Rajini,
>
> Thanks for your response. I've attached some logs from a three node
> cluster that I was doing this on. For the purpose of this test I'm using
> certificates with ten minutes validity. The original certificate was valid
> from roughly 05:32 to 05:42, I rotate them every five minutes, and stopped
> the third broker at 05:48 (five or so minutes after the original
> certificates would have expired). At this point the errors start to appear
> in the logs in volume. A lot of them are errors connecting to the restarted
> broker, but not all of them. Some of them are the first broker (which
> hasn't been restarted) apparently having trouble connecting to itself. I
> restarted the other brokers at about 05:49 but the errors continued.
>
> To answer your questions:
> 1) All brokers are dynamically updated with new certs and this works fine
> without any errors (this just indicates that config update succeeded)
>
> - Yes that is correct.
>
> 2) If one broker is restarted, handshakes failures appear in the logs and
> these are logged from SocketServer in the restarted broker. (this could
> indicate that some client-side cert was not updated dynamically)
>
> - The errors tend to appear in all the brokers logs. I can also get the
> same thing just by shutting down one of the brokers. It doesn't actually
> need to be restarted, since the SslAuthenticationExceptions will appear in
> the brokers that weren't shutdown or restarted regardless. In fact, I can
> get a message saying "Failed authentication with <IP address> (SSL
> handshake failed)" when no broker is running on that IP address at all.
> Some of the messages indicate a failure of a broker connecting to itself
> (e.g. "Failed authentication with /10.224.59.10 (SSL handshake failed)"
> in the logs of the broker running on 10.224.59.10)
>
> 3) Once all brokers are restarted, there are no more handshake failures
> (this is the actual confirmation that certs from 1) work)
>
> - Unfortunately no, the errors persist even after this. Sometimes, it will
> recover from this situation if I run another dynamic certificate rotation.
> But restarting the brokers doesn't seem to clear it. It might be tempting
> to think that there is something wrong with the new certificates, but
> they're generated in the exact same way as the original certificates
> (except for the validity time period), and they seem to be being served
> correctly after the dynamic rotation. (I'm checking with this command:
> openssl s_client -debug -connect 10.224.59.10:9094 2>/dev/null </dev/null
> | openssl x509 -noout -dates ). Also, subsequent updates with similar
> certificates can sometimes fix the problem.
>
> The fact that it can persist after a restart is what really sticks out to
> me. Is there any state stored outside of Kafka that could be getting in the
> way? Does Zookeeper store anything other than the filename of the keystore?
>
> Cheers,
> Michael
>
>
>
> > On 1 Aug 2019, at 7:10 pm, Rajini Sivaram <ra...@gmail.com>
> wrote:
> >
> > Hi Michael,
> >
> > Thank you for reporting this. It is possible that we missed out some
> > client-side update in KAFKA-8336. Would it be possible for you to provide
> > logs from multiple brokers around the time of a broker restart when
> > handshake failures started occurring so that we can try to work out which
> > client connection is failing? Corresponding to the handshake failure
> entry
> > from SocketServer in the restarted broker, there should a failed client
> > connection on another broker. If that has a stack trace, it would be
> useful.
> >
> > Also just to confirm, are these correct:
> >
> > 1) All brokers are dynamically updated with new certs and this works fine
> > without any errors (this just indicates that config update succeeded)
> > 2) If one broker is restarted, handshakes failures appear in the logs and
> > these are logged from SocketServer in the restarted broker. (this could
> > indicate that some client-side cert was not updated dynamically)
> > 3) Once all brokers are restarted, there are no more handshake failures
> > (this is the actual confirmation that certs from 1) work)
> >
> > Thank you,
> >
> > Rajini
> >
> > On Thu, Aug 1, 2019 at 12:40 AM Michael Carter <
> > michael.carter@instaclustr.com> wrote:
> >
> >> Hi all,
> >>
> >> I'm having an issue with dynamic configuration of interbroker SSL
> >> certificates (in Kafka 2.3.0) that I'm hoping someone can give me
> insight
> >> on. I've previously posted something similar on the Users email list,
> but I
> >> think it will need some help from developers experienced with how the
> >> networking code in Kafka works.
> >>
> >> I'm trying to use SSL two-way authentication for inter broker
> >> communication, with short lived SSL certificates, rotatating them
> >> frequently without needing to do a broker restart. So, on each broker
> in my
> >> cluster, I periodically generate a new certificate keystore file, and
> set
> >> the "listener.name.interbroker.ssl.keystore.location" broker config
> >> property property. (I'm using inter.broker.listener.name=INTERBROKER)
> >>
> >> Setting this property works fine, and everything appears ok. And
> manually
> >> connecting to the inter broker listener shows it's correctly serving the
> >> new certificate. But if I ever restart a broker after the original
> >> certificate has expired (The one the broker started up with, which is no
> >> longer configured anywhere), then communication failures between brokers
> >> start to toccur. My logs fill up with messages like this:
> >>
> >> [2019-07-22 03:57:43,605] INFO [SocketServer brokerId=1] Failed
> >> authentication with 10.224.70.3 (SSL handshake failed)
> >> (org.apache.kafka.common.network.Selector)
> >>
> >> A little bit of extra logging injected into the code tells me that the
> >> failures are caused by the out of date SSL certificates being used. So
> it
> >> seems there are some network components inside Kafka still stuck on the
> old
> >> settings.
> >> This sounds similar to the behaviour described in KAFKA-8336 (
> >> https://issues.apache.org/jira/browse/KAFKA-8336), but this is marked
> as
> >> fixed in 2.3.0.
> >>
> >> I've confirmed that all the SslChannelBuilders and SslFactories appear
> to
> >> be being reconfigured correctly when the dynamic setting is set. I've
> tried
> >> closing all existing KafkaChannels on a reconfiguration event, in order
> to
> >> force them to re-open with the new certificates, but the problem
> persists.
> >>
> >> Does anyone have any idea what other components may be hanging around,
> >> using the old certificates?
> >>
> >> Thanks,
> >> Michael
> >>
> >>
>
>

Re: Dynamic configuration of interbroker SSL certificates

Posted by Michael Carter <mi...@instaclustr.com>.

Hi Rajini,

Thanks for your response. I've attached some logs from a three node cluster that I was doing this on. For the purpose of this test I'm using certificates with ten minutes validity. The original certificate was valid from roughly 05:32 to 05:42, I rotate them every five minutes, and stopped the third broker at 05:48 (five or so minutes after the original certificates would have expired). At this point the errors start to appear in the logs in volume. A lot of them are errors connecting to the restarted broker, but not all of them. Some of them are the first broker (which hasn't been restarted) apparently having trouble connecting to itself. I restarted the other brokers at about 05:49 but the errors continued.

To answer your questions:
1) All brokers are dynamically updated with new certs and this works fine
without any errors (this just indicates that config update succeeded)

- Yes that is correct.

2) If one broker is restarted, handshakes failures appear in the logs and
these are logged from SocketServer in the restarted broker. (this could
indicate that some client-side cert was not updated dynamically)

- The errors tend to appear in all the brokers logs. I can also get the same thing just by shutting down one of the brokers. It doesn't actually need to be restarted, since the SslAuthenticationExceptions will appear in the brokers that weren't shutdown or restarted regardless. In fact, I can get a message saying "Failed authentication with <IP address> (SSL handshake failed)" when no broker is running on that IP address at all. Some of the messages indicate a failure of a broker connecting to itself (e.g. "Failed authentication with /10.224.59.10 (SSL handshake failed)" in the logs of the broker running on 10.224.59.10)

3) Once all brokers are restarted, there are no more handshake failures
(this is the actual confirmation that certs from 1) work)

- Unfortunately no, the errors persist even after this. Sometimes, it will recover from this situation if I run another dynamic certificate rotation. But restarting the brokers doesn't seem to clear it. It might be tempting to think that there is something wrong with the new certificates, but they're generated in the exact same way as the original certificates (except for the validity time period), and they seem to be being served correctly after the dynamic rotation. (I'm checking with this command: openssl s_client -debug -connect 10.224.59.10:9094 2>/dev/null </dev/null | openssl x509 -noout -dates ). Also, subsequent updates with similar certificates can sometimes fix the problem.

The fact that it can persist after a restart is what really sticks out to me. Is there any state stored outside of Kafka that could be getting in the way? Does Zookeeper store anything other than the filename of the keystore?

Cheers,
Michael