You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by David Espinosa <es...@gmail.com> on 2017/11/22 18:46:51 UTC

GDPR appliance

Hi all,
I would like to double check with you how we want to apply some GDPR into
my kafka topics. In concrete the "right to be forgotten", what forces us to
delete some data contained in the messages. So not deleting the message,
but editing it.
For doing that, my intention is to replicate the topic and apply a
transformation over it.
I think that frameworks like Kafka Streams or Apache Storm.

Did anybody had to solve this problem?

Thanks in advance.

Re: GDPR appliance

Posted by Wim Van Leuven <wi...@highestpoint.biz>.
Sounds nice!

I'm discussing with a customer to create a fully anonymized stream for
future analytical purposes.

Remaining question: the anonymization algorithm/strategy that maintains
statistical relevance while being resilient against brute force.

Thoughts?
-wim

On Thu, 23 Nov 2017 at 19:03 Scott Reynolds <sr...@twilio.com.invalid>
wrote:

> Our legal departments interpretation is when an account is deleted any data
> that is kept longer then K days must be deleted. We setup our un-redacted
> Kafka topics to never be greater then K days. This simplifies the problem
> greatly.
>
> Our solution is designed to limit the ability of services to see parts of
> the data they do not require to operate. It simplifies  the technical
> requirements ( no key management, library implementation in multiple
> languages etc) requires little coordination with other teams (they change
> the topic they read from, which is just a string) and fits cleanly within
> Kafka ecosystem allowing teams to use new streaming technologies, older
> technologies etc without  requiring our data infrastructure team to support
> them.
>
> I am really proud of our solution because it doesn't try to boil the ocean.
>
> On Thu, Nov 23, 2017 at 9:31 AM Wim Van Leuven <
> wim.vanleuven@highestpoint.biz> wrote:
>
> > I think the best way to implement this is via envelope encryption: your
> > system manages a key encryption key (kek) which is used to encrypt data
> > encryption keys (dek) per user/customer which are used to encrypt the
> > user's/customer's data.
> >
> > If the user/customer walks away, simply drop the dek. His data becomes
> > undecryptable.
> >
> > You do have to implement reencryption in case keks or deks become
> > compromised.
> >
> > If you run in the cloud, AWS and GCloud have basic services Key
> Management
> > Services (KMS) to manage the KEKs esp. The access to it and versioning
> it.
> >
> > Their docs explain such a setup very well.
> >
> >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cloud.google.com_kms_docs_envelope-2Dencryption&d=DwIBaQ&c=x_Y1Lz9GyeGp2OvBCa_eow&r=ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w&m=01Zi8Fp_9BjyOtL5xBYDXRoWwaW_Om105MYasPnG_oc&s=zesl_dG4CDCFF-denLAjKtzsf6Hy0pB07O-jA3y-2zo&e=
> >
> > HTH
> > -wim
> >
> > On Thu, Nov 23, 2017, 09:55 David Espinosa <es...@gmail.com> wrote:
> >
> > > Hi Scott and thanks for your reply.
> > > For what you say, I guess that when you are asked to delete some "data
> > > user" (that's the "right to be forgotten" in GDPR), what you are really
> > > doing is blocking the access to it. I had a similar approach, based on
> > the
> > > idea of Greg Young's solution of encrypting any private data and
> > forgetting
> > > the key when data has to deleted.
> > > Sadly, our legal department after some checkins has conclude that this
> > > approach is "to block" data but not deleting it, as a consequence it
> can
> > > take us problems. If my guess about your solution is right, you could
> > have
> > > the same problems.
> > >
> > > Thanks
> > >
> > > 2017-11-22 19:59 GMT+01:00 Scott Reynolds <sreynolds@twilio.com.invalid
> > >:
> > >
> > > > We are using Kafka Connect consumers that consume from the raw
> > unredacted
> > > > topic and apply transformations and produce to a redacted topic.
> Using
> > > > kafka connect allows us to set it all up with an HTTP request and
> > doesn't
> > > > require additional infrastructure.
> > > >
> > > > Then we wrote a KafkaPrincipal builder to authenticate each consumer
> to
> > > > their service names. KafkaPrincipal class is specified in the
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__server.properties&d=DwIBaQ&c=x_Y1Lz9GyeGp2OvBCa_eow&r=ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w&m=01Zi8Fp_9BjyOtL5xBYDXRoWwaW_Om105MYasPnG_oc&s=t3YyvA1dDi-dZuJ-HNuo1SYyEPSNqmQ0DT63RNz4OLQ&e=
> > file on the brokers. To provide topic level access
> > > > control we just configured SimpleAclAuthorizer. The net result is,
> some
> > > > consumers can only read redacted topic and very few have consumers
> can
> > > read
> > > > unredacted.
> > > >
> > > > On Wed, Nov 22, 2017 at 10:47 AM David Espinosa <es...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi all,
> > > > > I would like to double check with you how we want to apply some
> GDPR
> > > into
> > > > > my kafka topics. In concrete the "right to be forgotten", what
> forces
> > > us
> > > > to
> > > > > delete some data contained in the messages. So not deleting the
> > > message,
> > > > > but editing it.
> > > > > For doing that, my intention is to replicate the topic and apply a
> > > > > transformation over it.
> > > > > I think that frameworks like Kafka Streams or Apache Storm.
> > > > >
> > > > > Did anybody had to solve this problem?
> > > > >
> > > > > Thanks in advance.
> > > > >
> > > > --
> > > >
> > > > Scott Reynolds
> > > > Principal Engineer
> > > > [image: twilio] <http://www.twilio.com/?utm_source=email_signature>
> > > > MOBILE (630) 254-2474
> > > > EMAIL sreynolds@twilio.com
> > > >
> > >
> >
> --
>
> Scott Reynolds
> Principal Engineer
> [image: twilio] <http://www.twilio.com/?utm_source=email_signature>
> MOBILE (630) 254-2474
> EMAIL sreynolds@twilio.com
>

Re: GDPR appliance

Posted by Scott Reynolds <sr...@twilio.com.INVALID>.
Our legal departments interpretation is when an account is deleted any data
that is kept longer then K days must be deleted. We setup our un-redacted
Kafka topics to never be greater then K days. This simplifies the problem
greatly.

Our solution is designed to limit the ability of services to see parts of
the data they do not require to operate. It simplifies  the technical
requirements ( no key management, library implementation in multiple
languages etc) requires little coordination with other teams (they change
the topic they read from, which is just a string) and fits cleanly within
Kafka ecosystem allowing teams to use new streaming technologies, older
technologies etc without  requiring our data infrastructure team to support
them.

I am really proud of our solution because it doesn't try to boil the ocean.

On Thu, Nov 23, 2017 at 9:31 AM Wim Van Leuven <
wim.vanleuven@highestpoint.biz> wrote:

> I think the best way to implement this is via envelope encryption: your
> system manages a key encryption key (kek) which is used to encrypt data
> encryption keys (dek) per user/customer which are used to encrypt the
> user's/customer's data.
>
> If the user/customer walks away, simply drop the dek. His data becomes
> undecryptable.
>
> You do have to implement reencryption in case keks or deks become
> compromised.
>
> If you run in the cloud, AWS and GCloud have basic services Key Management
> Services (KMS) to manage the KEKs esp. The access to it and versioning it.
>
> Their docs explain such a setup very well.
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cloud.google.com_kms_docs_envelope-2Dencryption&d=DwIBaQ&c=x_Y1Lz9GyeGp2OvBCa_eow&r=ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w&m=01Zi8Fp_9BjyOtL5xBYDXRoWwaW_Om105MYasPnG_oc&s=zesl_dG4CDCFF-denLAjKtzsf6Hy0pB07O-jA3y-2zo&e=
>
> HTH
> -wim
>
> On Thu, Nov 23, 2017, 09:55 David Espinosa <es...@gmail.com> wrote:
>
> > Hi Scott and thanks for your reply.
> > For what you say, I guess that when you are asked to delete some "data
> > user" (that's the "right to be forgotten" in GDPR), what you are really
> > doing is blocking the access to it. I had a similar approach, based on
> the
> > idea of Greg Young's solution of encrypting any private data and
> forgetting
> > the key when data has to deleted.
> > Sadly, our legal department after some checkins has conclude that this
> > approach is "to block" data but not deleting it, as a consequence it can
> > take us problems. If my guess about your solution is right, you could
> have
> > the same problems.
> >
> > Thanks
> >
> > 2017-11-22 19:59 GMT+01:00 Scott Reynolds <sreynolds@twilio.com.invalid
> >:
> >
> > > We are using Kafka Connect consumers that consume from the raw
> unredacted
> > > topic and apply transformations and produce to a redacted topic. Using
> > > kafka connect allows us to set it all up with an HTTP request and
> doesn't
> > > require additional infrastructure.
> > >
> > > Then we wrote a KafkaPrincipal builder to authenticate each consumer to
> > > their service names. KafkaPrincipal class is specified in the
> > >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__server.properties&d=DwIBaQ&c=x_Y1Lz9GyeGp2OvBCa_eow&r=ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w&m=01Zi8Fp_9BjyOtL5xBYDXRoWwaW_Om105MYasPnG_oc&s=t3YyvA1dDi-dZuJ-HNuo1SYyEPSNqmQ0DT63RNz4OLQ&e=
> file on the brokers. To provide topic level access
> > > control we just configured SimpleAclAuthorizer. The net result is, some
> > > consumers can only read redacted topic and very few have consumers can
> > read
> > > unredacted.
> > >
> > > On Wed, Nov 22, 2017 at 10:47 AM David Espinosa <es...@gmail.com>
> > wrote:
> > >
> > > > Hi all,
> > > > I would like to double check with you how we want to apply some GDPR
> > into
> > > > my kafka topics. In concrete the "right to be forgotten", what forces
> > us
> > > to
> > > > delete some data contained in the messages. So not deleting the
> > message,
> > > > but editing it.
> > > > For doing that, my intention is to replicate the topic and apply a
> > > > transformation over it.
> > > > I think that frameworks like Kafka Streams or Apache Storm.
> > > >
> > > > Did anybody had to solve this problem?
> > > >
> > > > Thanks in advance.
> > > >
> > > --
> > >
> > > Scott Reynolds
> > > Principal Engineer
> > > [image: twilio] <http://www.twilio.com/?utm_source=email_signature>
> > > MOBILE (630) 254-2474
> > > EMAIL sreynolds@twilio.com
> > >
> >
>
-- 

Scott Reynolds
Principal Engineer
[image: twilio] <http://www.twilio.com/?utm_source=email_signature>
MOBILE (630) 254-2474
EMAIL sreynolds@twilio.com

Re: GDPR appliance

Posted by Wim Van Leuven <wi...@highestpoint.biz>.
I think the best way to implement this is via envelope encryption: your
system manages a key encryption key (kek) which is used to encrypt data
encryption keys (dek) per user/customer which are used to encrypt the
user's/customer's data.

If the user/customer walks away, simply drop the dek. His data becomes
undecryptable.

You do have to implement reencryption in case keks or deks become
compromised.

If you run in the cloud, AWS and GCloud have basic services Key Management
Services (KMS) to manage the KEKs esp. The access to it and versioning it.

Their docs explain such a setup very well.

https://cloud.google.com/kms/docs/envelope-encryption

HTH
-wim

On Thu, Nov 23, 2017, 09:55 David Espinosa <es...@gmail.com> wrote:

> Hi Scott and thanks for your reply.
> For what you say, I guess that when you are asked to delete some "data
> user" (that's the "right to be forgotten" in GDPR), what you are really
> doing is blocking the access to it. I had a similar approach, based on the
> idea of Greg Young's solution of encrypting any private data and forgetting
> the key when data has to deleted.
> Sadly, our legal department after some checkins has conclude that this
> approach is "to block" data but not deleting it, as a consequence it can
> take us problems. If my guess about your solution is right, you could have
> the same problems.
>
> Thanks
>
> 2017-11-22 19:59 GMT+01:00 Scott Reynolds <sr...@twilio.com.invalid>:
>
> > We are using Kafka Connect consumers that consume from the raw unredacted
> > topic and apply transformations and produce to a redacted topic. Using
> > kafka connect allows us to set it all up with an HTTP request and doesn't
> > require additional infrastructure.
> >
> > Then we wrote a KafkaPrincipal builder to authenticate each consumer to
> > their service names. KafkaPrincipal class is specified in the
> > server.properties file on the brokers. To provide topic level access
> > control we just configured SimpleAclAuthorizer. The net result is, some
> > consumers can only read redacted topic and very few have consumers can
> read
> > unredacted.
> >
> > On Wed, Nov 22, 2017 at 10:47 AM David Espinosa <es...@gmail.com>
> wrote:
> >
> > > Hi all,
> > > I would like to double check with you how we want to apply some GDPR
> into
> > > my kafka topics. In concrete the "right to be forgotten", what forces
> us
> > to
> > > delete some data contained in the messages. So not deleting the
> message,
> > > but editing it.
> > > For doing that, my intention is to replicate the topic and apply a
> > > transformation over it.
> > > I think that frameworks like Kafka Streams or Apache Storm.
> > >
> > > Did anybody had to solve this problem?
> > >
> > > Thanks in advance.
> > >
> > --
> >
> > Scott Reynolds
> > Principal Engineer
> > [image: twilio] <http://www.twilio.com/?utm_source=email_signature>
> > MOBILE (630) 254-2474
> > EMAIL sreynolds@twilio.com
> >
>

Re: GDPR appliance

Posted by David Espinosa <es...@gmail.com>.
Hi Scott and thanks for your reply.
For what you say, I guess that when you are asked to delete some "data
user" (that's the "right to be forgotten" in GDPR), what you are really
doing is blocking the access to it. I had a similar approach, based on the
idea of Greg Young's solution of encrypting any private data and forgetting
the key when data has to deleted.
Sadly, our legal department after some checkins has conclude that this
approach is "to block" data but not deleting it, as a consequence it can
take us problems. If my guess about your solution is right, you could have
the same problems.

Thanks

2017-11-22 19:59 GMT+01:00 Scott Reynolds <sr...@twilio.com.invalid>:

> We are using Kafka Connect consumers that consume from the raw unredacted
> topic and apply transformations and produce to a redacted topic. Using
> kafka connect allows us to set it all up with an HTTP request and doesn't
> require additional infrastructure.
>
> Then we wrote a KafkaPrincipal builder to authenticate each consumer to
> their service names. KafkaPrincipal class is specified in the
> server.properties file on the brokers. To provide topic level access
> control we just configured SimpleAclAuthorizer. The net result is, some
> consumers can only read redacted topic and very few have consumers can read
> unredacted.
>
> On Wed, Nov 22, 2017 at 10:47 AM David Espinosa <es...@gmail.com> wrote:
>
> > Hi all,
> > I would like to double check with you how we want to apply some GDPR into
> > my kafka topics. In concrete the "right to be forgotten", what forces us
> to
> > delete some data contained in the messages. So not deleting the message,
> > but editing it.
> > For doing that, my intention is to replicate the topic and apply a
> > transformation over it.
> > I think that frameworks like Kafka Streams or Apache Storm.
> >
> > Did anybody had to solve this problem?
> >
> > Thanks in advance.
> >
> --
>
> Scott Reynolds
> Principal Engineer
> [image: twilio] <http://www.twilio.com/?utm_source=email_signature>
> MOBILE (630) 254-2474
> EMAIL sreynolds@twilio.com
>

Re: GDPR appliance

Posted by Scott Reynolds <sr...@twilio.com.INVALID>.
We are using Kafka Connect consumers that consume from the raw unredacted
topic and apply transformations and produce to a redacted topic. Using
kafka connect allows us to set it all up with an HTTP request and doesn't
require additional infrastructure.

Then we wrote a KafkaPrincipal builder to authenticate each consumer to
their service names. KafkaPrincipal class is specified in the
server.properties file on the brokers. To provide topic level access
control we just configured SimpleAclAuthorizer. The net result is, some
consumers can only read redacted topic and very few have consumers can read
unredacted.

On Wed, Nov 22, 2017 at 10:47 AM David Espinosa <es...@gmail.com> wrote:

> Hi all,
> I would like to double check with you how we want to apply some GDPR into
> my kafka topics. In concrete the "right to be forgotten", what forces us to
> delete some data contained in the messages. So not deleting the message,
> but editing it.
> For doing that, my intention is to replicate the topic and apply a
> transformation over it.
> I think that frameworks like Kafka Streams or Apache Storm.
>
> Did anybody had to solve this problem?
>
> Thanks in advance.
>
-- 

Scott Reynolds
Principal Engineer
[image: twilio] <http://www.twilio.com/?utm_source=email_signature>
MOBILE (630) 254-2474
EMAIL sreynolds@twilio.com

Re: GDPR appliance

Posted by David Espinosa <es...@gmail.com>.
Thanks a lot. I think that's the only way that ensures GDPR compliance.
In a second iteration, my thoughts are to anonymize instead of removing,
maybe identifying PII fields using AVRO custom types.

Thanks again,

2017-11-28 15:54 GMT+01:00 Ben Stopford <be...@confluent.io>:

> You should also be able to manage this with a compacted topic. If you give
> each message a unique key you'd then be able to delete, or overwrite
> specific records. Kafka will delete them from disk when compaction runs. If
> you need to partition for ordering purposes you'd need to use a custom
> partitioner that extracts a partition key from the unique key before it
> does the hash.
>
> B
>
> On Sun, Nov 26, 2017 at 10:40 AM Wim Van Leuven <
> wim.vanleuven@highestpoint.biz> wrote:
>
> > Thanks, Lars, for the most interesting read!
> >
> >
> >
> > On Sun, 26 Nov 2017 at 00:38 Lars Albertsson <la...@mapflat.com> wrote:
> >
> > > Hi David,
> > >
> > > You might find this presentation useful:
> > > https://www.slideshare.net/lallea/protecting-privacy-in-practice
> > >
> > > It explains privacy building blocks primarily in a batch processing
> > > context, but most of the principles are applicable for stream
> > > processing as well, e.g. splitting non-PII and PII data ("ejected
> > > record" slide), encrypting PII data ("lost key" slide).
> > >
> > > Regards,
> > >
> > >
> > >
> > > Lars Albertsson
> > > Data engineering consultant
> > > www.mapflat.com
> > > https://twitter.com/lalleal
> > > +46 70 7687109 <+46%2070%20768%2071%2009> <+46%2070%20768%2071%2009>
> > > Calendar: http://www.mapflat.com/calendar
> > >
> > >
> > > On Wed, Nov 22, 2017 at 7:46 PM, David Espinosa <es...@gmail.com>
> > wrote:
> > > > Hi all,
> > > > I would like to double check with you how we want to apply some GDPR
> > into
> > > > my kafka topics. In concrete the "right to be forgotten", what forces
> > us
> > > to
> > > > delete some data contained in the messages. So not deleting the
> > message,
> > > > but editing it.
> > > > For doing that, my intention is to replicate the topic and apply a
> > > > transformation over it.
> > > > I think that frameworks like Kafka Streams or Apache Storm.
> > > >
> > > > Did anybody had to solve this problem?
> > > >
> > > > Thanks in advance.
> > >
> >
>

Re: GDPR appliance

Posted by Ben Stopford <be...@confluent.io>.
You should also be able to manage this with a compacted topic. If you give
each message a unique key you'd then be able to delete, or overwrite
specific records. Kafka will delete them from disk when compaction runs. If
you need to partition for ordering purposes you'd need to use a custom
partitioner that extracts a partition key from the unique key before it
does the hash.

B

On Sun, Nov 26, 2017 at 10:40 AM Wim Van Leuven <
wim.vanleuven@highestpoint.biz> wrote:

> Thanks, Lars, for the most interesting read!
>
>
>
> On Sun, 26 Nov 2017 at 00:38 Lars Albertsson <la...@mapflat.com> wrote:
>
> > Hi David,
> >
> > You might find this presentation useful:
> > https://www.slideshare.net/lallea/protecting-privacy-in-practice
> >
> > It explains privacy building blocks primarily in a batch processing
> > context, but most of the principles are applicable for stream
> > processing as well, e.g. splitting non-PII and PII data ("ejected
> > record" slide), encrypting PII data ("lost key" slide).
> >
> > Regards,
> >
> >
> >
> > Lars Albertsson
> > Data engineering consultant
> > www.mapflat.com
> > https://twitter.com/lalleal
> > +46 70 7687109 <+46%2070%20768%2071%2009> <+46%2070%20768%2071%2009>
> > Calendar: http://www.mapflat.com/calendar
> >
> >
> > On Wed, Nov 22, 2017 at 7:46 PM, David Espinosa <es...@gmail.com>
> wrote:
> > > Hi all,
> > > I would like to double check with you how we want to apply some GDPR
> into
> > > my kafka topics. In concrete the "right to be forgotten", what forces
> us
> > to
> > > delete some data contained in the messages. So not deleting the
> message,
> > > but editing it.
> > > For doing that, my intention is to replicate the topic and apply a
> > > transformation over it.
> > > I think that frameworks like Kafka Streams or Apache Storm.
> > >
> > > Did anybody had to solve this problem?
> > >
> > > Thanks in advance.
> >
>

Re: GDPR appliance

Posted by Wim Van Leuven <wi...@highestpoint.biz>.
Thanks, Lars, for the most interesting read!



On Sun, 26 Nov 2017 at 00:38 Lars Albertsson <la...@mapflat.com> wrote:

> Hi David,
>
> You might find this presentation useful:
> https://www.slideshare.net/lallea/protecting-privacy-in-practice
>
> It explains privacy building blocks primarily in a batch processing
> context, but most of the principles are applicable for stream
> processing as well, e.g. splitting non-PII and PII data ("ejected
> record" slide), encrypting PII data ("lost key" slide).
>
> Regards,
>
>
>
> Lars Albertsson
> Data engineering consultant
> www.mapflat.com
> https://twitter.com/lalleal
> +46 70 7687109 <+46%2070%20768%2071%2009>
> Calendar: http://www.mapflat.com/calendar
>
>
> On Wed, Nov 22, 2017 at 7:46 PM, David Espinosa <es...@gmail.com> wrote:
> > Hi all,
> > I would like to double check with you how we want to apply some GDPR into
> > my kafka topics. In concrete the "right to be forgotten", what forces us
> to
> > delete some data contained in the messages. So not deleting the message,
> > but editing it.
> > For doing that, my intention is to replicate the topic and apply a
> > transformation over it.
> > I think that frameworks like Kafka Streams or Apache Storm.
> >
> > Did anybody had to solve this problem?
> >
> > Thanks in advance.
>

Re: GDPR appliance

Posted by Lars Albertsson <la...@mapflat.com>.
Hi David,

You might find this presentation useful:
https://www.slideshare.net/lallea/protecting-privacy-in-practice

It explains privacy building blocks primarily in a batch processing
context, but most of the principles are applicable for stream
processing as well, e.g. splitting non-PII and PII data ("ejected
record" slide), encrypting PII data ("lost key" slide).

Regards,



Lars Albertsson
Data engineering consultant
www.mapflat.com
https://twitter.com/lalleal
+46 70 7687109
Calendar: http://www.mapflat.com/calendar


On Wed, Nov 22, 2017 at 7:46 PM, David Espinosa <es...@gmail.com> wrote:
> Hi all,
> I would like to double check with you how we want to apply some GDPR into
> my kafka topics. In concrete the "right to be forgotten", what forces us to
> delete some data contained in the messages. So not deleting the message,
> but editing it.
> For doing that, my intention is to replicate the topic and apply a
> transformation over it.
> I think that frameworks like Kafka Streams or Apache Storm.
>
> Did anybody had to solve this problem?
>
> Thanks in advance.