You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by GitBox <gi...@apache.org> on 2020/06/15 14:16:00 UTC
[GitHub] [parquet-mr] ggershinsky edited a comment on pull request #615: PARQUET-1373: Encryption key tools

ggershinsky edited a comment on pull request #615:
URL: https://github.com/apache/parquet-mr/pull/615#issuecomment-644145175


   > > ConcurrentMap has a segment synchronization for write operations, and allows for synchronization-free read operations; this makes it faster than HasMap with synchronized methods.
   > 
   > Yes, I know how `ConcurrentHashMap` works. What I wanted to say that you are using synchronization as well. As you already use a `ConcurrentMap` you might implement these synchronized code parts by using the methods of `ConcurrentMap`. I've put some examples that might work.
   
   Sounds good, and thank you for the examples! We've already applied your code (not pushed yet), it indeed allowed to remove the explicit synchronization, making the cache implementation cleaner and faster.
   
   > 
   > Please, check why Travis fails.
   > 
   
   Sorry, should have mentioned that it will take a few more commits to fully address this round of the comments (and fix the unitests). Once all commits are in, I will squash them to simplify the review, and will post a comment here.
   
   > Another point of view came up about handling sensitive data in memory. Java does not clean memory after garbage collecting objects. It means that sensitive data must be manually cleaned after used otherwise it might get compromised by another java application in the same jvm or even by another process after the jvm exists. Because of the same reason `String` objects shall never contain sensitive information as the `char[]` behind the object might not get garbage collected after the `String` object itself gets dropped.
   > I did not find any particular bad practice in the code or any examples of the listed situations just wanted to highlight that we shall think about this as well.
   
   Yep, keeping secret data in Java strings is a notorious problem. I think the general consensus is not to rely on gc or explicit byte wiping - but to remember that these Java processes must run in a trusted environment anyway, simply because they work with confidential information, ranging from the encryption keys to the sensitive data itself. Micro-managing the memory with confidential information is always hard, and is basically impossible with Java. It goes beyond Parquet. One example - the KMS Client implementations send secret tokens and fetch explicit encryption keys, using a custom HTTP library. There is no guarantee this library doesn't use strings (most likely, it does). Another example - the secret tokens are passed as a Hadoop property from Spark or another framework; this is likely to be implemented with strings. Moreover, the tokens are built in an access control system, then sent to a user, then sent to a Spark driver, then sent to Spark workers (or other framework components) - there is no way to control this, except to rely on HTTPS for the transport security, and on running framework drivers/workers in a trusted environment for the memory security.
   
   In other words, our threat model is simple. We don't trust the storage - encrypted Parquet files can be accessed by malicious parties, but they won't be able to read them. We do trust the framework hosts (where the JVM runs) - if these are breached, the secret data can be stolen from any part of host memory / disc pages; not just the Parquet lib memory, but framework memory, HTTP libs, etc. Memory protection is a holy grail in this field, addressed by technologies like VMs, containers, hardware enclaves, etc, etc. Parquet encryption is focused on data-in-storage protection; data-in-memory protection is covered by other technologies.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org