You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by GitBox <gi...@apache.org> on 2020/05/19 14:11:38 UTC

[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #776: PARQUET-1229: Parquet MR encryption

gszadovszky commented on a change in pull request #776:
URL: https://github.com/apache/parquet-mr/pull/776#discussion_r427284527



##########
File path: parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/OffsetIndex.java
##########
@@ -49,6 +49,13 @@
    * @return the index of the first row in the page
    */
   public long getFirstRowIndex(int pageIndex);
+  
+  /**
+   * @param pageIndex
+   *         the index of the page
+   * @return the original ordinal of the page in the column chunk
+   */
+  public short getPageOrdinal(int pageIndex);

Review comment:
       Why do we need it as a `short` instead of keeping it as an `int`? As per the parquet.thrift spec we never say that we cannot have more pages than `32767` even if it is unlikely to have such many.

##########
File path: parquet-hadoop/src/main/java/org/apache/parquet/crypto/AesCipher.java
##########
@@ -68,19 +67,32 @@
 
   public static byte[] createModuleAAD(byte[] fileAAD, ModuleType moduleType, 
       short rowGroupOrdinal, short columnOrdinal, short pageOrdinal) {

Review comment:
       Theoretically we don't give hard limits for the number of row groups, number of columns or the number of pages in the spec. There is a de facto limit that we use thrift lists where the size is an i32 meaning that we should allow java int values here.
   Also, there was a post commit discussion in a [related PR](https://github.com/apache/parquet-format/pull/142#issuecomment-600294754). It is unfortunate that that time parquet-format was already released so I don't know if there is a way to properly fix this issue in the format. Anyway, I would not restrict these values to a `short`.

##########
File path: parquet-hadoop/src/main/java/org/apache/parquet/crypto/AesGcmDecryptor.java
##########
@@ -98,16 +104,21 @@
         ((lengthBuffer[0] & 0xff));
 
     if (ciphertextLength < 1) {
-      throw new IOException("Wrong length of encrypted metadata: " + ciphertextLength);
+      throw new ParquetCryptoRuntimeException("Wrong length of encrypted metadata: " + ciphertextLength);
     }
 
     byte[] ciphertextBuffer = new byte[ciphertextLength];
     gotBytes = 0;
     // Read the encrypted structure contents
     while (gotBytes < ciphertextLength) {
-      int n = from.read(ciphertextBuffer, gotBytes, ciphertextLength - gotBytes);
+      int n;
+      try {
+        n = from.read(ciphertextBuffer, gotBytes, ciphertextLength - gotBytes);
+      } catch (IOException e) {

Review comment:
       We should let the `IOException` thrown out.

##########
File path: parquet-hadoop/src/main/java/org/apache/parquet/crypto/AesGcmDecryptor.java
##########
@@ -70,23 +69,30 @@
       if (null != AAD) cipher.updateAAD(AAD);
 
       cipher.doFinal(ciphertext, inputOffset, inputLength, plainText, outputOffset);
-    }  catch (GeneralSecurityException e) {
-      throw new IOException("Failed to decrypt", e);
+    }  catch (AEADBadTagException e) {
+      throw new TagVerificationException("GCM tag check failed", e);
+    } catch (GeneralSecurityException e) {
+      throw new ParquetCryptoRuntimeException("Failed to decrypt", e);
     }
 
     return plainText;
   }
 
   @Override
-  public byte[] decrypt(InputStream from, byte[] AAD) throws IOException {
+  public byte[] decrypt(InputStream from, byte[] AAD) {
     byte[] lengthBuffer = new byte[SIZE_LENGTH];
     int gotBytes = 0;
 
     // Read the length of encrypted Thrift structure
     while (gotBytes < SIZE_LENGTH) {
-      int n = from.read(lengthBuffer, gotBytes, SIZE_LENGTH - gotBytes);
+      int n;
+      try {
+        n = from.read(lengthBuffer, gotBytes, SIZE_LENGTH - gotBytes);
+      } catch (IOException e) {

Review comment:
       We should let the `IOException` thrown out.

##########
File path: parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
##########
@@ -1185,70 +1275,189 @@ static long getOffset(ColumnChunk columnChunk) {
     return offset;
   }
 
+  private static void verifyFooterIntegrity(InputStream from, InternalFileDecryptor fileDecryptor, 
+      int combinedFooterLength) throws IOException {
+    
+    byte[] nonce = new byte[AesCipher.NONCE_LENGTH];
+    from.read(nonce);
+    byte[] gcmTag = new byte[AesCipher.GCM_TAG_LENGTH];
+    from.read(gcmTag);
+    
+    AesGcmEncryptor footerSigner =  fileDecryptor.createSignedFooterEncryptor();
+    
+    byte[] footerAndSignature = ((ByteBufferInputStream) from).slice(0).array();
+    int footerSignatureLength = AesCipher.NONCE_LENGTH + AesCipher.GCM_TAG_LENGTH;
+    byte[] serializedFooter = new byte[combinedFooterLength - footerSignatureLength];
+    System.arraycopy(footerAndSignature, 0, serializedFooter, 0, serializedFooter.length);
+
+    byte[] signedFooterAAD = AesCipher.createFooterAAD(fileDecryptor.getFileAAD());
+    byte[] encryptedFooterBytes = footerSigner.encrypt(false, serializedFooter, nonce, signedFooterAAD);
+    byte[] calculatedTag = new byte[AesCipher.GCM_TAG_LENGTH];
+    System.arraycopy(encryptedFooterBytes, encryptedFooterBytes.length - AesCipher.GCM_TAG_LENGTH, 
+        calculatedTag, 0, AesCipher.GCM_TAG_LENGTH);
+    if (!Arrays.equals(gcmTag, calculatedTag)) {
+      throw new TagVerificationException("Signature mismatch in plaintext footer");
+    }
+  }
+
   public ParquetMetadata readParquetMetadata(final InputStream from, MetadataFilter filter) throws IOException {
+    return readParquetMetadata(from, filter, null, false, 0);
+  }
+
+  public ParquetMetadata readParquetMetadata(final InputStream from, MetadataFilter filter,
+      final InternalFileDecryptor fileDecryptor, final boolean encryptedFooter, 
+      final int combinedFooterLength) throws IOException {
+    
+    final BlockCipher.Decryptor footerDecryptor = (encryptedFooter? fileDecryptor.fetchFooterDecryptor() : null);
+    final byte[] encryptedFooterAAD = (encryptedFooter? AesCipher.createFooterAAD(fileDecryptor.getFileAAD()) : null);
+    
     FileMetaData fileMetaData = filter.accept(new MetadataFilterVisitor<FileMetaData, IOException>() {
       @Override
       public FileMetaData visit(NoFilter filter) throws IOException {
-        return readFileMetaData(from);
+        return readFileMetaData(from, footerDecryptor, encryptedFooterAAD);
       }
 
       @Override
       public FileMetaData visit(SkipMetadataFilter filter) throws IOException {
-        return readFileMetaData(from, true);
+        return readFileMetaData(from, true, footerDecryptor, encryptedFooterAAD);
       }
 
       @Override
       public FileMetaData visit(OffsetMetadataFilter filter) throws IOException {
-        return filterFileMetaDataByStart(readFileMetaData(from), filter);
+        return filterFileMetaDataByStart(readFileMetaData(from, footerDecryptor, encryptedFooterAAD), filter);
       }
 
       @Override
       public FileMetaData visit(RangeMetadataFilter filter) throws IOException {
-        return filterFileMetaDataByMidpoint(readFileMetaData(from), filter);
+        return filterFileMetaDataByMidpoint(readFileMetaData(from, footerDecryptor, encryptedFooterAAD), filter);
       }
     });
     LOG.debug("{}", fileMetaData);
-    ParquetMetadata parquetMetadata = fromParquetMetadata(fileMetaData);
+    
+    if (!encryptedFooter && null != fileDecryptor) {
+      if (!fileMetaData.isSetEncryption_algorithm()) { // Plaintext file

Review comment:
       Is `plaintext` a usual term for _un-encrypted_ files? I don't really like it but fine if it is commonly used in that sense. (Plaintext files for me are the `*.txt` files.)

##########
File path: parquet-format-structures/src/main/java/org/apache/parquet/format/BlockCipher.java
##########
@@ -51,19 +49,17 @@
      * Parquet Modular Encryption specification.
      * @param AAD - Additional Authenticated Data for the decryption (ignored in case of CTR cipher)
      * @return plaintext - starts at offset 0 of the output value, and fills up the entire byte array.
-     * @throws IOException thrown upon any crypto problem encountered during decryption
      */
-    public byte[] decrypt(byte[] lengthAndCiphertext, byte[] AAD) throws IOException;
+    public byte[] decrypt(byte[] lengthAndCiphertext, byte[] AAD);
 
     /**
      * Convenience decryption method that reads the length and ciphertext from the input stream.
      * 
      * @param from Input stream with length and ciphertext.
      * @param AAD - Additional Authenticated Data for the decryption (ignored in case of CTR cipher)
      * @return plaintext -  starts at offset 0 of the output, and fills up the entire byte array.
-     * @throws IOException thrown upon any crypto or IO problem encountered during decryption
      */
-    public byte[] decrypt(InputStream from, byte[] AAD) throws IOException;
+    public byte[] decrypt(InputStream from, byte[] AAD);

Review comment:
       I would keep `throws IOException` here. `InputStream` objects throw `IOException` so the caller shall be prepared handling these.

##########
File path: parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
##########
@@ -1185,70 +1275,189 @@ static long getOffset(ColumnChunk columnChunk) {
     return offset;
   }
 
+  private static void verifyFooterIntegrity(InputStream from, InternalFileDecryptor fileDecryptor, 
+      int combinedFooterLength) throws IOException {
+    
+    byte[] nonce = new byte[AesCipher.NONCE_LENGTH];
+    from.read(nonce);
+    byte[] gcmTag = new byte[AesCipher.GCM_TAG_LENGTH];
+    from.read(gcmTag);
+    
+    AesGcmEncryptor footerSigner =  fileDecryptor.createSignedFooterEncryptor();
+    
+    byte[] footerAndSignature = ((ByteBufferInputStream) from).slice(0).array();
+    int footerSignatureLength = AesCipher.NONCE_LENGTH + AesCipher.GCM_TAG_LENGTH;
+    byte[] serializedFooter = new byte[combinedFooterLength - footerSignatureLength];
+    System.arraycopy(footerAndSignature, 0, serializedFooter, 0, serializedFooter.length);
+
+    byte[] signedFooterAAD = AesCipher.createFooterAAD(fileDecryptor.getFileAAD());
+    byte[] encryptedFooterBytes = footerSigner.encrypt(false, serializedFooter, nonce, signedFooterAAD);
+    byte[] calculatedTag = new byte[AesCipher.GCM_TAG_LENGTH];
+    System.arraycopy(encryptedFooterBytes, encryptedFooterBytes.length - AesCipher.GCM_TAG_LENGTH, 
+        calculatedTag, 0, AesCipher.GCM_TAG_LENGTH);
+    if (!Arrays.equals(gcmTag, calculatedTag)) {
+      throw new TagVerificationException("Signature mismatch in plaintext footer");
+    }
+  }
+
   public ParquetMetadata readParquetMetadata(final InputStream from, MetadataFilter filter) throws IOException {
+    return readParquetMetadata(from, filter, null, false, 0);
+  }
+
+  public ParquetMetadata readParquetMetadata(final InputStream from, MetadataFilter filter,
+      final InternalFileDecryptor fileDecryptor, final boolean encryptedFooter, 
+      final int combinedFooterLength) throws IOException {
+    
+    final BlockCipher.Decryptor footerDecryptor = (encryptedFooter? fileDecryptor.fetchFooterDecryptor() : null);
+    final byte[] encryptedFooterAAD = (encryptedFooter? AesCipher.createFooterAAD(fileDecryptor.getFileAAD()) : null);
+    
     FileMetaData fileMetaData = filter.accept(new MetadataFilterVisitor<FileMetaData, IOException>() {
       @Override
       public FileMetaData visit(NoFilter filter) throws IOException {
-        return readFileMetaData(from);
+        return readFileMetaData(from, footerDecryptor, encryptedFooterAAD);
       }
 
       @Override
       public FileMetaData visit(SkipMetadataFilter filter) throws IOException {
-        return readFileMetaData(from, true);
+        return readFileMetaData(from, true, footerDecryptor, encryptedFooterAAD);
       }
 
       @Override
       public FileMetaData visit(OffsetMetadataFilter filter) throws IOException {
-        return filterFileMetaDataByStart(readFileMetaData(from), filter);
+        return filterFileMetaDataByStart(readFileMetaData(from, footerDecryptor, encryptedFooterAAD), filter);
       }
 
       @Override
       public FileMetaData visit(RangeMetadataFilter filter) throws IOException {
-        return filterFileMetaDataByMidpoint(readFileMetaData(from), filter);
+        return filterFileMetaDataByMidpoint(readFileMetaData(from, footerDecryptor, encryptedFooterAAD), filter);
       }
     });
     LOG.debug("{}", fileMetaData);
-    ParquetMetadata parquetMetadata = fromParquetMetadata(fileMetaData);
+    
+    if (!encryptedFooter && null != fileDecryptor) {
+      if (!fileMetaData.isSetEncryption_algorithm()) { // Plaintext file
+        fileDecryptor.setPlaintextFile();
+        // Done to detect files that were not encrypted by mistake
+        if (!fileDecryptor.plaintextFilesAllowed()) {
+          throw new IOException("Applying decryptor on plaintext file");

Review comment:
       I think, `ParquetCryptoRuntimeException` would fit better here.

##########
File path: parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
##########
@@ -1185,70 +1275,189 @@ static long getOffset(ColumnChunk columnChunk) {
     return offset;
   }
 
+  private static void verifyFooterIntegrity(InputStream from, InternalFileDecryptor fileDecryptor, 
+      int combinedFooterLength) throws IOException {
+    
+    byte[] nonce = new byte[AesCipher.NONCE_LENGTH];
+    from.read(nonce);
+    byte[] gcmTag = new byte[AesCipher.GCM_TAG_LENGTH];
+    from.read(gcmTag);
+    
+    AesGcmEncryptor footerSigner =  fileDecryptor.createSignedFooterEncryptor();
+    
+    byte[] footerAndSignature = ((ByteBufferInputStream) from).slice(0).array();
+    int footerSignatureLength = AesCipher.NONCE_LENGTH + AesCipher.GCM_TAG_LENGTH;
+    byte[] serializedFooter = new byte[combinedFooterLength - footerSignatureLength];
+    System.arraycopy(footerAndSignature, 0, serializedFooter, 0, serializedFooter.length);
+
+    byte[] signedFooterAAD = AesCipher.createFooterAAD(fileDecryptor.getFileAAD());
+    byte[] encryptedFooterBytes = footerSigner.encrypt(false, serializedFooter, nonce, signedFooterAAD);
+    byte[] calculatedTag = new byte[AesCipher.GCM_TAG_LENGTH];
+    System.arraycopy(encryptedFooterBytes, encryptedFooterBytes.length - AesCipher.GCM_TAG_LENGTH, 
+        calculatedTag, 0, AesCipher.GCM_TAG_LENGTH);
+    if (!Arrays.equals(gcmTag, calculatedTag)) {
+      throw new TagVerificationException("Signature mismatch in plaintext footer");
+    }
+  }
+
   public ParquetMetadata readParquetMetadata(final InputStream from, MetadataFilter filter) throws IOException {
+    return readParquetMetadata(from, filter, null, false, 0);
+  }
+
+  public ParquetMetadata readParquetMetadata(final InputStream from, MetadataFilter filter,
+      final InternalFileDecryptor fileDecryptor, final boolean encryptedFooter, 
+      final int combinedFooterLength) throws IOException {
+    
+    final BlockCipher.Decryptor footerDecryptor = (encryptedFooter? fileDecryptor.fetchFooterDecryptor() : null);
+    final byte[] encryptedFooterAAD = (encryptedFooter? AesCipher.createFooterAAD(fileDecryptor.getFileAAD()) : null);
+    
     FileMetaData fileMetaData = filter.accept(new MetadataFilterVisitor<FileMetaData, IOException>() {
       @Override
       public FileMetaData visit(NoFilter filter) throws IOException {
-        return readFileMetaData(from);
+        return readFileMetaData(from, footerDecryptor, encryptedFooterAAD);
       }
 
       @Override
       public FileMetaData visit(SkipMetadataFilter filter) throws IOException {
-        return readFileMetaData(from, true);
+        return readFileMetaData(from, true, footerDecryptor, encryptedFooterAAD);
       }
 
       @Override
       public FileMetaData visit(OffsetMetadataFilter filter) throws IOException {
-        return filterFileMetaDataByStart(readFileMetaData(from), filter);
+        return filterFileMetaDataByStart(readFileMetaData(from, footerDecryptor, encryptedFooterAAD), filter);
       }
 
       @Override
       public FileMetaData visit(RangeMetadataFilter filter) throws IOException {
-        return filterFileMetaDataByMidpoint(readFileMetaData(from), filter);
+        return filterFileMetaDataByMidpoint(readFileMetaData(from, footerDecryptor, encryptedFooterAAD), filter);
       }
     });
     LOG.debug("{}", fileMetaData);
-    ParquetMetadata parquetMetadata = fromParquetMetadata(fileMetaData);
+    
+    if (!encryptedFooter && null != fileDecryptor) {
+      if (!fileMetaData.isSetEncryption_algorithm()) { // Plaintext file

Review comment:
       Actually, based on parquet-mr README:
   > Generally speaking, stick to the Sun Java Code Conventions
   
   Based on the [related section](https://www.oracle.com/java/technologies/javase/codeconventions-comments.html#286) both should be fine.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org