You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by nicolas paris <ni...@riseup.net> on 2022/10/28 21:44:33 UTC

Modular encryption to support arrays and nested arrays

Hello,

apparently, modular encryption does not yet support **arrays** types.

```scala
spark.sparkContext.hadoopConfiguration.set("parquet.crypto.factory.class", "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory")
spark.sparkContext.hadoopConfiguration.set("parquet.encryption.kms.client.class" , "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS")
spark.sparkContext.hadoopConfiguration.set("parquet.encryption.key.list", "k1:AAECAwQFBgcICQoLDA0ODw==, k2:AAECAAECAAECAAECAAECAA==")
spark.sparkContext.hadoopConfiguration.set("parquet.encryption.plaintext.footer", "true")
spark.sparkContext.hadoopConfiguration.set("parquet.encryption.footer.key", "k1")
spark.sparkContext.hadoopConfiguration.set("parquet.encryption.column.keys", "k2:rider")

val df = spark.sql("select 1 as foo, array(named_struct('foo',2, 'bar',3)) as rider, 3 as ts, uuid() as uuid")
df.write.format("parquet").mode("overwrite").save("/tmp/enc")

Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: Encrypted column [rider] not in file schema

```

also, the doted columnpath would not support to encrypt within nested 
structure mixed with arrays. For example, there is no way I am aware of to target "all foo in rider".

```
root
 |-- foo: integer (nullable = false)
 |-- rider: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- foo: integer (nullable = false)
 |    |    |-- bar: integer (nullable = false)
 |-- ts: integer (nullable = false)
 |-- uuid: string (nullable = false)
```

so far, those two issues makes arrays of confidential information impossible to encrypt, or am I missing something ?

Thanks, 

Re: Modular encryption to support arrays and nested arrays

Posted by nicolas paris <ni...@riseup.net>.
Thanks for your help.
> 
> The goal is to make the exception print something like:
> *Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException:
> Encrypted column [rider] not in file schema column list: [foo] ,
> [rider.list.element.foo] , [rider.list.element.bar] , [ts] , [uuid]*
> 


this sounds good. I also got pain to apply encryption on map fields.
Finally found out in the source some pointers and made it work. Also
either key_value.value or key_value.key lead to the whole map to be
encrypted apparently.

```
spark.sparkContext.hadoopConfiguration.set("parquet.encryption.column.k
eys", "k2:ma.key_value.value")
val df = spark.sql("select  map('foo',2, 'bar',3) as ma")
```

> - Configuring a key for all children of a nested schema node (eg "
> *k2:rider.*"*). This had been discussed in the past, but not followed
> up..
> Is this something you'd be interested to build?
> > 

I'm affraid this is not something useful in my context for now. Columns
to be encrypted is a carefully hand crafted list - there is no usage
for * right now.

Re: Modular encryption to support arrays and nested arrays

Posted by Gidon Gershinsky <gg...@gmail.com>.
Parquet columnar encryption supports these types. Currently, it requires an
explicit full path for each column to be encrypted.
Your sample will work with
*spark.sparkContext.hadoopConfiguration.set("parquet.encryption.column.keys",
"k2:rider.list.element.foo,rider.list.element.bar")*

Having said that, there are a couple of things that can be improved (thank
you for running these checks!)

- the exception text is not informative enough, doesn't help much in
correcting the parameters. I've opened a Jira for this (and for updating
the parameter documentation).
The goal is to make the exception print something like:
*Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException:
Encrypted column [rider] not in file schema column list: [foo] ,
[rider.list.element.foo] , [rider.list.element.bar] , [ts] , [uuid]*

- Configuring a key for all children of a nested schema node (eg "
*k2:rider.*"*). This had been discussed in the past, but not followed up..
Is this something you'd be interested to build? Alternatively, I can do it,
but this will take me a while to get to.


Cheers, Gidon


On Sat, Oct 29, 2022 at 12:45 AM nicolas paris <ni...@riseup.net>
wrote:

> Hello,
>
> apparently, modular encryption does not yet support **arrays** types.
>
> ```scala
> spark.sparkContext.hadoopConfiguration.set("parquet.crypto.factory.class",
> "org.apache.parquet.crypto.keytools.PropertiesDrivenCryptoFactory")
> spark.sparkContext.hadoopConfiguration.set("parquet.encryption.kms.client.class"
> , "org.apache.parquet.crypto.keytools.mocks.InMemoryKMS")
> spark.sparkContext.hadoopConfiguration.set("parquet.encryption.key.list",
> "k1:AAECAwQFBgcICQoLDA0ODw==, k2:AAECAAECAAECAAECAAECAA==")
> spark.sparkContext.hadoopConfiguration.set("parquet.encryption.plaintext.footer",
> "true")
> spark.sparkContext.hadoopConfiguration.set("parquet.encryption.footer.key",
> "k1")
> spark.sparkContext.hadoopConfiguration.set("parquet.encryption.column.keys",
> "k2:rider")
>
> val df = spark.sql("select 1 as foo, array(named_struct('foo',2, 'bar',3))
> as rider, 3 as ts, uuid() as uuid")
> df.write.format("parquet").mode("overwrite").save("/tmp/enc")
>
> Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException:
> Encrypted column [rider] not in file schema
>
> ```
>
> also, the doted columnpath would not support to encrypt within nested
> structure mixed with arrays. For example, there is no way I am aware of to
> target "all foo in rider".
>
> ```
> root
>  |-- foo: integer (nullable = false)
>  |-- rider: array (nullable = false)
>  |    |-- element: struct (containsNull = false)
>  |    |    |-- foo: integer (nullable = false)
>  |    |    |-- bar: integer (nullable = false)
>  |-- ts: integer (nullable = false)
>  |-- uuid: string (nullable = false)
> ```
>
> so far, those two issues makes arrays of confidential information
> impossible to encrypt, or am I missing something ?
>
> Thanks,
>