You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "tolleybot (via GitHub)" <gi...@apache.org> on 2023/03/17 19:52:31 UTC

[GitHub] [arrow] tolleybot opened a new pull request, #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

tolleybot opened a new pull request, #34616:
URL: https://github.com/apache/arrow/pull/34616

   ### Rationale for this change
   
   The purpose of this pull request is to support modular encryption in the new Database API.  See [https://docs.google.com/document/d/13EysCNC6-Nu9wnJ8YpdzmD-aMLn4i2KXUJTNqIihy7A/edit#](url) for supporting document.
   
   
   ### What changes are included in this PR?
   
   I made improvements to the C++ and Python code to enable the file writer and file reader to use customized encryption settings for each file. Previously, the Dataset API applied the same encryption properties to all saved files, but now I've updated the code to allow for greater flexibility. In the Python code, I've added support for the changes by updating the ParquetFormat class to accept DatasetEncryptionConfiguration and DatasetDecryptionConfiguration structures. With these changes, you can pass the format object to the write_dataset function, giving you the ability to set unique encryption properties for each file in your Dataset.
   
   ### Are these changes tested?
   
   Yes, unit tests are included. I have also included a python sample project.
   
   ### Are there any user-facing changes?
   
   Yes, as stated above the ParquetFormat class has optional parameters for DatasetEncryptionConfiguration and DatasetDecryptionConfiguration through setters and getters.
   MakeReaderProperties now optionally takes in a filesystem object and a path.   
   
   <!--
   If there are any breaking changes to public APIs, please uncomment the line below and explain which changes are breaking.
   -->
   <!-- **This PR includes breaking changes to public APIs.** -->
   
   <!--
   Please uncomment the line below (and provide explanation) if the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld). We use this to highlight fixes to issues that may affect users without their knowledge. For this reason, fixing bugs that cause errors don't count, since those are usually obvious.
   -->
   <!-- **This PR contains a "Critical Fix".** -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1140705190


##########
cpp/src/parquet/properties.h:
##########
@@ -209,6 +209,22 @@ class PARQUET_EXPORT WriterProperties {
           created_by_(DEFAULT_CREATED_BY),
           store_decimal_as_integer_(false),
           page_checksum_enabled_(false) {}
+    
+    Builder(const WriterProperties& properties)

Review Comment:
   would you mind assign them in ctor list?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1142836269


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -31,6 +31,8 @@
 #include "arrow/dataset/type_fwd.h"
 #include "arrow/dataset/visibility.h"
 #include "arrow/io/caching.h"
+#include "parquet/encryption/dataset_encryption_config.h"

Review Comment:
   Should we add macros to these header files so the compiler won't complain when encryption is not enabled (e.g. `PARQUET_REQUIRE_ENCRYPTION` is OFF)? Or at least make sure they compile without encryption enabled.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1223054110


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########


Review Comment:
   I tested the build against the unit tests as you suggested.  What changes are critical to getting the PR merged? Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1269769347


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -637,10 +717,14 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
         If not None, override the maximum total size of containers allocated
         when decoding Thrift structures. The default limit should be
         sufficient for most Parquet files.
+    dataset_decryption_config : ParquetDecryptionConfig, default None

Review Comment:
   I changed the parameter to just decryption_config to make it match what I did in WriteProperties with encryption.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1678633493

   It seems we have such a build in the nightlies (which I triggered above):
   
   https://github.com/apache/arrow/blob/e7ece9ae8cd84ef67e83d50bfd912638a5355838/dev/tasks/tasks.yml#L1566-L1569


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1171350981


##########
cpp/write_dataset_example.py:
##########
@@ -0,0 +1,71 @@
+import sys
+sys.path.append('/home/ubuntu/projects/tolleybot_arrow/python')

Review Comment:
   moved that example to python/examples/dataset and removed hard coding



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1348644455


##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -29,44 +29,50 @@ struct DecryptionConfiguration;
 namespace arrow {
 namespace dataset {
 
-/// \brief Core configuration class encapsulating parameters for high-level encryption
-/// within Parquet framework.
-///
-/// ParquetEncryptionConfig serves as a bridge, passing encryption-related
-/// parameters to appropriate components within the Parquet library. It holds references
-/// to objects defining encryption strategy, Key Management Service (KMS) configuration,
-/// and specific encryption configurations for Parquet data.
-///
-/// \member crypto_factory Shared pointer to CryptoFactory object, responsible for
-/// creating cryptographic components like encryptors and decryptors. \member
-/// kms_connection_config Shared pointer to KmsConnectionConfig object, holding
-/// configuration parameters for connecting to a Key Management Service (KMS).
-/// \member encryption_config Shared pointer to EncryptionConfiguration object, defining
-/// specific encryption settings for Parquet data, like keys for different columns.
 struct ARROW_DS_EXPORT ParquetEncryptionConfig {
   std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory;
   std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config;
   std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config;
+  /// \brief Core configuration class encapsulating parameters for high-level encryption

Review Comment:
   I think we typically put those docs just above the struct? (as you had initially)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1752862299

   Yes, that's more or less what I was thinking as well. But then we still need a compile-time switch between those two for what gets imported in `_parquet_dataset.pyx`. And I am not sure we can then write that conditional in C to import cython files?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1753205273

   > From a quick look, it seems trivial to write a `def` function in the optional module that does this for you. Why do you think this makes it more complex or more difficult to maintain?
   
   It will not be a `def` function but a `cdef` function (actually, several of them) since it has to handle C++ types.
   Unless a `cdef` function is looked up at runtime, this makes things more complicated than a simple Python import.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1306118536


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -52,12 +52,81 @@ from pyarrow._parquet cimport (
     FileMetaData,
 )
 
+from pyarrow._parquet_encryption cimport *
+
 
 cdef Expression _true = Expression._scalar(True)
 
 
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+cdef class ParquetEncryptionConfig(_Weakrefable):

Review Comment:
   the decryption_properties I think you are referring to is the FileDecryptionProperties class which is a different thing than decryption_config which is the cpp type ParquetDecryptionConfig.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1310403498


##########
python/CMakeLists.txt:
##########
@@ -336,6 +343,8 @@ if(PYARROW_BUILD_PARQUET_ENCRYPTION)
   else()
     message(FATAL_ERROR "You must build Arrow C++ with PARQUET_REQUIRE_ENCRYPTION=ON")
   endif()
+else()
+  set(CYTHON_COMPILE_TIME_ENV "PARQUET_ENCRYPTION_ENABLED=0")
 endif()

Review Comment:
   I looked into this and found an issue.  It looks like the way other parameters are being added creates a list.  This causes each parameter to be separated by a ";". which wont work for what I need to do.
   
    Example output:
   CYTHON_FLAGS: ;-E PARQUET_ENCRYPTION_ENABLED=0;--warning-errors
   
   This is due to the way items are added:
   set(CYTHON_FLAGS "${CYTHON_FLAGS}" "--warning-errors") 
   instead of 
   set(CYTHON_FLAGS "${CYTHON_FLAGS} --warning-errors")
   
   For now I'll keep things as I have them unless you have another idea.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1329693825


##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,61 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+
+namespace arrow {
+namespace dataset {
+
+/// core class, that translates the parameters of high level encryption
+struct ARROW_DS_EXPORT ParquetEncryptionConfig {
+  void Setup(

Review Comment:
   I don't see this removed, did you forget to push your changes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1717888175

   @tolleybot I'll take a look again, thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1347382786


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -67,8 +72,28 @@ parquet::ReaderProperties MakeReaderProperties(
     properties.disable_buffered_stream();
   }
   properties.set_buffer_size(parquet_scan_options->reader_properties->buffer_size());
+
+#ifdef PARQUET_REQUIRE_ENCRYPTION
+  auto parquet_decrypt_config = parquet_scan_options->parquet_decryption_config;
+
+  if (parquet_decrypt_config != nullptr) {
+    auto file_decryption_prop =
+        parquet_decrypt_config->crypto_factory->GetFileDecryptionProperties(
+            *parquet_decrypt_config->kms_connection_config,
+            *parquet_decrypt_config->decryption_config, path, filesystem);
+
+    parquet_scan_options->reader_properties->file_decryption_properties(
+        std::move(file_decryption_prop));
+  }
+#else
+  if (parquet_scan_options->parquet_decryption_config != nullptr) {
+    return Status::NotImplemented("Encryption is not supported in this build.");
+  }

Review Comment:
   Ouch, my bad. We cannot return a `Status` here, we must instead throw a Parquet exception.
   (this can be seen on some CI jobs that have encryption disabled: 
   https://github.com/apache/arrow/actions/runs/6411553066/job/17408394537?pr=34616#step:5:3088 )
   
   Perhaps something like (untested):
   ```suggestion
     if (parquet_scan_options->parquet_decryption_config != nullptr) {
       parquet::ParquetException::NYI("Encryption is not supported in this build.");
     }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1340143239


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -56,9 +63,163 @@ from pyarrow._parquet cimport (
 
 cdef Expression _true = Expression._scalar(True)
 
-
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+IF PARQUET_ENCRYPTION_ENABLED:
+    cdef class ParquetEncryptionConfig(_Weakrefable):
+        """
+        Configuration for Parquet Encryption.
+
+        Parameters
+        ----------
+        crypto_factory : CryptoFactory
+            Factory for creating cryptographic instances.
+        kms_connection_config : KmsConnectionConfig

Review Comment:
   ```suggestion
           crypto_factory : pyarrow.parquet.encryption.CryptoFactory
               Factory for creating cryptographic instances.
           kms_connection_config : pyarrow.parquet.encryption.KmsConnectionConfig
   ```



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -678,10 +843,14 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
         If not None, override the maximum total size of containers allocated
         when decoding Thrift structures. The default limit should be
         sufficient for most Parquet files.
+    decryption_config : ParquetDecryptionConfig, default None

Review Comment:
   ```suggestion
       decryption_config : pyarrow.dataset.ParquetDecryptionConfig, default None
   ```



##########
python/pyarrow/includes/libarrow_parquet_readwrite.pxd:
##########
@@ -0,0 +1,32 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# distutils: language = c++
+
+from pyarrow.includes.libarrow_dataset cimport *
+from pyarrow._parquet cimport *
+
+cdef extern from "arrow/dataset/api.h" namespace "arrow::dataset" nogil:

Review Comment:
   I assume this is related to having the separate pxd file for encryption (see comment below https://github.com/apache/arrow/pull/34616/files#r1317199494)? 
   If we don't split those in a separate file, it might not work to do the
   
   ```
   from pyarrow.includes.libarrow_dataset_parquet cimport *
   
   IF PARQUET_ENCRYPTION_ENABLED:
       from pyarrow.includes.libarrow_parquet_readwrite_encryption cimport *
       from pyarrow._parquet_encryption cimport *
   ELSE:
       from pyarrow.includes.libarrow_parquet_readwrite cimport *
   ```
   
   Because otherwise if `CParquetFileWriteOptions` was also defined in `libarrow_dataset_parquet.pxd`, importing it from `libarrow_parquet_readwrite_encryption` would duplicate the imported declaration (and not sure if cython would be happy about that)



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -78,7 +239,8 @@ cdef class ParquetFileFormat(FileFormat):
         CParquetFileFormat* parquet_format
 
     def __init__(self, read_options=None,
-                 default_fragment_scan_options=None, **kwargs):
+                 default_fragment_scan_options=None,
+                 **kwargs):

Review Comment:
   Small general comment: while there is nothing wrong with this change, trying to avoid such stylistic changes (or other whitespace changes) does make it easier to review the diff.



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -56,9 +63,163 @@ from pyarrow._parquet cimport (
 
 cdef Expression _true = Expression._scalar(True)
 
-
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+IF PARQUET_ENCRYPTION_ENABLED:
+    cdef class ParquetEncryptionConfig(_Weakrefable):
+        """
+        Configuration for Parquet Encryption.
+
+        Parameters
+        ----------
+        crypto_factory : CryptoFactory
+            Factory for creating cryptographic instances.
+        kms_connection_config : KmsConnectionConfig
+            Configuration for connecting to Key Management Service.
+        encryption_config : EncryptionConfiguration

Review Comment:
   ```suggestion
           encryption_config : pyarrow.parquet.encryption.EncryptionConfiguration
   ```
   
   This is long, but I think being explicit with the full name helps to understand which objects are from the parquet module and which objects from the dataset module (as that still confuses me in this PR, especially with the very similar names EncryptionConfiguration vs ParquetEncryptionConfig)
   
   (and the same for the ParquetDecryptionConfig docstring)



##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,65 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+
+namespace parquet::encryption {
+class CryptoFactory;
+struct KmsConnectionConfig;
+struct EncryptionConfiguration;
+struct DecryptionConfiguration;
+}  // namespace parquet::encryption
+
+namespace arrow {
+namespace dataset {
+
+/// core class, that translates the parameters of high level encryption
+struct ARROW_DS_EXPORT ParquetEncryptionConfig {

Review Comment:
   Could you expand this doc comment a bit more? For example in the scan options parameter above this is explained as "configuration structure that provides encryption properties for a dataset", and in the python docs as "Configuration for Parquet Encryption".  
   I also don't directly understand what "translates parameters" means in this context (not being familiar with the encryption implementation, though)
   
   (and then when expanding the explanation here, you can add the same in the python docstring)



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -711,6 +889,20 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
     cdef ArrowReaderProperties* arrow_reader_properties(self):
         return self.parquet_options.arrow_reader_properties.get()
 
+    IF PARQUET_ENCRYPTION_ENABLED:
+        @property
+        def parquet_decryption_config(self):

Review Comment:
   We could also always add the property, but let it raise NotImplementedError when encryption is not enabled?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1259707663


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +138,20 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+    return dataset_encryption_config_;
+  }
+  /// \brief A setter for DatasetEncryptionConfiguration
+  void SetDatasetEncryptionConfig(
+      std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config) {
+    dataset_encryption_config_ = std::move(dataset_encryption_config);
+  }
+
+ private:
+  // A configuration structure that provides per file encryption properties for a dataset
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config_ = NULLPTR;

Review Comment:
   @wgtmac @westonpace @wjones127 
   I need to get some idea of how the flow would go if I move the DatasetEncryptionConfiguration into ParquetFragmentScanOptions and ParquetFragmeentScanOptions.  Here is a sample of the current implementation in python being used
   
   https://github.com/tolleybot/arrow/blob/f3f73d86c28551f89b5aab08bfd120dc85dfca80/python/examples/dataset/write_dataset_encrypted.py#L43-L74
   
   My questions are:
   1.  How would the DatasetEncryptionConfiguration be propagated to those classes? Through ParuqetFileFormat?
   2. When would the propagation take place.
   3.  Looking at the python sample, how would this look with this change?
   
   Thanks for any help.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1265506199


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +138,20 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+    return dataset_encryption_config_;
+  }
+  /// \brief A setter for DatasetEncryptionConfiguration
+  void SetDatasetEncryptionConfig(
+      std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config) {
+    dataset_encryption_config_ = std::move(dataset_encryption_config);
+  }
+
+ private:
+  // A configuration structure that provides per file encryption properties for a dataset
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config_ = NULLPTR;

Review Comment:
   @wgtmac I pushed in the changes this morning.  It should be good for a review. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1266912510


##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,77 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/util/logging.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+
+namespace arrow {
+namespace dataset {
+
+/// core class, that translates the parameters of high level encryption
+struct ARROW_DS_EXPORT DatasetEncryptionConfiguration {
+  DatasetEncryptionConfiguration()
+      : kms_connection_config(
+            std::make_shared<parquet::encryption::KmsConnectionConfig>()) {}
+
+  void Setup(
+      std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory,
+      std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config,
+      std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config) {
+    ARROW_CHECK(crypto_factory != NULLPTR);
+    ARROW_CHECK(kms_connection_config != NULLPTR);
+    ARROW_CHECK(encryption_config != NULLPTR);
+    this->crypto_factory = std::move(crypto_factory);
+    this->kms_connection_config = std::move(kms_connection_config);
+    this->encryption_config = std::move(encryption_config);
+  }
+
+  std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory;
+  std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config;
+  std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config;
+};
+
+/// core class, that translates the parameters of high level encryption
+struct ARROW_DS_EXPORT DatasetDecryptionConfiguration {
+  DatasetDecryptionConfiguration()

Review Comment:
   Ditto for removing the constructor.



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;

Review Comment:
   It would be good to call Setup() function here and below. And do not use default config implicitly.



##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,77 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/util/logging.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+
+namespace arrow {
+namespace dataset {
+
+/// core class, that translates the parameters of high level encryption
+struct ARROW_DS_EXPORT DatasetEncryptionConfiguration {
+  DatasetEncryptionConfiguration()
+      : kms_connection_config(

Review Comment:
   Could you remove this constructor and do not assign any default value to the member variables? In the python code, you can always depend on `Setup` function. This may facilitate C++ users to use aggregate initialization like below:
   ```cpp
   DatasetEncryptionConfiguration config{.crypto_factory = xxx, .kms_connection_config = yyy};
   ``` 



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    return std::make_pair(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config),
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+};
+// Write dataset to disk with encryption
+// The aim of this test is to demonstrate the process of writing a partitioned
+// Parquet file while applying distinct file encryption properties to each
+// file within the test. This is based on the selected columns.
+TEST_F(DatasetEncryptionTest, WriteReadDatasetWithEncryption) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);

Review Comment:
   ```suggestion
         internal::checked_pointer_cast<ParquetFileWriteOptions>(file_write_options);
   ```



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    return std::make_pair(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config),
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+};
+// Write dataset to disk with encryption
+// The aim of this test is to demonstrate the process of writing a partitioned
+// Parquet file while applying distinct file encryption properties to each
+// file within the test. This is based on the selected columns.
+TEST_F(DatasetEncryptionTest, WriteReadDatasetWithEncryption) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));
+
+  auto mock_fs =
+      std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+  auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+  auto partitioning =
+      std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+  // ----- Write the Dataset ----
+  auto dataset_out = BuildTable();
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder_out, dataset_out->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner_out, scanner_builder_out->Finish());
+
+  ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+  write_options.file_write_options = parquet_file_write_options;
+  write_options.filesystem = file_system;
+  write_options.base_dir = kBaseDir;
+  write_options.partitioning = partitioning;
+  write_options.basename_template = "part{i}.parquet";
+  ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner_out));
+
+  std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                    "part=c/part0.parquet", "part=d/part0.parquet",
+                                    "part=e/part0.parquet", "part=f/part0.parquet",
+                                    "part=g/part0.parquet", "part=h/part0.parquet",
+                                    "part=i/part0.parquet", "part=j/part0.parquet"};
+  ValidateFilesExist(mock_fs, files);
+
+  // ----- Read the Dataset -----
+
+  // Get FileInfo objects for all files under the base directory
+  arrow::fs::FileSelector selector;
+  selector.base_dir = kBaseDir;
+  selector.recursive = true;
+
+  // Create a FileSystemDatasetFactory
+  arrow::dataset::FileSystemFactoryOptions factory_options;
+  factory_options.partitioning = partitioning;
+  factory_options.partition_base_dir = kBaseDir;
+  ASSERT_OK_AND_ASSIGN(auto dataset_factory,
+                       arrow::dataset::FileSystemDatasetFactory::Make(
+                           mock_fs, selector, file_format, factory_options));
+  // Create a Dataset
+  ASSERT_OK_AND_ASSIGN(auto dataset_in, dataset_factory->Finish());
+
+  // Define the callback function
+  std::function<arrow::Status(arrow::dataset::TaggedRecordBatch tagged_record_batch)>
+      visitor =
+          [](arrow::dataset::TaggedRecordBatch tagged_record_batch) -> arrow::Status {
+    return arrow::Status::OK();
+  };
+
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder_in, dataset_in->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner_in, scanner_builder_in->Finish());
+
+  // Scan the dataset and process the record batches using the callback function
+  arrow::Status status = scanner_in->Scan(visitor);
+
+  // Check if there was an error during iteration
+  ASSERT_OK(status);
+}
+
+// Write dataset to disk with encryption and then read in a single parquet file
+TEST_F(DatasetEncryptionTest, WriteReadSingleFile) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));

Review Comment:
   ```suggestion
     ASSERT_OK(file_system->CreateDir(kBaseDir));
   ```



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    return std::make_pair(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config),
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+};
+// Write dataset to disk with encryption
+// The aim of this test is to demonstrate the process of writing a partitioned
+// Parquet file while applying distinct file encryption properties to each
+// file within the test. This is based on the selected columns.
+TEST_F(DatasetEncryptionTest, WriteReadDatasetWithEncryption) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));
+
+  auto mock_fs =
+      std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+  auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+  auto partitioning =
+      std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+  // ----- Write the Dataset ----
+  auto dataset_out = BuildTable();
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder_out, dataset_out->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner_out, scanner_builder_out->Finish());
+
+  ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+  write_options.file_write_options = parquet_file_write_options;
+  write_options.filesystem = file_system;
+  write_options.base_dir = kBaseDir;
+  write_options.partitioning = partitioning;
+  write_options.basename_template = "part{i}.parquet";
+  ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner_out));
+
+  std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                    "part=c/part0.parquet", "part=d/part0.parquet",
+                                    "part=e/part0.parquet", "part=f/part0.parquet",
+                                    "part=g/part0.parquet", "part=h/part0.parquet",
+                                    "part=i/part0.parquet", "part=j/part0.parquet"};
+  ValidateFilesExist(mock_fs, files);
+
+  // ----- Read the Dataset -----
+
+  // Get FileInfo objects for all files under the base directory
+  arrow::fs::FileSelector selector;
+  selector.base_dir = kBaseDir;
+  selector.recursive = true;
+
+  // Create a FileSystemDatasetFactory
+  arrow::dataset::FileSystemFactoryOptions factory_options;
+  factory_options.partitioning = partitioning;
+  factory_options.partition_base_dir = kBaseDir;
+  ASSERT_OK_AND_ASSIGN(auto dataset_factory,
+                       arrow::dataset::FileSystemDatasetFactory::Make(
+                           mock_fs, selector, file_format, factory_options));
+  // Create a Dataset
+  ASSERT_OK_AND_ASSIGN(auto dataset_in, dataset_factory->Finish());
+
+  // Define the callback function
+  std::function<arrow::Status(arrow::dataset::TaggedRecordBatch tagged_record_batch)>
+      visitor =
+          [](arrow::dataset::TaggedRecordBatch tagged_record_batch) -> arrow::Status {
+    return arrow::Status::OK();
+  };
+
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder_in, dataset_in->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner_in, scanner_builder_in->Finish());
+
+  // Scan the dataset and process the record batches using the callback function
+  arrow::Status status = scanner_in->Scan(visitor);
+
+  // Check if there was an error during iteration
+  ASSERT_OK(status);
+}
+
+// Write dataset to disk with encryption and then read in a single parquet file
+TEST_F(DatasetEncryptionTest, WriteReadSingleFile) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));
+
+  auto mock_fs =
+      std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);

Review Comment:
   ```suggestion
         internal::checked_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
   ```



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    return std::make_pair(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config),
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+};
+// Write dataset to disk with encryption
+// The aim of this test is to demonstrate the process of writing a partitioned
+// Parquet file while applying distinct file encryption properties to each
+// file within the test. This is based on the selected columns.
+TEST_F(DatasetEncryptionTest, WriteReadDatasetWithEncryption) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));
+
+  auto mock_fs =
+      std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+  auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+  auto partitioning =
+      std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+  // ----- Write the Dataset ----
+  auto dataset_out = BuildTable();
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder_out, dataset_out->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner_out, scanner_builder_out->Finish());
+
+  ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+  write_options.file_write_options = parquet_file_write_options;
+  write_options.filesystem = file_system;
+  write_options.base_dir = kBaseDir;
+  write_options.partitioning = partitioning;
+  write_options.basename_template = "part{i}.parquet";
+  ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner_out));
+
+  std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                    "part=c/part0.parquet", "part=d/part0.parquet",
+                                    "part=e/part0.parquet", "part=f/part0.parquet",
+                                    "part=g/part0.parquet", "part=h/part0.parquet",
+                                    "part=i/part0.parquet", "part=j/part0.parquet"};
+  ValidateFilesExist(mock_fs, files);
+
+  // ----- Read the Dataset -----
+
+  // Get FileInfo objects for all files under the base directory
+  arrow::fs::FileSelector selector;
+  selector.base_dir = kBaseDir;
+  selector.recursive = true;
+
+  // Create a FileSystemDatasetFactory
+  arrow::dataset::FileSystemFactoryOptions factory_options;
+  factory_options.partitioning = partitioning;
+  factory_options.partition_base_dir = kBaseDir;
+  ASSERT_OK_AND_ASSIGN(auto dataset_factory,
+                       arrow::dataset::FileSystemDatasetFactory::Make(
+                           mock_fs, selector, file_format, factory_options));
+  // Create a Dataset
+  ASSERT_OK_AND_ASSIGN(auto dataset_in, dataset_factory->Finish());
+
+  // Define the callback function
+  std::function<arrow::Status(arrow::dataset::TaggedRecordBatch tagged_record_batch)>
+      visitor =
+          [](arrow::dataset::TaggedRecordBatch tagged_record_batch) -> arrow::Status {
+    return arrow::Status::OK();
+  };
+
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder_in, dataset_in->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner_in, scanner_builder_in->Finish());
+
+  // Scan the dataset and process the record batches using the callback function
+  arrow::Status status = scanner_in->Scan(visitor);
+
+  // Check if there was an error during iteration
+  ASSERT_OK(status);
+}
+
+// Write dataset to disk with encryption and then read in a single parquet file
+TEST_F(DatasetEncryptionTest, WriteReadSingleFile) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object

Review Comment:
   Could you fix the comments to make sure they all have capitalized initials?



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    return std::make_pair(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config),
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+};
+// Write dataset to disk with encryption
+// The aim of this test is to demonstrate the process of writing a partitioned
+// Parquet file while applying distinct file encryption properties to each
+// file within the test. This is based on the selected columns.
+TEST_F(DatasetEncryptionTest, WriteReadDatasetWithEncryption) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));
+
+  auto mock_fs =
+      std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+  auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+  auto partitioning =
+      std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+  // ----- Write the Dataset ----
+  auto dataset_out = BuildTable();
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder_out, dataset_out->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner_out, scanner_builder_out->Finish());
+
+  ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+  write_options.file_write_options = parquet_file_write_options;
+  write_options.filesystem = file_system;
+  write_options.base_dir = kBaseDir;
+  write_options.partitioning = partitioning;
+  write_options.basename_template = "part{i}.parquet";
+  ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner_out));
+
+  std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                    "part=c/part0.parquet", "part=d/part0.parquet",
+                                    "part=e/part0.parquet", "part=f/part0.parquet",
+                                    "part=g/part0.parquet", "part=h/part0.parquet",
+                                    "part=i/part0.parquet", "part=j/part0.parquet"};
+  ValidateFilesExist(mock_fs, files);
+
+  // ----- Read the Dataset -----
+
+  // Get FileInfo objects for all files under the base directory
+  arrow::fs::FileSelector selector;
+  selector.base_dir = kBaseDir;
+  selector.recursive = true;
+
+  // Create a FileSystemDatasetFactory
+  arrow::dataset::FileSystemFactoryOptions factory_options;
+  factory_options.partitioning = partitioning;
+  factory_options.partition_base_dir = kBaseDir;
+  ASSERT_OK_AND_ASSIGN(auto dataset_factory,
+                       arrow::dataset::FileSystemDatasetFactory::Make(
+                           mock_fs, selector, file_format, factory_options));
+  // Create a Dataset
+  ASSERT_OK_AND_ASSIGN(auto dataset_in, dataset_factory->Finish());
+
+  // Define the callback function
+  std::function<arrow::Status(arrow::dataset::TaggedRecordBatch tagged_record_batch)>
+      visitor =
+          [](arrow::dataset::TaggedRecordBatch tagged_record_batch) -> arrow::Status {
+    return arrow::Status::OK();
+  };
+
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder_in, dataset_in->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner_in, scanner_builder_in->Finish());
+
+  // Scan the dataset and process the record batches using the callback function
+  arrow::Status status = scanner_in->Scan(visitor);
+
+  // Check if there was an error during iteration
+  ASSERT_OK(status);
+}
+
+// Write dataset to disk with encryption and then read in a single parquet file
+TEST_F(DatasetEncryptionTest, WriteReadSingleFile) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));
+
+  auto mock_fs =
+      std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+  auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+  auto partitioning =
+      std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+  // ----- Write the Dataset ----
+  auto dataset_out = BuildTable();
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset_out->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
+
+  ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+  write_options.file_write_options = parquet_file_write_options;
+  write_options.filesystem = file_system;
+  write_options.base_dir = kBaseDir;
+  write_options.partitioning = partitioning;
+  write_options.basename_template = "part{i}.parquet";
+  ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner));
+
+  // ----- Read Single File -----
+
+  // Define the path to the encrypted Parquet file
+  std::string file_path = "part=a/part0.parquet";
+
+  auto crypto_factory = dataset_decryption_config->crypto_factory;
+
+  // Get the FileDecryptionProperties object using the CryptoFactory object
+  auto file_decryption_properties = crypto_factory->GetFileDecryptionProperties(
+      *dataset_decryption_config->kms_connection_config,
+      *dataset_decryption_config->decryption_config);
+
+  // Create the ReaderProperties object using the FileDecryptionProperties object
+  auto reader_properties = std::make_shared<parquet::ReaderProperties>();
+  reader_properties->file_decryption_properties(file_decryption_properties);
+
+  // Open the Parquet file using the MockFileSystem
+  std::shared_ptr<arrow::io::RandomAccessFile> input;
+  ASSERT_OK_AND_ASSIGN(input, mock_fs->OpenInputFile(file_path));
+
+  parquet::arrow::FileReaderBuilder reader_builder;
+  ASSERT_OK(reader_builder.Open(input, *reader_properties));
+
+  ASSERT_OK_AND_ASSIGN(auto arrow_reader, reader_builder.Build());
+
+  // Read entire file as a single Arrow table
+  std::shared_ptr<arrow::Table> table;
+  ASSERT_OK(arrow_reader->ReadTable(&table));
+
+  // Add assertions to check the contents of the table
+  ASSERT_EQ(table->num_rows(), 1);
+  ASSERT_EQ(table->num_columns(), 3);
+}
+
+// verify if Parquet metadata can be read without decryption
+// properties when the footer is encrypted:
+TEST_F(DatasetEncryptionTest, CannotReadMetadataWithEncryptedFooter) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));
+
+  auto mock_fs =
+      std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);

Review Comment:
   ```suggestion
         internal::checked_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
   ```



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    return std::make_pair(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config),
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);

Review Comment:
   Can use an exact `ASSERT_EQ`?



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    return std::make_pair(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config),
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+};
+// Write dataset to disk with encryption
+// The aim of this test is to demonstrate the process of writing a partitioned
+// Parquet file while applying distinct file encryption properties to each
+// file within the test. This is based on the selected columns.
+TEST_F(DatasetEncryptionTest, WriteReadDatasetWithEncryption) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));
+
+  auto mock_fs =
+      std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+  auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+  auto partitioning =
+      std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+  // ----- Write the Dataset ----
+  auto dataset_out = BuildTable();
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder_out, dataset_out->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner_out, scanner_builder_out->Finish());
+
+  ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+  write_options.file_write_options = parquet_file_write_options;
+  write_options.filesystem = file_system;
+  write_options.base_dir = kBaseDir;
+  write_options.partitioning = partitioning;
+  write_options.basename_template = "part{i}.parquet";
+  ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner_out));
+
+  std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                    "part=c/part0.parquet", "part=d/part0.parquet",
+                                    "part=e/part0.parquet", "part=f/part0.parquet",
+                                    "part=g/part0.parquet", "part=h/part0.parquet",
+                                    "part=i/part0.parquet", "part=j/part0.parquet"};
+  ValidateFilesExist(mock_fs, files);
+
+  // ----- Read the Dataset -----
+
+  // Get FileInfo objects for all files under the base directory
+  arrow::fs::FileSelector selector;
+  selector.base_dir = kBaseDir;
+  selector.recursive = true;
+
+  // Create a FileSystemDatasetFactory
+  arrow::dataset::FileSystemFactoryOptions factory_options;
+  factory_options.partitioning = partitioning;
+  factory_options.partition_base_dir = kBaseDir;
+  ASSERT_OK_AND_ASSIGN(auto dataset_factory,
+                       arrow::dataset::FileSystemDatasetFactory::Make(
+                           mock_fs, selector, file_format, factory_options));
+  // Create a Dataset
+  ASSERT_OK_AND_ASSIGN(auto dataset_in, dataset_factory->Finish());
+
+  // Define the callback function
+  std::function<arrow::Status(arrow::dataset::TaggedRecordBatch tagged_record_batch)>
+      visitor =
+          [](arrow::dataset::TaggedRecordBatch tagged_record_batch) -> arrow::Status {
+    return arrow::Status::OK();
+  };
+
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder_in, dataset_in->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner_in, scanner_builder_in->Finish());
+
+  // Scan the dataset and process the record batches using the callback function
+  arrow::Status status = scanner_in->Scan(visitor);
+
+  // Check if there was an error during iteration
+  ASSERT_OK(status);
+}
+
+// Write dataset to disk with encryption and then read in a single parquet file
+TEST_F(DatasetEncryptionTest, WriteReadSingleFile) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));
+
+  auto mock_fs =
+      std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);

Review Comment:
   BTW, do we really need this cast? It seems that you only need `mock_fs->OpenInputFile(file_path)` and that's already available in the base FileSystem class.



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    return std::make_pair(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config),
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);

Review Comment:
   ```suggestion
       auto kms_client_factory =
           std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
               wrap_locally, key_list);
   ```



##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,61 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/util/logging.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+
+namespace arrow {
+namespace dataset {
+
+/// core class, that translates the parameters of high level encryption
+struct ARROW_DS_EXPORT DatasetEncryptionConfiguration {

Review Comment:
   It is already in the arrow::dataset namespace, so probably `ParquetEncryptionConfig` is enough? My rationale is that the config only applies to parquet.



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties

Review Comment:
   ```suggestion
     // Create dataset encryption properties
   ```



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    return std::make_pair(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config),
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */

Review Comment:
   ```suggestion
     // An utility function to validate our files were written out
   ```



##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -57,6 +57,9 @@ struct SchemaManifest;
 namespace arrow {
 namespace dataset {
 
+struct DatasetEncryptionConfiguration;
+struct DatasetDecryptionConfiguration;

Review Comment:
   ```suggestion
   struct DatasetDecryptionConfiguration;
   struct DatasetEncryptionConfiguration;
   ```



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    return std::make_pair(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config),
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+};
+// Write dataset to disk with encryption
+// The aim of this test is to demonstrate the process of writing a partitioned
+// Parquet file while applying distinct file encryption properties to each
+// file within the test. This is based on the selected columns.
+TEST_F(DatasetEncryptionTest, WriteReadDatasetWithEncryption) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));

Review Comment:
   ```suggestion
     ASSERT_OK(file_system->CreateDir(kBaseDir));
   ```



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    return std::make_pair(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config),
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+};
+// Write dataset to disk with encryption
+// The aim of this test is to demonstrate the process of writing a partitioned
+// Parquet file while applying distinct file encryption properties to each
+// file within the test. This is based on the selected columns.
+TEST_F(DatasetEncryptionTest, WriteReadDatasetWithEncryption) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));
+
+  auto mock_fs =
+      std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);

Review Comment:
   ```suggestion
         internal::checked_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
   ```



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    return std::make_pair(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config),
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+};
+// Write dataset to disk with encryption
+// The aim of this test is to demonstrate the process of writing a partitioned
+// Parquet file while applying distinct file encryption properties to each
+// file within the test. This is based on the selected columns.
+TEST_F(DatasetEncryptionTest, WriteReadDatasetWithEncryption) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));
+
+  auto mock_fs =
+      std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+  auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+  auto partitioning =
+      std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+  // ----- Write the Dataset ----
+  auto dataset_out = BuildTable();
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder_out, dataset_out->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner_out, scanner_builder_out->Finish());
+
+  ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+  write_options.file_write_options = parquet_file_write_options;
+  write_options.filesystem = file_system;
+  write_options.base_dir = kBaseDir;
+  write_options.partitioning = partitioning;
+  write_options.basename_template = "part{i}.parquet";
+  ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner_out));
+
+  std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                    "part=c/part0.parquet", "part=d/part0.parquet",
+                                    "part=e/part0.parquet", "part=f/part0.parquet",
+                                    "part=g/part0.parquet", "part=h/part0.parquet",
+                                    "part=i/part0.parquet", "part=j/part0.parquet"};
+  ValidateFilesExist(mock_fs, files);
+
+  // ----- Read the Dataset -----
+
+  // Get FileInfo objects for all files under the base directory
+  arrow::fs::FileSelector selector;
+  selector.base_dir = kBaseDir;
+  selector.recursive = true;
+
+  // Create a FileSystemDatasetFactory
+  arrow::dataset::FileSystemFactoryOptions factory_options;
+  factory_options.partitioning = partitioning;
+  factory_options.partition_base_dir = kBaseDir;
+  ASSERT_OK_AND_ASSIGN(auto dataset_factory,
+                       arrow::dataset::FileSystemDatasetFactory::Make(
+                           mock_fs, selector, file_format, factory_options));
+  // Create a Dataset
+  ASSERT_OK_AND_ASSIGN(auto dataset_in, dataset_factory->Finish());
+
+  // Define the callback function
+  std::function<arrow::Status(arrow::dataset::TaggedRecordBatch tagged_record_batch)>
+      visitor =
+          [](arrow::dataset::TaggedRecordBatch tagged_record_batch) -> arrow::Status {
+    return arrow::Status::OK();
+  };
+
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder_in, dataset_in->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner_in, scanner_builder_in->Finish());
+
+  // Scan the dataset and process the record batches using the callback function
+  arrow::Status status = scanner_in->Scan(visitor);
+
+  // Check if there was an error during iteration
+  ASSERT_OK(status);

Review Comment:
   ```suggestion
     // Scan the dataset and check if there was an error during iteration
     ASSERT_OK(scanner_in->Scan(visitor));
   ```



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    return std::make_pair(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config),
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+};
+// Write dataset to disk with encryption
+// The aim of this test is to demonstrate the process of writing a partitioned
+// Parquet file while applying distinct file encryption properties to each
+// file within the test. This is based on the selected columns.
+TEST_F(DatasetEncryptionTest, WriteReadDatasetWithEncryption) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));
+
+  auto mock_fs =
+      std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+  auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+  auto partitioning =
+      std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+  // ----- Write the Dataset ----
+  auto dataset_out = BuildTable();
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder_out, dataset_out->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner_out, scanner_builder_out->Finish());
+
+  ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+  write_options.file_write_options = parquet_file_write_options;
+  write_options.filesystem = file_system;
+  write_options.base_dir = kBaseDir;
+  write_options.partitioning = partitioning;
+  write_options.basename_template = "part{i}.parquet";
+  ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner_out));
+
+  std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                    "part=c/part0.parquet", "part=d/part0.parquet",
+                                    "part=e/part0.parquet", "part=f/part0.parquet",
+                                    "part=g/part0.parquet", "part=h/part0.parquet",
+                                    "part=i/part0.parquet", "part=j/part0.parquet"};
+  ValidateFilesExist(mock_fs, files);
+
+  // ----- Read the Dataset -----
+
+  // Get FileInfo objects for all files under the base directory
+  arrow::fs::FileSelector selector;
+  selector.base_dir = kBaseDir;
+  selector.recursive = true;
+
+  // Create a FileSystemDatasetFactory
+  arrow::dataset::FileSystemFactoryOptions factory_options;
+  factory_options.partitioning = partitioning;
+  factory_options.partition_base_dir = kBaseDir;
+  ASSERT_OK_AND_ASSIGN(auto dataset_factory,
+                       arrow::dataset::FileSystemDatasetFactory::Make(
+                           mock_fs, selector, file_format, factory_options));
+  // Create a Dataset
+  ASSERT_OK_AND_ASSIGN(auto dataset_in, dataset_factory->Finish());
+
+  // Define the callback function
+  std::function<arrow::Status(arrow::dataset::TaggedRecordBatch tagged_record_batch)>
+      visitor =
+          [](arrow::dataset::TaggedRecordBatch tagged_record_batch) -> arrow::Status {
+    return arrow::Status::OK();
+  };
+
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder_in, dataset_in->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner_in, scanner_builder_in->Finish());
+
+  // Scan the dataset and process the record batches using the callback function
+  arrow::Status status = scanner_in->Scan(visitor);
+
+  // Check if there was an error during iteration
+  ASSERT_OK(status);
+}
+
+// Write dataset to disk with encryption and then read in a single parquet file
+TEST_F(DatasetEncryptionTest, WriteReadSingleFile) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);

Review Comment:
   ```suggestion
         internal::checked_pointer_cast<ParquetFileWriteOptions>(file_write_options);
   ```



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;

Review Comment:
   BTW, why not directly do this
   ```cpp
   auto dataset_encryption_config = std::make_shared<DatasetEncryptionConfiguration>();
   dataset_encryption_config->Setup(crypto_factory, connection_config, encryption_config);
   
   auto dataset_decryption_config = std::make_shared<DatasetDecryptionConfiguration>();
   dataset_decryption_config->Setup(crypto_factory, connection_config, decryption_config);
   
   return std::make_pair(std::move(dataset_encryption_config), std::move(dataset_decryption_config));
   ```



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    return std::make_pair(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config),
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+};
+// Write dataset to disk with encryption
+// The aim of this test is to demonstrate the process of writing a partitioned
+// Parquet file while applying distinct file encryption properties to each
+// file within the test. This is based on the selected columns.
+TEST_F(DatasetEncryptionTest, WriteReadDatasetWithEncryption) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));
+
+  auto mock_fs =
+      std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+  auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+  auto partitioning =
+      std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+  // ----- Write the Dataset ----
+  auto dataset_out = BuildTable();
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder_out, dataset_out->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner_out, scanner_builder_out->Finish());
+
+  ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+  write_options.file_write_options = parquet_file_write_options;
+  write_options.filesystem = file_system;
+  write_options.base_dir = kBaseDir;
+  write_options.partitioning = partitioning;
+  write_options.basename_template = "part{i}.parquet";
+  ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner_out));
+
+  std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                    "part=c/part0.parquet", "part=d/part0.parquet",
+                                    "part=e/part0.parquet", "part=f/part0.parquet",
+                                    "part=g/part0.parquet", "part=h/part0.parquet",
+                                    "part=i/part0.parquet", "part=j/part0.parquet"};
+  ValidateFilesExist(mock_fs, files);
+
+  // ----- Read the Dataset -----
+
+  // Get FileInfo objects for all files under the base directory
+  arrow::fs::FileSelector selector;
+  selector.base_dir = kBaseDir;
+  selector.recursive = true;
+
+  // Create a FileSystemDatasetFactory
+  arrow::dataset::FileSystemFactoryOptions factory_options;
+  factory_options.partitioning = partitioning;
+  factory_options.partition_base_dir = kBaseDir;
+  ASSERT_OK_AND_ASSIGN(auto dataset_factory,
+                       arrow::dataset::FileSystemDatasetFactory::Make(
+                           mock_fs, selector, file_format, factory_options));
+  // Create a Dataset
+  ASSERT_OK_AND_ASSIGN(auto dataset_in, dataset_factory->Finish());
+
+  // Define the callback function
+  std::function<arrow::Status(arrow::dataset::TaggedRecordBatch tagged_record_batch)>
+      visitor =
+          [](arrow::dataset::TaggedRecordBatch tagged_record_batch) -> arrow::Status {
+    return arrow::Status::OK();
+  };
+
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder_in, dataset_in->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner_in, scanner_builder_in->Finish());
+
+  // Scan the dataset and process the record batches using the callback function
+  arrow::Status status = scanner_in->Scan(visitor);
+
+  // Check if there was an error during iteration
+  ASSERT_OK(status);
+}
+
+// Write dataset to disk with encryption and then read in a single parquet file
+TEST_F(DatasetEncryptionTest, WriteReadSingleFile) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));
+
+  auto mock_fs =
+      std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+  auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+  auto partitioning =
+      std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+  // ----- Write the Dataset ----
+  auto dataset_out = BuildTable();
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset_out->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
+
+  ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+  write_options.file_write_options = parquet_file_write_options;
+  write_options.filesystem = file_system;
+  write_options.base_dir = kBaseDir;
+  write_options.partitioning = partitioning;
+  write_options.basename_template = "part{i}.parquet";
+  ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner));
+
+  // ----- Read Single File -----
+
+  // Define the path to the encrypted Parquet file
+  std::string file_path = "part=a/part0.parquet";
+
+  auto crypto_factory = dataset_decryption_config->crypto_factory;
+
+  // Get the FileDecryptionProperties object using the CryptoFactory object
+  auto file_decryption_properties = crypto_factory->GetFileDecryptionProperties(
+      *dataset_decryption_config->kms_connection_config,
+      *dataset_decryption_config->decryption_config);
+
+  // Create the ReaderProperties object using the FileDecryptionProperties object
+  auto reader_properties = std::make_shared<parquet::ReaderProperties>();
+  reader_properties->file_decryption_properties(file_decryption_properties);
+
+  // Open the Parquet file using the MockFileSystem
+  std::shared_ptr<arrow::io::RandomAccessFile> input;
+  ASSERT_OK_AND_ASSIGN(input, mock_fs->OpenInputFile(file_path));
+
+  parquet::arrow::FileReaderBuilder reader_builder;
+  ASSERT_OK(reader_builder.Open(input, *reader_properties));
+
+  ASSERT_OK_AND_ASSIGN(auto arrow_reader, reader_builder.Build());
+
+  // Read entire file as a single Arrow table
+  std::shared_ptr<arrow::Table> table;
+  ASSERT_OK(arrow_reader->ReadTable(&table));
+
+  // Add assertions to check the contents of the table
+  ASSERT_EQ(table->num_rows(), 1);
+  ASSERT_EQ(table->num_columns(), 3);
+}
+
+// verify if Parquet metadata can be read without decryption
+// properties when the footer is encrypted:
+TEST_F(DatasetEncryptionTest, CannotReadMetadataWithEncryptedFooter) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);

Review Comment:
   ```suggestion
         internal::checked_pointer_cast<ParquetFileWriteOptions>(file_write_options);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1269624502


##########
python/examples/dataset/write_dataset_encrypted.py:
##########
@@ -0,0 +1,93 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+import pyarrow as pa
+import pyarrow.dataset as ds
+import pyarrow.parquet.encryption as pe
+from pyarrow.tests.parquet.encryption import InMemoryKmsClient
+from datetime import timedelta
+import shutil
+import os
+
+""" A sample to demostrate dataset encryption and decryption"""
+
+# create a list of dictionaries that will represent our dataset
+table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+                  'n_legs': [2, 2, 4, 4, 5, 100],
+                  'animal': ["Flamingo", "Parrot", "Dog", "Horse",
+                             "Brittle stars", "Centipede"]})
+
+# create a PyArrow dataset from the table
+dataset = ds.dataset(table)
+
+FOOTER_KEY = b"0123456789112345"
+FOOTER_KEY_NAME = "footer_key"
+COL_KEY = b"1234567890123450"
+COL_KEY_NAME = "col_key"
+
+encryption_config = pe.EncryptionConfiguration(
+    footer_key=FOOTER_KEY_NAME,
+    plaintext_footer=False,
+    # Use COL_KEY_NAME to encrypt `n_legs` and `animal` columns.
+    column_keys={
+        COL_KEY_NAME: ["n_legs", "animal"],
+    },
+    encryption_algorithm="AES_GCM_V1",
+    # requires timedelta or an assertion is raised
+    cache_lifetime=timedelta(minutes=5.0),
+    data_key_length_bits=256)
+
+kms_connection_config = pe.KmsConnectionConfig(
+    custom_kms_conf={
+        FOOTER_KEY_NAME: FOOTER_KEY.decode("UTF-8"),
+        COL_KEY_NAME: COL_KEY.decode("UTF-8"),
+    }
+)
+
+decryption_config = pe.DecryptionConfiguration(cache_lifetime=300)
+
+
+def kms_factory(kms_connection_configuration):
+    return InMemoryKmsClient(kms_connection_configuration)
+
+
+crypto_factory = pe.CryptoFactory(kms_factory)
+dataset_encryption_cfg = ds.ParquetEncryptionConfig(
+    crypto_factory, kms_connection_config, encryption_config)
+dataset_decryption_cfg = ds.ParquetDecryptionConfig(crypto_factory,

Review Comment:
   Ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1253438947


##########
cpp/src/arrow/dataset/CMakeLists.txt:
##########
@@ -175,6 +175,12 @@ endif()
 
 if(ARROW_PARQUET)
   add_arrow_dataset_test(file_parquet_test)
+  if(PARQUET_REQUIRE_ENCRYPTION AND ARROW_DATASET)
+    add_arrow_dataset_test(dataset_encryption_test
+                           SOURCES
+                           ${PROJECT_SOURCE_DIR}/src/parquet/encryption/test_in_memory_kms.cc

Review Comment:
   It looks like the add_arrow_dataset_test cmake function adds it from the first name you provide.  In this case dataset_encryption_test



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1254756807


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +137,40 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+#ifdef PARQUET_REQUIRE_ENCRYPTION
+    return dataset_encryption_config_;
+#else
+    return NULLPTR;
+#endif
+  }
+  /// \brief A getter function to retrieve the dataset decryption configuration
+  std::shared_ptr<DatasetDecryptionConfiguration> GetDatasetDecryptionConfig() const {
+#ifdef PARQUET_REQUIRE_ENCRYPTION
+    return dataset_decryption_config_;
+#else
+    return NULLPTR;
+#endif

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1253245351


##########
cpp/src/parquet/properties.h:
##########
@@ -218,7 +218,23 @@ class PARQUET_EXPORT WriterProperties {
           data_page_version_(ParquetDataPageVersion::V1),
           created_by_(DEFAULT_CREATED_BY),
           store_decimal_as_integer_(false),
-          page_checksum_enabled_(false) {}
+          page_checksum_enabled_(false),
+          default_column_properties_() {}
+
+    explicit Builder(const WriterProperties& properties)
+        : pool_(::arrow::default_memory_pool()),

Review Comment:
   ```suggestion
           : pool_(properties.memory_pool()),
   ```



##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,75 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+
+namespace arrow {
+namespace dataset {
+
+struct ARROW_DS_EXPORT DatasetEncryptionConfiguration {
+  /// core class, that translates the parameters of high level encryption
+
+  DatasetEncryptionConfiguration()

Review Comment:
   ```suggestion
   /// core class, that translates the parameters of high level encryption
   struct ARROW_DS_EXPORT DatasetEncryptionConfiguration {
     DatasetEncryptionConfiguration()
   ```



##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +137,40 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+#ifdef PARQUET_REQUIRE_ENCRYPTION
+    return dataset_encryption_config_;
+#else
+    return NULLPTR;
+#endif
+  }
+  /// \brief A getter function to retrieve the dataset decryption configuration
+  std::shared_ptr<DatasetDecryptionConfiguration> GetDatasetDecryptionConfig() const {
+#ifdef PARQUET_REQUIRE_ENCRYPTION
+    return dataset_decryption_config_;
+#else
+    return NULLPTR;
+#endif
+  }
+  /// \brief A setter for DatasetEncryptionConfiguration
+  void SetDatasetEncryptionConfig(
+      std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config) {
+    dataset_encryption_config_ = std::move(dataset_encryption_config);
+  }
+  /// \brief A setter for DatasetDecryptionConfiguration
+  void SetDatasetDecryptionConfig(
+      std::shared_ptr<DatasetDecryptionConfiguration> dataset_decryption_config) {
+    dataset_decryption_config_ = std::move(dataset_decryption_config);
+  }
+
+ private:
+  // A configuration structure that provides per file encryption properties for a dataset
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config_ = NULLPTR;

Review Comment:
   ```suggestion
     std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config_;
   ```



##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,75 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+
+namespace arrow {
+namespace dataset {
+
+struct ARROW_DS_EXPORT DatasetEncryptionConfiguration {
+  /// core class, that translates the parameters of high level encryption
+
+  DatasetEncryptionConfiguration()
+      : kms_connection_config(
+            std::make_shared<parquet::encryption::KmsConnectionConfig>()) {}
+
+  void Setup(
+      std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory,
+      std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config,
+      std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config) {
+    this->crypto_factory = crypto_factory;
+    this->kms_connection_config = kms_connection_config;
+    this->encryption_config = encryption_config;

Review Comment:
   ```suggestion
       this->crypto_factory = std::move(crypto_factory);
       this->kms_connection_config = std::move(kms_connection_config);
       this->encryption_config = std::move(encryption_config);
   ```



##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,75 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+
+namespace arrow {
+namespace dataset {
+
+struct ARROW_DS_EXPORT DatasetEncryptionConfiguration {
+  /// core class, that translates the parameters of high level encryption
+
+  DatasetEncryptionConfiguration()
+      : kms_connection_config(
+            std::make_shared<parquet::encryption::KmsConnectionConfig>()) {}
+
+  void Setup(
+      std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory,
+      std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config,
+      std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config) {
+    this->crypto_factory = crypto_factory;
+    this->kms_connection_config = kms_connection_config;
+    this->encryption_config = encryption_config;
+  }
+
+  std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory;
+
+  std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config;
+
+  std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config;
+};
+
+struct ARROW_DS_EXPORT DatasetDecryptionConfiguration {
+  /// core class, that translates the parameters of high level encryption
+  DatasetDecryptionConfiguration()
+      : kms_connection_config(
+            std::make_shared<parquet::encryption::KmsConnectionConfig>()),
+        decryption_config(
+            std::make_shared<parquet::encryption::DecryptionConfiguration>()) {}
+
+  void Setup(
+      std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory,
+      std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config,
+      std::shared_ptr<parquet::encryption::DecryptionConfiguration> decryption_config) {
+    this->crypto_factory = crypto_factory;
+    this->kms_connection_config = kms_connection_config;
+    this->decryption_config = decryption_config;

Review Comment:
   ```suggestion
       this->crypto_factory = std::move(crypto_factory);
       this->kms_connection_config = std::move(kms_connection_config);
       this->decryption_config = std::move(decryption_config);
   ```



##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,75 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+
+namespace arrow {
+namespace dataset {
+
+struct ARROW_DS_EXPORT DatasetEncryptionConfiguration {
+  /// core class, that translates the parameters of high level encryption
+
+  DatasetEncryptionConfiguration()
+      : kms_connection_config(
+            std::make_shared<parquet::encryption::KmsConnectionConfig>()) {}
+
+  void Setup(
+      std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory,
+      std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config,
+      std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config) {
+    this->crypto_factory = crypto_factory;
+    this->kms_connection_config = kms_connection_config;
+    this->encryption_config = encryption_config;
+  }
+
+  std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory;
+
+  std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config;
+
+  std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config;
+};
+
+struct ARROW_DS_EXPORT DatasetDecryptionConfiguration {

Review Comment:
   It seems to overlap a lot with `DatasetEncryptionConfiguration`. Is there any chance to consolidate them?



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -16,14 +16,12 @@
 // under the License.
 
 #include "arrow/dataset/file_parquet.h"
-

Review Comment:
   It seems that the empty line is still missing.



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -16,14 +16,12 @@
 // under the License.
 
 #include "arrow/dataset/file_parquet.h"
-
 #include <memory>
 #include <mutex>
 #include <unordered_map>
 #include <unordered_set>
 #include <utility>
 #include <vector>
-

Review Comment:
   ditto, please revert this change.



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -67,8 +66,25 @@ parquet::ReaderProperties MakeReaderProperties(
     properties.disable_buffered_stream();
   }
   properties.set_buffer_size(parquet_scan_options->reader_properties->buffer_size());
+
+#ifdef PARQUET_REQUIRE_ENCRYPTION
+  std::shared_ptr<DatasetDecryptionConfiguration> dataset_decrypt_config =
+      format.GetDatasetDecryptionConfig();
+
+  if (dataset_decrypt_config != nullptr) {
+    auto file_decryption_prop =
+        dataset_decrypt_config->crypto_factory->GetFileDecryptionProperties(
+            *dataset_decrypt_config->kms_connection_config.get(),
+            *dataset_decrypt_config->decryption_config.get(), path, filesystem);

Review Comment:
   ```suggestion
               *dataset_decrypt_config->kms_connection_config,
               *dataset_decrypt_config->decryption_config, path, filesystem);
   ```



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,388 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view ds_footer_master_key = "0123456789012345";
+constexpr std::string_view ds_footer_master_key_id = "footer_key";
+constexpr std::string_view ds_column_master_keys[] = {"1234567890123450"};
+constexpr std::string_view ds_column_master_key_ids[] = {"col_key"};
+const int ds_num_columns = 1;
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // setup the test
+  void SetUp() {}
+
+  // Create our parquetfileformat with encryption properties
+  std::shared_ptr<ParquetFileFormat> CreateFileFormat(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, std::string_view footer_id, std::string_view footer_key,
+      std::string_view footer_key_name = "footer_key",
+      std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    // create our Parquet file format object
+    auto file_format = std::make_shared<ParquetFileFormat>();
+
+    file_format->SetDatasetEncryptionConfig(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config));
+    file_format->SetDatasetDecryptionConfig(
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+
+    return file_format;
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    std::shared_ptr<::arrow::dataset::InMemoryDataset> dataset =
+        std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+
+    return dataset;

Review Comment:
   ```suggestion
       return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
   ```



##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,75 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+
+namespace arrow {
+namespace dataset {
+
+struct ARROW_DS_EXPORT DatasetEncryptionConfiguration {
+  /// core class, that translates the parameters of high level encryption
+
+  DatasetEncryptionConfiguration()
+      : kms_connection_config(
+            std::make_shared<parquet::encryption::KmsConnectionConfig>()) {}
+
+  void Setup(
+      std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory,
+      std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config,
+      std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config) {
+    this->crypto_factory = crypto_factory;
+    this->kms_connection_config = kms_connection_config;
+    this->encryption_config = encryption_config;
+  }
+
+  std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory;
+
+  std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config;
+
+  std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config;
+};
+
+struct ARROW_DS_EXPORT DatasetDecryptionConfiguration {
+  /// core class, that translates the parameters of high level encryption
+  DatasetDecryptionConfiguration()

Review Comment:
   ```suggestion
   /// core class, that translates the parameters of high level encryption
   struct ARROW_DS_EXPORT DatasetDecryptionConfiguration {
     DatasetDecryptionConfiguration()
   ```



##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -28,6 +28,7 @@
 
 #include "arrow/dataset/discovery.h"
 #include "arrow/dataset/file_base.h"
+#include "arrow/dataset/parquet_encryption_config.h"

Review Comment:
   It would be better to use forward declaration here and only include this header file in the file_parquet.cc



##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +137,40 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+#ifdef PARQUET_REQUIRE_ENCRYPTION
+    return dataset_encryption_config_;
+#else
+    return NULLPTR;
+#endif
+  }
+  /// \brief A getter function to retrieve the dataset decryption configuration
+  std::shared_ptr<DatasetDecryptionConfiguration> GetDatasetDecryptionConfig() const {
+#ifdef PARQUET_REQUIRE_ENCRYPTION
+    return dataset_decryption_config_;
+#else
+    return NULLPTR;
+#endif
+  }
+  /// \brief A setter for DatasetEncryptionConfiguration
+  void SetDatasetEncryptionConfig(
+      std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config) {
+    dataset_encryption_config_ = std::move(dataset_encryption_config);
+  }
+  /// \brief A setter for DatasetDecryptionConfiguration
+  void SetDatasetDecryptionConfig(
+      std::shared_ptr<DatasetDecryptionConfiguration> dataset_decryption_config) {
+    dataset_decryption_config_ = std::move(dataset_decryption_config);
+  }
+
+ private:
+  // A configuration structure that provides per file encryption properties for a dataset
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config_ = NULLPTR;
+  // A configuration structure that provides per file encryption and decryption properties
+  // for a dataset
+  std::shared_ptr<DatasetDecryptionConfiguration> dataset_decryption_config_ = NULLPTR;

Review Comment:
   ```suggestion
     std::shared_ptr<DatasetDecryptionConfiguration> dataset_decryption_config_;
   ```



##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,75 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+
+namespace arrow {
+namespace dataset {
+
+struct ARROW_DS_EXPORT DatasetEncryptionConfiguration {
+  /// core class, that translates the parameters of high level encryption
+
+  DatasetEncryptionConfiguration()
+      : kms_connection_config(
+            std::make_shared<parquet::encryption::KmsConnectionConfig>()) {}
+
+  void Setup(
+      std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory,
+      std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config,
+      std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config) {
+    this->crypto_factory = crypto_factory;
+    this->kms_connection_config = kms_connection_config;
+    this->encryption_config = encryption_config;
+  }
+
+  std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory;
+
+  std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config;
+
+  std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config;

Review Comment:
   ```suggestion
     std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory;
     std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config;
     std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config;
   ```



##########
cpp/src/parquet/properties.h:
##########
@@ -218,7 +218,23 @@ class PARQUET_EXPORT WriterProperties {
           data_page_version_(ParquetDataPageVersion::V1),
           created_by_(DEFAULT_CREATED_BY),
           store_decimal_as_integer_(false),
-          page_checksum_enabled_(false) {}
+          page_checksum_enabled_(false),
+          default_column_properties_() {}

Review Comment:
   I don't think this need to be explicit.



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,388 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view ds_footer_master_key = "0123456789012345";
+constexpr std::string_view ds_footer_master_key_id = "footer_key";
+constexpr std::string_view ds_column_master_keys[] = {"1234567890123450"};
+constexpr std::string_view ds_column_master_key_ids[] = {"col_key"};
+const int ds_num_columns = 1;
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // setup the test
+  void SetUp() {}

Review Comment:
   Please remove it if it does nothing.



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,388 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view ds_footer_master_key = "0123456789012345";
+constexpr std::string_view ds_footer_master_key_id = "footer_key";
+constexpr std::string_view ds_column_master_keys[] = {"1234567890123450"};
+constexpr std::string_view ds_column_master_key_ids[] = {"col_key"};
+const int ds_num_columns = 1;
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // setup the test
+  void SetUp() {}
+
+  // Create our parquetfileformat with encryption properties
+  std::shared_ptr<ParquetFileFormat> CreateFileFormat(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, std::string_view footer_id, std::string_view footer_key,
+      std::string_view footer_key_name = "footer_key",
+      std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    // create our Parquet file format object
+    auto file_format = std::make_shared<ParquetFileFormat>();
+
+    file_format->SetDatasetEncryptionConfig(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config));
+    file_format->SetDatasetDecryptionConfig(
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+
+    return file_format;
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    std::shared_ptr<::arrow::dataset::InMemoryDataset> dataset =
+        std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+
+    return dataset;
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+
+  // Write dataset to disk with encryption
+  void WriteReadDatasetWithEncryption() {
+    auto file_format =
+        CreateFileFormat(ds_column_master_key_ids, ds_column_master_keys, ds_num_columns,
+                         ds_footer_master_key_id, ds_footer_master_key);
+
+    // create our mock file system
+    ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+    ASSERT_OK_AND_ASSIGN(auto file_system,
+                         ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+    // create filesystem
+    ASSERT_OK(file_system->CreateDir(""));
+
+    auto mock_fs =
+        std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+    auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+    auto partitioning =
+        std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+    // ----- Write the Dataset ----
+    auto dataset_out = BuildTable();
+    ASSERT_OK_AND_ASSIGN(auto scanner_builder_out, dataset_out->NewScan());
+    ASSERT_OK_AND_ASSIGN(auto scanner_out, scanner_builder_out->Finish());
+
+    ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+    write_options.file_write_options = file_format->DefaultWriteOptions();
+    write_options.filesystem = file_system;
+    write_options.base_dir = "";
+    write_options.partitioning = partitioning;
+    write_options.basename_template = "part{i}.parquet";
+    ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner_out));
+
+    std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",

Review Comment:
   In this case all files are encrypted, right? Do we support following cases?
   - Some files are encrypted while others are not.
   - Files are encrypted with different encryption configs. (e.g. different keys).
   
   If any answer is yet, probably we need a test case to cover that.



##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,75 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+
+namespace arrow {
+namespace dataset {
+
+struct ARROW_DS_EXPORT DatasetEncryptionConfiguration {
+  /// core class, that translates the parameters of high level encryption
+
+  DatasetEncryptionConfiguration()
+      : kms_connection_config(
+            std::make_shared<parquet::encryption::KmsConnectionConfig>()) {}
+
+  void Setup(
+      std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory,
+      std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config,
+      std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config) {
+    this->crypto_factory = crypto_factory;
+    this->kms_connection_config = kms_connection_config;
+    this->encryption_config = encryption_config;

Review Comment:
   Should we check to make sure none of them is null?



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,388 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view ds_footer_master_key = "0123456789012345";
+constexpr std::string_view ds_footer_master_key_id = "footer_key";
+constexpr std::string_view ds_column_master_keys[] = {"1234567890123450"};
+constexpr std::string_view ds_column_master_key_ids[] = {"col_key"};
+const int ds_num_columns = 1;

Review Comment:
   ```suggestion
   constexpr std::string_view kFooterMasterKey = "0123456789012345";
   constexpr std::string_view kFooterMasterKeyId = "footer_key";
   constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
   constexpr std::string_view kColumnMasterKeyIds[] = {"col_key"};
   constexpr int kNumColumns = 1;
   ```
   
   Would be good to follow the conventions.



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,388 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view ds_footer_master_key = "0123456789012345";
+constexpr std::string_view ds_footer_master_key_id = "footer_key";
+constexpr std::string_view ds_column_master_keys[] = {"1234567890123450"};
+constexpr std::string_view ds_column_master_key_ids[] = {"col_key"};
+const int ds_num_columns = 1;
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // setup the test
+  void SetUp() {}

Review Comment:
   ```suggestion
   ```



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,388 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view ds_footer_master_key = "0123456789012345";
+constexpr std::string_view ds_footer_master_key_id = "footer_key";
+constexpr std::string_view ds_column_master_keys[] = {"1234567890123450"};
+constexpr std::string_view ds_column_master_key_ids[] = {"col_key"};
+const int ds_num_columns = 1;
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // setup the test
+  void SetUp() {}
+
+  // Create our parquetfileformat with encryption properties
+  std::shared_ptr<ParquetFileFormat> CreateFileFormat(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, std::string_view footer_id, std::string_view footer_key,
+      std::string_view footer_key_name = "footer_key",
+      std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration

Review Comment:
   Not all the parameters are required?



##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,75 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+
+namespace arrow {
+namespace dataset {
+
+struct ARROW_DS_EXPORT DatasetEncryptionConfiguration {
+  /// core class, that translates the parameters of high level encryption
+
+  DatasetEncryptionConfiguration()
+      : kms_connection_config(
+            std::make_shared<parquet::encryption::KmsConnectionConfig>()) {}
+
+  void Setup(
+      std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory,
+      std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config,
+      std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config) {
+    this->crypto_factory = crypto_factory;
+    this->kms_connection_config = kms_connection_config;
+    this->encryption_config = encryption_config;
+  }
+
+  std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory;
+
+  std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config;
+
+  std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config;
+};
+
+struct ARROW_DS_EXPORT DatasetDecryptionConfiguration {
+  /// core class, that translates the parameters of high level encryption
+  DatasetDecryptionConfiguration()
+      : kms_connection_config(
+            std::make_shared<parquet::encryption::KmsConnectionConfig>()),
+        decryption_config(
+            std::make_shared<parquet::encryption::DecryptionConfiguration>()) {}
+
+  void Setup(
+      std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory,
+      std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config,
+      std::shared_ptr<parquet::encryption::DecryptionConfiguration> decryption_config) {
+    this->crypto_factory = crypto_factory;
+    this->kms_connection_config = kms_connection_config;
+    this->decryption_config = decryption_config;
+  }
+
+  std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory;
+
+  std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config;
+
+  std::shared_ptr<parquet::encryption::DecryptionConfiguration> decryption_config;

Review Comment:
   ```suggestion
     std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory;
     std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config;
     std::shared_ptr<parquet::encryption::DecryptionConfiguration> decryption_config;
   ```



##########
cpp/src/arrow/dataset/CMakeLists.txt:
##########
@@ -175,6 +175,12 @@ endif()
 
 if(ARROW_PARQUET)
   add_arrow_dataset_test(file_parquet_test)
+  if(PARQUET_REQUIRE_ENCRYPTION AND ARROW_DATASET)
+    add_arrow_dataset_test(dataset_encryption_test
+                           SOURCES
+                           ${PROJECT_SOURCE_DIR}/src/parquet/encryption/test_in_memory_kms.cc

Review Comment:
   Did you mean dataset_encryption_test.cc? I don't see it is compiled any where.



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -621,11 +637,38 @@ Result<std::shared_ptr<FileWriter>> ParquetFileFormat::MakeWriter(
 
   auto parquet_options = checked_pointer_cast<ParquetFileWriteOptions>(options);
 
-  std::unique_ptr<parquet::arrow::FileWriter> parquet_writer;
-  ARROW_ASSIGN_OR_RAISE(parquet_writer, parquet::arrow::FileWriter::Open(
-                                            *schema, default_memory_pool(), destination,
-                                            parquet_options->writer_properties,
-                                            parquet_options->arrow_writer_properties));
+  std::unique_ptr<parquet::arrow::FileWriter> parquet_writer = NULLPTR;
+
+#ifdef PARQUET_REQUIRE_ENCRYPTION
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encrypt_config =
+      GetDatasetEncryptionConfig();
+
+  if (dataset_encrypt_config != nullptr) {
+    auto file_encryption_prop =
+        dataset_encrypt_config->crypto_factory->GetFileEncryptionProperties(
+            *dataset_encrypt_config->kms_connection_config.get(),
+            *dataset_encrypt_config->encryption_config.get(), destination_locator.path,

Review Comment:
   ```suggestion
               *dataset_encrypt_config->kms_connection_config,
               *dataset_encrypt_config->encryption_config, destination_locator.path,
   ```



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,388 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view ds_footer_master_key = "0123456789012345";
+constexpr std::string_view ds_footer_master_key_id = "footer_key";
+constexpr std::string_view ds_column_master_keys[] = {"1234567890123450"};
+constexpr std::string_view ds_column_master_key_ids[] = {"col_key"};
+const int ds_num_columns = 1;
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // setup the test
+  void SetUp() {}
+
+  // Create our parquetfileformat with encryption properties
+  std::shared_ptr<ParquetFileFormat> CreateFileFormat(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, std::string_view footer_id, std::string_view footer_key,
+      std::string_view footer_key_name = "footer_key",
+      std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    // create our Parquet file format object
+    auto file_format = std::make_shared<ParquetFileFormat>();
+
+    file_format->SetDatasetEncryptionConfig(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config));
+    file_format->SetDatasetDecryptionConfig(
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+
+    return file_format;
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));

Review Comment:
   You may actually call `TableFromJSON` to create a table directly from a json string.



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,388 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view ds_footer_master_key = "0123456789012345";
+constexpr std::string_view ds_footer_master_key_id = "footer_key";
+constexpr std::string_view ds_column_master_keys[] = {"1234567890123450"};
+constexpr std::string_view ds_column_master_key_ids[] = {"col_key"};
+const int ds_num_columns = 1;
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // setup the test
+  void SetUp() {}
+
+  // Create our parquetfileformat with encryption properties
+  std::shared_ptr<ParquetFileFormat> CreateFileFormat(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, std::string_view footer_id, std::string_view footer_key,
+      std::string_view footer_key_name = "footer_key",
+      std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(true, key_list);

Review Comment:
   ```suggestion
       auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
   ```



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,388 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view ds_footer_master_key = "0123456789012345";
+constexpr std::string_view ds_footer_master_key_id = "footer_key";
+constexpr std::string_view ds_column_master_keys[] = {"1234567890123450"};
+constexpr std::string_view ds_column_master_key_ids[] = {"col_key"};
+const int ds_num_columns = 1;
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // setup the test
+  void SetUp() {}
+
+  // Create our parquetfileformat with encryption properties
+  std::shared_ptr<ParquetFileFormat> CreateFileFormat(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, std::string_view footer_id, std::string_view footer_key,
+      std::string_view footer_key_name = "footer_key",
+      std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    // create our Parquet file format object
+    auto file_format = std::make_shared<ParquetFileFormat>();
+
+    file_format->SetDatasetEncryptionConfig(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config));
+    file_format->SetDatasetDecryptionConfig(
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+
+    return file_format;
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    std::shared_ptr<::arrow::dataset::InMemoryDataset> dataset =
+        std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+
+    return dataset;
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+
+  // Write dataset to disk with encryption
+  void WriteReadDatasetWithEncryption() {

Review Comment:
   ```suggestion
   TEST_F(DatasetEncryptionTest, WriteReadDatasetWithEncryption) {
   ```
   
   It seems that we can directly define the test function here and thus avoid what line 377 does.



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -621,11 +637,38 @@ Result<std::shared_ptr<FileWriter>> ParquetFileFormat::MakeWriter(
 
   auto parquet_options = checked_pointer_cast<ParquetFileWriteOptions>(options);
 
-  std::unique_ptr<parquet::arrow::FileWriter> parquet_writer;
-  ARROW_ASSIGN_OR_RAISE(parquet_writer, parquet::arrow::FileWriter::Open(
-                                            *schema, default_memory_pool(), destination,
-                                            parquet_options->writer_properties,
-                                            parquet_options->arrow_writer_properties));
+  std::unique_ptr<parquet::arrow::FileWriter> parquet_writer = NULLPTR;
+
+#ifdef PARQUET_REQUIRE_ENCRYPTION
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encrypt_config =
+      GetDatasetEncryptionConfig();
+
+  if (dataset_encrypt_config != nullptr) {
+    auto file_encryption_prop =
+        dataset_encrypt_config->crypto_factory->GetFileEncryptionProperties(
+            *dataset_encrypt_config->kms_connection_config.get(),
+            *dataset_encrypt_config->encryption_config.get(), destination_locator.path,
+            destination_locator.filesystem);
+
+    auto writer_properties =
+        parquet::WriterProperties::Builder(*parquet_options->writer_properties.get())

Review Comment:
   ```suggestion
           parquet::WriterProperties::Builder(*parquet_options->writer_properties)
   ```



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,388 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view ds_footer_master_key = "0123456789012345";
+constexpr std::string_view ds_footer_master_key_id = "footer_key";
+constexpr std::string_view ds_column_master_keys[] = {"1234567890123450"};
+constexpr std::string_view ds_column_master_key_ids[] = {"col_key"};
+const int ds_num_columns = 1;
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // setup the test
+  void SetUp() {}
+
+  // Create our parquetfileformat with encryption properties
+  std::shared_ptr<ParquetFileFormat> CreateFileFormat(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, std::string_view footer_id, std::string_view footer_key,
+      std::string_view footer_key_name = "footer_key",
+      std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    // create our Parquet file format object
+    auto file_format = std::make_shared<ParquetFileFormat>();
+
+    file_format->SetDatasetEncryptionConfig(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config));
+    file_format->SetDatasetDecryptionConfig(
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+
+    return file_format;
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    std::shared_ptr<::arrow::dataset::InMemoryDataset> dataset =
+        std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+
+    return dataset;
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+
+  // Write dataset to disk with encryption
+  void WriteReadDatasetWithEncryption() {
+    auto file_format =
+        CreateFileFormat(ds_column_master_key_ids, ds_column_master_keys, ds_num_columns,
+                         ds_footer_master_key_id, ds_footer_master_key);
+
+    // create our mock file system
+    ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+    ASSERT_OK_AND_ASSIGN(auto file_system,
+                         ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+    // create filesystem
+    ASSERT_OK(file_system->CreateDir(""));
+
+    auto mock_fs =
+        std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+    auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+    auto partitioning =
+        std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+    // ----- Write the Dataset ----
+    auto dataset_out = BuildTable();
+    ASSERT_OK_AND_ASSIGN(auto scanner_builder_out, dataset_out->NewScan());
+    ASSERT_OK_AND_ASSIGN(auto scanner_out, scanner_builder_out->Finish());
+
+    ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+    write_options.file_write_options = file_format->DefaultWriteOptions();
+    write_options.filesystem = file_system;
+    write_options.base_dir = "";

Review Comment:
   Please add `constexpr std::string_view kBaseDir = "";` in the beginning as it is repeatedly used in the test.



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,388 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view ds_footer_master_key = "0123456789012345";
+constexpr std::string_view ds_footer_master_key_id = "footer_key";
+constexpr std::string_view ds_column_master_keys[] = {"1234567890123450"};
+constexpr std::string_view ds_column_master_key_ids[] = {"col_key"};
+const int ds_num_columns = 1;
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // setup the test
+  void SetUp() {}
+
+  // Create our parquetfileformat with encryption properties
+  std::shared_ptr<ParquetFileFormat> CreateFileFormat(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, std::string_view footer_id, std::string_view footer_key,
+      std::string_view footer_key_name = "footer_key",
+      std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    // create our Parquet file format object
+    auto file_format = std::make_shared<ParquetFileFormat>();
+
+    file_format->SetDatasetEncryptionConfig(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config));
+    file_format->SetDatasetDecryptionConfig(
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+
+    return file_format;
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));

Review Comment:
   ```
   auto table = ::arrow::TableFromJSON(schema, {R"([
     [0, 9, 1, "a"],
     [1, 8, 2, "b"],
     [2, 7, 1, "c"]
   ])"});
   ```



##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,75 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+
+namespace arrow {
+namespace dataset {
+
+struct ARROW_DS_EXPORT DatasetEncryptionConfiguration {
+  /// core class, that translates the parameters of high level encryption
+
+  DatasetEncryptionConfiguration()
+      : kms_connection_config(
+            std::make_shared<parquet::encryption::KmsConnectionConfig>()) {}
+
+  void Setup(
+      std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory,
+      std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config,
+      std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config) {
+    this->crypto_factory = crypto_factory;
+    this->kms_connection_config = kms_connection_config;
+    this->encryption_config = encryption_config;
+  }
+
+  std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory;
+
+  std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config;
+
+  std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config;
+};
+
+struct ARROW_DS_EXPORT DatasetDecryptionConfiguration {
+  /// core class, that translates the parameters of high level encryption
+  DatasetDecryptionConfiguration()
+      : kms_connection_config(
+            std::make_shared<parquet::encryption::KmsConnectionConfig>()),
+        decryption_config(
+            std::make_shared<parquet::encryption::DecryptionConfiguration>()) {}
+
+  void Setup(
+      std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory,
+      std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config,
+      std::shared_ptr<parquet::encryption::DecryptionConfiguration> decryption_config) {
+    this->crypto_factory = crypto_factory;
+    this->kms_connection_config = kms_connection_config;
+    this->decryption_config = decryption_config;

Review Comment:
   Should we check to make sure none of them is null?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1625634022

   @wgtmac The original inspiration for the DatasetEncryptionConfiguration & DatasetDecryptionConfiguration came from the google doc https://docs.google.com/document/d/13EysCNC6-Nu9wnJ8YpdzmD-aMLn4i2KXUJTNqIihy7A/edit#heading=h.wdoy8yup3nk9 
   I am not sure if there are two structures defined originally for further expansion or for other reasons I am not aware.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1253799426


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,388 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view ds_footer_master_key = "0123456789012345";
+constexpr std::string_view ds_footer_master_key_id = "footer_key";
+constexpr std::string_view ds_column_master_keys[] = {"1234567890123450"};
+constexpr std::string_view ds_column_master_key_ids[] = {"col_key"};
+const int ds_num_columns = 1;
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // setup the test
+  void SetUp() {}
+
+  // Create our parquetfileformat with encryption properties
+  std::shared_ptr<ParquetFileFormat> CreateFileFormat(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, std::string_view footer_id, std::string_view footer_key,
+      std::string_view footer_key_name = "footer_key",
+      std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    // create our Parquet file format object
+    auto file_format = std::make_shared<ParquetFileFormat>();
+
+    file_format->SetDatasetEncryptionConfig(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config));
+    file_format->SetDatasetDecryptionConfig(
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+
+    return file_format;
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));

Review Comment:
   Good to know.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1254573237


##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,70 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+
+namespace arrow {
+namespace dataset {
+
+/// core class, that translates the parameters of high level encryption
+struct ARROW_DS_EXPORT DatasetEncryptionConfiguration {
+  DatasetEncryptionConfiguration()
+      : kms_connection_config(
+            std::make_shared<parquet::encryption::KmsConnectionConfig>()) {}
+
+  void Setup(
+      std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory,
+      std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config,
+      std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config) {
+    this->crypto_factory = std::move(crypto_factory);
+    this->kms_connection_config = std::move(kms_connection_config);
+    this->encryption_config = std::move(encryption_config);
+  }
+
+  std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory;
+  std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config;
+  std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config;
+};

Review Comment:
   Ok, Il check it out



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] anjakefala commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.
anjakefala commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1334461383


##########
python/pyarrow/includes/libarrow_parquet_readwrite_encryption.pxd:
##########
@@ -0,0 +1,35 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# distutils: language = c++
+
+from pyarrow.includes.libarrow_dataset cimport *
+from pyarrow._parquet cimport *
+from pyarrow._parquet_encryption cimport *
+
+cdef extern from "arrow/dataset/api.h" namespace "arrow::dataset" nogil:
+    cdef cppclass CParquetFileWriteOptions \

Review Comment:
   @tolleybot Was this to avoid `from pyarrow._parquet_encryption` in `libarrow_dataset_parquet.pxd`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1163122221


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,247 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <arrow/api.h>
+#include <arrow/dataset/api.h>
+#include <string_view>
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "gtest/gtest.h"
+#include "parquet/encryption/dataset_encryption_config.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view dsFooterMasterKey = "0123456789012345";
+constexpr std::string_view dsFooterMasterKeyId = "footer_key";
+constexpr std::string_view dsColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view dsColumnMasterKeyIds[] = {"col_key"};
+const int dsNumColumns = 1;
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  std::unique_ptr<arrow::internal::TemporaryDir> temp_dir_;
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> dataset_;
+  std::string footer_key_name_ = "footer_key";
+
+  ::parquet::encryption::DatasetEncryptionConfiguration dataset_encryption_config_;
+  ::parquet::encryption::DatasetDecryptionConfiguration dataset_decryption_config_;
+  std::string column_key_mapping_;
+  ::parquet::encryption::KmsConnectionConfig kms_connection_config_;
+  std::shared_ptr<::parquet::encryption::CryptoFactory> crypto_factory_;
+  std::shared_ptr<ParquetFileFormat> file_format_;
+  std::shared_ptr<::arrow::fs::FileSystem> file_system_;
+
+  /** setup the test
+   *
+   */
+  void SetUp() {
+    // create our mock file system
+    ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+    ASSERT_OK_AND_ASSIGN(file_system_,
+                         ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+    // build our dummy table
+    BuildTable();
+
+    auto key_list = BuildKeyMap(dsColumnMasterKeyIds, dsColumnMasterKeys, dsNumColumns,
+                                dsFooterMasterKeyId, dsFooterMasterKey);
+
+    SetupCryptoFactory(true, key_list);
+
+    column_key_mapping_ = "col_key: a";
+
+    // Setup our Dataset encrytion configurations
+    dataset_encryption_config_.crypto_factory = crypto_factory_;
+    dataset_encryption_config_.kms_connection_config =
+        std::make_shared<::parquet::encryption::KmsConnectionConfig>(
+            kms_connection_config_);
+    dataset_encryption_config_.encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            footer_key_name_);
+    dataset_encryption_config_.encryption_config->column_keys = column_key_mapping_;
+    dataset_encryption_config_.encryption_config->footer_key = footer_key_name_;
+
+    dataset_decryption_config_.crypto_factory = crypto_factory_;
+    dataset_decryption_config_.kms_connection_config =
+        std::make_shared<::parquet::encryption::KmsConnectionConfig>(
+            kms_connection_config_);
+    dataset_decryption_config_.decryption_config =
+        std::make_shared<::parquet::encryption::DecryptionConfiguration>();
+
+    // create our Parquet file format object
+    file_format_ = std::make_shared<ParquetFileFormat>();
+
+    file_format_->SetDatasetEncryptionConfig(
+        std::make_shared<::parquet::encryption::DatasetEncryptionConfiguration>(
+            dataset_encryption_config_));
+    file_format_->SetDatasetDecryptionConfig(
+        std::make_shared<::parquet::encryption::DatasetDecryptionConfiguration>(
+            dataset_decryption_config_));
+  }
+
+  /** utility to build the key map
+   *
+   */
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  /** utilty to build column key mapping

Review Comment:
   changed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1143383321


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########


Review Comment:
   Will do.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1476948985

   On the C++ formatting: If you install clang-tools 14 (must be that version) and set the path the binaries with `CLANG_TOOLS_PATH `, then when you reconfigure CMake it will create a `format` target and a `lint` target that can be used to check style. (You'll know it's configured right when the CMake configure output says `Found ClangTools: /some/path/bin/clang-format-14`)
   
   For Python formatting, see these instructions: https://arrow.apache.org/docs/developers/python.html#coding-style


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1182661564


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +137,32 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+
+  /// \brief A getter function to retrie‰ve the dataset encryption configuration
+  std::shared_ptr<DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+    return dataset_encryption_config_;
+  }
+  /// \brief A getter function to retrieve the dataset decryption configuration
+  std::shared_ptr<DatasetDecryptionConfiguration> GetDatasetDecryptionConfig() const {
+    return dataset_decryption_config_;
+  }
+  /// \brief A setter for DatasetEncryptionConfiguration
+  void SetDatasetEncryptionConfig(
+      std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config) {
+    dataset_encryption_config_ = dataset_encryption_config;
+  }
+  /// \brief A setter for DatasetDecryptionConfiguration
+  void SetDatasetDecryptionConfig(
+      std::shared_ptr<DatasetDecryptionConfiguration> dataset_decryption_config) {
+    dataset_decryption_config_ = dataset_decryption_config;
+  }
+
+ private:
+  // A configuration structure that provides per file encryption properties for a dataset
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config_ = nullptr;

Review Comment:
   You'll see in other header files we always use the `NULLPTR` macro:
   ```suggestion
     std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config_ = NULLPTR;
   ```



##########
cpp/src/arrow/dataset/CMakeLists.txt:
##########


Review Comment:
   Also seems like this file needs to be formatted, though not immediately obvious what is wrong.



##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +137,32 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+
+  /// \brief A getter function to retrie‰ve the dataset encryption configuration
+  std::shared_ptr<DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+    return dataset_encryption_config_;
+  }
+  /// \brief A getter function to retrieve the dataset decryption configuration
+  std::shared_ptr<DatasetDecryptionConfiguration> GetDatasetDecryptionConfig() const {
+    return dataset_decryption_config_;
+  }
+  /// \brief A setter for DatasetEncryptionConfiguration
+  void SetDatasetEncryptionConfig(
+      std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config) {
+    dataset_encryption_config_ = dataset_encryption_config;
+  }
+  /// \brief A setter for DatasetDecryptionConfiguration
+  void SetDatasetDecryptionConfig(
+      std::shared_ptr<DatasetDecryptionConfiguration> dataset_decryption_config) {
+    dataset_decryption_config_ = dataset_decryption_config;
+  }
+
+ private:
+  // A configuration structure that provides per file encryption properties for a dataset
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config_ = nullptr;
+  // A configuration structure that provides per file encryption and decryption properties
+  // for a dataset
+  std::shared_ptr<DatasetDecryptionConfiguration> dataset_decryption_config_ = nullptr;

Review Comment:
   ```suggestion
     std::shared_ptr<DatasetDecryptionConfiguration> dataset_decryption_config_ = NULLPTR;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1182709683


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +137,32 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+
+  /// \brief A getter function to retrie‰ve the dataset encryption configuration
+  std::shared_ptr<DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+    return dataset_encryption_config_;
+  }
+  /// \brief A getter function to retrieve the dataset decryption configuration
+  std::shared_ptr<DatasetDecryptionConfiguration> GetDatasetDecryptionConfig() const {
+    return dataset_decryption_config_;
+  }
+  /// \brief A setter for DatasetEncryptionConfiguration
+  void SetDatasetEncryptionConfig(
+      std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config) {
+    dataset_encryption_config_ = dataset_encryption_config;
+  }
+  /// \brief A setter for DatasetDecryptionConfiguration
+  void SetDatasetDecryptionConfig(
+      std::shared_ptr<DatasetDecryptionConfiguration> dataset_decryption_config) {
+    dataset_decryption_config_ = dataset_decryption_config;
+  }
+
+ private:
+  // A configuration structure that provides per file encryption properties for a dataset
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config_ = nullptr;
+  // A configuration structure that provides per file encryption and decryption properties
+  // for a dataset
+  std::shared_ptr<DatasetDecryptionConfiguration> dataset_decryption_config_ = nullptr;

Review Comment:
   I changed it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1251384057


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -28,6 +28,8 @@
 
 #include "arrow/dataset/discovery.h"
 #include "arrow/dataset/file_base.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+
 #include "arrow/dataset/type_fwd.h"
 #include "arrow/dataset/visibility.h"
 #include "arrow/io/caching.h"

Review Comment:
   Got it.  good to know



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1251354690


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -28,6 +28,8 @@
 
 #include "arrow/dataset/discovery.h"
 #include "arrow/dataset/file_base.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+
 #include "arrow/dataset/type_fwd.h"
 #include "arrow/dataset/visibility.h"
 #include "arrow/io/caching.h"

Review Comment:
   I also checked this.
   We don't need to include `arrow/util/config.h` explicitly because `arrow/util/config.h` is included in implicitly. (`arrow/dataset/file_base.h` -> `arrow/dataset/dataset.h` -> `arrow/util/future.h` -> `arrow/util/config.h`)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1257284061


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +138,20 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+    return dataset_encryption_config_;
+  }
+  /// \brief A setter for DatasetEncryptionConfiguration
+  void SetDatasetEncryptionConfig(
+      std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config) {
+    dataset_encryption_config_ = std::move(dataset_encryption_config);
+  }
+
+ private:
+  // A configuration structure that provides per file encryption properties for a dataset
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config_ = NULLPTR;

Review Comment:
   I am not that familiar the design principle of dataset api. Is it the best place to put the new config here? Or should we move it into `ParquetFragmentScanOptions` and `ParquetFileWriteOptions` respectively? @westonpace @wjones127 



##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +138,20 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+    return dataset_encryption_config_;
+  }
+  /// \brief A setter for DatasetEncryptionConfiguration
+  void SetDatasetEncryptionConfig(
+      std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config) {
+    dataset_encryption_config_ = std::move(dataset_encryption_config);
+  }
+
+ private:
+  // A configuration structure that provides per file encryption properties for a dataset
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config_ = NULLPTR;

Review Comment:
   After double check the design doc (https://docs.google.com/document/d/13EysCNC6-Nu9wnJ8YpdzmD-aMLn4i2KXUJTNqIihy7A), I think the original design to add two separate classes (i.e. EncryptionConfig and DecryptionConfig) was considered to add those configs to `ParquetFileWriteOptions` and `ParquetFragmentScanOptions` separately. In the implementation of `MakeReaderProperties` and `MakeWriter` calls, we need to translate these config to the current writer/reader properties. @tolleybot 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1259707663


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +138,20 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+    return dataset_encryption_config_;
+  }
+  /// \brief A setter for DatasetEncryptionConfiguration
+  void SetDatasetEncryptionConfig(
+      std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config) {
+    dataset_encryption_config_ = std::move(dataset_encryption_config);
+  }
+
+ private:
+  // A configuration structure that provides per file encryption properties for a dataset
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config_ = NULLPTR;

Review Comment:
   @wgtmac @westonpace @wjones127 
   I need to get some idea of how the flow would go if I move the DatasetEncryptionConfiguration into ParquetFragmentScanOptions and ParquetFragmeentScanOptions.  Here is a sample of the current implementation in python being used
   
   https://github.com/tolleybot/arrow/blob/f3f73d86c28551f89b5aab08bfd120dc85dfca80/python/examples/dataset/write_dataset_encrypted.py#L43-L74
   
   My questions are:
   1.  How would the DatasetEncryptionConfiguration be propagated to those classes? Through ParquetFileFormat?
   2. When would the propagation take place.
   3.  Looking at the python sample, how would this look with this change?
   
   Thanks for any help.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1269771351


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -564,6 +635,14 @@ cdef class ParquetFileWriteOptions(FileWriteOptions):
             data_page_version=self._properties["data_page_version"],
         )
 
+        cdef shared_ptr[CParquetEncryptionConfig] c_config
+        if self._properties["dataset_encryption_config"]:
+            config = self._properties["dataset_encryption_config"]

Review Comment:
   I just shortened this to encryption_config here also



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1269770772


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -598,6 +677,7 @@ cdef class ParquetFileWriteOptions(FileWriteOptions):
             coerce_timestamps=None,
             allow_truncated_timestamps=False,
             use_compliant_nested_type=True,
+            dataset_encryption_config=None,

Review Comment:
   I just shortened this to encryption_config. Again let me know if this an issue



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1267131347


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    return std::make_pair(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config),
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+};
+// Write dataset to disk with encryption
+// The aim of this test is to demonstrate the process of writing a partitioned
+// Parquet file while applying distinct file encryption properties to each
+// file within the test. This is based on the selected columns.
+TEST_F(DatasetEncryptionTest, WriteReadDatasetWithEncryption) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));
+
+  auto mock_fs =
+      std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+  auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+  auto partitioning =
+      std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+  // ----- Write the Dataset ----
+  auto dataset_out = BuildTable();
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder_out, dataset_out->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner_out, scanner_builder_out->Finish());
+
+  ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+  write_options.file_write_options = parquet_file_write_options;
+  write_options.filesystem = file_system;
+  write_options.base_dir = kBaseDir;
+  write_options.partitioning = partitioning;
+  write_options.basename_template = "part{i}.parquet";
+  ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner_out));
+
+  std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                    "part=c/part0.parquet", "part=d/part0.parquet",
+                                    "part=e/part0.parquet", "part=f/part0.parquet",
+                                    "part=g/part0.parquet", "part=h/part0.parquet",
+                                    "part=i/part0.parquet", "part=j/part0.parquet"};
+  ValidateFilesExist(mock_fs, files);
+
+  // ----- Read the Dataset -----
+
+  // Get FileInfo objects for all files under the base directory
+  arrow::fs::FileSelector selector;
+  selector.base_dir = kBaseDir;
+  selector.recursive = true;
+
+  // Create a FileSystemDatasetFactory
+  arrow::dataset::FileSystemFactoryOptions factory_options;
+  factory_options.partitioning = partitioning;
+  factory_options.partition_base_dir = kBaseDir;
+  ASSERT_OK_AND_ASSIGN(auto dataset_factory,
+                       arrow::dataset::FileSystemDatasetFactory::Make(
+                           mock_fs, selector, file_format, factory_options));
+  // Create a Dataset
+  ASSERT_OK_AND_ASSIGN(auto dataset_in, dataset_factory->Finish());
+
+  // Define the callback function
+  std::function<arrow::Status(arrow::dataset::TaggedRecordBatch tagged_record_batch)>
+      visitor =
+          [](arrow::dataset::TaggedRecordBatch tagged_record_batch) -> arrow::Status {
+    return arrow::Status::OK();
+  };
+
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder_in, dataset_in->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner_in, scanner_builder_in->Finish());
+
+  // Scan the dataset and process the record batches using the callback function
+  arrow::Status status = scanner_in->Scan(visitor);
+
+  // Check if there was an error during iteration
+  ASSERT_OK(status);
+}
+
+// Write dataset to disk with encryption and then read in a single parquet file
+TEST_F(DatasetEncryptionTest, WriteReadSingleFile) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1267123527


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    return std::make_pair(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config),
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1267200233


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    return std::make_pair(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config),
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+};
+// Write dataset to disk with encryption
+// The aim of this test is to demonstrate the process of writing a partitioned
+// Parquet file while applying distinct file encryption properties to each
+// file within the test. This is based on the selected columns.
+TEST_F(DatasetEncryptionTest, WriteReadDatasetWithEncryption) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));
+
+  auto mock_fs =
+      std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+  auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+  auto partitioning =
+      std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+  // ----- Write the Dataset ----
+  auto dataset_out = BuildTable();
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder_out, dataset_out->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner_out, scanner_builder_out->Finish());
+
+  ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+  write_options.file_write_options = parquet_file_write_options;
+  write_options.filesystem = file_system;
+  write_options.base_dir = kBaseDir;
+  write_options.partitioning = partitioning;
+  write_options.basename_template = "part{i}.parquet";
+  ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner_out));
+
+  std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                    "part=c/part0.parquet", "part=d/part0.parquet",
+                                    "part=e/part0.parquet", "part=f/part0.parquet",
+                                    "part=g/part0.parquet", "part=h/part0.parquet",
+                                    "part=i/part0.parquet", "part=j/part0.parquet"};
+  ValidateFilesExist(mock_fs, files);
+
+  // ----- Read the Dataset -----
+
+  // Get FileInfo objects for all files under the base directory
+  arrow::fs::FileSelector selector;
+  selector.base_dir = kBaseDir;
+  selector.recursive = true;
+
+  // Create a FileSystemDatasetFactory
+  arrow::dataset::FileSystemFactoryOptions factory_options;
+  factory_options.partitioning = partitioning;
+  factory_options.partition_base_dir = kBaseDir;
+  ASSERT_OK_AND_ASSIGN(auto dataset_factory,
+                       arrow::dataset::FileSystemDatasetFactory::Make(
+                           mock_fs, selector, file_format, factory_options));
+  // Create a Dataset
+  ASSERT_OK_AND_ASSIGN(auto dataset_in, dataset_factory->Finish());
+
+  // Define the callback function
+  std::function<arrow::Status(arrow::dataset::TaggedRecordBatch tagged_record_batch)>
+      visitor =
+          [](arrow::dataset::TaggedRecordBatch tagged_record_batch) -> arrow::Status {
+    return arrow::Status::OK();
+  };
+
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder_in, dataset_in->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner_in, scanner_builder_in->Finish());
+
+  // Scan the dataset and process the record batches using the callback function
+  arrow::Status status = scanner_in->Scan(visitor);
+
+  // Check if there was an error during iteration
+  ASSERT_OK(status);
+}
+
+// Write dataset to disk with encryption and then read in a single parquet file
+TEST_F(DatasetEncryptionTest, WriteReadSingleFile) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));
+
+  auto mock_fs =
+      std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+  auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+  auto partitioning =
+      std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+  // ----- Write the Dataset ----
+  auto dataset_out = BuildTable();
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset_out->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
+
+  ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+  write_options.file_write_options = parquet_file_write_options;
+  write_options.filesystem = file_system;
+  write_options.base_dir = kBaseDir;
+  write_options.partitioning = partitioning;
+  write_options.basename_template = "part{i}.parquet";
+  ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner));
+
+  // ----- Read Single File -----

Review Comment:
   Ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1269059268


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -236,11 +252,25 @@ class ARROW_DS_EXPORT ParquetFileWriteOptions : public FileWriteOptions {
   /// \brief Parquet Arrow writer properties.
   std::shared_ptr<parquet::ArrowWriterProperties> arrow_writer_properties;
 
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<ParquetEncryptionConfig> GetParquetEncryptionConfig() const {
+    return parquet_encryption_config_;
+  }
+  /// \brief A setter for ParquetEncryptionConfig
+  void SetParquetEncryptionConfig(
+      std::shared_ptr<ParquetEncryptionConfig> dataset_encryption_config) {
+    parquet_encryption_config_ = std::move(dataset_encryption_config);

Review Comment:
   ```suggestion
         std::shared_ptr<ParquetEncryptionConfig> parquet_encryption_config) {
       parquet_encryption_config_ = std::move(parquet_encryption_config);
   ```



##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -226,6 +229,19 @@ class ARROW_DS_EXPORT ParquetFragmentScanOptions : public FragmentScanOptions {
   /// ScanOptions. Additionally, dictionary columns come from
   /// ParquetFileFormat::ReaderOptions::dict_columns.
   std::shared_ptr<parquet::ArrowReaderProperties> arrow_reader_properties;
+  /// \brief A getter function to retrieve the dataset decryption configuration

Review Comment:
   ```suggestion
     /// \brief A getter function to retrieve the parquet decryption configuration
   ```



##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -236,11 +252,25 @@ class ARROW_DS_EXPORT ParquetFileWriteOptions : public FileWriteOptions {
   /// \brief Parquet Arrow writer properties.
   std::shared_ptr<parquet::ArrowWriterProperties> arrow_writer_properties;
 
+  /// \brief A getter function to retrieve the dataset encryption configuration

Review Comment:
   ```suggestion
     /// \brief A getter function to retrieve the parquet encryption configuration
   ```



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -637,10 +717,14 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
         If not None, override the maximum total size of containers allocated
         when decoding Thrift structures. The default limit should be
         sufficient for most Parquet files.
+    dataset_decryption_config : ParquetDecryptionConfig, default None

Review Comment:
   ```suggestion
       parquet_decryption_config : ParquetDecryptionConfig, default None
   ```



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -564,6 +635,14 @@ cdef class ParquetFileWriteOptions(FileWriteOptions):
             data_page_version=self._properties["data_page_version"],
         )
 
+        cdef shared_ptr[CParquetEncryptionConfig] c_config
+        if self._properties["dataset_encryption_config"]:
+            config = self._properties["dataset_encryption_config"]

Review Comment:
   ```suggestion
           if self._properties["parquet_encryption_config"]:
               config = self._properties["parquet_encryption_config"]
   ```



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -598,6 +677,7 @@ cdef class ParquetFileWriteOptions(FileWriteOptions):
             coerce_timestamps=None,
             allow_truncated_timestamps=False,
             use_compliant_nested_type=True,
+            dataset_encryption_config=None,

Review Comment:
   ```suggestion
               parquet_encryption_config=None,
   ```



##########
python/examples/dataset/write_dataset_encrypted.py:
##########
@@ -0,0 +1,93 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+import pyarrow as pa
+import pyarrow.dataset as ds
+import pyarrow.parquet.encryption as pe
+from pyarrow.tests.parquet.encryption import InMemoryKmsClient
+from datetime import timedelta
+import shutil
+import os
+
+""" A sample to demostrate dataset encryption and decryption"""
+
+# create a list of dictionaries that will represent our dataset
+table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+                  'n_legs': [2, 2, 4, 4, 5, 100],
+                  'animal': ["Flamingo", "Parrot", "Dog", "Horse",
+                             "Brittle stars", "Centipede"]})
+
+# create a PyArrow dataset from the table
+dataset = ds.dataset(table)
+
+FOOTER_KEY = b"0123456789112345"
+FOOTER_KEY_NAME = "footer_key"
+COL_KEY = b"1234567890123450"
+COL_KEY_NAME = "col_key"
+
+encryption_config = pe.EncryptionConfiguration(
+    footer_key=FOOTER_KEY_NAME,
+    plaintext_footer=False,
+    # Use COL_KEY_NAME to encrypt `n_legs` and `animal` columns.
+    column_keys={
+        COL_KEY_NAME: ["n_legs", "animal"],
+    },
+    encryption_algorithm="AES_GCM_V1",
+    # requires timedelta or an assertion is raised
+    cache_lifetime=timedelta(minutes=5.0),
+    data_key_length_bits=256)
+
+kms_connection_config = pe.KmsConnectionConfig(
+    custom_kms_conf={
+        FOOTER_KEY_NAME: FOOTER_KEY.decode("UTF-8"),
+        COL_KEY_NAME: COL_KEY.decode("UTF-8"),
+    }
+)
+
+decryption_config = pe.DecryptionConfiguration(cache_lifetime=300)
+
+
+def kms_factory(kms_connection_configuration):
+    return InMemoryKmsClient(kms_connection_configuration)
+
+
+crypto_factory = pe.CryptoFactory(kms_factory)
+dataset_encryption_cfg = ds.ParquetEncryptionConfig(
+    crypto_factory, kms_connection_config, encryption_config)
+dataset_decryption_cfg = ds.ParquetDecryptionConfig(crypto_factory,

Review Comment:
   ```suggestion
   parquet_encryption_cfg = ds.ParquetEncryptionConfig(
       crypto_factory, kms_connection_config, encryption_config)
   parquet_decryption_cfg = ds.ParquetDecryptionConfig(crypto_factory,
   ```



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -670,6 +758,14 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
     cdef ArrowReaderProperties* arrow_reader_properties(self):
         return self.parquet_options.arrow_reader_properties.get()
 
+    @property
+    def dataset_decryption_config(self):
+        return self._dataset_decryption_config
+
+    @dataset_decryption_config.setter

Review Comment:
   ```suggestion
       def parquet_decryption_config(self):
           return self._parquet_decryption_config
   
       @parquet_decryption_config.setter
   ```



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -637,10 +717,14 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
         If not None, override the maximum total size of containers allocated
         when decoding Thrift structures. The default limit should be
         sufficient for most Parquet files.
+    dataset_decryption_config : ParquetDecryptionConfig, default None
+        If not None, use the provided ParquetDecryptionConfig to decrypt the
+        Parquet file.
     """
 
     cdef:
         CParquetFragmentScanOptions* parquet_options
+        ParquetDecryptionConfig _dataset_decryption_config

Review Comment:
   ```suggestion
           ParquetDecryptionConfig _parquet_decryption_config
   ```



##########
python/pyarrow/includes/libarrow_dataset_parquet.pxd:
##########
@@ -31,6 +31,8 @@ cdef extern from "arrow/dataset/api.h" namespace "arrow::dataset" nogil:
             "arrow::dataset::ParquetFileWriteOptions"(CFileWriteOptions):
         shared_ptr[WriterProperties] writer_properties
         shared_ptr[ArrowWriterProperties] arrow_writer_properties
+        shared_ptr[CParquetEncryptionConfig] GetParquetEncryptionConfig()
+        void SetParquetEncryptionConfig(shared_ptr[CParquetEncryptionConfig] dataset_encryption_config)

Review Comment:
   ```suggestion
           void SetParquetEncryptionConfig(shared_ptr[CParquetEncryptionConfig] parquet_encryption_config)
   ```



##########
python/examples/dataset/write_dataset_encrypted.py:
##########
@@ -0,0 +1,93 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+import pyarrow as pa
+import pyarrow.dataset as ds
+import pyarrow.parquet.encryption as pe
+from pyarrow.tests.parquet.encryption import InMemoryKmsClient
+from datetime import timedelta
+import shutil
+import os
+
+""" A sample to demostrate dataset encryption and decryption"""
+
+# create a list of dictionaries that will represent our dataset
+table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+                  'n_legs': [2, 2, 4, 4, 5, 100],
+                  'animal': ["Flamingo", "Parrot", "Dog", "Horse",
+                             "Brittle stars", "Centipede"]})
+
+# create a PyArrow dataset from the table
+dataset = ds.dataset(table)
+
+FOOTER_KEY = b"0123456789112345"
+FOOTER_KEY_NAME = "footer_key"
+COL_KEY = b"1234567890123450"
+COL_KEY_NAME = "col_key"
+
+encryption_config = pe.EncryptionConfiguration(
+    footer_key=FOOTER_KEY_NAME,
+    plaintext_footer=False,
+    # Use COL_KEY_NAME to encrypt `n_legs` and `animal` columns.
+    column_keys={
+        COL_KEY_NAME: ["n_legs", "animal"],
+    },
+    encryption_algorithm="AES_GCM_V1",
+    # requires timedelta or an assertion is raised
+    cache_lifetime=timedelta(minutes=5.0),
+    data_key_length_bits=256)
+
+kms_connection_config = pe.KmsConnectionConfig(
+    custom_kms_conf={
+        FOOTER_KEY_NAME: FOOTER_KEY.decode("UTF-8"),
+        COL_KEY_NAME: COL_KEY.decode("UTF-8"),
+    }
+)
+
+decryption_config = pe.DecryptionConfiguration(cache_lifetime=300)
+
+
+def kms_factory(kms_connection_configuration):
+    return InMemoryKmsClient(kms_connection_configuration)
+
+
+crypto_factory = pe.CryptoFactory(kms_factory)
+dataset_encryption_cfg = ds.ParquetEncryptionConfig(
+    crypto_factory, kms_connection_config, encryption_config)
+dataset_decryption_cfg = ds.ParquetDecryptionConfig(crypto_factory,
+                                                    kms_connection_config,
+                                                    decryption_config)
+
+# set encryption config for parquet fragment scan options
+pq_scan_opts = ds.ParquetFragmentScanOptions()
+pq_scan_opts.dataset_decryption_config = dataset_decryption_cfg

Review Comment:
   ```suggestion
   pq_scan_opts.parquet_decryption_config = parquet_decryption_cfg
   ```



##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -226,6 +229,19 @@ class ARROW_DS_EXPORT ParquetFragmentScanOptions : public FragmentScanOptions {
   /// ScanOptions. Additionally, dictionary columns come from
   /// ParquetFileFormat::ReaderOptions::dict_columns.
   std::shared_ptr<parquet::ArrowReaderProperties> arrow_reader_properties;
+  /// \brief A getter function to retrieve the dataset decryption configuration
+  std::shared_ptr<ParquetDecryptionConfig> GetParquetDecryptionConfig() const {
+    return parquet_decryption_config_;
+  }
+  /// \brief A setter for ParquetDecryptionConfig
+  void SetParquetDecryptionConfig(
+      std::shared_ptr<ParquetDecryptionConfig> dataset_decryption_config) {
+    parquet_decryption_config_ = std::move(dataset_decryption_config);

Review Comment:
   ```suggestion
         std::shared_ptr<ParquetDecryptionConfig> parquet_decryption_config) {
       parquet_decryption_config_ = std::move(parquet_decryption_config);
   ```



##########
python/examples/dataset/write_dataset_encrypted.py:
##########
@@ -0,0 +1,93 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+import pyarrow as pa
+import pyarrow.dataset as ds
+import pyarrow.parquet.encryption as pe
+from pyarrow.tests.parquet.encryption import InMemoryKmsClient
+from datetime import timedelta
+import shutil
+import os
+
+""" A sample to demostrate dataset encryption and decryption"""

Review Comment:
   ```suggestion
   """ A sample to demonstrate parquet dataset encryption and decryption"""
   ```



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -660,6 +745,9 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
         if thrift_container_size_limit is not None:
             self.thrift_container_size_limit = thrift_container_size_limit
 
+        if dataset_decryption_config:
+            self.SetParquetDecryptionConfig(dataset_decryption_config)

Review Comment:
   ```suggestion
           if parquet_decryption_config:
               self.SetParquetDecryptionConfig(parquet_decryption_config)
   ```



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -649,7 +733,8 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
                  buffer_size=8192,
                  bint pre_buffer=False,
                  thrift_string_size_limit=None,
-                 thrift_container_size_limit=None):
+                 thrift_container_size_limit=None,
+                 dataset_decryption_config=None):

Review Comment:
   ```suggestion
                    parquet_decryption_config=None):
   ```



##########
python/examples/dataset/write_dataset_encrypted.py:
##########
@@ -0,0 +1,93 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+import pyarrow as pa
+import pyarrow.dataset as ds
+import pyarrow.parquet.encryption as pe
+from pyarrow.tests.parquet.encryption import InMemoryKmsClient
+from datetime import timedelta
+import shutil
+import os
+
+""" A sample to demostrate dataset encryption and decryption"""
+
+# create a list of dictionaries that will represent our dataset
+table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+                  'n_legs': [2, 2, 4, 4, 5, 100],
+                  'animal': ["Flamingo", "Parrot", "Dog", "Horse",
+                             "Brittle stars", "Centipede"]})
+
+# create a PyArrow dataset from the table
+dataset = ds.dataset(table)
+
+FOOTER_KEY = b"0123456789112345"
+FOOTER_KEY_NAME = "footer_key"
+COL_KEY = b"1234567890123450"
+COL_KEY_NAME = "col_key"
+
+encryption_config = pe.EncryptionConfiguration(
+    footer_key=FOOTER_KEY_NAME,
+    plaintext_footer=False,
+    # Use COL_KEY_NAME to encrypt `n_legs` and `animal` columns.
+    column_keys={
+        COL_KEY_NAME: ["n_legs", "animal"],
+    },
+    encryption_algorithm="AES_GCM_V1",
+    # requires timedelta or an assertion is raised
+    cache_lifetime=timedelta(minutes=5.0),
+    data_key_length_bits=256)
+
+kms_connection_config = pe.KmsConnectionConfig(
+    custom_kms_conf={
+        FOOTER_KEY_NAME: FOOTER_KEY.decode("UTF-8"),
+        COL_KEY_NAME: COL_KEY.decode("UTF-8"),
+    }
+)
+
+decryption_config = pe.DecryptionConfiguration(cache_lifetime=300)
+
+
+def kms_factory(kms_connection_configuration):
+    return InMemoryKmsClient(kms_connection_configuration)
+
+
+crypto_factory = pe.CryptoFactory(kms_factory)
+dataset_encryption_cfg = ds.ParquetEncryptionConfig(
+    crypto_factory, kms_connection_config, encryption_config)
+dataset_decryption_cfg = ds.ParquetDecryptionConfig(crypto_factory,
+                                                    kms_connection_config,
+                                                    decryption_config)
+
+# set encryption config for parquet fragment scan options
+pq_scan_opts = ds.ParquetFragmentScanOptions()
+pq_scan_opts.dataset_decryption_config = dataset_decryption_cfg
+pformat = pa.dataset.ParquetFileFormat(default_fragment_scan_options=pq_scan_opts)
+
+if os.path.exists('sample_dataset'):
+    shutil.rmtree('sample_dataset')
+
+write_options = pformat.make_write_options(
+    dataset_encryption_config=dataset_encryption_cfg)

Review Comment:
   ```suggestion
       parquet_encryption_config=parquet_encryption_cfg)
   ```



##########
python/pyarrow/includes/libarrow_dataset_parquet.pxd:
##########
@@ -62,6 +64,8 @@ cdef extern from "arrow/dataset/api.h" namespace "arrow::dataset" nogil:
             "arrow::dataset::ParquetFragmentScanOptions"(CFragmentScanOptions):
         shared_ptr[CReaderProperties] reader_properties
         shared_ptr[ArrowReaderProperties] arrow_reader_properties
+        shared_ptr[CParquetDecryptionConfig] GetDatasetDecryptionConfig()

Review Comment:
   ```suggestion
           shared_ptr[CParquetDecryptionConfig] GetParquetDecryptionConfig()
   ```



##########
python/pyarrow/includes/libarrow_dataset_parquet.pxd:
##########
@@ -62,6 +64,8 @@ cdef extern from "arrow/dataset/api.h" namespace "arrow::dataset" nogil:
             "arrow::dataset::ParquetFragmentScanOptions"(CFragmentScanOptions):
         shared_ptr[CReaderProperties] reader_properties
         shared_ptr[ArrowReaderProperties] arrow_reader_properties
+        shared_ptr[CParquetDecryptionConfig] GetDatasetDecryptionConfig()
+        void SetParquetDecryptionConfig(shared_ptr[CParquetDecryptionConfig] dataset_decryption_config)

Review Comment:
   ```suggestion
           void SetParquetDecryptionConfig(shared_ptr[CParquetDecryptionConfig] parquet_decryption_config)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1660327479

   Hi @westonpace, just let me know if there's anything I need to do to get this PR pushed in.  Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1528023436

   Hmm that's not quite right. If you squashed you should have only one commit, but not you have 109 commits, some of which are not your own. This is probably due to merging main into your feature branch. I'll look into how to clean this up...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1249752262


##########
cpp/src/arrow/util/config.h.cmake:
##########
@@ -59,3 +59,6 @@
 #cmakedefine ARROW_WITH_UCX
 
 #cmakedefine GRPCPP_PP_INCLUDE
+#cmakedefine PARQUET_REQUIRE_ENCRYPTION
+
+

Review Comment:
   Ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1163337983


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########


Review Comment:
   Completed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1171343581


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +138,33 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+   
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<parquet::encryption::DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+    return dataset_encryption_config_;
+  }
+  /// \brief A getter function to retrieve the dataset decryption configuration
+  std::shared_ptr<parquet::encryption::DatasetDecryptionConfiguration> GetDatasetDecryptionConfig() const {
+    return dataset_decryption_config_;
+  }

Review Comment:
   great



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1342948692


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,216 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kFooterKeyName = "footer_key";
+constexpr std::string_view kColumnMasterKey = "1234567890123450";
+constexpr std::string_view kColumnMasterKeysId = "col_key";

Review Comment:
   Its been changed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1342979734


##########
python/pyarrow/includes/libarrow_parquet_readwrite.pxd:
##########
@@ -0,0 +1,32 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# distutils: language = c++
+
+from pyarrow.includes.libarrow_dataset cimport *
+from pyarrow._parquet cimport *
+
+cdef extern from "arrow/dataset/api.h" namespace "arrow::dataset" nogil:

Review Comment:
   Your correct @jorisvandenbossche.  I played around with this today.  Let me know though if you have an idea you wish me to try out.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1342952419


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -56,9 +62,113 @@ from pyarrow._parquet cimport (
 
 cdef Expression _true = Expression._scalar(True)
 
-
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+IF PARQUET_ENCRYPTION_ENABLED:
+
+    from pyarrow._parquet_encryption cimport *
+
+    cdef class ParquetEncryptionConfig(_Weakrefable):
+        """
+        Configuration for Parquet Encryption.
+
+        Parameters
+        ----------
+        crypto_factory : object
+            Factory for creating cryptographic instances.
+        kms_connection_config : object
+            Configuration for connecting to Key Management Service.
+        encryption_config : object
+            Configuration for encryption settings.
+
+        Raises
+        ------
+        ValueError
+            If encryption_config is None.
+        """
+        cdef:
+            shared_ptr[CParquetEncryptionConfig] c_config
+
+        # Avoid mistakenly creating attributes
+        __slots__ = ()
+
+        def __cinit__(self, object crypto_factory, object kms_connection_config,
+                      object encryption_config):

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1346299229


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -711,6 +889,20 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
     cdef ArrowReaderProperties* arrow_reader_properties(self):
         return self.parquet_options.arrow_reader_properties.get()
 
+    IF PARQUET_ENCRYPTION_ENABLED:
+        @property
+        def parquet_decryption_config(self):

Review Comment:
   Got it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1346455963


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -67,8 +72,24 @@ parquet::ReaderProperties MakeReaderProperties(
     properties.disable_buffered_stream();
   }
   properties.set_buffer_size(parquet_scan_options->reader_properties->buffer_size());
+
+#ifdef PARQUET_REQUIRE_ENCRYPTION
+  auto parquet_decrypt_config = parquet_scan_options->parquet_decryption_config;
+
+  if (parquet_decrypt_config != nullptr) {
+    auto file_decryption_prop =
+        parquet_decrypt_config->crypto_factory->GetFileDecryptionProperties(
+            *parquet_decrypt_config->kms_connection_config,
+            *parquet_decrypt_config->decryption_config, path, filesystem);
+
+    parquet_scan_options->reader_properties->file_decryption_properties(
+        std::move(file_decryption_prop));
+  }
+#endif

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1348779731


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -711,6 +889,20 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
     cdef ArrowReaderProperties* arrow_reader_properties(self):
         return self.parquet_options.arrow_reader_properties.get()
 
+    IF PARQUET_ENCRYPTION_ENABLED:
+        @property
+        def parquet_decryption_config(self):

Review Comment:
   I updated the code to provide stubs that raise the NotImplementedError when encryption is not defined.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.
anjakefala commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1349249263


##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -29,44 +29,50 @@ struct DecryptionConfiguration;
 namespace arrow {
 namespace dataset {
 
-/// \brief Core configuration class encapsulating parameters for high-level encryption
-/// within Parquet framework.
-///
-/// ParquetEncryptionConfig serves as a bridge, passing encryption-related
-/// parameters to appropriate components within the Parquet library. It holds references
-/// to objects defining encryption strategy, Key Management Service (KMS) configuration,
-/// and specific encryption configurations for Parquet data.
-///
-/// \member crypto_factory Shared pointer to CryptoFactory object, responsible for
-/// creating cryptographic components like encryptors and decryptors. \member
-/// kms_connection_config Shared pointer to KmsConnectionConfig object, holding
-/// configuration parameters for connecting to a Key Management Service (KMS).
-/// \member encryption_config Shared pointer to EncryptionConfiguration object, defining
-/// specific encryption settings for Parquet data, like keys for different columns.
 struct ARROW_DS_EXPORT ParquetEncryptionConfig {
   std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory;
   std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config;
   std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config;
+  /// \brief Core configuration class encapsulating parameters for high-level encryption

Review Comment:
   I actually have a commit, and ran into a protected branch error when trying to push it. I am going to open a PR!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1346254087


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -56,9 +63,180 @@ from pyarrow._parquet cimport (
 
 cdef Expression _true = Expression._scalar(True)
 
-
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+IF PARQUET_ENCRYPTION_ENABLED:
+    cdef class ParquetEncryptionConfig(_Weakrefable):
+        """
+        Core configuration class encapsulating parameters for high-level encryption
+        within the Parquet framework.
+
+        The ParquetEncryptionConfig class serves as a bridge for passing encryption-related
+        parameters to the appropriate components within the Parquet library. It maintains references
+        to objects that define the encryption strategy, Key Management Service (KMS) configuration,
+        and specific encryption configurations for Parquet data.
+
+        Parameters
+        ----------
+        crypto_factory : pyarrow.parquet.encryption.CryptoFactory
+            Shared pointer to a `CryptoFactory` object. The `CryptoFactory` is responsible for
+            creating cryptographic components, such as encryptors and decryptors.
+        kms_connection_config : pyarrow.parquet.encryption.KmsConnectionConfig
+            Shared pointer to a `KmsConnectionConfig` object. This object holds the configuration
+            parameters necessary for connecting to a Key Management Service (KMS).
+        encryption_config : pyarrow.parquet.encryption.EncryptionConfiguration
+            Shared pointer to an `EncryptionConfiguration` object. This object defines specific
+            encryption settings for Parquet data, including the keys assigned to different columns.
+
+        Raises
+        ------
+        ValueError
+            Raised if `encryption_config` is None.
+        """
+        cdef:
+            shared_ptr[CParquetEncryptionConfig] c_config
+
+        # Avoid mistakenly creating attributes
+        __slots__ = ()
+
+        def __cinit__(self, CryptoFactory crypto_factory, KmsConnectionConfig kms_connection_config,
+                      EncryptionConfiguration encryption_config):
+
+            cdef shared_ptr[CEncryptionConfiguration] c_encryption_config
+
+            if crypto_factory is None:
+                raise ValueError("crypto_factory cannot be None")
+
+            if kms_connection_config is None:
+                raise ValueError("kms_connection_config cannot be None")
+
+            if encryption_config is None:
+                raise ValueError("encryption_config cannot be None")
+
+            self.c_config.reset(new CParquetEncryptionConfig())
+
+            c_encryption_config = ParquetEncryptionConfig.unwrap_encryptionconfig(
+                encryption_config)
+
+            self.c_config.get().crypto_factory = ParquetEncryptionConfig.unwrap_cryptofactory(crypto_factory)
+            self.c_config.get().kms_connection_config = ParquetEncryptionConfig.unwrap_kmsconnectionconfig(
+                kms_connection_config)
+            self.c_config.get().encryption_config = c_encryption_config
+
+        @staticmethod
+        cdef wrap(shared_ptr[CParquetEncryptionConfig] c_config):
+            cdef ParquetEncryptionConfig python_config = ParquetEncryptionConfig.__new__(ParquetEncryptionConfig)
+            python_config.c_config = c_config
+            return python_config
+
+        cdef shared_ptr[CParquetEncryptionConfig] unwrap(self):
+            return self.c_config
+
+        @staticmethod
+        cdef shared_ptr[CCryptoFactory] unwrap_cryptofactory(object crypto_factory) except *:
+            if isinstance(crypto_factory, CryptoFactory):
+                pycf = (<CryptoFactory> crypto_factory).unwrap()
+                return static_pointer_cast[CCryptoFactory, CPyCryptoFactory](pycf)
+            raise TypeError("Expected CryptoFactory, got %s" % type(crypto_factory))
+
+        @staticmethod
+        cdef shared_ptr[CKmsConnectionConfig] unwrap_kmsconnectionconfig(object kmsconnectionconfig):
+            if isinstance(kmsconnectionconfig, KmsConnectionConfig):
+                return (<KmsConnectionConfig> kmsconnectionconfig).unwrap()
+            raise TypeError("Expected KmsConnectionConfig, got %s" %
+                            type(kmsconnectionconfig))
+
+        @staticmethod
+        cdef shared_ptr[CEncryptionConfiguration] unwrap_encryptionconfig(object encryptionconfig):
+            if isinstance(encryptionconfig, EncryptionConfiguration):
+                return (<EncryptionConfiguration> encryptionconfig).unwrap()
+            raise TypeError("Expected EncryptionConfiguration, got %s" %
+                            type(encryptionconfig))
+
+    cdef class ParquetDecryptionConfig(_Weakrefable):
+        """
+        Core configuration class encapsulating parameters for high-level decryption
+        within the Parquet framework.
+
+        ParquetDecryptionConfig is designed to pass decryption-related parameters to
+        the appropriate decryption components within the Parquet library. It holds references to
+        objects that define the decryption strategy, Key Management Service (KMS) configuration,
+        and specific decryption configurations for reading encrypted Parquet data.
+
+        Parameters
+        ----------
+        crypto_factory : pyarrow.parquet.encryption.CryptoFactory
+            Shared pointer to a `CryptoFactory` object, pivotal in creating cryptographic
+            components for the decryption process.
+        kms_connection_config : pyarrow.parquet.encryption.KmsConnectionConfig
+            Shared pointer to a `KmsConnectionConfig` object, containing parameters necessary
+            for connecting to a Key Management Service (KMS) during decryption.
+        decryption_config : pyarrow.parquet.encryption.DecryptionConfiguration
+            Shared pointer to a `DecryptionConfiguration` object, specifying decryption settings
+            for reading encrypted Parquet data.
+
+        Raises
+        ------
+        ValueError
+            Raised if `decryption_config` is None.
+        """
+
+        cdef:
+            shared_ptr[CParquetDecryptionConfig] c_config
+
+        # Avoid mistakingly creating attributes
+        __slots__ = ()
+
+        def __cinit__(self, CryptoFactory crypto_factory, KmsConnectionConfig kms_connection_config,
+                      DecryptionConfiguration decryption_config):
+
+            cdef shared_ptr[CDecryptionConfiguration] c_decryption_config
+
+            if decryption_config is None:
+                raise ValueError(
+                    "decryption_config cannot be None")
+
+            self.c_config.reset(new CParquetDecryptionConfig())
+
+            c_decryption_config = ParquetDecryptionConfig.unwrap_decryptionconfig(
+                decryption_config)
+
+            self.c_config.get().crypto_factory = ParquetDecryptionConfig.unwrap_cryptofactory(crypto_factory)
+            self.c_config.get().kms_connection_config = ParquetDecryptionConfig.unwrap_kmsconnectionconfig(
+                kms_connection_config)
+            self.c_config.get().decryption_config = c_decryption_config
+
+        @staticmethod
+        cdef wrap(shared_ptr[CParquetDecryptionConfig] c_config):
+            cdef ParquetDecryptionConfig python_config = ParquetDecryptionConfig.__new__(ParquetDecryptionConfig)
+            python_config.c_config = c_config
+            return python_config
+
+        cdef shared_ptr[CParquetDecryptionConfig] unwrap(self):
+            return self.c_config
+
+        @staticmethod
+        cdef shared_ptr[CCryptoFactory] unwrap_cryptofactory(object crypto_factory) except *:
+            if isinstance(crypto_factory, CryptoFactory):
+                pycf = (<CryptoFactory> crypto_factory).unwrap()
+                return static_pointer_cast[CCryptoFactory, CPyCryptoFactory](pycf)
+            raise TypeError("Expected CryptoFactory, got %s" % type(crypto_factory))
+
+        @staticmethod
+        cdef shared_ptr[CKmsConnectionConfig] unwrap_kmsconnectionconfig(object kmsconnectionconfig) except *:
+            if isinstance(kmsconnectionconfig, KmsConnectionConfig):
+                return (<KmsConnectionConfig> kmsconnectionconfig).unwrap()
+            raise TypeError("Expected KmsConnectionConfig, got %s" %
+                            type(kmsconnectionconfig))
+
+        @staticmethod
+        cdef shared_ptr[CDecryptionConfiguration] unwrap_decryptionconfig(object decryptionconfig) except *:
+            if isinstance(decryptionconfig, DecryptionConfiguration):
+                return (<DecryptionConfiguration> decryptionconfig).unwrap()
+
+            raise TypeError("Expected DecryptionConfiguration, got %s" %
+                            type(decryptionconfig))

Review Comment:
   Ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1762709649

   FYI I checked the reports above and they are all flakes (the benchmarks reports are stable beyond this commit)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1352701223


##########
python/pyarrow/tests/test_dataset_encryption.py:
##########
@@ -0,0 +1,156 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from datetime import timedelta
+import pyarrow.fs as fs
+import pyarrow as pa
+import pytest
+
+encryption_unavailable = False
+
+try:
+    import pyarrow.dataset as ds
+except ImportError:
+    ds = None
+
+try:
+    from pyarrow.tests.parquet.encryption import InMemoryKmsClient
+    import pyarrow.parquet.encryption as pe
+except ImportError:
+    encryption_unavailable = True
+
+FOOTER_KEY = b"0123456789112345"
+FOOTER_KEY_NAME = "footer_key"
+COL_KEY = b"1234567890123450"
+COL_KEY_NAME = "col_key"
+
+
+def create_sample_table():
+    return pa.table(
+        {
+            "year": [2020, 2022, 2021, 2022, 2019, 2021],
+            "n_legs": [2, 2, 4, 4, 5, 100],
+            "animal": [
+                "Flamingo",
+                "Parrot",
+                "Dog",
+                "Horse",
+                "Brittle stars",
+                "Centipede",
+            ],
+        }
+    )
+
+
+def create_encryption_config():
+    return pe.EncryptionConfiguration(
+        footer_key=FOOTER_KEY_NAME,
+        plaintext_footer=False,
+        column_keys={COL_KEY_NAME: ["n_legs", "animal"]},
+        encryption_algorithm="AES_GCM_V1",
+        # requires timedelta or an assertion is raised
+        cache_lifetime=timedelta(minutes=5.0),
+        data_key_length_bits=256,
+    )
+
+
+def create_decryption_config():
+    return pe.DecryptionConfiguration(cache_lifetime=300)
+
+
+def create_kms_connection_config():
+    return pe.KmsConnectionConfig(
+        custom_kms_conf={
+            FOOTER_KEY_NAME: FOOTER_KEY.decode("UTF-8"),
+            COL_KEY_NAME: COL_KEY.decode("UTF-8"),
+        }
+    )
+
+
+def kms_factory(kms_connection_configuration):
+    return InMemoryKmsClient(kms_connection_configuration)
+
+
+@pytest.mark.skipif(
+    encryption_unavailable, reason="Parquet Encryption is not currently enabled"
+)
+def test_dataset_encryption_decryption():
+    table = create_sample_table()
+
+    encryption_config = create_encryption_config()
+    decryption_config = create_decryption_config()
+    kms_connection_config = create_kms_connection_config()
+
+    crypto_factory = pe.CryptoFactory(kms_factory)
+    parquet_encryption_cfg = ds.ParquetEncryptionConfig(
+        crypto_factory, kms_connection_config, encryption_config
+    )
+    parquet_decryption_cfg = ds.ParquetDecryptionConfig(
+        crypto_factory, kms_connection_config, decryption_config
+    )
+
+    # create write_options with dataset encryption config
+    pformat = pa.dataset.ParquetFileFormat()
+    write_options = pformat.make_write_options(encryption_config=parquet_encryption_cfg)
+
+    mockfs = fs._MockFileSystem()
+    mockfs.create_dir("/")
+
+    ds.write_dataset(
+        data=table,
+        base_dir="sample_dataset",
+        format=pformat,
+        file_options=write_options,
+        filesystem=mockfs,
+    )
+
+    # read without descryption config -> should error is dataset was properly encrypted
+    pformat = pa.dataset.ParquetFileFormat()
+    with pytest.raises(IOError, match=r"no decryption"):
+        ds.dataset("sample_dataset", format=pformat, filesystem=mockfs)
+
+    # set decryption config for parquet fragment scan options
+    pq_scan_opts = ds.ParquetFragmentScanOptions(
+        decryption_config=parquet_decryption_cfg
+    )
+    pformat = pa.dataset.ParquetFileFormat(default_fragment_scan_options=pq_scan_opts)
+    dataset = ds.dataset("sample_dataset", format=pformat, filesystem=mockfs)
+
+    assert table.equals(dataset.to_table())
+
+    # try to read dataset without encryption to verify encryption is enabled
+    with pytest.raises(OSError, match="no decryption found"):
+        ds.dataset("sample_dataset", filesystem=mockfs)

Review Comment:
   ```suggestion
   ```



##########
python/pyarrow/tests/test_dataset_encryption.py:
##########
@@ -0,0 +1,156 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from datetime import timedelta
+import pyarrow.fs as fs
+import pyarrow as pa
+import pytest
+
+encryption_unavailable = False
+
+try:
+    import pyarrow.dataset as ds
+except ImportError:
+    ds = None
+
+try:
+    from pyarrow.tests.parquet.encryption import InMemoryKmsClient
+    import pyarrow.parquet.encryption as pe
+except ImportError:
+    encryption_unavailable = True
+
+FOOTER_KEY = b"0123456789112345"
+FOOTER_KEY_NAME = "footer_key"
+COL_KEY = b"1234567890123450"
+COL_KEY_NAME = "col_key"
+
+
+def create_sample_table():
+    return pa.table(
+        {
+            "year": [2020, 2022, 2021, 2022, 2019, 2021],
+            "n_legs": [2, 2, 4, 4, 5, 100],
+            "animal": [
+                "Flamingo",
+                "Parrot",
+                "Dog",
+                "Horse",
+                "Brittle stars",
+                "Centipede",
+            ],
+        }
+    )
+
+
+def create_encryption_config():
+    return pe.EncryptionConfiguration(
+        footer_key=FOOTER_KEY_NAME,
+        plaintext_footer=False,
+        column_keys={COL_KEY_NAME: ["n_legs", "animal"]},
+        encryption_algorithm="AES_GCM_V1",
+        # requires timedelta or an assertion is raised
+        cache_lifetime=timedelta(minutes=5.0),
+        data_key_length_bits=256,
+    )
+
+
+def create_decryption_config():
+    return pe.DecryptionConfiguration(cache_lifetime=300)
+
+
+def create_kms_connection_config():
+    return pe.KmsConnectionConfig(
+        custom_kms_conf={
+            FOOTER_KEY_NAME: FOOTER_KEY.decode("UTF-8"),
+            COL_KEY_NAME: COL_KEY.decode("UTF-8"),
+        }
+    )
+
+
+def kms_factory(kms_connection_configuration):
+    return InMemoryKmsClient(kms_connection_configuration)
+
+
+@pytest.mark.skipif(
+    encryption_unavailable, reason="Parquet Encryption is not currently enabled"
+)
+def test_dataset_encryption_decryption():
+    table = create_sample_table()
+
+    encryption_config = create_encryption_config()
+    decryption_config = create_decryption_config()
+    kms_connection_config = create_kms_connection_config()
+
+    crypto_factory = pe.CryptoFactory(kms_factory)
+    parquet_encryption_cfg = ds.ParquetEncryptionConfig(
+        crypto_factory, kms_connection_config, encryption_config
+    )
+    parquet_decryption_cfg = ds.ParquetDecryptionConfig(
+        crypto_factory, kms_connection_config, decryption_config
+    )
+
+    # create write_options with dataset encryption config
+    pformat = pa.dataset.ParquetFileFormat()
+    write_options = pformat.make_write_options(encryption_config=parquet_encryption_cfg)
+
+    mockfs = fs._MockFileSystem()
+    mockfs.create_dir("/")
+
+    ds.write_dataset(
+        data=table,
+        base_dir="sample_dataset",
+        format=pformat,
+        file_options=write_options,
+        filesystem=mockfs,
+    )
+
+    # read without descryption config -> should error is dataset was properly encrypted
+    pformat = pa.dataset.ParquetFileFormat()
+    with pytest.raises(IOError, match=r"no decryption"):
+        ds.dataset("sample_dataset", format=pformat, filesystem=mockfs)
+
+    # set decryption config for parquet fragment scan options
+    pq_scan_opts = ds.ParquetFragmentScanOptions(
+        decryption_config=parquet_decryption_cfg
+    )
+    pformat = pa.dataset.ParquetFileFormat(default_fragment_scan_options=pq_scan_opts)
+    dataset = ds.dataset("sample_dataset", format=pformat, filesystem=mockfs)
+
+    assert table.equals(dataset.to_table())
+
+    # try to read dataset without encryption to verify encryption is enabled
+    with pytest.raises(OSError, match="no decryption found"):
+        ds.dataset("sample_dataset", filesystem=mockfs)

Review Comment:
   ```suggestion
   ```
   
   (duplicated now with some lines above)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1317175343


##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,61 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"

Review Comment:
   And if we do that, it probably means we can move the `ParquetEncryptionConfig` and `ParquetDecryptionConfig` directly into `dataset/file_parquet.h`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1317599683


##########
cpp/src/arrow/dataset/CMakeLists.txt:
##########
@@ -175,6 +175,13 @@ endif()
 
 if(ARROW_PARQUET)
   add_arrow_dataset_test(file_parquet_test)
+  if(PARQUET_REQUIRE_ENCRYPTION AND ARROW_DATASET)
+    add_arrow_dataset_test(dataset_encryption_test
+                           SOURCES
+                           dataset_encryption_test.cc

Review Comment:
   Sure thing



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1318705038


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -622,10 +640,36 @@ Result<std::shared_ptr<FileWriter>> ParquetFileFormat::MakeWriter(
   auto parquet_options = checked_pointer_cast<ParquetFileWriteOptions>(options);
 
   std::unique_ptr<parquet::arrow::FileWriter> parquet_writer;
-  ARROW_ASSIGN_OR_RAISE(parquet_writer, parquet::arrow::FileWriter::Open(
-                                            *schema, default_memory_pool(), destination,
-                                            parquet_options->writer_properties,
-                                            parquet_options->arrow_writer_properties));
+
+#ifdef PARQUET_REQUIRE_ENCRYPTION

Review Comment:
   I updated to return a Status::NotImplemented



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1352280113


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -639,7 +775,9 @@ cdef class ParquetFileWriteOptions(FileWriteOptions):
             write_batch_size=None,
             dictionary_pagesize_limit=None,
             write_page_index=False,
+            encryption_config=None,

Review Comment:
   Your assumption is correct. I'll update the test as you suggested.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.
ianmcook commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1756095800

   @github-actions crossbow submit -g python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "scoder (via GitHub)" <gi...@apache.org>.
scoder commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1753336591

   > I was passing the `ParquetFragmentScanOptions` as the C++ reference, but I can probably pass that as a python object as well (and unwrap it inside the function). And so when writing it as a `def` function instead, I don't need the `cimport` and can do a normal python import?
   
   Yes, that was the idea. That way, you avoid any C++ compile time dependency on the types and can use the normal Python mechanisms to deal with optional dependencies.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1753226304

   So I have a somewhat working implementation where I moved the conditional parts to a `_dataset_parquet_encryption.pyx` file, with a shim `_dataset_parquet_no_encryption.pyx` that contains just dummies with a runtime error message in case encryption is not available. During compilation, we switch the source files for the cythonize step of a `_dataset_parquet_encryption` module between those two. 
   
   The `pxd` file for this essentially looks like:
   
   ```
   from pyarrow.includes.libarrow_dataset_parquet cimport CParquetFragmentScanOptions
   
   cdef bint is_encryption_enabled()
   cdef set_decryption_config(CParquetFragmentScanOptions * parquet_options, config)
   ```
   
   However, what I cannot yet figure out is how to deal with this pxd file. Should I have two versions of that `_dataset_parquet_encryption.pxd` and `_dataset_parquet_no_encryption.pxd`? (although they are exactly the same) 
   But how does the cythonize step deal with that?
   
   If I only have a single `_dataset_parquet_encryption.pxd`, then compiling `_dataset_parquet_no_encryption.pxd` to a `_dataset_parquet_encryption` module, I get errors like "pyarrow/_dataset_parquet_encryption.pxd' not found". 
   However, when I add a `_dataset_parquet_no_encryption.pxd` with the same content, I get a huge amount of seemingly unrelated compiler errors like:
   
   ```
   Error compiling Cython file:
   ------------------------------------------------------------
   ...
           bint operator>= (const string&)
           bint operator>= (const char*)
   
   
       string to_string(int val) except +
       string to_string(long val) except +
                       ^
   ------------------------------------------------------------
   
   /home/joris/miniconda3/envs/arrow-dev/lib/python3.10/site-packages/Cython/Includes/libcpp/string.pxd:290:20: Function signature does not match previous declaration
   ....
   
   Error compiling Cython file:
   ------------------------------------------------------------
   ...
   
   
   cdef extern from "arrow/util/iterator.h" namespace "arrow" nogil:
       cdef cppclass CIterator" arrow::Iterator"[T]:
           CResult[T] Next()
           CStatus Visit[Visitor](Visitor&& visitor)
                                         ^
   ------------------------------------------------------------
   
   pyarrow/includes/libarrow.pxd:2721:38: Expected ')', found '&&'
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1607573905

   Could you please rebase on the latest main branch? It seems there are a lot of C++ failures.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1182782841


##########
cpp/src/arrow/dataset/CMakeLists.txt:
##########
@@ -166,6 +171,9 @@ endif()
 
 if(ARROW_PARQUET)
   add_arrow_dataset_test(file_parquet_test)
+if(PARQUET_REQUIRE_ENCRYPTION AND ARROW_DATASET) 
+  add_arrow_dataset_test(dataset_encryption_test)
+endif()

Review Comment:
   ```suggestion
     if(PARQUET_REQUIRE_ENCRYPTION AND ARROW_DATASET) 
       add_arrow_dataset_test(dataset_encryption_test)
     endif()
   ```



##########
cpp/src/arrow/dataset/CMakeLists.txt:
##########
@@ -96,6 +96,11 @@ if(ARROW_BUILD_STATIC AND WIN32)
   target_compile_definitions(arrow_dataset_static PUBLIC ARROW_DS_STATIC)
 endif()
 
+if(ARROW_PARQUET AND PARQUET_REQUIRE_ENCRYPTION AND ARROW_DATASET)  
+    # Add parquet to ARROW_TEST_STATIC_LINK_LIBS
+    list(APPEND ARROW_TEST_STATIC_LINK_LIBS parquet)  
+endif()

Review Comment:
   Got this when I ran `cmake-format cpp/src/arrow/dataset/CMakeLists.txt`:
   ```suggestion
   if(ARROW_PARQUET
      AND PARQUET_REQUIRE_ENCRYPTION
      AND ARROW_DATASET)
     # Add parquet to ARROW_TEST_STATIC_LINK_LIBS
     list(APPEND ARROW_TEST_STATIC_LINK_LIBS parquet)
   endif()
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1184269560


##########
cpp/src/arrow/dataset/CMakeLists.txt:
##########


Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1191383767


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########


Review Comment:
   Ok
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1250784248


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -28,6 +28,8 @@
 
 #include "arrow/dataset/discovery.h"
 #include "arrow/dataset/file_base.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+
 #include "arrow/dataset/type_fwd.h"
 #include "arrow/dataset/visibility.h"
 #include "arrow/io/caching.h"

Review Comment:
   I do not have it currently defined in arrow/util/config.h



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1253526818


##########
cpp/src/parquet/properties.h:
##########
@@ -218,7 +218,23 @@ class PARQUET_EXPORT WriterProperties {
           data_page_version_(ParquetDataPageVersion::V1),
           created_by_(DEFAULT_CREATED_BY),
           store_decimal_as_integer_(false),
-          page_checksum_enabled_(false) {}
+          page_checksum_enabled_(false),
+          default_column_properties_() {}

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1678591192

   :warning: GitHub issue #29238 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1169133250


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +138,33 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+   
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<parquet::encryption::DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+    return dataset_encryption_config_;
+  }
+  /// \brief A getter function to retrieve the dataset decryption configuration
+  std::shared_ptr<parquet::encryption::DatasetDecryptionConfiguration> GetDatasetDecryptionConfig() const {
+    return dataset_decryption_config_;
+  }

Review Comment:
   I agree it would. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1512108780

   FYI here are the directions for formatting the Python / Cython source code: https://arrow.apache.org/docs/developers/python.html#coding-style


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1306215999


##########
python/pyarrow/dataset_api.pxi:
##########
@@ -0,0 +1,59 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from libcpp.memory cimport shared_ptr
+
+
+cdef api bint pyarrow_is_cryptofactory(object crypto_factory):
+    return isinstance(crypto_factory, CryptoFactory)
+
+cdef api shared_ptr[CCryptoFactory] pyarrow_unwrap_cryptofactory(object crypto_factory):

Review Comment:
   I removed the pxi file as it was no longer needed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] AlenkaF commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "AlenkaF (via GitHub)" <gi...@apache.org>.
AlenkaF commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1316976912


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -56,9 +62,113 @@ from pyarrow._parquet cimport (
 
 cdef Expression _true = Expression._scalar(True)
 
-
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+IF PARQUET_ENCRYPTION_ENABLED:
+
+    from pyarrow._parquet_encryption cimport *
+
+    cdef class ParquetEncryptionConfig(_Weakrefable):
+        """
+        Configuration for Parquet Encryption.
+
+        Parameters
+        ----------
+        crypto_factory : object
+            Factory for creating cryptographic instances.
+        kms_connection_config : object
+            Configuration for connecting to Key Management Service.
+        encryption_config : object
+            Configuration for encryption settings.

Review Comment:
   Could we add specific objects in parameters for `ParquetEncryptionConfig` and `ParquetDecryptionConfig`, like `pyarrow.parquet.encryption.CryptoFactory` or `CryptoFactory` instead of just object?



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -56,9 +62,113 @@ from pyarrow._parquet cimport (
 
 cdef Expression _true = Expression._scalar(True)
 
-
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+IF PARQUET_ENCRYPTION_ENABLED:
+
+    from pyarrow._parquet_encryption cimport *
+
+    cdef class ParquetEncryptionConfig(_Weakrefable):
+        """
+        Configuration for Parquet Encryption.
+
+        Parameters
+        ----------
+        crypto_factory : object
+            Factory for creating cryptographic instances.
+        kms_connection_config : object
+            Configuration for connecting to Key Management Service.
+        encryption_config : object
+            Configuration for encryption settings.
+
+        Raises
+        ------
+        ValueError
+            If encryption_config is None.
+        """
+        cdef:
+            shared_ptr[CParquetEncryptionConfig] c_config
+
+        # Avoid mistakenly creating attributes
+        __slots__ = ()
+
+        def __cinit__(self, object crypto_factory, object kms_connection_config,
+                      object encryption_config):
+
+            cdef shared_ptr[CEncryptionConfiguration] c_encryption_config
+
+            if encryption_config is None:
+                raise ValueError(
+                    "encryption_config cannot be None")
+
+            self.c_config.reset(new CParquetEncryptionConfig())
+
+            c_encryption_config = pyarrow_unwrap_encryptionconfig(encryption_config)
+
+            self.c_config.get().Setup(pyarrow_unwrap_cryptofactory(crypto_factory),
+                                      pyarrow_unwrap_kmsconnectionconfig(
+                kms_connection_config),
+                c_encryption_config)
+
+        @staticmethod
+        cdef wrap(shared_ptr[CParquetEncryptionConfig] c_config):
+            cdef ParquetEncryptionConfig python_config = ParquetEncryptionConfig.__new__(ParquetEncryptionConfig)
+            python_config.c_config = c_config
+            return python_config
+
+        cdef shared_ptr[CParquetEncryptionConfig] unwrap(self):
+            return self.c_config
+
+    cdef class ParquetDecryptionConfig(_Weakrefable):

Review Comment:
   ```suggestion
   
   
       cdef class ParquetDecryptionConfig(_Weakrefable):
   ```
   
   Here also.



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -56,9 +62,113 @@ from pyarrow._parquet cimport (
 
 cdef Expression _true = Expression._scalar(True)
 
-
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+IF PARQUET_ENCRYPTION_ENABLED:
+
+    from pyarrow._parquet_encryption cimport *
+
+    cdef class ParquetEncryptionConfig(_Weakrefable):

Review Comment:
   ```suggestion
   
   
       cdef class ParquetEncryptionConfig(_Weakrefable):
   ```
   
   Small formatting nit as suggested by Joris.



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -56,9 +62,113 @@ from pyarrow._parquet cimport (
 
 cdef Expression _true = Expression._scalar(True)
 
-
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+IF PARQUET_ENCRYPTION_ENABLED:
+
+    from pyarrow._parquet_encryption cimport *
+
+    cdef class ParquetEncryptionConfig(_Weakrefable):
+        """
+        Configuration for Parquet Encryption.
+
+        Parameters
+        ----------
+        crypto_factory : object
+            Factory for creating cryptographic instances.
+        kms_connection_config : object
+            Configuration for connecting to Key Management Service.
+        encryption_config : object
+            Configuration for encryption settings.
+
+        Raises
+        ------
+        ValueError
+            If encryption_config is None.
+        """
+        cdef:
+            shared_ptr[CParquetEncryptionConfig] c_config
+
+        # Avoid mistakenly creating attributes
+        __slots__ = ()
+
+        def __cinit__(self, object crypto_factory, object kms_connection_config,
+                      object encryption_config):
+
+            cdef shared_ptr[CEncryptionConfiguration] c_encryption_config
+
+            if encryption_config is None:
+                raise ValueError(
+                    "encryption_config cannot be None")
+
+            self.c_config.reset(new CParquetEncryptionConfig())
+
+            c_encryption_config = pyarrow_unwrap_encryptionconfig(encryption_config)
+
+            self.c_config.get().Setup(pyarrow_unwrap_cryptofactory(crypto_factory),
+                                      pyarrow_unwrap_kmsconnectionconfig(
+                kms_connection_config),
+                c_encryption_config)
+
+        @staticmethod
+        cdef wrap(shared_ptr[CParquetEncryptionConfig] c_config):
+            cdef ParquetEncryptionConfig python_config = ParquetEncryptionConfig.__new__(ParquetEncryptionConfig)
+            python_config.c_config = c_config
+            return python_config
+
+        cdef shared_ptr[CParquetEncryptionConfig] unwrap(self):
+            return self.c_config
+
+    cdef class ParquetDecryptionConfig(_Weakrefable):
+        """
+        Configuration for Parquet Decryption.
+
+        Parameters
+        ----------
+        crypto_factory : object
+            Factory for creating cryptographic instances.
+        kms_connection_config : object
+            Configuration for connecting to Key Management Service.
+        decryption_config : object
+            Configuration for decryption settings.
+
+        Raises
+        ------
+        ValueError
+            If decryption_config is None.
+        """
+        cdef:
+            shared_ptr[CParquetDecryptionConfig] c_config
+
+        # Avoid mistakingly creating attributes
+        __slots__ = ()
+
+        def __cinit__(self, object crypto_factory, object kms_connection_config,
+                      object decryption_config):
+
+            cdef shared_ptr[CDecryptionConfiguration] c_decryption_config
+
+            if decryption_config is None:
+                raise ValueError(
+                    "decryption_config cannot be None")
+
+            self.c_config.reset(new CParquetDecryptionConfig())
+
+            c_decryption_config = pyarrow_unwrap_decryptionconfig(decryption_config)
+
+            self.c_config.get().Setup(pyarrow_unwrap_cryptofactory(crypto_factory),
+                                      pyarrow_unwrap_kmsconnectionconfig(
+                kms_connection_config),
+                c_decryption_config)
+
+        @staticmethod
+        cdef wrap(shared_ptr[CParquetDecryptionConfig] c_config):
+            cdef ParquetDecryptionConfig python_config = ParquetDecryptionConfig.__new__(ParquetDecryptionConfig)
+            python_config.c_config = c_config
+            return python_config
+
+        cdef shared_ptr[CParquetDecryptionConfig] unwrap(self):
+            return self.c_config

Review Comment:
   ```suggestion
               return self.c_config
   
   ```
   And here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1317167793


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -67,8 +69,24 @@ parquet::ReaderProperties MakeReaderProperties(
     properties.disable_buffered_stream();
   }
   properties.set_buffer_size(parquet_scan_options->reader_properties->buffer_size());
+
+#ifdef PARQUET_REQUIRE_ENCRYPTION

Review Comment:
   Not as critical as for encryption, but if `PARQUET_REQUIRE_ENCRYPTION` we might want to disallow passing a `parquet_decryption_config`? Or is there a reason to be lenient?



##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -236,6 +241,9 @@ class ARROW_DS_EXPORT ParquetFileWriteOptions : public FileWriteOptions {
   /// \brief Parquet Arrow writer properties.
   std::shared_ptr<parquet::ArrowWriterProperties> arrow_writer_properties;
 
+  // A configuration structure that provides per file encryption properties for a dataset

Review Comment:
   Same.



##########
cpp/src/arrow/dataset/CMakeLists.txt:
##########
@@ -175,6 +175,13 @@ endif()
 
 if(ARROW_PARQUET)
   add_arrow_dataset_test(file_parquet_test)
+  if(PARQUET_REQUIRE_ENCRYPTION AND ARROW_DATASET)
+    add_arrow_dataset_test(dataset_encryption_test
+                           SOURCES
+                           dataset_encryption_test.cc

Review Comment:
   I assume this is specific to Parquet, so can we perhaps call it `file_parquet_encryption_test.cc`?
   



##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,61 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"

Review Comment:
   I don't think we need to include these, forward declarations would be enough, right?
   
   Perhaps we can create and populate a `parquet/encryption/type_fwd.h` that we would include here.



##########
python/pyarrow/_parquet_encryption.pxd:
##########
@@ -131,3 +131,22 @@ cdef extern from "arrow/python/parquet_encryption.h" \
             SafeGetFileDecryptionProperties(
             const CKmsConnectionConfig& kms_connection_config,
             const CDecryptionConfiguration& decryption_config)
+
+cdef extern from "arrow/dataset/parquet_encryption_config.h" namespace "arrow::dataset" nogil:
+    cdef cppclass CParquetEncryptionConfig "arrow::dataset::ParquetEncryptionConfig":
+        CParquetEncryptionConfig() except +
+        void Setup(shared_ptr[CCryptoFactory] crypto_factory,
+                   shared_ptr[CKmsConnectionConfig] kms_connection_config,
+                   shared_ptr[CEncryptionConfiguration] encryption_config)
+
+    cdef cppclass CParquetDecryptionConfig "arrow::dataset::ParquetDecryptionConfig":
+        CParquetDecryptionConfig() except +
+        void Setup(shared_ptr[CCryptoFactory] crypto_factory,
+                   shared_ptr[CKmsConnectionConfig] kms_connection_config,
+                   shared_ptr[CDecryptionConfiguration] decryption_config)
+
+
+cdef public shared_ptr[CCryptoFactory] pyarrow_unwrap_cryptofactory(object crypto_factory)

Review Comment:
   It doesn't seem necessary to create and expose public Cython/C APIs for this. We could just add the necessary Cython declarations for the caller code to unwrap those objects directly, like we do for example here:
   https://github.com/apache/arrow/blob/ad7f6ef3a97be6a3eb6fe37d0df1b1634db057d2/python/pyarrow/_parquet.pxd#L550-L564
   
   
   



##########
python/pyarrow/_parquet_encryption.pyx:
##########
@@ -466,3 +466,47 @@ cdef class CryptoFactory(_Weakrefable):
 
     def remove_cache_entries_for_all_tokens(self):
         self.factory.get().RemoveCacheEntriesForAllTokens()
+
+    cdef inline shared_ptr[CPyCryptoFactory] unwrap(self) nogil:

Review Comment:
   This is accessing a Python object (`self`) so `nogil` doesn't look right. Also, it's not productive to pay the cost of releasing the GIL for such a trivial method.



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -33,6 +33,12 @@ from pyarrow.includes.libarrow_dataset cimport *
 from pyarrow.includes.libarrow_dataset_parquet cimport *
 from pyarrow._fs cimport FileSystem
 
+IF PARQUET_ENCRYPTION_ENABLED:

Review Comment:
   Oh, interesting. I didn't know that Cython allowed compile-time conditionals and that definitely makes things easier here :-)



##########
python/pyarrow/_parquet_encryption.pxd:
##########
@@ -131,3 +131,22 @@ cdef extern from "arrow/python/parquet_encryption.h" \
             SafeGetFileDecryptionProperties(
             const CKmsConnectionConfig& kms_connection_config,
             const CDecryptionConfiguration& decryption_config)
+
+cdef extern from "arrow/dataset/parquet_encryption_config.h" namespace "arrow::dataset" nogil:
+    cdef cppclass CParquetEncryptionConfig "arrow::dataset::ParquetEncryptionConfig":
+        CParquetEncryptionConfig() except +

Review Comment:
   You don't need to declare a default constructor, I think.
   Also, I don't think the constructor would raise any C++ exception, so the `except +` seems pointless.



##########
python/pyarrow/includes/libarrow_parquet_readwrite.pxd:
##########
@@ -0,0 +1,32 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# distutils: language = c++
+
+from pyarrow.includes.libarrow_dataset cimport *
+from pyarrow._parquet cimport *
+
+cdef extern from "arrow/dataset/api.h" namespace "arrow::dataset" nogil:

Review Comment:
   Hmm... I'm not sure what the point is of moving these declarations to a new file? 



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,216 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/dataset/api.h"

Review Comment:
   `api.h` files are generally heavy-weight and can blow up compile times, so it would be preferrable to use more specific inclusions depending on what is really needed here.



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -693,6 +825,10 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
         if thrift_container_size_limit is not None:
             self.thrift_container_size_limit = thrift_container_size_limit
 
+        IF PARQUET_ENCRYPTION_ENABLED:

Review Comment:
   Is the parameter ignored otherwise? That doesn't seem like a good idea.



##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,61 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"

Review Comment:
   And if we do that, it probably means we can move the `ParquetEncryptionConfig` and `ParquetDecryptionConfi` directly into `dataset/file_parquet.h`?



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -597,6 +709,15 @@ cdef class ParquetFileWriteOptions(FileWriteOptions):
             data_page_version=self._properties["data_page_version"],
         )
 
+        IF PARQUET_ENCRYPTION_ENABLED:
+            cdef shared_ptr[CParquetEncryptionConfig] c_config
+            if self._properties["encryption_config"]:
+                config = self._properties["encryption_config"]
+                if not isinstance(config, ParquetEncryptionConfig):
+                    raise ValueError("config must be a ParquetEncryptionConfig")

Review Comment:
   1. raise TypeError
   2. make the error message more precise (which config?)
   ```suggestion
                       raise TypeError("encryption_config must be a ParquetEncryptionConfig")
   ```



##########
cpp/src/parquet/properties.h:
##########
@@ -770,6 +791,11 @@ class PARQUET_EXPORT WriterProperties {
     }
   }
 
+  // \brief Returns the default column properties

Review Comment:
   Nit: use infinitive not imperative
   ```suggestion
     // \brief Return the default column properties
   ```



##########
python/examples/dataset/write_dataset_encrypted.py:
##########
@@ -0,0 +1,93 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+import pyarrow as pa
+import pyarrow.dataset as ds
+import pyarrow.parquet.encryption as pe
+from pyarrow.tests.parquet.encryption import InMemoryKmsClient
+from datetime import timedelta
+import shutil
+import os

Review Comment:
   Hmm, please let's follow the conventional import order (even though we might not always observe it faithfully :-)):
   1. stdlib imports
   2. a blank line
   3. third-party imports
   4. a blank line
   5. pyarrow imports
   



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -56,9 +62,113 @@ from pyarrow._parquet cimport (
 
 cdef Expression _true = Expression._scalar(True)
 
-
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+IF PARQUET_ENCRYPTION_ENABLED:
+
+    from pyarrow._parquet_encryption cimport *
+
+    cdef class ParquetEncryptionConfig(_Weakrefable):
+        """
+        Configuration for Parquet Encryption.
+
+        Parameters
+        ----------
+        crypto_factory : object

Review Comment:
   It would be more informative to give the expected type here, for example:
   ```suggestion
           crypto_factory : CryptoFactory
   ```
   
   (same for other parameters below)



##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -226,6 +229,8 @@ class ARROW_DS_EXPORT ParquetFragmentScanOptions : public FragmentScanOptions {
   /// ScanOptions. Additionally, dictionary columns come from
   /// ParquetFileFormat::ReaderOptions::dict_columns.
   std::shared_ptr<parquet::ArrowReaderProperties> arrow_reader_properties;
+  /// A configuration structure that provides per file encryption properties for a dataset

Review Comment:
   "per file" seems misleading since `ParquetDecryptionConfig` is file-agnostic. The per-file part is derived by the dataset machinery, not provided by the user.



##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,61 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+
+namespace arrow {
+namespace dataset {
+
+/// core class, that translates the parameters of high level encryption
+struct ARROW_DS_EXPORT ParquetEncryptionConfig {
+  void Setup(

Review Comment:
   I don't understand what this `Setup` method is for, since:
   1. this seems trivially the same as [aggregate initialization](https://en.cppreference.com/w/cpp/language/aggregate_initialization), right?
   2. one can also set the attribute values individually anyway
   



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -622,10 +640,36 @@ Result<std::shared_ptr<FileWriter>> ParquetFileFormat::MakeWriter(
   auto parquet_options = checked_pointer_cast<ParquetFileWriteOptions>(options);
 
   std::unique_ptr<parquet::arrow::FileWriter> parquet_writer;
-  ARROW_ASSIGN_OR_RAISE(parquet_writer, parquet::arrow::FileWriter::Open(
-                                            *schema, default_memory_pool(), destination,
-                                            parquet_options->writer_properties,
-                                            parquet_options->arrow_writer_properties));
+
+#ifdef PARQUET_REQUIRE_ENCRYPTION

Review Comment:
   Hmm, so what happens if `PARQUET_REQUIRE_ENCRYPTION` is false and the user sets a `parquet_encrypt_config`? We should certainly silently produce unencrypted files, so it should perhaps return a `Status::NotImplemented`.
   
   Also, it should probably be tested somewhere.



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -56,9 +62,113 @@ from pyarrow._parquet cimport (
 
 cdef Expression _true = Expression._scalar(True)
 
-
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+IF PARQUET_ENCRYPTION_ENABLED:
+
+    from pyarrow._parquet_encryption cimport *
+
+    cdef class ParquetEncryptionConfig(_Weakrefable):
+        """
+        Configuration for Parquet Encryption.
+
+        Parameters
+        ----------
+        crypto_factory : object
+            Factory for creating cryptographic instances.
+        kms_connection_config : object
+            Configuration for connecting to Key Management Service.
+        encryption_config : object
+            Configuration for encryption settings.
+
+        Raises
+        ------
+        ValueError
+            If encryption_config is None.
+        """
+        cdef:
+            shared_ptr[CParquetEncryptionConfig] c_config
+
+        # Avoid mistakenly creating attributes
+        __slots__ = ()
+
+        def __cinit__(self, object crypto_factory, object kms_connection_config,
+                      object encryption_config):
+
+            cdef shared_ptr[CEncryptionConfiguration] c_encryption_config
+
+            if encryption_config is None:
+                raise ValueError(
+                    "encryption_config cannot be None")
+
+            self.c_config.reset(new CParquetEncryptionConfig())
+
+            c_encryption_config = pyarrow_unwrap_encryptionconfig(encryption_config)
+
+            self.c_config.get().Setup(pyarrow_unwrap_cryptofactory(crypto_factory),
+                                      pyarrow_unwrap_kmsconnectionconfig(
+                kms_connection_config),
+                c_encryption_config)
+
+        @staticmethod
+        cdef wrap(shared_ptr[CParquetEncryptionConfig] c_config):
+            cdef ParquetEncryptionConfig python_config = ParquetEncryptionConfig.__new__(ParquetEncryptionConfig)
+            python_config.c_config = c_config
+            return python_config
+
+        cdef shared_ptr[CParquetEncryptionConfig] unwrap(self):
+            return self.c_config
+
+    cdef class ParquetDecryptionConfig(_Weakrefable):
+        """

Review Comment:
   Similar comments below as for `ParquetEncryptionConfig`.



##########
python/pyarrow/includes/libarrow_parquet_readwrite_encryption.pxd:
##########
@@ -0,0 +1,35 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# distutils: language = c++
+
+from pyarrow.includes.libarrow_dataset cimport *
+from pyarrow._parquet cimport *
+from pyarrow._parquet_encryption cimport *
+
+cdef extern from "arrow/dataset/api.h" namespace "arrow::dataset" nogil:
+    cdef cppclass CParquetFileWriteOptions \

Review Comment:
   Same question: why not declare these in `libarrow_dataset_parquet.pxd`?



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -56,9 +62,113 @@ from pyarrow._parquet cimport (
 
 cdef Expression _true = Expression._scalar(True)
 
-
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+IF PARQUET_ENCRYPTION_ENABLED:
+
+    from pyarrow._parquet_encryption cimport *
+
+    cdef class ParquetEncryptionConfig(_Weakrefable):
+        """
+        Configuration for Parquet Encryption.
+
+        Parameters
+        ----------
+        crypto_factory : object
+            Factory for creating cryptographic instances.
+        kms_connection_config : object
+            Configuration for connecting to Key Management Service.
+        encryption_config : object
+            Configuration for encryption settings.
+
+        Raises
+        ------
+        ValueError
+            If encryption_config is None.

Review Comment:
   But why only for encryption_config? Why not check all input args for None?



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,216 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kFooterKeyName = "footer_key";
+constexpr std::string_view kColumnMasterKey = "1234567890123450";
+constexpr std::string_view kColumnMasterKeysId = "col_key";
+constexpr std::string_view kColumnKeyMapping = "col_key: a";
+constexpr std::string_view kBaseDir = "";
+
+using arrow::internal::checked_pointer_cast;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ public:
+  // This function creates a mock file system using the current time point, creates a
+  // directory with the given base directory path, and writes a dataset to it using
+  // provided Parquet file write options. The dataset is partitioned using a Hive
+  // partitioning scheme. The function also checks if the written files exist in the file
+  // system.
+  static void SetUpTestSuite() {
+    // Creates a mock file system using the current time point.
+    EXPECT_OK_AND_ASSIGN(file_system, fs::internal::MockFileSystem::Make(
+                                          std::chrono::system_clock::now(), {}));
+    ASSERT_OK(file_system->CreateDir(std::string(kBaseDir)));
+
+    // Prepare table data.
+    auto table_schema = schema({field("a", int64()), field("b", int64()),
+                                field("c", int64()), field("part", utf8())});
+    table = TableFromJSON(table_schema, {R"([
+       [ 0, 9, 1, "a" ],
+       [ 1, 8, 2, "b" ],
+       [ 2, 7, 1, "c" ],
+       [ 3, 6, 2, "d" ],
+       [ 4, 5, 1, "e" ],
+       [ 5, 4, 2, "f" ],
+       [ 6, 3, 1, "g" ],
+       [ 7, 2, 2, "h" ],
+       [ 8, 1, 1, "i" ],
+       [ 9, 0, 2, "j" ]
+     ])"});
+
+    // Use a Hive-style partitioning scheme.
+    partitioning = std::make_shared<HivePartitioning>(schema({field("part", utf8())}));
+
+    // Prepare encryption properties.
+    std::unordered_map<std::string, std::string> key_map;
+    key_map.emplace(kColumnMasterKeysId, kColumnMasterKey);
+    key_map.emplace(kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+    crypto_factory = std::make_shared<parquet::encryption::CryptoFactory>();
+    auto kms_client_factory =
+        std::make_shared<parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            /*wrap_locally=*/true, key_map);
+    crypto_factory->RegisterKmsClientFactory(std::move(kms_client_factory));
+    kms_connection_config = std::make_shared<parquet::encryption::KmsConnectionConfig>();
+
+    // Set write options with encryption configuration.
+    auto encryption_config =
+        std::make_shared<parquet::encryption::EncryptionConfiguration>(
+            std::string(kFooterKeyName));
+    encryption_config->column_keys = kColumnKeyMapping;
+    auto parquet_encryption_config = std::make_shared<ParquetEncryptionConfig>();
+    parquet_encryption_config->Setup(crypto_factory, kms_connection_config,
+                                     std::move(encryption_config));
+
+    auto file_format = std::make_shared<ParquetFileFormat>();
+    auto parquet_file_write_options =
+        checked_pointer_cast<ParquetFileWriteOptions>(file_format->DefaultWriteOptions());
+    parquet_file_write_options->parquet_encryption_config =
+        std::move(parquet_encryption_config);
+
+    // Write dataset.
+    auto dataset = std::make_shared<InMemoryDataset>(table);
+    EXPECT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan());
+    EXPECT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
+
+    FileSystemDatasetWriteOptions write_options;
+    write_options.file_write_options = parquet_file_write_options;
+    write_options.filesystem = file_system;
+    write_options.base_dir = kBaseDir;
+    write_options.partitioning = partitioning;
+    write_options.basename_template = "part{i}.parquet";
+    ASSERT_OK(FileSystemDataset::Write(write_options, std::move(scanner)));
+
+    // Verify that the files exist
+    std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                      "part=c/part0.parquet", "part=d/part0.parquet",
+                                      "part=e/part0.parquet", "part=f/part0.parquet",
+                                      "part=g/part0.parquet", "part=h/part0.parquet",
+                                      "part=i/part0.parquet", "part=j/part0.parquet"};
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, file_system->GetFileInfo(file_path));
+      ASSERT_EQ(result.type(), fs::FileType::File);
+    }
+  }
+
+ protected:
+  inline static std::shared_ptr<fs::FileSystem> file_system;

Review Comment:
   Our convention is to append underscores in members of classes (except for plain light-weight structs). So `file_system_` here, for example.



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,216 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kFooterKeyName = "footer_key";
+constexpr std::string_view kColumnMasterKey = "1234567890123450";
+constexpr std::string_view kColumnMasterKeysId = "col_key";

Review Comment:
   Nit, but should this be `kColumnMasterKeyId` (not "keys")?



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -56,9 +62,113 @@ from pyarrow._parquet cimport (
 
 cdef Expression _true = Expression._scalar(True)
 
-
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+IF PARQUET_ENCRYPTION_ENABLED:
+
+    from pyarrow._parquet_encryption cimport *
+
+    cdef class ParquetEncryptionConfig(_Weakrefable):
+        """
+        Configuration for Parquet Encryption.
+
+        Parameters
+        ----------
+        crypto_factory : object
+            Factory for creating cryptographic instances.
+        kms_connection_config : object
+            Configuration for connecting to Key Management Service.
+        encryption_config : object
+            Configuration for encryption settings.
+
+        Raises
+        ------
+        ValueError
+            If encryption_config is None.
+        """
+        cdef:
+            shared_ptr[CParquetEncryptionConfig] c_config
+
+        # Avoid mistakenly creating attributes
+        __slots__ = ()
+
+        def __cinit__(self, object crypto_factory, object kms_connection_config,
+                      object encryption_config):

Review Comment:
   "object" is superfluous in this context as it's the default.
   



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,216 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kFooterKeyName = "footer_key";
+constexpr std::string_view kColumnMasterKey = "1234567890123450";
+constexpr std::string_view kColumnMasterKeysId = "col_key";
+constexpr std::string_view kColumnKeyMapping = "col_key: a";
+constexpr std::string_view kBaseDir = "";
+
+using arrow::internal::checked_pointer_cast;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ public:
+  // This function creates a mock file system using the current time point, creates a
+  // directory with the given base directory path, and writes a dataset to it using
+  // provided Parquet file write options. The dataset is partitioned using a Hive
+  // partitioning scheme. The function also checks if the written files exist in the file
+  // system.
+  static void SetUpTestSuite() {
+    // Creates a mock file system using the current time point.
+    EXPECT_OK_AND_ASSIGN(file_system, fs::internal::MockFileSystem::Make(
+                                          std::chrono::system_clock::now(), {}));
+    ASSERT_OK(file_system->CreateDir(std::string(kBaseDir)));
+
+    // Prepare table data.
+    auto table_schema = schema({field("a", int64()), field("b", int64()),
+                                field("c", int64()), field("part", utf8())});
+    table = TableFromJSON(table_schema, {R"([
+       [ 0, 9, 1, "a" ],
+       [ 1, 8, 2, "b" ],
+       [ 2, 7, 1, "c" ],
+       [ 3, 6, 2, "d" ],
+       [ 4, 5, 1, "e" ],
+       [ 5, 4, 2, "f" ],
+       [ 6, 3, 1, "g" ],
+       [ 7, 2, 2, "h" ],
+       [ 8, 1, 1, "i" ],
+       [ 9, 0, 2, "j" ]

Review Comment:
   Perhaps we can make this non-trivial by having non-unique `part` values?
   (not sure that would change much for this test, but still)



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,216 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kFooterKeyName = "footer_key";
+constexpr std::string_view kColumnMasterKey = "1234567890123450";
+constexpr std::string_view kColumnMasterKeysId = "col_key";
+constexpr std::string_view kColumnKeyMapping = "col_key: a";
+constexpr std::string_view kBaseDir = "";
+
+using arrow::internal::checked_pointer_cast;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ public:
+  // This function creates a mock file system using the current time point, creates a
+  // directory with the given base directory path, and writes a dataset to it using
+  // provided Parquet file write options. The dataset is partitioned using a Hive
+  // partitioning scheme. The function also checks if the written files exist in the file
+  // system.
+  static void SetUpTestSuite() {
+    // Creates a mock file system using the current time point.
+    EXPECT_OK_AND_ASSIGN(file_system, fs::internal::MockFileSystem::Make(
+                                          std::chrono::system_clock::now(), {}));
+    ASSERT_OK(file_system->CreateDir(std::string(kBaseDir)));
+
+    // Prepare table data.
+    auto table_schema = schema({field("a", int64()), field("b", int64()),
+                                field("c", int64()), field("part", utf8())});
+    table = TableFromJSON(table_schema, {R"([
+       [ 0, 9, 1, "a" ],
+       [ 1, 8, 2, "b" ],
+       [ 2, 7, 1, "c" ],
+       [ 3, 6, 2, "d" ],
+       [ 4, 5, 1, "e" ],
+       [ 5, 4, 2, "f" ],
+       [ 6, 3, 1, "g" ],
+       [ 7, 2, 2, "h" ],
+       [ 8, 1, 1, "i" ],
+       [ 9, 0, 2, "j" ]
+     ])"});
+
+    // Use a Hive-style partitioning scheme.
+    partitioning = std::make_shared<HivePartitioning>(schema({field("part", utf8())}));
+
+    // Prepare encryption properties.
+    std::unordered_map<std::string, std::string> key_map;
+    key_map.emplace(kColumnMasterKeysId, kColumnMasterKey);
+    key_map.emplace(kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+    crypto_factory = std::make_shared<parquet::encryption::CryptoFactory>();
+    auto kms_client_factory =
+        std::make_shared<parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            /*wrap_locally=*/true, key_map);
+    crypto_factory->RegisterKmsClientFactory(std::move(kms_client_factory));
+    kms_connection_config = std::make_shared<parquet::encryption::KmsConnectionConfig>();
+
+    // Set write options with encryption configuration.
+    auto encryption_config =
+        std::make_shared<parquet::encryption::EncryptionConfiguration>(
+            std::string(kFooterKeyName));
+    encryption_config->column_keys = kColumnKeyMapping;
+    auto parquet_encryption_config = std::make_shared<ParquetEncryptionConfig>();
+    parquet_encryption_config->Setup(crypto_factory, kms_connection_config,
+                                     std::move(encryption_config));
+
+    auto file_format = std::make_shared<ParquetFileFormat>();
+    auto parquet_file_write_options =
+        checked_pointer_cast<ParquetFileWriteOptions>(file_format->DefaultWriteOptions());
+    parquet_file_write_options->parquet_encryption_config =
+        std::move(parquet_encryption_config);
+
+    // Write dataset.
+    auto dataset = std::make_shared<InMemoryDataset>(table);
+    EXPECT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan());
+    EXPECT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
+
+    FileSystemDatasetWriteOptions write_options;
+    write_options.file_write_options = parquet_file_write_options;
+    write_options.filesystem = file_system;
+    write_options.base_dir = kBaseDir;
+    write_options.partitioning = partitioning;
+    write_options.basename_template = "part{i}.parquet";
+    ASSERT_OK(FileSystemDataset::Write(write_options, std::move(scanner)));
+
+    // Verify that the files exist
+    std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                      "part=c/part0.parquet", "part=d/part0.parquet",
+                                      "part=e/part0.parquet", "part=f/part0.parquet",
+                                      "part=g/part0.parquet", "part=h/part0.parquet",
+                                      "part=i/part0.parquet", "part=j/part0.parquet"};
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, file_system->GetFileInfo(file_path));
+      ASSERT_EQ(result.type(), fs::FileType::File);
+    }
+  }
+
+ protected:
+  inline static std::shared_ptr<fs::FileSystem> file_system;
+  inline static std::shared_ptr<Table> table;
+  inline static std::shared_ptr<HivePartitioning> partitioning;
+  inline static std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory;
+  inline static std::shared_ptr<parquet::encryption::KmsConnectionConfig>
+      kms_connection_config;
+};
+
+// This test demonstrates the process of writing a partitioned Parquet file with the same
+// encryption properties applied to each file within the dataset. The encryption
+// properties are determined based on the selected columns. After writing the dataset, the
+// test reads the data back and verifies that it can be successfully decrypted and
+// scanned.
+TEST_F(DatasetEncryptionTest, WriteReadDatasetWithEncryption) {
+  // Create decryption properties.
+  auto decryption_config =
+      std::make_shared<parquet::encryption::DecryptionConfiguration>();
+  auto parquet_decryption_config = std::make_shared<ParquetDecryptionConfig>();
+  parquet_decryption_config->Setup(crypto_factory, kms_connection_config,
+                                   std::move(decryption_config));
+
+  // Set scan options.
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->parquet_decryption_config = std::move(parquet_decryption_config);
+
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  file_format->default_fragment_scan_options = std::move(parquet_scan_options);
+
+  // Get FileInfo objects for all files under the base directory
+  fs::FileSelector selector;
+  selector.base_dir = kBaseDir;
+  selector.recursive = true;
+
+  FileSystemFactoryOptions factory_options;
+  factory_options.partitioning = partitioning;
+  factory_options.partition_base_dir = kBaseDir;
+  ASSERT_OK_AND_ASSIGN(auto dataset_factory,
+                       FileSystemDatasetFactory::Make(file_system, selector, file_format,
+                                                      factory_options));
+
+  // Read dataset into table
+  ASSERT_OK_AND_ASSIGN(auto dataset, dataset_factory->Finish());
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
+  ASSERT_OK_AND_ASSIGN(auto read_table, scanner->ToTable());
+
+  // Verify the data was read correctly
+  ASSERT_OK_AND_ASSIGN(auto combined_table, read_table->CombineChunks());

Review Comment:
   Let's also validate the table.
   ```suggestion
     ASSERT_OK_AND_ASSIGN(auto combined_table, read_table->CombineChunks());
     ASSERT_OK(combined_table->ValidateFull());
   ```



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -56,9 +62,113 @@ from pyarrow._parquet cimport (
 
 cdef Expression _true = Expression._scalar(True)
 
-
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+IF PARQUET_ENCRYPTION_ENABLED:
+
+    from pyarrow._parquet_encryption cimport *
+
+    cdef class ParquetEncryptionConfig(_Weakrefable):
+        """
+        Configuration for Parquet Encryption.
+
+        Parameters
+        ----------
+        crypto_factory : object
+            Factory for creating cryptographic instances.
+        kms_connection_config : object
+            Configuration for connecting to Key Management Service.
+        encryption_config : object
+            Configuration for encryption settings.

Review Comment:
   "Configuration for encryption settings" is redundant and also a bit vague.
   
   



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -631,7 +752,13 @@ cdef class ParquetFileWriteOptions(FileWriteOptions):
             coerce_timestamps=None,
             allow_truncated_timestamps=False,
             use_compliant_nested_type=True,
+            encryption_config=None,
         )
+        IF PARQUET_ENCRYPTION_ENABLED:
+            if self._properties["encryption_config"] is not None:
+                print("Encryption is not enabled in this build of pyarrow. "
+                      "Please reinstall pyarrow with encryption enabled.")
+

Review Comment:
   Also, let's make sure this is tested to avoid any regressions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1314161810


##########
python/CMakeLists.txt:
##########
@@ -706,6 +706,14 @@ endif()
 # Error on any warnings not already explicitly ignored.
 set(CYTHON_FLAGS "${CYTHON_FLAGS}" "--warning-errors")
 
+if(PYARROW_BUILD_PARQUET_ENCRYPTION)
+  message(STATUS "Parquet Encryption Enabled")
+  list(APPEND CYTHON_FLAGS "-E PARQUET_ENCRYPTION_ENABLED=1")
+else()
+  message(STATUS "Parquet Encryption is NOT Enabled")
+  list(APPEND CYTHON_FLAGS "-E PARQUET_ENCRYPTION_ENABLED=0")

Review Comment:
   ```suggestion
     list(APPEND CYTHON_FLAGS "-E" "PARQUET_ENCRYPTION_ENABLED=0")
   ```



##########
python/CMakeLists.txt:
##########
@@ -706,6 +706,14 @@ endif()
 # Error on any warnings not already explicitly ignored.
 set(CYTHON_FLAGS "${CYTHON_FLAGS}" "--warning-errors")
 
+if(PYARROW_BUILD_PARQUET_ENCRYPTION)
+  message(STATUS "Parquet Encryption Enabled")
+  list(APPEND CYTHON_FLAGS "-E PARQUET_ENCRYPTION_ENABLED=1")

Review Comment:
   ```suggestion
     list(APPEND CYTHON_FLAGS "-E" "PARQUET_ENCRYPTION_ENABLED=1")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1267119859


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;

Review Comment:
   Ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1642233441

   @wgtmac I've updated the naming convention


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1269484663


##########
python/pyarrow/includes/libarrow_dataset_parquet.pxd:
##########
@@ -62,6 +64,8 @@ cdef extern from "arrow/dataset/api.h" namespace "arrow::dataset" nogil:
             "arrow::dataset::ParquetFragmentScanOptions"(CFragmentScanOptions):
         shared_ptr[CReaderProperties] reader_properties
         shared_ptr[ArrowReaderProperties] arrow_reader_properties
+        shared_ptr[CParquetDecryptionConfig] GetDatasetDecryptionConfig()

Review Comment:
   I'll take a look again also to see if I can find any more naming issues.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1261868174


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +138,20 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+    return dataset_encryption_config_;
+  }
+  /// \brief A setter for DatasetEncryptionConfiguration
+  void SetDatasetEncryptionConfig(
+      std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config) {
+    dataset_encryption_config_ = std::move(dataset_encryption_config);
+  }
+
+ private:
+  // A configuration structure that provides per file encryption properties for a dataset
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config_ = NULLPTR;

Review Comment:
   Hmm, what I described was the read path.  I'll check how write properties are specified later (on mobile right now)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1640890926

   @wgtmac I think I got everything except the remaining of DatasetEncryptionConfiguration / DatasetDecryptionConfiguration.  I wanted to let you take a look at the changes and then if your good with them I can tackle that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1270266377


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -236,11 +252,25 @@ class ARROW_DS_EXPORT ParquetFileWriteOptions : public FileWriteOptions {
   /// \brief Parquet Arrow writer properties.
   std::shared_ptr<parquet::ArrowWriterProperties> arrow_writer_properties;
 
+  /// \brief A getter function to retrieve the parquet encryption configuration

Review Comment:
   ditto



##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -226,6 +229,19 @@ class ARROW_DS_EXPORT ParquetFragmentScanOptions : public FragmentScanOptions {
   /// ScanOptions. Additionally, dictionary columns come from
   /// ParquetFileFormat::ReaderOptions::dict_columns.
   std::shared_ptr<parquet::ArrowReaderProperties> arrow_reader_properties;
+  /// \brief A getter function to retrieve the parquet decryption configuration
+  std::shared_ptr<ParquetDecryptionConfig> GetParquetDecryptionConfig() const {
+    return parquet_decryption_config_;
+  }
+  /// \brief A setter for ParquetDecryptionConfig
+  void SetParquetDecryptionConfig(
+      std::shared_ptr<ParquetDecryptionConfig> parquet_decryption_config) {
+    parquet_decryption_config_ = std::move(parquet_decryption_config);
+  }

Review Comment:
   Is it better to be consistent and not use setter and getter here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1270936312


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,412 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+using arrow::internal::checked_pointer_cast;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // This function creates a mock file system using the current time point, creates a
+  // directory with the given base directory path, and writes a dataset to it using
+  // provided Parquet file write options. The dataset is partitioned using a Hive
+  // partitioning scheme. The function also checks if the written files exist in the file
+  // system.
+  ::arrow::Result<std::shared_ptr<::arrow::fs::FileSystem>>
+  CreateMockFileSystemAndWriteData(
+      const std::string& base_dir,
+      const std::shared_ptr<FileWriteOptions>& parquet_file_write_options) {
+    // Create our mock file system
+    ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+    ARROW_ASSIGN_OR_RAISE(auto file_system,
+                          ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+    // Create filesystem
+    RETURN_NOT_OK(file_system->CreateDir(base_dir));
+
+    auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+    auto partitioning =
+        std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+    // ----- Write the Dataset ----
+    auto dataset_out = BuildTable();
+    ARROW_ASSIGN_OR_RAISE(auto scanner_builder_out, dataset_out->NewScan());
+    ARROW_ASSIGN_OR_RAISE(auto scanner_out, scanner_builder_out->Finish());
+
+    ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+    write_options.file_write_options = parquet_file_write_options;
+    write_options.filesystem = file_system;
+    write_options.base_dir = base_dir;
+    write_options.partitioning = partitioning;
+    write_options.basename_template = "part{i}.parquet";
+    RETURN_NOT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner_out));
+
+    std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                      "part=c/part0.parquet", "part=d/part0.parquet",
+                                      "part=e/part0.parquet", "part=f/part0.parquet",
+                                      "part=g/part0.parquet", "part=h/part0.parquet",
+                                      "part=i/part0.parquet", "part=j/part0.parquet"};
+    ValidateFilesExist(file_system, files);
+
+    return file_system;
+  }
+
+  // Create dataset encryption properties
+  std::pair<std::shared_ptr<ParquetEncryptionConfig>,
+            std::shared_ptr<ParquetDecryptionConfig>>
+  CreateParquetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    auto kms_connection_config =
+        std::make_shared<parquet::encryption::KmsConnectionConfig>();
+
+    ParquetEncryptionConfig parquet_encryption_config;
+    parquet_encryption_config.Setup(crypto_factory, kms_connection_config,
+                                    encryption_config);
+    auto decryption_config =
+        std::make_shared<parquet::encryption::DecryptionConfiguration>();
+    ParquetDecryptionConfig parquet_decryption_config;
+    parquet_decryption_config.Setup(crypto_factory, kms_connection_config,
+                                    decryption_config);
+    return std::make_pair(
+        std::make_shared<ParquetEncryptionConfig>(parquet_encryption_config),
+        std::make_shared<ParquetDecryptionConfig>(parquet_decryption_config));
+  }
+
+  // Utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // Add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // Add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::FileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_EQ(result.type(), arrow::fs::FileType::File);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    auto kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+};
+// Write dataset to disk with encryption
+// The aim of this test is to demonstrate the process of writing a partitioned
+// Parquet file while applying distinct file encryption properties to each
+// file within the test. This is based on the selected columns.

Review Comment:
   Yes, feel free to edit as you see fit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1270853366


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,412 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+using arrow::internal::checked_pointer_cast;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // This function creates a mock file system using the current time point, creates a
+  // directory with the given base directory path, and writes a dataset to it using
+  // provided Parquet file write options. The dataset is partitioned using a Hive
+  // partitioning scheme. The function also checks if the written files exist in the file
+  // system.
+  ::arrow::Result<std::shared_ptr<::arrow::fs::FileSystem>>
+  CreateMockFileSystemAndWriteData(
+      const std::string& base_dir,
+      const std::shared_ptr<FileWriteOptions>& parquet_file_write_options) {
+    // Create our mock file system
+    ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+    ARROW_ASSIGN_OR_RAISE(auto file_system,
+                          ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+    // Create filesystem
+    RETURN_NOT_OK(file_system->CreateDir(base_dir));
+
+    auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+    auto partitioning =
+        std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+    // ----- Write the Dataset ----
+    auto dataset_out = BuildTable();
+    ARROW_ASSIGN_OR_RAISE(auto scanner_builder_out, dataset_out->NewScan());
+    ARROW_ASSIGN_OR_RAISE(auto scanner_out, scanner_builder_out->Finish());
+
+    ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+    write_options.file_write_options = parquet_file_write_options;
+    write_options.filesystem = file_system;
+    write_options.base_dir = base_dir;
+    write_options.partitioning = partitioning;
+    write_options.basename_template = "part{i}.parquet";
+    RETURN_NOT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner_out));
+
+    std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                      "part=c/part0.parquet", "part=d/part0.parquet",
+                                      "part=e/part0.parquet", "part=f/part0.parquet",
+                                      "part=g/part0.parquet", "part=h/part0.parquet",
+                                      "part=i/part0.parquet", "part=j/part0.parquet"};
+    ValidateFilesExist(file_system, files);
+
+    return file_system;
+  }
+
+  // Create dataset encryption properties
+  std::pair<std::shared_ptr<ParquetEncryptionConfig>,
+            std::shared_ptr<ParquetDecryptionConfig>>
+  CreateParquetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    auto kms_connection_config =
+        std::make_shared<parquet::encryption::KmsConnectionConfig>();
+
+    ParquetEncryptionConfig parquet_encryption_config;
+    parquet_encryption_config.Setup(crypto_factory, kms_connection_config,
+                                    encryption_config);
+    auto decryption_config =
+        std::make_shared<parquet::encryption::DecryptionConfiguration>();
+    ParquetDecryptionConfig parquet_decryption_config;
+    parquet_decryption_config.Setup(crypto_factory, kms_connection_config,
+                                    decryption_config);
+    return std::make_pair(
+        std::make_shared<ParquetEncryptionConfig>(parquet_encryption_config),
+        std::make_shared<ParquetDecryptionConfig>(parquet_decryption_config));
+  }
+
+  // Utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // Add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // Add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::FileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_EQ(result.type(), arrow::fs::FileType::File);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    auto kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+};
+// Write dataset to disk with encryption
+// The aim of this test is to demonstrate the process of writing a partitioned
+// Parquet file while applying distinct file encryption properties to each
+// file within the test. This is based on the selected columns.

Review Comment:
   Thanks for confirming! It seems the test can be simplified a little bit. Do you feel good to authorize me to directly edit the file dataset_encryption_test.cc in your PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1265705188


##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,61 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/util/logging.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+
+namespace arrow {
+namespace dataset {
+
+/// core class, that translates the parameters of high level encryption
+struct ARROW_DS_EXPORT DatasetEncryptionConfiguration {

Review Comment:
   Ok, but we may want to keep "dataset" in the name also.  Maybe  ParquetDatasetEncryptionConfiguration?  Possibly getting to long



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1607966888

   will do


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1144998766


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,247 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/testing/gtest_util.h"
+#include "gtest/gtest.h"
+
+#include <arrow/api.h>
+#include <arrow/dataset/api.h>
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "parquet/encryption/dataset_encryption_config.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+const char dsFooterMasterKey[] = "0123456789012345";
+const char dsFooterMasterKeyId[] = "footer_key";
+const char* const dsColumnMasterKeys[] = {"1234567890123450"};
+const char* const dsColumnMasterKeyIds[] = {"col_key"};
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  std::unique_ptr<arrow::internal::TemporaryDir> temp_dir_;
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> dataset_;
+  std::string footer_key_name_ = "footer_key";
+
+  ::parquet::encryption::DatasetEncryptionConfiguration dataset_encryption_config_;
+  ::parquet::encryption::DatasetDecryptionConfiguration dataset_decryption_config_;
+  std::string column_key_mapping_;
+  ::parquet::encryption::KmsConnectionConfig kms_connection_config_;
+  std::shared_ptr<::parquet::encryption::CryptoFactory> crypto_factory_;
+  std::shared_ptr<ParquetFileFormat> file_format_;
+  std::shared_ptr<::arrow::fs::FileSystem> file_system_;
+
+  /** setup the test
+   *
+   */
+  void SetUp() {
+    // create our mock file system
+    ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+    ASSERT_OK_AND_ASSIGN(file_system_,
+                         ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+    // build our dummy table
+    BuildTable();
+
+    auto key_list = BuildKeyMap(dsColumnMasterKeyIds, dsColumnMasterKeys,
+                                dsFooterMasterKeyId, dsFooterMasterKey);
+
+    SetupCryptoFactory(true, key_list);
+
+    column_key_mapping_ = "col_key: a";
+
+    // Setup our Dataset encrytion configurations
+    dataset_encryption_config_.crypto_factory = crypto_factory_;
+    dataset_encryption_config_.kms_connection_config =
+        std::make_shared<::parquet::encryption::KmsConnectionConfig>(
+            kms_connection_config_);
+    dataset_encryption_config_.encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            footer_key_name_);
+    dataset_encryption_config_.encryption_config->column_keys = column_key_mapping_;
+    dataset_encryption_config_.encryption_config->footer_key = footer_key_name_;
+
+    dataset_decryption_config_.crypto_factory = crypto_factory_;
+    dataset_decryption_config_.kms_connection_config =
+        std::make_shared<::parquet::encryption::KmsConnectionConfig>(
+            kms_connection_config_);
+    dataset_decryption_config_.decryption_config =
+        std::make_shared<::parquet::encryption::DecryptionConfiguration>();
+
+    // create our Parquet file format object
+    file_format_ = std::make_shared<ParquetFileFormat>();
+
+    file_format_->SetDatasetEncryptionConfig(
+        std::make_shared<::parquet::encryption::DatasetEncryptionConfiguration>(
+            dataset_encryption_config_));
+    file_format_->SetDatasetDecryptionConfig(
+        std::make_shared<::parquet::encryption::DatasetDecryptionConfiguration>(
+            dataset_decryption_config_));
+  }
+
+  /** utility to build the key map
+   *
+   */
+  std::unordered_map<std::string, std::string> BuildKeyMap(const char* const* column_ids,
+                                                           const char* const* column_keys,
+                                                           const char* footer_id,
+                                                           const char* footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < 1; i++) {
+      key_map.insert({column_ids[i], column_keys[i]});
+    }

Review Comment:
   Ok I updated the function that code resides to take in an integer for the number of columns and use that instead of a hard-coded value



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1142628397


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,247 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/testing/gtest_util.h"
+#include "gtest/gtest.h"
+
+#include <arrow/api.h>
+#include <arrow/dataset/api.h>
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "parquet/encryption/dataset_encryption_config.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+const char dsFooterMasterKey[] = "0123456789012345";
+const char dsFooterMasterKeyId[] = "footer_key";
+const char* const dsColumnMasterKeys[] = {"1234567890123450"};
+const char* const dsColumnMasterKeyIds[] = {"col_key"};
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  std::unique_ptr<arrow::internal::TemporaryDir> temp_dir_;
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> dataset_;
+  std::string footer_key_name_ = "footer_key";
+
+  ::parquet::encryption::DatasetEncryptionConfiguration dataset_encryption_config_;
+  ::parquet::encryption::DatasetDecryptionConfiguration dataset_decryption_config_;
+  std::string column_key_mapping_;
+  ::parquet::encryption::KmsConnectionConfig kms_connection_config_;
+  std::shared_ptr<::parquet::encryption::CryptoFactory> crypto_factory_;
+  std::shared_ptr<ParquetFileFormat> file_format_;
+  std::shared_ptr<::arrow::fs::FileSystem> file_system_;
+
+  /** setup the test
+   *
+   */
+  void SetUp() {
+    // create our mock file system
+    ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+    ASSERT_OK_AND_ASSIGN(file_system_,
+                         ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+    // build our dummy table
+    BuildTable();
+
+    auto key_list = BuildKeyMap(dsColumnMasterKeyIds, dsColumnMasterKeys,
+                                dsFooterMasterKeyId, dsFooterMasterKey);
+
+    SetupCryptoFactory(true, key_list);
+
+    column_key_mapping_ = "col_key: a";
+
+    // Setup our Dataset encrytion configurations
+    dataset_encryption_config_.crypto_factory = crypto_factory_;
+    dataset_encryption_config_.kms_connection_config =
+        std::make_shared<::parquet::encryption::KmsConnectionConfig>(
+            kms_connection_config_);
+    dataset_encryption_config_.encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            footer_key_name_);
+    dataset_encryption_config_.encryption_config->column_keys = column_key_mapping_;
+    dataset_encryption_config_.encryption_config->footer_key = footer_key_name_;
+
+    dataset_decryption_config_.crypto_factory = crypto_factory_;
+    dataset_decryption_config_.kms_connection_config =
+        std::make_shared<::parquet::encryption::KmsConnectionConfig>(
+            kms_connection_config_);
+    dataset_decryption_config_.decryption_config =
+        std::make_shared<::parquet::encryption::DecryptionConfiguration>();
+
+    // create our Parquet file format object
+    file_format_ = std::make_shared<ParquetFileFormat>();
+
+    file_format_->SetDatasetEncryptionConfig(
+        std::make_shared<::parquet::encryption::DatasetEncryptionConfiguration>(
+            dataset_encryption_config_));
+    file_format_->SetDatasetDecryptionConfig(
+        std::make_shared<::parquet::encryption::DatasetDecryptionConfiguration>(
+            dataset_decryption_config_));
+  }
+
+  /** utility to build the key map
+   *
+   */
+  std::unordered_map<std::string, std::string> BuildKeyMap(const char* const* column_ids,
+                                                           const char* const* column_keys,
+                                                           const char* footer_id,
+                                                           const char* footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < 1; i++) {
+      key_map.insert({column_ids[i], column_keys[i]});
+    }
+    // add footer key
+    key_map.insert({footer_id, footer_key});
+
+    return key_map;
+  }
+
+  /** utilty to build column key mapping
+   *
+   */
+  std::string BuildColumnKeyMapping() {
+    std::ostringstream stream;
+    stream << dsColumnMasterKeys[0] << ":"
+           << "a"
+           << ";";
+    return stream.str();
+  }
+  /** Write dataset to disk
+   *
+   */
+  void WriteDataset() {
+    auto base_path = "";
+    ASSERT_OK(file_system_->CreateDir(base_path));
+    // Write it using Datasets
+    ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset_->NewScan());
+    ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
+
+    auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+    auto partitioning =
+        std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+    ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+    write_options.file_write_options = file_format_->DefaultWriteOptions();
+    write_options.filesystem = file_system_;
+    write_options.base_dir = base_path;
+    write_options.partitioning = partitioning;
+    write_options.basename_template = "part{i}.parquet";
+    ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner));
+
+    auto mock_fs =
+        std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system_);
+
+    std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                      "part=c/part0.parquet", "part=d/part0.parquet",
+                                      "part=e/part0.parquet", "part=f/part0.parquet",
+                                      "part=g/part0.parquet", "part=h/part0.parquet",
+                                      "part=i/part0.parquet", "part=j/part0.parquet"};
+    ValidateFilesExist(mock_fs, files);
+  }
+
+  /** A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  void ReadDataset() {
+    // File format
+    // Partitioning
+    auto partition_schema = arrow::schema({arrow::field("part", arrow::utf8())});
+    auto partitioning =
+        std::make_shared<arrow::dataset::HivePartitioning>(partition_schema);
+
+    // Get FileInfo objects for all files under the base directory
+    arrow::fs::FileSelector selector;
+    selector.base_dir = "";
+    selector.recursive = true;
+    ASSERT_OK_AND_ASSIGN(auto files, file_system_->GetFileInfo(selector));
+
+    // Create a FileSystemDatasetFactory
+    arrow::dataset::FileSystemFactoryOptions factory_options;
+    factory_options.partitioning = partitioning;
+    ASSERT_OK_AND_ASSIGN(auto dataset_factory,
+                         arrow::dataset::FileSystemDatasetFactory::Make(
+                             file_system_, files, file_format_, factory_options));
+
+    // Create a Dataset
+    ASSERT_OK_AND_ASSIGN(auto dataset, dataset_factory->Finish());
+
+    // Create a ScannerBuilder
+    ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan());
+
+    // Create a Scanner
+    ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());

Review Comment:
   This test seems incomplete. The scanner is just configured, but isn't actually used to read data. We should probably read the data and then verify it matches expected values.
   



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -594,10 +608,33 @@ Result<std::shared_ptr<FileWriter>> ParquetFileFormat::MakeWriter(
   auto parquet_options = checked_pointer_cast<ParquetFileWriteOptions>(options);
 
   std::unique_ptr<parquet::arrow::FileWriter> parquet_writer;
-  ARROW_ASSIGN_OR_RAISE(parquet_writer, parquet::arrow::FileWriter::Open(
-                                            *schema, default_memory_pool(), destination,
-                                            parquet_options->writer_properties,
-                                            parquet_options->arrow_writer_properties));
+
+  std::shared_ptr<parquet::encryption::DatasetEncryptionConfiguration>
+      dataset_encrypt_config = GetDatasetEncryptionConfig();
+
+  if (dataset_encrypt_config != nullptr) {
+    auto file_encryption_prop =
+        dataset_encrypt_config->crypto_factory->GetFileEncryptionProperties(
+            *dataset_encrypt_config->kms_connection_config.get(),
+            *dataset_encrypt_config->encryption_config.get(), destination_locator.path,
+            destination_locator.filesystem);
+
+    auto writer_properties =
+        parquet::WriterProperties::Builder(*parquet_options->writer_properties.get())
+            .encryption(file_encryption_prop)
+            ->build();
+
+    ARROW_ASSIGN_OR_RAISE(
+        parquet_writer, parquet::arrow::FileWriter::Open(
+                            *schema, default_memory_pool(), destination,

Review Comment:
   This is a pre-existing issue, but I don't think we should be using `default_memory_pool()` here. (Correct me if I am wrong @westonpace.)
   ```suggestion
                               *schema, writer_properties->memory_pool(), destination,
   ```



##########
python/examples/dataset/write_dataset_encrypted.py:
##########
@@ -0,0 +1,85 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+import pyarrow as pa
+import pyarrow.dataset as ds
+import pyarrow.parquet as pq
+import pyarrow.parquet.encryption as pe
+from pyarrow.tests.parquet.encryption import InMemoryKmsClient
+from datetime import timedelta
+import shutil
+import os
+
+""" A sample to demostrate dataset encryption and decryption"""
+
+# create a list of dictionaries that will represent our dataset
+table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+                  'n_legs': [2, 2, 4, 4, 5, 100],
+                  'animal': ["Flamingo", "Parrot", "Dog", "Horse",
+                             "Brittle stars", "Centipede"]})
+
+# create a PyArrow dataset from the table
+dataset = ds.dataset(table)
+
+FOOTER_KEY = b"0123456789112345"
+FOOTER_KEY_NAME = "footer_key"
+COL_KEY = b"1234567890123450"
+COL_KEY_NAME = "col_key"
+
+encryption_config = pe.EncryptionConfiguration(
+    footer_key=FOOTER_KEY_NAME,
+    column_keys={
+        COL_KEY_NAME: ["n_legs", "animal"],
+    },
+    encryption_algorithm="AES_GCM_V1",
+    cache_lifetime=timedelta(minutes=5.0),
+    data_key_length_bits=256)
+
+decryption_config = pe.DecryptionConfiguration(cache_lifetime=300)

Review Comment:
   In one place `cache_lifetime` is taking a `timedelta` and in another an integer. Can both APIs take either, or is there an inconsistency here?



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,247 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/testing/gtest_util.h"
+#include "gtest/gtest.h"
+
+#include <arrow/api.h>
+#include <arrow/dataset/api.h>
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "parquet/encryption/dataset_encryption_config.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+const char dsFooterMasterKey[] = "0123456789012345";
+const char dsFooterMasterKeyId[] = "footer_key";
+const char* const dsColumnMasterKeys[] = {"1234567890123450"};
+const char* const dsColumnMasterKeyIds[] = {"col_key"};
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  std::unique_ptr<arrow::internal::TemporaryDir> temp_dir_;
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> dataset_;
+  std::string footer_key_name_ = "footer_key";
+
+  ::parquet::encryption::DatasetEncryptionConfiguration dataset_encryption_config_;
+  ::parquet::encryption::DatasetDecryptionConfiguration dataset_decryption_config_;
+  std::string column_key_mapping_;
+  ::parquet::encryption::KmsConnectionConfig kms_connection_config_;
+  std::shared_ptr<::parquet::encryption::CryptoFactory> crypto_factory_;
+  std::shared_ptr<ParquetFileFormat> file_format_;
+  std::shared_ptr<::arrow::fs::FileSystem> file_system_;
+
+  /** setup the test
+   *
+   */
+  void SetUp() {

Review Comment:
   Here's a behavior we should probably test: If we request a column is encrypted, but use that column as a partition column, what happens? Are the serialized values encrypted? Or is an error returned?



##########
cpp/write_dataset_example.py:
##########


Review Comment:
   Should this file be removed?



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,247 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/testing/gtest_util.h"
+#include "gtest/gtest.h"
+
+#include <arrow/api.h>
+#include <arrow/dataset/api.h>
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "parquet/encryption/dataset_encryption_config.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+const char dsFooterMasterKey[] = "0123456789012345";
+const char dsFooterMasterKeyId[] = "footer_key";
+const char* const dsColumnMasterKeys[] = {"1234567890123450"};
+const char* const dsColumnMasterKeyIds[] = {"col_key"};
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  std::unique_ptr<arrow::internal::TemporaryDir> temp_dir_;
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> dataset_;
+  std::string footer_key_name_ = "footer_key";
+
+  ::parquet::encryption::DatasetEncryptionConfiguration dataset_encryption_config_;
+  ::parquet::encryption::DatasetDecryptionConfiguration dataset_decryption_config_;
+  std::string column_key_mapping_;
+  ::parquet::encryption::KmsConnectionConfig kms_connection_config_;
+  std::shared_ptr<::parquet::encryption::CryptoFactory> crypto_factory_;
+  std::shared_ptr<ParquetFileFormat> file_format_;
+  std::shared_ptr<::arrow::fs::FileSystem> file_system_;
+
+  /** setup the test
+   *
+   */
+  void SetUp() {
+    // create our mock file system
+    ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+    ASSERT_OK_AND_ASSIGN(file_system_,
+                         ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+    // build our dummy table
+    BuildTable();
+
+    auto key_list = BuildKeyMap(dsColumnMasterKeyIds, dsColumnMasterKeys,
+                                dsFooterMasterKeyId, dsFooterMasterKey);
+
+    SetupCryptoFactory(true, key_list);
+
+    column_key_mapping_ = "col_key: a";
+
+    // Setup our Dataset encrytion configurations
+    dataset_encryption_config_.crypto_factory = crypto_factory_;
+    dataset_encryption_config_.kms_connection_config =
+        std::make_shared<::parquet::encryption::KmsConnectionConfig>(
+            kms_connection_config_);
+    dataset_encryption_config_.encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            footer_key_name_);
+    dataset_encryption_config_.encryption_config->column_keys = column_key_mapping_;
+    dataset_encryption_config_.encryption_config->footer_key = footer_key_name_;
+
+    dataset_decryption_config_.crypto_factory = crypto_factory_;
+    dataset_decryption_config_.kms_connection_config =
+        std::make_shared<::parquet::encryption::KmsConnectionConfig>(
+            kms_connection_config_);
+    dataset_decryption_config_.decryption_config =
+        std::make_shared<::parquet::encryption::DecryptionConfiguration>();
+
+    // create our Parquet file format object
+    file_format_ = std::make_shared<ParquetFileFormat>();
+
+    file_format_->SetDatasetEncryptionConfig(
+        std::make_shared<::parquet::encryption::DatasetEncryptionConfiguration>(
+            dataset_encryption_config_));
+    file_format_->SetDatasetDecryptionConfig(
+        std::make_shared<::parquet::encryption::DatasetDecryptionConfiguration>(
+            dataset_decryption_config_));
+  }
+
+  /** utility to build the key map
+   *
+   */
+  std::unordered_map<std::string, std::string> BuildKeyMap(const char* const* column_ids,
+                                                           const char* const* column_keys,
+                                                           const char* footer_id,
+                                                           const char* footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < 1; i++) {
+      key_map.insert({column_ids[i], column_keys[i]});
+    }

Review Comment:
   This seems to only use the first value? Which is fine for the existing case, but might be confusing for future devs.



##########
python/examples/dataset/write_dataset_encrypted.py:
##########
@@ -0,0 +1,85 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+import pyarrow as pa

Review Comment:
   Could you transform this into some unit tests for the dataset encryption in Python. We should make sure that it is passing configuration down and propagating up errors correctly. 



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,247 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/testing/gtest_util.h"
+#include "gtest/gtest.h"
+
+#include <arrow/api.h>
+#include <arrow/dataset/api.h>
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "parquet/encryption/dataset_encryption_config.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+const char dsFooterMasterKey[] = "0123456789012345";
+const char dsFooterMasterKeyId[] = "footer_key";
+const char* const dsColumnMasterKeys[] = {"1234567890123450"};
+const char* const dsColumnMasterKeyIds[] = {"col_key"};

Review Comment:
   I think going forward we'll prefer `constexpr std::string_view` (there are some places left over from when we supported C++11):
   
   ```suggestion
   constexpr std::string_view dsFooterMasterKey = "0123456789012345";
   constexpr std::string_view dsFooterMasterKeyId = "footer_key";
   constexpr std::string_view dsColumnMasterKeys[] = {"1234567890123450"};
   constexpr std::string_view dsColumnMasterKeyIds[] = {"col_key"};
   ```



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,247 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/testing/gtest_util.h"
+#include "gtest/gtest.h"
+
+#include <arrow/api.h>
+#include <arrow/dataset/api.h>
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "parquet/encryption/dataset_encryption_config.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+const char dsFooterMasterKey[] = "0123456789012345";
+const char dsFooterMasterKeyId[] = "footer_key";
+const char* const dsColumnMasterKeys[] = {"1234567890123450"};
+const char* const dsColumnMasterKeyIds[] = {"col_key"};
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  std::unique_ptr<arrow::internal::TemporaryDir> temp_dir_;
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> dataset_;
+  std::string footer_key_name_ = "footer_key";
+
+  ::parquet::encryption::DatasetEncryptionConfiguration dataset_encryption_config_;
+  ::parquet::encryption::DatasetDecryptionConfiguration dataset_decryption_config_;
+  std::string column_key_mapping_;
+  ::parquet::encryption::KmsConnectionConfig kms_connection_config_;
+  std::shared_ptr<::parquet::encryption::CryptoFactory> crypto_factory_;
+  std::shared_ptr<ParquetFileFormat> file_format_;
+  std::shared_ptr<::arrow::fs::FileSystem> file_system_;
+
+  /** setup the test
+   *
+   */
+  void SetUp() {
+    // create our mock file system
+    ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+    ASSERT_OK_AND_ASSIGN(file_system_,
+                         ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+    // build our dummy table
+    BuildTable();
+
+    auto key_list = BuildKeyMap(dsColumnMasterKeyIds, dsColumnMasterKeys,
+                                dsFooterMasterKeyId, dsFooterMasterKey);
+
+    SetupCryptoFactory(true, key_list);
+
+    column_key_mapping_ = "col_key: a";
+
+    // Setup our Dataset encrytion configurations
+    dataset_encryption_config_.crypto_factory = crypto_factory_;
+    dataset_encryption_config_.kms_connection_config =
+        std::make_shared<::parquet::encryption::KmsConnectionConfig>(
+            kms_connection_config_);
+    dataset_encryption_config_.encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            footer_key_name_);
+    dataset_encryption_config_.encryption_config->column_keys = column_key_mapping_;
+    dataset_encryption_config_.encryption_config->footer_key = footer_key_name_;
+
+    dataset_decryption_config_.crypto_factory = crypto_factory_;
+    dataset_decryption_config_.kms_connection_config =
+        std::make_shared<::parquet::encryption::KmsConnectionConfig>(
+            kms_connection_config_);
+    dataset_decryption_config_.decryption_config =
+        std::make_shared<::parquet::encryption::DecryptionConfiguration>();
+
+    // create our Parquet file format object
+    file_format_ = std::make_shared<ParquetFileFormat>();
+
+    file_format_->SetDatasetEncryptionConfig(
+        std::make_shared<::parquet::encryption::DatasetEncryptionConfiguration>(
+            dataset_encryption_config_));
+    file_format_->SetDatasetDecryptionConfig(
+        std::make_shared<::parquet::encryption::DatasetDecryptionConfiguration>(
+            dataset_decryption_config_));
+  }
+
+  /** utility to build the key map
+   *
+   */
+  std::unordered_map<std::string, std::string> BuildKeyMap(const char* const* column_ids,
+                                                           const char* const* column_keys,
+                                                           const char* footer_id,
+                                                           const char* footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < 1; i++) {
+      key_map.insert({column_ids[i], column_keys[i]});
+    }
+    // add footer key
+    key_map.insert({footer_id, footer_key});
+
+    return key_map;
+  }
+
+  /** utilty to build column key mapping
+   *
+   */
+  std::string BuildColumnKeyMapping() {
+    std::ostringstream stream;
+    stream << dsColumnMasterKeys[0] << ":"
+           << "a"
+           << ";";
+    return stream.str();
+  }
+  /** Write dataset to disk
+   *
+   */
+  void WriteDataset() {
+    auto base_path = "";
+    ASSERT_OK(file_system_->CreateDir(base_path));
+    // Write it using Datasets
+    ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset_->NewScan());
+    ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
+
+    auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+    auto partitioning =
+        std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+    ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+    write_options.file_write_options = file_format_->DefaultWriteOptions();
+    write_options.filesystem = file_system_;
+    write_options.base_dir = base_path;
+    write_options.partitioning = partitioning;
+    write_options.basename_template = "part{i}.parquet";
+    ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner));
+
+    auto mock_fs =
+        std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system_);
+
+    std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                      "part=c/part0.parquet", "part=d/part0.parquet",
+                                      "part=e/part0.parquet", "part=f/part0.parquet",
+                                      "part=g/part0.parquet", "part=h/part0.parquet",
+                                      "part=i/part0.parquet", "part=j/part0.parquet"};
+    ValidateFilesExist(mock_fs, files);
+  }
+
+  /** A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  void ReadDataset() {
+    // File format
+    // Partitioning
+    auto partition_schema = arrow::schema({arrow::field("part", arrow::utf8())});
+    auto partitioning =
+        std::make_shared<arrow::dataset::HivePartitioning>(partition_schema);
+
+    // Get FileInfo objects for all files under the base directory
+    arrow::fs::FileSelector selector;
+    selector.base_dir = "";
+    selector.recursive = true;
+    ASSERT_OK_AND_ASSIGN(auto files, file_system_->GetFileInfo(selector));
+
+    // Create a FileSystemDatasetFactory
+    arrow::dataset::FileSystemFactoryOptions factory_options;
+    factory_options.partitioning = partitioning;
+    ASSERT_OK_AND_ASSIGN(auto dataset_factory,
+                         arrow::dataset::FileSystemDatasetFactory::Make(
+                             file_system_, files, file_format_, factory_options));
+
+    // Create a Dataset
+    ASSERT_OK_AND_ASSIGN(auto dataset, dataset_factory->Finish());
+
+    // Create a ScannerBuilder
+    ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan());
+
+    // Create a Scanner
+    ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
+
+  }
+
+  /** Build a dummy table
+   *
+   */
+  void BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ASSERT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ASSERT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ASSERT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ASSERT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ASSERT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ASSERT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ASSERT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ASSERT_OK(string_builder.Finish(&arrays[3]));
+    auto table = arrow::Table::Make(schema, arrays);
+    // Write it using Datasets
+    dataset_ = std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  /** Helper function to create crypto factory and setup
+   */
+  void SetupCryptoFactory(bool wrap_locally,
+                          const std::unordered_map<std::string, std::string>& key_list) {
+    crypto_factory_ = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory_->RegisterKmsClientFactory(kms_client_factory);
+  }
+};
+
+TEST_F(DatasetEncryptionTest, WriteDatasetEncrypted) { this->WriteDataset(); }
+TEST_F(DatasetEncryptionTest, ReadDatasetEncrypted) { this->ReadDataset(); }

Review Comment:
   IIUC, these tests depend on each other. Perhaps we should combine them into a single `RoundTripDatasetEncrypted` test.



##########
python/examples/dataset/write_dataset_encrypted.py:
##########
@@ -0,0 +1,85 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+import pyarrow as pa
+import pyarrow.dataset as ds
+import pyarrow.parquet as pq
+import pyarrow.parquet.encryption as pe
+from pyarrow.tests.parquet.encryption import InMemoryKmsClient
+from datetime import timedelta
+import shutil
+import os
+
+""" A sample to demostrate dataset encryption and decryption"""
+
+# create a list of dictionaries that will represent our dataset
+table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+                  'n_legs': [2, 2, 4, 4, 5, 100],
+                  'animal': ["Flamingo", "Parrot", "Dog", "Horse",
+                             "Brittle stars", "Centipede"]})
+
+# create a PyArrow dataset from the table
+dataset = ds.dataset(table)
+
+FOOTER_KEY = b"0123456789112345"
+FOOTER_KEY_NAME = "footer_key"
+COL_KEY = b"1234567890123450"
+COL_KEY_NAME = "col_key"
+
+encryption_config = pe.EncryptionConfiguration(
+    footer_key=FOOTER_KEY_NAME,
+    column_keys={
+        COL_KEY_NAME: ["n_legs", "animal"],
+    },

Review Comment:
   Would be good to add comments to the example explaining what this means. For example:
   
   ```suggestion
       # Use COL_KEY_NAME to encrypt `n_legs` and `animal` columns.
       column_keys={
           COL_KEY_NAME: ["n_legs", "animal"],
       },
   ```



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -594,10 +608,33 @@ Result<std::shared_ptr<FileWriter>> ParquetFileFormat::MakeWriter(
   auto parquet_options = checked_pointer_cast<ParquetFileWriteOptions>(options);
 
   std::unique_ptr<parquet::arrow::FileWriter> parquet_writer;
-  ARROW_ASSIGN_OR_RAISE(parquet_writer, parquet::arrow::FileWriter::Open(
-                                            *schema, default_memory_pool(), destination,
-                                            parquet_options->writer_properties,
-                                            parquet_options->arrow_writer_properties));
+
+  std::shared_ptr<parquet::encryption::DatasetEncryptionConfiguration>
+      dataset_encrypt_config = GetDatasetEncryptionConfig();
+
+  if (dataset_encrypt_config != nullptr) {
+    auto file_encryption_prop =
+        dataset_encrypt_config->crypto_factory->GetFileEncryptionProperties(
+            *dataset_encrypt_config->kms_connection_config.get(),
+            *dataset_encrypt_config->encryption_config.get(), destination_locator.path,
+            destination_locator.filesystem);
+
+    auto writer_properties =
+        parquet::WriterProperties::Builder(*parquet_options->writer_properties.get())
+            .encryption(file_encryption_prop)
+            ->build();
+
+    ARROW_ASSIGN_OR_RAISE(
+        parquet_writer, parquet::arrow::FileWriter::Open(
+                            *schema, default_memory_pool(), destination,
+                            writer_properties, parquet_options->arrow_writer_properties));
+
+  } else {
+    ARROW_ASSIGN_OR_RAISE(parquet_writer, parquet::arrow::FileWriter::Open(
+                                              *schema, default_memory_pool(), destination,

Review Comment:
   Same here.



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########


Review Comment:
   Another test to add: If we write a dataset with encryption, can we use a single file reader to read just one file out of it as long as we pass the same encryption configuration?



##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +138,33 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+   
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<parquet::encryption::DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+    return dataset_encryption_config_;
+  }
+  /// \brief A getter function to retrieve the dataset decryption configuration
+  std::shared_ptr<parquet::encryption::DatasetDecryptionConfiguration> GetDatasetDecryptionConfig() const {
+    return dataset_decryption_config_;
+  }
+  // create an setter for DatasetEncryptionConfiguration

Review Comment:
   Should this and the comment below be transformed into a doc comment?



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,247 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/testing/gtest_util.h"
+#include "gtest/gtest.h"
+
+#include <arrow/api.h>
+#include <arrow/dataset/api.h>
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "parquet/encryption/dataset_encryption_config.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+const char dsFooterMasterKey[] = "0123456789012345";
+const char dsFooterMasterKeyId[] = "footer_key";
+const char* const dsColumnMasterKeys[] = {"1234567890123450"};
+const char* const dsColumnMasterKeyIds[] = {"col_key"};
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  std::unique_ptr<arrow::internal::TemporaryDir> temp_dir_;
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> dataset_;
+  std::string footer_key_name_ = "footer_key";
+
+  ::parquet::encryption::DatasetEncryptionConfiguration dataset_encryption_config_;
+  ::parquet::encryption::DatasetDecryptionConfiguration dataset_decryption_config_;
+  std::string column_key_mapping_;
+  ::parquet::encryption::KmsConnectionConfig kms_connection_config_;
+  std::shared_ptr<::parquet::encryption::CryptoFactory> crypto_factory_;
+  std::shared_ptr<ParquetFileFormat> file_format_;
+  std::shared_ptr<::arrow::fs::FileSystem> file_system_;
+
+  /** setup the test
+   *
+   */
+  void SetUp() {
+    // create our mock file system
+    ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+    ASSERT_OK_AND_ASSIGN(file_system_,
+                         ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+    // build our dummy table
+    BuildTable();
+
+    auto key_list = BuildKeyMap(dsColumnMasterKeyIds, dsColumnMasterKeys,
+                                dsFooterMasterKeyId, dsFooterMasterKey);
+
+    SetupCryptoFactory(true, key_list);
+
+    column_key_mapping_ = "col_key: a";
+
+    // Setup our Dataset encrytion configurations
+    dataset_encryption_config_.crypto_factory = crypto_factory_;
+    dataset_encryption_config_.kms_connection_config =
+        std::make_shared<::parquet::encryption::KmsConnectionConfig>(
+            kms_connection_config_);
+    dataset_encryption_config_.encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            footer_key_name_);
+    dataset_encryption_config_.encryption_config->column_keys = column_key_mapping_;
+    dataset_encryption_config_.encryption_config->footer_key = footer_key_name_;
+
+    dataset_decryption_config_.crypto_factory = crypto_factory_;
+    dataset_decryption_config_.kms_connection_config =
+        std::make_shared<::parquet::encryption::KmsConnectionConfig>(
+            kms_connection_config_);
+    dataset_decryption_config_.decryption_config =
+        std::make_shared<::parquet::encryption::DecryptionConfiguration>();
+
+    // create our Parquet file format object
+    file_format_ = std::make_shared<ParquetFileFormat>();
+
+    file_format_->SetDatasetEncryptionConfig(
+        std::make_shared<::parquet::encryption::DatasetEncryptionConfiguration>(
+            dataset_encryption_config_));
+    file_format_->SetDatasetDecryptionConfig(
+        std::make_shared<::parquet::encryption::DatasetDecryptionConfiguration>(
+            dataset_decryption_config_));
+  }
+
+  /** utility to build the key map
+   *
+   */
+  std::unordered_map<std::string, std::string> BuildKeyMap(const char* const* column_ids,
+                                                           const char* const* column_keys,
+                                                           const char* footer_id,
+                                                           const char* footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < 1; i++) {
+      key_map.insert({column_ids[i], column_keys[i]});
+    }
+    // add footer key
+    key_map.insert({footer_id, footer_key});
+
+    return key_map;
+  }
+
+  /** utilty to build column key mapping
+   *
+   */
+  std::string BuildColumnKeyMapping() {
+    std::ostringstream stream;
+    stream << dsColumnMasterKeys[0] << ":"
+           << "a"
+           << ";";
+    return stream.str();
+  }
+  /** Write dataset to disk
+   *
+   */
+  void WriteDataset() {
+    auto base_path = "";
+    ASSERT_OK(file_system_->CreateDir(base_path));
+    // Write it using Datasets
+    ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset_->NewScan());
+    ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
+
+    auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+    auto partitioning =
+        std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+    ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+    write_options.file_write_options = file_format_->DefaultWriteOptions();
+    write_options.filesystem = file_system_;
+    write_options.base_dir = base_path;
+    write_options.partitioning = partitioning;
+    write_options.basename_template = "part{i}.parquet";
+    ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner));
+
+    auto mock_fs =
+        std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system_);
+
+    std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                      "part=c/part0.parquet", "part=d/part0.parquet",
+                                      "part=e/part0.parquet", "part=f/part0.parquet",
+                                      "part=g/part0.parquet", "part=h/part0.parquet",
+                                      "part=i/part0.parquet", "part=j/part0.parquet"};
+    ValidateFilesExist(mock_fs, files);
+  }
+
+  /** A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  void ReadDataset() {
+    // File format
+    // Partitioning
+    auto partition_schema = arrow::schema({arrow::field("part", arrow::utf8())});
+    auto partitioning =
+        std::make_shared<arrow::dataset::HivePartitioning>(partition_schema);
+
+    // Get FileInfo objects for all files under the base directory
+    arrow::fs::FileSelector selector;
+    selector.base_dir = "";
+    selector.recursive = true;
+    ASSERT_OK_AND_ASSIGN(auto files, file_system_->GetFileInfo(selector));
+
+    // Create a FileSystemDatasetFactory
+    arrow::dataset::FileSystemFactoryOptions factory_options;
+    factory_options.partitioning = partitioning;
+    ASSERT_OK_AND_ASSIGN(auto dataset_factory,
+                         arrow::dataset::FileSystemDatasetFactory::Make(
+                             file_system_, files, file_format_, factory_options));
+
+    // Create a Dataset
+    ASSERT_OK_AND_ASSIGN(auto dataset, dataset_factory->Finish());
+
+    // Create a ScannerBuilder
+    ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan());
+
+    // Create a Scanner
+    ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());

Review Comment:
   Perhaps we should also assert that attempting to read the dataset without passing any encryption properties results in an appropriate error.



##########
cpp/write_dataset_example.py:
##########
@@ -0,0 +1,71 @@
+import sys
+sys.path.append('/home/ubuntu/projects/tolleybot_arrow/python')

Review Comment:
   ```suggestion
   ```



##########
python/examples/dataset/write_dataset_encrypted.py:
##########
@@ -0,0 +1,85 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+import pyarrow as pa
+import pyarrow.dataset as ds
+import pyarrow.parquet as pq
+import pyarrow.parquet.encryption as pe
+from pyarrow.tests.parquet.encryption import InMemoryKmsClient
+from datetime import timedelta
+import shutil
+import os
+
+""" A sample to demostrate dataset encryption and decryption"""
+
+# create a list of dictionaries that will represent our dataset
+table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+                  'n_legs': [2, 2, 4, 4, 5, 100],
+                  'animal': ["Flamingo", "Parrot", "Dog", "Horse",
+                             "Brittle stars", "Centipede"]})
+
+# create a PyArrow dataset from the table
+dataset = ds.dataset(table)
+
+FOOTER_KEY = b"0123456789112345"
+FOOTER_KEY_NAME = "footer_key"
+COL_KEY = b"1234567890123450"
+COL_KEY_NAME = "col_key"
+
+encryption_config = pe.EncryptionConfiguration(
+    footer_key=FOOTER_KEY_NAME,

Review Comment:
   Is this showing encrypting the footer or plaintext footer (where the key is used for a signature)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1142836269


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -31,6 +31,8 @@
 #include "arrow/dataset/type_fwd.h"
 #include "arrow/dataset/visibility.h"
 #include "arrow/io/caching.h"
+#include "parquet/encryption/dataset_encryption_config.h"

Review Comment:
   Should we add macros to these header files so the compiler won't complain when encryption is not enabled (e.g. `PARQUET_REQUIRE_ENCRYPTION` is OFF)? Or at least make sure they compile without encryption enabled.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1141456434


##########
cpp/src/parquet/properties.h:
##########
@@ -57,624 +57,699 @@ constexpr int32_t kDefaultThriftStringSizeLimit = 100 * 1000 * 1000;
 constexpr int32_t kDefaultThriftContainerSizeLimit = 1000 * 1000;
 
 class PARQUET_EXPORT ReaderProperties {
- public:
-  explicit ReaderProperties(MemoryPool* pool = ::arrow::default_memory_pool())
-      : pool_(pool) {}
-
-  MemoryPool* memory_pool() const { return pool_; }
-
-  std::shared_ptr<ArrowInputStream> GetStream(std::shared_ptr<ArrowInputFile> source,
-                                              int64_t start, int64_t num_bytes);
-
-  /// Buffered stream reading allows the user to control the memory usage of
-  /// parquet readers. This ensure that all `RandomAccessFile::ReadAt` calls are
-  /// wrapped in a buffered reader that uses a fix sized buffer (of size
-  /// `buffer_size()`) instead of the full size of the ReadAt.
-  ///
-  /// The primary reason for this control knobs is for resource control and not
-  /// performance.
-  bool is_buffered_stream_enabled() const { return buffered_stream_enabled_; }
-  /// Enable buffered stream reading.
-  void enable_buffered_stream() { buffered_stream_enabled_ = true; }
-  /// Disable buffered stream reading.
-  void disable_buffered_stream() { buffered_stream_enabled_ = false; }
-
-  /// Return the size of the buffered stream buffer.
-  int64_t buffer_size() const { return buffer_size_; }
-  /// Set the size of the buffered stream buffer in bytes.
-  void set_buffer_size(int64_t size) { buffer_size_ = size; }
-
-  /// \brief Return the size limit on thrift strings.
-  ///
-  /// This limit helps prevent space and time bombs in files, but may need to
-  /// be increased in order to read files with especially large headers.
-  int32_t thrift_string_size_limit() const { return thrift_string_size_limit_; }
-  /// Set the size limit on thrift strings.
-  void set_thrift_string_size_limit(int32_t size) { thrift_string_size_limit_ = size; }
-
-  /// \brief Return the size limit on thrift containers.
-  ///
-  /// This limit helps prevent space and time bombs in files, but may need to
-  /// be increased in order to read files with especially large headers.
-  int32_t thrift_container_size_limit() const { return thrift_container_size_limit_; }
-  /// Set the size limit on thrift containers.
-  void set_thrift_container_size_limit(int32_t size) {
-    thrift_container_size_limit_ = size;
-  }
-
-  /// Set the decryption properties.
-  void file_decryption_properties(std::shared_ptr<FileDecryptionProperties> decryption) {
-    file_decryption_properties_ = std::move(decryption);
-  }
-  /// Return the decryption properties.
-  const std::shared_ptr<FileDecryptionProperties>& file_decryption_properties() const {
-    return file_decryption_properties_;
-  }
-
-  bool page_checksum_verification() const { return page_checksum_verification_; }
-  void set_page_checksum_verification(bool check_crc) {
-    page_checksum_verification_ = check_crc;
-  }
-
- private:
-  MemoryPool* pool_;
-  int64_t buffer_size_ = kDefaultBufferSize;
-  int32_t thrift_string_size_limit_ = kDefaultThriftStringSizeLimit;
-  int32_t thrift_container_size_limit_ = kDefaultThriftContainerSizeLimit;
-  bool buffered_stream_enabled_ = false;
-  bool page_checksum_verification_ = false;
-  std::shared_ptr<FileDecryptionProperties> file_decryption_properties_;
+  public:

Review Comment:
   Sorry about that.  I used the incorrect formatter through vscode.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1343567198


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -33,6 +33,12 @@ from pyarrow.includes.libarrow_dataset cimport *
 from pyarrow.includes.libarrow_dataset_parquet cimport *
 from pyarrow._fs cimport FileSystem
 
+IF PARQUET_ENCRYPTION_ENABLED:

Review Comment:
   Yes, I think it's fine to use it here, we can look into a replacement later if it becomes actually necessary



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1349241809


##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -29,44 +29,50 @@ struct DecryptionConfiguration;
 namespace arrow {
 namespace dataset {
 
-/// \brief Core configuration class encapsulating parameters for high-level encryption
-/// within Parquet framework.
-///
-/// ParquetEncryptionConfig serves as a bridge, passing encryption-related
-/// parameters to appropriate components within the Parquet library. It holds references
-/// to objects defining encryption strategy, Key Management Service (KMS) configuration,
-/// and specific encryption configurations for Parquet data.
-///
-/// \member crypto_factory Shared pointer to CryptoFactory object, responsible for
-/// creating cryptographic components like encryptors and decryptors. \member
-/// kms_connection_config Shared pointer to KmsConnectionConfig object, holding
-/// configuration parameters for connecting to a Key Management Service (KMS).
-/// \member encryption_config Shared pointer to EncryptionConfiguration object, defining
-/// specific encryption settings for Parquet data, like keys for different columns.
 struct ARROW_DS_EXPORT ParquetEncryptionConfig {
   std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory;
   std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config;
   std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config;
+  /// \brief Core configuration class encapsulating parameters for high-level encryption

Review Comment:
   @anjakefala do I need to do anything with this or are you handling it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1352679463


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -639,7 +775,9 @@ cdef class ParquetFileWriteOptions(FileWriteOptions):
             write_batch_size=None,
             dictionary_pagesize_limit=None,
             write_page_index=False,
+            encryption_config=None,

Review Comment:
   Ok will do



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1753494646

   OK, the missing C++ compilation flag was because we enable this flag explicitly in cmake with `set_source_files_properties(${module_SRC} PROPERTIES CYTHON_IS_CXX TRUE)`, and I forgot to do this for the renamed src file.
   
   After fixing that, the compilation works fine now, but at runtime I get the following error:
   
   ```
   ImportError: dynamic module does not define module export function (PyInit__dataset_parquet_encryption)
   ```
   
   So the extension module doesn't load. Checking the generated c++ code, it actually defined a `PyInit__dataset_parquet_no_encryption` (so using the original file name, not the module target name), and so it is not finding `PyInit__dataset_parquet_encryption`. 
   @scoder Is that a limitation of cython when your pyx file has a different name than the module target name? Or is there some extra flag I need to pass to the cythonization step? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1757530251

   Thanks, everyone for all the reviews and work on this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "conbench-apache-arrow[bot] (via GitHub)" <gi...@apache.org>.
conbench-apache-arrow[bot] commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1762594950

   After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 0793432ad0ef5cb598b7b1e61071cd4991bd1b8b.
   
   There were 7 benchmark results indicating a performance regression:
   
   - Commit Run on `ursa-i9-9960x` at [2023-10-12 08:25:25Z](https://conbench.ursa.dev/compare/runs/2fd42faf48404478affd58f9bcfbb26e...bb99d7aba6ce4bd79c1b9bbf0ace732f/)
     - [`tpch` (R) with engine=arrow, format=parquet, language=R, memory_map=False, query_id=TPCH-02, scale_factor=1](https://conbench.ursa.dev/compare/benchmarks/0652799a766b7ec78000b7d4a5048681...06527bdedaf4756c80004b95ebc5bc2c)
     - [`file-read` (R) with compression=uncompressed, dataset=nyctaxi_2010-01, file_type=feather, language=R, output_type=table](https://conbench.ursa.dev/compare/benchmarks/0652794a1132763180004ee540fe19ef...06527b8b4ad77f9c80004a63833916b4)
   - and 5 more (see the report linked below)
   
   The [full Conbench report](https://github.com/apache/arrow/runs/17698267889) has more details. It also includes information about 6 possible false positives for unstable benchmarks that are known to sometimes produce them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1164199206


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,247 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/testing/gtest_util.h"
+#include "gtest/gtest.h"
+
+#include <arrow/api.h>
+#include <arrow/dataset/api.h>
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "parquet/encryption/dataset_encryption_config.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+const char dsFooterMasterKey[] = "0123456789012345";
+const char dsFooterMasterKeyId[] = "footer_key";
+const char* const dsColumnMasterKeys[] = {"1234567890123450"};
+const char* const dsColumnMasterKeyIds[] = {"col_key"};
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  std::unique_ptr<arrow::internal::TemporaryDir> temp_dir_;
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> dataset_;
+  std::string footer_key_name_ = "footer_key";
+
+  ::parquet::encryption::DatasetEncryptionConfiguration dataset_encryption_config_;
+  ::parquet::encryption::DatasetDecryptionConfiguration dataset_decryption_config_;
+  std::string column_key_mapping_;
+  ::parquet::encryption::KmsConnectionConfig kms_connection_config_;
+  std::shared_ptr<::parquet::encryption::CryptoFactory> crypto_factory_;
+  std::shared_ptr<ParquetFileFormat> file_format_;
+  std::shared_ptr<::arrow::fs::FileSystem> file_system_;
+
+  /** setup the test
+   *
+   */
+  void SetUp() {
+    // create our mock file system
+    ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+    ASSERT_OK_AND_ASSIGN(file_system_,
+                         ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+    // build our dummy table
+    BuildTable();
+
+    auto key_list = BuildKeyMap(dsColumnMasterKeyIds, dsColumnMasterKeys,
+                                dsFooterMasterKeyId, dsFooterMasterKey);
+
+    SetupCryptoFactory(true, key_list);
+
+    column_key_mapping_ = "col_key: a";
+
+    // Setup our Dataset encrytion configurations
+    dataset_encryption_config_.crypto_factory = crypto_factory_;
+    dataset_encryption_config_.kms_connection_config =
+        std::make_shared<::parquet::encryption::KmsConnectionConfig>(
+            kms_connection_config_);
+    dataset_encryption_config_.encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            footer_key_name_);
+    dataset_encryption_config_.encryption_config->column_keys = column_key_mapping_;
+    dataset_encryption_config_.encryption_config->footer_key = footer_key_name_;
+
+    dataset_decryption_config_.crypto_factory = crypto_factory_;
+    dataset_decryption_config_.kms_connection_config =
+        std::make_shared<::parquet::encryption::KmsConnectionConfig>(
+            kms_connection_config_);
+    dataset_decryption_config_.decryption_config =
+        std::make_shared<::parquet::encryption::DecryptionConfiguration>();
+
+    // create our Parquet file format object
+    file_format_ = std::make_shared<ParquetFileFormat>();
+
+    file_format_->SetDatasetEncryptionConfig(
+        std::make_shared<::parquet::encryption::DatasetEncryptionConfiguration>(
+            dataset_encryption_config_));
+    file_format_->SetDatasetDecryptionConfig(
+        std::make_shared<::parquet::encryption::DatasetDecryptionConfiguration>(
+            dataset_decryption_config_));
+  }
+
+  /** utility to build the key map
+   *
+   */
+  std::unordered_map<std::string, std::string> BuildKeyMap(const char* const* column_ids,
+                                                           const char* const* column_keys,
+                                                           const char* footer_id,
+                                                           const char* footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < 1; i++) {
+      key_map.insert({column_ids[i], column_keys[i]});
+    }
+    // add footer key
+    key_map.insert({footer_id, footer_key});
+
+    return key_map;
+  }
+
+  /** utilty to build column key mapping
+   *
+   */
+  std::string BuildColumnKeyMapping() {
+    std::ostringstream stream;
+    stream << dsColumnMasterKeys[0] << ":"
+           << "a"
+           << ";";
+    return stream.str();
+  }
+  /** Write dataset to disk
+   *
+   */
+  void WriteDataset() {
+    auto base_path = "";
+    ASSERT_OK(file_system_->CreateDir(base_path));
+    // Write it using Datasets
+    ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset_->NewScan());
+    ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
+
+    auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+    auto partitioning =
+        std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+    ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+    write_options.file_write_options = file_format_->DefaultWriteOptions();
+    write_options.filesystem = file_system_;
+    write_options.base_dir = base_path;
+    write_options.partitioning = partitioning;
+    write_options.basename_template = "part{i}.parquet";
+    ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner));
+
+    auto mock_fs =
+        std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system_);
+
+    std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                      "part=c/part0.parquet", "part=d/part0.parquet",
+                                      "part=e/part0.parquet", "part=f/part0.parquet",
+                                      "part=g/part0.parquet", "part=h/part0.parquet",
+                                      "part=i/part0.parquet", "part=j/part0.parquet"};
+    ValidateFilesExist(mock_fs, files);
+  }
+
+  /** A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  void ReadDataset() {
+    // File format
+    // Partitioning
+    auto partition_schema = arrow::schema({arrow::field("part", arrow::utf8())});
+    auto partitioning =
+        std::make_shared<arrow::dataset::HivePartitioning>(partition_schema);
+
+    // Get FileInfo objects for all files under the base directory
+    arrow::fs::FileSelector selector;
+    selector.base_dir = "";
+    selector.recursive = true;
+    ASSERT_OK_AND_ASSIGN(auto files, file_system_->GetFileInfo(selector));
+
+    // Create a FileSystemDatasetFactory
+    arrow::dataset::FileSystemFactoryOptions factory_options;
+    factory_options.partitioning = partitioning;
+    ASSERT_OK_AND_ASSIGN(auto dataset_factory,
+                         arrow::dataset::FileSystemDatasetFactory::Make(
+                             file_system_, files, file_format_, factory_options));
+
+    // Create a Dataset
+    ASSERT_OK_AND_ASSIGN(auto dataset, dataset_factory->Finish());
+
+    // Create a ScannerBuilder
+    ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan());
+
+    // Create a Scanner
+    ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());

Review Comment:
   added a test to  verify if parquet metadata can be read without decryption properties when footer is encrypted.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1267153469


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    return std::make_pair(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config),
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+};
+// Write dataset to disk with encryption
+// The aim of this test is to demonstrate the process of writing a partitioned
+// Parquet file while applying distinct file encryption properties to each
+// file within the test. This is based on the selected columns.
+TEST_F(DatasetEncryptionTest, WriteReadDatasetWithEncryption) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);

Review Comment:
   Ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1641415251

   Thanks @tolleybot, these changes on the cpp side so far look good. I'll go ahead to review the python code with my limited knowledge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1583883587

   It looks like there is an issue in the Python build that needs to be addressed: 
   
   https://github.com/apache/arrow/actions/runs/5217685646/jobs/9417770763?pr=34616#step:6:3804


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1583889768

   I'll take a look


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1261882097


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +138,20 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+    return dataset_encryption_config_;
+  }
+  /// \brief A setter for DatasetEncryptionConfiguration
+  void SetDatasetEncryptionConfig(
+      std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config) {
+    dataset_encryption_config_ = std::move(dataset_encryption_config);
+  }
+
+ private:
+  // A configuration structure that provides per file encryption properties for a dataset
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config_ = NULLPTR;

Review Comment:
   @westonpace Thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1644893434

   Could you please fix the CI failure:
   ```
   pyarrow._dataset_parquet.ParquetFragmentScanOptions
   -> pyarrow._dataset_parquet.ParquetFragmentScanOptions(bool use_buffered_stream=False, *, buffer_size=8192, bool pre_buffer=False, thrift_string_size_limit=None, thrift_container_size_limit=None, decryption_config=None)
   PR01: Parameters {'decryption_config'} not documented
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1270809281


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,412 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+using arrow::internal::checked_pointer_cast;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // This function creates a mock file system using the current time point, creates a
+  // directory with the given base directory path, and writes a dataset to it using
+  // provided Parquet file write options. The dataset is partitioned using a Hive
+  // partitioning scheme. The function also checks if the written files exist in the file
+  // system.
+  ::arrow::Result<std::shared_ptr<::arrow::fs::FileSystem>>
+  CreateMockFileSystemAndWriteData(
+      const std::string& base_dir,
+      const std::shared_ptr<FileWriteOptions>& parquet_file_write_options) {
+    // Create our mock file system
+    ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+    ARROW_ASSIGN_OR_RAISE(auto file_system,
+                          ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+    // Create filesystem
+    RETURN_NOT_OK(file_system->CreateDir(base_dir));
+
+    auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+    auto partitioning =
+        std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+    // ----- Write the Dataset ----
+    auto dataset_out = BuildTable();
+    ARROW_ASSIGN_OR_RAISE(auto scanner_builder_out, dataset_out->NewScan());
+    ARROW_ASSIGN_OR_RAISE(auto scanner_out, scanner_builder_out->Finish());
+
+    ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+    write_options.file_write_options = parquet_file_write_options;
+    write_options.filesystem = file_system;
+    write_options.base_dir = base_dir;
+    write_options.partitioning = partitioning;
+    write_options.basename_template = "part{i}.parquet";
+    RETURN_NOT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner_out));
+
+    std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                      "part=c/part0.parquet", "part=d/part0.parquet",
+                                      "part=e/part0.parquet", "part=f/part0.parquet",
+                                      "part=g/part0.parquet", "part=h/part0.parquet",
+                                      "part=i/part0.parquet", "part=j/part0.parquet"};
+    ValidateFilesExist(file_system, files);
+
+    return file_system;
+  }
+
+  // Create dataset encryption properties
+  std::pair<std::shared_ptr<ParquetEncryptionConfig>,
+            std::shared_ptr<ParquetDecryptionConfig>>
+  CreateParquetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    auto kms_connection_config =
+        std::make_shared<parquet::encryption::KmsConnectionConfig>();
+
+    ParquetEncryptionConfig parquet_encryption_config;
+    parquet_encryption_config.Setup(crypto_factory, kms_connection_config,
+                                    encryption_config);
+    auto decryption_config =
+        std::make_shared<parquet::encryption::DecryptionConfiguration>();
+    ParquetDecryptionConfig parquet_decryption_config;
+    parquet_decryption_config.Setup(crypto_factory, kms_connection_config,
+                                    decryption_config);
+    return std::make_pair(
+        std::make_shared<ParquetEncryptionConfig>(parquet_encryption_config),
+        std::make_shared<ParquetDecryptionConfig>(parquet_decryption_config));
+  }
+
+  // Utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // Add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // Add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::FileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_EQ(result.type(), arrow::fs::FileType::File);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    auto kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+};
+// Write dataset to disk with encryption
+// The aim of this test is to demonstrate the process of writing a partitioned
+// Parquet file while applying distinct file encryption properties to each
+// file within the test. This is based on the selected columns.

Review Comment:
   Yes, the comment is incorrect.  It is applying the same encryption properties.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1269620908


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -637,10 +717,14 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
         If not None, override the maximum total size of containers allocated
         when decoding Thrift structures. The default limit should be
         sufficient for most Parquet files.
+    dataset_decryption_config : ParquetDecryptionConfig, default None
+        If not None, use the provided ParquetDecryptionConfig to decrypt the
+        Parquet file.
     """
 
     cdef:
         CParquetFragmentScanOptions* parquet_options
+        ParquetDecryptionConfig _dataset_decryption_config

Review Comment:
   Ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1474345856

   :warning: GitHub issue #29238 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1474347146

   :warning: GitHub issue #29238 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1295892937


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -52,12 +52,81 @@ from pyarrow._parquet cimport (
     FileMetaData,
 )
 
+from pyarrow._parquet_encryption cimport *
+
 
 cdef Expression _true = Expression._scalar(True)
 
 
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+cdef class ParquetEncryptionConfig(_Weakrefable):

Review Comment:
   Ill take a look to refresh my memory



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1697617348

   Revision: 801abe69b2b187bcc8ab8cb2de0f804f92c9f2a6
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-e1e7bc085f](https://github.com/ursacomputing/crossbow/branches/all?query=actions-e1e7bc085f)
   
   |Task|Status|
   |----|------|
   |test-conda-python-3.10-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e1e7bc085f-github-test-conda-python-3.10-pandas-latest)](https://github.com/ursacomputing/crossbow/actions/runs/6014018042/job/16312793446)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] anjakefala commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.
anjakefala commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1696159709

   AFAIU, it seems as of commit https://github.com/apache/arrow/commit/7812921acbe5104eb1c74105389b9a70ead16d31 we are still getting run-time errors.
   
   From https://github.com/ursacomputing/crossbow/actions/runs/6003326499/job/16281542940 :
   
   ```
   opt/conda/envs/arrow/lib/python3.10/site-packages/pyarrow/parquet/encryption.py:19: in <module>
       from pyarrow._parquet_encryption import (CryptoFactory,   # noqa
   E   ModuleNotFoundError: No module named 'pyarrow._parquet_encryption'
   ```
   
   It is still failing to run with this configuration:
   
   ```
   --   ARROW_DATASET=ON [default=OFF]
   --       Build the Arrow Dataset Modules
   --   ARROW_PARQUET=ON [default=OFF]
   --       Build the Parquet libraries
   --   PARQUET_REQUIRE_ENCRYPTION=OFF [default=OFF]
   --       Build support for encryption. Fail if OpenSSL is not found
   
   
   # Enable/disable optional PyArrow components
   export PYARROW_WITH_PARQUET=ON
   export PYARROW_WITH_DATASET=ON
   export PYARROW_WITH_PARQUET_ENCRYPTION=OFF
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1707015601

   Revision: 1ce624df85d862d8e6e1e6b33242221e28c17588
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-e6d728e59e](https://github.com/ursacomputing/crossbow/branches/all?query=actions-e6d728e59e)
   
   |Task|Status|
   |----|------|
   |test-conda-python-3.10-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-e6d728e59e-github-test-conda-python-3.10-pandas-latest)](https://github.com/ursacomputing/crossbow/actions/runs/6087779748/job/16517034985)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amoeba commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "amoeba (via GitHub)" <gi...@apache.org>.
amoeba commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1316506645


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -631,7 +752,13 @@ cdef class ParquetFileWriteOptions(FileWriteOptions):
             coerce_timestamps=None,
             allow_truncated_timestamps=False,
             use_compliant_nested_type=True,
+            encryption_config=None,
         )
+        IF PARQUET_ENCRYPTION_ENABLED:
+            if self._properties["encryption_config"] is not None:
+                print("Encryption is not enabled in this build of pyarrow. "
+                      "Please reinstall pyarrow with encryption enabled.")
+

Review Comment:
   Two questions here: (1) Am I reading this wrong or is this logic not quite right? It doesn't look like this will print if someone tries to pass `encryption_config` w/ a build of PyArrow not built with encryption enabled and (2) should this be a print or a throw? Considering a user might miss this message and the potential consequences of thinking you've encrypted your data when you haven't, maybe this should throw instead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1329698095


##########
cpp/src/arrow/dataset/file_parquet_test.cc:
##########
@@ -395,6 +396,36 @@ TEST_F(TestParquetFileSystemDataset, WriteWithEmptyPartitioningSchema) {
   TestWriteWithEmptyPartitioningSchema();
 }
 
+TEST_F(TestParquetFileSystemDataset, WriteWithEncryptionConfigNotSupported) {
+#ifndef PARQUET_REQUIRE_ENCRYPTION
+  // Create a dummy ParquetEncryptionConfig
+  std::shared_ptr<ParquetEncryptionConfig> encryption_config =
+      std::make_shared<ParquetEncryptionConfig>();
+
+  auto options =
+      checked_pointer_cast<ParquetFileWriteOptions>(format_->DefaultWriteOptions());
+  std::cout << "A" << std::endl;
+  ASSERT_NE(options, nullptr) << "Failed to cast to ParquetFileWriteOptions";

Review Comment:
   Do you mean to remove these lines? They look like temporary debug code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1717680692

   I wanted to follow up on this PR, which was created based on the specifications outlined in [https://docs.google.com/document/d/13EysCNC6-Nu9wnJ8YpdzmD-aMLn4i2KXUJTNqIihy7A/edit#](https://github.com/apache/arrow/pull/url).
   
   Are there any significant hurdles or concerns preventing these changes from being merged? I would greatly appreciate any feedback or specific issues that need to be addressed, considering the duration this PR has been open. I'm available to provide any required clarifications or assistance.
   
   Your insights and reviews so far and any feedback would be highly valued and appreciated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] anjakefala commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.
anjakefala commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1719887536

   Hey @tolleybot,
   
   While we wait for @pitrou, are you open to me taking a look at some of the outstanding review comments, and opening a PR on your branch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1318701919


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -67,8 +69,24 @@ parquet::ReaderProperties MakeReaderProperties(
     properties.disable_buffered_stream();
   }
   properties.set_buffer_size(parquet_scan_options->reader_properties->buffer_size());
+
+#ifdef PARQUET_REQUIRE_ENCRYPTION

Review Comment:
   I added an exception



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1342775987


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -33,6 +33,12 @@ from pyarrow.includes.libarrow_dataset cimport *
 from pyarrow.includes.libarrow_dataset_parquet cimport *
 from pyarrow._fs cimport FileSystem
 
+IF PARQUET_ENCRYPTION_ENABLED:

Review Comment:
   Looks like a lot of discussion in that thread, and there doesn't seem to be a real replacement yet in place.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1342960242


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -56,9 +62,113 @@ from pyarrow._parquet cimport (
 
 cdef Expression _true = Expression._scalar(True)
 
-
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+IF PARQUET_ENCRYPTION_ENABLED:
+
+    from pyarrow._parquet_encryption cimport *
+
+    cdef class ParquetEncryptionConfig(_Weakrefable):
+        """
+        Configuration for Parquet Encryption.
+
+        Parameters
+        ----------
+        crypto_factory : object
+            Factory for creating cryptographic instances.
+        kms_connection_config : object
+            Configuration for connecting to Key Management Service.
+        encryption_config : object
+            Configuration for encryption settings.

Review Comment:
   It was updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1348644455


##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -29,44 +29,50 @@ struct DecryptionConfiguration;
 namespace arrow {
 namespace dataset {
 
-/// \brief Core configuration class encapsulating parameters for high-level encryption
-/// within Parquet framework.
-///
-/// ParquetEncryptionConfig serves as a bridge, passing encryption-related
-/// parameters to appropriate components within the Parquet library. It holds references
-/// to objects defining encryption strategy, Key Management Service (KMS) configuration,
-/// and specific encryption configurations for Parquet data.
-///
-/// \member crypto_factory Shared pointer to CryptoFactory object, responsible for
-/// creating cryptographic components like encryptors and decryptors. \member
-/// kms_connection_config Shared pointer to KmsConnectionConfig object, holding
-/// configuration parameters for connecting to a Key Management Service (KMS).
-/// \member encryption_config Shared pointer to EncryptionConfiguration object, defining
-/// specific encryption settings for Parquet data, like keys for different columns.
 struct ARROW_DS_EXPORT ParquetEncryptionConfig {
   std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory;
   std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config;
   std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config;
+  /// \brief Core configuration class encapsulating parameters for high-level encryption

Review Comment:
   I think we typically pt those docs just above the struct? (as you had initially)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1247267725


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -67,8 +66,25 @@ parquet::ReaderProperties MakeReaderProperties(
     properties.disable_buffered_stream();
   }
   properties.set_buffer_size(parquet_scan_options->reader_properties->buffer_size());
+
+#if PARQUET_REQUIRE_ENCRYPTION

Review Comment:
   ```suggestion
   #ifdef PARQUET_REQUIRE_ENCRYPTION
   ```



##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +138,40 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+#if PARQUET_REQUIRE_ENCRYPTION
+    return dataset_encryption_config_;
+#else
+    return NULLPTR;
+#endif
+  }
+  /// \brief A getter function to retrieve the dataset decryption configuration
+  std::shared_ptr<DatasetDecryptionConfiguration> GetDatasetDecryptionConfig() const {
+#if PARQUET_REQUIRE_ENCRYPTION

Review Comment:
   ```suggestion
   #ifdef PARQUET_REQUIRE_ENCRYPTION
   ```



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -621,11 +637,38 @@ Result<std::shared_ptr<FileWriter>> ParquetFileFormat::MakeWriter(
 
   auto parquet_options = checked_pointer_cast<ParquetFileWriteOptions>(options);
 
-  std::unique_ptr<parquet::arrow::FileWriter> parquet_writer;
-  ARROW_ASSIGN_OR_RAISE(parquet_writer, parquet::arrow::FileWriter::Open(
-                                            *schema, default_memory_pool(), destination,
-                                            parquet_options->writer_properties,
-                                            parquet_options->arrow_writer_properties));
+  std::unique_ptr<parquet::arrow::FileWriter> parquet_writer = NULLPTR;
+
+#if PARQUET_REQUIRE_ENCRYPTION

Review Comment:
   ```suggestion
   #ifdef PARQUET_REQUIRE_ENCRYPTION
   ```



##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +138,40 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+#if PARQUET_REQUIRE_ENCRYPTION

Review Comment:
   ```suggestion
   #ifdef PARQUET_REQUIRE_ENCRYPTION
   ```



##########
cpp/src/arrow/dataset/CMakeLists.txt:
##########
@@ -36,6 +36,10 @@ if(ARROW_PARQUET)
   string(APPEND ARROW_DATASET_PKG_CONFIG_REQUIRES " parquet")
 endif()
 
+if(PARQUET_REQUIRE_ENCRYPTION)
+  add_definitions(-DPARQUET_REQUIRE_ENCRYPTION=1)
+endif()
+

Review Comment:
   Could you use `cpp/src/arrow/util/config.h.cmake` instead of `add_definitions()`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1248071177


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -621,11 +637,38 @@ Result<std::shared_ptr<FileWriter>> ParquetFileFormat::MakeWriter(
 
   auto parquet_options = checked_pointer_cast<ParquetFileWriteOptions>(options);
 
-  std::unique_ptr<parquet::arrow::FileWriter> parquet_writer;
-  ARROW_ASSIGN_OR_RAISE(parquet_writer, parquet::arrow::FileWriter::Open(
-                                            *schema, default_memory_pool(), destination,
-                                            parquet_options->writer_properties,
-                                            parquet_options->arrow_writer_properties));
+  std::unique_ptr<parquet::arrow::FileWriter> parquet_writer = NULLPTR;
+
+#if PARQUET_REQUIRE_ENCRYPTION

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1254571683


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -621,11 +639,38 @@ Result<std::shared_ptr<FileWriter>> ParquetFileFormat::MakeWriter(
 
   auto parquet_options = checked_pointer_cast<ParquetFileWriteOptions>(options);
 
-  std::unique_ptr<parquet::arrow::FileWriter> parquet_writer;
-  ARROW_ASSIGN_OR_RAISE(parquet_writer, parquet::arrow::FileWriter::Open(
-                                            *schema, default_memory_pool(), destination,
-                                            parquet_options->writer_properties,
-                                            parquet_options->arrow_writer_properties));
+  std::unique_ptr<parquet::arrow::FileWriter> parquet_writer = NULLPTR;
+
+#ifdef PARQUET_REQUIRE_ENCRYPTION
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encrypt_config =
+      GetDatasetEncryptionConfig();
+
+  if (dataset_encrypt_config != nullptr) {
+    auto file_encryption_prop =
+        dataset_encrypt_config->crypto_factory->GetFileEncryptionProperties(
+            *dataset_encrypt_config->kms_connection_config,
+            *dataset_encrypt_config->encryption_config, destination_locator.path,
+            destination_locator.filesystem);
+
+    auto writer_properties =
+        parquet::WriterProperties::Builder(*parquet_options->writer_properties)
+            .encryption(std::move(file_encryption_prop))
+            ->build();
+
+    ARROW_ASSIGN_OR_RAISE(
+        parquet_writer, parquet::arrow::FileWriter::Open(
+                            *schema, writer_properties->memory_pool(), destination,
+                            writer_properties, parquet_options->arrow_writer_properties));
+  }
+#endif
+
+  if (parquet_writer == NULLPTR) {

Review Comment:
   Ok. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1254571957


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -621,11 +639,38 @@ Result<std::shared_ptr<FileWriter>> ParquetFileFormat::MakeWriter(
 
   auto parquet_options = checked_pointer_cast<ParquetFileWriteOptions>(options);
 
-  std::unique_ptr<parquet::arrow::FileWriter> parquet_writer;
-  ARROW_ASSIGN_OR_RAISE(parquet_writer, parquet::arrow::FileWriter::Open(
-                                            *schema, default_memory_pool(), destination,
-                                            parquet_options->writer_properties,
-                                            parquet_options->arrow_writer_properties));
+  std::unique_ptr<parquet::arrow::FileWriter> parquet_writer = NULLPTR;

Review Comment:
   Ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1253595917


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,388 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view ds_footer_master_key = "0123456789012345";
+constexpr std::string_view ds_footer_master_key_id = "footer_key";
+constexpr std::string_view ds_column_master_keys[] = {"1234567890123450"};
+constexpr std::string_view ds_column_master_key_ids[] = {"col_key"};
+const int ds_num_columns = 1;
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // setup the test
+  void SetUp() {}
+
+  // Create our parquetfileformat with encryption properties
+  std::shared_ptr<ParquetFileFormat> CreateFileFormat(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, std::string_view footer_id, std::string_view footer_key,
+      std::string_view footer_key_name = "footer_key",
+      std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    // create our Parquet file format object
+    auto file_format = std::make_shared<ParquetFileFormat>();
+
+    file_format->SetDatasetEncryptionConfig(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config));
+    file_format->SetDatasetDecryptionConfig(
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+
+    return file_format;
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    std::shared_ptr<::arrow::dataset::InMemoryDataset> dataset =
+        std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+
+    return dataset;
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+
+  // Write dataset to disk with encryption
+  void WriteReadDatasetWithEncryption() {
+    auto file_format =
+        CreateFileFormat(ds_column_master_key_ids, ds_column_master_keys, ds_num_columns,
+                         ds_footer_master_key_id, ds_footer_master_key);
+
+    // create our mock file system
+    ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+    ASSERT_OK_AND_ASSIGN(auto file_system,
+                         ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+    // create filesystem
+    ASSERT_OK(file_system->CreateDir(""));
+
+    auto mock_fs =
+        std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+    auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+    auto partitioning =
+        std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+    // ----- Write the Dataset ----
+    auto dataset_out = BuildTable();
+    ASSERT_OK_AND_ASSIGN(auto scanner_builder_out, dataset_out->NewScan());
+    ASSERT_OK_AND_ASSIGN(auto scanner_out, scanner_builder_out->Finish());
+
+    ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+    write_options.file_write_options = file_format->DefaultWriteOptions();
+    write_options.filesystem = file_system;
+    write_options.base_dir = "";
+    write_options.partitioning = partitioning;
+    write_options.basename_template = "part{i}.parquet";
+    ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner_out));
+
+    std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",

Review Comment:
   Thanks for reviewing this. The aim of this test is to demonstrate the process of writing a partitioned Parquet file while applying distinct file encryption properties to each file within the test. This is based on the selected columns. One of the key objectives of this PR is indeed to enable varying file_encryption_properties for each individual file. Any feedback is appreciated!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1254756535


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -621,11 +639,38 @@ Result<std::shared_ptr<FileWriter>> ParquetFileFormat::MakeWriter(
 
   auto parquet_options = checked_pointer_cast<ParquetFileWriteOptions>(options);
 
-  std::unique_ptr<parquet::arrow::FileWriter> parquet_writer;
-  ARROW_ASSIGN_OR_RAISE(parquet_writer, parquet::arrow::FileWriter::Open(
-                                            *schema, default_memory_pool(), destination,
-                                            parquet_options->writer_properties,
-                                            parquet_options->arrow_writer_properties));
+  std::unique_ptr<parquet::arrow::FileWriter> parquet_writer = NULLPTR;
+
+#ifdef PARQUET_REQUIRE_ENCRYPTION
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encrypt_config =
+      GetDatasetEncryptionConfig();
+
+  if (dataset_encrypt_config != nullptr) {

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1257298546


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +138,20 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+    return dataset_encryption_config_;
+  }
+  /// \brief A setter for DatasetEncryptionConfiguration
+  void SetDatasetEncryptionConfig(
+      std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config) {
+    dataset_encryption_config_ = std::move(dataset_encryption_config);
+  }
+
+ private:
+  // A configuration structure that provides per file encryption properties for a dataset
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config_ = NULLPTR;

Review Comment:
   @wgtmac That's something that has been discussed previously.  I would like to stay on track with the current implementation if at all possible and address translating these config(s) to the current writer/reader properties in another PR.  Let me know if this is an option.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] anjakefala commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.
anjakefala commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1703973103

   @github-actions crossbow submit test-conda-python-3.10-pandas-latest


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1317575557


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -56,9 +62,113 @@ from pyarrow._parquet cimport (
 
 cdef Expression _true = Expression._scalar(True)
 
-
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+IF PARQUET_ENCRYPTION_ENABLED:
+
+    from pyarrow._parquet_encryption cimport *
+
+    cdef class ParquetEncryptionConfig(_Weakrefable):
+        """
+        Configuration for Parquet Encryption.
+
+        Parameters
+        ----------
+        crypto_factory : object
+            Factory for creating cryptographic instances.
+        kms_connection_config : object
+            Configuration for connecting to Key Management Service.
+        encryption_config : object
+            Configuration for encryption settings.

Review Comment:
   I have refined the type specifications for the parameters Instead of using a generic object type, we've specified  I have ensured that the Cython classes have the appropriate declarations in their corresponding .pxd files to avoid compilation issues and to make the C attributes accessible to other Cython modules



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1317575384


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -56,9 +62,113 @@ from pyarrow._parquet cimport (
 
 cdef Expression _true = Expression._scalar(True)
 
-
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+IF PARQUET_ENCRYPTION_ENABLED:
+
+    from pyarrow._parquet_encryption cimport *
+
+    cdef class ParquetEncryptionConfig(_Weakrefable):
+        """
+        Configuration for Parquet Encryption.
+
+        Parameters
+        ----------
+        crypto_factory : object
+            Factory for creating cryptographic instances.
+        kms_connection_config : object
+            Configuration for connecting to Key Management Service.
+        encryption_config : object
+            Configuration for encryption settings.
+
+        Raises
+        ------
+        ValueError
+            If encryption_config is None.

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1752851786

   Well, the easiest is of course to just ignore the warnings. 
   
   But my suggestion of having a dummy version with the same interface but without functionality was not necessarily a suggestion to make it easy, but rather brainstorming about something that is _possible_ as a starter ;)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "scoder (via GitHub)" <gi...@apache.org>.
scoder commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1753289161

   > That's indeed more or less exactly what I did ([here](https://github.com/tolleybot/arrow/commit/f78a162ccc48cc6c9ae2ade31a0791d4d72b3b48#diff-01063ec38c5d3b6345a3a5933c68e9a60b3df177c643bbaf89bbbf3b8c768c8bR161-R166)).
   
   Why are you using a `cdef` function?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1350943326


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -639,7 +775,9 @@ cdef class ParquetFileWriteOptions(FileWriteOptions):
             write_batch_size=None,
             dictionary_pagesize_limit=None,
             write_page_index=False,
+            encryption_config=None,

Review Comment:
   This was a good catch.  The enryption_config was not being propagated properly.  I have updated the code and included a method _set_encryption_config, to ensure the encryption config is propagated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.
anjakefala commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1756092463

   I updated the branch with the latest `main`, so we can see if that helps with the unrelated failures.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1352985823


##########
python/pyarrow/tests/test_dataset_encryption.py:
##########
@@ -0,0 +1,152 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from datetime import timedelta
+import pyarrow.fs as fs
+import pyarrow as pa
+import pytest
+
+encryption_unavailable = False
+
+try:
+    import pyarrow.dataset as ds
+except ImportError:
+    ds = None
+
+try:
+    from pyarrow.tests.parquet.encryption import InMemoryKmsClient
+    import pyarrow.parquet.encryption as pe
+except ImportError:
+    encryption_unavailable = True
+
+FOOTER_KEY = b"0123456789112345"
+FOOTER_KEY_NAME = "footer_key"
+COL_KEY = b"1234567890123450"
+COL_KEY_NAME = "col_key"
+
+
+def create_sample_table():
+    return pa.table(
+        {
+            "year": [2020, 2022, 2021, 2022, 2019, 2021],
+            "n_legs": [2, 2, 4, 4, 5, 100],
+            "animal": [
+                "Flamingo",
+                "Parrot",
+                "Dog",
+                "Horse",
+                "Brittle stars",
+                "Centipede",
+            ],
+        }
+    )
+
+
+def create_encryption_config():
+    return pe.EncryptionConfiguration(
+        footer_key=FOOTER_KEY_NAME,
+        plaintext_footer=False,
+        column_keys={COL_KEY_NAME: ["n_legs", "animal"]},
+        encryption_algorithm="AES_GCM_V1",
+        # requires timedelta or an assertion is raised
+        cache_lifetime=timedelta(minutes=5.0),
+        data_key_length_bits=256,
+    )
+
+
+def create_decryption_config():
+    return pe.DecryptionConfiguration(cache_lifetime=300)
+
+
+def create_kms_connection_config():
+    return pe.KmsConnectionConfig(
+        custom_kms_conf={
+            FOOTER_KEY_NAME: FOOTER_KEY.decode("UTF-8"),
+            COL_KEY_NAME: COL_KEY.decode("UTF-8"),
+        }
+    )
+
+
+def kms_factory(kms_connection_configuration):
+    return InMemoryKmsClient(kms_connection_configuration)
+
+
+@pytest.mark.skipif(
+    encryption_unavailable, reason="Parquet Encryption is not currently enabled"
+)
+def test_dataset_encryption_decryption():
+    table = create_sample_table()
+
+    encryption_config = create_encryption_config()
+    decryption_config = create_decryption_config()
+    kms_connection_config = create_kms_connection_config()
+
+    crypto_factory = pe.CryptoFactory(kms_factory)
+    parquet_encryption_cfg = ds.ParquetEncryptionConfig(
+        crypto_factory, kms_connection_config, encryption_config
+    )
+    parquet_decryption_cfg = ds.ParquetDecryptionConfig(
+        crypto_factory, kms_connection_config, decryption_config
+    )
+
+    # create write_options with dataset encryption config
+    pformat = pa.dataset.ParquetFileFormat()
+    write_options = pformat.make_write_options(encryption_config=parquet_encryption_cfg)
+
+    mockfs = fs._MockFileSystem()
+    mockfs.create_dir("/")
+
+    ds.write_dataset(
+        data=table,
+        base_dir="sample_dataset",
+        format=pformat,
+        file_options=write_options,
+        filesystem=mockfs,
+    )
+
+    # read without descryption config -> should error is dataset was properly encrypted
+    pformat = pa.dataset.ParquetFileFormat()
+    with pytest.raises(IOError, match=r"no decryption"):
+        ds.dataset("sample_dataset", format=pformat, filesystem=mockfs)
+
+    # set decryption config for parquet fragment scan options
+    pq_scan_opts = ds.ParquetFragmentScanOptions(
+        decryption_config=parquet_decryption_cfg
+    )
+    pformat = pa.dataset.ParquetFileFormat(default_fragment_scan_options=pq_scan_opts)
+    dataset = ds.dataset("sample_dataset", format=pformat, filesystem=mockfs)
+
+    assert table.equals(dataset.to_table())
+
+
+@pytest.mark.skipif(
+    not encryption_unavailable, reason="Parquet Encryption is currently enabled"
+)
+def test_write_dataset_parquet_without_encryption():
+    """Test write_dataset with ParquetFileFormat and test if an exception is thrown
+    if you try to set encryption_config using make_write_options"""
+
+    # Set the encryption configuration using ParquetFileFormat
+    # and make_write_options
+    pformat = pa.dataset.ParquetFileFormat()
+
+    with pytest.raises(NotImplementedError):
+        _ = pformat.make_write_options(
+            # encryption_config=encryption_config_placeholder
+            # TODO
+            encryption_properties="some value"
+        )

Review Comment:
   ```suggestion
           _ = pformat.make_write_options(encryption_config="some value")
   ```
   
   (that was a leftover from my branch when the name was wrong)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1270814996


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,412 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+using arrow::internal::checked_pointer_cast;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // This function creates a mock file system using the current time point, creates a
+  // directory with the given base directory path, and writes a dataset to it using
+  // provided Parquet file write options. The dataset is partitioned using a Hive
+  // partitioning scheme. The function also checks if the written files exist in the file
+  // system.
+  ::arrow::Result<std::shared_ptr<::arrow::fs::FileSystem>>
+  CreateMockFileSystemAndWriteData(
+      const std::string& base_dir,
+      const std::shared_ptr<FileWriteOptions>& parquet_file_write_options) {
+    // Create our mock file system
+    ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+    ARROW_ASSIGN_OR_RAISE(auto file_system,
+                          ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+    // Create filesystem
+    RETURN_NOT_OK(file_system->CreateDir(base_dir));
+
+    auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+    auto partitioning =
+        std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+    // ----- Write the Dataset ----
+    auto dataset_out = BuildTable();
+    ARROW_ASSIGN_OR_RAISE(auto scanner_builder_out, dataset_out->NewScan());
+    ARROW_ASSIGN_OR_RAISE(auto scanner_out, scanner_builder_out->Finish());
+
+    ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+    write_options.file_write_options = parquet_file_write_options;
+    write_options.filesystem = file_system;
+    write_options.base_dir = base_dir;
+    write_options.partitioning = partitioning;
+    write_options.basename_template = "part{i}.parquet";
+    RETURN_NOT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner_out));
+
+    std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                      "part=c/part0.parquet", "part=d/part0.parquet",
+                                      "part=e/part0.parquet", "part=f/part0.parquet",
+                                      "part=g/part0.parquet", "part=h/part0.parquet",
+                                      "part=i/part0.parquet", "part=j/part0.parquet"};
+    ValidateFilesExist(file_system, files);
+
+    return file_system;
+  }
+
+  // Create dataset encryption properties
+  std::pair<std::shared_ptr<ParquetEncryptionConfig>,
+            std::shared_ptr<ParquetDecryptionConfig>>
+  CreateParquetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    auto kms_connection_config =
+        std::make_shared<parquet::encryption::KmsConnectionConfig>();
+
+    ParquetEncryptionConfig parquet_encryption_config;
+    parquet_encryption_config.Setup(crypto_factory, kms_connection_config,
+                                    encryption_config);
+    auto decryption_config =
+        std::make_shared<parquet::encryption::DecryptionConfiguration>();
+    ParquetDecryptionConfig parquet_decryption_config;
+    parquet_decryption_config.Setup(crypto_factory, kms_connection_config,
+                                    decryption_config);
+    return std::make_pair(
+        std::make_shared<ParquetEncryptionConfig>(parquet_encryption_config),
+        std::make_shared<ParquetDecryptionConfig>(parquet_decryption_config));
+  }
+
+  // Utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // Add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // Add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::FileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_EQ(result.type(), arrow::fs::FileType::File);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    auto kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+};
+// Write dataset to disk with encryption
+// The aim of this test is to demonstrate the process of writing a partitioned
+// Parquet file while applying distinct file encryption properties to each
+// file within the test. This is based on the selected columns.

Review Comment:
   Ive updated the comments



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1267152172


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    return std::make_pair(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config),
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+};
+// Write dataset to disk with encryption
+// The aim of this test is to demonstrate the process of writing a partitioned
+// Parquet file while applying distinct file encryption properties to each
+// file within the test. This is based on the selected columns.
+TEST_F(DatasetEncryptionTest, WriteReadDatasetWithEncryption) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));
+
+  auto mock_fs =
+      std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+  auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+  auto partitioning =
+      std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+  // ----- Write the Dataset ----
+  auto dataset_out = BuildTable();
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder_out, dataset_out->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner_out, scanner_builder_out->Finish());
+
+  ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+  write_options.file_write_options = parquet_file_write_options;
+  write_options.filesystem = file_system;
+  write_options.base_dir = kBaseDir;
+  write_options.partitioning = partitioning;
+  write_options.basename_template = "part{i}.parquet";
+  ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner_out));
+
+  std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                    "part=c/part0.parquet", "part=d/part0.parquet",
+                                    "part=e/part0.parquet", "part=f/part0.parquet",
+                                    "part=g/part0.parquet", "part=h/part0.parquet",
+                                    "part=i/part0.parquet", "part=j/part0.parquet"};
+  ValidateFilesExist(mock_fs, files);
+
+  // ----- Read the Dataset -----
+
+  // Get FileInfo objects for all files under the base directory
+  arrow::fs::FileSelector selector;
+  selector.base_dir = kBaseDir;
+  selector.recursive = true;
+
+  // Create a FileSystemDatasetFactory
+  arrow::dataset::FileSystemFactoryOptions factory_options;
+  factory_options.partitioning = partitioning;
+  factory_options.partition_base_dir = kBaseDir;
+  ASSERT_OK_AND_ASSIGN(auto dataset_factory,
+                       arrow::dataset::FileSystemDatasetFactory::Make(
+                           mock_fs, selector, file_format, factory_options));
+  // Create a Dataset
+  ASSERT_OK_AND_ASSIGN(auto dataset_in, dataset_factory->Finish());
+
+  // Define the callback function
+  std::function<arrow::Status(arrow::dataset::TaggedRecordBatch tagged_record_batch)>
+      visitor =
+          [](arrow::dataset::TaggedRecordBatch tagged_record_batch) -> arrow::Status {
+    return arrow::Status::OK();
+  };
+
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder_in, dataset_in->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner_in, scanner_builder_in->Finish());
+
+  // Scan the dataset and process the record batches using the callback function
+  arrow::Status status = scanner_in->Scan(visitor);
+
+  // Check if there was an error during iteration
+  ASSERT_OK(status);
+}
+
+// Write dataset to disk with encryption and then read in a single parquet file
+TEST_F(DatasetEncryptionTest, WriteReadSingleFile) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));
+
+  auto mock_fs =
+      std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);

Review Comment:
   You're correct.  I think I can eliminate the mock_fs variable from the tests



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1267124405


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    return std::make_pair(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config),
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+};
+// Write dataset to disk with encryption
+// The aim of this test is to demonstrate the process of writing a partitioned
+// Parquet file while applying distinct file encryption properties to each
+// file within the test. This is based on the selected columns.
+TEST_F(DatasetEncryptionTest, WriteReadDatasetWithEncryption) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));
+
+  auto mock_fs =
+      std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);

Review Comment:
   Ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1267120864


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    return std::make_pair(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config),
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */

Review Comment:
   Ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1294380534


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -52,12 +52,81 @@ from pyarrow._parquet cimport (
     FileMetaData,
 )
 
+from pyarrow._parquet_encryption cimport *

Review Comment:
   This is the import that will fail at runtime.
   
   The way we solved this before for the plain Parquet (non-dataset) support is splitting it in a separate file (`_parquet_encryption.pyx` in addition to `_parquet.pyx`), and expose it in a submodule (`pyarrow.parquet.encryption`). 
   However, I assume that approach will not work here, given that it's the ParquetFragmentScanOptions et al classes really need to know about it (unwrap it to created the proper C++ counterpart of the options class).
   
   We might need some compile-time switches here, similarly as is done in the C++ code?



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -729,6 +825,14 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
             other.thrift_container_size_limit)
         return attrs == other_attrs
 
+    def SetParquetDecryptionConfig(self, ParquetDecryptionConfig config):
+        cdef shared_ptr[CParquetDecryptionConfig] c_config
+        if not isinstance(config, ParquetDecryptionConfig):
+            raise ValueError("config must be a ParquetDecryptionConfig")
+        self._parquet_decryption_config = config
+        c_config = config.unwrap()
+        self.parquet_options.parquet_decryption_config = c_config

Review Comment:
   Is there a reason to not included this in the `@parquet_decryption_config.setter`? So that property setter can just contain this code, instead of calling `SetParquetDecryptionConfig`.



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -52,12 +52,81 @@ from pyarrow._parquet cimport (
     FileMetaData,
 )
 
+from pyarrow._parquet_encryption cimport *
+
 
 cdef Expression _true = Expression._scalar(True)
 
 
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+cdef class ParquetEncryptionConfig(_Weakrefable):

Review Comment:
   Question on naming: what is called here "config", is that more or less what maps to the "properties" in the single-file case? 
   The keyword in `pq.read_table` is called `decryption_properties` but here the keyword is `decryption_config` (below in the ParquetFragmentScanOptions). Could those two names be consolidated, or is that difference actually intentionally?



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -52,12 +52,81 @@ from pyarrow._parquet cimport (
     FileMetaData,
 )
 
+from pyarrow._parquet_encryption cimport *
+
 
 cdef Expression _true = Expression._scalar(True)
 
 
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+cdef class ParquetEncryptionConfig(_Weakrefable):
+    cdef:
+        shared_ptr[CParquetEncryptionConfig] c_config
+
+    # Avoid mistakenly creating attributes
+    __slots__ = ()
+
+    def __cinit__(self, object crypto_factory, object kms_connection_config,
+                  object encryption_config):
+
+        cdef shared_ptr[CEncryptionConfiguration] c_encryption_config
+
+        if encryption_config is None:
+            raise ValueError(
+                "encryption_config cannot be None")
+
+        self.c_config.reset(new CParquetEncryptionConfig())
+
+        c_encryption_config = pyarrow_unwrap_encryptionconfig(encryption_config)
+
+        self.c_config.get().Setup(pyarrow_unwrap_cryptofactory(crypto_factory),
+                                  pyarrow_unwrap_kmsconnectionconfig(
+                                      kms_connection_config),
+                                  c_encryption_config)
+
+    @staticmethod
+    cdef wrap(shared_ptr[CParquetEncryptionConfig] c_config):
+        cdef ParquetEncryptionConfig python_config = ParquetEncryptionConfig.__new__(ParquetEncryptionConfig)
+        python_config.c_config = c_config
+        return python_config
+
+    cdef shared_ptr[CParquetEncryptionConfig] unwrap(self):
+        return self.c_config
+
+cdef class ParquetDecryptionConfig(_Weakrefable):

Review Comment:
   ```suggestion
   
   
   cdef class ParquetDecryptionConfig(_Weakrefable):
   ```
   
   (small formatting nit: two blank lines between class definitions; same for other cases further in this file)



##########
python/pyarrow/dataset_api.pxi:
##########
@@ -0,0 +1,59 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from libcpp.memory cimport shared_ptr
+
+
+cdef api bint pyarrow_is_cryptofactory(object crypto_factory):
+    return isinstance(crypto_factory, CryptoFactory)
+
+cdef api shared_ptr[CCryptoFactory] pyarrow_unwrap_cryptofactory(object crypto_factory):

Review Comment:
   Can you expand a bit on why those functions were put in this separate file? (and if it is needed, it would be good to add some context for it as a comment in the file)



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -52,12 +52,81 @@ from pyarrow._parquet cimport (
     FileMetaData,
 )
 
+from pyarrow._parquet_encryption cimport *
+
 
 cdef Expression _true = Expression._scalar(True)
 
 
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+cdef class ParquetEncryptionConfig(_Weakrefable):

Review Comment:
   Can you add a docstring to this class? (and same for ParquetDecryptionConfig)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1294824888


##########
python/pyarrow/dataset_api.pxi:
##########
@@ -0,0 +1,59 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from libcpp.memory cimport shared_ptr
+
+
+cdef api bint pyarrow_is_cryptofactory(object crypto_factory):
+    return isinstance(crypto_factory, CryptoFactory)
+
+cdef api shared_ptr[CCryptoFactory] pyarrow_unwrap_cryptofactory(object crypto_factory):

Review Comment:
   If I remember correctly initially I was having some dependency issues.  This may not be the case anymore.  I can try to include those definitions in the pyx file and see if there are any issues



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1527950165

   Hi @tolleybot. Looks like there may have been some conflicts. I'd recommend squashing your commits and then rebasing on main. You can do that with:
   
   ```shell
   git checkout main && git pull
   git checkout dataset_encryption
   git rebase -i main
   ```
   
   Choose squash for all but the last of your commits. Then fix any merge conflicts and commit the changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1191381664


##########
cpp/src/parquet/properties.h:
##########
@@ -513,16 +535,16 @@ class PARQUET_EXPORT WriterProperties {
       return this;
     }
 
-    /// Disable decimal logical type with 1 <= precision <= 18 to be stored as
-    /// integer physical type.
+    /// Disable decimal logical type with 1 <= precision <= 18 to be stored
+    /// as integer physical type.
     ///
     /// Default disabled.
     Builder* disable_store_decimal_as_integer() {
       store_decimal_as_integer_ = false;
       return this;
     }
 
-    /// Enable writing page index in general for all columns. Default disabled.

Review Comment:
   Ok



##########
cpp/src/parquet/properties.h:
##########
@@ -582,8 +604,6 @@ class PARQUET_EXPORT WriterProperties {
         get(item.first).set_dictionary_enabled(item.second);
       for (const auto& item : statistics_enabled_)
         get(item.first).set_statistics_enabled(item.second);
-      for (const auto& item : page_index_enabled_)
-        get(item.first).set_page_index_enabled(item.second);

Review Comment:
   Got it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1191383767


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########


Review Comment:
   Will do



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1607552262

   I updated the python build.  I was successful locally and I have pushed up the changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1615163773

   The latest changes addressed the previous build issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1247920623


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -67,8 +66,25 @@ parquet::ReaderProperties MakeReaderProperties(
     properties.disable_buffered_stream();
   }
   properties.set_buffer_size(parquet_scan_options->reader_properties->buffer_size());
+
+#if PARQUET_REQUIRE_ENCRYPTION

Review Comment:
   good catch.  Ill make that change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1247920069


##########
cpp/src/arrow/dataset/CMakeLists.txt:
##########
@@ -36,6 +36,10 @@ if(ARROW_PARQUET)
   string(APPEND ARROW_DATASET_PKG_CONFIG_REQUIRES " parquet")
 endif()
 
+if(PARQUET_REQUIRE_ENCRYPTION)
+  add_definitions(-DPARQUET_REQUIRE_ENCRYPTION=1)
+endif()
+

Review Comment:
   will do



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1248069932


##########
cpp/src/arrow/dataset/CMakeLists.txt:
##########
@@ -36,6 +36,10 @@ if(ARROW_PARQUET)
   string(APPEND ARROW_DATASET_PKG_CONFIG_REQUIRES " parquet")
 endif()
 
+if(PARQUET_REQUIRE_ENCRYPTION)
+  add_definitions(-DPARQUET_REQUIRE_ENCRYPTION=1)
+endif()
+

Review Comment:
   done



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -67,8 +66,25 @@ parquet::ReaderProperties MakeReaderProperties(
     properties.disable_buffered_stream();
   }
   properties.set_buffer_size(parquet_scan_options->reader_properties->buffer_size());
+
+#if PARQUET_REQUIRE_ENCRYPTION

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1254656599


##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,70 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+
+namespace arrow {
+namespace dataset {
+
+/// core class, that translates the parameters of high level encryption
+struct ARROW_DS_EXPORT DatasetEncryptionConfiguration {
+  DatasetEncryptionConfiguration()
+      : kms_connection_config(
+            std::make_shared<parquet::encryption::KmsConnectionConfig>()) {}
+
+  void Setup(
+      std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory,
+      std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config,
+      std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config) {
+    this->crypto_factory = std::move(crypto_factory);
+    this->kms_connection_config = std::move(kms_connection_config);
+    this->encryption_config = std::move(encryption_config);
+  }
+
+  std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory;
+  std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config;
+  std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config;
+};

Review Comment:
   I recall now the reason for the existence of the Setup function. It resolves a compile-time issue for the Python/Cython bindings. Previously, I encountered an issue where simply setting the values would cause compiler issues due to undefined types when the PARQUET_REQUIRE_ENCRYPTION flag wasn't set to "ON".  To alleviate these compiler errors, I introduced the Setup function, which resolved the issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1249752764


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -28,6 +28,8 @@
 
 #include "arrow/dataset/discovery.h"
 #include "arrow/dataset/file_base.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+
 #include "arrow/dataset/type_fwd.h"
 #include "arrow/dataset/visibility.h"
 #include "arrow/io/caching.h"

Review Comment:
   will do



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1265522216


##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,61 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/util/logging.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+
+namespace arrow {
+namespace dataset {
+
+/// core class, that translates the parameters of high level encryption
+struct ARROW_DS_EXPORT DatasetEncryptionConfiguration {

Review Comment:
   Probably rename to `ParquetEncryptionConfiguration` and `ParquetDecryptionConfiguration`?



##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -226,6 +228,19 @@ class ARROW_DS_EXPORT ParquetFragmentScanOptions : public FragmentScanOptions {
   /// ScanOptions. Additionally, dictionary columns come from
   /// ParquetFileFormat::ReaderOptions::dict_columns.
   std::shared_ptr<parquet::ArrowReaderProperties> arrow_reader_properties;
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+    return dataset_encryption_config_;
+  }
+  /// \brief A setter for DatasetEncryptionConfiguration
+  void SetDatasetEncryptionConfig(
+      std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config) {
+    dataset_encryption_config_ = std::move(dataset_encryption_config);
+  }
+
+ private:
+  // A configuration structure that provides per file encryption properties for a dataset
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config_ = NULLPTR;

Review Comment:
   I am sorry but could you please revert the consolidation of `DatasetEncryptionConfiguration` and `DatasetDecryptionConfiguration` which I suggested previously? I thought it should be merged if it is added to ParquetFileFormat but now it is moved to Options. My apology again!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1265705188


##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,61 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/util/logging.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+
+namespace arrow {
+namespace dataset {
+
+/// core class, that translates the parameters of high level encryption
+struct ARROW_DS_EXPORT DatasetEncryptionConfiguration {

Review Comment:
   Ok.  There is only one structure now so DatasetEncryptionConfiguration.  I can rename it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1142834923


##########
cpp/src/parquet/encryption/dataset_encryption_config.h:
##########
@@ -0,0 +1,31 @@
+#pragma once
+
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+#include "arrow/dataset/type_fwd.h"
+
+namespace parquet {
+namespace encryption {
+
+struct PARQUET_EXPORT DatasetEncryptionConfiguration {

Review Comment:
   Is it better not to expose dataset concepts within the parquet module? For example the line 6 looks strange to me. `ARROW_PARQUET` can be built without `ARROW_DATASET` enabled. We may add a translation layer to pass these config to parquet writer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1164295889


##########
python/examples/dataset/write_dataset_encrypted.py:
##########
@@ -0,0 +1,85 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+
+import pyarrow as pa
+import pyarrow.dataset as ds
+import pyarrow.parquet as pq
+import pyarrow.parquet.encryption as pe
+from pyarrow.tests.parquet.encryption import InMemoryKmsClient
+from datetime import timedelta
+import shutil
+import os
+
+""" A sample to demostrate dataset encryption and decryption"""
+
+# create a list of dictionaries that will represent our dataset
+table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+                  'n_legs': [2, 2, 4, 4, 5, 100],
+                  'animal': ["Flamingo", "Parrot", "Dog", "Horse",
+                             "Brittle stars", "Centipede"]})
+
+# create a PyArrow dataset from the table
+dataset = ds.dataset(table)
+
+FOOTER_KEY = b"0123456789112345"
+FOOTER_KEY_NAME = "footer_key"
+COL_KEY = b"1234567890123450"
+COL_KEY_NAME = "col_key"
+
+encryption_config = pe.EncryptionConfiguration(
+    footer_key=FOOTER_KEY_NAME,

Review Comment:
   set footer to non-plaintext



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1164226053


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########


Review Comment:
   I added this test 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1170154371


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,403 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "gtest/gtest.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view dsFooterMasterKey = "0123456789012345";
+constexpr std::string_view dsFooterMasterKeyId = "footer_key";
+constexpr std::string_view dsColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view dsColumnMasterKeyIds[] = {"col_key"};
+const int dsNumColumns = 1;
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // setup the test
+  void SetUp() {}
+
+  // Create our parquetfileformat with encryption properties
+  std::shared_ptr<ParquetFileFormat> CreateFileFormat(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key,
+      const std::string_view& footer_key_name = "footer_key",
+      const std::string_view& column_key_mapping = "col_key: a") {
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    ::parquet::encryption::KmsConnectionConfig kms_connection_config;
+
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(true, key_list);
+
+    // Setup our Dataset encrytion configurations
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.kms_connection_config =
+        std::make_shared<::parquet::encryption::KmsConnectionConfig>(
+            kms_connection_config);
+    dataset_encryption_config.encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+
+    dataset_encryption_config.encryption_config->column_keys = column_key_mapping;
+
+    if (footer_key_name.size() > 0) {
+      dataset_encryption_config.encryption_config->footer_key = footer_key_name;
+    }
+
+    dataset_decryption_config.crypto_factory = crypto_factory;
+    dataset_decryption_config.kms_connection_config =
+        std::make_shared<::parquet::encryption::KmsConnectionConfig>(
+            kms_connection_config);
+    dataset_decryption_config.decryption_config =
+        std::make_shared<::parquet::encryption::DecryptionConfiguration>();
+
+    // create our Parquet file format object
+    auto file_format = std::make_shared<ParquetFileFormat>();
+
+    file_format->SetDatasetEncryptionConfig(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config));
+    file_format->SetDatasetDecryptionConfig(
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+
+    return file_format;
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    std::shared_ptr<::arrow::dataset::InMemoryDataset> dataset =
+        std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+
+    return dataset;
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+
+  // Write dataset to disk with encryption
+  void WriteReadDatasetWithEncryption() {
+    auto file_format =
+        CreateFileFormat(dsColumnMasterKeyIds, dsColumnMasterKeys, dsNumColumns,
+                         dsFooterMasterKeyId, dsFooterMasterKey);
+
+    // create our mock file system
+    ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+    ASSERT_OK_AND_ASSIGN(auto file_system,
+                         ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+    // create filesystem
+    ASSERT_OK(file_system->CreateDir(""));
+
+    auto mock_fs =
+        std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+    auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+    auto partitioning =
+        std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+    // ----- Write the Dataset ----
+    auto dataset_out = BuildTable();
+    ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset_out->NewScan());
+    ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
+
+    ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+    write_options.file_write_options = file_format->DefaultWriteOptions();
+    write_options.filesystem = file_system;
+    write_options.base_dir = "";
+    write_options.partitioning = partitioning;
+    write_options.basename_template = "part{i}.parquet";
+    ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner));
+
+    std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                      "part=c/part0.parquet", "part=d/part0.parquet",
+                                      "part=e/part0.parquet", "part=f/part0.parquet",
+                                      "part=g/part0.parquet", "part=h/part0.parquet",
+                                      "part=i/part0.parquet", "part=j/part0.parquet"};
+    ValidateFilesExist(mock_fs, files);
+
+    // ----- Read the Dataset -----
+
+    // Get FileInfo objects for all files under the base directory
+    arrow::fs::FileSelector selector;
+    selector.base_dir = "";
+    selector.recursive = true;
+
+    // Create a FileSystemDatasetFactory
+    arrow::dataset::FileSystemFactoryOptions factory_options;
+    factory_options.partitioning = partitioning;
+    factory_options.partition_base_dir = "";
+    ASSERT_OK_AND_ASSIGN(auto dataset_factory,
+                         arrow::dataset::FileSystemDatasetFactory::Make(
+                             mock_fs, selector, file_format, factory_options));
+
+    // Create a Dataset
+    ASSERT_OK_AND_ASSIGN(auto dataset_in, dataset_factory->Finish());
+
+    // Define the callback function
+    std::function<arrow::Status(arrow::dataset::TaggedRecordBatch tagged_record_batch)>
+        visitor =
+            [](arrow::dataset::TaggedRecordBatch tagged_record_batch) -> arrow::Status {
+      return arrow::Status::OK();
+    };
+
+    // Scan the dataset and process the record batches using the callback function
+    arrow::Status status = scanner->Scan(visitor);

Review Comment:
   Your correct!  I made the change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1165970375


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,403 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "gtest/gtest.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view dsFooterMasterKey = "0123456789012345";
+constexpr std::string_view dsFooterMasterKeyId = "footer_key";
+constexpr std::string_view dsColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view dsColumnMasterKeyIds[] = {"col_key"};
+const int dsNumColumns = 1;
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // setup the test
+  void SetUp() {}
+
+  // Create our parquetfileformat with encryption properties
+  std::shared_ptr<ParquetFileFormat> CreateFileFormat(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key,
+      const std::string_view& footer_key_name = "footer_key",
+      const std::string_view& column_key_mapping = "col_key: a") {
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    ::parquet::encryption::KmsConnectionConfig kms_connection_config;

Review Comment:
   nit: instead of putting these at the top, could you put each of them at the top of where they are initialized?



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,403 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "gtest/gtest.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view dsFooterMasterKey = "0123456789012345";
+constexpr std::string_view dsFooterMasterKeyId = "footer_key";
+constexpr std::string_view dsColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view dsColumnMasterKeyIds[] = {"col_key"};
+const int dsNumColumns = 1;
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // setup the test
+  void SetUp() {}
+
+  // Create our parquetfileformat with encryption properties
+  std::shared_ptr<ParquetFileFormat> CreateFileFormat(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key,
+      const std::string_view& footer_key_name = "footer_key",
+      const std::string_view& column_key_mapping = "col_key: a") {
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    ::parquet::encryption::KmsConnectionConfig kms_connection_config;
+
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(true, key_list);
+
+    // Setup our Dataset encrytion configurations
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.kms_connection_config =
+        std::make_shared<::parquet::encryption::KmsConnectionConfig>(
+            kms_connection_config);
+    dataset_encryption_config.encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+
+    dataset_encryption_config.encryption_config->column_keys = column_key_mapping;
+
+    if (footer_key_name.size() > 0) {
+      dataset_encryption_config.encryption_config->footer_key = footer_key_name;
+    }
+
+    dataset_decryption_config.crypto_factory = crypto_factory;
+    dataset_decryption_config.kms_connection_config =
+        std::make_shared<::parquet::encryption::KmsConnectionConfig>(
+            kms_connection_config);
+    dataset_decryption_config.decryption_config =
+        std::make_shared<::parquet::encryption::DecryptionConfiguration>();
+
+    // create our Parquet file format object
+    auto file_format = std::make_shared<ParquetFileFormat>();
+
+    file_format->SetDatasetEncryptionConfig(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config));
+    file_format->SetDatasetDecryptionConfig(
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+
+    return file_format;
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    std::shared_ptr<::arrow::dataset::InMemoryDataset> dataset =
+        std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+
+    return dataset;
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+
+  // Write dataset to disk with encryption
+  void WriteReadDatasetWithEncryption() {
+    auto file_format =
+        CreateFileFormat(dsColumnMasterKeyIds, dsColumnMasterKeys, dsNumColumns,
+                         dsFooterMasterKeyId, dsFooterMasterKey);
+
+    // create our mock file system
+    ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+    ASSERT_OK_AND_ASSIGN(auto file_system,
+                         ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+    // create filesystem
+    ASSERT_OK(file_system->CreateDir(""));
+
+    auto mock_fs =
+        std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+    auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+    auto partitioning =
+        std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+    // ----- Write the Dataset ----
+    auto dataset_out = BuildTable();
+    ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset_out->NewScan());
+    ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
+
+    ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+    write_options.file_write_options = file_format->DefaultWriteOptions();
+    write_options.filesystem = file_system;
+    write_options.base_dir = "";
+    write_options.partitioning = partitioning;
+    write_options.basename_template = "part{i}.parquet";
+    ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner));
+
+    std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                      "part=c/part0.parquet", "part=d/part0.parquet",
+                                      "part=e/part0.parquet", "part=f/part0.parquet",
+                                      "part=g/part0.parquet", "part=h/part0.parquet",
+                                      "part=i/part0.parquet", "part=j/part0.parquet"};
+    ValidateFilesExist(mock_fs, files);
+
+    // ----- Read the Dataset -----
+
+    // Get FileInfo objects for all files under the base directory
+    arrow::fs::FileSelector selector;
+    selector.base_dir = "";
+    selector.recursive = true;
+
+    // Create a FileSystemDatasetFactory
+    arrow::dataset::FileSystemFactoryOptions factory_options;
+    factory_options.partitioning = partitioning;
+    factory_options.partition_base_dir = "";
+    ASSERT_OK_AND_ASSIGN(auto dataset_factory,
+                         arrow::dataset::FileSystemDatasetFactory::Make(
+                             mock_fs, selector, file_format, factory_options));
+
+    // Create a Dataset
+    ASSERT_OK_AND_ASSIGN(auto dataset_in, dataset_factory->Finish());
+
+    // Define the callback function
+    std::function<arrow::Status(arrow::dataset::TaggedRecordBatch tagged_record_batch)>
+        visitor =
+            [](arrow::dataset::TaggedRecordBatch tagged_record_batch) -> arrow::Status {
+      return arrow::Status::OK();
+    };
+
+    // Scan the dataset and process the record batches using the callback function
+    arrow::Status status = scanner->Scan(visitor);

Review Comment:
   I think you need to create a new scanner here
   
   ```suggestion
       ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset_in->NewScan());
       ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
       arrow::Status status = scanner->Scan(visitor);
   ```
   
   



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,403 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "gtest/gtest.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view dsFooterMasterKey = "0123456789012345";
+constexpr std::string_view dsFooterMasterKeyId = "footer_key";
+constexpr std::string_view dsColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view dsColumnMasterKeyIds[] = {"col_key"};
+const int dsNumColumns = 1;
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // setup the test
+  void SetUp() {}
+
+  // Create our parquetfileformat with encryption properties
+  std::shared_ptr<ParquetFileFormat> CreateFileFormat(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key,
+      const std::string_view& footer_key_name = "footer_key",
+      const std::string_view& column_key_mapping = "col_key: a") {
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    ::parquet::encryption::KmsConnectionConfig kms_connection_config;
+
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(true, key_list);
+
+    // Setup our Dataset encrytion configurations
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.kms_connection_config =
+        std::make_shared<::parquet::encryption::KmsConnectionConfig>(
+            kms_connection_config);
+    dataset_encryption_config.encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+
+    dataset_encryption_config.encryption_config->column_keys = column_key_mapping;
+
+    if (footer_key_name.size() > 0) {
+      dataset_encryption_config.encryption_config->footer_key = footer_key_name;
+    }
+
+    dataset_decryption_config.crypto_factory = crypto_factory;
+    dataset_decryption_config.kms_connection_config =
+        std::make_shared<::parquet::encryption::KmsConnectionConfig>(
+            kms_connection_config);
+    dataset_decryption_config.decryption_config =
+        std::make_shared<::parquet::encryption::DecryptionConfiguration>();
+
+    // create our Parquet file format object
+    auto file_format = std::make_shared<ParquetFileFormat>();
+
+    file_format->SetDatasetEncryptionConfig(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config));
+    file_format->SetDatasetDecryptionConfig(
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+
+    return file_format;
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    std::shared_ptr<::arrow::dataset::InMemoryDataset> dataset =
+        std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+
+    return dataset;
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+
+  // Write dataset to disk with encryption
+  void WriteReadDatasetWithEncryption() {
+    auto file_format =
+        CreateFileFormat(dsColumnMasterKeyIds, dsColumnMasterKeys, dsNumColumns,
+                         dsFooterMasterKeyId, dsFooterMasterKey);
+
+    // create our mock file system
+    ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+    ASSERT_OK_AND_ASSIGN(auto file_system,
+                         ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+    // create filesystem
+    ASSERT_OK(file_system->CreateDir(""));
+
+    auto mock_fs =
+        std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+    auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+    auto partitioning =
+        std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+    // ----- Write the Dataset ----
+    auto dataset_out = BuildTable();
+    ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset_out->NewScan());
+    ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
+
+    ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+    write_options.file_write_options = file_format->DefaultWriteOptions();
+    write_options.filesystem = file_system;
+    write_options.base_dir = "";
+    write_options.partitioning = partitioning;
+    write_options.basename_template = "part{i}.parquet";
+    ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner));
+
+    std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                      "part=c/part0.parquet", "part=d/part0.parquet",
+                                      "part=e/part0.parquet", "part=f/part0.parquet",
+                                      "part=g/part0.parquet", "part=h/part0.parquet",
+                                      "part=i/part0.parquet", "part=j/part0.parquet"};
+    ValidateFilesExist(mock_fs, files);
+
+    // ----- Read the Dataset -----
+
+    // Get FileInfo objects for all files under the base directory
+    arrow::fs::FileSelector selector;
+    selector.base_dir = "";
+    selector.recursive = true;
+
+    // Create a FileSystemDatasetFactory
+    arrow::dataset::FileSystemFactoryOptions factory_options;
+    factory_options.partitioning = partitioning;
+    factory_options.partition_base_dir = "";
+    ASSERT_OK_AND_ASSIGN(auto dataset_factory,
+                         arrow::dataset::FileSystemDatasetFactory::Make(
+                             mock_fs, selector, file_format, factory_options));
+
+    // Create a Dataset
+    ASSERT_OK_AND_ASSIGN(auto dataset_in, dataset_factory->Finish());
+
+    // Define the callback function
+    std::function<arrow::Status(arrow::dataset::TaggedRecordBatch tagged_record_batch)>
+        visitor =
+            [](arrow::dataset::TaggedRecordBatch tagged_record_batch) -> arrow::Status {
+      return arrow::Status::OK();
+    };
+
+    // Scan the dataset and process the record batches using the callback function
+    arrow::Status status = scanner->Scan(visitor);
+
+    // Check if there was an error during iteration
+    ASSERT_OK(status);
+  }
+
+  // Write dataset to disk with encryption and then read in a single parquet file
+  void WriteReadSingleFile() {
+    auto file_format =
+        CreateFileFormat(dsColumnMasterKeyIds, dsColumnMasterKeys, dsNumColumns,
+                         dsFooterMasterKeyId, dsFooterMasterKey);
+
+    // create our mock file system
+    ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+    ASSERT_OK_AND_ASSIGN(auto file_system,
+                         ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+    // create filesystem
+    ASSERT_OK(file_system->CreateDir(""));
+
+    auto mock_fs =
+        std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+    auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+    auto partitioning =
+        std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+    // ----- Write the Dataset ----
+    auto dataset_out = BuildTable();
+    ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset_out->NewScan());
+    ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
+
+    ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+    write_options.file_write_options = file_format->DefaultWriteOptions();
+    write_options.filesystem = file_system;
+    write_options.base_dir = "";
+    write_options.partitioning = partitioning;
+    write_options.basename_template = "part{i}.parquet";
+    ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner));
+
+    // ----- Read Single File -----
+
+    // Define the path to the encrypted Parquet file
+    std::string file_path = "part=a/part0.parquet";
+
+    auto dataset_decryption_config = file_format->GetDatasetDecryptionConfig();
+
+    auto crypto_factory = dataset_decryption_config->crypto_factory;
+
+    // Get the FileDecryptionProperties object using the CryptoFactory object
+    auto file_decryption_properties = crypto_factory->GetFileDecryptionProperties(
+        *dataset_decryption_config->kms_connection_config,
+        *dataset_decryption_config->decryption_config);
+
+    // Create the ReaderProperties object using the FileDecryptionProperties object
+    auto reader_properties = std::make_shared<parquet::ReaderProperties>();
+    reader_properties->file_decryption_properties(file_decryption_properties);
+
+    // Open the Parquet file using the MockFileSystem
+    std::shared_ptr<arrow::io::RandomAccessFile> input;
+    ASSERT_OK_AND_ASSIGN(input, mock_fs->OpenInputFile(file_path));
+
+    parquet::arrow::FileReaderBuilder reader_builder;
+    ASSERT_OK(reader_builder.Open(input, *reader_properties));
+
+    ASSERT_OK_AND_ASSIGN(auto arrow_reader, reader_builder.Build());
+
+    // Read entire file as a single Arrow table
+    std::shared_ptr<arrow::Table> table;
+    ASSERT_OK(arrow_reader->ReadTable(&table));
+
+    // Add assertions to check the contents of the table
+    ASSERT_EQ(table->num_rows(), 1);
+    ASSERT_EQ(table->num_columns(), 3);
+  }
+
+  // verify if Parquet metadata can be read without decryption
+  // properties when the footer is encrypted:
+  void CannotReadMetadataWithEncryptedFooter() {
+    auto file_format =
+        CreateFileFormat(dsColumnMasterKeyIds, dsColumnMasterKeys, dsNumColumns,
+                         dsFooterMasterKeyId, dsFooterMasterKey);
+
+    // create our mock file system
+    ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+    ASSERT_OK_AND_ASSIGN(auto file_system,
+                         ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+    // create filesystem
+    ASSERT_OK(file_system->CreateDir(""));
+
+    auto mock_fs =
+        std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+    auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+    auto partitioning =
+        std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+    // ----- Write the Dataset ----
+    auto dataset_out = BuildTable();
+    ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset_out->NewScan());
+    ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
+
+    ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+    write_options.file_write_options = file_format->DefaultWriteOptions();
+    write_options.filesystem = file_system;
+    write_options.base_dir = "";
+    write_options.partitioning = partitioning;
+    write_options.basename_template = "part{i}.parquet";
+    ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner));
+
+    // ----- Read Single File -----
+
+    // Define the path to the encrypted Parquet file
+    std::string file_path = "part=a/part0.parquet";
+
+    auto dataset_decryption_config = file_format->GetDatasetDecryptionConfig();
+
+    auto crypto_factory = dataset_decryption_config->crypto_factory;
+
+    // Get the FileDecryptionProperties object using the CryptoFactory object
+    auto file_decryption_properties = crypto_factory->GetFileDecryptionProperties(
+        *dataset_decryption_config->kms_connection_config,
+        *dataset_decryption_config->decryption_config);
+
+    // Create the ReaderProperties object using the FileDecryptionProperties object
+    auto reader_properties = std::make_shared<parquet::ReaderProperties>();
+    reader_properties->file_decryption_properties(file_decryption_properties);
+
+    // Open the Parquet file using the MockFileSystem
+    std::shared_ptr<arrow::io::RandomAccessFile> input;
+    ASSERT_OK_AND_ASSIGN(input, mock_fs->OpenInputFile(file_path));
+
+    parquet::arrow::FileReaderBuilder reader_builder;
+    ASSERT_OK(reader_builder.Open(input, *reader_properties));
+
+    // Try to read metadata without providing decryption properties
+    try {
+      auto reader_properties_without_decryption = parquet::ReaderProperties();
+      auto file_metadata = parquet::ReadMetaData(input);
+      // If this point is reached, the metadata was read successfully, which is not
+      // expected
+      FAIL()
+          << "Expected an exception when reading metadata without decryption properties";
+    } catch (const std::exception& e) {
+      // An exception is expected, so the test passes
+      SUCCEED();
+    }

Review Comment:
   You should be able to use ASSERT_THROW from GTest. See similar examples throughout the codebase.
   
   ```suggestion
       auto reader_properties_without_decryption = parquet::ReaderProperties();
       ASSERT_THROW(parquet::ReadMetaData(input), ParquetException);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1145582094


##########
cpp/src/parquet/CMakeLists.txt:
##########
@@ -230,7 +230,8 @@ if(PARQUET_REQUIRE_ENCRYPTION)
       encryption/key_metadata.cc
       encryption/key_toolkit.cc
       encryption/key_toolkit_internal.cc
-      encryption/local_wrap_kms_client.cc)
+      encryption/local_wrap_kms_client.cc
+      encryption/test_in_memory_kms.cc)

Review Comment:
   `encryption/test_in_memory_kms.cc` is missing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1313600702


##########
cpp/cmake_modules/UseCython.cmake:
##########
@@ -82,6 +82,9 @@ function(compile_pyx
   set(extension ${CYTHON_C_EXTENSION})
   set(pyx_lang "C")
   set(comment "Compiling Cython C source for ${_name}...")
+  #define compile time env variable for cython conditional compilation
+  set(CYTHON_COMPILE_TIME_ENV "" CACHE STRING "Compile time environment for Cython.") 
+

Review Comment:
   Could you revert this?
   
   ```suggestion
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1317575057


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -693,6 +825,10 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
         if thrift_container_size_limit is not None:
             self.thrift_container_size_limit = thrift_container_size_limit
 
+        IF PARQUET_ENCRYPTION_ENABLED:

Review Comment:
   I enhanced the handling of the decryption_config parameter. Previously, if PARQUET_ENCRYPTION_ENABLED was not set, the decryption_config parameter might have been silently ignored. To address this, I added explicit checks: now, if PARQUET_ENCRYPTION_ENABLED is False and a decryption_config is provided, an exception is raised to notify the user that encryption is not enabled. This ensures that users are made aware if they provide a decryption configuration when encryption support is not active, preventing potential silent failures or unexpected behaviors



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -56,9 +62,113 @@ from pyarrow._parquet cimport (
 
 cdef Expression _true = Expression._scalar(True)
 
-
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+IF PARQUET_ENCRYPTION_ENABLED:
+
+    from pyarrow._parquet_encryption cimport *
+
+    cdef class ParquetEncryptionConfig(_Weakrefable):
+        """
+        Configuration for Parquet Encryption.
+
+        Parameters
+        ----------
+        crypto_factory : object
+            Factory for creating cryptographic instances.
+        kms_connection_config : object
+            Configuration for connecting to Key Management Service.
+        encryption_config : object
+            Configuration for encryption settings.
+
+        Raises
+        ------
+        ValueError
+            If encryption_config is None.
+        """
+        cdef:
+            shared_ptr[CParquetEncryptionConfig] c_config
+
+        # Avoid mistakenly creating attributes
+        __slots__ = ()
+
+        def __cinit__(self, object crypto_factory, object kms_connection_config,
+                      object encryption_config):
+
+            cdef shared_ptr[CEncryptionConfiguration] c_encryption_config
+
+            if encryption_config is None:
+                raise ValueError(
+                    "encryption_config cannot be None")
+
+            self.c_config.reset(new CParquetEncryptionConfig())
+
+            c_encryption_config = pyarrow_unwrap_encryptionconfig(encryption_config)
+
+            self.c_config.get().Setup(pyarrow_unwrap_cryptofactory(crypto_factory),
+                                      pyarrow_unwrap_kmsconnectionconfig(
+                kms_connection_config),
+                c_encryption_config)
+
+        @staticmethod
+        cdef wrap(shared_ptr[CParquetEncryptionConfig] c_config):
+            cdef ParquetEncryptionConfig python_config = ParquetEncryptionConfig.__new__(ParquetEncryptionConfig)
+            python_config.c_config = c_config
+            return python_config
+
+        cdef shared_ptr[CParquetEncryptionConfig] unwrap(self):
+            return self.c_config
+
+    cdef class ParquetDecryptionConfig(_Weakrefable):
+        """

Review Comment:
   Done



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -56,9 +62,113 @@ from pyarrow._parquet cimport (
 
 cdef Expression _true = Expression._scalar(True)
 
-
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+IF PARQUET_ENCRYPTION_ENABLED:
+
+    from pyarrow._parquet_encryption cimport *
+
+    cdef class ParquetEncryptionConfig(_Weakrefable):
+        """
+        Configuration for Parquet Encryption.
+
+        Parameters
+        ----------
+        crypto_factory : object

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1318702270


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -226,6 +229,8 @@ class ARROW_DS_EXPORT ParquetFragmentScanOptions : public FragmentScanOptions {
   /// ScanOptions. Additionally, dictionary columns come from
   /// ParquetFileFormat::ReaderOptions::dict_columns.
   std::shared_ptr<parquet::ArrowReaderProperties> arrow_reader_properties;
+  /// A configuration structure that provides per file encryption properties for a dataset

Review Comment:
   I updated the docstring



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1704040011

   Could you rebase on the main to include #37481?
   It will fix pandas related CI failures.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1309469438


##########
python/CMakeLists.txt:
##########
@@ -336,6 +343,8 @@ if(PYARROW_BUILD_PARQUET_ENCRYPTION)
   else()
     message(FATAL_ERROR "You must build Arrow C++ with PARQUET_REQUIRE_ENCRYPTION=ON")
   endif()
+else()
+  set(CYTHON_COMPILE_TIME_ENV "PARQUET_ENCRYPTION_ENABLED=0")
 endif()

Review Comment:
   How about using the existing `CYTHON_FLAGS` variable instead of adding a new `CYTHON_COMPILE_TIME_ENV` variable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1329110215


##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,61 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+
+namespace arrow {
+namespace dataset {
+
+/// core class, that translates the parameters of high level encryption
+struct ARROW_DS_EXPORT ParquetEncryptionConfig {
+  void Setup(

Review Comment:
   It was originally there to be used by the python bindings, but its removed now as Its no longer needed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "scoder (via GitHub)" <gi...@apache.org>.
scoder commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1753153130

   > but then we still need to use those objects conditionally in `_dataset_parquet.pyx`, since we have to unwrap those and pass the C++ objects down
   
   From a quick look, it seems trivial to write a `def` function in the optional module that does this for you. Why do you think this makes it more complex or more difficult to maintain?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1352282178


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -639,7 +775,9 @@ cdef class ParquetFileWriteOptions(FileWriteOptions):
             write_batch_size=None,
             dictionary_pagesize_limit=None,
             write_page_index=False,
+            encryption_config=None,

Review Comment:
   Can you check https://github.com/tolleybot/arrow/pull/4 first? In the last commit there, I added a read check without decryption with the assertion that this should error



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1350940054


##########
python/pyarrow/tests/test_dataset_encryption.py:
##########
@@ -0,0 +1,167 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+from datetime import timedelta
+import pyarrow.fs as fs
+import pyarrow as pa
+import pytest
+import numpy as np
+import tempfile
+import os
+
+encryption_enabled = False

Review Comment:
   I updated the name



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1755877537

   @github-actions crossbow submit test-conda-python-3.10-pandas-latest


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1752866098

   No, I think only one of those modules would be compiled from CMake, and the target extension name should be the same for both (I hope that's possible using Cython?).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1350504695


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -639,7 +775,9 @@ cdef class ParquetFileWriteOptions(FileWriteOptions):
             write_batch_size=None,
             dictionary_pagesize_limit=None,
             write_page_index=False,
+            encryption_config=None,

Review Comment:
   @tolleybot This is called `encryption_config`, but above in the `_set_properties` the keyword that is being set is `self._properties["encryption_properties"]`, so this was actually being ignored?
   
   Also in the test_dataset_encryption.py you added, it is being passed to `encryption_config` keyword.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "scoder (via GitHub)" <gi...@apache.org>.
scoder commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1753583137

   > it actually defined a `PyInit__dataset_parquet_no_encryption` (so using the original file name, not the module target name), and so it is not finding `PyInit__dataset_parquet_encryption`.
   
   Do you really need such a module? I'd just try to drop it completely. If you use Python references, you can just check them for `None`.
   
   > pyx file has a different name than the module target name
   
   I keep forgetting the rules and options for this, but there are definitely ways to define the target name. In any case, Cython needs to know the final module name when generating the C/C++ code because it's part of the module's C-API. Renaming modules between in and out is usually not a good idea because it adds complexity to the build and makes things harder to find.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1340518972


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -56,9 +63,163 @@ from pyarrow._parquet cimport (
 
 cdef Expression _true = Expression._scalar(True)
 
-
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+IF PARQUET_ENCRYPTION_ENABLED:
+    cdef class ParquetEncryptionConfig(_Weakrefable):
+        """
+        Configuration for Parquet Encryption.
+
+        Parameters
+        ----------
+        crypto_factory : CryptoFactory
+            Factory for creating cryptographic instances.
+        kms_connection_config : KmsConnectionConfig
+            Configuration for connecting to Key Management Service.
+        encryption_config : EncryptionConfiguration

Review Comment:
   Ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1340525553


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -711,6 +889,20 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
     cdef ArrowReaderProperties* arrow_reader_properties(self):
         return self.parquet_options.arrow_reader_properties.get()
 
+    IF PARQUET_ENCRYPTION_ENABLED:
+        @property
+        def parquet_decryption_config(self):

Review Comment:
   
   Thank you for your comment. Raising a NotImplementedError might not be the optimal approach in this scenario. PARQUET_ENCRYPTION_ENABLED is a compile-time flag that would be set intentionally, having the code throw an exception at runtime could lead to confusion.
   
   Users who deliberately disable the encryption feature presumably have no intention of using the parquet encryption. Throwing an error in this case might be seen as unexpected behavior, as users might not be attempting to use a feature they knowingly disabled. This could lead to unnecessary troubleshooting and frustration.
   
   Looking forward to your thoughts on this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1340529141


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -78,7 +239,8 @@ cdef class ParquetFileFormat(FileFormat):
         CParquetFileFormat* parquet_format
 
     def __init__(self, read_options=None,
-                 default_fragment_scan_options=None, **kwargs):
+                 default_fragment_scan_options=None,
+                 **kwargs):

Review Comment:
   will do



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ianmcook commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "ianmcook (via GitHub)" <gi...@apache.org>.
ianmcook commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1735753269

   @pitrou could you please re-review? Thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1342771373


##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,61 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+
+namespace arrow {
+namespace dataset {
+
+/// core class, that translates the parameters of high level encryption
+struct ARROW_DS_EXPORT ParquetEncryptionConfig {
+  void Setup(

Review Comment:
   I removed the Setup(), function 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1347378803


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -711,6 +889,20 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
     cdef ArrowReaderProperties* arrow_reader_properties(self):
         return self.parquet_options.arrow_reader_properties.get()
 
+    IF PARQUET_ENCRYPTION_ENABLED:
+        @property
+        def parquet_decryption_config(self):

Review Comment:
   Got it.  Thanks for the clarification. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1346455441


##########
cpp/src/arrow/dataset/file_parquet_test.cc:
##########
@@ -424,6 +425,34 @@ TEST_F(TestParquetFileSystemDataset, WriteWithEmptyPartitioningSchema) {
   TestWriteWithEmptyPartitioningSchema();
 }
 
+TEST_F(TestParquetFileSystemDataset, WriteWithEncryptionConfigNotSupported) {
+#ifndef PARQUET_REQUIRE_ENCRYPTION
+  // Create a dummy ParquetEncryptionConfig
+  std::shared_ptr<ParquetEncryptionConfig> encryption_config =
+      std::make_shared<ParquetEncryptionConfig>();
+
+  auto options =
+      checked_pointer_cast<ParquetFileWriteOptions>(format_->DefaultWriteOptions());
+
+  // Set the encryption config in the options
+  options->parquet_encryption_config = encryption_config;
+
+  // Setup mock filesystem and test data
+  auto mock_fs = std::make_shared<fs::internal::MockFileSystem>(fs::kNoTime);
+  std::shared_ptr<Schema> test_schema = schema({field("x", int32())});
+  std::shared_ptr<RecordBatch> batch = RecordBatchFromJSON(test_schema, "[[0]]");
+  ASSERT_OK_AND_ASSIGN(std::shared_ptr<io::OutputStream> out_stream,
+                       mock_fs->OpenOutputStream("/foo.parquet"));
+  std::cout << "B" << std::endl;

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1528055669

   It's my bad I didn't do the squash option 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1162931316


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,247 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <arrow/api.h>

Review Comment:
   changed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1187656336


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,391 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "gtest/gtest.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view dsFooterMasterKey = "0123456789012345";

Review Comment:
   done



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,391 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "gtest/gtest.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view dsFooterMasterKey = "0123456789012345";
+constexpr std::string_view dsFooterMasterKeyId = "footer_key";
+constexpr std::string_view dsColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view dsColumnMasterKeyIds[] = {"col_key"};
+const int dsNumColumns = 1;
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // setup the test
+  void SetUp() {}
+
+  // Create our parquetfileformat with encryption properties
+  std::shared_ptr<ParquetFileFormat> CreateFileFormat(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key,
+      const std::string_view& footer_key_name = "footer_key",
+      const std::string_view& column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(true, key_list);
+    ::parquet::encryption::KmsConnectionConfig kms_connection_config;
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    DatasetEncryptionConfiguration dataset_encryption_config{
+        crypto_factory,
+        std::make_shared<parquet::encryption::KmsConnectionConfig>(kms_connection_config),

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1151400017


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,247 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <arrow/api.h>

Review Comment:
   The include style is inconsistent here.



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,247 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <arrow/api.h>
+#include <arrow/dataset/api.h>
+#include <string_view>
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "gtest/gtest.h"
+#include "parquet/encryption/dataset_encryption_config.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view dsFooterMasterKey = "0123456789012345";
+constexpr std::string_view dsFooterMasterKeyId = "footer_key";
+constexpr std::string_view dsColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view dsColumnMasterKeyIds[] = {"col_key"};
+const int dsNumColumns = 1;
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  std::unique_ptr<arrow::internal::TemporaryDir> temp_dir_;
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> dataset_;
+  std::string footer_key_name_ = "footer_key";
+
+  ::parquet::encryption::DatasetEncryptionConfiguration dataset_encryption_config_;
+  ::parquet::encryption::DatasetDecryptionConfiguration dataset_decryption_config_;
+  std::string column_key_mapping_;
+  ::parquet::encryption::KmsConnectionConfig kms_connection_config_;
+  std::shared_ptr<::parquet::encryption::CryptoFactory> crypto_factory_;
+  std::shared_ptr<ParquetFileFormat> file_format_;
+  std::shared_ptr<::arrow::fs::FileSystem> file_system_;
+
+  /** setup the test
+   *
+   */
+  void SetUp() {
+    // create our mock file system
+    ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+    ASSERT_OK_AND_ASSIGN(file_system_,
+                         ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+    // build our dummy table
+    BuildTable();
+
+    auto key_list = BuildKeyMap(dsColumnMasterKeyIds, dsColumnMasterKeys, dsNumColumns,
+                                dsFooterMasterKeyId, dsFooterMasterKey);
+
+    SetupCryptoFactory(true, key_list);
+
+    column_key_mapping_ = "col_key: a";
+
+    // Setup our Dataset encrytion configurations
+    dataset_encryption_config_.crypto_factory = crypto_factory_;
+    dataset_encryption_config_.kms_connection_config =
+        std::make_shared<::parquet::encryption::KmsConnectionConfig>(
+            kms_connection_config_);
+    dataset_encryption_config_.encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            footer_key_name_);
+    dataset_encryption_config_.encryption_config->column_keys = column_key_mapping_;
+    dataset_encryption_config_.encryption_config->footer_key = footer_key_name_;
+
+    dataset_decryption_config_.crypto_factory = crypto_factory_;
+    dataset_decryption_config_.kms_connection_config =
+        std::make_shared<::parquet::encryption::KmsConnectionConfig>(
+            kms_connection_config_);
+    dataset_decryption_config_.decryption_config =
+        std::make_shared<::parquet::encryption::DecryptionConfiguration>();
+
+    // create our Parquet file format object
+    file_format_ = std::make_shared<ParquetFileFormat>();
+
+    file_format_->SetDatasetEncryptionConfig(
+        std::make_shared<::parquet::encryption::DatasetEncryptionConfiguration>(
+            dataset_encryption_config_));
+    file_format_->SetDatasetDecryptionConfig(
+        std::make_shared<::parquet::encryption::DatasetDecryptionConfiguration>(
+            dataset_decryption_config_));
+  }
+
+  /** utility to build the key map
+   *
+   */
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  /** utilty to build column key mapping

Review Comment:
   The comment style looks inconsistent with the code base.



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,247 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/testing/gtest_util.h"
+#include "gtest/gtest.h"
+
+#include <arrow/api.h>
+#include <arrow/dataset/api.h>
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "parquet/encryption/dataset_encryption_config.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+const char dsFooterMasterKey[] = "0123456789012345";
+const char dsFooterMasterKeyId[] = "footer_key";
+const char* const dsColumnMasterKeys[] = {"1234567890123450"};
+const char* const dsColumnMasterKeyIds[] = {"col_key"};
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  std::unique_ptr<arrow::internal::TemporaryDir> temp_dir_;
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> dataset_;
+  std::string footer_key_name_ = "footer_key";
+
+  ::parquet::encryption::DatasetEncryptionConfiguration dataset_encryption_config_;
+  ::parquet::encryption::DatasetDecryptionConfiguration dataset_decryption_config_;
+  std::string column_key_mapping_;
+  ::parquet::encryption::KmsConnectionConfig kms_connection_config_;
+  std::shared_ptr<::parquet::encryption::CryptoFactory> crypto_factory_;
+  std::shared_ptr<ParquetFileFormat> file_format_;
+  std::shared_ptr<::arrow::fs::FileSystem> file_system_;
+
+  /** setup the test
+   *
+   */
+  void SetUp() {
+    // create our mock file system
+    ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+    ASSERT_OK_AND_ASSIGN(file_system_,
+                         ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+    // build our dummy table
+    BuildTable();
+
+    auto key_list = BuildKeyMap(dsColumnMasterKeyIds, dsColumnMasterKeys,
+                                dsFooterMasterKeyId, dsFooterMasterKey);
+
+    SetupCryptoFactory(true, key_list);
+
+    column_key_mapping_ = "col_key: a";
+
+    // Setup our Dataset encrytion configurations
+    dataset_encryption_config_.crypto_factory = crypto_factory_;
+    dataset_encryption_config_.kms_connection_config =
+        std::make_shared<::parquet::encryption::KmsConnectionConfig>(
+            kms_connection_config_);
+    dataset_encryption_config_.encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            footer_key_name_);
+    dataset_encryption_config_.encryption_config->column_keys = column_key_mapping_;
+    dataset_encryption_config_.encryption_config->footer_key = footer_key_name_;
+
+    dataset_decryption_config_.crypto_factory = crypto_factory_;
+    dataset_decryption_config_.kms_connection_config =
+        std::make_shared<::parquet::encryption::KmsConnectionConfig>(
+            kms_connection_config_);
+    dataset_decryption_config_.decryption_config =
+        std::make_shared<::parquet::encryption::DecryptionConfiguration>();
+
+    // create our Parquet file format object
+    file_format_ = std::make_shared<ParquetFileFormat>();
+
+    file_format_->SetDatasetEncryptionConfig(
+        std::make_shared<::parquet::encryption::DatasetEncryptionConfiguration>(
+            dataset_encryption_config_));
+    file_format_->SetDatasetDecryptionConfig(
+        std::make_shared<::parquet::encryption::DatasetDecryptionConfiguration>(
+            dataset_decryption_config_));
+  }
+
+  /** utility to build the key map
+   *
+   */
+  std::unordered_map<std::string, std::string> BuildKeyMap(const char* const* column_ids,
+                                                           const char* const* column_keys,
+                                                           const char* footer_id,
+                                                           const char* footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < 1; i++) {
+      key_map.insert({column_ids[i], column_keys[i]});
+    }
+    // add footer key
+    key_map.insert({footer_id, footer_key});
+
+    return key_map;
+  }
+
+  /** utilty to build column key mapping
+   *
+   */
+  std::string BuildColumnKeyMapping() {
+    std::ostringstream stream;
+    stream << dsColumnMasterKeys[0] << ":"
+           << "a"
+           << ";";
+    return stream.str();
+  }
+  /** Write dataset to disk
+   *
+   */
+  void WriteDataset() {
+    auto base_path = "";
+    ASSERT_OK(file_system_->CreateDir(base_path));
+    // Write it using Datasets
+    ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset_->NewScan());
+    ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
+
+    auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+    auto partitioning =
+        std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+    ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+    write_options.file_write_options = file_format_->DefaultWriteOptions();
+    write_options.filesystem = file_system_;
+    write_options.base_dir = base_path;
+    write_options.partitioning = partitioning;
+    write_options.basename_template = "part{i}.parquet";
+    ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner));
+
+    auto mock_fs =
+        std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system_);
+
+    std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                      "part=c/part0.parquet", "part=d/part0.parquet",
+                                      "part=e/part0.parquet", "part=f/part0.parquet",
+                                      "part=g/part0.parquet", "part=h/part0.parquet",
+                                      "part=i/part0.parquet", "part=j/part0.parquet"};
+    ValidateFilesExist(mock_fs, files);
+  }
+
+  /** A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  void ReadDataset() {
+    // File format
+    // Partitioning
+    auto partition_schema = arrow::schema({arrow::field("part", arrow::utf8())});
+    auto partitioning =
+        std::make_shared<arrow::dataset::HivePartitioning>(partition_schema);
+
+    // Get FileInfo objects for all files under the base directory
+    arrow::fs::FileSelector selector;
+    selector.base_dir = "";
+    selector.recursive = true;
+    ASSERT_OK_AND_ASSIGN(auto files, file_system_->GetFileInfo(selector));
+
+    // Create a FileSystemDatasetFactory
+    arrow::dataset::FileSystemFactoryOptions factory_options;
+    factory_options.partitioning = partitioning;
+    ASSERT_OK_AND_ASSIGN(auto dataset_factory,
+                         arrow::dataset::FileSystemDatasetFactory::Make(
+                             file_system_, files, file_format_, factory_options));
+
+    // Create a Dataset
+    ASSERT_OK_AND_ASSIGN(auto dataset, dataset_factory->Finish());
+
+    // Create a ScannerBuilder
+    ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan());
+
+    // Create a Scanner
+    ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());

Review Comment:
   +1. 
   
   Except for validating read values, we probably need to verify if parquet metadata can be read without decryption properties when footer is encrypted.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1145579316


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +138,33 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+   
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<parquet::encryption::DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+    return dataset_encryption_config_;
+  }
+  /// \brief A getter function to retrieve the dataset decryption configuration
+  std::shared_ptr<parquet::encryption::DatasetDecryptionConfiguration> GetDatasetDecryptionConfig() const {
+    return dataset_decryption_config_;
+  }

Review Comment:
   Instead of putting these properties on `ParquetFileFormat` can you put the decryption config on `ParquetFragmentScanOptions` and the encryption config on `ParquetFileWriteOptions`?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1144918924


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,247 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/testing/gtest_util.h"
+#include "gtest/gtest.h"
+
+#include <arrow/api.h>
+#include <arrow/dataset/api.h>
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "parquet/encryption/dataset_encryption_config.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+const char dsFooterMasterKey[] = "0123456789012345";
+const char dsFooterMasterKeyId[] = "footer_key";
+const char* const dsColumnMasterKeys[] = {"1234567890123450"};
+const char* const dsColumnMasterKeyIds[] = {"col_key"};
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  std::unique_ptr<arrow::internal::TemporaryDir> temp_dir_;
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> dataset_;
+  std::string footer_key_name_ = "footer_key";
+
+  ::parquet::encryption::DatasetEncryptionConfiguration dataset_encryption_config_;
+  ::parquet::encryption::DatasetDecryptionConfiguration dataset_decryption_config_;
+  std::string column_key_mapping_;
+  ::parquet::encryption::KmsConnectionConfig kms_connection_config_;
+  std::shared_ptr<::parquet::encryption::CryptoFactory> crypto_factory_;
+  std::shared_ptr<ParquetFileFormat> file_format_;
+  std::shared_ptr<::arrow::fs::FileSystem> file_system_;
+
+  /** setup the test
+   *
+   */
+  void SetUp() {
+    // create our mock file system
+    ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+    ASSERT_OK_AND_ASSIGN(file_system_,
+                         ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+    // build our dummy table
+    BuildTable();
+
+    auto key_list = BuildKeyMap(dsColumnMasterKeyIds, dsColumnMasterKeys,
+                                dsFooterMasterKeyId, dsFooterMasterKey);
+
+    SetupCryptoFactory(true, key_list);
+
+    column_key_mapping_ = "col_key: a";
+
+    // Setup our Dataset encrytion configurations
+    dataset_encryption_config_.crypto_factory = crypto_factory_;
+    dataset_encryption_config_.kms_connection_config =
+        std::make_shared<::parquet::encryption::KmsConnectionConfig>(
+            kms_connection_config_);
+    dataset_encryption_config_.encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            footer_key_name_);
+    dataset_encryption_config_.encryption_config->column_keys = column_key_mapping_;
+    dataset_encryption_config_.encryption_config->footer_key = footer_key_name_;
+
+    dataset_decryption_config_.crypto_factory = crypto_factory_;
+    dataset_decryption_config_.kms_connection_config =
+        std::make_shared<::parquet::encryption::KmsConnectionConfig>(
+            kms_connection_config_);
+    dataset_decryption_config_.decryption_config =
+        std::make_shared<::parquet::encryption::DecryptionConfiguration>();
+
+    // create our Parquet file format object
+    file_format_ = std::make_shared<ParquetFileFormat>();
+
+    file_format_->SetDatasetEncryptionConfig(
+        std::make_shared<::parquet::encryption::DatasetEncryptionConfiguration>(
+            dataset_encryption_config_));
+    file_format_->SetDatasetDecryptionConfig(
+        std::make_shared<::parquet::encryption::DatasetDecryptionConfiguration>(
+            dataset_decryption_config_));
+  }
+
+  /** utility to build the key map
+   *
+   */
+  std::unordered_map<std::string, std::string> BuildKeyMap(const char* const* column_ids,
+                                                           const char* const* column_keys,
+                                                           const char* footer_id,
+                                                           const char* footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < 1; i++) {
+      key_map.insert({column_ids[i], column_keys[i]});
+    }

Review Comment:
   Ok let me look into that



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1164541457


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +138,33 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+   
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<parquet::encryption::DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+    return dataset_encryption_config_;
+  }
+  /// \brief A getter function to retrieve the dataset decryption configuration
+  std::shared_ptr<parquet::encryption::DatasetDecryptionConfiguration> GetDatasetDecryptionConfig() const {
+    return dataset_decryption_config_;
+  }

Review Comment:
   Do you think this is a blocker for this PR or could it be done as a separate PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1169125808


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +138,33 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+   
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<parquet::encryption::DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+    return dataset_encryption_config_;
+  }
+  /// \brief A getter function to retrieve the dataset decryption configuration
+  std::shared_ptr<parquet::encryption::DatasetDecryptionConfiguration> GetDatasetDecryptionConfig() const {
+    return dataset_decryption_config_;
+  }

Review Comment:
   I wouldn't want to add new pyarrow properties as part of the 12.0.0 release that we then have to do a deprecation cycle for.  However, it seems like the user-facing python API wouldn't have to change if we changed this internally.  If you agree then I think this can be a follow-up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1474341953

   :warning: GitHub issue #29238 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1140735362


##########
cpp/src/parquet/properties.h:
##########
@@ -209,6 +209,22 @@ class PARQUET_EXPORT WriterProperties {
           created_by_(DEFAULT_CREATED_BY),
           store_decimal_as_integer_(false),
           page_checksum_enabled_(false) {}
+    
+    Builder(const WriterProperties& properties)

Review Comment:
   No problem.  Just made the chnage



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1257303272


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +138,20 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+    return dataset_encryption_config_;
+  }
+  /// \brief A setter for DatasetEncryptionConfiguration
+  void SetDatasetEncryptionConfig(
+      std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config) {
+    dataset_encryption_config_ = std::move(dataset_encryption_config);
+  }
+
+ private:
+  // A configuration structure that provides per file encryption properties for a dataset
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config_ = NULLPTR;

Review Comment:
   This may involve API change. Therefore I think it would be good to finalized this time.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1253520666


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,388 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view ds_footer_master_key = "0123456789012345";
+constexpr std::string_view ds_footer_master_key_id = "footer_key";
+constexpr std::string_view ds_column_master_keys[] = {"1234567890123450"};
+constexpr std::string_view ds_column_master_key_ids[] = {"col_key"};
+const int ds_num_columns = 1;
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // setup the test
+  void SetUp() {}
+
+  // Create our parquetfileformat with encryption properties
+  std::shared_ptr<ParquetFileFormat> CreateFileFormat(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, std::string_view footer_id, std::string_view footer_key,
+      std::string_view footer_key_name = "footer_key",
+      std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    // create our Parquet file format object
+    auto file_format = std::make_shared<ParquetFileFormat>();
+
+    file_format->SetDatasetEncryptionConfig(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config));
+    file_format->SetDatasetDecryptionConfig(
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+
+    return file_format;
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    std::shared_ptr<::arrow::dataset::InMemoryDataset> dataset =
+        std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+
+    return dataset;
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+
+  // Write dataset to disk with encryption
+  void WriteReadDatasetWithEncryption() {
+    auto file_format =
+        CreateFileFormat(ds_column_master_key_ids, ds_column_master_keys, ds_num_columns,
+                         ds_footer_master_key_id, ds_footer_master_key);
+
+    // create our mock file system
+    ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+    ASSERT_OK_AND_ASSIGN(auto file_system,
+                         ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+    // create filesystem
+    ASSERT_OK(file_system->CreateDir(""));
+
+    auto mock_fs =
+        std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+    auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+    auto partitioning =
+        std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+    // ----- Write the Dataset ----
+    auto dataset_out = BuildTable();
+    ASSERT_OK_AND_ASSIGN(auto scanner_builder_out, dataset_out->NewScan());
+    ASSERT_OK_AND_ASSIGN(auto scanner_out, scanner_builder_out->Finish());
+
+    ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+    write_options.file_write_options = file_format->DefaultWriteOptions();
+    write_options.filesystem = file_system;
+    write_options.base_dir = "";

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1254376420


##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,75 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+
+namespace arrow {
+namespace dataset {
+
+struct ARROW_DS_EXPORT DatasetEncryptionConfiguration {
+  /// core class, that translates the parameters of high level encryption
+
+  DatasetEncryptionConfiguration()
+      : kms_connection_config(
+            std::make_shared<parquet::encryption::KmsConnectionConfig>()) {}
+
+  void Setup(
+      std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory,
+      std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config,
+      std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config) {
+    this->crypto_factory = crypto_factory;
+    this->kms_connection_config = kms_connection_config;
+    this->encryption_config = encryption_config;
+  }
+
+  std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory;
+
+  std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config;
+
+  std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config;
+};
+
+struct ARROW_DS_EXPORT DatasetDecryptionConfiguration {

Review Comment:
   I can take a look



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1253413519


##########
cpp/src/arrow/dataset/CMakeLists.txt:
##########
@@ -175,6 +175,12 @@ endif()
 
 if(ARROW_PARQUET)
   add_arrow_dataset_test(file_parquet_test)
+  if(PARQUET_REQUIRE_ENCRYPTION AND ARROW_DATASET)
+    add_arrow_dataset_test(dataset_encryption_test
+                           SOURCES
+                           ${PROJECT_SOURCE_DIR}/src/parquet/encryption/test_in_memory_kms.cc

Review Comment:
   Let me verify



##########
cpp/src/parquet/properties.h:
##########
@@ -218,7 +218,23 @@ class PARQUET_EXPORT WriterProperties {
           data_page_version_(ParquetDataPageVersion::V1),
           created_by_(DEFAULT_CREATED_BY),
           store_decimal_as_integer_(false),
-          page_checksum_enabled_(false) {}
+          page_checksum_enabled_(false),
+          default_column_properties_() {}

Review Comment:
   Ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1254530922


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -621,11 +639,38 @@ Result<std::shared_ptr<FileWriter>> ParquetFileFormat::MakeWriter(
 
   auto parquet_options = checked_pointer_cast<ParquetFileWriteOptions>(options);
 
-  std::unique_ptr<parquet::arrow::FileWriter> parquet_writer;
-  ARROW_ASSIGN_OR_RAISE(parquet_writer, parquet::arrow::FileWriter::Open(
-                                            *schema, default_memory_pool(), destination,
-                                            parquet_options->writer_properties,
-                                            parquet_options->arrow_writer_properties));
+  std::unique_ptr<parquet::arrow::FileWriter> parquet_writer = NULLPTR;
+
+#ifdef PARQUET_REQUIRE_ENCRYPTION
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encrypt_config =
+      GetDatasetEncryptionConfig();

Review Comment:
   ```suggestion
     auto dataset_encrypt_config = GetDatasetEncryptionConfig();
   ```



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -621,11 +639,38 @@ Result<std::shared_ptr<FileWriter>> ParquetFileFormat::MakeWriter(
 
   auto parquet_options = checked_pointer_cast<ParquetFileWriteOptions>(options);
 
-  std::unique_ptr<parquet::arrow::FileWriter> parquet_writer;
-  ARROW_ASSIGN_OR_RAISE(parquet_writer, parquet::arrow::FileWriter::Open(
-                                            *schema, default_memory_pool(), destination,
-                                            parquet_options->writer_properties,
-                                            parquet_options->arrow_writer_properties));
+  std::unique_ptr<parquet::arrow::FileWriter> parquet_writer = NULLPTR;
+
+#ifdef PARQUET_REQUIRE_ENCRYPTION
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encrypt_config =
+      GetDatasetEncryptionConfig();
+
+  if (dataset_encrypt_config != nullptr) {
+    auto file_encryption_prop =
+        dataset_encrypt_config->crypto_factory->GetFileEncryptionProperties(
+            *dataset_encrypt_config->kms_connection_config,
+            *dataset_encrypt_config->encryption_config, destination_locator.path,
+            destination_locator.filesystem);
+
+    auto writer_properties =
+        parquet::WriterProperties::Builder(*parquet_options->writer_properties)
+            .encryption(std::move(file_encryption_prop))
+            ->build();
+
+    ARROW_ASSIGN_OR_RAISE(
+        parquet_writer, parquet::arrow::FileWriter::Open(
+                            *schema, writer_properties->memory_pool(), destination,
+                            writer_properties, parquet_options->arrow_writer_properties));
+  }
+#endif
+
+  if (parquet_writer == NULLPTR) {

Review Comment:
   ```suggestion
     if (parquet_writer == nullptr) {
   ```
   
   nit: it would be good to be consistent to use `nullptr` in the source file.



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -621,11 +639,38 @@ Result<std::shared_ptr<FileWriter>> ParquetFileFormat::MakeWriter(
 
   auto parquet_options = checked_pointer_cast<ParquetFileWriteOptions>(options);
 
-  std::unique_ptr<parquet::arrow::FileWriter> parquet_writer;
-  ARROW_ASSIGN_OR_RAISE(parquet_writer, parquet::arrow::FileWriter::Open(
-                                            *schema, default_memory_pool(), destination,
-                                            parquet_options->writer_properties,
-                                            parquet_options->arrow_writer_properties));
+  std::unique_ptr<parquet::arrow::FileWriter> parquet_writer = NULLPTR;
+
+#ifdef PARQUET_REQUIRE_ENCRYPTION
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encrypt_config =
+      GetDatasetEncryptionConfig();
+
+  if (dataset_encrypt_config != nullptr) {

Review Comment:
   It will not be nullptr, right? We can simply remove this check.



##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,70 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+
+namespace arrow {
+namespace dataset {
+
+/// core class, that translates the parameters of high level encryption
+struct ARROW_DS_EXPORT DatasetEncryptionConfiguration {
+  DatasetEncryptionConfiguration()
+      : kms_connection_config(
+            std::make_shared<parquet::encryption::KmsConnectionConfig>()) {}
+
+  void Setup(
+      std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory,
+      std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config,
+      std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config) {
+    this->crypto_factory = std::move(crypto_factory);
+    this->kms_connection_config = std::move(kms_connection_config);
+    this->encryption_config = std::move(encryption_config);
+  }
+
+  std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory;
+  std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config;
+  std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config;
+};

Review Comment:
   ```suggestion
   struct ARROW_DS_EXPORT DatasetEncryptionConfiguration {
     std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory;
     std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config;
     std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config;
   };
   ```
   
   Have you considered to simply define it as the above? From you test cases, it seems that not all parameters are required. So `Setup` function is not that helpful if none of checks is done inside. By doing this, it enables aggregate initialization in C++20 like this:
   ```
   auto conf = {
     .crypto_factory = xxx,
     .kms_connection_config = std::make_shared<parquet::encryption::KmsConnectionConfig>()
   };
   ```
   
   I am not saying we have to do this. Just a suggestion.



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,373 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+namespace arrow {

Review Comment:
   ```suggestion
   const int kNumColumns = 1;
   
   namespace arrow {
   ```



##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -28,6 +28,7 @@
 
 #include "arrow/dataset/discovery.h"
 #include "arrow/dataset/file_base.h"
+#include "arrow/dataset/parquet_encryption_config.h"

Review Comment:
   I think so, it is generally the convention to use forward declaration whenever possible.



##########
cpp/src/arrow/dataset/CMakeLists.txt:
##########
@@ -175,6 +175,12 @@ endif()
 
 if(ARROW_PARQUET)
   add_arrow_dataset_test(file_parquet_test)
+  if(PARQUET_REQUIRE_ENCRYPTION AND ARROW_DATASET)
+    add_arrow_dataset_test(dataset_encryption_test
+                           SOURCES
+                           ${PROJECT_SOURCE_DIR}/src/parquet/encryption/test_in_memory_kms.cc

Review Comment:
   Thanks for confirming!



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -621,11 +639,38 @@ Result<std::shared_ptr<FileWriter>> ParquetFileFormat::MakeWriter(
 
   auto parquet_options = checked_pointer_cast<ParquetFileWriteOptions>(options);
 
-  std::unique_ptr<parquet::arrow::FileWriter> parquet_writer;
-  ARROW_ASSIGN_OR_RAISE(parquet_writer, parquet::arrow::FileWriter::Open(
-                                            *schema, default_memory_pool(), destination,
-                                            parquet_options->writer_properties,
-                                            parquet_options->arrow_writer_properties));
+  std::unique_ptr<parquet::arrow::FileWriter> parquet_writer = NULLPTR;

Review Comment:
   ```suggestion
     std::unique_ptr<parquet::arrow::FileWriter> parquet_writer;
   ```
   
   That's redundant.



##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +137,40 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+#ifdef PARQUET_REQUIRE_ENCRYPTION
+    return dataset_encryption_config_;
+#else
+    return NULLPTR;
+#endif
+  }
+  /// \brief A getter function to retrieve the dataset decryption configuration
+  std::shared_ptr<DatasetDecryptionConfiguration> GetDatasetDecryptionConfig() const {
+#ifdef PARQUET_REQUIRE_ENCRYPTION
+    return dataset_decryption_config_;
+#else
+    return NULLPTR;
+#endif

Review Comment:
   It seems that these two functions are always called in the scope of `PARQUET_REQUIRE_ENCRYPTION`. So we can remove the branch inside, right?



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,388 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view ds_footer_master_key = "0123456789012345";
+constexpr std::string_view ds_footer_master_key_id = "footer_key";
+constexpr std::string_view ds_column_master_keys[] = {"1234567890123450"};
+constexpr std::string_view ds_column_master_key_ids[] = {"col_key"};
+const int ds_num_columns = 1;
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // setup the test
+  void SetUp() {}
+
+  // Create our parquetfileformat with encryption properties
+  std::shared_ptr<ParquetFileFormat> CreateFileFormat(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, std::string_view footer_id, std::string_view footer_key,
+      std::string_view footer_key_name = "footer_key",
+      std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    // create our Parquet file format object
+    auto file_format = std::make_shared<ParquetFileFormat>();
+
+    file_format->SetDatasetEncryptionConfig(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config));
+    file_format->SetDatasetDecryptionConfig(
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+
+    return file_format;
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    std::shared_ptr<::arrow::dataset::InMemoryDataset> dataset =
+        std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+
+    return dataset;
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+
+  // Write dataset to disk with encryption
+  void WriteReadDatasetWithEncryption() {
+    auto file_format =
+        CreateFileFormat(ds_column_master_key_ids, ds_column_master_keys, ds_num_columns,
+                         ds_footer_master_key_id, ds_footer_master_key);
+
+    // create our mock file system
+    ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+    ASSERT_OK_AND_ASSIGN(auto file_system,
+                         ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+    // create filesystem
+    ASSERT_OK(file_system->CreateDir(""));
+
+    auto mock_fs =
+        std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+    auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+    auto partitioning =
+        std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+    // ----- Write the Dataset ----
+    auto dataset_out = BuildTable();
+    ASSERT_OK_AND_ASSIGN(auto scanner_builder_out, dataset_out->NewScan());
+    ASSERT_OK_AND_ASSIGN(auto scanner_out, scanner_builder_out->Finish());
+
+    ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+    write_options.file_write_options = file_format->DefaultWriteOptions();
+    write_options.filesystem = file_system;
+    write_options.base_dir = "";
+    write_options.partitioning = partitioning;
+    write_options.basename_template = "part{i}.parquet";
+    ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner_out));
+
+    std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",

Review Comment:
   Thanks for the explanation! Would you mind adding these as a comment?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1254756535


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -621,11 +639,38 @@ Result<std::shared_ptr<FileWriter>> ParquetFileFormat::MakeWriter(
 
   auto parquet_options = checked_pointer_cast<ParquetFileWriteOptions>(options);
 
-  std::unique_ptr<parquet::arrow::FileWriter> parquet_writer;
-  ARROW_ASSIGN_OR_RAISE(parquet_writer, parquet::arrow::FileWriter::Open(
-                                            *schema, default_memory_pool(), destination,
-                                            parquet_options->writer_properties,
-                                            parquet_options->arrow_writer_properties));
+  std::unique_ptr<parquet::arrow::FileWriter> parquet_writer = NULLPTR;
+
+#ifdef PARQUET_REQUIRE_ENCRYPTION
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encrypt_config =
+      GetDatasetEncryptionConfig();
+
+  if (dataset_encrypt_config != nullptr) {

Review Comment:
   Looks l need to keep that check in there as it can be null.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1251356745


##########
python/pyarrow/dataset.py:
##########
@@ -85,6 +85,8 @@
         ParquetFragmentScanOptions,
         ParquetReadOptions,
         RowGroupInfo,
+        DatasetDecryptionConfiguration,
+        DatasetEncryptionConfiguration

Review Comment:
   It seems that we sort this list in alphabetical order.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1248986239


##########
cpp/src/arrow/dataset/CMakeLists.txt:
##########
@@ -145,7 +145,6 @@ function(ADD_ARROW_DATASET_TEST REL_TEST_NAME)
                  EXTRA_LINK_LIBS
                  ${ARROW_DATASET_TEST_LINK_LIBS}
                  PREFIX
-                 ${PREFIX}

Review Comment:
   Is it an intentional change?



##########
cpp/src/parquet/properties.h:
##########
@@ -722,6 +744,10 @@ class PARQUET_EXPORT WriterProperties {
       return NULLPTR;
     }
   }
+  // \brief Returns the default column properties

Review Comment:
   ```suggestion
   
     // \brief Returns the default column properties
   ```



##########
cpp/src/arrow/util/config.h.cmake:
##########
@@ -59,3 +59,6 @@
 #cmakedefine ARROW_WITH_UCX
 
 #cmakedefine GRPCPP_PP_INCLUDE
+#cmakedefine PARQUET_REQUIRE_ENCRYPTION
+
+

Review Comment:
   We may want to group Parquet related definitions in the future:
   
   ```suggestion
   
   #cmakedefine PARQUET_REQUIRE_ENCRYPTION
   ```



##########
python/sample_dataset/2019/part-0.parquet:
##########


Review Comment:
   Do we need to add this file?
   It seems that this is generated by automatically.



##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -28,6 +28,8 @@
 
 #include "arrow/dataset/discovery.h"
 #include "arrow/dataset/file_base.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+
 #include "arrow/dataset/type_fwd.h"
 #include "arrow/dataset/visibility.h"
 #include "arrow/io/caching.h"

Review Comment:
   We may need to include `arrow/util/config.h` explicitly?
   Could you check whether `PARQUET_REQUIRE_ENCRYPTION` is defined or not in this file?



##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -28,6 +28,8 @@
 
 #include "arrow/dataset/discovery.h"
 #include "arrow/dataset/file_base.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+

Review Comment:
   ```suggestion
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1262599723


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +138,20 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+    return dataset_encryption_config_;
+  }
+  /// \brief A setter for DatasetEncryptionConfiguration
+  void SetDatasetEncryptionConfig(
+      std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config) {
+    dataset_encryption_config_ = std::move(dataset_encryption_config);
+  }
+
+ private:
+  // A configuration structure that provides per file encryption properties for a dataset
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config_ = NULLPTR;

Review Comment:
   That makes sense.  I'll move forward with the user supplying the file_options.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1270484757


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,412 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+using arrow::internal::checked_pointer_cast;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // This function creates a mock file system using the current time point, creates a
+  // directory with the given base directory path, and writes a dataset to it using
+  // provided Parquet file write options. The dataset is partitioned using a Hive
+  // partitioning scheme. The function also checks if the written files exist in the file
+  // system.
+  ::arrow::Result<std::shared_ptr<::arrow::fs::FileSystem>>
+  CreateMockFileSystemAndWriteData(
+      const std::string& base_dir,
+      const std::shared_ptr<FileWriteOptions>& parquet_file_write_options) {
+    // Create our mock file system
+    ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+    ARROW_ASSIGN_OR_RAISE(auto file_system,
+                          ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+    // Create filesystem
+    RETURN_NOT_OK(file_system->CreateDir(base_dir));
+
+    auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+    auto partitioning =
+        std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+    // ----- Write the Dataset ----
+    auto dataset_out = BuildTable();
+    ARROW_ASSIGN_OR_RAISE(auto scanner_builder_out, dataset_out->NewScan());
+    ARROW_ASSIGN_OR_RAISE(auto scanner_out, scanner_builder_out->Finish());
+
+    ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+    write_options.file_write_options = parquet_file_write_options;
+    write_options.filesystem = file_system;
+    write_options.base_dir = base_dir;
+    write_options.partitioning = partitioning;
+    write_options.basename_template = "part{i}.parquet";
+    RETURN_NOT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner_out));
+
+    std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                      "part=c/part0.parquet", "part=d/part0.parquet",
+                                      "part=e/part0.parquet", "part=f/part0.parquet",
+                                      "part=g/part0.parquet", "part=h/part0.parquet",
+                                      "part=i/part0.parquet", "part=j/part0.parquet"};
+    ValidateFilesExist(file_system, files);
+
+    return file_system;
+  }
+
+  // Create dataset encryption properties
+  std::pair<std::shared_ptr<ParquetEncryptionConfig>,
+            std::shared_ptr<ParquetDecryptionConfig>>
+  CreateParquetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    auto kms_connection_config =
+        std::make_shared<parquet::encryption::KmsConnectionConfig>();
+
+    ParquetEncryptionConfig parquet_encryption_config;
+    parquet_encryption_config.Setup(crypto_factory, kms_connection_config,
+                                    encryption_config);
+    auto decryption_config =
+        std::make_shared<parquet::encryption::DecryptionConfiguration>();
+    ParquetDecryptionConfig parquet_decryption_config;
+    parquet_decryption_config.Setup(crypto_factory, kms_connection_config,
+                                    decryption_config);
+    return std::make_pair(
+        std::make_shared<ParquetEncryptionConfig>(parquet_encryption_config),
+        std::make_shared<ParquetDecryptionConfig>(parquet_decryption_config));
+  }
+
+  // Utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // Add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // Add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::FileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_EQ(result.type(), arrow::fs::FileType::File);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    auto kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+};
+// Write dataset to disk with encryption
+// The aim of this test is to demonstrate the process of writing a partitioned
+// Parquet file while applying distinct file encryption properties to each
+// file within the test. This is based on the selected columns.

Review Comment:
   I just took another pass on this test and my understanding may be wrong. Only column `a` is encrypted and all partitioned files share the same encryption properties. How to understand `applying distinct file encryption properties to each file`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1271303414


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,412 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+using arrow::internal::checked_pointer_cast;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // This function creates a mock file system using the current time point, creates a
+  // directory with the given base directory path, and writes a dataset to it using
+  // provided Parquet file write options. The dataset is partitioned using a Hive
+  // partitioning scheme. The function also checks if the written files exist in the file
+  // system.
+  ::arrow::Result<std::shared_ptr<::arrow::fs::FileSystem>>
+  CreateMockFileSystemAndWriteData(
+      const std::string& base_dir,
+      const std::shared_ptr<FileWriteOptions>& parquet_file_write_options) {
+    // Create our mock file system
+    ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+    ARROW_ASSIGN_OR_RAISE(auto file_system,
+                          ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+    // Create filesystem
+    RETURN_NOT_OK(file_system->CreateDir(base_dir));
+
+    auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+    auto partitioning =
+        std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+    // ----- Write the Dataset ----
+    auto dataset_out = BuildTable();
+    ARROW_ASSIGN_OR_RAISE(auto scanner_builder_out, dataset_out->NewScan());
+    ARROW_ASSIGN_OR_RAISE(auto scanner_out, scanner_builder_out->Finish());
+
+    ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+    write_options.file_write_options = parquet_file_write_options;
+    write_options.filesystem = file_system;
+    write_options.base_dir = base_dir;
+    write_options.partitioning = partitioning;
+    write_options.basename_template = "part{i}.parquet";
+    RETURN_NOT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner_out));
+
+    std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                      "part=c/part0.parquet", "part=d/part0.parquet",
+                                      "part=e/part0.parquet", "part=f/part0.parquet",
+                                      "part=g/part0.parquet", "part=h/part0.parquet",
+                                      "part=i/part0.parquet", "part=j/part0.parquet"};
+    ValidateFilesExist(file_system, files);
+
+    return file_system;
+  }
+
+  // Create dataset encryption properties
+  std::pair<std::shared_ptr<ParquetEncryptionConfig>,
+            std::shared_ptr<ParquetDecryptionConfig>>
+  CreateParquetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    auto kms_connection_config =
+        std::make_shared<parquet::encryption::KmsConnectionConfig>();
+
+    ParquetEncryptionConfig parquet_encryption_config;
+    parquet_encryption_config.Setup(crypto_factory, kms_connection_config,
+                                    encryption_config);
+    auto decryption_config =
+        std::make_shared<parquet::encryption::DecryptionConfiguration>();
+    ParquetDecryptionConfig parquet_decryption_config;
+    parquet_decryption_config.Setup(crypto_factory, kms_connection_config,
+                                    decryption_config);
+    return std::make_pair(
+        std::make_shared<ParquetEncryptionConfig>(parquet_encryption_config),
+        std::make_shared<ParquetDecryptionConfig>(parquet_decryption_config));
+  }
+
+  // Utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // Add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // Add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::FileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_EQ(result.type(), arrow::fs::FileType::File);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    auto kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+};
+// Write dataset to disk with encryption
+// The aim of this test is to demonstrate the process of writing a partitioned
+// Parquet file while applying distinct file encryption properties to each
+// file within the test. This is based on the selected columns.

Review Comment:
   Thanks, I have pushed some changes related to the cpp test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1696164862

   Revision: 7812921acbe5104eb1c74105389b9a70ead16d31
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-163bbcbe9e](https://github.com/ursacomputing/crossbow/branches/all?query=actions-163bbcbe9e)
   
   |Task|Status|
   |----|------|
   |test-conda-python-3.10-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-163bbcbe9e-github-test-conda-python-3.10-pandas-latest)](https://github.com/ursacomputing/crossbow/actions/runs/6003636985/job/16282505288)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1251384212


##########
python/pyarrow/dataset.py:
##########
@@ -85,6 +85,8 @@
         ParquetFragmentScanOptions,
         ParquetReadOptions,
         RowGroupInfo,
+        DatasetDecryptionConfiguration,
+        DatasetEncryptionConfiguration

Review Comment:
   Ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1266938905


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    return std::make_pair(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config),
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+};
+// Write dataset to disk with encryption
+// The aim of this test is to demonstrate the process of writing a partitioned
+// Parquet file while applying distinct file encryption properties to each
+// file within the test. This is based on the selected columns.
+TEST_F(DatasetEncryptionTest, WriteReadDatasetWithEncryption) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);

Review Comment:
   ```suggestion
         internal::checked_pointer_cast<ParquetFileWriteOptions>(file_write_options);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1755672898

   Revision: d7a6b55e49c50f219cb2a0e07d5e5904b457d2af
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-f652f7cf6d](https://github.com/ursacomputing/crossbow/branches/all?query=actions-f652f7cf6d)
   
   |Task|Status|
   |----|------|
   |test-conda-python-3.10|[![Github Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-f652f7cf6d-github-test-conda-python-3.10)](https://github.com/ursacomputing/crossbow/actions/runs/6471348597/job/17569587868)|
   |test-conda-python-3.10-cython2|[![Github Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-f652f7cf6d-github-test-conda-python-3.10-cython2)](https://github.com/ursacomputing/crossbow/actions/runs/6471348499/job/17569587526)|
   |test-conda-python-3.10-hdfs-2.9.2|[![Github Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-f652f7cf6d-github-test-conda-python-3.10-hdfs-2.9.2)](https://github.com/ursacomputing/crossbow/actions/runs/6471348579/job/17569587527)|
   |test-conda-python-3.10-hdfs-3.2.1|[![Github Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-f652f7cf6d-github-test-conda-python-3.10-hdfs-3.2.1)](https://github.com/ursacomputing/crossbow/actions/runs/6471348598/job/17569587548)|
   |test-conda-python-3.10-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-f652f7cf6d-github-test-conda-python-3.10-pandas-latest)](https://github.com/ursacomputing/crossbow/actions/runs/6471348558/job/17569587514)|
   |test-conda-python-3.10-pandas-nightly|[![Github Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-f652f7cf6d-github-test-conda-python-3.10-pandas-nightly)](https://github.com/ursacomputing/crossbow/actions/runs/6471348810/job/17569588774)|
   |test-conda-python-3.10-spark-v3.4.1|[![Github Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-f652f7cf6d-github-test-conda-python-3.10-spark-v3.4.1)](https://github.com/ursacomputing/crossbow/actions/runs/6471348643/job/17569588485)|
   |test-conda-python-3.10-substrait|[![Github Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-f652f7cf6d-github-test-conda-python-3.10-substrait)](https://github.com/ursacomputing/crossbow/actions/runs/6471348829/job/17569588584)|
   |test-conda-python-3.11|[![Github Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-f652f7cf6d-github-test-conda-python-3.11)](https://github.com/ursacomputing/crossbow/actions/runs/6471348812/job/17569588548)|
   |test-conda-python-3.11-dask-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-f652f7cf6d-github-test-conda-python-3.11-dask-latest)](https://github.com/ursacomputing/crossbow/actions/runs/6471348977/job/17569589297)|
   |test-conda-python-3.11-dask-upstream_devel|[![Github Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-f652f7cf6d-github-test-conda-python-3.11-dask-upstream_devel)](https://github.com/ursacomputing/crossbow/actions/runs/6471348953/job/17569589298)|
   |test-conda-python-3.11-hypothesis|[![Github Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-f652f7cf6d-github-test-conda-python-3.11-hypothesis)](https://github.com/ursacomputing/crossbow/actions/runs/6471348965/job/17569589322)|
   |test-conda-python-3.11-pandas-upstream_devel|[![Github Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-f652f7cf6d-github-test-conda-python-3.11-pandas-upstream_devel)](https://github.com/ursacomputing/crossbow/actions/runs/6471349079/job/17569590053)|
   |test-conda-python-3.11-spark-master|[![Github Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-f652f7cf6d-github-test-conda-python-3.11-spark-master)](https://github.com/ursacomputing/crossbow/actions/runs/6471349067/job/17569589908)|
   |test-conda-python-3.8|[![Github Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-f652f7cf6d-github-test-conda-python-3.8)](https://github.com/ursacomputing/crossbow/actions/runs/6471349048/job/17569589823)|
   |test-conda-python-3.8-pandas-1.0|[![Github Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-f652f7cf6d-github-test-conda-python-3.8-pandas-1.0)](https://github.com/ursacomputing/crossbow/actions/runs/6471349043/job/17569589791)|
   |test-conda-python-3.8-spark-v3.4.1|[![Github Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-f652f7cf6d-github-test-conda-python-3.8-spark-v3.4.1)](https://github.com/ursacomputing/crossbow/actions/runs/6471349058/job/17569589801)|
   |test-conda-python-3.9|[![Github Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-f652f7cf6d-github-test-conda-python-3.9)](https://github.com/ursacomputing/crossbow/actions/runs/6471349513/job/17569591317)|
   |test-conda-python-3.9-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-f652f7cf6d-github-test-conda-python-3.9-pandas-latest)](https://github.com/ursacomputing/crossbow/actions/runs/6471349258/job/17569591018)|
   |test-cuda-python|[![Github Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-f652f7cf6d-github-test-cuda-python)](https://github.com/ursacomputing/crossbow/actions/runs/6471349249/job/17569590639)|
   |test-debian-11-python-3|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-f652f7cf6d-azure-test-debian-11-python-3)](https://github.com/ursacomputing/crossbow/runs/17569592761)|
   |test-fedora-35-python-3|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-f652f7cf6d-azure-test-fedora-35-python-3)](https://github.com/ursacomputing/crossbow/runs/17569590712)|
   |test-ubuntu-20.04-python-3|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-f652f7cf6d-azure-test-ubuntu-20.04-python-3)](https://github.com/ursacomputing/crossbow/runs/17569591524)|
   |test-ubuntu-22.04-python-3|[![Github Actions](https://github.com/ursacomputing/crossbow/actions/workflows/crossbow.yml/badge.svg?branch=actions-f652f7cf6d-github-test-ubuntu-22.04-python-3)](https://github.com/ursacomputing/crossbow/actions/runs/6471349331/job/17569591305)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1755157127

   > Do you really need such a module? I'd just try to drop it completely. If you use Python references, you can just check them for `None`.
   
   It turns out that indeed that is not needed at all (I originally had it, because it was a cimport, because I was using a C++ type, but after making it a def, I should also have removed the pxd ..). Thanks for the help @scoder!
   
   (got it working after some more fun runtime errors, like "AttributeError: module 'pyarrow._parquet_encryption' has no attribute '__pyx_capi__'", clean rebuilds!)
   
   Version that seems to be working locally -> https://github.com/tolleybot/arrow/pull/4
   
   > I keep forgetting the rules and options for this, but there are definitely ways to define the target name. In any case, Cython needs to know the final module name when generating the C/C++ code because it's part of the module's C-API. Renaming modules between in and out is usually not a good idea because it adds complexity to the build and makes things harder to find.
   
   AFAIK I was using standard cython features to specify the result and source file name, like `python -m cython --gdb --warning-errors --output-file .../_dataset_parquet_encryption.c .../_dataset_parquet_no_encryption.pyx`. Although, that is using `--output-file`, and probably should also set `--module-name`, which wasn't done by our cmake wrapper around `python -m cython`. 
   (anyway, not needed here anymore, but it is a confusing error)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1316582555


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -631,7 +752,13 @@ cdef class ParquetFileWriteOptions(FileWriteOptions):
             coerce_timestamps=None,
             allow_truncated_timestamps=False,
             use_compliant_nested_type=True,
+            encryption_config=None,
         )
+        IF PARQUET_ENCRYPTION_ENABLED:
+            if self._properties["encryption_config"] is not None:
+                print("Encryption is not enabled in this build of pyarrow. "
+                      "Please reinstall pyarrow with encryption enabled.")
+

Review Comment:
   Good point.  I'll make that change 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1310965664


##########
python/CMakeLists.txt:
##########
@@ -336,6 +343,8 @@ if(PYARROW_BUILD_PARQUET_ENCRYPTION)
   else()
     message(FATAL_ERROR "You must build Arrow C++ with PARQUET_REQUIRE_ENCRYPTION=ON")
   endif()
+else()
+  set(CYTHON_COMPILE_TIME_ENV "PARQUET_ENCRYPTION_ENABLED=0")
 endif()

Review Comment:
   Could you try `list(APPEND CYTHON_FLAGS "-E" "PARQUET_ENCRYPTION_ENABLED=0")`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1347376880


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -711,6 +889,20 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
     cdef ArrowReaderProperties* arrow_reader_properties(self):
         return self.parquet_options.arrow_reader_properties.get()
 
+    IF PARQUET_ENCRYPTION_ENABLED:
+        @property
+        def parquet_decryption_config(self):

Review Comment:
   > Do we want to define these functions with an ELSE statement and then throw an exception if they are called?
   
   Yes, because right now the user would probably get a generic error such as `AttributeError("ParquetFragmentScanOptions has no attribute 'parquet_decryption_config'")`, which may also occur because of a typo in the attribute name, while it might be more informative to have something like `NotImplementedError("Trying access the 'parquet_decryption_config' attribute but encryption is not enabled in this build")`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1342878330


##########
python/pyarrow/_parquet_encryption.pyx:
##########
@@ -466,3 +466,47 @@ cdef class CryptoFactory(_Weakrefable):
 
     def remove_cache_entries_for_all_tokens(self):
         self.factory.get().RemoveCacheEntriesForAllTokens()
+
+    cdef inline shared_ptr[CPyCryptoFactory] unwrap(self) nogil:

Review Comment:
   I removed this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1678630935

   It does _build_, but I think with the current changes to the dataset cython code, I would still expect that it will raise an error at run-time. 
   
   Testing that locally while fetching this branch to review it, I first build it with my normal setup that has parquet enabled but not encryption, and then my build indeed passes, but trying to use Parquet through the dataset API raises an error:
   
   ```
   In [1]: import pyarrow.dataset as ds
   
   In [2]: ds.dataset("test.parquet")
   ...
   
   File ~/scipy/repos/arrow/python/pyarrow/dataset.py:298, in _ensure_format(obj)
       296 elif obj == "parquet":
       297     if not _parquet_available:
   --> 298         raise ValueError(_parquet_msg)
       299     return ParquetFileFormat()
       300 elif obj in {"ipc", "arrow"}:
   
   ValueError: The pyarrow installation is not built with support for the Parquet file format.
   ```
   
   It thinks that Parquet is not available, because it cannot import then ``pyarrow._dataset_parquet`` cython module. And this is because that module now tries to import the encryption module, which isn't available in my installation:
   
   ```
   In [3]: import pyarrow._dataset_parquet
   ...
   ModuleNotFoundError: No module named 'pyarrow._parquet_encryption'
   ```
   
   To verify this (and also in general ensure this is covered), we should check with one of the CI builds that have parquet encryption disabled (although given all builds are green, we might lack such CI build).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1678632325

   @github-actions crossbow submit test-conda-python-3.10-pandas-latest


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1678681097

   @github-actions crossbow submit test-conda-python-3.10-pandas-latest


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1269770486


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -637,10 +717,14 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
         If not None, override the maximum total size of containers allocated
         when decoding Thrift structures. The default limit should be
         sufficient for most Parquet files.
+    dataset_decryption_config : ParquetDecryptionConfig, default None

Review Comment:
   I did this in a couple of places that I thought it made sense.   Let me know if you have any issue with this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1348781965


##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -29,44 +29,50 @@ struct DecryptionConfiguration;
 namespace arrow {
 namespace dataset {
 
-/// \brief Core configuration class encapsulating parameters for high-level encryption
-/// within Parquet framework.
-///
-/// ParquetEncryptionConfig serves as a bridge, passing encryption-related
-/// parameters to appropriate components within the Parquet library. It holds references
-/// to objects defining encryption strategy, Key Management Service (KMS) configuration,
-/// and specific encryption configurations for Parquet data.
-///
-/// \member crypto_factory Shared pointer to CryptoFactory object, responsible for
-/// creating cryptographic components like encryptors and decryptors. \member
-/// kms_connection_config Shared pointer to KmsConnectionConfig object, holding
-/// configuration parameters for connecting to a Key Management Service (KMS).
-/// \member encryption_config Shared pointer to EncryptionConfiguration object, defining
-/// specific encryption settings for Parquet data, like keys for different columns.
 struct ARROW_DS_EXPORT ParquetEncryptionConfig {
   std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory;
   std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config;
   std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config;
+  /// \brief Core configuration class encapsulating parameters for high-level encryption

Review Comment:
   Definitely



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.
anjakefala commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1349250607


##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -29,44 +29,50 @@ struct DecryptionConfiguration;
 namespace arrow {
 namespace dataset {
 
-/// \brief Core configuration class encapsulating parameters for high-level encryption
-/// within Parquet framework.
-///
-/// ParquetEncryptionConfig serves as a bridge, passing encryption-related
-/// parameters to appropriate components within the Parquet library. It holds references
-/// to objects defining encryption strategy, Key Management Service (KMS) configuration,
-/// and specific encryption configurations for Parquet data.
-///
-/// \member crypto_factory Shared pointer to CryptoFactory object, responsible for
-/// creating cryptographic components like encryptors and decryptors. \member
-/// kms_connection_config Shared pointer to KmsConnectionConfig object, holding
-/// configuration parameters for connecting to a Key Management Service (KMS).
-/// \member encryption_config Shared pointer to EncryptionConfiguration object, defining
-/// specific encryption settings for Parquet data, like keys for different columns.
 struct ARROW_DS_EXPORT ParquetEncryptionConfig {
   std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory;
   std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config;
   std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config;
+  /// \brief Core configuration class encapsulating parameters for high-level encryption

Review Comment:
   https://github.com/tolleybot/arrow/pull/3



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1349255962


##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -29,44 +29,50 @@ struct DecryptionConfiguration;
 namespace arrow {
 namespace dataset {
 
-/// \brief Core configuration class encapsulating parameters for high-level encryption
-/// within Parquet framework.
-///
-/// ParquetEncryptionConfig serves as a bridge, passing encryption-related
-/// parameters to appropriate components within the Parquet library. It holds references
-/// to objects defining encryption strategy, Key Management Service (KMS) configuration,
-/// and specific encryption configurations for Parquet data.
-///
-/// \member crypto_factory Shared pointer to CryptoFactory object, responsible for
-/// creating cryptographic components like encryptors and decryptors. \member
-/// kms_connection_config Shared pointer to KmsConnectionConfig object, holding
-/// configuration parameters for connecting to a Key Management Service (KMS).
-/// \member encryption_config Shared pointer to EncryptionConfiguration object, defining
-/// specific encryption settings for Parquet data, like keys for different columns.
 struct ARROW_DS_EXPORT ParquetEncryptionConfig {
   std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory;
   std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config;
   std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config;
+  /// \brief Core configuration class encapsulating parameters for high-level encryption

Review Comment:
   Merged it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1752607149

   Unfortunately, the Python CI is failing. It seems that Cython 3.0.3 does _really_ not want us to use the compile-time `IF` construct, for example:
   ```
     FAILED: CMakeFiles/_dataset_parquet_pyx /arrow/python/build/temp.linux-x86_64-cpython-39/CMakeFiles/_dataset_parquet_pyx
     cd /arrow/python/build/temp.linux-x86_64-cpython-39 && /opt/conda/envs/arrow/bin/python -m cython --cplus --gdb --warning-errors -E PARQUET_ENCRYPTION_ENABLED=1 --directive embedsignature=True --working /arrow/python --output-file /arrow/python/build/temp.linux-x86_64-cpython-39/_dataset_parquet.cpp /arrow/python/pyarrow/_dataset_parquet.pyx
   
     Error compiling Cython file:
     ------------------------------------------------------------
     ...
     from pyarrow.includes.libarrow cimport *
     from pyarrow.includes.libarrow_dataset cimport *
     from pyarrow.includes.libarrow_dataset_parquet cimport *
     from pyarrow._fs cimport FileSystem
   
     IF PARQUET_ENCRYPTION_ENABLED:
     ^
     ------------------------------------------------------------
   
     pyarrow/_dataset_parquet.pyx:36:0: The 'IF' statement is deprecated and will be removed in a future Cython version. Consider using runtime conditions or C macros instead. See https://github.com/cython/cython/issues/4310
   ```
   
   https://github.com/apache/arrow/actions/runs/6435881563/job/17519423323?pr=34616#step:6:3833
   
   @jorisvandenbossche What do you think we should do here? Pin Cython back to 2.x and refactor the encryption code later? Or refactor it right now?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1752858688

   I think the "dummy version with the same interface" will have to be at the Cython level, not the C++ level.
   
   In other words, we could have both a `_dataset_parquet_encryption.pyx` and a `_dataset_parquet_no_encryption.pyx` providing the same `cdef api` functions to handle options (wrap/unwrap them, etc.).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1329701359


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -67,8 +72,28 @@ parquet::ReaderProperties MakeReaderProperties(
     properties.disable_buffered_stream();
   }
   properties.set_buffer_size(parquet_scan_options->reader_properties->buffer_size());
+
+#ifdef PARQUET_REQUIRE_ENCRYPTION
+  auto parquet_decrypt_config = parquet_scan_options->parquet_decryption_config;
+
+  if (parquet_decrypt_config != nullptr) {
+    auto file_decryption_prop =
+        parquet_decrypt_config->crypto_factory->GetFileDecryptionProperties(
+            *parquet_decrypt_config->kms_connection_config,
+            *parquet_decrypt_config->decryption_config, path, filesystem);
+
+    parquet_scan_options->reader_properties->file_decryption_properties(
+        std::move(file_decryption_prop));
+  }
+#else
+  if (parquet_scan_options->parquet_decryption_config != nullptr) {
+    throw std::runtime_error("Encryption is not supported in this build.");

Review Comment:
   We don't use exceptions in Arrow (as opposed to the base Parquet APIs), so this function should return a `Result<parquet::ReaderProperties>` instead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1329699037


##########
cpp/src/arrow/dataset/file_parquet_test.cc:
##########
@@ -395,6 +396,36 @@ TEST_F(TestParquetFileSystemDataset, WriteWithEmptyPartitioningSchema) {
   TestWriteWithEmptyPartitioningSchema();
 }
 
+TEST_F(TestParquetFileSystemDataset, WriteWithEncryptionConfigNotSupported) {
+#ifndef PARQUET_REQUIRE_ENCRYPTION
+  // Create a dummy ParquetEncryptionConfig
+  std::shared_ptr<ParquetEncryptionConfig> encryption_config =
+      std::make_shared<ParquetEncryptionConfig>();
+
+  auto options =
+      checked_pointer_cast<ParquetFileWriteOptions>(format_->DefaultWriteOptions());
+  std::cout << "A" << std::endl;
+  ASSERT_NE(options, nullptr) << "Failed to cast to ParquetFileWriteOptions";

Review Comment:
   Same for other accesses to `std::cout` below.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1338910693


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -33,6 +33,12 @@ from pyarrow.includes.libarrow_dataset cimport *
 from pyarrow.includes.libarrow_dataset_parquet cimport *
 from pyarrow._fs cimport FileSystem
 
+IF PARQUET_ENCRYPTION_ENABLED:

Review Comment:
   > Oh, interesting. I didn't know that Cython allowed compile-time conditionals and that definitely makes things easier here :-)
   
   Well, they are planning to deprecate it .. (https://github.com/cython/cython/issues/4310). But as long as the discussion isn't yet clear, we can happily use this IMO



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1162941467


##########
cpp/src/parquet/encryption/dataset_encryption_config.h:
##########
@@ -0,0 +1,31 @@
+#pragma once
+
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+#include "arrow/dataset/type_fwd.h"
+
+namespace parquet {
+namespace encryption {
+
+struct PARQUET_EXPORT DatasetEncryptionConfiguration {

Review Comment:
   I moved the datatset_encryption_confg.h to src/arrow/dataset/parquet_encryption_config.h.  In addition, I changed the namespace to arrow::dataset. I changed the export macro to ARROW_DS_EXPORT



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1163134049


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,247 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/testing/gtest_util.h"
+#include "gtest/gtest.h"
+
+#include <arrow/api.h>
+#include <arrow/dataset/api.h>
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "parquet/encryption/dataset_encryption_config.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+const char dsFooterMasterKey[] = "0123456789012345";
+const char dsFooterMasterKeyId[] = "footer_key";
+const char* const dsColumnMasterKeys[] = {"1234567890123450"};
+const char* const dsColumnMasterKeyIds[] = {"col_key"};
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  std::unique_ptr<arrow::internal::TemporaryDir> temp_dir_;
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> dataset_;
+  std::string footer_key_name_ = "footer_key";
+
+  ::parquet::encryption::DatasetEncryptionConfiguration dataset_encryption_config_;
+  ::parquet::encryption::DatasetDecryptionConfiguration dataset_decryption_config_;
+  std::string column_key_mapping_;
+  ::parquet::encryption::KmsConnectionConfig kms_connection_config_;
+  std::shared_ptr<::parquet::encryption::CryptoFactory> crypto_factory_;
+  std::shared_ptr<ParquetFileFormat> file_format_;
+  std::shared_ptr<::arrow::fs::FileSystem> file_system_;
+
+  /** setup the test
+   *
+   */
+  void SetUp() {

Review Comment:
   I get an error:  failed with IOError: Encrypted column part not in file schema.  "part" is what I used both as an encryption column and the partition schema field



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1162919658


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -31,6 +31,8 @@
 #include "arrow/dataset/type_fwd.h"
 #include "arrow/dataset/visibility.h"
 #include "arrow/io/caching.h"
+#include "parquet/encryption/dataset_encryption_config.h"
+#include "parquet/encryption/dataset_encryption_config.h"

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1170274093


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +138,33 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+   
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<parquet::encryption::DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+    return dataset_encryption_config_;
+  }
+  /// \brief A getter function to retrieve the dataset decryption configuration
+  std::shared_ptr<parquet::encryption::DatasetDecryptionConfiguration> GetDatasetDecryptionConfig() const {
+    return dataset_decryption_config_;
+  }

Review Comment:
   I've filed #35211 for the follow-up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1528120166

   Ill fix it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1249753263


##########
cpp/src/arrow/dataset/CMakeLists.txt:
##########
@@ -145,7 +145,6 @@ function(ADD_ARROW_DATASET_TEST REL_TEST_NAME)
                  EXTRA_LINK_LIBS
                  ${ARROW_DATASET_TEST_LINK_LIBS}
                  PREFIX
-                 ${PREFIX}

Review Comment:
   Ill take a look



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1253520985


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,388 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view ds_footer_master_key = "0123456789012345";
+constexpr std::string_view ds_footer_master_key_id = "footer_key";
+constexpr std::string_view ds_column_master_keys[] = {"1234567890123450"};
+constexpr std::string_view ds_column_master_key_ids[] = {"col_key"};
+const int ds_num_columns = 1;
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // setup the test
+  void SetUp() {}
+
+  // Create our parquetfileformat with encryption properties
+  std::shared_ptr<ParquetFileFormat> CreateFileFormat(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, std::string_view footer_id, std::string_view footer_key,
+      std::string_view footer_key_name = "footer_key",
+      std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    // create our Parquet file format object
+    auto file_format = std::make_shared<ParquetFileFormat>();
+
+    file_format->SetDatasetEncryptionConfig(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config));
+    file_format->SetDatasetDecryptionConfig(
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+
+    return file_format;
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    std::shared_ptr<::arrow::dataset::InMemoryDataset> dataset =
+        std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+
+    return dataset;
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+
+  // Write dataset to disk with encryption
+  void WriteReadDatasetWithEncryption() {

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1254657664


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +137,40 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+#ifdef PARQUET_REQUIRE_ENCRYPTION
+    return dataset_encryption_config_;
+#else
+    return NULLPTR;
+#endif
+  }
+  /// \brief A getter function to retrieve the dataset decryption configuration
+  std::shared_ptr<DatasetDecryptionConfiguration> GetDatasetDecryptionConfig() const {
+#ifdef PARQUET_REQUIRE_ENCRYPTION
+    return dataset_decryption_config_;
+#else
+    return NULLPTR;
+#endif

Review Comment:
   Ok, let me give that a try



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1261840015


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +138,20 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+    return dataset_encryption_config_;
+  }
+  /// \brief A setter for DatasetEncryptionConfiguration
+  void SetDatasetEncryptionConfig(
+      std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config) {
+    dataset_encryption_config_ = std::move(dataset_encryption_config);
+  }
+
+ private:
+  // A configuration structure that provides per file encryption properties for a dataset
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config_ = NULLPTR;

Review Comment:
   Per the document 
   "Add a new method MakeWriterProperties() called by ParquetFileFormat::MakeWriter() to create writer_properties with file_encryption_properties per file based on the dataset_encryption_config , if it is not NULL"
   
   So I am assuming MakeWriteProperties(..) would use the ParquetFragmentScanOptions to retrieve the DatasetEncryptionConfiguration for that method.
   
   I appreciate the feedback.  It helps to ensure I'm on the same page before breaking what I have in place :).
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1261979449


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -136,6 +138,20 @@ class ARROW_DS_EXPORT ParquetFileFormat : public FileFormat {
       fs::FileLocator destination_locator) const override;
 
   std::shared_ptr<FileWriteOptions> DefaultWriteOptions() override;
+
+  /// \brief A getter function to retrieve the dataset encryption configuration
+  std::shared_ptr<DatasetEncryptionConfiguration> GetDatasetEncryptionConfig() const {
+    return dataset_encryption_config_;
+  }
+  /// \brief A setter for DatasetEncryptionConfiguration
+  void SetDatasetEncryptionConfig(
+      std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config) {
+    dataset_encryption_config_ = std::move(dataset_encryption_config);
+  }
+
+ private:
+  // A configuration structure that provides per file encryption properties for a dataset
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encryption_config_ = NULLPTR;

Review Comment:
   The user could (ideally) specify the write-options on each call to write_dataset this way:
   
   ```
   write_opts = ds.ParquetFileFormat().make_write_options(encryption=de)
   ds.write_dataset(..., file_options=write_opts, ...)
   ```
   
   This will require some additional code [here](https://github.com/apache/arrow/blob/apache-arrow-12.0.1/python/pyarrow/_dataset_parquet.pyx#L529-L581).  These write options are then eventually accessible [here](https://github.com/apache/arrow/pull/34616/files#diff-c49cc60c1aa3a00a84b72c89ce2c8f5eb5d034ed8383741def1c4ae377d62854R655) (as `parquet_options`) and so I think you can even simplify that logic.
   
   It looks like we don't have any real way for the user to set the default write options today.  We could probably add logic in `_dataset_parquet.pyx` in `ParquetFileFormat.make_write_options` to check and see if the default read options have an encryption config and, if they do, to use that.  Longer term it might make more sense to have a `default_write_options` which is the same as the `default_fragment_scan_options`.  For this PR I think it would be ok if there was no default and the user has to supply `file_options` to each call to `write_dataset` if they want to use encryption?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1253802007


##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -28,6 +28,7 @@
 
 #include "arrow/dataset/discovery.h"
 #include "arrow/dataset/file_base.h"
+#include "arrow/dataset/parquet_encryption_config.h"

Review Comment:
   Ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1626602758

   > @wgtmac The original inspiration for the DatasetEncryptionConfiguration & DatasetDecryptionConfiguration came from the google doc https://docs.google.com/document/d/13EysCNC6-Nu9wnJ8YpdzmD-aMLn4i2KXUJTNqIihy7A/edit#heading=h.wdoy8yup3nk9 I am not sure if the two structures were defined originally for further expansion or for other reasons I am not aware.
   > 
   > Ill go ahead and merge the two.
   
   Thanks for the info! 
   
   @westonpace Could you please confirm the current design and use of `DatasetEncryptionConfiguration` & `DatasetDecryptionConfiguration` is good enough? I am not that familiar with the dataset api.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1254756535


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -621,11 +639,38 @@ Result<std::shared_ptr<FileWriter>> ParquetFileFormat::MakeWriter(
 
   auto parquet_options = checked_pointer_cast<ParquetFileWriteOptions>(options);
 
-  std::unique_ptr<parquet::arrow::FileWriter> parquet_writer;
-  ARROW_ASSIGN_OR_RAISE(parquet_writer, parquet::arrow::FileWriter::Open(
-                                            *schema, default_memory_pool(), destination,
-                                            parquet_options->writer_properties,
-                                            parquet_options->arrow_writer_properties));
+  std::unique_ptr<parquet::arrow::FileWriter> parquet_writer = NULLPTR;
+
+#ifdef PARQUET_REQUIRE_ENCRYPTION
+  std::shared_ptr<DatasetEncryptionConfiguration> dataset_encrypt_config =
+      GetDatasetEncryptionConfig();
+
+  if (dataset_encrypt_config != nullptr) {

Review Comment:
   Looks l need to keep that check as dataset_encrypt_config can be null.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1619292297

   +1 for the CMake related part.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wgtmac commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.
wgtmac commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1266970730


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    return std::make_pair(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config),
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+};
+// Write dataset to disk with encryption
+// The aim of this test is to demonstrate the process of writing a partitioned
+// Parquet file while applying distinct file encryption properties to each
+// file within the test. This is based on the selected columns.
+TEST_F(DatasetEncryptionTest, WriteReadDatasetWithEncryption) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);

Review Comment:
   ```suggestion
     auto parquet_file_write_options =
         internal::checked_pointer_cast<ParquetFileWriteOptions>(
             file_format->DefaultWriteOptions());
   ```



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    return std::make_pair(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config),
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+};
+// Write dataset to disk with encryption
+// The aim of this test is to demonstrate the process of writing a partitioned
+// Parquet file while applying distinct file encryption properties to each
+// file within the test. This is based on the selected columns.
+TEST_F(DatasetEncryptionTest, WriteReadDatasetWithEncryption) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));
+
+  auto mock_fs =
+      std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+  auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+  auto partitioning =
+      std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+  // ----- Write the Dataset ----
+  auto dataset_out = BuildTable();
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder_out, dataset_out->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner_out, scanner_builder_out->Finish());
+
+  ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+  write_options.file_write_options = parquet_file_write_options;
+  write_options.filesystem = file_system;
+  write_options.base_dir = kBaseDir;
+  write_options.partitioning = partitioning;
+  write_options.basename_template = "part{i}.parquet";
+  ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner_out));
+
+  std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                    "part=c/part0.parquet", "part=d/part0.parquet",
+                                    "part=e/part0.parquet", "part=f/part0.parquet",
+                                    "part=g/part0.parquet", "part=h/part0.parquet",
+                                    "part=i/part0.parquet", "part=j/part0.parquet"};
+  ValidateFilesExist(mock_fs, files);
+
+  // ----- Read the Dataset -----
+
+  // Get FileInfo objects for all files under the base directory
+  arrow::fs::FileSelector selector;
+  selector.base_dir = kBaseDir;
+  selector.recursive = true;
+
+  // Create a FileSystemDatasetFactory
+  arrow::dataset::FileSystemFactoryOptions factory_options;
+  factory_options.partitioning = partitioning;
+  factory_options.partition_base_dir = kBaseDir;
+  ASSERT_OK_AND_ASSIGN(auto dataset_factory,
+                       arrow::dataset::FileSystemDatasetFactory::Make(
+                           mock_fs, selector, file_format, factory_options));
+  // Create a Dataset
+  ASSERT_OK_AND_ASSIGN(auto dataset_in, dataset_factory->Finish());
+
+  // Define the callback function
+  std::function<arrow::Status(arrow::dataset::TaggedRecordBatch tagged_record_batch)>
+      visitor =
+          [](arrow::dataset::TaggedRecordBatch tagged_record_batch) -> arrow::Status {
+    return arrow::Status::OK();
+  };
+
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder_in, dataset_in->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner_in, scanner_builder_in->Finish());
+
+  // Scan the dataset and process the record batches using the callback function
+  arrow::Status status = scanner_in->Scan(visitor);
+
+  // Check if there was an error during iteration
+  ASSERT_OK(status);
+}
+
+// Write dataset to disk with encryption and then read in a single parquet file
+TEST_F(DatasetEncryptionTest, WriteReadSingleFile) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));
+
+  auto mock_fs =
+      std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+  auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+  auto partitioning =
+      std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+  // ----- Write the Dataset ----
+  auto dataset_out = BuildTable();
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset_out->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
+
+  ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+  write_options.file_write_options = parquet_file_write_options;
+  write_options.filesystem = file_system;
+  write_options.base_dir = kBaseDir;
+  write_options.partitioning = partitioning;
+  write_options.basename_template = "part{i}.parquet";
+  ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner));
+
+  // ----- Read Single File -----

Review Comment:
   It seems that the lines above overlap a lot with previous test. Could we make a new utility function at least for the write path?



##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,402 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/array/builder_primitive.h"
+#include "arrow/builder.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kColumnMasterKeys[] = {"1234567890123450"};
+constexpr std::string_view kColumnMasterKeysIds[] = {"col_key"};
+constexpr std::string_view kBaseDir = "";
+const int kNumColumns = 1;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ protected:
+  // Create our parquetfileformat with encryption properties
+  std::pair<std::shared_ptr<DatasetEncryptionConfiguration>,
+            std::shared_ptr<DatasetDecryptionConfiguration>>
+  CreateDatasetEncryptionConfig(const std::string_view* column_ids,
+                                const std::string_view* column_keys, int num_columns,
+                                std::string_view footer_id, std::string_view footer_key,
+                                std::string_view footer_key_name = "footer_key",
+                                std::string_view column_key_mapping = "col_key: a") {
+    auto key_list =
+        BuildKeyMap(column_ids, column_keys, num_columns, footer_id, footer_key);
+
+    auto crypto_factory = SetupCryptoFactory(/*wrap_locally=*/true, key_list);
+
+    auto encryption_config =
+        std::make_shared<::parquet::encryption::EncryptionConfiguration>(
+            std::string(footer_key_name));
+    encryption_config->column_keys = column_key_mapping;
+    if (footer_key_name.size() > 0) {
+      encryption_config->footer_key = footer_key_name;
+    }
+
+    // DatasetEncryptionConfiguration
+    DatasetEncryptionConfiguration dataset_encryption_config;
+    dataset_encryption_config.crypto_factory = crypto_factory;
+    dataset_encryption_config.encryption_config = encryption_config;
+
+    // DatasetDecryptionConfiguration
+    DatasetDecryptionConfiguration dataset_decryption_config;
+    dataset_decryption_config.crypto_factory = crypto_factory;
+
+    return std::make_pair(
+        std::make_shared<DatasetEncryptionConfiguration>(dataset_encryption_config),
+        std::make_shared<DatasetDecryptionConfiguration>(dataset_decryption_config));
+  }
+
+  // utility to build the key map
+  std::unordered_map<std::string, std::string> BuildKeyMap(
+      const std::string_view* column_ids, const std::string_view* column_keys,
+      int num_columns, const std::string_view& footer_id,
+      const std::string_view& footer_key) {
+    std::unordered_map<std::string, std::string> key_map;
+    // add column keys
+    for (int i = 0; i < num_columns; i++) {
+      key_map.insert({std::string(column_ids[i]), std::string(column_keys[i])});
+    }
+    // add footer key
+    key_map.insert({std::string(footer_id), std::string(footer_key)});
+
+    return key_map;
+  }
+
+  // A utility function to validate our files were written out */
+  void ValidateFilesExist(const std::shared_ptr<arrow::fs::internal::MockFileSystem>& fs,
+                          const std::vector<std::string>& files) {
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, fs->GetFileInfo(file_path));
+
+      ASSERT_NE(result.type(), arrow::fs::FileType::NotFound);
+    }
+  }
+
+  // Build a dummy table
+  std::shared_ptr<::arrow::dataset::InMemoryDataset> BuildTable() {
+    // Create an Arrow Table
+    auto schema = arrow::schema(
+        {arrow::field("a", arrow::int64()), arrow::field("b", arrow::int64()),
+         arrow::field("c", arrow::int64()), arrow::field("part", arrow::utf8())});
+
+    std::vector<std::shared_ptr<arrow::Array>> arrays(4);
+    arrow::NumericBuilder<arrow::Int64Type> builder;
+    ARROW_EXPECT_OK(builder.AppendValues({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[0]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({9, 8, 7, 6, 5, 4, 3, 2, 1, 0}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[1]));
+    builder.Reset();
+    ARROW_EXPECT_OK(builder.AppendValues({1, 2, 1, 2, 1, 2, 1, 2, 1, 2}));
+    ARROW_EXPECT_OK(builder.Finish(&arrays[2]));
+    arrow::StringBuilder string_builder;
+    ARROW_EXPECT_OK(
+        string_builder.AppendValues({"a", "b", "c", "d", "e", "f", "g", "h", "i", "j"}));
+    ARROW_EXPECT_OK(string_builder.Finish(&arrays[3]));
+
+    auto table = arrow::Table::Make(schema, arrays);
+
+    // Write it using Datasets
+    return std::make_shared<::arrow::dataset::InMemoryDataset>(table);
+  }
+
+  // Helper function to create crypto factory and setup
+  std::shared_ptr<::parquet::encryption::CryptoFactory> SetupCryptoFactory(
+      bool wrap_locally, const std::unordered_map<std::string, std::string>& key_list) {
+    auto crypto_factory = std::make_shared<::parquet::encryption::CryptoFactory>();
+
+    std::shared_ptr<::parquet::encryption::KmsClientFactory> kms_client_factory =
+        std::make_shared<::parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            wrap_locally, key_list);
+
+    crypto_factory->RegisterKmsClientFactory(kms_client_factory);
+
+    return crypto_factory;
+  }
+};
+// Write dataset to disk with encryption
+// The aim of this test is to demonstrate the process of writing a partitioned
+// Parquet file while applying distinct file encryption properties to each
+// file within the test. This is based on the selected columns.
+TEST_F(DatasetEncryptionTest, WriteReadDatasetWithEncryption) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));
+
+  auto mock_fs =
+      std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+  auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+  auto partitioning =
+      std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+  // ----- Write the Dataset ----
+  auto dataset_out = BuildTable();
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder_out, dataset_out->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner_out, scanner_builder_out->Finish());
+
+  ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+  write_options.file_write_options = parquet_file_write_options;
+  write_options.filesystem = file_system;
+  write_options.base_dir = kBaseDir;
+  write_options.partitioning = partitioning;
+  write_options.basename_template = "part{i}.parquet";
+  ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner_out));
+
+  std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                    "part=c/part0.parquet", "part=d/part0.parquet",
+                                    "part=e/part0.parquet", "part=f/part0.parquet",
+                                    "part=g/part0.parquet", "part=h/part0.parquet",
+                                    "part=i/part0.parquet", "part=j/part0.parquet"};
+  ValidateFilesExist(mock_fs, files);
+
+  // ----- Read the Dataset -----
+
+  // Get FileInfo objects for all files under the base directory
+  arrow::fs::FileSelector selector;
+  selector.base_dir = kBaseDir;
+  selector.recursive = true;
+
+  // Create a FileSystemDatasetFactory
+  arrow::dataset::FileSystemFactoryOptions factory_options;
+  factory_options.partitioning = partitioning;
+  factory_options.partition_base_dir = kBaseDir;
+  ASSERT_OK_AND_ASSIGN(auto dataset_factory,
+                       arrow::dataset::FileSystemDatasetFactory::Make(
+                           mock_fs, selector, file_format, factory_options));
+  // Create a Dataset
+  ASSERT_OK_AND_ASSIGN(auto dataset_in, dataset_factory->Finish());
+
+  // Define the callback function
+  std::function<arrow::Status(arrow::dataset::TaggedRecordBatch tagged_record_batch)>
+      visitor =
+          [](arrow::dataset::TaggedRecordBatch tagged_record_batch) -> arrow::Status {
+    return arrow::Status::OK();
+  };
+
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder_in, dataset_in->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner_in, scanner_builder_in->Finish());
+
+  // Scan the dataset and process the record batches using the callback function
+  arrow::Status status = scanner_in->Scan(visitor);
+
+  // Check if there was an error during iteration
+  ASSERT_OK(status);
+}
+
+// Write dataset to disk with encryption and then read in a single parquet file
+TEST_F(DatasetEncryptionTest, WriteReadSingleFile) {
+  auto [dataset_encryption_config, dataset_decryption_config] =
+      CreateDatasetEncryptionConfig(kColumnMasterKeysIds, kColumnMasterKeys, kNumColumns,
+                                    kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->SetDatasetDecryptionConfig(dataset_decryption_config);
+
+  // create our Parquet file format object
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  // update default scan options
+  file_format->default_fragment_scan_options = parquet_scan_options;
+
+  // set write options
+  auto file_write_options = file_format->DefaultWriteOptions();
+  std::shared_ptr<ParquetFileWriteOptions> parquet_file_write_options =
+      std::static_pointer_cast<ParquetFileWriteOptions>(file_write_options);
+  parquet_file_write_options->SetDatasetEncryptionConfig(dataset_encryption_config);
+
+  // create our mock file system
+  ::arrow::fs::TimePoint mock_now = std::chrono::system_clock::now();
+  ASSERT_OK_AND_ASSIGN(auto file_system,
+                       ::arrow::fs::internal::MockFileSystem::Make(mock_now, {}));
+  // create filesystem
+  ASSERT_OK(file_system->CreateDir(""));
+
+  auto mock_fs =
+      std::dynamic_pointer_cast<::arrow::fs::internal::MockFileSystem>(file_system);
+
+  auto partition_schema = ::arrow::schema({::arrow::field("part", ::arrow::utf8())});
+  auto partitioning =
+      std::make_shared<::arrow::dataset::HivePartitioning>(partition_schema);
+
+  // ----- Write the Dataset ----
+  auto dataset_out = BuildTable();
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset_out->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
+
+  ::arrow::dataset::FileSystemDatasetWriteOptions write_options;
+  write_options.file_write_options = parquet_file_write_options;
+  write_options.filesystem = file_system;
+  write_options.base_dir = kBaseDir;
+  write_options.partitioning = partitioning;
+  write_options.basename_template = "part{i}.parquet";
+  ASSERT_OK(::arrow::dataset::FileSystemDataset::Write(write_options, scanner));
+
+  // ----- Read Single File -----
+
+  // Define the path to the encrypted Parquet file
+  std::string file_path = "part=a/part0.parquet";
+
+  auto crypto_factory = dataset_decryption_config->crypto_factory;
+
+  // Get the FileDecryptionProperties object using the CryptoFactory object
+  auto file_decryption_properties = crypto_factory->GetFileDecryptionProperties(
+      *dataset_decryption_config->kms_connection_config,
+      *dataset_decryption_config->decryption_config);
+
+  // Create the ReaderProperties object using the FileDecryptionProperties object
+  auto reader_properties = std::make_shared<parquet::ReaderProperties>();
+  reader_properties->file_decryption_properties(file_decryption_properties);
+
+  // Open the Parquet file using the MockFileSystem
+  std::shared_ptr<arrow::io::RandomAccessFile> input;
+  ASSERT_OK_AND_ASSIGN(input, mock_fs->OpenInputFile(file_path));
+
+  parquet::arrow::FileReaderBuilder reader_builder;
+  ASSERT_OK(reader_builder.Open(input, *reader_properties));
+
+  ASSERT_OK_AND_ASSIGN(auto arrow_reader, reader_builder.Build());
+
+  // Read entire file as a single Arrow table
+  std::shared_ptr<arrow::Table> table;
+  ASSERT_OK(arrow_reader->ReadTable(&table));

Review Comment:
   Should we test the correctness of the data?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1267109710


##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,77 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/util/logging.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+
+namespace arrow {
+namespace dataset {
+
+/// core class, that translates the parameters of high level encryption
+struct ARROW_DS_EXPORT DatasetEncryptionConfiguration {
+  DatasetEncryptionConfiguration()
+      : kms_connection_config(

Review Comment:
   Ok



##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,77 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "arrow/util/logging.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+
+namespace arrow {
+namespace dataset {
+
+/// core class, that translates the parameters of high level encryption
+struct ARROW_DS_EXPORT DatasetEncryptionConfiguration {
+  DatasetEncryptionConfiguration()
+      : kms_connection_config(
+            std::make_shared<parquet::encryption::KmsConnectionConfig>()) {}
+
+  void Setup(
+      std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory,
+      std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config,
+      std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config) {
+    ARROW_CHECK(crypto_factory != NULLPTR);
+    ARROW_CHECK(kms_connection_config != NULLPTR);
+    ARROW_CHECK(encryption_config != NULLPTR);
+    this->crypto_factory = std::move(crypto_factory);
+    this->kms_connection_config = std::move(kms_connection_config);
+    this->encryption_config = std::move(encryption_config);
+  }
+
+  std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory;
+  std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config;
+  std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config;
+};
+
+/// core class, that translates the parameters of high level encryption
+struct ARROW_DS_EXPORT DatasetDecryptionConfiguration {
+  DatasetDecryptionConfiguration()

Review Comment:
   Ok



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1696123196

   Revision: 7812921acbe5104eb1c74105389b9a70ead16d31
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-da15a51704](https://github.com/ursacomputing/crossbow/branches/all?query=actions-da15a51704)
   
   |Task|Status|
   |----|------|
   |test-conda-python-3.10-pandas-latest|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-da15a51704-github-test-conda-python-3.10-pandas-latest)](https://github.com/ursacomputing/crossbow/actions/runs/6003326499/job/16281542940)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on PR #34616:
URL: https://github.com/apache/arrow/pull/34616#issuecomment-1696313283

   > AFAIU, it seems as of commit https://github.com/apache/arrow/commit/7812921acbe5104eb1c74105389b9a70ead16d31 we are still getting run-time errors.
   > 
   > 
   > 
   > From https://github.com/ursacomputing/crossbow/actions/runs/6003326499/job/16281542940 :
   > 
   > 
   > 
   > ```
   > 
   > opt/conda/envs/arrow/lib/python3.10/site-packages/pyarrow/parquet/encryption.py:19: in <module>
   > 
   >     from pyarrow._parquet_encryption import (CryptoFactory,   # noqa
   > 
   > E   ModuleNotFoundError: No module named 'pyarrow._parquet_encryption'
   > 
   > ```
   > 
   > 
   > 
   > It is still failing to run with this configuration:
   > 
   > 
   > 
   > ```
   > 
   > --   ARROW_DATASET=ON [default=OFF]
   > 
   > --       Build the Arrow Dataset Modules
   > 
   > --   ARROW_PARQUET=ON [default=OFF]
   > 
   > --       Build the Parquet libraries
   > 
   > --   PARQUET_REQUIRE_ENCRYPTION=OFF [default=OFF]
   > 
   > --       Build support for encryption. Fail if OpenSSL is not found
   > 
   > 
   > 
   > 
   > 
   > # Enable/disable optional PyArrow components
   > 
   > export PYARROW_WITH_PARQUET=ON
   > 
   > export PYARROW_WITH_DATASET=ON
   > 
   > export PYARROW_WITH_PARQUET_ENCRYPTION=OFF
   > 
   > ```
   
   I'll take a look at what's going on 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1314029217


##########
python/CMakeLists.txt:
##########
@@ -336,6 +343,8 @@ if(PYARROW_BUILD_PARQUET_ENCRYPTION)
   else()
     message(FATAL_ERROR "You must build Arrow C++ with PARQUET_REQUIRE_ENCRYPTION=ON")
   endif()
+else()
+  set(CYTHON_COMPILE_TIME_ENV "PARQUET_ENCRYPTION_ENABLED=0")
 endif()

Review Comment:
   Ok, I just need to move my append to lower down in the CMakeLists.txt.  It seems something is reseting CYTHON_FLAGS after I was setting it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1306215901


##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -729,6 +825,14 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
             other.thrift_container_size_limit)
         return attrs == other_attrs
 
+    def SetParquetDecryptionConfig(self, ParquetDecryptionConfig config):
+        cdef shared_ptr[CParquetDecryptionConfig] c_config
+        if not isinstance(config, ParquetDecryptionConfig):
+            raise ValueError("config must be a ParquetDecryptionConfig")
+        self._parquet_decryption_config = config
+        c_config = config.unwrap()
+        self.parquet_options.parquet_decryption_config = c_config

Review Comment:
   I removed the SetParquetDecryptionConfig



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1317161077


##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -622,10 +640,36 @@ Result<std::shared_ptr<FileWriter>> ParquetFileFormat::MakeWriter(
   auto parquet_options = checked_pointer_cast<ParquetFileWriteOptions>(options);
 
   std::unique_ptr<parquet::arrow::FileWriter> parquet_writer;
-  ARROW_ASSIGN_OR_RAISE(parquet_writer, parquet::arrow::FileWriter::Open(
-                                            *schema, default_memory_pool(), destination,
-                                            parquet_options->writer_properties,
-                                            parquet_options->arrow_writer_properties));
+
+#ifdef PARQUET_REQUIRE_ENCRYPTION

Review Comment:
   Hmm, so what happens if `PARQUET_REQUIRE_ENCRYPTION` is false and the user sets a `parquet_encrypt_config`? We should not silently produce unencrypted files, so it should perhaps return a `Status::NotImplemented`.
   
   Also, it should probably be tested somewhere.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1336198058


##########
python/pyarrow/_parquet_encryption.pxd:
##########
@@ -131,3 +131,22 @@ cdef extern from "arrow/python/parquet_encryption.h" \
             SafeGetFileDecryptionProperties(
             const CKmsConnectionConfig& kms_connection_config,
             const CDecryptionConfiguration& decryption_config)
+
+cdef extern from "arrow/dataset/parquet_encryption_config.h" namespace "arrow::dataset" nogil:
+    cdef cppclass CParquetEncryptionConfig "arrow::dataset::ParquetEncryptionConfig":
+        CParquetEncryptionConfig() except +
+        void Setup(shared_ptr[CCryptoFactory] crypto_factory,
+                   shared_ptr[CKmsConnectionConfig] kms_connection_config,
+                   shared_ptr[CEncryptionConfiguration] encryption_config)
+
+    cdef cppclass CParquetDecryptionConfig "arrow::dataset::ParquetDecryptionConfig":
+        CParquetDecryptionConfig() except +
+        void Setup(shared_ptr[CCryptoFactory] crypto_factory,
+                   shared_ptr[CKmsConnectionConfig] kms_connection_config,
+                   shared_ptr[CDecryptionConfiguration] decryption_config)
+
+
+cdef public shared_ptr[CCryptoFactory] pyarrow_unwrap_cryptofactory(object crypto_factory)

Review Comment:
   Ok, I moved those functions into the CParquetDecriptionConfig and CParquetEncryptionConfig classes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1329108082


##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,61 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"

Review Comment:
   I've updated this to use forward declarations.  I didn't move the definitions to file_parquet.h just to avoid adding encryption definitions in the header



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1342903182


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,216 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/dataset/api.h"

Review Comment:
   Ok, I removed both of those.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1342951374


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,216 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kFooterKeyName = "footer_key";
+constexpr std::string_view kColumnMasterKey = "1234567890123450";
+constexpr std::string_view kColumnMasterKeysId = "col_key";
+constexpr std::string_view kColumnKeyMapping = "col_key: a";
+constexpr std::string_view kBaseDir = "";
+
+using arrow::internal::checked_pointer_cast;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ public:
+  // This function creates a mock file system using the current time point, creates a
+  // directory with the given base directory path, and writes a dataset to it using
+  // provided Parquet file write options. The dataset is partitioned using a Hive
+  // partitioning scheme. The function also checks if the written files exist in the file
+  // system.
+  static void SetUpTestSuite() {
+    // Creates a mock file system using the current time point.
+    EXPECT_OK_AND_ASSIGN(file_system, fs::internal::MockFileSystem::Make(
+                                          std::chrono::system_clock::now(), {}));
+    ASSERT_OK(file_system->CreateDir(std::string(kBaseDir)));
+
+    // Prepare table data.
+    auto table_schema = schema({field("a", int64()), field("b", int64()),
+                                field("c", int64()), field("part", utf8())});
+    table = TableFromJSON(table_schema, {R"([
+       [ 0, 9, 1, "a" ],
+       [ 1, 8, 2, "b" ],
+       [ 2, 7, 1, "c" ],
+       [ 3, 6, 2, "d" ],
+       [ 4, 5, 1, "e" ],
+       [ 5, 4, 2, "f" ],
+       [ 6, 3, 1, "g" ],
+       [ 7, 2, 2, "h" ],
+       [ 8, 1, 1, "i" ],
+       [ 9, 0, 2, "j" ]
+     ])"});
+
+    // Use a Hive-style partitioning scheme.
+    partitioning = std::make_shared<HivePartitioning>(schema({field("part", utf8())}));
+
+    // Prepare encryption properties.
+    std::unordered_map<std::string, std::string> key_map;
+    key_map.emplace(kColumnMasterKeysId, kColumnMasterKey);
+    key_map.emplace(kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+    crypto_factory = std::make_shared<parquet::encryption::CryptoFactory>();
+    auto kms_client_factory =
+        std::make_shared<parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            /*wrap_locally=*/true, key_map);
+    crypto_factory->RegisterKmsClientFactory(std::move(kms_client_factory));
+    kms_connection_config = std::make_shared<parquet::encryption::KmsConnectionConfig>();
+
+    // Set write options with encryption configuration.
+    auto encryption_config =
+        std::make_shared<parquet::encryption::EncryptionConfiguration>(
+            std::string(kFooterKeyName));
+    encryption_config->column_keys = kColumnKeyMapping;
+    auto parquet_encryption_config = std::make_shared<ParquetEncryptionConfig>();
+    parquet_encryption_config->Setup(crypto_factory, kms_connection_config,
+                                     std::move(encryption_config));
+
+    auto file_format = std::make_shared<ParquetFileFormat>();
+    auto parquet_file_write_options =
+        checked_pointer_cast<ParquetFileWriteOptions>(file_format->DefaultWriteOptions());
+    parquet_file_write_options->parquet_encryption_config =
+        std::move(parquet_encryption_config);
+
+    // Write dataset.
+    auto dataset = std::make_shared<InMemoryDataset>(table);
+    EXPECT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan());
+    EXPECT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
+
+    FileSystemDatasetWriteOptions write_options;
+    write_options.file_write_options = parquet_file_write_options;
+    write_options.filesystem = file_system;
+    write_options.base_dir = kBaseDir;
+    write_options.partitioning = partitioning;
+    write_options.basename_template = "part{i}.parquet";
+    ASSERT_OK(FileSystemDataset::Write(write_options, std::move(scanner)));
+
+    // Verify that the files exist
+    std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                      "part=c/part0.parquet", "part=d/part0.parquet",
+                                      "part=e/part0.parquet", "part=f/part0.parquet",
+                                      "part=g/part0.parquet", "part=h/part0.parquet",
+                                      "part=i/part0.parquet", "part=j/part0.parquet"};
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, file_system->GetFileInfo(file_path));
+      ASSERT_EQ(result.type(), fs::FileType::File);
+    }
+  }
+
+ protected:
+  inline static std::shared_ptr<fs::FileSystem> file_system;
+  inline static std::shared_ptr<Table> table;
+  inline static std::shared_ptr<HivePartitioning> partitioning;
+  inline static std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory;
+  inline static std::shared_ptr<parquet::encryption::KmsConnectionConfig>
+      kms_connection_config;
+};
+
+// This test demonstrates the process of writing a partitioned Parquet file with the same
+// encryption properties applied to each file within the dataset. The encryption
+// properties are determined based on the selected columns. After writing the dataset, the
+// test reads the data back and verifies that it can be successfully decrypted and
+// scanned.
+TEST_F(DatasetEncryptionTest, WriteReadDatasetWithEncryption) {
+  // Create decryption properties.
+  auto decryption_config =
+      std::make_shared<parquet::encryption::DecryptionConfiguration>();
+  auto parquet_decryption_config = std::make_shared<ParquetDecryptionConfig>();
+  parquet_decryption_config->Setup(crypto_factory, kms_connection_config,
+                                   std::move(decryption_config));
+
+  // Set scan options.
+  auto parquet_scan_options = std::make_shared<ParquetFragmentScanOptions>();
+  parquet_scan_options->parquet_decryption_config = std::move(parquet_decryption_config);
+
+  auto file_format = std::make_shared<ParquetFileFormat>();
+  file_format->default_fragment_scan_options = std::move(parquet_scan_options);
+
+  // Get FileInfo objects for all files under the base directory
+  fs::FileSelector selector;
+  selector.base_dir = kBaseDir;
+  selector.recursive = true;
+
+  FileSystemFactoryOptions factory_options;
+  factory_options.partitioning = partitioning;
+  factory_options.partition_base_dir = kBaseDir;
+  ASSERT_OK_AND_ASSIGN(auto dataset_factory,
+                       FileSystemDatasetFactory::Make(file_system, selector, file_format,
+                                                      factory_options));
+
+  // Read dataset into table
+  ASSERT_OK_AND_ASSIGN(auto dataset, dataset_factory->Finish());
+  ASSERT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan());
+  ASSERT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
+  ASSERT_OK_AND_ASSIGN(auto read_table, scanner->ToTable());
+
+  // Verify the data was read correctly
+  ASSERT_OK_AND_ASSIGN(auto combined_table, read_table->CombineChunks());

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1342947063


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,216 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kFooterKeyName = "footer_key";
+constexpr std::string_view kColumnMasterKey = "1234567890123450";
+constexpr std::string_view kColumnMasterKeysId = "col_key";
+constexpr std::string_view kColumnKeyMapping = "col_key: a";
+constexpr std::string_view kBaseDir = "";
+
+using arrow::internal::checked_pointer_cast;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ public:
+  // This function creates a mock file system using the current time point, creates a
+  // directory with the given base directory path, and writes a dataset to it using
+  // provided Parquet file write options. The dataset is partitioned using a Hive
+  // partitioning scheme. The function also checks if the written files exist in the file
+  // system.
+  static void SetUpTestSuite() {
+    // Creates a mock file system using the current time point.
+    EXPECT_OK_AND_ASSIGN(file_system, fs::internal::MockFileSystem::Make(
+                                          std::chrono::system_clock::now(), {}));
+    ASSERT_OK(file_system->CreateDir(std::string(kBaseDir)));
+
+    // Prepare table data.
+    auto table_schema = schema({field("a", int64()), field("b", int64()),
+                                field("c", int64()), field("part", utf8())});
+    table = TableFromJSON(table_schema, {R"([
+       [ 0, 9, 1, "a" ],
+       [ 1, 8, 2, "b" ],
+       [ 2, 7, 1, "c" ],
+       [ 3, 6, 2, "d" ],
+       [ 4, 5, 1, "e" ],
+       [ 5, 4, 2, "f" ],
+       [ 6, 3, 1, "g" ],
+       [ 7, 2, 2, "h" ],
+       [ 8, 1, 1, "i" ],
+       [ 9, 0, 2, "j" ]
+     ])"});
+
+    // Use a Hive-style partitioning scheme.
+    partitioning = std::make_shared<HivePartitioning>(schema({field("part", utf8())}));
+
+    // Prepare encryption properties.
+    std::unordered_map<std::string, std::string> key_map;
+    key_map.emplace(kColumnMasterKeysId, kColumnMasterKey);
+    key_map.emplace(kFooterKeyMasterKeyId, kFooterKeyMasterKey);
+
+    crypto_factory = std::make_shared<parquet::encryption::CryptoFactory>();
+    auto kms_client_factory =
+        std::make_shared<parquet::encryption::TestOnlyInMemoryKmsClientFactory>(
+            /*wrap_locally=*/true, key_map);
+    crypto_factory->RegisterKmsClientFactory(std::move(kms_client_factory));
+    kms_connection_config = std::make_shared<parquet::encryption::KmsConnectionConfig>();
+
+    // Set write options with encryption configuration.
+    auto encryption_config =
+        std::make_shared<parquet::encryption::EncryptionConfiguration>(
+            std::string(kFooterKeyName));
+    encryption_config->column_keys = kColumnKeyMapping;
+    auto parquet_encryption_config = std::make_shared<ParquetEncryptionConfig>();
+    parquet_encryption_config->Setup(crypto_factory, kms_connection_config,
+                                     std::move(encryption_config));
+
+    auto file_format = std::make_shared<ParquetFileFormat>();
+    auto parquet_file_write_options =
+        checked_pointer_cast<ParquetFileWriteOptions>(file_format->DefaultWriteOptions());
+    parquet_file_write_options->parquet_encryption_config =
+        std::move(parquet_encryption_config);
+
+    // Write dataset.
+    auto dataset = std::make_shared<InMemoryDataset>(table);
+    EXPECT_OK_AND_ASSIGN(auto scanner_builder, dataset->NewScan());
+    EXPECT_OK_AND_ASSIGN(auto scanner, scanner_builder->Finish());
+
+    FileSystemDatasetWriteOptions write_options;
+    write_options.file_write_options = parquet_file_write_options;
+    write_options.filesystem = file_system;
+    write_options.base_dir = kBaseDir;
+    write_options.partitioning = partitioning;
+    write_options.basename_template = "part{i}.parquet";
+    ASSERT_OK(FileSystemDataset::Write(write_options, std::move(scanner)));
+
+    // Verify that the files exist
+    std::vector<std::string> files = {"part=a/part0.parquet", "part=b/part0.parquet",
+                                      "part=c/part0.parquet", "part=d/part0.parquet",
+                                      "part=e/part0.parquet", "part=f/part0.parquet",
+                                      "part=g/part0.parquet", "part=h/part0.parquet",
+                                      "part=i/part0.parquet", "part=j/part0.parquet"};
+    for (const auto& file_path : files) {
+      ASSERT_OK_AND_ASSIGN(auto result, file_system->GetFileInfo(file_path));
+      ASSERT_EQ(result.type(), fs::FileType::File);
+    }
+  }
+
+ protected:
+  inline static std::shared_ptr<fs::FileSystem> file_system;

Review Comment:
   I updated this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1342974480


##########
cpp/src/arrow/dataset/dataset_encryption_test.cc:
##########
@@ -0,0 +1,216 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include <string_view>
+
+#include "gtest/gtest.h"
+
+#include "arrow/api.h"
+#include "arrow/dataset/api.h"
+#include "arrow/dataset/parquet_encryption_config.h"
+#include "arrow/dataset/partition.h"
+#include "arrow/filesystem/mockfs.h"
+#include "arrow/io/api.h"
+#include "arrow/status.h"
+#include "arrow/table.h"
+#include "arrow/testing/gtest_util.h"
+#include "parquet/arrow/reader.h"
+#include "parquet/encryption/test_in_memory_kms.h"
+
+constexpr std::string_view kFooterKeyMasterKey = "0123456789012345";
+constexpr std::string_view kFooterKeyMasterKeyId = "footer_key";
+constexpr std::string_view kFooterKeyName = "footer_key";
+constexpr std::string_view kColumnMasterKey = "1234567890123450";
+constexpr std::string_view kColumnMasterKeysId = "col_key";
+constexpr std::string_view kColumnKeyMapping = "col_key: a";
+constexpr std::string_view kBaseDir = "";
+
+using arrow::internal::checked_pointer_cast;
+
+namespace arrow {
+namespace dataset {
+
+class DatasetEncryptionTest : public ::testing::Test {
+ public:
+  // This function creates a mock file system using the current time point, creates a
+  // directory with the given base directory path, and writes a dataset to it using
+  // provided Parquet file write options. The dataset is partitioned using a Hive
+  // partitioning scheme. The function also checks if the written files exist in the file
+  // system.
+  static void SetUpTestSuite() {
+    // Creates a mock file system using the current time point.
+    EXPECT_OK_AND_ASSIGN(file_system, fs::internal::MockFileSystem::Make(
+                                          std::chrono::system_clock::now(), {}));
+    ASSERT_OK(file_system->CreateDir(std::string(kBaseDir)));
+
+    // Prepare table data.
+    auto table_schema = schema({field("a", int64()), field("b", int64()),
+                                field("c", int64()), field("part", utf8())});
+    table = TableFromJSON(table_schema, {R"([
+       [ 0, 9, 1, "a" ],
+       [ 1, 8, 2, "b" ],
+       [ 2, 7, 1, "c" ],
+       [ 3, 6, 2, "d" ],
+       [ 4, 5, 1, "e" ],
+       [ 5, 4, 2, "f" ],
+       [ 6, 3, 1, "g" ],
+       [ 7, 2, 2, "h" ],
+       [ 8, 1, 1, "i" ],
+       [ 9, 0, 2, "j" ]

Review Comment:
   Ok, I updated this to add some non-unique part values



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1342975249


##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,65 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+
+namespace parquet::encryption {
+class CryptoFactory;
+struct KmsConnectionConfig;
+struct EncryptionConfiguration;
+struct DecryptionConfiguration;
+}  // namespace parquet::encryption
+
+namespace arrow {
+namespace dataset {
+
+/// core class, that translates the parameters of high level encryption
+struct ARROW_DS_EXPORT ParquetEncryptionConfig {

Review Comment:
   Sure thing



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tolleybot commented on a diff in pull request #34616: GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

Posted by "tolleybot (via GitHub)" <gi...@apache.org>.
tolleybot commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1160936373


##########
cpp/src/parquet/encryption/dataset_encryption_config.h:
##########
@@ -0,0 +1,31 @@
+#pragma once
+
+#include "parquet/encryption/crypto_factory.h"
+#include "parquet/encryption/encryption.h"
+#include "parquet/encryption/kms_client.h"
+#include "arrow/dataset/type_fwd.h"

Review Comment:
   Ok, this is also related to your previous comment.  I'll make that change and change the namespace to arrow::dataset in addition changing the PARQUET_EXPORT to ARROW_DS_EXPORT



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29238 [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #34616:
URL: https://github.com/apache/arrow/pull/34616#discussion_r1346108609


##########
cpp/src/arrow/dataset/file_parquet_test.cc:
##########
@@ -424,6 +425,34 @@ TEST_F(TestParquetFileSystemDataset, WriteWithEmptyPartitioningSchema) {
   TestWriteWithEmptyPartitioningSchema();
 }
 
+TEST_F(TestParquetFileSystemDataset, WriteWithEncryptionConfigNotSupported) {
+#ifndef PARQUET_REQUIRE_ENCRYPTION
+  // Create a dummy ParquetEncryptionConfig
+  std::shared_ptr<ParquetEncryptionConfig> encryption_config =
+      std::make_shared<ParquetEncryptionConfig>();
+
+  auto options =
+      checked_pointer_cast<ParquetFileWriteOptions>(format_->DefaultWriteOptions());
+
+  // Set the encryption config in the options
+  options->parquet_encryption_config = encryption_config;
+
+  // Setup mock filesystem and test data
+  auto mock_fs = std::make_shared<fs::internal::MockFileSystem>(fs::kNoTime);
+  std::shared_ptr<Schema> test_schema = schema({field("x", int32())});
+  std::shared_ptr<RecordBatch> batch = RecordBatchFromJSON(test_schema, "[[0]]");
+  ASSERT_OK_AND_ASSIGN(std::shared_ptr<io::OutputStream> out_stream,
+                       mock_fs->OpenOutputStream("/foo.parquet"));
+  std::cout << "B" << std::endl;

Review Comment:
   I suppose these debug prints, could you remove them (there's another one below)? :-)



##########
cpp/src/arrow/dataset/file_parquet.h:
##########
@@ -226,6 +229,8 @@ class ARROW_DS_EXPORT ParquetFragmentScanOptions : public FragmentScanOptions {
   /// ScanOptions. Additionally, dictionary columns come from
   /// ParquetFileFormat::ReaderOptions::dict_columns.
   std::shared_ptr<parquet::ArrowReaderProperties> arrow_reader_properties;
+  /// A configuration structure that provides encryption properties for a dataset

Review Comment:
   ```suggestion
     /// A configuration structure that provides decryption properties for a dataset
   ```



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -711,6 +889,20 @@ cdef class ParquetFragmentScanOptions(FragmentScanOptions):
     cdef ArrowReaderProperties* arrow_reader_properties(self):
         return self.parquet_options.arrow_reader_properties.get()
 
+    IF PARQUET_ENCRYPTION_ENABLED:
+        @property
+        def parquet_decryption_config(self):

Review Comment:
   > PARQUET_ENCRYPTION_ENABLED is a compile-time flag that would be set intentionally, having the code throw an exception at runtime could lead to confusion.
   
   With the current code, a runtime exception would be raised anyway, it would just be a generic `AttributeError`. A specific exception could have a descriptive error message telling the user why the attributes are not available.



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -67,8 +72,24 @@ parquet::ReaderProperties MakeReaderProperties(
     properties.disable_buffered_stream();
   }
   properties.set_buffer_size(parquet_scan_options->reader_properties->buffer_size());
+
+#ifdef PARQUET_REQUIRE_ENCRYPTION
+  auto parquet_decrypt_config = parquet_scan_options->parquet_decryption_config;
+
+  if (parquet_decrypt_config != nullptr) {
+    auto file_decryption_prop =
+        parquet_decrypt_config->crypto_factory->GetFileDecryptionProperties(
+            *parquet_decrypt_config->kms_connection_config,
+            *parquet_decrypt_config->decryption_config, path, filesystem);
+
+    parquet_scan_options->reader_properties->file_decryption_properties(
+        std::move(file_decryption_prop));
+  }
+#endif

Review Comment:
   Should we raise `NotImplemented` in the same way as for encryption if Parquet encryption was not enabled?
   For example (untested):
   ```suggestion
   #else
     if (parquet_scan_options->parquet_decryption_config != nullptr) {
       return Status::NotImplemented("Encryption is not supported in this build.");
     }
   #endif
   ```



##########
python/pyarrow/_dataset_parquet.pyx:
##########
@@ -56,9 +63,180 @@ from pyarrow._parquet cimport (
 
 cdef Expression _true = Expression._scalar(True)
 
-
 ctypedef CParquetFileWriter* _CParquetFileWriterPtr
 
+IF PARQUET_ENCRYPTION_ENABLED:
+    cdef class ParquetEncryptionConfig(_Weakrefable):
+        """
+        Core configuration class encapsulating parameters for high-level encryption
+        within the Parquet framework.
+
+        The ParquetEncryptionConfig class serves as a bridge for passing encryption-related
+        parameters to the appropriate components within the Parquet library. It maintains references
+        to objects that define the encryption strategy, Key Management Service (KMS) configuration,
+        and specific encryption configurations for Parquet data.
+
+        Parameters
+        ----------
+        crypto_factory : pyarrow.parquet.encryption.CryptoFactory
+            Shared pointer to a `CryptoFactory` object. The `CryptoFactory` is responsible for
+            creating cryptographic components, such as encryptors and decryptors.
+        kms_connection_config : pyarrow.parquet.encryption.KmsConnectionConfig
+            Shared pointer to a `KmsConnectionConfig` object. This object holds the configuration
+            parameters necessary for connecting to a Key Management Service (KMS).
+        encryption_config : pyarrow.parquet.encryption.EncryptionConfiguration
+            Shared pointer to an `EncryptionConfiguration` object. This object defines specific
+            encryption settings for Parquet data, including the keys assigned to different columns.
+
+        Raises
+        ------
+        ValueError
+            Raised if `encryption_config` is None.
+        """
+        cdef:
+            shared_ptr[CParquetEncryptionConfig] c_config
+
+        # Avoid mistakenly creating attributes
+        __slots__ = ()
+
+        def __cinit__(self, CryptoFactory crypto_factory, KmsConnectionConfig kms_connection_config,
+                      EncryptionConfiguration encryption_config):
+
+            cdef shared_ptr[CEncryptionConfiguration] c_encryption_config
+
+            if crypto_factory is None:
+                raise ValueError("crypto_factory cannot be None")
+
+            if kms_connection_config is None:
+                raise ValueError("kms_connection_config cannot be None")
+
+            if encryption_config is None:
+                raise ValueError("encryption_config cannot be None")
+
+            self.c_config.reset(new CParquetEncryptionConfig())
+
+            c_encryption_config = ParquetEncryptionConfig.unwrap_encryptionconfig(
+                encryption_config)
+
+            self.c_config.get().crypto_factory = ParquetEncryptionConfig.unwrap_cryptofactory(crypto_factory)
+            self.c_config.get().kms_connection_config = ParquetEncryptionConfig.unwrap_kmsconnectionconfig(
+                kms_connection_config)
+            self.c_config.get().encryption_config = c_encryption_config
+
+        @staticmethod
+        cdef wrap(shared_ptr[CParquetEncryptionConfig] c_config):
+            cdef ParquetEncryptionConfig python_config = ParquetEncryptionConfig.__new__(ParquetEncryptionConfig)
+            python_config.c_config = c_config
+            return python_config
+
+        cdef shared_ptr[CParquetEncryptionConfig] unwrap(self):
+            return self.c_config
+
+        @staticmethod
+        cdef shared_ptr[CCryptoFactory] unwrap_cryptofactory(object crypto_factory) except *:
+            if isinstance(crypto_factory, CryptoFactory):
+                pycf = (<CryptoFactory> crypto_factory).unwrap()
+                return static_pointer_cast[CCryptoFactory, CPyCryptoFactory](pycf)
+            raise TypeError("Expected CryptoFactory, got %s" % type(crypto_factory))
+
+        @staticmethod
+        cdef shared_ptr[CKmsConnectionConfig] unwrap_kmsconnectionconfig(object kmsconnectionconfig):
+            if isinstance(kmsconnectionconfig, KmsConnectionConfig):
+                return (<KmsConnectionConfig> kmsconnectionconfig).unwrap()
+            raise TypeError("Expected KmsConnectionConfig, got %s" %
+                            type(kmsconnectionconfig))
+
+        @staticmethod
+        cdef shared_ptr[CEncryptionConfiguration] unwrap_encryptionconfig(object encryptionconfig):
+            if isinstance(encryptionconfig, EncryptionConfiguration):
+                return (<EncryptionConfiguration> encryptionconfig).unwrap()
+            raise TypeError("Expected EncryptionConfiguration, got %s" %
+                            type(encryptionconfig))
+
+    cdef class ParquetDecryptionConfig(_Weakrefable):
+        """
+        Core configuration class encapsulating parameters for high-level decryption
+        within the Parquet framework.
+
+        ParquetDecryptionConfig is designed to pass decryption-related parameters to
+        the appropriate decryption components within the Parquet library. It holds references to
+        objects that define the decryption strategy, Key Management Service (KMS) configuration,
+        and specific decryption configurations for reading encrypted Parquet data.
+
+        Parameters
+        ----------
+        crypto_factory : pyarrow.parquet.encryption.CryptoFactory
+            Shared pointer to a `CryptoFactory` object, pivotal in creating cryptographic
+            components for the decryption process.
+        kms_connection_config : pyarrow.parquet.encryption.KmsConnectionConfig
+            Shared pointer to a `KmsConnectionConfig` object, containing parameters necessary
+            for connecting to a Key Management Service (KMS) during decryption.
+        decryption_config : pyarrow.parquet.encryption.DecryptionConfiguration
+            Shared pointer to a `DecryptionConfiguration` object, specifying decryption settings
+            for reading encrypted Parquet data.
+
+        Raises
+        ------
+        ValueError
+            Raised if `decryption_config` is None.
+        """
+
+        cdef:
+            shared_ptr[CParquetDecryptionConfig] c_config
+
+        # Avoid mistakingly creating attributes
+        __slots__ = ()
+
+        def __cinit__(self, CryptoFactory crypto_factory, KmsConnectionConfig kms_connection_config,
+                      DecryptionConfiguration decryption_config):
+
+            cdef shared_ptr[CDecryptionConfiguration] c_decryption_config
+
+            if decryption_config is None:
+                raise ValueError(
+                    "decryption_config cannot be None")
+
+            self.c_config.reset(new CParquetDecryptionConfig())
+
+            c_decryption_config = ParquetDecryptionConfig.unwrap_decryptionconfig(
+                decryption_config)
+
+            self.c_config.get().crypto_factory = ParquetDecryptionConfig.unwrap_cryptofactory(crypto_factory)
+            self.c_config.get().kms_connection_config = ParquetDecryptionConfig.unwrap_kmsconnectionconfig(
+                kms_connection_config)
+            self.c_config.get().decryption_config = c_decryption_config
+
+        @staticmethod
+        cdef wrap(shared_ptr[CParquetDecryptionConfig] c_config):
+            cdef ParquetDecryptionConfig python_config = ParquetDecryptionConfig.__new__(ParquetDecryptionConfig)
+            python_config.c_config = c_config
+            return python_config
+
+        cdef shared_ptr[CParquetDecryptionConfig] unwrap(self):
+            return self.c_config
+
+        @staticmethod
+        cdef shared_ptr[CCryptoFactory] unwrap_cryptofactory(object crypto_factory) except *:
+            if isinstance(crypto_factory, CryptoFactory):
+                pycf = (<CryptoFactory> crypto_factory).unwrap()
+                return static_pointer_cast[CCryptoFactory, CPyCryptoFactory](pycf)
+            raise TypeError("Expected CryptoFactory, got %s" % type(crypto_factory))
+
+        @staticmethod
+        cdef shared_ptr[CKmsConnectionConfig] unwrap_kmsconnectionconfig(object kmsconnectionconfig) except *:
+            if isinstance(kmsconnectionconfig, KmsConnectionConfig):
+                return (<KmsConnectionConfig> kmsconnectionconfig).unwrap()
+            raise TypeError("Expected KmsConnectionConfig, got %s" %
+                            type(kmsconnectionconfig))
+
+        @staticmethod
+        cdef shared_ptr[CDecryptionConfiguration] unwrap_decryptionconfig(object decryptionconfig) except *:
+            if isinstance(decryptionconfig, DecryptionConfiguration):
+                return (<DecryptionConfiguration> decryptionconfig).unwrap()
+
+            raise TypeError("Expected DecryptionConfiguration, got %s" %
+                            type(decryptionconfig))

Review Comment:
   I may be mistaken, but it seems these three static methods are the same as in from `ParquetEncryptionConfig`? To avoid duplication, I think you can simply make them top-level functions instead of methods.



##########
cpp/src/arrow/dataset/parquet_encryption_config.h:
##########
@@ -0,0 +1,79 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "arrow/dataset/type_fwd.h"
+
+namespace parquet::encryption {
+class CryptoFactory;
+struct KmsConnectionConfig;
+struct EncryptionConfiguration;
+struct DecryptionConfiguration;
+}  // namespace parquet::encryption
+
+namespace arrow {
+namespace dataset {
+
+struct ARROW_DS_EXPORT ParquetEncryptionConfig {
+  std::shared_ptr<parquet::encryption::CryptoFactory> crypto_factory;
+  std::shared_ptr<parquet::encryption::KmsConnectionConfig> kms_connection_config;
+  std::shared_ptr<parquet::encryption::EncryptionConfiguration> encryption_config;
+  /// \brief Core configuration class encapsulating parameters for high-level encryption

Review Comment:
   Nice docstrings, thank you!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org