You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/08/08 18:20:37 UTC

[GitHub] [arrow] nevi-me opened a new pull request #7917: ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata

nevi-me opened a new pull request #7917:
URL: https://github.com/apache/arrow/pull/7917


   This will allow preserving Arrow-specific metadata when writing or reading Parquet files created from C++ or Rust.
   If the schema can't be deserialised, the normal Parquet > Arrow schema conversion is performed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #7917: ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

nevi-me commented on a change in pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#discussion_r467497445



##########
File path: rust/parquet/Cargo.toml
##########
@@ -52,4 +53,4 @@ zstd = "0.5"
 arrow = { path = "../arrow", version = "2.0.0-SNAPSHOT" }
 
 [features]
-default = ["arrow", "snap", "brotli", "flate2", "lz4", "zstd"]
+default = ["arrow", "snap", "brotli", "flate2", "lz4", "zstd", "base64"]

Review comment:
       I assume the various compression types could depend on what the parquet source being read supports. In that case, could a user run the risk of not being able to read in a file because it supports compression? I like the `core`, `arrow` would have been more intuitive, but I suppose it clashes with the `arrow` crate




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vertexclique commented on a change in pull request #7917: ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

vertexclique commented on a change in pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#discussion_r467496882



##########
File path: rust/parquet/Cargo.toml
##########
@@ -52,4 +53,4 @@ zstd = "0.5"
 arrow = { path = "../arrow", version = "2.0.0-SNAPSHOT" }
 
 [features]
-default = ["arrow", "snap", "brotli", "flate2", "lz4", "zstd"]
+default = ["arrow", "snap", "brotli", "flate2", "lz4", "zstd", "base64"]

Review comment:
       Would be super nice. Meanwhile I was looking to parquet in [this PR](https://github.com/apache/arrow/pull/7873) I realized that `"snap", "brotli", "flate2", "lz4", "zstd"` part is not always needed. Since these compressions are not additive features(I am not sure, we might need to check this).
   
   What about a feature gating like this? wdyt?:
   ```
   default = ["core"]
   core = ["arrow", "base64"]
   all = ["arrow", "snap", "brotli", "flate2", "lz4", "zstd", "base64"]
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #7917: ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

nevi-me commented on a change in pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#discussion_r467491183



##########
File path: rust/parquet/Cargo.toml
##########
@@ -40,6 +40,7 @@ zstd = { version = "0.5", optional = true }
 chrono = "0.4"
 num-bigint = "0.3"
 arrow = { path = "../arrow", version = "2.0.0-SNAPSHOT", optional = true }
+base64 = { version = "*", optional = true }

Review comment:
       Not using a specific version as we already transitively depend on an older version, but I'll change this when the feature changes are merged




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on pull request #7917: ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

nevi-me commented on pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#issuecomment-674100806


   Current failures will be resolved by https://github.com/alexcrichton/socket2-rs/pull/96


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #7917: ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

nevi-me commented on a change in pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#discussion_r470636985



##########
File path: rust/parquet/Cargo.toml
##########
@@ -52,4 +53,4 @@ zstd = "0.5"
 arrow = { path = "../arrow", version = "2.0.0-SNAPSHOT" }
 
 [features]
-default = ["arrow", "snap", "brotli", "flate2", "lz4", "zstd"]
+default = ["arrow", "snap", "brotli", "flate2", "lz4", "zstd", "base64"]

Review comment:
       I'm not sure, removing the compression features and running tests doesn't throw any failures, so I can't see what the behaviour should be.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] sunchao commented on a change in pull request #7917: ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#discussion_r471972373



##########
File path: rust/parquet/src/arrow/schema.rs
##########
@@ -83,12 +90,77 @@ where
         .map(|fields| Schema::new_with_metadata(fields, metadata))
 }
 
+/// Try to convert Arrow schema metadata into a schema
+fn get_arrow_schema_from_metadata(encoded_meta: &str) -> Option<Schema> {
+    let decoded = base64::decode(encoded_meta);
+    match decoded {
+        Ok(bytes) => {
+            let slice = if bytes[0..4] == [255u8; 4] {
+                &bytes[8..]
+            } else {
+                bytes.as_slice()
+            };
+            let message = arrow::ipc::get_root_as_message(slice);
+            message
+                .header_as_schema()
+                .map(arrow::ipc::convert::fb_to_schema)
+        }
+        Err(err) => {
+            // The C++ implementation returns an error if the schema can't be parsed.
+            // To prevent this, we explicitly log this, then compute the schema without the metadata
+            eprintln!(
+                "Unable to decode the encoded schema stored in {}, {:?}",
+                super::ARROW_SCHEMA_META_KEY,
+                err
+            );
+            None
+        }
+    }
+}
+
+/// Mutates writer metadata by encoding the Arrow schema and storing it in the metadata.
+/// If there is an existing Arrow schema metadata, it is replaced.
+pub fn add_encoded_arrow_schema_to_metadata(
+    schema: &Schema,
+    props: &mut WriterProperties,
+) {
+    let mut serialized_schema = arrow::ipc::writer::schema_to_bytes(&schema);
+    let schema_len = serialized_schema.len();
+    let mut len_prefix_schema = Vec::with_capacity(schema_len + 8);
+    len_prefix_schema.append(&mut vec![255u8, 255, 255, 255]);

Review comment:
       I see. Thanks!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] sunchao commented on pull request #7917: ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

sunchao commented on pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#issuecomment-674170257


   @nevi-me sg - I think this is stacked on another PR though: do we need to get the other one reviewed first?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #7917: ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

nevi-me commented on a change in pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#discussion_r472105674



##########
File path: rust/parquet/src/arrow/schema.rs
##########
@@ -83,12 +90,77 @@ where
         .map(|fields| Schema::new_with_metadata(fields, metadata))
 }
 
+/// Try to convert Arrow schema metadata into a schema
+fn get_arrow_schema_from_metadata(encoded_meta: &str) -> Option<Schema> {
+    let decoded = base64::decode(encoded_meta);
+    match decoded {
+        Ok(bytes) => {
+            let slice = if bytes[0..4] == [255u8; 4] {
+                &bytes[8..]
+            } else {
+                bytes.as_slice()
+            };
+            let message = arrow::ipc::get_root_as_message(slice);
+            message
+                .header_as_schema()
+                .map(arrow::ipc::convert::fb_to_schema)
+        }
+        Err(err) => {
+            // The C++ implementation returns an error if the schema can't be parsed.
+            // To prevent this, we explicitly log this, then compute the schema without the metadata
+            eprintln!(
+                "Unable to decode the encoded schema stored in {}, {:?}",
+                super::ARROW_SCHEMA_META_KEY,
+                err
+            );
+            None
+        }
+    }
+}
+
+/// Mutates writer metadata by encoding the Arrow schema and storing it in the metadata.
+/// If there is an existing Arrow schema metadata, it is replaced.
+pub fn add_encoded_arrow_schema_to_metadata(

Review comment:
       I can change the visibility to only the crate. We'll likely only ever use the function in one place, so that's why I hadn't split it.
   
   I'd also prefer replacing `Vec<KeyValue>` with a `HashMap` partly because it's a `parquet_format` detail, and it's more convenient to work with hashmaps. I can open a JIRA for this




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #7917: ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

nevi-me commented on a change in pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#discussion_r470097911



##########
File path: rust/parquet/src/file/properties.rs
##########
@@ -99,7 +99,7 @@ pub struct WriterProperties {
     max_row_group_size: usize,
     writer_version: WriterVersion,
     created_by: String,
-    key_value_metadata: Option<Vec<KeyValue>>,
+    pub(crate) key_value_metadata: Option<Vec<KeyValue>>,

Review comment:
       @sunchao this was the easiest way of making the metadata mutable without a lot of effort. I tried to create a `WriterPropertiesBuilder` from the `WriterProperties`, but for some properties there wasn't a way of retaining properties (esp if the user has specified compression/encoding options for some columns).
   
   the mutability is required in order to encode the `ARROW:schema`. I followed the C++ approach, so that if a user writes Arrow data to Parquet, the schema information is retained when they read that file back in. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me closed pull request #7917: ARROW-8423: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

nevi-me closed pull request #7917:
URL: https://github.com/apache/arrow/pull/7917


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #7917: ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

nevi-me commented on a change in pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#discussion_r467500876



##########
File path: rust/parquet/Cargo.toml
##########
@@ -52,4 +53,4 @@ zstd = "0.5"
 arrow = { path = "../arrow", version = "2.0.0-SNAPSHOT" }
 
 [features]
-default = ["arrow", "snap", "brotli", "flate2", "lz4", "zstd"]
+default = ["arrow", "snap", "brotli", "flate2", "lz4", "zstd", "base64"]

Review comment:
       I think that's fine, if users expect to read parquet data which they didn't write (and thus might not know if it's compressed), they can turn on the various compression features.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #7917: ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#issuecomment-670959690


   https://issues.apache.org/jira/browse/ARROW-8243


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #7917: ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

nevi-me commented on a change in pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#discussion_r470102183



##########
File path: rust/parquet/src/arrow/arrow_writer.rs
##########
@@ -674,8 +712,16 @@ mod tests {
         )
         .unwrap();
 
+        let props = WriterProperties::builder()
+            .set_key_value_metadata(Some(vec![KeyValue {
+                key: "test_key".to_string(),
+                value: Some("test_value".to_string()),
+            }]))
+            .build();
+
         let file = get_temp_file("test_arrow_writer_complex.parquet", &[]);

Review comment:
       TODO: this is stateful, and might fail as this file night not exist. Create a small file for the test




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #7917: ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

nevi-me commented on a change in pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#discussion_r472106770



##########
File path: rust/parquet/src/arrow/schema.rs
##########
@@ -83,12 +90,77 @@ where
         .map(|fields| Schema::new_with_metadata(fields, metadata))
 }
 
+/// Try to convert Arrow schema metadata into a schema
+fn get_arrow_schema_from_metadata(encoded_meta: &str) -> Option<Schema> {
+    let decoded = base64::decode(encoded_meta);
+    match decoded {
+        Ok(bytes) => {
+            let slice = if bytes[0..4] == [255u8; 4] {
+                &bytes[8..]
+            } else {
+                bytes.as_slice()
+            };
+            let message = arrow::ipc::get_root_as_message(slice);
+            message
+                .header_as_schema()
+                .map(arrow::ipc::convert::fb_to_schema)
+        }
+        Err(err) => {
+            // The C++ implementation returns an error if the schema can't be parsed.
+            // To prevent this, we explicitly log this, then compute the schema without the metadata
+            eprintln!(
+                "Unable to decode the encoded schema stored in {}, {:?}",
+                super::ARROW_SCHEMA_META_KEY,
+                err
+            );
+            None
+        }
+    }
+}
+
+/// Mutates writer metadata by encoding the Arrow schema and storing it in the metadata.
+/// If there is an existing Arrow schema metadata, it is replaced.
+pub fn add_encoded_arrow_schema_to_metadata(
+    schema: &Schema,
+    props: &mut WriterProperties,
+) {
+    let mut serialized_schema = arrow::ipc::writer::schema_to_bytes(&schema);
+    let schema_len = serialized_schema.len();
+    let mut len_prefix_schema = Vec::with_capacity(schema_len + 8);
+    len_prefix_schema.append(&mut vec![255u8, 255, 255, 255]);
+    len_prefix_schema.append((schema_len as u32).to_le_bytes().to_vec().as_mut());
+    len_prefix_schema.append(&mut serialized_schema);
+    let encoded = base64::encode(&len_prefix_schema);
+
+    let schema_kv = KeyValue {
+        key: super::ARROW_SCHEMA_META_KEY.to_string(),
+        value: Some(encoded),
+    };
+
+    let mut meta = props.key_value_metadata.clone().unwrap_or_default();
+    // check if ARROW:schema exists, and overwrite it
+    let schema_meta = meta
+        .iter()
+        .enumerate()
+        .find(|(_, kv)| kv.key.as_str() == super::ARROW_SCHEMA_META_KEY);

Review comment:
       Yes, I'm making the presumption that it'd exist once. I don't expect that it'd ever be populated, so this is redundant. If we change `Vec<KeyValue>` to a `HashMap` it'll address this.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #7917: ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

nevi-me commented on a change in pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#discussion_r470102183



##########
File path: rust/parquet/src/arrow/arrow_writer.rs
##########
@@ -674,8 +712,16 @@ mod tests {
         )
         .unwrap();
 
+        let props = WriterProperties::builder()
+            .set_key_value_metadata(Some(vec![KeyValue {
+                key: "test_key".to_string(),
+                value: Some("test_value".to_string()),
+            }]))
+            .build();
+
         let file = get_temp_file("test_arrow_writer_complex.parquet", &[]);

Review comment:
       TODO: this is stateful, and might fail as this file night not exist. Create a small file for the test

##########
File path: rust/parquet/src/arrow/arrow_writer.rs
##########
@@ -53,13 +56,48 @@ impl<W: 'static + ParquetWriter> ArrowWriter<W> {
     pub fn try_new(
         writer: W,
         arrow_schema: SchemaRef,
-        props: Option<Rc<WriterProperties>>,
+        props: Option<WriterProperties>,
     ) -> Result<Self> {
         let schema = crate::arrow::arrow_to_parquet_schema(&arrow_schema)?;
+        // add serialized arrow schema
+        let mut serialized_schema = arrow::ipc::writer::schema_to_bytes(&arrow_schema);
+        let schema_len = serialized_schema.len();
+        let mut len_prefix_schema = Vec::with_capacity(schema_len + 8);
+        len_prefix_schema.append(&mut vec![255u8, 255, 255, 255]);

Review comment:
       Our IPC implementation is still based on the legacy format, so we have to manually prepend the continuation marker and message length.

##########
File path: rust/parquet/src/arrow/schema.rs
##########
@@ -55,40 +55,75 @@ pub fn parquet_to_arrow_schema_by_columns<T>(
 where
     T: IntoIterator<Item = usize>,
 {
-    let mut base_nodes = Vec::new();
-    let mut base_nodes_set = HashSet::new();
-    let mut leaves = HashSet::new();
-
-    for c in column_indices {
-        let column = parquet_schema.column(c).self_type() as *const Type;
-        let root = parquet_schema.get_column_root(c);
-        let root_raw_ptr = root as *const Type;
-
-        leaves.insert(column);
-        if !base_nodes_set.contains(&root_raw_ptr) {
-            base_nodes.push(root);
-            base_nodes_set.insert(root_raw_ptr);
+    let mut metadata = parse_key_value_metadata(key_value_metadata).unwrap_or_default();

Review comment:
       @sunchao the main change here is that if the `ARROW:schema` exists, we try to decode it and pass it back as the schema to read.
   Unlike C++, if there's a problem reading the schema, we do not return an error, but continue to reconstruct the schema with some information loss. The error is logged to console.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on pull request #7917: ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

nevi-me commented on pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#issuecomment-675177564


   @sunchao I forgot to mention, I made the changes, so this is ready for review


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #7917: ARROW-8423: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#issuecomment-675430328


   https://issues.apache.org/jira/browse/ARROW-8423


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #7917: ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

nevi-me commented on a change in pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#discussion_r470650350



##########
File path: rust/parquet/src/arrow/schema.rs
##########
@@ -34,16 +34,26 @@ use crate::schema::types::{ColumnDescriptor, SchemaDescriptor, Type, TypePtr};
 use arrow::datatypes::TimeUnit;
 use arrow::datatypes::{DataType, DateUnit, Field, Schema};
 
-/// Convert parquet schema to arrow schema including optional metadata.
+/// Convert Parquet schema to Arrow schema including optional metadata.
+/// Attempts to decode any existing Arrow shcema metadata, falling back
+/// to converting the Parquet schema column-wise
 pub fn parquet_to_arrow_schema(
     parquet_schema: &SchemaDescriptor,
-    metadata: &Option<Vec<KeyValue>>,
+    key_value_metadata: &Option<Vec<KeyValue>>,
 ) -> Result<Schema> {
-    parquet_to_arrow_schema_by_columns(
-        parquet_schema,
-        0..parquet_schema.columns().len(),
-        metadata,
-    )
+    let mut metadata = parse_key_value_metadata(key_value_metadata).unwrap_or_default();
+    let arrow_schema_metadata = metadata
+        .remove(super::ARROW_SCHEMA_META_KEY)
+        .map(|encoded| get_arrow_schema_from_metadata(&encoded));
+
+    match arrow_schema_metadata {
+        Some(Some(schema)) => Ok(schema),
+        _ => parquet_to_arrow_schema_by_columns(
+            parquet_schema,
+            0..parquet_schema.columns().len(),
+            key_value_metadata,
+        ),
+    }
 }
 
 /// Convert parquet schema to arrow schema including optional metadata, only preserving some leaf columns.

Review comment:
       @sunchao @liurenjie1024 I think I'll need help with correctly filtering an Arrow schema if the user supplies column indices. Not sure if there's a 1:1 mapping between Arrow's fields and Parquet leaves.
   
   We can either do it as part of this PR or as a follow-up




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #7917: ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

nevi-me commented on a change in pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#discussion_r467491490



##########
File path: rust/parquet/Cargo.toml
##########
@@ -52,4 +53,4 @@ zstd = "0.5"
 arrow = { path = "../arrow", version = "2.0.0-SNAPSHOT" }
 
 [features]
-default = ["arrow", "snap", "brotli", "flate2", "lz4", "zstd"]
+default = ["arrow", "snap", "brotli", "flate2", "lz4", "zstd", "base64"]

Review comment:
       @vertexclique `arrow` and `base64` would have to go together, should we create a feature that refers to both crates?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] sunchao commented on a change in pull request #7917: ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#discussion_r470327828



##########
File path: rust/parquet/src/file/properties.rs
##########
@@ -99,7 +99,7 @@ pub struct WriterProperties {
     max_row_group_size: usize,
     writer_version: WriterVersion,
     created_by: String,
-    key_value_metadata: Option<Vec<KeyValue>>,
+    pub(crate) key_value_metadata: Option<Vec<KeyValue>>,

Review comment:
       +1. I think this is fine.

##########
File path: rust/parquet/Cargo.toml
##########
@@ -52,4 +53,4 @@ zstd = "0.5"
 arrow = { path = "../arrow", version = "2.0.0-SNAPSHOT" }
 
 [features]
-default = ["arrow", "snap", "brotli", "flate2", "lz4", "zstd"]
+default = ["arrow", "snap", "brotli", "flate2", "lz4", "zstd", "base64"]

Review comment:
       Sounds fine. What will the be error message like in this case?

##########
File path: rust/parquet/src/arrow/schema.rs
##########
@@ -55,40 +55,75 @@ pub fn parquet_to_arrow_schema_by_columns<T>(
 where
     T: IntoIterator<Item = usize>,
 {
-    let mut base_nodes = Vec::new();
-    let mut base_nodes_set = HashSet::new();
-    let mut leaves = HashSet::new();
-
-    for c in column_indices {
-        let column = parquet_schema.column(c).self_type() as *const Type;
-        let root = parquet_schema.get_column_root(c);
-        let root_raw_ptr = root as *const Type;
-
-        leaves.insert(column);
-        if !base_nodes_set.contains(&root_raw_ptr) {
-            base_nodes.push(root);
-            base_nodes_set.insert(root_raw_ptr);
+    let mut metadata = parse_key_value_metadata(key_value_metadata).unwrap_or_default();

Review comment:
       👍 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on pull request #7917: ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

nevi-me commented on pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#issuecomment-674176859


   > ... I think this is stacked on another PR though: do we need to get the other one reviewed first?
   
   @sunchao it's a bit messy, but it's one PR.
   
   The changes are below:
   
   * arrow_reader.rs - added a roundtrip test (https://github.com/apache/arrow/pull/7917/files?file-filters%5B%5D=.toml#diff-4c103051156d7b901fad8d9e26104932R310) to avoid importing the reader and writer into the `schema.rs` file. I opted to put the test in reader because it results in the fewest imports (only need the `ArrowWriter`).
   * arrow_writer.rs - implements the schema serialisation
   * schema.rs - implements schema deserialisation
   * properties.rs - change to allow mutating metadata
   ____
   
   Let me refactor the above quickly, as I now see that I can move both serialisation and deserialisation to `schema.rs`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on pull request #7917: ARROW-8423: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

nevi-me commented on pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#issuecomment-675595547


   merged in the parquet branch


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] sunchao commented on a change in pull request #7917: ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#discussion_r471927282



##########
File path: rust/parquet/src/arrow/schema.rs
##########
@@ -83,12 +90,77 @@ where
         .map(|fields| Schema::new_with_metadata(fields, metadata))
 }
 
+/// Try to convert Arrow schema metadata into a schema
+fn get_arrow_schema_from_metadata(encoded_meta: &str) -> Option<Schema> {
+    let decoded = base64::decode(encoded_meta);
+    match decoded {
+        Ok(bytes) => {
+            let slice = if bytes[0..4] == [255u8; 4] {
+                &bytes[8..]
+            } else {
+                bytes.as_slice()
+            };
+            let message = arrow::ipc::get_root_as_message(slice);
+            message
+                .header_as_schema()
+                .map(arrow::ipc::convert::fb_to_schema)
+        }
+        Err(err) => {
+            // The C++ implementation returns an error if the schema can't be parsed.
+            // To prevent this, we explicitly log this, then compute the schema without the metadata
+            eprintln!(
+                "Unable to decode the encoded schema stored in {}, {:?}",
+                super::ARROW_SCHEMA_META_KEY,
+                err
+            );
+            None
+        }
+    }
+}
+
+/// Mutates writer metadata by encoding the Arrow schema and storing it in the metadata.
+/// If there is an existing Arrow schema metadata, it is replaced.
+pub fn add_encoded_arrow_schema_to_metadata(

Review comment:
       Not sure if this should be public.
   
   It might be useful to split this into util methods:
   - encode a `Schema`
   - replace occurrences of a key with value in a vector.
   
   But perhaps a better way is to replace `Vec<KeyValue>` with `HashMap` in `WriterProperties`, but that is another story.

##########
File path: rust/parquet/src/arrow/schema.rs
##########
@@ -83,12 +90,77 @@ where
         .map(|fields| Schema::new_with_metadata(fields, metadata))
 }
 
+/// Try to convert Arrow schema metadata into a schema
+fn get_arrow_schema_from_metadata(encoded_meta: &str) -> Option<Schema> {
+    let decoded = base64::decode(encoded_meta);
+    match decoded {
+        Ok(bytes) => {
+            let slice = if bytes[0..4] == [255u8; 4] {
+                &bytes[8..]
+            } else {
+                bytes.as_slice()
+            };
+            let message = arrow::ipc::get_root_as_message(slice);
+            message
+                .header_as_schema()
+                .map(arrow::ipc::convert::fb_to_schema)
+        }
+        Err(err) => {
+            // The C++ implementation returns an error if the schema can't be parsed.
+            // To prevent this, we explicitly log this, then compute the schema without the metadata
+            eprintln!(
+                "Unable to decode the encoded schema stored in {}, {:?}",
+                super::ARROW_SCHEMA_META_KEY,
+                err
+            );
+            None
+        }
+    }
+}
+
+/// Mutates writer metadata by encoding the Arrow schema and storing it in the metadata.
+/// If there is an existing Arrow schema metadata, it is replaced.
+pub fn add_encoded_arrow_schema_to_metadata(
+    schema: &Schema,
+    props: &mut WriterProperties,
+) {
+    let mut serialized_schema = arrow::ipc::writer::schema_to_bytes(&schema);
+    let schema_len = serialized_schema.len();
+    let mut len_prefix_schema = Vec::with_capacity(schema_len + 8);
+    len_prefix_schema.append(&mut vec![255u8, 255, 255, 255]);

Review comment:
       Curious why we need to write the 4-byte payload here? is it for C++ compatibility?

##########
File path: rust/parquet/src/arrow/schema.rs
##########
@@ -83,12 +90,77 @@ where
         .map(|fields| Schema::new_with_metadata(fields, metadata))
 }
 
+/// Try to convert Arrow schema metadata into a schema
+fn get_arrow_schema_from_metadata(encoded_meta: &str) -> Option<Schema> {
+    let decoded = base64::decode(encoded_meta);
+    match decoded {
+        Ok(bytes) => {
+            let slice = if bytes[0..4] == [255u8; 4] {
+                &bytes[8..]
+            } else {
+                bytes.as_slice()
+            };
+            let message = arrow::ipc::get_root_as_message(slice);
+            message
+                .header_as_schema()
+                .map(arrow::ipc::convert::fb_to_schema)
+        }
+        Err(err) => {
+            // The C++ implementation returns an error if the schema can't be parsed.
+            // To prevent this, we explicitly log this, then compute the schema without the metadata
+            eprintln!(
+                "Unable to decode the encoded schema stored in {}, {:?}",
+                super::ARROW_SCHEMA_META_KEY,
+                err
+            );
+            None
+        }
+    }
+}
+
+/// Mutates writer metadata by encoding the Arrow schema and storing it in the metadata.
+/// If there is an existing Arrow schema metadata, it is replaced.
+pub fn add_encoded_arrow_schema_to_metadata(
+    schema: &Schema,
+    props: &mut WriterProperties,
+) {
+    let mut serialized_schema = arrow::ipc::writer::schema_to_bytes(&schema);
+    let schema_len = serialized_schema.len();
+    let mut len_prefix_schema = Vec::with_capacity(schema_len + 8);
+    len_prefix_schema.append(&mut vec![255u8, 255, 255, 255]);
+    len_prefix_schema.append((schema_len as u32).to_le_bytes().to_vec().as_mut());
+    len_prefix_schema.append(&mut serialized_schema);
+    let encoded = base64::encode(&len_prefix_schema);
+
+    let schema_kv = KeyValue {
+        key: super::ARROW_SCHEMA_META_KEY.to_string(),
+        value: Some(encoded),
+    };
+
+    let mut meta = props.key_value_metadata.clone().unwrap_or_default();
+    // check if ARROW:schema exists, and overwrite it
+    let schema_meta = meta
+        .iter()
+        .enumerate()
+        .find(|(_, kv)| kv.key.as_str() == super::ARROW_SCHEMA_META_KEY);

Review comment:
       this only find/replace the first occurrence but I guess the chance is pretty low to have multiple `KeyValue`s with `ARROW:schema` in the vector...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] sunchao commented on a change in pull request #7917: ARROW-8423: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#discussion_r472294646



##########
File path: rust/parquet/src/arrow/schema.rs
##########
@@ -83,12 +90,77 @@ where
         .map(|fields| Schema::new_with_metadata(fields, metadata))
 }
 
+/// Try to convert Arrow schema metadata into a schema
+fn get_arrow_schema_from_metadata(encoded_meta: &str) -> Option<Schema> {
+    let decoded = base64::decode(encoded_meta);
+    match decoded {
+        Ok(bytes) => {
+            let slice = if bytes[0..4] == [255u8; 4] {
+                &bytes[8..]
+            } else {
+                bytes.as_slice()
+            };
+            let message = arrow::ipc::get_root_as_message(slice);
+            message
+                .header_as_schema()
+                .map(arrow::ipc::convert::fb_to_schema)
+        }
+        Err(err) => {
+            // The C++ implementation returns an error if the schema can't be parsed.
+            // To prevent this, we explicitly log this, then compute the schema without the metadata
+            eprintln!(
+                "Unable to decode the encoded schema stored in {}, {:?}",
+                super::ARROW_SCHEMA_META_KEY,
+                err
+            );
+            None
+        }
+    }
+}
+
+/// Mutates writer metadata by encoding the Arrow schema and storing it in the metadata.
+/// If there is an existing Arrow schema metadata, it is replaced.
+pub fn add_encoded_arrow_schema_to_metadata(

Review comment:
       np - my comments are mostly cosmetic - it will be good to change the visibility though.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] vertexclique commented on a change in pull request #7917: ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

vertexclique commented on a change in pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#discussion_r467500372



##########
File path: rust/parquet/Cargo.toml
##########
@@ -52,4 +53,4 @@ zstd = "0.5"
 arrow = { path = "../arrow", version = "2.0.0-SNAPSHOT" }
 
 [features]
-default = ["arrow", "snap", "brotli", "flate2", "lz4", "zstd"]
+default = ["arrow", "snap", "brotli", "flate2", "lz4", "zstd", "base64"]

Review comment:
       > I assume the various compression types could depend on what the parquet source being read supports. In that case, could a user run the risk of not being able to read in a file because it supports compression?
   
   Yes, that's true, for having less build dependencies for default set yet another idea (FWIW bikeshedding):
   
   ```
   default = ["core"]
   core = ["arrow", "base64"]
   
   comp_snap = ["snap"]
   comp_brotli = ["brotli"]
   comp_flate = ["flate2"]
   comp_lz4 = ["lz4"]
   comp_zstd = ["zstd"]
   ```
   
   and users add their desired compression to enable. Just tested this, it passes parquet test on master as like that (without base64).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] maxburke commented on a change in pull request #7917: ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

maxburke commented on a change in pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#discussion_r470117845



##########
File path: rust/parquet/src/arrow/arrow_writer.rs
##########
@@ -53,13 +56,48 @@ impl<W: 'static + ParquetWriter> ArrowWriter<W> {
     pub fn try_new(
         writer: W,
         arrow_schema: SchemaRef,
-        props: Option<Rc<WriterProperties>>,
+        props: Option<WriterProperties>,

Review comment:
       :+1:




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] sunchao commented on pull request #7917: ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

sunchao commented on pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#issuecomment-675178642


   @nevi-me ah OK - I'll take a look soon!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #7917: ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

nevi-me commented on a change in pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#discussion_r470099290



##########
File path: rust/parquet/src/arrow/arrow_writer.rs
##########
@@ -53,13 +56,48 @@ impl<W: 'static + ParquetWriter> ArrowWriter<W> {
     pub fn try_new(
         writer: W,
         arrow_schema: SchemaRef,
-        props: Option<Rc<WriterProperties>>,
+        props: Option<WriterProperties>,

Review comment:
       @maxburke I've had to change this to take the `WriterProperties` directly so I can mutate it to add the Arrow serialized schema. This affects your branch.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] nevi-me commented on a change in pull request #7917: ARROW-8243: [Rust] [Parquet] Serialize Arrow schema metadata

Posted by GitBox <gi...@apache.org>.

nevi-me commented on a change in pull request #7917:
URL: https://github.com/apache/arrow/pull/7917#discussion_r471964910



##########
File path: rust/parquet/src/arrow/schema.rs
##########
@@ -83,12 +90,77 @@ where
         .map(|fields| Schema::new_with_metadata(fields, metadata))
 }
 
+/// Try to convert Arrow schema metadata into a schema
+fn get_arrow_schema_from_metadata(encoded_meta: &str) -> Option<Schema> {
+    let decoded = base64::decode(encoded_meta);
+    match decoded {
+        Ok(bytes) => {
+            let slice = if bytes[0..4] == [255u8; 4] {
+                &bytes[8..]
+            } else {
+                bytes.as_slice()
+            };
+            let message = arrow::ipc::get_root_as_message(slice);
+            message
+                .header_as_schema()
+                .map(arrow::ipc::convert::fb_to_schema)
+        }
+        Err(err) => {
+            // The C++ implementation returns an error if the schema can't be parsed.
+            // To prevent this, we explicitly log this, then compute the schema without the metadata
+            eprintln!(
+                "Unable to decode the encoded schema stored in {}, {:?}",
+                super::ARROW_SCHEMA_META_KEY,
+                err
+            );
+            None
+        }
+    }
+}
+
+/// Mutates writer metadata by encoding the Arrow schema and storing it in the metadata.
+/// If there is an existing Arrow schema metadata, it is replaced.
+pub fn add_encoded_arrow_schema_to_metadata(
+    schema: &Schema,
+    props: &mut WriterProperties,
+) {
+    let mut serialized_schema = arrow::ipc::writer::schema_to_bytes(&schema);
+    let schema_len = serialized_schema.len();
+    let mut len_prefix_schema = Vec::with_capacity(schema_len + 8);
+    len_prefix_schema.append(&mut vec![255u8, 255, 255, 255]);

Review comment:
       The metadata length has been included since 0.15, but Rust fell behind on the implementation. Without the payload, pyarrow fails to read the file.
   
   I opened a JIRA to address the IPC issues, so I'll change this when I get to the JIRA (hoping before the next release)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org