You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/29 22:35:11 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request, #2985: Alamb/ignore metadata fields too

alamb opened a new pull request, #2985:
URL: https://github.com/apache/arrow-datafusion/pull/2985

   # Which issue does this PR close?
   
   Closes https://github.com/apache/arrow-datafusion/issues/2982
   
   
    # Rationale for this change
   See https://github.com/apache/arrow-datafusion/issues/2982, basically mismatched metadata can cause parquet files to not be read
   
   # What changes are included in this PR?
   1. Add a option to skip metadata when reading from parquet (defaults to true)
   2. Tests for same
   
   # Are there any user-facing changes?
   By default metadata will be skipped when reading parquet files
   
   # TODOs
   - [ ] Fix tests
   - [ ] File follow on ticket / PR in arrow-rs for better error message of metadata mismatch
   - [ ] File follow on ticket / PR in arrow-rs for breaking up `Schema` into its parts
   - [ ] File follow on ticket / PR in datafusion bout the akwardness / duplication of parquet options (parquet format / parquet option


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2985: Optionally skip metadata from schema when merging parquet files

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #2985:
URL: https://github.com/apache/arrow-datafusion/pull/2985#discussion_r933781120


##########
datafusion/core/tests/sql/parquet.rs:
##########
@@ -173,58 +173,6 @@ async fn parquet_list_columns() {
     assert_eq!(result.value(3), "xyz");
 }
 
-#[tokio::test]
-async fn schema_merge_ignores_metadata() {
-    // Create two parquet files in same table with same schema but different metadata
-    let tmp_dir = TempDir::new().unwrap();
-    let table_dir = tmp_dir.path().join("parquet_test");
-    let table_path = Path::new(&table_dir);
-
-    let mut non_empty_metadata: HashMap<String, String> = HashMap::new();
-    non_empty_metadata.insert("testing".to_string(), "metadata".to_string());
-
-    let fields = vec![
-        Field::new("id", DataType::Int32, true),
-        Field::new("name", DataType::Utf8, true),
-    ];
-    let schemas = vec![
-        Arc::new(Schema::new_with_metadata(
-            fields.clone(),
-            non_empty_metadata.clone(),
-        )),
-        Arc::new(Schema::new(fields.clone())),
-    ];
-
-    if let Ok(()) = fs::create_dir(table_path) {
-        for (i, schema) in schemas.iter().enumerate().take(2) {
-            let filename = format!("part-{}.parquet", i);
-            let path = table_path.join(&filename);
-            let file = fs::File::create(path).unwrap();
-            let mut writer = ArrowWriter::try_new(file, schema.clone(), None).unwrap();
-
-            // create mock record batch
-            let ids = Arc::new(Int32Array::from_slice(&[i as i32]));
-            let names = Arc::new(StringArray::from_slice(&["test"]));
-            let rec_batch =
-                RecordBatch::try_new(schema.clone(), vec![ids, names]).unwrap();
-
-            writer.write(&rec_batch).unwrap();
-            writer.close().unwrap();
-        }
-    }
-
-    // Read the parquet files into a dataframe to confirm results
-    // (no errors)
-    let ctx = SessionContext::new();
-    let df = ctx
-        .read_parquet(table_dir.to_str().unwrap(), ParquetReadOptions::default())
-        .await
-        .unwrap();
-    let result = df.collect().await.unwrap();
-
-    assert_eq!(result[0].schema().metadata(), result[1].schema().metadata());

Review Comment:
   Note that this does not validate the contents of the metadata, just that it is the same. Turns out the metadata is actually empty.... 



##########
datafusion/core/src/physical_plan/file_format/mod.rs:
##########
@@ -130,7 +130,9 @@ impl FileScanConfig {
             column_statistics: Some(table_cols_stats),
         };
 
-        let table_schema = Arc::new(Schema::new(table_fields));
+        let table_schema = Arc::new(
+            Schema::new(table_fields).with_metadata(self.file_schema.metadata().clone()),

Review Comment:
   Here is the change that is required to preserve the metadata in the merged schema



##########
datafusion/core/tests/sql/parquet.rs:
##########
@@ -173,58 +173,6 @@ async fn parquet_list_columns() {
     assert_eq!(result.value(3), "xyz");
 }
 
-#[tokio::test]
-async fn schema_merge_ignores_metadata() {
-    // Create two parquet files in same table with same schema but different metadata
-    let tmp_dir = TempDir::new().unwrap();
-    let table_dir = tmp_dir.path().join("parquet_test");
-    let table_path = Path::new(&table_dir);
-
-    let mut non_empty_metadata: HashMap<String, String> = HashMap::new();
-    non_empty_metadata.insert("testing".to_string(), "metadata".to_string());
-
-    let fields = vec![
-        Field::new("id", DataType::Int32, true),
-        Field::new("name", DataType::Utf8, true),
-    ];
-    let schemas = vec![
-        Arc::new(Schema::new_with_metadata(
-            fields.clone(),
-            non_empty_metadata.clone(),

Review Comment:
   The schemas in this test have different (but compatible) metadata so merging works.



##########
datafusion/core/src/execution/options.rs:
##########
@@ -142,17 +142,23 @@ pub struct ParquetReadOptions<'a> {
     pub file_extension: &'a str,
     /// Partition Columns
     pub table_partition_cols: Vec<String>,
-    /// Should DataFusion parquet reader using the predicate to prune data,
+    /// Should DataFusion parquet reader use the predicate to prune data,
     /// overridden by value on execution::context::SessionConfig
     pub parquet_pruning: bool,
+    /// Tell the parquet reader to ignore any Metadata that may be in

Review Comment:
   The duplication of options on `ParquetReadOptions` and `ParquetFormat` is non ideal, and I hope to fix it, but for this PR I just followed the existing pattern



##########
datafusion/core/tests/sql/parquet_schema.rs:
##########
@@ -0,0 +1,219 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Tests for parquet schema handling
+use std::{
+    collections::{BTreeMap, HashMap},
+    fs,
+    path::Path,
+};
+
+use ::parquet::arrow::ArrowWriter;
+use tempfile::TempDir;
+
+use super::*;
+
+#[tokio::test]
+async fn schema_merge_ignores_metadata_by_default() {
+    // Create several parquet files in same directoty / table with
+    // same schema but different metadata
+    let tmp_dir = TempDir::new().unwrap();
+    let table_dir = tmp_dir.path().join("parquet_test");
+
+    let options = ParquetReadOptions::default();
+
+    let f1 = Field::new("id", DataType::Int32, true);
+    let f2 = Field::new("name", DataType::Utf8, true);
+
+    let schemas = vec![
+        // schema level metadata
+        Schema::new(vec![f1.clone(), f2.clone()]).with_metadata(make_meta("foo", "bar")),

Review Comment:
   Lots of different metadata



##########
datafusion/core/tests/sql/parquet_schema.rs:
##########
@@ -0,0 +1,219 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Tests for parquet schema handling
+use std::{
+    collections::{BTreeMap, HashMap},
+    fs,
+    path::Path,
+};
+
+use ::parquet::arrow::ArrowWriter;
+use tempfile::TempDir;
+
+use super::*;
+
+#[tokio::test]
+async fn schema_merge_ignores_metadata_by_default() {
+    // Create several parquet files in same directoty / table with
+    // same schema but different metadata
+    let tmp_dir = TempDir::new().unwrap();
+    let table_dir = tmp_dir.path().join("parquet_test");
+
+    let options = ParquetReadOptions::default();
+
+    let f1 = Field::new("id", DataType::Int32, true);
+    let f2 = Field::new("name", DataType::Utf8, true);
+
+    let schemas = vec![
+        // schema level metadata
+        Schema::new(vec![f1.clone(), f2.clone()]).with_metadata(make_meta("foo", "bar")),
+        // schema different (incompatible) metadata
+        Schema::new(vec![f1.clone(), f2.clone()]).with_metadata(make_meta("foo", "baz")),
+        // schema with no meta
+        Schema::new(vec![f1.clone(), f2.clone()]),
+        // field level metadata
+        Schema::new(vec![
+            f1.clone().with_metadata(make_b_meta("blarg", "bar")),
+            f2.clone(),
+        ]),
+        // incompatible field level metadata
+        Schema::new(vec![
+            f1.clone().with_metadata(make_b_meta("blarg", "baz")),
+            f2.clone(),
+        ]),
+        // schema with no meta
+        Schema::new(vec![f1, f2]),
+    ];
+    write_files(table_dir.as_path(), schemas);
+
+    // can be any order
+    let expected = vec![
+        "+----+------+",
+        "| id | name |",
+        "+----+------+",
+        "| 1  | test |",
+        "| 2  | test |",
+        "| 3  | test |",
+        "| 0  | test |",
+        "| 5  | test |",
+        "| 4  | test |",
+        "+----+------+",
+    ];
+
+    // Read the parquet files into a dataframe to confirm results
+    // (no errors)
+    let table_path = table_dir.to_str().unwrap().to_string();
+
+    let ctx = SessionContext::new();
+    let df = ctx
+        .read_parquet(&table_path, options.clone())
+        .await
+        .unwrap();
+    let actual = df.collect().await.unwrap();
+
+    assert_batches_sorted_eq!(expected, &actual);
+    assert_no_metadata(&actual);
+
+    // also validate it works via SQL interface as well
+    ctx.register_parquet("t", &table_path, options)
+        .await
+        .unwrap();
+
+    let actual = execute_to_batches(&ctx, "SELECT * from t").await;
+    assert_batches_sorted_eq!(expected, &actual);
+    assert_no_metadata(&actual);
+}
+
+#[tokio::test]
+async fn schema_merge_can_preserve_metadata() {
+    // Create several parquet files in same directoty / table with
+    // same schema but different metadata
+    let tmp_dir = TempDir::new().unwrap();
+    let table_dir = tmp_dir.path().join("parquet_test");
+
+    // explicitly disable schema clearing
+    let options = ParquetReadOptions::default().skip_metadata(false);
+
+    let f1 = Field::new("id", DataType::Int32, true);
+    let f2 = Field::new("name", DataType::Utf8, true);
+
+    let schemas = vec![
+        // schema level metadata
+        Schema::new(vec![f1.clone(), f2.clone()]).with_metadata(make_meta("foo", "bar")),
+        // schema different (compatible) metadata
+        Schema::new(vec![f1.clone(), f2.clone()]).with_metadata(make_meta("foo2", "baz")),
+        // schema with no meta
+        Schema::new(vec![f1.clone(), f2.clone()]),
+    ];
+    write_files(table_dir.as_path(), schemas);
+
+    // can be any order
+    let expected = vec![
+        "+----+------+",
+        "| id | name |",
+        "+----+------+",
+        "| 1  | test |",
+        "| 2  | test |",
+        "| 0  | test |",
+        "+----+------+",
+    ];
+
+    let mut expected_metadata = make_meta("foo", "bar");
+    expected_metadata.insert("foo2".into(), "baz".into());
+
+    // Read the parquet files into a dataframe to confirm results
+    // (no errors)
+    let table_path = table_dir.to_str().unwrap().to_string();
+
+    let ctx = SessionContext::new();
+    let df = ctx
+        .read_parquet(&table_path, options.clone())
+        .await
+        .unwrap();
+    let actual = df.collect().await.unwrap();
+
+    assert_batches_sorted_eq!(expected, &actual);
+    assert_metadata(&actual, &expected_metadata);

Review Comment:
   this test now validates that the merged metadata is as expected



##########
datafusion/core/tests/sql/parquet.rs:
##########
@@ -173,58 +173,6 @@ async fn parquet_list_columns() {
     assert_eq!(result.value(3), "xyz");
 }
 
-#[tokio::test]

Review Comment:
   I moved this to its own module `parquet_format` as it was getting somewhat complicated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] yjshen merged pull request #2985: Optionally skip metadata from schema when merging parquet files

Posted by GitBox <gi...@apache.org>.

yjshen merged PR #2985:
URL: https://github.com/apache/arrow-datafusion/pull/2985


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] ursabot commented on pull request #2985: Optionally skip metadata from schema when merging parquet files

Posted by GitBox <gi...@apache.org>.

ursabot commented on PR #2985:
URL: https://github.com/apache/arrow-datafusion/pull/2985#issuecomment-1200610833

   Benchmark runs are scheduled for baseline = c7fa789e85025a631ed634881e60c1ed71e8d269 and contender = 833f588c26bcb91a765c266b32f836a8cdc4785e. 833f588c26bcb91a765c266b32f836a8cdc4785e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/4bad206b6dcf405b8b60b46f4829a699...6e30aba48f8a454cbd4fa723abbf7961/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/76d2ab1d2ef84f8f972f4cf1e61f5bbf...2afa889566d447278617bdaa1298f000/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/579fc966ed6b43348df0bc7887056433...5e2e146423d74fe2b7a8d151e17c16e3/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/2a838d07599b45878d0e78322e6bc092...68ef3220bfad4593b1c6a76f5e4f889e/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #2985: Optionally skip metadata from schema when merging parquet files

Posted by GitBox <gi...@apache.org>.

codecov-commenter commented on PR #2985:
URL: https://github.com/apache/arrow-datafusion/pull/2985#issuecomment-1200132058

   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/2985?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#2985](https://codecov.io/gh/apache/arrow-datafusion/pull/2985?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (c04faaa) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/3d1de1557143efa7514e73839f5c54a6808d388c?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (3d1de15) will **increase** coverage by `0.01%`.
   > The diff coverage is `99.06%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #2985      +/-   ##
   ==========================================
   + Coverage   85.78%   85.79%   +0.01%     
   ==========================================
     Files         281      282       +1     
     Lines       51580    51652      +72     
   ==========================================
   + Hits        44246    44316      +70     
   - Misses       7334     7336       +2     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/2985?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [datafusion/core/tests/sql/mod.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2985/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3Rlc3RzL3NxbC9tb2QucnM=) | `98.26% <ø> (ø)` | |
   | [datafusion/core/tests/sql/parquet.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2985/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3Rlc3RzL3NxbC9wYXJxdWV0LnJz) | `100.00% <ø> (ø)` | |
   | [datafusion/core/tests/sql/parquet\_schema.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2985/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3Rlc3RzL3NxbC9wYXJxdWV0X3NjaGVtYS5ycw==) | `98.79% <98.79%> (ø)` | |
   | [...afusion/core/src/datasource/file\_format/parquet.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2985/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3NyYy9kYXRhc291cmNlL2ZpbGVfZm9ybWF0L3BhcnF1ZXQucnM=) | `86.26% <100.00%> (+0.37%)` | :arrow_up: |
   | [datafusion/core/src/execution/options.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2985/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3NyYy9leGVjdXRpb24vb3B0aW9ucy5ycw==) | `64.00% <100.00%> (+4.57%)` | :arrow_up: |
   | [...tafusion/core/src/physical\_plan/file\_format/mod.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2985/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3NyYy9waHlzaWNhbF9wbGFuL2ZpbGVfZm9ybWF0L21vZC5ycw==) | `97.36% <100.00%> (ø)` | |
   | [datafusion/core/src/physical\_plan/metrics/value.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2985/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3NyYy9waHlzaWNhbF9wbGFuL21ldHJpY3MvdmFsdWUucnM=) | `86.93% <0.00%> (-0.51%)` | :arrow_down: |
   | [datafusion/expr/src/logical\_plan/plan.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2985/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9leHByL3NyYy9sb2dpY2FsX3BsYW4vcGxhbi5ycw==) | `77.43% <0.00%> (-0.18%)` | :arrow_down: |
   
   Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org