You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/01 21:00:19 UTC

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2671: If statistics of column Max/Min value does not exists in parquet file, sent Min/Max to None

alamb commented on code in PR #2671:
URL: https://github.com/apache/arrow-datafusion/pull/2671#discussion_r887292893


##########
datafusion/core/tests/sql/parquet.rs:
##########
@@ -219,13 +219,76 @@ async fn schema_merge_ignores_metadata() {
     // (no errors)
     let ctx = SessionContext::new();
     let df = ctx
-        .read_parquet(
-            table_dir.to_str().unwrap().to_string(),
-            ParquetReadOptions::default(),
-        )
+        .read_parquet(table_dir.to_str().unwrap(), ParquetReadOptions::default())
         .await
         .unwrap();
     let result = df.collect().await.unwrap();
 
     assert_eq!(result[0].schema().metadata(), result[1].schema().metadata());
 }
+
+#[tokio::test]
+async fn parquet_query_with_max_min() {
+    let tmp_dir = TempDir::new().unwrap();
+    let table_dir = tmp_dir.path().join("parquet_test");
+    let table_path = Path::new(&table_dir);
+
+    let fields = vec![
+        Field::new("c1", DataType::Int32, true),
+        Field::new("c2", DataType::Utf8, true),
+    ];
+
+    let schema = Arc::new(Schema::new(fields.clone()));
+
+    if let Ok(()) = fs::create_dir(table_path) {
+        let filename = "foo.parquet";
+        let path = table_path.join(&filename);
+        let file = fs::File::create(path).unwrap();
+        let mut writer =
+            ArrowWriter::try_new(file.try_clone().unwrap(), schema.clone(), None)
+                .unwrap();
+
+        // create mock record batch
+        let c1s = Arc::new(Int32Array::from_slice(&[1, 2, 3]));
+        let c2s = Arc::new(StringArray::from_slice(&["aaa", "bbb", "ccc"]));

Review Comment:
   Since there is code for other data types (e.g. int64) can you please also add such columns to this test?



##########
datafusion/core/tests/sql/parquet.rs:
##########
@@ -219,13 +219,76 @@ async fn schema_merge_ignores_metadata() {
     // (no errors)
     let ctx = SessionContext::new();
     let df = ctx
-        .read_parquet(
-            table_dir.to_str().unwrap().to_string(),
-            ParquetReadOptions::default(),
-        )
+        .read_parquet(table_dir.to_str().unwrap(), ParquetReadOptions::default())
         .await
         .unwrap();
     let result = df.collect().await.unwrap();
 
     assert_eq!(result[0].schema().metadata(), result[1].schema().metadata());
 }
+
+#[tokio::test]
+async fn parquet_query_with_max_min() {

Review Comment:
   I verified that this test fails without the other changes in this PR 
   
   ```
   failures:
   
   ---- sql::parquet::parquet_query_with_max_min stdout ----
   thread 'sql::parquet::parquet_query_with_max_min' panicked at 'assertion failed: `(left == right)`
     left: `["+-------------+", "| MIN(foo.c2) |", "+-------------+", "| aaa         |", "+-------------+"]`,
    right: `["+-------------+", "| MIN(foo.c2) |", "+-------------+", "|             |", "+-------------+"]`: 
   
   expected:
   
   [
       "+-------------+",
       "| MIN(foo.c2) |",
       "+-------------+",
       "| aaa         |",
       "+-------------+",
   ]
   actual:
   
   [
       "+-------------+",
       "| MIN(foo.c2) |",
       "+-------------+",
       "|             |",
       "+-------------+",
   ]
   ```



##########
datafusion/core/tests/sql/parquet.rs:
##########
@@ -219,13 +219,76 @@ async fn schema_merge_ignores_metadata() {
     // (no errors)
     let ctx = SessionContext::new();
     let df = ctx
-        .read_parquet(
-            table_dir.to_str().unwrap().to_string(),
-            ParquetReadOptions::default(),
-        )
+        .read_parquet(table_dir.to_str().unwrap(), ParquetReadOptions::default())
         .await
         .unwrap();
     let result = df.collect().await.unwrap();
 
     assert_eq!(result[0].schema().metadata(), result[1].schema().metadata());
 }
+
+#[tokio::test]
+async fn parquet_query_with_max_min() {
+    let tmp_dir = TempDir::new().unwrap();
+    let table_dir = tmp_dir.path().join("parquet_test");
+    let table_path = Path::new(&table_dir);
+
+    let fields = vec![
+        Field::new("c1", DataType::Int32, true),
+        Field::new("c2", DataType::Utf8, true),
+    ];
+
+    let schema = Arc::new(Schema::new(fields.clone()));
+
+    if let Ok(()) = fs::create_dir(table_path) {

Review Comment:
   I recommend failing fast here if we can't make the directory:
   
   ```suggestion
        fs::create_dir(table_path).unwrap()
   ```
   
   Thus if that fails then the test will immediately stop rather whatever happens to be in the directory



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org