You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/03/23 19:50:24 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request, #5710: Fix parquet pruning when column names have periods

alamb opened a new pull request, #5710:
URL: https://github.com/apache/arrow-datafusion/pull/5710

   # Which issue does this PR close?
   
   Closes https://github.com/apache/arrow-datafusion/issues/5708
   
   # Rationale for this change
   
   We were seeing wrong results with parquet file that had columns with `.` in their names
   
   # What changes are included in this PR?
   
   Bug fix + test
   
   # Are these changes tested?
   
   yes
   
   # Are there any user-facing changes?
   bug fix


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] viirya commented on a diff in pull request #5710: Fix parquet pruning when column names have periods

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on code in PR #5710:
URL: https://github.com/apache/arrow-datafusion/pull/5710#discussion_r1147806697


##########
datafusion/core/tests/parquet/mod.rs:
##########
@@ -454,6 +455,25 @@ fn make_date_batch(offset: Duration) -> RecordBatch {
     .unwrap()
 }
 
+/// returns a batch with two columns (note "service.name" is the name
+/// of the column. It is *not* a table named service.name
+///
+/// name | service.name
+fn make_names_batch(name: &str, service_name_values: Vec<&str>) -> RecordBatch {
+    let num_rows = service_name_values.len();
+    let name: StringArray = std::iter::repeat(Some(name)).take(num_rows).collect();
+    let service_name: StringArray = service_name_values.iter().map(Some).collect();
+
+    let schema = Schema::new(vec![
+        Field::new("name", name.data_type().clone(), true),
+        // note the column name has a period in it!
+        Field::new("service.name", service_name.data_type().clone(), true),

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] viirya commented on a diff in pull request #5710: Fix parquet pruning when column names have periods

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on code in PR #5710:
URL: https://github.com/apache/arrow-datafusion/pull/5710#discussion_r1147807988


##########
datafusion/core/tests/parquet/row_group_pruning.rs:
##########
@@ -483,3 +483,36 @@ async fn prune_decimal_in_list() {
     )
     .await;
 }
+
+#[tokio::test]
+async fn prune_periods_in_column_names() {
+    // There are three row groups for "service.name", each with 5 rows = 15 rows total
+    // name = "HTTP GET / DISPATCH", service.name = ['frontend', 'frontend'],
+    // name = "HTTP PUT / DISPATCH", service.name = ['backend',  'frontend'],
+    // name = "HTTP GET / DISPATCH", service.name = ['backend',  'backend' ],
+    test_prune(
+        Scenario::PeriodsInColumnNames,
+        // use double quotes to use column named "service.name"
+        "SELECT \"name\", \"service.name\" FROM t WHERE \"service.name\" = 'frontend'",
+        Some(0),
+        Some(1), // prune out last row group
+        7,
+    )
+    .await;
+    test_prune(
+        Scenario::PeriodsInColumnNames,
+        "SELECT \"name\", \"service.name\" FROM t WHERE \"name\" != 'HTTP GET / DISPATCH'",
+        Some(0),
+        Some(2), // prune out first and last row group
+        5,
+    )
+    .await;
+    test_prune(
+        Scenario::PeriodsInColumnNames,
+        "SELECT \"name\", \"service.name\" FROM t WHERE \"service.name\" = 'frontend' AND \"name\" != 'HTTP GET / DISPATCH'",
+        Some(0),
+        Some(2), // prune out  middle and last row group

Review Comment:
   super nit:
   ```suggestion
           Some(2), // prune out middle and last row group
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #5710: Fix parquet pruning when column names have periods

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #5710:
URL: https://github.com/apache/arrow-datafusion/pull/5710


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5710: Fix parquet pruning when column names have periods

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5710:
URL: https://github.com/apache/arrow-datafusion/pull/5710#discussion_r1147935693


##########
datafusion/core/tests/parquet/row_group_pruning.rs:
##########
@@ -483,3 +483,36 @@ async fn prune_decimal_in_list() {
     )
     .await;
 }
+
+#[tokio::test]
+async fn prune_periods_in_column_names() {
+    // There are three row groups for "service.name", each with 5 rows = 15 rows total
+    // name = "HTTP GET / DISPATCH", service.name = ['frontend', 'frontend'],
+    // name = "HTTP PUT / DISPATCH", service.name = ['backend',  'frontend'],
+    // name = "HTTP GET / DISPATCH", service.name = ['backend',  'backend' ],
+    test_prune(
+        Scenario::PeriodsInColumnNames,
+        // use double quotes to use column named "service.name"
+        "SELECT \"name\", \"service.name\" FROM t WHERE \"service.name\" = 'frontend'",
+        Some(0),
+        Some(1), // prune out last row group
+        7,
+    )
+    .await;
+    test_prune(
+        Scenario::PeriodsInColumnNames,
+        "SELECT \"name\", \"service.name\" FROM t WHERE \"name\" != 'HTTP GET / DISPATCH'",
+        Some(0),
+        Some(2), // prune out first and last row group
+        5,
+    )
+    .await;
+    test_prune(
+        Scenario::PeriodsInColumnNames,
+        "SELECT \"name\", \"service.name\" FROM t WHERE \"service.name\" = 'frontend' AND \"name\" != 'HTTP GET / DISPATCH'",
+        Some(0),
+        Some(2), // prune out  middle and last row group

Review Comment:
   Included in https://github.com/apache/arrow-datafusion/pull/5726



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5710: Fix parquet pruning when column names have periods

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5710:
URL: https://github.com/apache/arrow-datafusion/pull/5710#discussion_r1147810808


##########
datafusion/core/tests/parquet/row_group_pruning.rs:
##########
@@ -483,3 +483,36 @@ async fn prune_decimal_in_list() {
     )
     .await;
 }
+
+#[tokio::test]
+async fn prune_periods_in_column_names() {
+    // There are three row groups for "service.name", each with 5 rows = 15 rows total
+    // name = "HTTP GET / DISPATCH", service.name = ['frontend', 'frontend'],
+    // name = "HTTP PUT / DISPATCH", service.name = ['backend',  'frontend'],
+    // name = "HTTP GET / DISPATCH", service.name = ['backend',  'backend' ],
+    test_prune(
+        Scenario::PeriodsInColumnNames,
+        // use double quotes to use column named "service.name"
+        "SELECT \"name\", \"service.name\" FROM t WHERE \"service.name\" = 'frontend'",
+        Some(0),
+        Some(1), // prune out last row group
+        7,
+    )
+    .await;
+    test_prune(
+        Scenario::PeriodsInColumnNames,
+        "SELECT \"name\", \"service.name\" FROM t WHERE \"name\" != 'HTTP GET / DISPATCH'",
+        Some(0),
+        Some(2), // prune out first and last row group
+        5,
+    )
+    .await;
+    test_prune(
+        Scenario::PeriodsInColumnNames,
+        "SELECT \"name\", \"service.name\" FROM t WHERE \"service.name\" = 'frontend' AND \"name\" != 'HTTP GET / DISPATCH'",
+        Some(0),
+        Some(2), // prune out  middle and last row group

Review Comment:
   sorry -- missed that -- will fix



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #5710: Fix parquet pruning when column names have periods

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #5710:
URL: https://github.com/apache/arrow-datafusion/pull/5710#issuecomment-1485653068

   > I wonder if we should stop using strings for columns at all to prevent this type of confusion in the future. IIRC this isn't the first bug introduces by it.
   
   For what it is worth, postgres uses all indexes (numbers) as I recall and that has its own set of horrible bugs (when the output indexes get mixed up). 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5710: Fix parquet pruning when column names have periods

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5710:
URL: https://github.com/apache/arrow-datafusion/pull/5710#discussion_r1146755111


##########
datafusion/core/src/physical_optimizer/pruning.rs:
##########
@@ -384,7 +384,7 @@ fn build_statistics_record_batch<S: PruningStatistics>(
     let mut arrays = Vec::<ArrayRef>::new();
     // For each needed statistics column:
     for (column, statistics_type, stat_field) in required_columns.iter() {
-        let column = Column::from_qualified_name(column.name());
+        let column = Column::from_name(column.name());

Review Comment:
   This line is the bugfix
   
   The rest of the PR is a test



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org