You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/11/14 15:59:29 UTC

[PR] Minor: simplify DataSource statistics [arrow-datafusion]

alamb opened a new pull request, #8172:
URL: https://github.com/apache/arrow-datafusion/pull/8172

   ## Which issue does this PR close?
   
   This is part of https://github.com/apache/arrow-datafusion/issues/8078
   
   ## Rationale for this change
   
   I am trying to improve the handling of statistics by changing the internal representation, so I would like less to change. 
   
   My longer term goal is to keep all the calculation and comparison code for `Statistics` within the `Statistics` class itself, which will make testing and changing implementation details
   
   
   ## What changes are included in this PR?
   
   1. remove some direct manipulation of `Precision` and use pre-existing methods `min`, `max` and `add`
   
   ## Are these changes tested?
   Covered by existing tests
   
   ## Are there any user-facing changes?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Minor: simplify DataSource statistics code [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #8172:
URL: https://github.com/apache/arrow-datafusion/pull/8172#issuecomment-1814910151

   > Improve statistics test coverage #8228
   
   Yes, thank you -- I plan to improve the situation via https://github.com/apache/arrow-datafusion/issues/8228


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Minor: simplify DataSource statistics [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #8172:
URL: https://github.com/apache/arrow-datafusion/pull/8172#discussion_r1392836240


##########
datafusion/core/src/datasource/statistics.rs:
##########
@@ -211,49 +199,3 @@ pub(crate) fn get_col_stats(
         })
         .collect()
 }
-
-/// If the given value is numerically greater than the original maximum value,
-/// return the new maximum value with appropriate exactness information.
-fn set_max_if_greater(
-    max_nominee: Precision<ScalarValue>,
-    max_values: Precision<ScalarValue>,
-) -> Precision<ScalarValue> {
-    match (&max_values, &max_nominee) {
-        (Precision::Exact(val1), Precision::Exact(val2)) if val1 < val2 => max_nominee,
-        (Precision::Exact(val1), Precision::Inexact(val2))
-        | (Precision::Inexact(val1), Precision::Inexact(val2))
-        | (Precision::Inexact(val1), Precision::Exact(val2))
-            if val1 < val2 =>
-        {
-            max_nominee.to_inexact()
-        }
-        (Precision::Exact(_), Precision::Absent) => max_values.to_inexact(),
-        (Precision::Absent, Precision::Exact(_)) => max_nominee.to_inexact(),
-        (Precision::Absent, Precision::Inexact(_)) => max_nominee,
-        (Precision::Absent, Precision::Absent) => Precision::Absent,
-        _ => max_values,
-    }
-}
-
-/// If the given value is numerically lesser than the original minimum value,

Review Comment:
   This is the same as https://github.com/apache/arrow-datafusion/blob/eef654c3b0c22b1f845b1441320b8bb718ddd605/datafusion/common/src/stats.rs#L92-L103



##########
datafusion/core/src/datasource/statistics.rs:
##########
@@ -211,49 +199,3 @@ pub(crate) fn get_col_stats(
         })
         .collect()
 }
-
-/// If the given value is numerically greater than the original maximum value,
-/// return the new maximum value with appropriate exactness information.
-fn set_max_if_greater(

Review Comment:
   This is the same as https://github.com/apache/arrow-datafusion/blob/eef654c3b0c22b1f845b1441320b8bb718ddd605/datafusion/common/src/stats.rs#L75-L87



##########
datafusion/core/src/datasource/statistics.rs:
##########
@@ -160,17 +159,6 @@ pub(crate) fn create_max_min_accs(
     (max_values, min_values)
 }
 
-fn add_row_stats(

Review Comment:
   This is the same as https://github.com/apache/arrow-datafusion/blob/eef654c3b0c22b1f845b1441320b8bb718ddd605/datafusion/common/src/stats.rs#L119-L126



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Minor: simplify DataSource statistics code [arrow-datafusion]

Posted by "berkaysynnada (via GitHub)" <gi...@apache.org>.
berkaysynnada commented on code in PR #8172:
URL: https://github.com/apache/arrow-datafusion/pull/8172#discussion_r1393733385


##########
datafusion/core/src/datasource/statistics.rs:
##########
@@ -211,49 +199,3 @@ pub(crate) fn get_col_stats(
         })
         .collect()
 }
-
-/// If the given value is numerically greater than the original maximum value,
-/// return the new maximum value with appropriate exactness information.
-fn set_max_if_greater(
-    max_nominee: Precision<ScalarValue>,
-    max_values: Precision<ScalarValue>,
-) -> Precision<ScalarValue> {
-    match (&max_values, &max_nominee) {
-        (Precision::Exact(val1), Precision::Exact(val2)) if val1 < val2 => max_nominee,
-        (Precision::Exact(val1), Precision::Inexact(val2))
-        | (Precision::Inexact(val1), Precision::Inexact(val2))
-        | (Precision::Inexact(val1), Precision::Exact(val2))
-            if val1 < val2 =>
-        {
-            max_nominee.to_inexact()
-        }
-        (Precision::Exact(_), Precision::Absent) => max_values.to_inexact(),
-        (Precision::Absent, Precision::Exact(_)) => max_nominee.to_inexact(),
-        (Precision::Absent, Precision::Inexact(_)) => max_nominee,
-        (Precision::Absent, Precision::Absent) => Precision::Absent,
-        _ => max_values,
-    }
-}
-
-/// If the given value is numerically lesser than the original minimum value,

Review Comment:
   Same here



##########
datafusion/core/src/datasource/statistics.rs:
##########
@@ -160,17 +159,6 @@ pub(crate) fn create_max_min_accs(
     (max_values, min_values)
 }
 
-fn add_row_stats(

Review Comment:
   I think they don't share the same behavior, but it didn't show up due to lack of testing. The `add()` function operates in the safest way; if one of the operands is absent, the result will also be absent. On the other hand, the remove() function keeps the non-absent value by changing its exactness (absent + exact(value) => inexact(value)).



##########
datafusion/core/src/datasource/statistics.rs:
##########
@@ -211,49 +199,3 @@ pub(crate) fn get_col_stats(
         })
         .collect()
 }
-
-/// If the given value is numerically greater than the original maximum value,
-/// return the new maximum value with appropriate exactness information.
-fn set_max_if_greater(

Review Comment:
   There is a similar difference here as well. Max values are conserved by relaxing the exactness when an absent statistic is read from the file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Minor: simplify DataSource statistics code [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #8172:
URL: https://github.com/apache/arrow-datafusion/pull/8172#discussion_r1394872809


##########
datafusion/core/src/datasource/statistics.rs:
##########
@@ -160,17 +159,6 @@ pub(crate) fn create_max_min_accs(
     (max_values, min_values)
 }
 
-fn add_row_stats(

Review Comment:
   It is interesting that there is very similar code in https://github.com/apache/arrow-datafusion/blob/a892300a5a56c97b5b4ddc9aa4a421aaf412d0fe/datafusion/core/src/datasource/file_format/parquet.rs#L503-L525 that uses `add` 🤔 



##########
datafusion/core/src/datasource/statistics.rs:
##########
@@ -160,17 +159,6 @@ pub(crate) fn create_max_min_accs(
     (max_values, min_values)
 }
 
-fn add_row_stats(

Review Comment:
   It is interesting that there is very similar code in https://github.com/apache/arrow-datafusion/blob/a892300a5a56c97b5b4ddc9aa4a421aaf412d0fe/datafusion/core/src/datasource/file_format/parquet.rs#L503-L525 that uses `add` 🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Minor: simplify DataSource statistics [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #8172:
URL: https://github.com/apache/arrow-datafusion/pull/8172#issuecomment-1810610702

   cc @berkaysynnada  and @ozankabak 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Minor: simplify DataSource statistics code [arrow-datafusion]

Posted by "berkaysynnada (via GitHub)" <gi...@apache.org>.
berkaysynnada commented on PR #8172:
URL: https://github.com/apache/arrow-datafusion/pull/8172#issuecomment-1813887176

   > How embarassing -- I even noted the lack of unit tests in this file on https://github.com/apache/arrow-datafusion/pull/7793/files#r1358805175 🤦
   
   Sadly, while I was working on the changes, I noticed that in many parts where we use statistics, we don't test them enough. You have done really well starting the Epic. It definitely requires quite a bit of work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Minor: simplify DataSource statistics code [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #8172:
URL: https://github.com/apache/arrow-datafusion/pull/8172#issuecomment-1814911008

   I'll keep working on statistics under the aegis of a different issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Minor: simplify DataSource statistics code [arrow-datafusion]

Posted by "berkaysynnada (via GitHub)" <gi...@apache.org>.
berkaysynnada commented on code in PR #8172:
URL: https://github.com/apache/arrow-datafusion/pull/8172#discussion_r1395223341


##########
datafusion/core/src/datasource/statistics.rs:
##########
@@ -160,17 +159,6 @@ pub(crate) fn create_max_min_accs(
     (max_values, min_values)
 }
 
-fn add_row_stats(

Review Comment:
   It seems like it is always adding exact values, so there should be no difference on the results whether it is used `add()` or `estimated_add()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Minor: simplify DataSource statistics code [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed pull request #8172: Minor: simplify DataSource statistics code
URL: https://github.com/apache/arrow-datafusion/pull/8172


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Minor: simplify DataSource statistics code [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #8172:
URL: https://github.com/apache/arrow-datafusion/pull/8172#issuecomment-1812869506

   > What are your thoughts on modifying the functionalities of the precision methods instead? I mean these precision methods' logic could be replaced by the functions you removed.
   
   This is an excellent catch @berkaysynnada  -- thank you. My thoughts are that I would like to initially not change any behavior unless it is clearly a bug. So in this case, I will:
   
   1.  add tests to the datasource statistics calculations so there is coverage for this behavior
   2. move this code into `Precision` (as a method like `estimated_add` or something) 
   
   The reason to move the code into `Precision` is to consolidate how this structure is being used (otherwise we can end up with surprises like this)
   
   > @berkaysynnada, good catch! Let's consider changing the add/max/min methods to create Inexact types when they encounter an Absent. @alamb, can you think of a case where we would need to propagate Absent if any operand is Absent?
   
   I can not. One test would be to try to change `Precision::add` and see if any tests break, though given our lack of coverage maybe this doesn't tell us as much as we would like. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Minor: simplify DataSource statistics code [arrow-datafusion]

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.
ozankabak commented on PR #8172:
URL: https://github.com/apache/arrow-datafusion/pull/8172#issuecomment-1812118566

   @berkaysynnada, good catch! Let's consider changing the `add`/`max`/`min` methods to create `Inexact` types when they encounter an `Absent`. @alamb, can you think of a case where we would need to propagate `Absent` if any operand is `Absent`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org