You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/09/19 22:45:04 UTC

[GitHub] [arrow-rs] alamb opened a new issue, #4840: Physical null and logical null are confusing concepts

alamb opened a new issue, #4840:
URL: https://github.com/apache/arrow-rs/issues/4840

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   The arrow-rs library (now) makes a distinction between physical nulls and logical nulls as the same distinction is made in the Arrow specification (though the terms physical and logical nulls are not used, to my knoweldge)
   
   The issue is that for certain array types computing if an element is very fast (consult a pre-existing bitmap) but for others can be quite slow (e.g. a dictionary where both the keys and values must be consulted for nullness)
   
   The method named `Array::is_null` returns the (fast) physical nullness, but is deeply confusing for for certain types -- see https://github.com/apache/arrow-rs/issues/4835 and https://github.com/apache/arrow-rs/pull/4838#discussion_r1330002357 from @crepererum. 
   
   We have tried to clarify the difference in https://github.com/apache/arrow-rs/pull/4838 but it is still confusing
   
   **Describe the solution you'd like**
   I am not sure -- @crepererum suggests in https://github.com/apache/arrow-rs/pull/4838#discussion_r1330002357
   
   > I would argue that at least this method should be called is_physical_null to force users to think about what kind of null they want, instead of tricking them into using the wrong implicit default for their use case.
   
   However, there are downsides to this too
   
   **Describe alternatives you've considered**
   The documentation changes may be enough, but I think the issue is important enough to track here
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Physical null and logical null are confusing concepts [arrow-rs]

Posted by "waynexia (via GitHub)" <gi...@apache.org>.
waynexia commented on issue #4840:
URL: https://github.com/apache/arrow-rs/issues/4840#issuecomment-1906616562

   I'd like to provide some though from a user perspective. When processing data where null is very common, it's natural to looking for a way to reduce the comsuption of those null values. As it's known that some part of the data is missing and we of cause can make optimization based on that.
   
   But I find it seems to be difficult at present. I can't use `NullArray` for not only the type problem and the "logical or physical" problem, but also the parquet side, where requires the array must be the same type with schema. And in this scenario a `Null` type never occurs -- some other parts will have data. Here some data are "logical null", but I can't give the answer of whether it's "physical null" (or should I even consider it?).
   
   (BTW, if I want write this part of data to parquet, or passing/compute it under a given schema, I can only build a corresponding array, and fills `None` one by one. This is costly comparing to how a `NullArray` works.)
   
   From whether the type is null and whether the value is null, we can give four (!!) types of null. When the type is null, test function like `is_null()` gives `true` when the value presents and is null (a), and gives `false` when the value is missing (b). And when the type is others, the null value of cause `is_null()` (c) and non-null value is not `is_null()` (d). Please correct me if this is not correct.
   
   By listing them down, some questions come to my mind:
   - Is it really necessary to distinguish case (a) and (b)? I have to use a new word "present" to say the difference.
   - Comparing case (a) and (c), does it means we have the fifth type of null that the type is not null but value "doesn't present"?
   - Null value should be a wildcard value, as it can fit into other types (case c). This is done by letting `None` to be a valid value for array.
   - We should have two kinds of null array. One for (a) and (b) where the type is null, and another one for (c) where the array is a compond array.
   
   Physical and logical null are truly confusing. But it there any way to make it intuitive and easy to use :thinking: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Physical null and logical null are confusing concepts [arrow-rs]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #4840:
URL: https://github.com/apache/arrow-rs/issues/4840#issuecomment-1908150493

   > Physical and logical null are truly confusing. 
   
   I agree @waynexia 
   
   I don't quite follow your examples with 4 different types of null. 
   
   If you need to quickly create an array that represents entirely null of a single type, there is [`new_null_array`](https://docs.rs/arrow/latest/arrow/array/fn.new_null_array.html), but as you seem to imply that certainly results in a larger array than strictly necessary
   
   It is fast to check for an array that contains only nulls by using  [`Array::nulls`](https://docs.rs/arrow/latest/arrow/array/trait.Array.html#tymethod.nulls) I think like:
   
   ```rust
   let num_nulls = array.nulls().map(|nulls| nulls.null_count()).unwrap_or(0)
   let is_all_null = num_nulls = array.len();
   ```
   
   Though as the docs say, this isn't correct for Dictionary or REE arrays
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Physical null and logical null are confusing concepts [arrow-rs]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #4840:
URL: https://github.com/apache/arrow-rs/issues/4840#issuecomment-1964673511

   > This might be a little off-topic. I'd like to propose a new array similar to `RunArray` but only accepts one value in construction, like `SingularArray`. The difference to `RunArray` is logic as `value()` can be simplified, and can have different behavior on null-related APIs (e.g., override the default impl of `is_valid()`/`is_null()`). And it can express both `(a)` & `(b)` (though I think it's not necessary to distinguish these two types...).
   
   This sounds very much like what datafusion `ScalarValue` and [`Datum`](https://docs.rs/arrow/latest/arrow/array/trait.Datum.html) are designed to do 🤔  -- I bet there would be interest on the arrow mailing list as well (they may have been prior discussion about it too)
   
   > I don't know if it's still an option to not have "logical null" and "physical null". Maybe overriding `is_valid()` and `is_null()` can have a slight help toward it? Adding a new array means lots of work, I'm not sure if this is viable, please let me know your thought.
   
   I think the `logical` and `physical` nulls refer to how the nulls are encoded in the Arrow arrays themselves. Given that this library is designed as a low level API for Arrow arrays, I believe the rationale is that exposing the null buffers directly as arrow encodes them provides the most control
   
   I made https://github.com/apache/arrow-rs/pull/5434 to try and clarify this even more


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org